This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Truncated Linear Regression in High Dimensions

Constantinos Daskalakis
MIT
[email protected]
   Dhruv Rohatgi
MIT
[email protected]
   Manolis Zampetakis
MIT
[email protected]
Abstract

As in standard linear regression, in truncated linear regression, we are given access to observations (Ai,yi)i(A_{i},y_{i})_{i} whose dependent variable equals yi=AiTx+ηiy_{i}=A_{i}^{\rm T}\cdot x^{*}+\eta_{i}, where xx^{*} is some fixed unknown vector of interest and ηi\eta_{i} is independent noise; except we are only given an observation if its dependent variable yiy_{i} lies in some “truncation set” SS\subset\mathbb{R}. The goal is to recover xx^{*} under some favorable conditions on the AiA_{i}’s and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering kk-sparse nn-dimensional vectors xx^{*} from mm truncated samples, which attains an optimal 2\ell_{2} reconstruction error of O((klogn)/m)O(\sqrt{(k\log n)/m}). As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) Daskalakis et al. [9], where no regularization is needed due to the low-dimensionality of the data, and (2) Wainright [26], where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest.

1 Introduction

In the vanilla linear regression setting, we are given mnm\geq n observations of the form (Ai,yi)(A_{i},y_{i}), where AinA_{i}\in\mathbb{R}^{n}, yi=AiTx+ηiy_{i}=A_{i}^{\rm T}x^{*}+\eta_{i}, xx^{*} is some unknown coefficient vector that we wish to recover, and ηi\eta_{i} is independent and identically distributed across different observations ii random noise. Under favorable conditions about the AiA_{i}’s and the distribution of the noise, it is well-known that xx^{*} can be recovered to within 2\ell_{2}-reconstruction error O(n/m)O(\sqrt{n/m}).

The classical model and its associated guarantees might, however, be inadequate to address many situations which frequently arise in both theory and practice. We focus on two common and widely studied deviations from the standard model. First, it is often the case that mnm\ll n, i.e. the number of observations is much smaller than the dimension of the unknown vector xx^{*}. In this “under-determined” regime, it is fairly clear that it is impossible to expect a non-trivial reconstruction of the underlying xx^{*}, since there are infinitely many xnx\in\mathbb{R}^{n} such that AiTx=AiTxA_{i}^{\rm T}x=A_{i}^{\rm T}x^{*} for all i=1,,mi=1,\ldots,m. To sidestep this impossibility, we must exploit additional structural properties that we might know xx^{*} satisfies. One such property might be sparsity, i.e. that xx^{*} has knk\ll n non-zero coordinates. Linear regression under sparsity assumptions has been widely studied, motivated by applications such as model selection and compressed sensing; see e.g. the celebrated works of [23, 6, 11, 26] on this topic. It is known, in particular, that a kk-sparse xx^{*} can be recovered to within 2\ell_{2} error O(klogn/m)O(\sqrt{k\log n/m}), when the AiA_{i}’s are drawn from the standard multivariate Normal, or satisfy other favorable conditions [26]. The recovery algorithm solves a least squares optimization problem with 1\ell_{1} regularization, i.e. what is called LASSO optimization in Statistics, in order to reward sparsity.

Another common deviation from the standard model is the presence of truncation. Truncation occurs when the sample (Ai,yi)(A_{i},y_{i}) is not observed whenever yiy_{i} falls outside of a subset SS\subseteq\mathbb{R}. Truncation arises quite often in practice as a result of saturation of measurement devices, bad data collection practices, incorrect experimental design, and legal or privacy constraints which might preclude the use of some of the data. Truncation is known to affect linear regression in counter-intuitive ways, as illustrated in Fig. 1,

Refer to caption
Figure 1: Truncation in one-dimensional linear regression, along with the linear fit obtained via least squares regression before and after truncation.

where the linear fits obtained via least squares regression before and after truncation of the data based on the value of the response variable are also shown. More broadly, it is well-understood that naive statistical inference using truncated data commonly leads to bias. Accordingly, a long line of research in Statistics and Econometrics has strived to develop regression methods that are robust to truncation [24, 1, 15, 18, 16, 5, 14]. This line of work falls into the broader field of Truncated Statistics [22, 8, 2], which finds its roots in the early works of [3], [13], [19, 20], and [12]. Despite this voluminous work, computationally and statistically efficient methods for truncated linear regression have only recently been obtained in [9], where it was shown that, under favorable assumptions about the AiA_{i}’s, the truncation set SS, and assuming the ηi\eta_{i}’s are drawn from a Gaussian, the negative log likelihood of the truncated sample can be optimized efficiently, and approximately recovers the true parameter vector with an 2\ell_{2} reconstruction error O(nlogmm){O}\left(\sqrt{n\log m\over m}\right).

Our contribution.

In this work, we solve the general problem addressing both of the aforedescribed challenges together. Namely, we provide efficient algorithms for the high-dimensional (mnm\ll n) truncated linear regression problem. This problem is very common, including in compressed sensing applications with measurement saturation, as studied e.g. in  [10, 17].

Under standard conditions on the design matrix and the noise distribution (namely that the AiA_{i}’s and ηi\eta_{i}’s are sampled from independent Gaussian distributions before truncation), and under mild assumptions on the truncation set SS (roughly, that it permits a constant fraction of the samples yi=AiTx+ηiy_{i}=A_{i}^{\rm T}x^{*}+\eta_{i} to survive truncation), we show that the SGD algorithm on the truncated LASSO optimization program, our proposed adaptation of the standard LASSO optimization to accommodate truncation, is a computationally and statistically efficient method for recovering xx^{*}, attaining an optimal 2\ell_{2} reconstruction error of O((klogn)/m)O(\sqrt{(k\log n)/m}), where kk is the sparsity of xx^{*}.

We formally state the model and assumptions in Section 2, and our result in Section 3.

1.1 Overview of proofs and techniques

The problem that we solve in this paper encompasses the two difficulties of the problems considered in: (1) Wainwright [26], which tackles the problem of high-dimensional sparse linear regression with Gaussian noise, and (2) Daskalakis et al. [9], which tackles the problem of truncated linear regression. The tools developed in those papers do not suffice to solve our problem, since each difficulty interferes with the other. Hence, we introduce new ideas and develop new interesting tools that allow us to bridge the gap between [26] and [9]. We begin our overview in this section with a brief description of the approaches of [26, 9] and subsequently outline the additional challenges that arise in our setting, and how we address them.

Wainwright [26] uses as an estimator the solution of the regularized least squares program, also called LASSO, to handle the high-dimensionality of the data. Wainwright then uses a primal-dual witness method to bound the number of samples that are needed in order for the solution of the LASSO program to be close to the true coefficient vector xx^{*}. The computational task is not discussed in detail in Wainwright [26], since the objective function of the LASSO program is very simple and standard convex optimization tools can be used.

Daskalakis et al. [9] use as their estimator the solution to the log-likelihood maximization problem. In contrast to [26], their convex optimization problem takes a very complicated form due to the presence of truncation, which introduces an intractable log-partition function term in the log-likelihood. The main idea of Daskalakis et al. [9] to overcome this difficulty is identifying a convex set DD such that: (1) it contains the true coefficient vector xx^{*}, (2) their objective function is strongly convex inside DD, (3) inside DD there exists an efficient rejection sampling algorithm to compute unbiased estimates of the gradients of their objective function, (4) the norm of the stochastic gradients inside DD is bounded, and (5) projecting onto DD is efficient. These five properties are essentially what they need to prove that the SGD with projection set DD converges quickly to a good estimate of xx^{*}.

Our reconstruction algorithm is inspired by both [26] and [9]. We formulate our optimization program as the 1\ell_{1}-regularized version of the negative log-likelihood function in the truncated setting, which we call truncated LASSO. In particular, our objective contains an intractable log-partition function term. Our proof then consists of two parts. First, we show statistical recovery, i.e. we upper bound the number of samples that are needed for the solution of the truncated LASSO program to be close to the true coefficient vector xx^{*}. Second, we show that this optimization problem can be solved efficiently. The cornerstones of our proof are the two seminal approaches that we mentioned: the Primal-Dual Witness method for statistical recovery in high dimensions, and the Projected SGD method for efficient maximum likelihood estimation in the presence of truncation. Unfortunately, these two techniques are not a priori compatible to stitch together.

Roughly speaking, the technique of [9] relies heavily on the very carefully chosen projection set DD in which the SGD is restricted, as we explained above. This projection set cannot be used in high dimensions because it effectively requires knowing the low-dimension subspace in which the true solution lies. The projection set was the key to the aforementioned nice properties: strong convexity, efficient gradient estimation, and bounded gradient. In its absence, we need to deal with each of these issues individually. The primal-dual witness method of [26] cannot also be applied directly in our setting. In our case the gradient of the truncated LASSO does not have a nice closed form and hence finding the correct way to construct the primal-dual witness requires a more delicate argument. Our proof manages to overcome all these issues. In a nutshell the architecture of our full proof is the following.

  1. 1.

    Optimality on the low dimensional subspace. The first thing that we need to prove is that the optimum of the truncated LASSO program when restricted to the low dimensional subspace defined by the non-zero coordinates of xx^{*} is close to the true solution. This step of the proof was unnecessary in [9] due to the lack of regularization in their objective, and was trivial in [26] due to the simple loss function, i.e. the regularized least square.

  2. 2.

    Optimality on the high dimensional space. We prove that the optimum of the truncated LASSO program in the low dimensional subspace is also optimal for the whole space. This step is done using the primal-dual witness method as in [26]. However, in our case the expression of the gradient is much more complicated due to the very convoluted objective function. Hence, we find a more general way to prove this step that does not rely on the exact expression of the gradient.

These two steps of the proof suffice to upper bound the number of samples that we need to recover the coefficient vector xx^{*} via the truncated LASSO program. Next, we provide a computationally efficient method to solve the truncated LASSO program.

  1. 3.

    Initialization of SGD. The first step of our algorithm to solve truncated LASSO is finding a good initial point for the SGD. This was unnecessary in [26] due to the simple objective and in [9] due to the existence of the projection set DD (where efficient projection onto DD immediately gave an initial point). We propose the simple answer of bootstrapping: start with the solution of the 1\ell_{1}-regularized ordinary least squares program. This is a biased estimate, but we show it’s good enough for initialization.

  2. 4.

    Projection of SGD. Next, we need to choose a projection set to make sure that Projected-SGD (PSGD) converges. The projection set chosen in [9] is not helpful in our case unless we a priori know the set of non-zero coordinates of xx^{*}. Hence, we define a different, simpler set which admits efficient projection algorithms. As a necessary side-effect, in contrast to [9], our set cannot guarantee many of the important properties that we need to prove fast convergence of SGD.

  3. 5.

    Lack of strong convexity and gradient estimation. Our different projection set cannot guarantee the strong convexity and efficient gradient estimation enjoyed in [9]. There are two problems here:

    First, we know that PSGD converges to a point with small loss, but why must the point be near the optimum? Since strong convexity fails in high dimensions, it is not clear. We provide a workaround to resolve this issue that can be applied to other regularized programs with stochastic access to the gradient function.

    Second, computing unbiased estimates of the gradient is now difficult. The prior work employed rejection sampling, but in our setting this may take exponential time. For this reason we provide a more explicit method for estimating the gradient much faster, whenever the truncation set is reasonably well-behaved.

An important tool that we leverage repeatedly in our analysis and we have not mentioned above is a strong isometry property for our measurement matrix, which has truncated Gaussian rows. Similar properties have been explored in the compressed sensing literature for matrices with i.i.d. Gaussian and sub-Gaussian entries [25].

We refer to Section 5 for a more detailed overview of the proofs of our main results.

2 High-dimensional truncated linear regression model

Notation.

Let ZN(0,1)Z\sim N(0,1) refer to a standard normal random variable. For tt\in\mathbb{R} and measurable SS\subseteq\mathbb{R}, let ZtN(t,1;S)Z_{t}\sim N(t,1;S) refer to the truncated normal (Z+t)|Z+tS(Z+t)|Z+t\in S. Let μt=𝔼[Zt]\mu_{t}=\mathbb{E}[Z_{t}]. Let γS(t)=Pr[Z+tS]\gamma_{S}(t)=\Pr[Z+t\in S]. Additionally, for a,xna,x\in\mathbb{R}^{n} let Za,xZ_{a,x} refer to ZaTxZ_{a^{T}x} (or ZaxZ_{ax} if aa is a row vector), and let γS(a,x)\gamma_{S}(a,x) refer to γS(aTx)\gamma_{S}(a^{T}x). For a matrix Am×nA\in\mathbb{R}^{m\times n}, let ZA,xmZ_{A,x}\in\mathbb{R}^{m} be the random vector with (ZA,x)j=ZAj,x(Z_{A,x})_{j}=Z_{A_{j},x}. For sets I[n]I\subseteq[n] and J[m]J\subseteq[m], let AI,JA_{I,J} refer to the submatrix [Ai,j]iI,jJ[A_{i,j}]_{i\in I,j\in J}. For i[n]i\in[n] we treat the row AiA_{i} as a row vector. In a slight abuse of notation, we will often write AUA_{U} (or sometimes, AVA_{V}); this will always mean A[m],UA_{[m],U}. By AUTA_{U}^{T} we mean (A[m],U)T(A_{[m],U})^{T}). For xnx\in\mathbb{R}^{n}, define supp(x)\operatorname{supp}(x) to be the set of indices i[n]i\in[n] such that xi0x_{i}\neq 0.

2.1 Model

Let xnx^{*}\in\mathbb{R}^{n} be the unknown parameter vector which we are trying to recover. We assume that it is kk-sparse; that is, supp(x)\operatorname{supp}(x^{*}) has cardinality at most kk. Let SS\subseteq\mathbb{R} be a measurable subset of the real line. The main focus of this paper is the setting of Gaussian noise: we assume that we are given mm truncated samples (Ai,yi)(A_{i},y_{i}) generated by the following process:

  1. 1.

    Pick AinA_{i}\in\mathbb{R}^{n} according to the standard normal distribution N(0,1)nN(0,1)^{n}.

  2. 2.

    Sample ηiN(0,1)\eta_{i}\sim N(0,1) and compute yiy_{i} as

    yi=Aix+ηi.y_{i}=A_{i}x^{*}+\eta_{i}. (1)
  3. 3.

    If yiSy_{i}\in S, then return sample (Ai,yi)(A_{i},y_{i}). Otherwise restart the process from step 11.

We also briefly discuss the setting of arbitrary noise, in which ηi\eta_{i} may be arbitrary and we are interested in approximations to xx^{*} which have guarantees bounded in terms of η2\left\lVert\eta\right\rVert_{2}.

Together, mm samples define a pair (A,y)(A,y) where Am×nA\in\mathbb{R}^{m\times n} and ymy\in\mathbb{R}^{m}. We make the following assumptions about set SS.

Assumption I (Constant Survival Probability).

Taking expectation over vectors aN(0,1)na\sim N(0,1)^{n}, we have 𝔼γS(a,x)α\mathbb{E}\gamma_{S}(a,x^{*})\geq\alpha for a constant α>0\alpha>0.

Assumption II (Efficient Sampling).

There is an T(γS(t))T(\gamma_{S}(t))-time algorithm which takes input tt\in\mathbb{R} and produces an unbiased sample zN(t,1;S)z\sim N(t,1;S).

We do not require that T()T(\cdot) is a constant, but it will affect the efficiency of our algorithm. To be precise, our algorithm will make poly(n)\text{poly}(n) queries to the sampling algorithm. As we explain in Lemma K.4 in Section K, if the set SS is a union of rr intervals i=1r[ai,bi]\cup_{i=1}^{r}[a_{i},b_{i}], then the Assumption II is satisfied with T(γS(t))=poly(log(1/γS(t)),r)T(\gamma_{S}(t))=\operatorname{poly}(\log(1/\gamma_{S}(t)),r). We express the theorems below with the assumption that SS is a union of rr intervals in which case the algorithms have polynomial running time, but all the statements below can be replaced with the more general Assumption II and the running time changes from poly(n,r)\operatorname{poly}(n,r) to poly(n,T(em/α))\operatorname{poly}(n,T(e^{m/\alpha})).

3 Statistically and computationally efficient recovery

In this section we formally state our main results for recovery of a sparse high-dimensional coefficient vector from truncated linear regression samples. In Section 3.1, we present our result under the standard assumption that the error distribution is Gaussian, whereas in Section 3.2 we present our results for the case of adversarial error.

3.1 Gaussian noise

In the setting of Gaussian noise (before truncation), we prove the following theorem.

Theorem 3.1.

Suppose that Assumption I holds, and that we have mm samples (Ai,yi)(A_{i},y_{i}) generated from Process (1), with nmMklognn\geq m\geq Mk\log n for a sufficiently large constant MM. Then, there is an algorithm which outputs x¯\bar{x} satisfying x¯x2O((klogn)/m)\left\lVert\bar{x}-x^{*}\right\rVert_{2}\leq O(\sqrt{(k\log n)/m}) with probability 1O(1/logn)1-O(1/\log n). Furthermore, if the survival set SS is a union of rr intervals the running time of our algorithm is poly(n,r)\operatorname{poly}(n,r).

From now on, we will use the term “with high probability” when the rate of decay is not of importance. This phrase means “with probability 1o(1)1-o(1) as nn\to\infty”.

Observe that even without the added difficulty of truncation (e.g. if S=S=\mathbb{R}), sparse linear regression requires Ω(klogn)\Omega(k\log n) samples by known information-theoretic arguments [26]. Thus, our sample complexity is information-theoretically optimal.

In one sentence, the algorithm optimizes the 1\ell_{1}-regularized sample negative log-likelihood via projected SGD. The negative log-likelihood of xnx\in\mathbb{R}^{n} for a single sample (a,y)(a,y) is

nll(x;a,y)=12(aTxy)2+logSe(aTxz)2/2𝑑z.\operatorname{nll}(x;a,y)=\frac{1}{2}(a^{T}x-y)^{2}+\log\int_{S}e^{-(a^{T}x-z)^{2}/2}\,dz.

Given multiple samples (A,y)(A,y), we can then write nll(x;A,y)=1mj=1mnll(x;Aj,yj)\operatorname{nll}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}\operatorname{nll}(x;A_{j},y_{j}). We also define the regularized negative log-likelihood f:nf:\mathbb{R}^{n}\to\mathbb{R} by f(x)=nll(x;A,y)+λx1f(x)=\operatorname{nll}(x;A,y)+\lambda\left\lVert x\right\rVert_{1}. We claim that optimizing the following program approximately recovers the true parameter vector xx^{*} with high probability, for sufficiently many samples and appropriate regularization λ\lambda:

minxnnll(x;A,y)+λx1.\min_{x\in\mathbb{R}^{n}}\operatorname{nll}(x;A,y)+\lambda\left\lVert x\right\rVert_{1}. (2)

The first step is to show that any solution to Program (2) will be near the true solution xx^{*}. To this end, we prove the following theorem, which already shows that O(klogn)O(k\log n) samples are sufficient to solve the problem of statistical recovery of xx^{*}:

Proposition 3.2.

Suppose that Assumption I holds. There are constants111In the entirety of this paper, constants may depend on α\alpha. κ\kappa, dd, and σ\sigma with the following property. Suppose that m>κklognm>\kappa\cdot k\log n, and let (A,y)(A,y) be mm samples drawn from Process 1. Let x^\hat{x} be any optimal solution to Program (2) with regularization constant λ=σ(logn)/m\lambda=\sigma\sqrt{(\log n)/m}. Then x^x2d(klogn)/m\left\lVert\hat{x}-x^{*}\right\rVert_{2}\leq d\sqrt{(k\log n)/m} with high probability.

Then it remains to show that Program (2) can be solved efficiently.

Proposition 3.3.

Suppose that Assumption I holds and let (A,y)(A,y) be mm samples drawn from Process 1 and x^\hat{x} be any optimal solution to Program (2). There exists a constant MM such that if mMklognm\geq Mk\log n then there is an algorithm which outputs x¯n\bar{x}\in\mathbb{R}^{n} satisfying x¯x^2O((klogn)/m)\left\lVert\bar{x}-\hat{x}\right\rVert_{2}\leq O(\sqrt{(k\log n)/m}) with high probability. Furthermore, if the survival set SS is a union of rr intervals the running time of our algorithm is poly(n,r)\operatorname{poly}(n,r).

We present a more detailed description of the algorithm that we use in Section 4.

3.2 Adversarial noise

In the setting of arbitrary noise, optimizing negative log-likelihood no longer makes sense, and indeed our results from the setting of Gaussian noise no longer hold. However, we may apply results from compressed sensing which describe sufficient conditions on the measurement matrix for recovery to be possible in the face of adversarial error. We obtain the following theorem:

Theorem 3.4.

Suppose that Assumption I holds and let ϵ>0\epsilon>0. There are constants cc and MM such that if mMklognm\geq Mk\log n, Axy2ϵ\left\lVert Ax^{*}-y\right\rVert_{2}\leq\epsilon, and x^\hat{x} minimizes x1\left\lVert x\right\rVert_{1} in the region {xn:Axy2ϵ}\{x\in\mathbb{R}^{n}:\left\lVert Ax-y\right\rVert_{2}\leq\epsilon\}, then x^x2cϵ/m\left\lVert\hat{x}-x^{*}\right\rVert_{2}\leq c\epsilon/\sqrt{m}.

The proof is a corollary of our result that AA satisfies the Restricted Isometry Property from [6] with high probability even when we only observe truncated samples; see Corollary G.6 and the subsequent discussion in Section G.

The remainder of the paper is dedicated to the case where the noise is Gaussian before truncation.

4 The efficient estimation algorithm

Define r={xn:Axy2rm}\mathscr{E}_{r}=\{x\in\mathbb{R}^{n}:\left\lVert Ax-y\right\rVert_{2}\leq r\sqrt{m}\}. To solve Program 2, our algorithm is Projected Stochastic Gradient Descent (PSGD) with projection set r\mathscr{E}_{r}, for an appropriate constant rr (specified in Lemma 5.4). We pick an initial feasible point by computing

x(0)=argminxrx1.x^{(0)}=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x\right\rVert_{1}.

Subsequently, the algorithm performs NN updates, where N=poly(n)N=\operatorname{poly}(n). Define a random update to x(t)nx^{(t)}\in\mathbb{R}^{n} as follows. Pick j[m]j\in[m] uniformly at random. Sample z(t)ZAj,x(i)z^{(t)}\sim Z_{A_{j},x^{(i)}}. Then set

v(t):=Aj(z(t)yj)+λsign(x(t))v^{(t)}:=A_{j}(z^{(t)}-y_{j})+\lambda\cdot\operatorname{sign}(x^{(t)})
w(t):=x(t)1nNv(t);x(t+1):=argminxrxw(t)2.w^{(t)}:=x^{(t)}-\sqrt{\frac{1}{nN}}v^{(t)};\qquad x^{(t+1)}:=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x-w^{(t)}\right\rVert_{2}.

Finally, the algorithm outputs x¯=1Nt=0N1x(t)\bar{x}=\frac{1}{N}\sum_{t=0}^{N-1}x^{(t)}.

See Section 5.2 for the motivation of this algorithm, and a proof sketch of correctness and efficiency. Section E contains a summary of the complete algorithm in pseudocode.

5 Overview of proofs and techniques

This section outlines our techniques. The first step is proving Proposition 3.2. The second step is proving Proposition 3.3, by showing that the algorithm described in Section 4 efficiently recovers an approximate solution to Program 2.

5.1 Statistical recovery

Our approach to proving Proposition 3.2 is the Primal-Dual Witness (PDW) method introduced in [26]. Namely, we are interested in showing that the solution of Program (2) is near xx^{*} with high probability. Let UU be the (unknown) support of the true parameter vector xx^{*}. Define the following (hypothetical) program in which the solution space is restricted to vectors with support in UU:

argminxn:supp(x)Unll(x;A,y)+λx1\operatorname*{argmin}_{x\in\mathbb{R}^{n}:\operatorname{supp}(x)\subseteq U}\operatorname{nll}(x;A,y)+\lambda\left\lVert x\right\rVert_{1} (3)

In the untruncated setting, the PDW method is to apply the subgradient optimality condition to the solution x˘\breve{x} of this restricted program, which is by definition sparse. Proving that x˘\breve{x} satisfies subgradient optimality for the original program implies that the original program has a sparse solution x˘\breve{x}, and under mild extra conditions x˘\breve{x} is the unique solution. Thus, the original program recovers the true basis. We use the PDW method for a slightly different purpose; we apply subgradient optimality to x˘\breve{x} to show that x˘x2\left\lVert\breve{x}-x^{*}\right\rVert_{2} is small, and then use this to prove that x˘\breve{x} solves the original program.

Truncation introduces its own challenges. While Program 2 is still convex [9], it is much more convoluted than ordinary least squares. In particular, the gradient and Hessian of the negative log-likelihood have the following form (see Section C for the proof).

Lemma 5.1.

For all (A,y)(A,y), the gradient of the negative log-likelihood is nll(x;A,y)=1mj=1mAjT(𝔼ZAj,xyj).\nabla\operatorname{nll}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j}^{T}(\mathbb{E}Z_{A_{j},x}-y_{j}). The Hessian is H(x;A,y)=1mj=1mAjTAjVar(ZAj,x).H(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j}^{T}A_{j}\operatorname{Var}(Z_{A_{j},x}).

We now state the two key facts which make the PDW method work. First, the solution x˘\breve{x} to the restricted program must have a zero subgradient in all directions in U\mathbb{R}^{U}. Second, if this subgradient can be extended to all of n\mathbb{R}^{n}, then x˘\breve{x} is optimal for the original program. Formally:

Lemma 5.2.

Fix any (A,y)(A,y). Let x˘\breve{x} be an optimal solution to Program 3.

  1. (a)

    There is some z˘UU\breve{z}_{U}\in\mathbb{R}^{U} such that z˘U1\left\lVert\breve{z}_{U}\right\rVert_{\infty}\leq 1 and λz˘U=1mAUT(𝔼ZA,x˘y).-\lambda\breve{z}_{U}=\frac{1}{m}A_{U}^{T}(\mathbb{E}Z_{A,\breve{x}}-y).

  2. (b)

    Extend z˘U\breve{z}_{U} to n\mathbb{R}^{n} by defining λz˘Uc=1mAUcT(𝔼ZA,x˘y).-\lambda\breve{z}_{U^{c}}=\frac{1}{m}A_{U^{c}}^{T}(\mathbb{E}Z_{A,\breve{x}}-y). If z˘Uc<1\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}<1, and AUTAUA_{U}^{T}A_{U} is invertible, then x˘\breve{x} is the unique optimal solution to Program 2.

See Section C for the proof. The utility of this lemma is in reducing Proposition 3.2 to showing that with high probability over (A,y)(A,y), the following conditions both hold:

  1. 1.

    x˘x\left\lVert\breve{x}-x^{*}\right\rVert is small (Theorem A.7).

  2. 2.

    z˘Uc<1\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}<1 (Theorem B.3) and AUTAUA_{U}^{T}A_{U} is invertible (corollary of Theorem G.1).

In Section A, we prove Condition (1) by dissecting the subgradient optimality condition. In Section B we then prove Condition (2) and complete the proof of Proposition 3.2.

5.2 Computational recovery

For Proposition 3.3, we want to solve Program 2, i.e. minimize f(x)=nll(x;A,y)+λx1.f(x)=\operatorname{nll}(x;A,y)+\lambda\left\lVert x\right\rVert_{1}. The gradient of nll(x;A,y)\operatorname{nll}(x;A,y) doesn’t have a closed form, but it can be written cleanly as an expectation:

nll(x;A,y)=1mj=1mAjT(𝔼ZAj,xyj).\nabla\operatorname{nll}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j}^{T}(\mathbb{E}Z_{A_{j},x}-y_{j}).

Let us assume that ZAj,xZ_{A_{j},x} can be sampled efficiently. Then we may hope to optimize f(x)f(x) by stochastic gradient descent. But problematically, in our high-dimensional setting ff is nowhere strongly convex. So while we can apply the following general result from convex optimization, it has several strings attached:

Theorem 5.3.

Let f:nf:\mathbb{R}^{n}\to\mathbb{R} be a convex function achieving its optimum at x˘n\breve{x}\in\mathbb{R}^{n}. Let x(0),x(1),,x(N)x^{(0)},x^{(1)},\dots,x^{(N)} be a sequence of random vectors in n\mathbb{R}^{n}. Suppose that x(i+1)=x(i)ηv(i)x^{(i+1)}=x^{(i)}-\eta v^{(i)} where 𝔼[v(i)|x(i)]f(x(i))\mathbb{E}[v^{(i)}|x^{(i)}]\in\partial f(x^{(i)}). Set x¯=1Ni=1Nx(i)\bar{x}=\frac{1}{N}\sum_{i=1}^{N}x^{(i)}. Then

𝔼[f(x¯)]f(x˘)(ηN)1𝔼[x(0)x˘22]+ηN1i=1N𝔼[v(i)22].\mathbb{E}[f(\bar{x})]-f(\breve{x})\leq(\eta N)^{-1}\mathbb{E}\left[\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}^{2}\right]+\eta N^{-1}\sum_{i=1}^{N}\mathbb{E}\left[\left\lVert v^{(i)}\right\rVert_{2}^{2}\right].

In particular, to apply this result, we need to solve three technical problems:

  1. 1.

    We need to efficiently find an initial point x(0)x^{(0)} with bounded distance from x˘\breve{x}.

  2. 2.

    The gradient does not have bounded norm for arbitrary xnx\in\mathbb{R}^{n}. Thus, we need to pick a projection set in which the bound holds.

  3. 3.

    Since ff is not strongly convex, we need to convert the bound on f(x¯)f(x˘)f(\bar{x})-f(\breve{x}) into a bound on x¯x˘2\left\lVert\bar{x}-\breve{x}\right\rVert_{2}.

As defined in Section 4, our solution is the projection set r={xn:Axy2rm}\mathscr{E}_{r}=\{x\in\mathbb{R}^{n}:\left\lVert Ax-y\right\rVert_{2}\leq r\sqrt{m}\}, for an appropriate constant r>0r>0. To pick an initial point in r\mathscr{E}_{r}, we solve the program x(0)=argminxrx1.x^{(0)}=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x\right\rVert_{1}. This estimate is biased due to the truncation, but the key point is that by classical results from compressed sensing, it has bounded distance from xx^{*} (and therefore from x˘\breve{x}).

The algorithm then consists of projected stochastic gradient descent with projection set r\mathscr{E}_{r}. To bound the number of update steps required for the algorithm to converge to a good estimate of x˘\breve{x}, we need to solve several statistical problems (which are direct consequences of assumptions in Theorem 5.3).

Properties of r\mathscr{E}_{r}.

First, x˘\breve{x} must be feasible and a bounded distance from the initial point (due to high-dimensionality, r\mathscr{E}_{r} is unbounded, so this is not immediate). The following lemmas formalize this; see Sections J.11 and J.12 for the respective proofs. Lemma 5.4 specifies the choice of rr.

Lemma 5.4.

With high probability, x˘r\breve{x}\in\mathscr{E}_{r} for an appropriate constant r>0r>0.

Lemma 5.5.

With high probability, x(0)x˘2O(1)\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}\leq O(1).

Second, the SGD updates at points within r\mathscr{E}_{r} must be unbiased estimates of the gradient, with bounded square-norm in expectation. The following lemma shows that the updates v(t)v^{(t)} defined in Section 4 satisfy this property. See Section J.13 for the proof.

Lemma 5.6.

With high probability over AA, the following statement holds. Let 0t<T0\leq t<T. Then 𝔼[v(t)|x(t)]f(x(t))\mathbb{E}[v^{(t)}|x^{(t)}]\in\partial f(x^{(t)}), and 𝔼[v(t)22]O(n)\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}]\leq O(n).

Addressing the lack of strong convexity.

Third, we need to show that the algorithm converges in parameter space and not just in loss. That is, if f(x)f(x˘)f(x)-f(\breve{x}) is small, then we want to show that xx˘2\left\lVert x-\breve{x}\right\rVert_{2} is small as well. Note that ff is not strongly convex even in r\mathscr{E}_{r}, due to the high dimension. So we need a more careful approach. In the subspace U\mathbb{R}^{U}, ff is indeed strongly convex near x˘\breve{x}, as shown in the following lemma (proof in Section J.14):

Lemma 5.7.

There is a constant ζ\zeta such that with high probability over AA, f(x)f(x˘)ζ2xx˘22f(x)-f(\breve{x})\geq\frac{\zeta}{2}\left\lVert x-\breve{x}\right\rVert_{2}^{2} for all xnx\in\mathbb{R}^{n} with supp(x)U\operatorname{supp}(x)\subseteq U and xx˘21\left\lVert x-\breve{x}\right\rVert_{2}\leq 1.

But we need a bound for all n\mathbb{R}^{n}. The idea is to prove a lower bound on f(x)f(x˘)f(x)-f(\breve{x}) for xx near x˘\breve{x}, and then use convexity to scale the bound linearly in xx˘2\left\lVert x-\breve{x}\right\rVert_{2}. The above lemma provides a lower bound for xx near x˘\breve{x} if supp(x)U\operatorname{supp}(x)\subseteq U, and we need to show that adding an Uc\mathbb{R}^{U^{c}}-component to xx increases ff proportionally. This is precisely the content of Theorem B.4. Putting these pieces together we obtain the following lemma. See Section J.15 for the full proof.

Lemma 5.8.

There are constants cf>0c_{f}>0 and cfc^{\prime}_{f} such that with high probability over AA the following holds. Let xnx\in\mathbb{R}^{n}. If f(x)f(x˘)cf(logn)/m3f(x)-f(\breve{x})\leq c_{f}(\log n)/m^{3}, then

f(x)f(x˘)cflognmxx˘22.f(x)-f(\breve{x})\geq c^{\prime}_{f}\frac{\log n}{m}\left\lVert x-\breve{x}\right\rVert_{2}^{2}.

Convergence of PSGD.

It now follows from the above lemmas and Theorem 5.3 that the PSGD algorithm, as outlined above and described in Section 4, converges to a good approximation of x˘\breve{x} in a polynomial number of updates. The following theorem formalizes the guarantee. See Section J.16 for the proof.

Theorem 5.9.

With high probability over AA and over the execution of the algorithm, we get

x¯x˘2O((klogn)/m).\left\lVert\bar{x}-\breve{x}\right\rVert_{2}\leq O(\sqrt{(k\log n)/m}).

Efficient implementation.

Finally, in Section F we prove that initialization and each update step is efficient. Efficient gradient estimation in the projection set (i.e. sampling ZAj,xZ_{A_{j},x}) does not follow from the prior work, since our projection set is by necessity laxer than that of the prior work [9]. So we replace their rejection sampling procedure with a novel approximate sampling procedure under mild assumptions about the truncation set. Together with the convergence bound claimed in Theorem 5.9, these prove Proposition 3.3.

Proof of Proposition 3.3.

The correctness guarantee follows from Theorem 5.9. For the efficiency guarantee, note that the algorithm performs initialization and then N=poly(n)N=\operatorname{poly}(n) update steps. By Section F, the initialization takes poly(n)\operatorname{poly}(n) time, and each update step takes poly(n)+T(eΘ(m))\operatorname{poly}(n)+T(e^{-\Theta(m)}). This implies the desired bounds on overall time complexity. ∎

Acknowledgments

This research was supported by NSF Awards IIS-1741137, CCF-1617730 and CCF-1901292, by a Simons Investigator Award, by the DOE PhILMs project (No. DE-AC05-76RL01830), by the DARPA award HR00111990021, by the MIT Undergraduate Research Opportunities Program, and by a Google PhD Fellowship.

References

  • [1] Takeshi Amemiya. Regression analysis when the dependent variable is truncated normal. Econometrica: Journal of the Econometric Society, pages 997–1016, 1973.
  • [2] N Balakrishnan and Erhard Cramer. The art of progressive censoring. Springer, 2014.
  • [3] Daniel Bernoulli. Essai d’une nouvelle analyse de la mortalité causée par la petite vérole, et des avantages de l’inoculation pour la prévenir. Histoire de l’Acad., Roy. Sci.(Paris) avec Mem, pages 1–45, 1760.
  • [4] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • [5] Richard Breen et al. Regression models: Censored, sample selected, or truncated data, volume 111. Sage, 1996.
  • [6] Emmanuel J. Candés, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006.
  • [7] Sylvain Chevillard. The functions erf and erfc computed with arbitrary precision and explicit error bounds. Information and Computation, 216:72–95, 2012.
  • [8] A Clifford Cohen. Truncated and censored samples: theory and applications. CRC press, 2016.
  • [9] Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Computationally and Statistically Efficient Truncated Regression. In the 32nd Conference on Learning Theory (COLT), 2019.
  • [10] Mark A Davenport, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. A simple proof that random matrices are democratic. arXiv preprint arXiv:0911.0736, 2009.
  • [11] David L Donoho. For most large underdetermined systems of linear equations the minimal 1\ell_{1}-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6):797–829, 2006.
  • [12] RA Fisher. Properties and applications of Hh functions. Mathematical tables, 1:815–852, 1931.
  • [13] Francis Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62(379-387):310–315, 1897.
  • [14] Vassilis A Hajivassiliou and Daniel L McFadden. The method of simulated scores for the estimation of ldv models. Econometrica, pages 863–896, 1998.
  • [15] Jerry A Hausman and David A Wise. Social experimentation, truncated distributions, and efficient estimation. Econometrica: Journal of the Econometric Society, pages 919–938, 1977.
  • [16] Michael P Keane. 20 simulation estimation for panel data models with limited dependent variables. 1993.
  • [17] Jason N Laska. Democracy in action: Quantization, saturation, and compressive sensing. PhD thesis, 2010.
  • [18] Gangadharrao S Maddala. Limited-dependent and qualitative variables in econometrics. Number 3. Cambridge university press, 1986.
  • [19] Karl Pearson. On the systematic fitting of frequency curves. Biometrika, 2:2–7, 1902.
  • [20] Karl Pearson and Alice Lee. On the generalised probable error in multiple normal correlation. Biometrika, 6(1):59–68, 1908.
  • [21] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures, pages 1576–1602. World Scientific, 2010.
  • [22] Helmut Schneider. Truncated and censored samples from normal populations. Marcel Dekker, Inc., 1986.
  • [23] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [24] James Tobin. Estimation of relationships for limited dependent variables. Econometrica: journal of the Econometric Society, pages 24–36, 1958.
  • [25] Vladislav Voroninski and Zhiqiang Xu. A strong restricted isometry property, with an application to phaseless compressed sensing. Applied and Computational Harmonic Analysis, 40(2):386 – 395, 2016.
  • [26] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.

Appendix A Bounding solution of restricted program

In this section we prove that x˘x2\left\lVert\breve{x}-x^{*}\right\rVert_{2} is small with high probability, where x˘\breve{x} is a solution to Program 3. Specifically, we use regularization parameter λ=Θ((logn)/m)\lambda=\Theta(\sqrt{(\log n)/m}), and prove that x˘x2O((klogn)/m)\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq O(\sqrt{(k\log n)/m}).

The proof is motivated by the following rephrasal of part (a) of Lemma 5.2:

λz˘U=1mAUT(𝔼[ZA,x˘]𝔼[ZA,x])+1mAUT(𝔼[ZA,x]y)-\lambda\breve{z}_{U}=\frac{1}{m}A_{U}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}])+\frac{1}{m}A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-y) (4)

where z˘U1\left\lVert\breve{z}_{U}\right\rVert_{\infty}\leq 1. For intuition, consider the untruncated setting: then 𝔼[Zt]=t\mathbb{E}[Z_{t}]=t, so the equation is simply

λz˘U=1mAUTAU(x˘UxU)1mAUTw-\lambda\breve{z}_{U}=\frac{1}{m}A_{U}^{T}A_{U}(\breve{x}_{U}-x^{*}_{U})-\frac{1}{m}A_{U}^{T}w

where wN(0,1)mw\sim N(0,1)^{m}. Since ww is independent of AUTA_{U}^{T} and has norm Θ(m)\Theta(m), each entry of AUTwA_{U}^{T}w is Gaussian with variance Θ(m)\Theta(m), so 1mAUTw\frac{1}{m}A_{U}^{T}w has norm Θ(k/m)\Theta(\sqrt{k/m}). Additionally, λz˘U2λk=O((klogn)/m)\left\lVert\lambda\breve{z}_{U}\right\rVert_{2}\leq\lambda\sqrt{k}=O(\sqrt{(k\log n)/m}). Finally, 1mAUTAU\frac{1}{m}A_{U}^{T}A_{U} is a Θ(1)\Theta(1)-isometry, so we get the desired bound on x˘UxU\breve{x}_{U}-x^{*}_{U}.

Returning to the truncated setting, one bound still holds, namely λz˘U2λk\left\lVert\lambda\breve{z}_{U}\right\rVert_{2}\leq\lambda\sqrt{k}. The remainder of the above sketch breaks down for two reasons. First, 𝔼[ZA,x]y\mathbb{E}[Z_{A,x^{*}}]-y is no longer independent of AA. Second, bounding 1mAUT(𝔼[ZA,x˘]𝔼[ZA,x])\frac{1}{m}A_{U}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]) no longer implies a bound on x˘UxU\breve{x}_{U}-x^{*}_{U}.

The first problem is not so hard to work around; we can still bound AUT(𝔼[ZA,x]y)A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-y) as follows; see Section J.1 for the proof.

Lemma A.1.

With high probability over AA and yy, AUT(𝔼[ZA,x]y)22α1kmlogn.\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-y)\right\rVert^{2}_{2}\leq\alpha^{-1}km\log n.

So in equation 4, the last term is O((klogn)/m)O(\sqrt{(k\log n)/m}) with high probability. The first term is always O((klogn)/m)O(\sqrt{(k\log n)/m}), since z˘U2k\left\lVert\breve{z}_{U}\right\rVert_{2}\leq\sqrt{k}. So we know that 1mAUT(𝔼[ZA,x˘]𝔼[ZA,x])\frac{1}{m}A_{U}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]) has small norm. Unfortunately this does not imply that 𝔼[ZA,x˘]𝔼[ZA,x]\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}] has small norm, but as motivation, assume that we have such a bound.

Since AUA_{U} is a Θ(m)\Theta(\sqrt{m})-isometry, bounding x˘x\breve{x}-x^{*} is equivalent to bounding Ax˘AxA\breve{x}-Ax^{*}. To relate this quantity to 𝔼[ZA,x˘]𝔼[ZA,x]\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}], our approach is to lower bound the derivative of μt=𝔼[Zt]\mu_{t}=\mathbb{E}[Z_{t}] with respect to tt. The derivative turns out to have the following elegant form (proof in Section J.2):

Lemma A.2.

For any tt\in\mathbb{R}, ddtμt=Var(Zt)\frac{d}{dt}\mu_{t}=\operatorname{Var}(Z_{t}).

Crucially, Var(Zt)\operatorname{Var}(Z_{t}) is nonnegative, and relates to survival probability. By integrating a lower bound on the derivative, we get the following lower bound on μtμt\mu_{t}-\mu_{t^{*}} in terms of ttt-t^{*}. The bound is linear for small |tt||t-t^{*}|, but flattens out as |tt||t-t^{*}| grows. See Section J.3 for the proof.

Lemma A.3.

Let t,tt,t^{*}\in\mathbb{R}. Then sign(μtμt)=sign(tt)\operatorname{sign}(\mu_{t}-\mu_{t^{*}})=\operatorname{sign}(t-t^{*}). Additionally, for any constant β>0\beta>0 there is a constant c=c(β)>0c=c(\beta)>0 such that if γS(t)β\gamma_{S}(t^{*})\geq\beta, then |μtμt|cmin(1,|tt|).|\mu_{t}-\mu_{t^{*}}|\geq c\min(1,|t-t^{*}|).

If we want to use this lemma to prove that 𝔼[ZA,x˘]𝔼[ZA,x]2\left\lVert\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]\right\rVert_{2} is at least a constant multiple of A(x˘x)2\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}, we face two obstacles: (1) γS(Ajx)\gamma_{S}(A_{j}x^{*}) may not be large for all jj, and (2) the lemma only gives linear scaling if |Aj(x˘x)|=O(1)|A_{j}(\breve{x}-x^{*})|=O(1): but this is essentially what we’re trying to prove!

To deal with obstacle (1), we restrict to the rows j[m]j\in[m] for which γS(Ajx)\gamma_{S}(A_{j}x^{*}) is large. To deal with obstacle (2), we have a two-step proof. In the first step, we use the Ω(1)\Omega(1)-lower bound provided by Lemma A.3 to show that A(x˘x)2=O(m)\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}=O(\sqrt{m}) (so that |Aj(x˘x)|=O(1)|A_{j}(\breve{x}-x^{*})|=O(1) on average). In the second step, we use this to get linear scaling in Lemma A.3, and complete the proof, showing that A(x˘x)2=O(klogn)\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}=O(\sqrt{k\log n}).

Formally, define IgoodI_{\text{good}} to be the set of indices j[m]j\in[m] such that γS(Ajx)α/2\gamma_{S}(A_{j}x^{*})\geq\alpha/2 and |AjxAjx˘|2(6/(αm))AxAx˘2|A_{j}x^{*}-A_{j}\breve{x}|^{2}\leq(6/(\alpha m))\left\lVert Ax^{*}-A\breve{x}\right\rVert^{2}. In the following lemmas we show that IgoodI_{\text{good}} contains a constant fraction of the indices, so by the isometry properties we retain a constant fraction of A(x˘x)2\left\lVert A(\breve{x}-x^{*})\right\rVert_{2} when restricting to IgoodI_{\text{good}}. See Appendices J.4 and J.5 for the proofs of Lemmas A.4 and A.5 respectively.

Lemma A.4.

With high probability, |Igood|(α/6)m|I_{\text{good}}|\geq(\alpha/6)m.

Lemma A.5.

For some constant ϵ>0\epsilon>0, we have that with high probability, AxAx˘Igood2ϵAxAx˘2\left\lVert Ax^{*}-A\breve{x}\right\rVert_{I_{\text{good}}}^{2}\geq\epsilon\left\lVert Ax^{*}-A\breve{x}\right\rVert^{2}.

We now prove the weaker, first-step bound on A(x˘x)2\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}. But there is one glaring issue we must address: we made a simplifying assumption that 𝔼[ZA,x˘]𝔼[ZA,x]\left\lVert\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]\right\rVert is small. All we actually know is that AUT(𝔼[ZA,x˘]𝔼[ZA,x]2\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]\right\rVert_{2} is small. And AUTA_{U}^{T} has a nontrivial null space.

Here is a sketch of how we resolve this issue. Let a=A(x˘x)a=A(\breve{x}-x^{*}) and b=μAx˘μAxb=\mu_{A\breve{x}}-\mu_{Ax^{*}}; we want to show that if a\left\lVert a\right\rVert is large then AUTb\left\lVert A_{U}^{T}b\right\rVert is large. Geometrically, AUTb\left\lVert A_{U}^{T}b\right\rVert is approximately proportional to the distance from bb to the subspace Null(AUT)\text{Null}(A_{U}^{T}). Oversimplifying for clarity, we know that |bj|c|aj||b_{j}|\geq c|a_{j}| for all jj. This is by itself insufficient. The key observation is that we also know sign(aj)=sign(bj)\operatorname{sign}(a_{j})=\operatorname{sign}(b_{j}) for all jj. Thus, bb lies in a hyperoctant shifted to have corner caca. Since caca lies in the row space of AUTA_{U}^{T}, it’s perpendicular to Null(AUT)\text{Null}(A_{U}^{T}), so the closest point to Null(AUT)\text{Null}(A_{U}^{T}) in the shifted hyperoctant should be caca.

Formalizing this geometric intuition yields the last piece of the proofs of the following theorems. See Section J.6 for the full proofs.

Theorem A.6.

There are positive constants creg=creg(α)c^{\prime}_{\text{reg}}=c^{\prime}_{\text{reg}}(\alpha), M=M(α)M^{\prime}=M^{\prime}(\alpha), and C=C(α)C^{\prime}=C^{\prime}(\alpha) with the following property. Suppose that λcreg/k\lambda\leq c^{\prime}_{\text{reg}}/\sqrt{k} and mMklognm\geq M^{\prime}k\log n. Then with high probability, AUxAUx˘2Cm\left\lVert A_{U}x^{*}-A_{U}\breve{x}\right\rVert_{2}\leq C^{\prime}\sqrt{m}.

Theorem A.7.

There are positive constants creg′′=creg′′(α)c^{\prime\prime}_{\text{reg}}=c^{\prime\prime}_{\text{reg}}(\alpha), M′′=M′′(α)M^{\prime\prime}=M^{\prime\prime}(\alpha), and C′′=C′′(α)C^{\prime\prime}=C^{\prime\prime}(\alpha) with the following property. Suppose that λcreg′′/k\lambda\leq c^{\prime\prime}_{\text{reg}}/\sqrt{k} and mM′′klognm\geq M^{\prime\prime}k\log n. Then xx˘2C′′(λk+(klogn)/m)\left\lVert x^{*}-\breve{x}\right\rVert_{2}\leq C^{\prime\prime}(\lambda\sqrt{k}+\sqrt{(k\log n)/m}) with high probability.

Appendix B Proof of statistical recovery

Extend z˘\breve{z} to n\mathbb{R}^{n} by defining

z˘Uc=1λmAUcT(𝔼ZA,x˘y).\breve{z}_{U^{c}}=-\frac{1}{\lambda m}A_{U^{c}}^{T}(\mathbb{E}Z_{A,\breve{x}}-y).

We would like to show that zUc<1\left\lVert z_{U^{c}}\right\rVert_{\infty}<1. Since AUcTA_{U^{c}}^{T} is independent of 𝔼[ZA,x˘]y\mathbb{E}[Z_{A,\breve{x}}]-y, each entry of AUcT(𝔼[ZA,x˘]y)A_{U^{c}}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-y) is Gaussian with standard deviation 𝔼[ZA,x˘]y2\left\lVert\mathbb{E}[Z_{A,\breve{x}}]-y\right\rVert_{2}. It turns out that a bound of O(λkm+m)O(\lambda\sqrt{km}+\sqrt{m}) suffices. To get this bound, we decompose

𝔼[ZA,x˘]y=A(x˘x)+𝔼RA,x˘(yAx)\mathbb{E}[Z_{A,\breve{x}}]-y=A(\breve{x}-x^{*})+\mathbb{E}R_{A,\breve{x}}-(y-Ax^{*})

and bound each term separately. Here we are defining Rt=ZttR_{t}=Z_{t}-t, and Ra,x=Za,xaTxR_{a,x}=Z_{a,x}-a^{T}x and RA,x=ZA,xAxR_{A,x}=Z_{A,x}-Ax similarly.

We present the proof of the following lemmas in Section J.7 and Section J.8 respectively.

Lemma B.1.

There is a constant c=c(α)c=c(\alpha) such that under the conditions of Theorem A.7, with high probability over (A,y)(A,y), 𝔼[RA,x˘]22cm.\left\lVert\mathbb{E}[R_{A,\breve{x}}]\right\rVert_{2}^{2}\leq cm.

Lemma B.2.

There is a constant cy=cy(α)c_{y}=c_{y}(\alpha) such that RA,x22cym\left\lVert R_{A,x^{*}}\right\rVert_{2}^{2}\leq c_{y}m with high probability.

Combining the above lemmas with the bound on x˘x2\left\lVert\breve{x}-x^{*}\right\rVert_{2} from the previous section, we get the desired theorem. See Section J.9 for the full proof.

Theorem B.3.

There are constants M=M(α)M=M(\alpha), σ=σ(α)\sigma=\sigma(\alpha), and d=d(α)d=d(\alpha) with the following property. Suppose mMklognm\geq Mk\log n, and λ=σ(logn)/m\lambda=\sigma\sqrt{(\log n)/m}. Then with high probability we have z˘Uc<1\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}<1.

As an aside that we’ll use later, this proof can be extended to any random vector near x˘\breve{x} with support contained in UU (proof in Section J.10).

Theorem B.4.

There are constants M=M(α)M=M(\alpha), σ=σ(α)\sigma=\sigma(\alpha), and d=d(α)d=d(\alpha) with the following property. Suppose mMklognm\geq Mk\log n and λ=σ(logn)/m\lambda=\sigma\sqrt{(\log n)/m}. If XnX\in\mathbb{R}^{n} is a random variable with supp(X)U\operatorname{supp}(X)\subseteq U always, and x˘X21/m\left\lVert\breve{x}-X\right\rVert_{2}\leq 1/m with high probability, then with high probability 1mAUc(𝔼ZA,Xy)λ/2\left\lVert\frac{1}{m}A_{U^{c}}(\mathbb{E}Z_{A,X}-y)\right\rVert_{\infty}\leq\lambda/2.

Returning to the goal of this section, it remains to show that AUTAUA_{U}^{T}A_{U} is invertible with high probability. But this follows from the isometry guarantee of Theorem G.1. Our main statistical result, Proposition 3.2, now follows.

Proof of Proposition 3.2.

Take MM, σ\sigma, and dd as in the statement of Theorem B.3. Let mMklognm\geq Mk\log n and λ=σ(logn)/m\lambda=\sigma\sqrt{(\log n)/m}. Let x^n\hat{x}\in\mathbb{R}^{n} be any optimal solution to the regularized program, and let x˘U\breve{x}\in\mathbb{R}^{U} be any solution to the restricted program. By Theorem B.3, with high probability we have xx˘d(klogn)/m\left\lVert x^{*}-\breve{x}\right\rVert\leq d\sqrt{(k\log n)/m} and z˘Uc<1\left\lVert\breve{z}_{U^{c}}\right\rVert<1; and by Theorem G.1, AUTAUA_{U}^{T}A_{U} is invertible. So by Lemma 5.2, it follows that x˘=x^\breve{x}=\hat{x}. Therefore xx^d(klogn)/m\left\lVert x^{*}-\hat{x}\right\rVert\leq d\sqrt{(k\log n)/m}. ∎

Appendix C Primal-dual witness method

Proof of Lemma 5.1.

For a single sample (Aj,yj)(A_{j},y_{j}), the partial derivative in direction xix_{i} is

xinll(x;Aj,yj)\displaystyle\frac{\partial}{\partial x_{i}}\operatorname{nll}(x;A_{j},y_{j}) =Aji(Ajxy)+xiSe(Ajxz)2/2𝑑zSe(Ajxz)2/2𝑑z\displaystyle=A_{ji}(A_{j}x-y)+\frac{\frac{\partial}{\partial x_{i}}\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}{\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}
=Aji(Ajxy)SAji(Ajxz)e(Ajxz)2/2𝑑zSe(Ajxz)2/2𝑑z\displaystyle=A_{ji}(A_{j}x-y)-\frac{\int_{S}A_{ji}(A_{j}x-z)e^{-(A_{j}x-z)^{2}/2}\,dz}{\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}
=Aji(Ajxy)𝔼[Aji(AjxZAjx)]\displaystyle=A_{ji}(A_{j}x-y)-\mathbb{E}[A_{ji}(A_{j}x-Z_{A_{j}x})]

where expectation is taken over the random variable ZAjxZ_{A_{j}x} (for fixed AjA_{j}). Simplifying yields the expression

nll(x;Aj,yj)=Aj(𝔼[ZAjx]y).\nabla\operatorname{nll}(x;A_{j},y_{j})=A_{j}(\mathbb{E}[Z_{A_{j}x}]-y).

The second partial derivative of nll(x;Aj,yj)\operatorname{nll}(x;A_{j},y_{j}) in directions xi1x_{i_{1}} and xi2x_{i_{2}} is therefore

2xi1xi2nll(x;Aj,yj)\displaystyle\frac{\partial^{2}}{\partial x_{i_{1}}\partial x_{i_{2}}}\operatorname{nll}(x;A_{j},y_{j}) =xi1Aji2(𝔼[ZAjx]y)\displaystyle=\frac{\partial}{\partial x_{i_{1}}}A_{ji_{2}}(\mathbb{E}[Z_{A_{j}x}]-y)
=Aji2xi1(Sze(Ajxz)2/2𝑑zSe(Ajxz)2/2𝑑zy)\displaystyle=A_{ji_{2}}\frac{\partial}{\partial x_{i_{1}}}\left(\frac{\int_{S}ze^{-(A_{j}x-z)^{2}/2}\,dz}{\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}-y\right)
=Aji2(xi1Sze(Ajxz)2/2𝑑zSe(Ajxz)2/2𝑑z\displaystyle=A_{ji_{2}}\Bigg{(}\frac{\frac{\partial}{\partial x_{i_{1}}}\int_{S}ze^{-(A_{j}x-z)^{2}/2}\,dz}{\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}-
Sze(Ajxz)2/2𝑑zxi1Se(Ajxz)2/2𝑑z(Se(Ajxz)2/2𝑑z)2)\displaystyle\qquad\frac{\int_{S}ze^{-(A_{j}x-z)^{2}/2}\,dz\frac{\partial}{\partial x_{i_{1}}}\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}{\left(\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz\right)^{2}}\Bigg{)}
=Aji2(𝔼[Aji1ZAjx(AjxZAjx)]𝔼[ZAjx]𝔼[Aji1(AjxZAjx)]\displaystyle=A_{ji_{2}}(\mathbb{E}[-A_{ji_{1}}Z_{A_{j}x}(A_{j}x-Z_{A_{j}x})]-\mathbb{E}[Z_{A_{j}x}]\mathbb{E}[-A_{ji_{1}}(A_{j}x-Z_{A_{j}x})]
=Aji1Aji2Var(ZAjx).\displaystyle=A_{ji_{1}}A_{ji_{2}}\operatorname{Var}(Z_{A_{j}x}).

We conclude that

H(x;Aj,yj)=AjTAjVar(ZAjx).H(x;A_{j},y_{j})=A_{j}^{T}A_{j}\operatorname{Var}(Z_{A_{j}x}).

Averaging over all samples yields the claimed result. ∎

The following lemma collects several useful facts that are needed for the PDW method. Parts (a) and (b) are generically true for any 1\ell_{1}-regularized convex program; part (c) is a holdover from the untruncated setting that is still true. The proof is essentially due to [26], although part (c) now requires slightly more work.

Lemma C.1.

Fix any (A,y)(A,y).

  1. (a)

    A vector xnx\in\mathbb{R}^{n} is optimal for Program 2 if and only if there exists some zx1z\in\partial\left\lVert x\right\rVert_{1} such that

    nll(x;A,y)+λz=0.\nabla\operatorname{nll}(x;A,y)+\lambda z=0.
  2. (b)

    Suppose that (x,z)(x,z) are as in (a), and furthermore |zi|<1|z_{i}|<1 for all isupp(x)i\not\in\operatorname{supp}(x). Then necessarily supp(x^)supp(x)\operatorname{supp}(\hat{x})\subseteq\operatorname{supp}(x) for any optimal solution x^\hat{x} to Program 2.

  3. (c)

    Suppose that (x,z)(x,z) are as in (b), with I=supp(x)I=\operatorname{supp}(x). If AITAIA_{I}^{T}A_{I} is invertible, then xx is the unique optimal solution to Program 2.

Proof.

Part (a) is simply the subgradient optimality condition in a convex program.

Part (b) is a standard fact about duality; we provide a proof here. Let x^\hat{x} be any optimal solution to Program 2. We claim that x^Tz=x^1\hat{x}^{T}z=\left\lVert\hat{x}\right\rVert_{1}. To see this, first note that xTz=x1x^{T}z=\left\lVert x\right\rVert_{1}, since xizi=|xi|x_{i}z_{i}=|x_{i}| always holds by definition of a subgradient for the 1\ell_{1} norm. Now, by optimality of xx and x^\hat{x}, we have f(x)=f(x^)f(tx+(1t)x^)f(x)=f(\hat{x})\leq f(tx+(1-t)\hat{x}) for all 0t10\leq t\leq 1. Therefore by convexity, f(tx+(1t)x^)=f(x)f(tx+(1-t)\hat{x})=f(x) for all 0t10\leq t\leq 1. Since ff is the sum of two convex functions, both must be linear on the line segment between xx and x^\hat{x}. Therefore

nll(tx+(1t)x^)=tnll(x)+(1t)nll(x^)\operatorname{nll}(tx+(1-t)\hat{x})=t\operatorname{nll}(x)+(1-t)\operatorname{nll}(\hat{x})

for all 0t10\leq t\leq 1. We conclude that

(nll(x))(x^x)=nll(x^)nll(x)=x1x^1.(\nabla\operatorname{nll}(x))\cdot(\hat{x}-x)=\operatorname{nll}(\hat{x})-\operatorname{nll}(x)=\left\lVert x\right\rVert_{1}-\left\lVert\hat{x}\right\rVert_{1}.

Since nll(x)+z=0\nabla\operatorname{nll}(x)+z=0 by subgradient optimality, it follows that zT(x^x)=x^1x1z^{T}(\hat{x}-x)=\left\lVert\hat{x}\right\rVert_{1}-\left\lVert x\right\rVert_{1}. Hence, zTx^=x^1z^{T}\hat{x}=\left\lVert\hat{x}\right\rVert_{1}. Since |zi|1|z_{i}|\leq 1 for all ii, if |zi|<1|z_{i}|<1 for some ii then necessarily x^i=0\hat{x}_{i}=0 for equality to hold.

For part (c), if AITAIA_{I}^{T}A_{I} is invertible, then it is (strictly) positive definite. The Hessian of Program 3 is

1mj=1mAI,jTAI,jVar(ZAj,x).\frac{1}{m}\sum_{j=1}^{m}A_{I,j}^{T}A_{I,j}\operatorname{Var}(Z_{A_{j},x}).

Since Var(ZAj,x)\operatorname{Var}(Z_{A_{j},x}) is always positive, there is some ϵ>0\epsilon>0 (not necessarily a constant) such that

1mj=1mAI,jTAI,jVar(ZAj,x)1mϵj=1mAI,jTAI,j=1mϵAITAI.\frac{1}{m}\sum_{j=1}^{m}A_{I,j}^{T}A_{I,j}\operatorname{Var}(Z_{A_{j},x})\succcurlyeq\frac{1}{m}\epsilon\sum_{j=1}^{m}A_{I,j}^{T}A_{I,j}=\frac{1}{m}\epsilon A_{I}^{T}A_{I}.

Thus, the Hessian of the restricted program is positive definite, so the restricted program is strictly convex. Therefore the restricted program has a unique solution. By part (b), any solution to the original program has support in II, so the original program also has a unique solution, which must be xx. ∎

As with the previous lemma, the following proof is essentially due to [26] (with a different subgradient optimality condition).

Proof of Lemma 5.2.

By part (a) of Lemma C.1, a vector xnx\in\mathbb{R}^{n} is optimal for Program 2 if and only if there is some zx1z\in\partial\left\lVert x\right\rVert_{1} such that

1mAT(𝔼ZA,xy)+λz=0.\frac{1}{m}A^{T}(\mathbb{E}Z_{A,x}-y)+\lambda z=0.

This vector equality can be written in block form as follows:

1m[AUTAUcT](𝔼ZA,xy)+λ[zUzUc]=0.\frac{1}{m}\begin{bmatrix}A_{U}^{T}\\ A_{U^{c}}^{T}\end{bmatrix}\left(\mathbb{E}Z_{A,x}-y\right)+\lambda\begin{bmatrix}z_{U}\\ z_{U^{c}}\end{bmatrix}=0.

Since x˘\breve{x} is optimal in U\mathbb{R}^{U}, there is some z˘Ux˘1\breve{z}_{U}\in\partial\left\lVert\breve{x}\right\rVert_{1} such that (x˘,z˘U)(\breve{x},\breve{z}_{U}) satisfy the first of the two block equations. This is precisely part (a). If furthermore x˘\breve{x} is zero-extended to n\mathbb{R}^{n}, and z˘\breve{z} is extended as in part (b), and z˘\breve{z} satisfies z˘Uc1\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}\leq 1, then since xi=0x_{i}=0 for all iUi\not\in U, we have that z˘\breve{z} is a subgradient for x˘1\left\lVert\breve{x}\right\rVert_{1}. Therefore x˘\breve{x} is optimal for Program 2. If z˘Uc<1\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}<1 and AUTAUA_{U}^{T}A_{U} is invertible, then x˘\breve{x} is the unique solution to Program 2 by parts (b) and (c) of Lemma C.1. ∎

Appendix D Sparse recovery from the Restricted Isometry Property

In this section we restate a theorem due to [6] about sparse recovery in the presence of noise. Our statement is slightly generalized to allow a trade-off between the isometry constants and the sparsity. That is, as the sparsity kk decreases relative to the isometry order ss, the isometry constants τ,T\tau,T are allowed to worsen.

Theorem D.1 ([6]).

Let Bm×nB\in\mathbb{R}^{m\times n} be a matrix satisfying the ss-Restricted Isometry Property

τv2Bv2Tv2\tau\left\lVert v\right\rVert_{2}\leq\left\lVert Bv\right\rVert_{2}\leq T\left\lVert v\right\rVert_{2}

for all ss-sparse vnv\in\mathbb{R}^{n}. Let wnw^{*}\in\mathbb{R}^{n} be kk-sparse for some k<sk<s, and let wnw\in\mathbb{R}^{n} satisfy w1w1\left\lVert w\right\rVert_{1}\leq\left\lVert w^{*}\right\rVert_{1}. Then

B(ww)2(τ(1ρ)Tρ)ww2\left\lVert B(w-w^{*})\right\rVert_{2}\geq\left(\tau(1-\rho)-T\rho\right)\left\lVert w-w^{*}\right\rVert_{2}

where ρ=k/(sk)\rho=\sqrt{k/(s-k)}.

Proof.

Let h=wwh=w-w^{*} and let T0=supp(w)T_{0}=\operatorname{supp}(w^{*}). Then

w1w1=hT0C1+(h+w)T01hT0C1+w1hT01,\left\lVert w^{*}\right\rVert_{1}\geq\left\lVert w\right\rVert_{1}=\left\lVert h_{T_{0}^{C}}\right\rVert_{1}+\left\lVert(h+w^{*})_{T_{0}}\right\rVert_{1}\geq\left\lVert h_{T_{0}^{C}}\right\rVert_{1}+\left\lVert w^{*}\right\rVert_{1}-\left\lVert h_{T_{0}}\right\rVert_{1},

so hT01hT0C1\left\lVert h_{T_{0}}\right\rVert_{1}\geq\left\lVert h_{T_{0}^{C}}\right\rVert_{1}. Without loss of generality assume that T0C={1,,|T0C|}T_{0}^{C}=\{1,\dots,|T_{0}^{C}|\}, and |hi||hi+1||h_{i}|\geq|h_{i+1}| for all 1i<|T0C|1\leq i<|T_{0}^{C}|. Divide T0CT_{0}^{C} into sets of size s=sks^{\prime}=s-k respecting this order:

T0C=T1T2Tr.T_{0}^{C}=T_{1}\cup T_{2}\cup\dots\cup T_{r}.

Then the Restricted Isometry Property gives

Bh2BhT0T12t=2rBhTt2τhT0T12Tt=2rhTt2\left\lVert Bh\right\rVert_{2}\geq\left\lVert Bh_{T_{0}\cup T_{1}}\right\rVert_{2}-\sum_{t=2}^{r}\left\lVert Bh_{T_{t}}\right\rVert_{2}\geq\tau\left\lVert h_{T_{0}\cup T_{1}}\right\rVert_{2}-T\sum_{t=2}^{r}\left\lVert h_{T_{t}}\right\rVert_{2} (5)

For any t1t\geq 1 and iTt+1i\in T_{t+1}, we have hihTt1/sh_{i}\leq\left\lVert h_{T_{t}}\right\rVert_{1}/s^{\prime}, so that

hTt+122hTt12s.\left\lVert h_{T_{t+1}}\right\rVert_{2}^{2}\leq\frac{\left\lVert h_{T_{t}}\right\rVert_{1}^{2}}{s^{\prime}}.

Summing over all t2t\geq 2, we get

t=2rhTt21st=1rhTt1=hT0C1shT01sksh2.\sum_{t=2}^{r}\left\lVert h_{T_{t}}\right\rVert_{2}\leq\frac{1}{\sqrt{s^{\prime}}}\sum_{t=1}^{r}\left\lVert h_{T_{t}}\right\rVert_{1}=\frac{\left\lVert h_{T_{0}^{C}}\right\rVert_{1}}{\sqrt{s^{\prime}}}\leq\frac{\left\lVert h_{T_{0}}\right\rVert_{1}}{\sqrt{s^{\prime}}}\leq\sqrt{\frac{k}{s^{\prime}}}\left\lVert h\right\rVert_{2}.

The triangle inequality implies that hT0T12(1k/s)h2\left\lVert h_{T_{0}\cup T_{1}}\right\rVert_{2}\geq(1-\sqrt{k/s^{\prime}})\left\lVert h\right\rVert_{2}. Returning to Equation 5, it follows that

Bh2(τ(1k/s)Tk/s)h2\left\lVert Bh\right\rVert_{2}\geq\left(\tau(1-\sqrt{k/s^{\prime}})-T\sqrt{k/s^{\prime}}\right)\left\lVert h\right\rVert_{2}

as claimed. ∎

Appendix E Summary of the algorithm

Algorithm 1 Projected Stochastic Gradient Descent.
1:procedure Sgd(N,λN,\lambda)\triangleright NN: number of steps, λ\lambda: parameter
2:     x(0)argminx1x^{(0)}\leftarrow\operatorname*{argmin}\left\lVert x\right\rVert_{1} s.t. xrx\in\mathscr{E}_{r} \triangleright see the Appendix F for details
3:     for t=1,,Nt=1,\dots,N do
4:         ηt1nN\eta_{t}\leftarrow\frac{1}{\sqrt{nN}}
5:         v(t)GradientEstimation(x(t1)))v^{(t)}\leftarrow\textsc{GradientEstimation}(x^{(t-1)}))
6:         w(t)x(t1)ηtw(t)w^{(t)}\leftarrow x^{(t-1)}-\eta_{t}w^{(t)}
7:         x(t)argminxrxw(t)2x^{(t)}\leftarrow\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x-w^{(t)}\right\rVert_{2} \triangleright see the Appendix F for details      
8:     return x¯1Nt=1Nx(t)\bar{x}\leftarrow\frac{1}{N}\sum_{t=1}^{N}x^{(t)}\triangleright output the average
Algorithm 2 The function to estimate the gradient of the 1\ell_{1} regularized negative log-likelihood.
1:function GradientEstimation(xx)
2:     Pick jj at random from [n][n]
3:     Use Assumption II or Lemma K.4 to sample zZAjx(t)z\sim Z_{A_{j}x^{(t)}}
4:     return Aj(zyj)A_{j}(z-y_{j})

Appendix F Algorithm details

In this section we fill in the missing details about the algorithm’s efficiency. Since we have already seen that the algorithm converges in O(poly(n))O(\text{poly}(n)) update steps, all that remains is to show that the following algorithmic problems can be solved efficiently:

  1. 1.

    (Initial point) Compute x(0)=argminxrx1.x^{(0)}=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x\right\rVert_{1}.

  2. 2.

    (Stochastic gradient) Given x(t)rx^{(t)}\in\mathscr{E}_{r} and j[m]j\in[m], compute a sample Aj(zyj)A_{j}(z-y_{j}), where zZAjx(t)z\sim Z_{A_{j}x^{(t)}}.

  3. 3.

    (Projection) Given w(t)nw^{(t)}\in\mathbb{R}^{n}, compute x(t+1)=argminxrxw(t)2x^{(t+1)}=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x-w^{(t)}\right\rVert_{2}.

Initial point.

To obtain the initial point x(0)x^{(0)}, we need to solve the program

minimizex1subject toAxy2rm.\begin{array}[]{ll}\text{minimize}&\left\lVert x\right\rVert_{1}\\ \text{subject to}&\left\lVert Ax-y\right\rVert_{2}\leq r\sqrt{m}.\end{array}

This program has come up previously in the compressed sensing literature (see, e.g., [6]). It can be recast as a Second-Order Cone Program (SOCP) by introducing variables x+,xnx^{+},x^{-}\in\mathbb{R}^{n}:

minimizei=1n(xi+xi)subject toAx+Axy2rm,x+0,x0.\begin{array}[]{ll}\text{minimize}&\sum_{i=1}^{n}(x^{+}_{i}-x^{-}_{i})\\ \text{subject to}&\left\lVert Ax^{+}-Ax^{-}-y\right\rVert_{2}\leq r\sqrt{m},\\ &x^{+}\geq 0,\\ &-x^{-}\geq 0.\end{array}

Thus, it can be solved in polynomial time by interior-point methods (see [4]).

Stochastic gradient.

In computing an unbiased estimate of the gradient, the only challenge is sampling from ZAjx(t)Z_{A_{j}x^{(t)}}. By Assumption II, this takes T(γS(Ajx(t)))T(\gamma_{S}(A_{j}x^{(t)})) time. We know from Lemma I.4 that γS(Ajx)α2m\gamma_{S}(A_{j}x^{*})\geq\alpha^{2m}. Since x(t),xrx^{(t)},x^{*}\in\mathscr{E}_{r}, we have from Lemma I.2 that

γS(Ajx(t))γS(t)2e|Aj(x(t)x)|22α4me4r2m2eΘ(m/α).\gamma_{S}(A_{j}x^{(t)})\geq\gamma_{S}(t^{*})^{2}e^{-|A_{j}(x^{(t)}-x^{*})|^{2}-2}\geq\alpha^{4m}e^{-4r^{2}m-2}\geq e^{-\Theta(m/\alpha)}.

Thus, the time complexity of computing the stochastic gradient is T(eΘ(m/α))T(e^{-\Theta(m/\alpha)}).

In the special case when the truncation set SS is a union of rr intervals, there is a sampling algorithm with time complexity T(β)=poly(r,log(1/β,n))T(\beta)=\operatorname{poly}(r,\log(1/\beta,n)) (Lemma K.4). Hence, in this case the time complexity of computing the stochastic gradient is poly(r,n)\operatorname{poly}(r,n).

To be more precise, we instantiate Lemma K.4 with accuracy ζ=1/(nL)\zeta=1/(nL), where L=poly(n)L=\operatorname{poly}(n) is the number of update steps performed. This gives some sampling algorithm 𝒜\mathscr{A}. In each step, 𝒜\mathscr{A}’s output distribution is within ζ\zeta of the true distribution N(t,1;S)N(t,1;S). Consider a hypothetical sampling algorithm 𝒜\mathscr{A}^{\prime} in which 𝒜\mathscr{A} is run, and then the output is altered by rejection to match the true distribution. Alteration occurs with probability ζ\zeta. Thus, running the PSGD algorithm with 𝒜\mathscr{A}^{\prime}, the probability that any alteration occurs is at most Lζ=o(1)L\zeta=o(1). As shown by Theorem H.1, PSGD with 𝒜\mathscr{A}^{\prime} succeeds with high probability. Hence, PSGD with 𝒜\mathscr{A} succeeds with high probability as well.

Projection.

The other problem we need to solve is projection onto set r\mathscr{E}_{r}:

minimizexv2subject toAxy2rm.\begin{array}[]{ll}\text{minimize}&\left\lVert x-v\right\rVert_{2}\\ \text{subject to}&\left\lVert Ax-y\right\rVert_{2}\leq r\sqrt{m}.\end{array}

This is a convex QCQP, and therefore solvable in polynomial time by interior point methods (see [4]).

Appendix G Isometry properties

Let Am×nA\in\mathbb{R}^{m\times n} consist of mm samples AiA_{i} from Process 1. In this section we prove the following theorem:

Theorem G.1.

For every ϵ>0\epsilon>0 there are constants δ>0\delta>0, MM, τ>0\tau>0 and TT with the following property. Let V[n]V\subseteq[n]. Suppose that mM|V|m\geq M|V|. With probability at least 1eδm1-e^{-\delta m} over AA, for every subset J[m]J\subseteq[m] with |J|ϵm|J|\geq\epsilon m, the |J|×k|J|\times k submatrix AJ,VA_{J,V} satisfies

τmv2AJ,Vv2Tmv2vV.\tau\sqrt{m}\left\lVert v\right\rVert_{2}\leq\left\lVert A_{J,V}v\right\rVert_{2}\leq T\sqrt{m}\left\lVert v\right\rVert_{2}\qquad\forall\,v\in\mathbb{R}^{V}.

We start with the upper bound, for which it suffices to take J=[m]J=[m].

Lemma G.2.

Let V[n]V\subseteq[n]. Suppose that m|V|m\geq|V|. There is a constant T=T(α)T=T(\alpha) such that

Pr[smax(AV)>T]eΩ(m).\Pr[s_{\max{}}(A_{V})>T]\leq e^{-\Omega(m)}.
Proof.

In the process for generating AA, consider the matrix AA^{\prime} obtained by not discarding any of the samples ana\in\mathbb{R}^{n}. Then AA^{\prime} is a ξ×n\xi\times n matrix for a random variable ξ\xi; each row of AA^{\prime} is a spherical Gaussian independent of all previous rows, but ξ\xi depends on the rows. Nonetheless, by a Chernoff bound, Pr[ξ>2m/α]em/(3α).\Pr[\xi>2m/\alpha]\leq e^{-m/(3\alpha)}. In this event, AA^{\prime} is a submatrix of 2m/α×n2m/\alpha\times n matrix BB with i.i.d. Gaussian entries. By [21],

Pr[smax(BV)>C2m/α]ecm\Pr[s_{\max{}}(B_{V})>C\sqrt{2m/\alpha}]\leq e^{-cm}

for some absolute constants c,C>0c,C>0. Since AA^{\prime} is a submatrix of BB with high probability, and AA is a submatrix of AA^{\prime}, it follows that

Pr[smax(AV)>C2m/α]eΩ(m)\Pr[s_{\max{}}(A_{V})>C\sqrt{2m/\alpha}]\leq e^{-\Omega(m)}

as desired. ∎

For the lower bound, we use an ϵ\epsilon-net argument.

Lemma G.3.

Let ϵ>0\epsilon>0 and let vnv\in\mathbb{R}^{n} with v2=1\left\lVert v\right\rVert_{2}=1. Let aN(0,1)na\sim N(0,1)^{n}. Then

Pr[|aTv|<αϵπ/2|aTx+ZS]<ϵ.\Pr[|a^{T}v|<\alpha\epsilon\sqrt{\pi/2}|a^{T}x^{*}+Z\in S]<\epsilon.
Proof.

From the constant survival probability assumption,

Pr[|aTv|<δ|aTx+ZS]α1Pr[|aTv|<δ].\Pr[|a^{T}v|<\delta|a^{T}x^{*}+Z\in S]\leq\alpha^{-1}\Pr[|a^{T}v|<\delta].

But aTvN(0,1)a^{T}v\sim N(0,1), so Pr[|aTv|<δ]2δ/2π.\Pr[|a^{T}v|<\delta]\leq 2\delta/\sqrt{2\pi}. Taking δ=αϵπ/2\delta=\alpha\epsilon\sqrt{\pi/2} yields the desired bound. ∎

Lemma G.4.

Let V[n]V\subseteq[n]. Fix ϵ>0\epsilon>0 and fix vVv\in\mathbb{R}^{V} with v2=1\left\lVert v\right\rVert_{2}=1. There are positive constants τ0=τ0(α,ϵ)\tau_{0}=\tau_{0}(\alpha,\epsilon) and c0=c0(α,ϵ)c_{0}=c_{0}(\alpha,\epsilon) such that

Pr[J[m]:(|J|ϵm)(AJ,Vv2<τ0)]ec0m.\Pr\left[\exists J\subseteq[m]:(|J|\geq\epsilon m)\land(\left\lVert A_{J,V}v\right\rVert_{2}<\tau_{0})\right]\leq e^{-c_{0}m}.
Proof.

For each j[m]j\in[m] let BjB_{j} be the indicator random variable for the event that |Aj,Vv|<αϵ/3|A_{j,V}v|<\alpha\epsilon/3. Let B=j=1mBjB=\sum_{j=1}^{m}B_{j}. By Lemma G.3, 𝔼B<ϵm/3\mathbb{E}B<\epsilon m/3. Each BjB_{j} is independent, so by a Chernoff bound,

Pr[B>ϵm/2]eϵm/18.\Pr[B>\epsilon m/2]\leq e^{-\epsilon m/18}.

In the event [Bϵm/2][B\leq\epsilon m/2], for any J[m]J\subseteq[m] with |J|ϵm|J|\geq\epsilon m it holds that

AJ,Vv22=jJ(Aj,Vv)2jJ:Bj=0(Aj,Vv)2(αϵ/3)Bαϵ2m/6.\left\lVert A_{J,V}v\right\rVert_{2}^{2}=\sum_{j\in J}(A_{j,V}v)^{2}\geq\sum_{j\in J:B_{j}=0}(A_{j,V}v)^{2}\geq(\alpha\epsilon/3)B\geq\alpha\epsilon^{2}m/6.

So the event in the lemma statement occurs with probability at most eϵm/18.e^{-\epsilon m/18}.

Now we can prove the isometry property claimed in Theorem G.1.

Proof of Theorem G.1.

Let V[n]V\subseteq[n]. Let ϵ>0\epsilon>0. Take γ=4|V|/(c0m)\gamma=4|V|/(c_{0}m), where c0=c0(α,ϵ)c_{0}=c_{0}(\alpha,\epsilon) is the constant in the statement of Lemma G.4. Let V\mathscr{B}\subseteq\mathbb{R}^{V} be the kk-dimensional unit ball. Let 𝒟\mathscr{D}\subset\mathscr{B} be a maximal packing of (1+γ/2)(1+\gamma/2)\mathscr{B} by radius-(γ/2)(\gamma/2) balls with centers on the unit sphere. By a volume argument,

|𝒟|(1+γ/2)k(γ/2)ke2k/γec0m/2.|\mathscr{D}|\leq\frac{(1+\gamma/2)^{k}}{(\gamma/2)^{k}}\leq e^{2k/\gamma}\leq e^{c_{0}m/2}.

Applying Lemma G.4 to each v𝒟v\in\mathscr{D} and taking a union bound,

Pr[J[m],v𝒟:(|J|ϵm)(AJ,Vv2<τ0)]ec0m/2.\Pr[\exists J\subseteq[m],v\in\mathscr{D}:(|J|\geq\epsilon m)\land(\left\lVert A_{J,V}v\right\rVert_{2}<\tau_{0})]\leq e^{-c_{0}m/2}.

So with probability 1eΩ(m)1-e^{-\Omega(m)}, the complement of this event holds. And by Lemma G.2, the event smax(AV)Tms_{\max{}}(A_{V})\leq T\sqrt{m} holds with probability 1eΩ(m)1-e^{-\Omega(m)}. In these events we claim that the conclusion of the theorem holds. Take any vVv\in\mathbb{R}^{V} with v2=1\left\lVert v\right\rVert_{2}=1, and take any J[m]J\subseteq[m] with |J|ϵm|J|\geq\epsilon m. Since 𝒟\mathscr{D} is maximal, there is some w𝒟w\in\mathscr{D} with vw2γ\left\lVert v-w\right\rVert_{2}\leq\gamma. Then

AJ,Vv2AJ,Vw2AJ,V(vw)2τ0γT.\left\lVert A_{J,V}v\right\rVert_{2}\geq\left\lVert A_{J,V}w\right\rVert_{2}-\left\lVert A_{J,V}(v-w)\right\rVert_{2}\geq\tau_{0}-\gamma T.

But γ4/(c0M)\gamma\leq 4/(c_{0}M). For sufficiently large MM, we get γ<τ0/(2T)\gamma<\tau_{0}/(2T). Taking τ=τ0/2\tau=\tau_{0}/2 yields the claimed lower bound. ∎

As a corollary, we get that AUTA_{U}^{T} is a m\sqrt{m}-isometry on its row space up to constants (of course, this holds for any V[n]V\subseteq[n] with |V|=k|V|=k, but we only need it for V=UV=U).

Corollary G.5.

With high probability, for every uku\in\mathbb{R}^{k},

τ2TmAUu2AUTAUu2T2τmAUu2.\frac{\tau^{2}}{T}\sqrt{m}\left\lVert A_{U}u\right\rVert_{2}\leq\left\lVert A_{U}^{T}A_{U}u\right\rVert_{2}\leq\frac{T^{2}}{\tau}\sqrt{m}\left\lVert A_{U}u\right\rVert_{2}.
Proof.

By Theorem G.1, with high probability all eigenvalues of AUTAUA_{U}^{T}A_{U} lie in the interval [τm,Tm][\tau\sqrt{m},T\sqrt{m}]. Hence, all eigenvalues of (AUTAU)2(A_{U}^{T}A_{U})^{2} lie in the interval [τ2m,T2m][\tau^{2}m,T^{2}m]. But then

AUTAUu2=uT(AUTAU)2uτ2muTuτ2TmAUu2.\left\lVert A_{U}^{T}A_{U}u\right\rVert_{2}=u^{T}(A_{U}^{T}A_{U})^{2}u\geq\tau^{2}mu^{T}u\geq\frac{\tau^{2}}{T}\sqrt{m}\left\lVert A_{U}u\right\rVert_{2}.

The upper bound is similar. ∎

We also get a Restricted Isometry Property, by applying Theorem G.1 to all subsets V[n]V\subseteq[n] of a fixed size.

Corollary G.6 (Restricted Isometry Property).

There is a constant MM such that for any s>0s>0, if mMslognm\geq Ms\log n, then with high probability, for every vnv\in\mathbb{R}^{n} with |supp(v)|s|\operatorname{supp}(v)|\leq s,

τmv2Av2Tmv2.\tau\sqrt{m}\left\lVert v\right\rVert_{2}\leq\left\lVert Av\right\rVert_{2}\leq T\sqrt{m}\left\lVert v\right\rVert_{2}.
Proof.

We apply Theorem G.1 to all V[n]V\subseteq[n] with |V|=s|V|=s, and take a union bound over the respective failure events. The probability that there exists some set V[n]V\subseteq[n] of size ss such that the isometry fails is at most

(ns)eδmeslognδm.\binom{n}{s}e^{-\delta m}\leq e^{s\log n-\delta m}.

If mMslognm\geq Ms\log n for a sufficiently large constant MM, then this probability is o(1)o(1). ∎

From this corollary, our main result for adversarial noise (Theorem 3.4) follows almost immediately:

Proof of Theorem 3.4.

Let MM^{\prime} be the constant in Corollary G.6. Let ρ=min(τ/(4T),1/3)\rho=\min(\tau/(4T),1/3), and let M=(1+1/ρ2)MM=(1+1/\rho^{2})M^{\prime}. Finally, let s=(1+1/ρ2)ks=(1+1/\rho^{2})k.

Let ϵ>0\epsilon>0. Suppose that mMklognm\geq Mk\log n and Axyϵ\left\lVert Ax^{*}-y\right\rVert\leq\epsilon. Then mMslognm\geq M^{\prime}s\log n, so by Corollary G.6, A/mA/\sqrt{m} satisfies the ss-Restricted Isometry Property.

By definition, x^\hat{x} satisfies Ax^y2ϵ\left\lVert A\hat{x}-y\right\rVert_{2}\leq\epsilon and x^1x1\left\lVert\hat{x}\right\rVert_{1}\leq\left\lVert x^{*}\right\rVert_{1} (by feasibility of xx^{*}). Finally, xx^{*} is kk-sparse. We conclude from Theorem D.1 and our choice of ρ\rho that

(A/m)(x^x)2(τ(1ρ)Tρ)x^x2τ2x^x2.\left\lVert(A/\sqrt{m})(\hat{x}-x^{*})\right\rVert_{2}\geq(\tau(1-\rho)-T\rho)\left\lVert\hat{x}-x^{*}\right\rVert_{2}\geq\frac{\tau}{2}\left\lVert\hat{x}-x^{*}\right\rVert_{2}.

But A(x^x)22ϵ\left\lVert A(\hat{x}-x^{*})\right\rVert_{2}\leq 2\epsilon by the triangle inequality. Thus, x^x2τϵ/m.\left\lVert\hat{x}-x^{*}\right\rVert_{2}\leq\tau\epsilon/\sqrt{m}.

Appendix H Projected Stochastic Gradient Descent

In this section we present the exact PSGD convergence theorem which we use, together with a proof for completeness.

Theorem H.1.

Let f:nf:\mathbb{R}^{n}\to\mathbb{R} be a convex function achieving its optimum at x˘n\breve{x}\in\mathbb{R}^{n}. Let 𝒫n\mathscr{P}\subseteq\mathbb{R}^{n} be a convex set containing x˘\breve{x}. Let x(0)𝒫x^{(0)}\in\mathscr{P} be arbitrary. For 1tT1\leq t\leq T define a random variable x(t)x^{(t)} by

x(t)=Proj𝒫(x(t1)ηv(t1)),x^{(t)}=\text{Proj}_{\mathscr{P}}(x^{(t-1)}-\eta v^{(t-1)}),

where 𝔼[v(t)|x(t)]f(x(t))\mathbb{E}[v^{(t)}|x^{(t)}]\in\partial f(x^{(t)}) and η\eta is fixed. Then

𝔼[f(x¯)]f(x˘)(ηT)1𝔼[x(0)x˘22]+ηT1i=1T𝔼[v(i)22]\mathbb{E}[f(\bar{x})]-f(\breve{x})\leq(\eta T)^{-1}\mathbb{E}\left[\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}^{2}\right]+\eta T^{-1}\sum_{i=1}^{T}\mathbb{E}\left[\left\lVert v^{(i)}\right\rVert_{2}^{2}\right]

where x¯=1Ti=1Tx(i)\bar{x}=\frac{1}{T}\sum_{i=1}^{T}x^{(i)}.

Proof.

Fix 0t<T0\leq t<T. We can write

x(t+1)x˘22(x(t)ηv(t))x˘22=x(t)x˘222ηv(t),x(t)x˘+η2v(t)22\left\lVert x^{(t+1)}-\breve{x}\right\rVert_{2}^{2}\leq\left\lVert(x^{(t)}-\eta v^{(t)})-\breve{x}\right\rVert_{2}^{2}=\left\lVert x^{(t)}-\breve{x}\right\rVert_{2}^{2}-2\eta\langle v^{(t)},x^{(t)}-\breve{x}\rangle+\eta^{2}\left\lVert v^{(t)}\right\rVert_{2}^{2}

since projecting onto r\mathscr{E}_{r} cannot increase the distance to x˘r\breve{x}\in\mathscr{E}_{r}.

Taking expectation over v(t)v^{(t)} for fixed x(0),,x(k)x^{(0)},\dots,x^{(k)}, we have

𝔼[x(t+1)x˘22|x(0),,x(k)]\displaystyle\mathbb{E}\left[\left\lVert x^{(t+1)}-\breve{x}\right\rVert_{2}^{2}\middle|x^{(0)},\dots,x^{(k)}\right] x(t)x˘222η𝔼v(t),x(t)x˘+η2𝔼[v(t)22]\displaystyle\leq\left\lVert x^{(t)}-\breve{x}\right\rVert_{2}^{2}-2\eta\langle\mathbb{E}v^{(t)},x^{(t)}-\breve{x}\rangle+\eta^{2}\mathbb{E}\left[\left\lVert v^{(t)}\right\rVert_{2}^{2}\right]
x(t)x˘222η(f(x(t))f(x˘))+η2𝔼[v(t)22]\displaystyle\leq\left\lVert x^{(t)}-\breve{x}\right\rVert_{2}^{2}-2\eta(f(x^{(t)})-f(\breve{x}))+\eta^{2}\mathbb{E}\left[\left\lVert v^{(t)}\right\rVert_{2}^{2}\right]

where the last inequality uses the fact that 𝔼v(t)\mathbb{E}v^{(t)} is a subgradient for ff at x(t)x^{(t)}. Rearranging and taking expectation over x(0),,x(t)x^{(0)},\dots,x^{(t)}, we get that

2(𝔼[f(x(t))]f(x˘))η1(𝔼[x(t)x˘22]𝔼[x(t+1)x˘22])+η𝔼[v(t)22].2\left(\mathbb{E}\left[f(x^{(t)})\right]-f(\breve{x})\right)\leq\eta^{-1}\left(\mathbb{E}\left[\left\lVert x^{(t)}-\breve{x}\right\rVert_{2}^{2}\right]-\mathbb{E}\left[\left\lVert x^{(t+1)}-\breve{x}\right\rVert_{2}^{2}\right]\right)+\eta\mathbb{E}\left[\left\lVert v^{(t)}\right\rVert_{2}^{2}\right].

But now summing over 0t<T0\leq t<T, the right-hand side of the inequality telescopes, giving

𝔼[f(x¯)]f(x˘)\displaystyle\mathbb{E}[f(\bar{x})]-f(\breve{x}) 1Tt=0T1𝔼[f(x(t))]f(x˘)\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[f(x^{(t)})]-f(\breve{x})
1ηT𝔼[x(0)x˘22]+ηTt=0T1𝔼[v(t)22].\displaystyle\leq\frac{1}{\eta T}\mathbb{E}\left[\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}^{2}\right]+\frac{\eta}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\lVert v^{(t)}\right\rVert_{2}^{2}\right].

This is the desired bound. ∎

Appendix I Survival probability

In this section we collect useful lemmas about truncated Gaussian random variables and survival probabilities.

Lemma I.1 ([9]).

Let tt\in\mathbb{R} and let SS\subset\mathbb{R} be a measurable set. Then Var(Zt)CγS(t)2\operatorname{Var}(Z_{t})\geq C\gamma_{S}(t)^{2} for a constant C>0C>0.

Lemma I.2 ([9]).

For t,tt,t^{*}\in\mathbb{R},

log1γS(t)2log1γS(t)+|tt|2+2.\log\frac{1}{\gamma_{S}(t)}\leq 2\log\frac{1}{\gamma_{S}(t^{*})}+|t-t^{*}|^{2}+2.
Lemma I.3 ([9]).

For tt\in\mathbb{R},

𝔼[Rt2]2log1γS(t)+4.\mathbb{E}[R_{t}^{2}]\leq 2\log\frac{1}{\gamma_{S}(t)}+4.
Lemma I.4.

With high probability,

j=1mlog1γS(Ajx)2mlog(1α).\sum_{j=1}^{m}\log\frac{1}{\gamma_{S}(A_{j}x^{*})}\leq 2m\log\left(\frac{1}{\alpha}\right).
Proof.

Let Xj=log1/γS(Ajx)X_{j}=\log 1/\gamma_{S}(A_{j}x^{*}) for j[m]j\in[m], and let X=X1++XmX=X_{1}+\dots+X_{m}. Since X1,,XmX_{1},\dots,X_{m} are independent and identically distributed,

𝔼[eX]=𝔼[eXj]m=𝔼[1γS(Ajx)]m=(𝔼aN(0,1)n[1]𝔼aN(0,1)n[γS(aTx)])mαm.\mathbb{E}[e^{X}]=\mathbb{E}[e^{X_{j}}]^{m}=\mathbb{E}\left[\frac{1}{\gamma_{S}(A_{j}x^{*})}\right]^{m}=\left(\frac{\mathbb{E}_{a\sim N(0,1)^{n}}[1]}{\mathbb{E}_{a\sim N(0,1)^{n}}[\gamma_{S}(a^{T}x^{*})]}\right)^{m}\leq\alpha^{-m}.

Therefore

Pr[X>2mlog1/α]=Pr[eX>e2mlog1/α]emlog1/α\Pr[X>2m\log 1/\alpha]=\Pr[e^{X}>e^{2m\log 1/\alpha}]\leq e^{-m\log 1/\alpha}

by Markov’s inequality. ∎

Appendix J Omitted proofs

J.1 Proof of Lemma A.1

We need the following computation:

Lemma J.1.

For any i[n]i\in[n] and j[m]j\in[m],

𝔼AAji2Var(ZAj,x)α1.\mathbb{E}_{A}A_{ji}^{2}\operatorname{Var}(Z_{A_{j},x^{*}})\leq\alpha^{-1}.
Proof.

By Assumption I,

𝔼Aj[Aji2Var(ZAj,x)]\displaystyle\mathbb{E}_{A_{j}}[A_{ji}^{2}\operatorname{Var}(Z_{A_{j},x^{*}})] =𝔼aN(0,1)n[γS(aTx)ai2Var(Za,x)]𝔼aN(0,1)n[γS(aTx)]\displaystyle=\frac{\mathbb{E}_{a\sim N(0,1)^{n}}[\gamma_{S}(a^{T}x^{*})a_{i}^{2}\operatorname{Var}(Z_{a,x^{*}})]}{\mathbb{E}_{a\sim N(0,1)^{n}}[\gamma_{S}(a^{T}x^{*})]}
α1𝔼aN(0,1)n[γS(aTx)ai2Var(Za,x)].\displaystyle\leq\alpha^{-1}\mathbb{E}_{a\sim N(0,1)^{n}}[\gamma_{S}(a^{T}x^{*})a_{i}^{2}\operatorname{Var}(Z_{a,x^{*}})].

But for any fixed tt\in\mathbb{R},

Var(Zt)𝔼[(Ztt)2]=𝔼[Z2|Z+tS]γS(t)1𝔼[Z2]=γS(t)1.\operatorname{Var}(Z_{t})\leq\mathbb{E}[(Z_{t}-t)^{2}]=\mathbb{E}[Z^{2}|Z+t\in S]\leq\gamma_{S}(t)^{-1}\mathbb{E}[Z^{2}]=\gamma_{S}(t)^{-1}.

Therefore

𝔼Aj[Aji2Var(ZAj,x)]α1𝔼aN(0,1)n[ai2]α1\mathbb{E}_{A_{j}}[A_{ji}^{2}\operatorname{Var}(Z_{A_{j},x^{*}})]\leq\alpha^{-1}\mathbb{E}_{a\sim N(0,1)^{n}}[a_{i}^{2}]\leq\alpha^{-1}

as desired. ∎

Now we can prove Lemma A.1.

Proof of Lemma A.1.

We prove that the bound holds in expectation, and then apply Markov’s inequality. We need to show that each entry of the vector AUT(𝔼[ZA,x]ZA,x)kA_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-Z_{A,x^{*}})\in\mathbb{R}^{k} has expected square O(m)O(m). A single entry of the vector AUT(𝔼[ZA,x]ZA,x)A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-Z_{A,x^{*}}) is

j=1mAji(ZAj,x𝔼ZAj,x),\sum_{j=1}^{m}A_{ji}(Z_{A_{j},x^{*}}-\mathbb{E}Z_{A_{j},x^{*}}),

so its expected square is

𝔼(j=1mAji(ZAj,x𝔼ZAj,x))2.\mathbb{E}\left(\sum_{j=1}^{m}A_{ji}(Z_{A_{j},x^{*}}-\mathbb{E}Z_{A_{j},x^{*}})\right)^{2}.

For any j1j2j_{1}\neq j_{2}, the cross-term is

𝔼Aj1iAj2i(ZAj1,x𝔼ZAj1,x)(ZAj2,x𝔼ZAj2,x).\mathbb{E}A_{j_{1}i}A_{j_{2}i}(Z_{A_{j_{1}},x^{*}}-\mathbb{E}Z_{A_{j_{1}},x^{*}})(Z_{A_{j_{2}},x^{*}}-\mathbb{E}Z_{A_{j_{2}},x^{*}}).

But for fixed AA, the two terms in the product are independent, and they both have mean 0, so the cross-term is 0. Thus,

𝔼(j=1mAji(ZAj,x𝔼ZAj,x))2\displaystyle\mathbb{E}\left(\sum_{j=1}^{m}A_{ji}(Z_{A_{j},x^{*}}-\mathbb{E}Z_{A_{j},x^{*}})\right)^{2} =j=1m𝔼A,yAji2(ZAj,x𝔼ZAj,x)2\displaystyle=\sum_{j=1}^{m}\mathbb{E}_{A,y}A_{ji}^{2}(Z_{A_{j},x^{*}}-\mathbb{E}Z_{A_{j},x^{*}})^{2}
=j=1m𝔼AAji2Var(ZAj,x)\displaystyle=\sum_{j=1}^{m}\mathbb{E}_{A}A_{ji}^{2}\operatorname{Var}(Z_{A_{j},x^{*}})
α1m.\displaystyle\leq\alpha^{-1}m.

Then

𝔼[AUT(𝔼[ZA,x]ZA,x)22]α1km.\mathbb{E}\left[\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-Z_{A,x^{*}})\right\rVert_{2}^{2}\right]\leq\alpha^{-1}km.

Hence with probability at least 11/logn1-1/\log n,

AUT(𝔼[ZA,x]ZA,x)22α1kmlogn\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-Z_{A,x^{*}})\right\rVert_{2}^{2}\leq\alpha^{-1}km\log n

by Markov’s inequality. ∎

J.2 Proof of Lemma A.2

We can write

μt=Sxe(xt)2/2𝑑xSe(xt)2/2𝑑x.\mu_{t}=\frac{\int_{S}xe^{-(x-t)^{2}/2}\,dx}{\int_{S}e^{-(x-t)^{2}/2}\,dx}.

By the quotient rule,

ddtμt\displaystyle\frac{d}{dt}\mu_{t} =Sx(tx)e(xt)2/2𝑑xSe(xt)2/2𝑑x+(Sxe(xt)2/2𝑑x)(S(tx)e(xt)2/2𝑑x)(Se(xt)2/2𝑑x)2\displaystyle=-\frac{\int_{S}x(t-x)e^{-(x-t)^{2}/2}\,dx}{\int_{S}e^{-(x-t)^{2}/2}\,dx}+\frac{\left(\int_{S}xe^{-(x-t)^{2}/2}\,dx\right)\left(\int_{S}(t-x)e^{-(x-t)^{2}/2}\,dx\right)}{\left(\int_{S}e^{-(x-t)^{2}/2}\,dx\right)^{2}}
=𝔼[Zt(tZt)]+𝔼[Zt]𝔼[tZt]\displaystyle=-\mathbb{E}[Z_{t}(t-Z_{t})]+\mathbb{E}[Z_{t}]\mathbb{E}[t-Z_{t}]
=Var(Zt)\displaystyle=\operatorname{Var}(Z_{t})

as desired.

J.3 Proof of Lemma A.3

The fact that sign(μtμt)=sign(tt)\operatorname{sign}(\mu_{t}-\mu_{t^{*}})=\operatorname{sign}(t-t^{*}) follows immediately from the fact that ddtμt=Var(Zt)0\frac{d}{dt}\mu_{t}=\operatorname{Var}(Z_{t})\geq 0 (Lemma A.2).

We now prove the second claim of the lemma. Suppose t<tt^{*}<t; the other case is symmetric. Then we have

μtμt=ttVar(Zr)𝑑rCttγS(r)2𝑑rCβ20tter22𝑑r\mu_{t}-\mu_{t^{*}}=\int_{t^{*}}^{t}\operatorname{Var}(Z_{r})\,dr\geq C\int_{t^{*}}^{t}\gamma_{S}(r)^{2}\,dr\geq C\beta^{2}\int_{0}^{t-t^{*}}e^{-r^{2}-2}\,dr

by Lemmas A.2, I.1 and I.2 respectively. But we can lower bound

0tter22𝑑r\displaystyle\int_{0}^{t-t^{*}}e^{-r^{2}-2}\,dr 0min(1,tt)er22𝑑r\displaystyle\geq\int_{0}^{\min(1,t-t^{*})}e^{-r^{2}-2}\,dr
e3min(1,tt).\displaystyle\geq e^{-3}\min(1,t-t^{*}).

This bound has the desired form.

J.4 Proof of Lemma A.4

Since 𝔼aN(0,1)kγS(aTx)α\mathbb{E}_{a\sim N(0,1)^{k}}\gamma_{S}(a^{T}x^{*})\geq\alpha and γS(aTx)\gamma_{S}(a^{T}x^{*}) is always at most 11, we have Pr[γS(aTx)α/2]1α/2\Pr[\gamma_{S}(a^{T}x^{*})\leq\alpha/2]\leq 1-\alpha/2. Since the samples are rejection sampled on γS(aTx)\gamma_{S}(a^{T}x^{*}), it follows that Pr[γS(Ajx)α/2]1α/2\Pr[\gamma_{S}(A_{j}x^{*})\leq\alpha/2]\leq 1-\alpha/2 as well. So by a Chernoff bound, with high probability, the number of j[m]j\in[m] such that γS(Ajx)α/2\gamma_{S}(A_{j}x^{*})\leq\alpha/2 is at most (1α/3)m(1-\alpha/3)m.

The condition that |AjxAjx˘|2(6/(αm))AxAx˘2|A_{j}x^{*}-A_{j}\breve{x}|^{2}\geq(6/(\alpha m))\left\lVert Ax^{*}-A\breve{x}\right\rVert^{2} is clearly satisfied by at most (α/6)m(\alpha/6)m indices.

J.5 Proof of Lemma A.5

By Lemma A.4 and Theorem G.1, with high probability AIgood,UA_{I_{\text{good}},U} and AUA_{U} both have singular values bounded between τm\sqrt{\tau m} and Tm\sqrt{Tm} for some positive constants τ=τ(α)\tau=\tau(\alpha) and T=T(α)T=T(\alpha). In this event, we have

A(xx˘)Igood2τmxx˘2τTA(xx˘)2\left\lVert A(x^{*}-\breve{x})\right\rVert_{I_{\text{good}}}^{2}\geq\tau m\left\lVert x^{*}-\breve{x}\right\rVert^{2}\geq\frac{\tau}{T}\left\lVert A(x^{*}-\breve{x})\right\rVert^{2}

which proves the claim.

J.6 Proof of Theorems A.6 and A.7

Proof of Theorem A.6.

Let a=A(x˘x)a=A(\breve{x}-x^{*}) and let b=μAx˘μAxb=\mu_{A\breve{x}}-\mu_{Ax^{*}}. Our aim is to show that if a2\left\lVert a\right\rVert_{2} is large, then AUTb2\left\lVert A_{U}^{T}b\right\rVert_{2} is large, which would contradict Equation 4. Since AUTA_{U}^{T} is not an isometry, we can’t simply show that b2\left\lVert b\right\rVert_{2} is large. Instead, we write an orthogonal decomposition b=v+AUub=v+A_{U}u for some uku\in\mathbb{R}^{k} and vmv\in\mathbb{R}^{m} with AUTv=0A_{U}^{T}v=0. We’ll show that AUu2\left\lVert A_{U}u\right\rVert_{2} is large. Since AUTb=AUTAUuA_{U}^{T}b=A_{U}^{T}A_{U}u, and AUTA_{U}^{T} is an isometry on the row space of AUA_{U}, this suffices.

For every jIgoodj\in I_{\text{good}} with |aj|>0|a_{j}|>0, we have by Lemma A.3 that

|bj|Cmin(1,|aj|)=C|aj|min(1/|aj|,1)|b_{j}|\geq C\min(1,|a_{j}|)=C|a_{j}|\min(1/|a_{j}|,1)

where CC is the constant which makes Lemma A.3 work for indices jj with γS(Ajx)α/2\gamma_{S}(A_{j}x^{*})\geq\alpha/2. Take C=6/αC^{\prime}=\sqrt{6/\alpha}, and suppose that the theorem’s conclusion is false, i.e. a2>Cm\left\lVert a\right\rVert_{2}>C^{\prime}\sqrt{m}. Also suppose that the events of Lemmas A.1 and A.5 hold.

Then by the bound |aj|2(6/(αm))a22|a_{j}|^{2}\leq(6/(\alpha m))\left\lVert a\right\rVert_{2}^{2} for jIgoodj\in I_{\text{good}} we get

|bj|C|aj|min(α/6ma2,1)=cma2|aj||b_{j}|\geq C|a_{j}|\min\left(\frac{\sqrt{\alpha/6}\sqrt{m}}{\left\lVert a\right\rVert_{2}},1\right)=\frac{c\sqrt{m}}{\left\lVert a\right\rVert_{2}}|a_{j}| (6)

where c=Cα/6c=C\sqrt{\alpha/6}. We assumed earlier that |aj|>0|a_{j}|>0 but Equation 6 certainly also holds when |aj|=0|a_{j}|=0.

By Lemma A.3, aja_{j} and bjb_{j} have the same sign for all j[m]j\in[m]. So ajbj0a_{j}b_{j}\geq 0 for all j[m]j\in[m]. Moreover, together with Equation 6, the sign constraint implies that for jIgoodj\in I_{\text{good}},

ajbjcmaaj2.a_{j}b_{j}\geq\frac{c\sqrt{m}}{\left\lVert a\right\rVert}a_{j}^{2}.

Summing over jIgoodj\in I_{\text{good}} we get

jIgoodaj2a2cmjIgoodajbja2cma,b=a2cma,AUu.\sum_{j\in I_{\text{good}}}a_{j}^{2}\leq\frac{\left\lVert a\right\rVert_{2}}{c\sqrt{m}}\sum_{j\in I_{\text{good}}}a_{j}b_{j}\leq\frac{\left\lVert a\right\rVert_{2}}{c\sqrt{m}}\langle a,b\rangle=\frac{\left\lVert a\right\rVert_{2}}{c\sqrt{m}}\langle a,A_{U}u\rangle.

By Lemma A.5 on the LHS and Cauchy-Schwarz on the RHS, we get

ϵa22a22cmAUu2.\epsilon\left\lVert a\right\rVert_{2}^{2}\leq\frac{\left\lVert a\right\rVert_{2}^{2}}{c\sqrt{m}}\left\lVert A_{U}u\right\rVert_{2}.

Hence AUu2ϵcm\left\lVert A_{U}u\right\rVert_{2}\geq\epsilon c\sqrt{m}. But then AUTb2=AUTAUu2(τ2/T)ϵcm\left\lVert A_{U}^{T}b\right\rVert_{2}=\left\lVert A_{U}^{T}A_{U}u\right\rVert_{2}\geq(\tau^{2}/T)\epsilon cm. On the other hand, Equation 4 implies that

1mAUTb2λk+1mAUT(𝔼[ZA,x]y)2creg+α1(klogn)/m\frac{1}{m}\left\lVert A_{U}^{T}b\right\rVert_{2}\leq\lambda\sqrt{k}+\frac{1}{m}\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-y)\right\rVert_{2}\leq c^{\prime}_{\text{reg}}+\sqrt{\alpha^{-1}(k\log n)/m}

since event (2) holds. This is a contradiction for MM^{\prime} sufficiently large and cregc^{\prime}_{\text{reg}} sufficiently small. So either the assumption a2>Cm\left\lVert a\right\rVert_{2}>C^{\prime}\sqrt{m} is false, or the events of Lemma A.1 or A.5 fail. But the latter two events fail with probability o(1)o(1). So a2Cm\left\lVert a\right\rVert_{2}\leq C^{\prime}\sqrt{m} with high probability.

Now that we know that xx˘2O(1)\left\lVert x^{*}-\breve{x}\right\rVert_{2}\leq O(1), we can bootstrap to show that xx˘2(klogn)/m\left\lVert x^{*}-\breve{x}\right\rVert_{2}\leq\sqrt{(k\log n)/m}. While the previous proof relied on the constant regime of the lower bound provided by Lemma A.3, the following proof relies on the linear regime.

Proof of Theorem A.7.

As before, let a=A(x˘x)a=A(\breve{x}-x^{*}) and b=μAx˘μAxb=\mu_{A\breve{x}}-\mu_{Ax^{*}}. Suppose that the conclusion of Theorem A.6 holds, i.e. a2Cm\left\lVert a\right\rVert_{2}\leq C^{\prime}\sqrt{m}. Also suppose that the events stated in Lemmas A.1 and A.5 holds. We can make these assumptions with high probability. For jIgoodj\in I_{\text{good}}, we now know that |aj|2(6/(αm))a22=O(1)|a_{j}|^{2}\leq(6/(\alpha m))\left\lVert a\right\rVert_{2}^{2}=O(1). Thus,

|bj|C|aj|min(1/|aj|,1)δ|aj||b_{j}|\geq C|a_{j}|\cdot\min(1/|a_{j}|,1)\geq\delta|a_{j}|

where δ=Cmin(1,α/6/C)\delta=C\min(1,\sqrt{\alpha/6}/C^{\prime}). By the same argument as in the proof of Theorem A.6, except replacing (cm)/a2(c\sqrt{m})/\left\lVert a\right\rVert_{2} by δ\delta, we get that

ϵa22δ1a2AUu2.\epsilon\left\lVert a\right\rVert_{2}^{2}\leq\delta^{-1}\left\lVert a\right\rVert_{2}\cdot\left\lVert A_{U}u\right\rVert_{2}.

Thus, a2ϵ1δ1AUu2.\left\lVert a\right\rVert_{2}\leq\epsilon^{-1}\delta^{-1}\left\lVert A_{U}u\right\rVert_{2}. By the isometry property of AUTA_{U}^{T} on its row space (Corollary G.5), we get

a2τ2TϵδmAUTAUu2=cmAUTb2\left\lVert a\right\rVert_{2}\leq\frac{\tau^{2}}{T\epsilon\delta\sqrt{m}}\left\lVert A_{U}^{T}A_{U}u\right\rVert_{2}=\frac{c^{\prime}}{\sqrt{m}}\left\lVert A_{U}^{T}b\right\rVert_{2}

for an appropriate constant cc^{\prime}. Since a=A(x˘x)a=A(\breve{x}-x^{*}) and AUA_{U} is a m\sqrt{m}-isometry up to constants (Theorem G.1), we get

x˘x2a2τcτmAUTb2.\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq\frac{\left\lVert a\right\rVert_{2}}{\tau}\leq\frac{c^{\prime}}{\tau m}\left\lVert A_{U}^{T}b\right\rVert_{2}.

By Equation 4 and bounds on the other terms of Equation 4, the RHS of this inequality is O(λk+(klogn)/m)O(\lambda\sqrt{k}+\sqrt{(k\log n)/m}). ∎

J.7 Proof of Lemma B.1

For 1jm1\leq j\leq m we have by Lemma I.3 that

(𝔼RAj,x˘)2𝔼[RAj,x˘2]2log1γS(Ajx˘)+4.(\mathbb{E}R_{A_{j},\breve{x}})^{2}\leq\mathbb{E}[R_{A_{j},\breve{x}}^{2}]\leq 2\log\frac{1}{\gamma_{S}(A_{j}\breve{x})}+4.

By Lemma I.2, we have

log1γS(Ajx˘)2log1γS(Ajx)+|Ajx˘Ajx|2+2.\log\frac{1}{\gamma_{S}(A_{j}\breve{x})}\leq 2\log\frac{1}{\gamma_{S}(A_{j}x^{*})}+|A_{j}\breve{x}-A_{j}x^{*}|^{2}+2.

Therefore summing over all j[m]j\in[m],

𝔼RA,x˘224j=1mlog1γS(Ajx)+2A(x˘x)22+8m.\left\lVert\mathbb{E}R_{A,\breve{x}}\right\rVert_{2}^{2}\leq 4\sum_{j=1}^{m}\log\frac{1}{\gamma_{S}(A_{j}x^{*})}+2\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}^{2}+8m.

Lemma I.4 bounds the first term. Theorems A.7 and G.1 bound the second: with high probability,

A(x˘x)22T(λkm+C′′klogn).\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}\leq 2T(\lambda\sqrt{km}+C^{\prime\prime}\sqrt{k\log n}).

Thus,

𝔼[RA,x˘]228mlog(1/α)+8m+8T2λ2km+8(TC′′)2klogn\left\lVert\mathbb{E}[R_{A,\breve{x}}]\right\rVert_{2}^{2}\leq 8m\log(1/\alpha)+8m+8T^{2}\lambda^{2}km+8(TC^{\prime\prime})^{2}k\log n

with high probability. Under the assumptions λcreg′′/k\lambda\leq c^{\prime\prime}_{\text{reg}}/\sqrt{k} and mM′′klognm\geq M^{\prime\prime}k\log n, this quantity is O(m)O(m).

J.8 Proof of Lemma B.2

Draw mm samples from the distribution RAj,xR_{A_{j},x^{*}} as follows: pick aN(0,1)na\sim N(0,1)^{n} and ηN(0,1)\eta\sim N(0,1). Keep sample η\eta if aTx+ηSa^{T}x^{*}+\eta\in S; otherwise reject. We want to bound η12++ηm2\eta_{1}^{2}+\dots+\eta_{m}^{2}. Now consider the following revised process: keep all the samples, but stop only once mm samples satisfy aTx+ηSa^{T}x^{*}+\eta\in S. Let tt be the (random) stopping point; then the random variable η12++ηt2\eta_{1}^{2}+\dots+\eta_{t}^{2} defined by the new process stochastically dominates the random variable η1++ηm2\eta_{1}+\dots+\eta_{m}^{2} defined by the original process.

But in the new process, each ηi\eta_{i} is Gaussian and independent of η1,,ηi1\eta_{1},\dots,\eta_{i-1}. With high probability, t2m/αt\leq 2m/\alpha by a Chernoff bound. And if η1,,η2m/αN(0,1)\eta^{\prime}_{1},\dots,\eta^{\prime}_{2m/\alpha}\sim N(0,1) are independent then

η12++η2m/α24m/α{\eta^{\prime}}_{1}^{2}+\dots+{\eta^{\prime}}_{2m/\alpha}^{2}\leq 4m/\alpha

with high probability, by concentration of norms of Gaussian vectors. Therefore η12++ηt24m/α\eta_{1}^{2}+\dots+\eta_{t}^{2}\leq 4m/\alpha with high probability as well.

J.9 Proof of Theorem B.3

Set σ=4(c+cy)\sigma=4(\sqrt{c}+\sqrt{c_{y}}), where cc and cyc_{y} are the constants in Lemmas B.1 and B.2. Set M=max(16T2C′′2(σ+1)2/σ2,σ2/c′′reg2)M=\max(16T^{2}C^{\prime\prime 2}(\sigma+1)^{2}/\sigma^{2},\sigma^{2}/{c^{\prime\prime}}_{\text{reg}}^{2}). Note that MM is chosen sufficiently large that λ=σ(logn)/mcreg′′/k\lambda=\sigma\sqrt{(\log n)/m}\leq c^{\prime\prime}_{\text{reg}}/\sqrt{k}.

By Theorem A.7, we have with high probability that the following event holds, which we call EcloseE_{\text{close}}:

xx˘2C′′(λk+(klogn)/m)=C′′(σ+1)klognm.\left\lVert x^{*}-\breve{x}\right\rVert_{2}\leq C^{\prime\prime}(\lambda\sqrt{k}+\sqrt{(k\log n)/m})=C^{\prime\prime}(\sigma+1)\sqrt{\frac{k\log n}{m}}.

Now notice that

𝔼ZA,x˘y=A(x˘x)+𝔼RA,x˘(yAx).\mathbb{E}Z_{A,\breve{x}}-y=A(\breve{x}-x^{*})+\mathbb{E}R_{A,\breve{x}}-(y-Ax^{*}).

If EcloseE_{\text{close}} holds, then by Theorem G.1, we get A(x˘x)2TC′′(σ+1)klogn\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}\leq TC^{\prime\prime}(\sigma+1)\sqrt{k\log n}. By Lemma B.1, with high probability 𝔼RA,x˘2cm\left\lVert\mathbb{E}R_{A,\breve{x}}\right\rVert_{2}\leq\sqrt{cm}. And by Lemma B.2, with high probability yAx2cym\left\lVert y-Ax^{*}\right\rVert_{2}\leq\sqrt{c_{y}m}. Therefore

𝔼ZA,x˘y2TC′′(σ+1)klogn+cym+cmσ2m\left\lVert\mathbb{E}Z_{A,\breve{x}}-y\right\rVert_{2}\leq TC^{\prime\prime}(\sigma+1)\sqrt{k\log n}+\sqrt{c_{y}m}+\sqrt{cm}\leq\frac{\sigma}{2}\sqrt{m}

where the last inequality is by choice of MM and σ\sigma. Thus, the event

E:𝔼ZA,x˘y2σ2mE:\left\lVert\mathbb{E}Z_{A,\breve{x}}-y\right\rVert_{2}\leq\frac{\sigma}{2}\sqrt{m}

occurs with high probability.

Suppose that event EE occurs. Now note that AUcTA_{U^{c}}^{T} has independent Gaussian entries. Fix any iUci\in U_{c}; since (AT)i(A^{T})_{i} is independent of AUA_{U}, x˘\breve{x}, and yy, the dot product

(AT)i(𝔼ZA,x˘y)(A^{T})_{i}(\mathbb{E}Z_{A,\breve{x}}-y)

is Gaussian with variance 𝔼ZA,x˘y22σ2m/4\left\lVert\mathbb{E}Z_{A,\breve{x}}-y\right\rVert_{2}^{2}\leq\sigma^{2}m/4. Hence, z˘i=1λm(AT)i(𝔼ZA,x˘y)\breve{z}_{i}=\frac{1}{\lambda m}(A^{T})_{i}(\mathbb{E}Z_{A,\breve{x}}-y) is Gaussian with variance at most (σ2m/4)/(λm)2=1/(4logn)(\sigma^{2}m/4)/(\lambda m)^{2}=1/(4\log n). So

Pr[|z˘i|1]2e2logn2n2.\Pr[|\breve{z}_{i}|\geq 1]\leq 2e^{-2\log n}\leq\frac{2}{n^{2}}.

By a union bound,

Pr[z˘Uc1]2n.\Pr[\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}\geq 1]\leq\frac{2}{n}.

So the event z˘Uc<1\left\lVert\breve{z}_{U^{c}}\right\rVert<1 holds with high probability.

J.10 Proof of Theorem B.4

We know from Theorem B.3 that 1mAUcT(𝔼[ZA,x˘]y)λ/3\left\lVert\frac{1}{m}A_{U^{c}}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-y)\right\rVert_{\infty}\leq\lambda/3. So it suffices to show that

1mAUcT(𝔼[ZA,X]y)AUcT(𝔼[ZA,x˘]y)λ6.\frac{1}{m}\left\lVert A_{U^{c}}^{T}(\mathbb{E}[Z_{A,X}]-y)-A_{U^{c}}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-y)\right\rVert_{\infty}\leq\frac{\lambda}{6}.

Thus, we need to show that

1m|(AT)i(𝔼[ZA,X]𝔼[ZA,x˘])|λ6\frac{1}{m}|(A^{T})_{i}(\mathbb{E}[Z_{A,X}]-\mathbb{E}[Z_{A,\breve{x}}])|\leq\frac{\lambda}{6}

for all iUci\in U^{c}. Fix one such ii. Then by Lemma A.2,

𝔼[ZA,X]𝔼[ZA,x˘]22\displaystyle\left\lVert\mathbb{E}[Z_{A,X}]-\mathbb{E}[Z_{A,\breve{x}}]\right\rVert_{2}^{2} =i=1m(μAiXμAix˘)2\displaystyle=\sum_{i=1}^{m}(\mu_{A_{i}X}-\mu_{A_{i}\breve{x}})^{2}
=i=1m(AiXAix˘Var(Zt)𝑑t)2\displaystyle=\sum_{i=1}^{m}\left(\int_{A_{i}X}^{A_{i}\breve{x}}\operatorname{Var}(Z_{t})\,dt\right)^{2}
i=1m(AiXAix˘)2supt[AiX,Aix˘]Var(Zt).\displaystyle\leq\sum_{i=1}^{m}(A_{i}X-A_{i}\breve{x})^{2}\cdot\sup_{t\in[A_{i}X,A_{i}\breve{x}]}\operatorname{Var}(Z_{t}).

By Lemma I.4, we have

j=1mlog1γS(Ajx)2mlog(1/α)\sum_{j=1}^{m}\log\frac{1}{\gamma_{S}(A_{j}x^{*})}\leq 2m\log(1/\alpha)

with high probability over AA. Assume that this inequality holds, and assume that Xx˘21\left\lVert X-\breve{x}\right\rVert_{2}\leq 1 and x˘x21\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq 1, so that Xx22\left\lVert X-x^{*}\right\rVert_{2}\leq 2. Then by Theorem G.1, A(Xx)22Tm\left\lVert A(X-x^{*})\right\rVert_{2}\leq 2T\sqrt{m}. By Lemma I.2, for every j[m]j\in[m] and every t[AjX,Ajx˘]t\in[A_{j}X,A_{j}\breve{x}],

log1γS(t)2log1γS(Ajx)+|tAjx|2+2cm\log\frac{1}{\gamma_{S}(t)}\leq 2\log\frac{1}{\gamma_{S}(A_{j}x^{*})}+|t-A_{j}x^{*}|^{2}+2\leq cm

for a constant cc. Hence, by Lemma I.3,

Var(Zt)𝔼[(Ztt)2]2log1γS(t)+42cm+4.\operatorname{Var}(Z_{t})\leq\mathbb{E}[(Z_{t}-t)^{2}]\leq 2\log\frac{1}{\gamma_{S}(t)}+4\leq 2cm+4.

We conclude that

𝔼[ZA,X]𝔼[ZA,x˘]22(2cm+4)AXAx˘22(2cm+4)T2mm2O(1).\left\lVert\mathbb{E}[Z_{A,X}]-\mathbb{E}[Z_{A,\breve{x}}]\right\rVert_{2}^{2}\leq(2cm+4)\left\lVert AX-A\breve{x}\right\rVert_{2}^{2}\leq(2cm+4)\frac{T^{2}m}{m^{2}}\leq O(1).

Additionally, (AT)i2Tm\left\lVert(A^{T})_{i}\right\rVert_{2}\leq T\sqrt{m} with high probability. Thus, Cauchy-Schwarz entails that

1m|(AT)i(𝔼[ZA,X]𝔼[ZA,x˘])|1mTmO(1)λ6\frac{1}{m}|(A^{T})_{i}(\mathbb{E}[Z_{A,X}]-\mathbb{E}[Z_{A,\breve{x}}])|\leq\frac{1}{m}T\sqrt{m}\cdot O(1)\leq\frac{\lambda}{6}

for large nn.

J.11 Proof of Lemma 5.4

Note that

Ax˘y2A(x˘x)2+Axy2.\left\lVert A\breve{x}-y\right\rVert_{2}\leq\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}+\left\lVert Ax^{*}-y\right\rVert_{2}.

With high probability, x˘x21\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq 1. Theorem G.1 gives that A(x˘x)2Tm\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}\leq T\sqrt{m}. Furthermore, Axy22m/α\left\lVert Ax^{*}-y\right\rVert_{2}\leq 2\sqrt{m/\alpha} by Lemma B.2.

J.12 Proof of Lemma 5.5

With high probability x˘r\breve{x}\in\mathscr{E}_{r} by the above lemma. Note that x(0)1x˘1\left\lVert x^{(0)}\right\rVert_{1}\leq\left\lVert\breve{x}\right\rVert_{1}, and A(x(0)x˘)22rm\left\lVert A(x^{(0)}-\breve{x})\right\rVert_{2}\leq 2r\sqrt{m}. Set ρ=min(τ/(4T),1/3)\rho=\min(\tau/(4T),1/3) and s=k(1+1/ρ2)s=k(1+1/\rho^{2}). If mMklognm\geq Mk\log n for a sufficiently large constant MM, then by Corollary G.6, A/mA/\sqrt{m} with high probability satisfies the ss-Restricted Isometry Property. Then by Theorem D.1 (due to [6], but reproduced here for completeness), it follows that x(0)x˘2O(1)\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}\leq O(1).

J.13 Proof of Lemma 5.6

Note that sign(x(t))\operatorname{sign}(x^{(t)}) is a subgradient for x1\left\lVert x\right\rVert_{1} at x=x(t)x=x^{(t)}. Furthermore, for fixed AA,

𝔼[Aj(z(t)yj)|x(t)]=1mj=1mAj(𝔼ZAj,x(t)yj)=nll(x(t);A,y).\mathbb{E}\left[A_{j}(z^{(t)}-y_{j})\middle|x^{(t)}\right]=\frac{1}{m}\sum_{j^{\prime}=1}^{m}A_{j^{\prime}}(\mathbb{E}Z_{A_{j^{\prime}},x^{(t)}}-y_{j^{\prime}})=\nabla\operatorname{nll}(x^{(t)};A,y).

It follows that

𝔼[v(t)|x(t)]=𝔼[Aj(z(t)yj)|x(t)]+sign(x(t))\mathbb{E}[v^{(t)}|x^{(t)}]=\mathbb{E}\left[A_{j}(z^{(t)}-y_{j})\middle|x^{(t)}\right]+\operatorname{sign}(x^{(t)})

is a subgradient for f(x)f(x) at x=x(t)x=x^{(t)}.

We proceed to bounding 𝔼[v(t)22|x(t)]\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}|x^{(t)}]. By definition of v(t)v^{(t)},

v(t)222Aj(z(t)yj)22+2λsign(x(t))22\left\lVert v^{(t)}\right\rVert_{2}^{2}\leq 2\left\lVert A_{j}(z^{(t)}-y_{j})\right\rVert_{2}^{2}+2\left\lVert\lambda\cdot\operatorname{sign}(x^{(t)})\right\rVert_{2}^{2}

where j[m]j\in[m] is uniformly random, and z(t)|x(t)ZAj,x(t)z^{(t)}|x^{(t)}\sim Z_{A_{j},x^{(t)}}. Since λsign(x(t))22=o(n)\left\lVert\lambda\cdot\operatorname{sign}(x^{(t)})\right\rVert_{2}^{2}=o(n) it remains to bound the other term. We have that

𝔼[Aj(z(t)yj)22|x(t)]=1mj=1m𝔼[Aj(ZAj,x(t)yj)22].\mathbb{E}[\left\lVert A_{j}(z^{(t)}-y_{j})\right\rVert_{2}^{2}|x^{(t)}]=\frac{1}{m}\sum_{j^{\prime}=1}^{m}\mathbb{E}[\left\lVert A_{j^{\prime}}(Z_{A_{j^{\prime}},x^{(t)}}-y_{j^{\prime}})\right\rVert_{2}^{2}].

With high probability, Ai222n\left\lVert A_{i}\right\rVert_{2}^{2}\leq 2n for all i[m]i\in[m]. Thus,

𝔼[Aj(z(t)yj)22|x(t)]nmj=1m𝔼[(ZAj,x(t)yj)2].\mathbb{E}[\left\lVert A_{j}(z^{(t)}-y_{j})\right\rVert_{2}^{2}|x^{(t)}]\leq\frac{n}{m}\sum_{j^{\prime}=1}^{m}\mathbb{E}[(Z_{A_{j^{\prime}},x^{(t)}}-y_{j^{\prime}})^{2}].

Now

i=1m𝔼[(ZAi,x(t)yi)2]2i=1m(Aix(t)yi)2+2i=1m𝔼[(Aix(t)ZAi,x(t))2].\sum_{i=1}^{m}\mathbb{E}[(Z_{A_{i},x^{(t)}}-y_{i})^{2}]\leq 2\sum_{i=1}^{m}(A_{i}x^{(t)}-y_{i})^{2}+2\sum_{i=1}^{m}\mathbb{E}[(A_{i}x^{(t)}-Z_{A_{i},x^{(t)}})^{2}].

The first term is bounded by 2r2m2r^{2}m since x(t)rx^{(t)}\in\mathscr{E}_{r}. Additionally,

A(x(t)x)2Ax(t)y2+Axy22rm\left\lVert A(x^{(t)}-x^{*})\right\rVert_{2}\leq\left\lVert Ax^{(t)}-y\right\rVert_{2}+\left\lVert Ax^{*}-y\right\rVert_{2}\leq 2r\sqrt{m}

since x(t),xrx^{(t)},x^{*}\in\mathscr{E}_{r}. Therefore the second term is bounded as

2i=1m𝔼[RAi,x(t)2]\displaystyle 2\sum_{i=1}^{m}\mathbb{E}[R_{A_{i},x^{(t)}}^{2}] 4i=1mlog(1γS(Aix(t)))+8m\displaystyle\leq 4\sum_{i=1}^{m}\log\left(\frac{1}{\gamma_{S}(A_{i}x^{(t)})}\right)+8m
8i=1mlog(1γS(Aix))+4A(x(t)x)22+16m\displaystyle\leq 8\sum_{i=1}^{m}\log\left(\frac{1}{\gamma_{S}(A_{i}x^{*})}\right)+4\left\lVert A(x^{(t)}-x^{*})\right\rVert_{2}^{2}+16m
64log(1/α)m+16r2m+80m.\displaystyle\leq 64\log(1/\alpha)m+16r^{2}m+80m.

where the first and second inequalities are by Lemmas I.3 and Lemma I.2, and the third inequality is by Lemma I.4. Putting together the two bounds, we get

i=1m𝔼[(ZAi,x(t)yi)2]O(m),\sum_{i=1}^{m}\mathbb{E}[(Z_{A_{i},x^{(t)}}-y_{i})^{2}]\leq O(m),

from which we conclude that 𝔼[v(t)22|x(t)]O(n)\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}|x^{(t)}]\leq O(n). The law of total expectation implies that 𝔼[v(t)22]O(n)\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}]\leq O(n) as well.

J.14 Proof of Lemma 5.7

We need to show that fUf_{\mathbb{R}^{U}} is ζ\zeta-strongly convex near x˘\breve{x}. Since x1\left\lVert x\right\rVert_{1} is convex, it suffices to show that nll(x;A,y)U\operatorname{nll}(x;A,y)_{\mathbb{R}^{U}} is ζ\zeta-strongly convex near x˘\breve{x}. The Hessian of nll(x;A,y)|U\operatorname{nll}(x;A,y)|_{\mathbb{R}^{U}} is

HU(x;A,y)=1mj=1mAj,UTAj,UVar(ZAj,x).H_{U}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j,U}^{T}A_{j,U}\operatorname{Var}(Z_{A_{j},x}).

Hence, it suffices to show that

1mj=1mAj,UTAj,UVar(ZAj,x)ζI\frac{1}{m}\sum_{j=1}^{m}A_{j,U}^{T}A_{j,U}\operatorname{Var}(Z_{A_{j},x})\succeq\zeta I

for all xnx\in\mathbb{R}^{n} with supp(x)U\operatorname{supp}(x)\subseteq U and xx˘21\left\lVert x-\breve{x}\right\rVert_{2}\leq 1. Call this region \mathscr{B}. With high probability over AA we can deduce the following.

(i) By Theorem A.7, we have x˘x2d(klogn)/m\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq d\sqrt{(k\log n)/m}. As xx˘21\left\lVert x-\breve{x}\right\rVert_{2}\leq 1 for all xx\in\mathscr{B}, we get A(x˘x)22T2(d+1)2m\left\lVert A(\breve{x}-x)\right\rVert_{2}^{2}\leq T^{2}(d+1)^{2}m for all xx\in\mathscr{B}.

(ii) By the proof of Lemma A.4, the number of j[m]j\in[m] such that γS(Ajx)α/2\gamma_{S}(A_{j}x^{*})\leq\alpha/2 is at most (1α/3)m(1-\alpha/3)m.

Fix xx\in\mathscr{B}, and define Jx[m]J_{x}\subseteq[m] to be the set of indices

Jx={j[m]:γS(Ajx)α/2|Aj(xx)|2(6/α)T2(d+1)2.}J_{x}=\{j\in[m]:\gamma_{S}(A_{j}x^{*})\geq\alpha/2\land|A_{j}(x-x^{*})|^{2}\leq(6/\alpha)T^{2}(d+1)^{2}.\}

For any jJxj\in J_{x},

log1γS(Ajx)2log1γS(Ajx)+|Aj(xx)|2+2log(2/α)+(6/α)T2(d+1)2+2.\log\frac{1}{\gamma_{S}(A_{j}x)}\leq 2\log\frac{1}{\gamma_{S}(A_{j}x^{*})}+|A_{j}(x-x^{*})|^{2}+2\leq\log(2/\alpha)+(6/\alpha)T^{2}(d+1)^{2}+2.

Thus,

Var(ZAj,x)CγS(Ajx)2elog(2/α)(6/α)T2(d+1)22=Ω(1).\operatorname{Var}(Z_{A_{j},x})\geq C\gamma_{S}(A_{j}x)^{2}\geq e^{-\log(2/\alpha)-(6/\alpha)T^{2}(d+1)^{2}-2}=\Omega(1).

Let δ\delta denote this lower bound—a positive constant. By (i) and (ii), |Jx|(α/6)m|J_{x}|\geq(\alpha/6)m, so by Theorem G.1,

HU(x;A,y)=1mj=1mAj,UTAj,UVar(ZAj,x)δmAJx,UTAJx,UδτIH_{U}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j,U}^{T}A_{j,U}\operatorname{Var}(Z_{A_{j},x})\succeq\frac{\delta}{m}A_{J_{x},U}^{T}A_{J_{x},U}\succeq\delta\tau I

as desired.

J.15 Proof of Lemma 5.8

Let t=(xx˘)U2t=\left\lVert(x-\breve{x})_{U}\right\rVert_{2}. Define w=x˘+(xx˘)min(t1/m,1)w=\breve{x}+(x-\breve{x})\min(t^{-1}/m,1). Also define w=[wU;0Uc]nw^{\prime}=[w_{U};0_{U^{c}}]\in\mathbb{R}^{n}. Then (wx˘)U21/m\left\lVert(w-\breve{x})_{U}\right\rVert_{2}\leq 1/m, so

(nll(w;A,y))Ucλ2.\left\lVert(\nabla\operatorname{nll}(w^{\prime};A,y))_{U^{c}}\right\rVert_{\infty}\leq\frac{\lambda}{2}.

Therefore wi(nll(w;A,y))i(λ/2)|wi|w_{i}\cdot(\nabla\operatorname{nll}(w^{\prime};A,y))_{i}\leq(\lambda/2)|w_{i}| for all iUci\in U^{c}, so

f(w)f(w)\displaystyle f(w)-f(w^{\prime}) =(nll(w;A,y)nll(w;A,y))+λ(w1w1)\displaystyle=(\operatorname{nll}(w;A,y)-\operatorname{nll}(w^{\prime};A,y))+\lambda(\left\lVert w\right\rVert_{1}-\left\lVert w^{\prime}\right\rVert_{1})
(ww)nll(w;A,y)+λwUc1\displaystyle\geq(w-w^{\prime})\cdot\nabla\operatorname{nll}(w^{\prime};A,y)+\lambda\left\lVert w_{U^{c}}\right\rVert_{1}
λ2wUc1.\displaystyle\geq\frac{\lambda}{2}\left\lVert w_{U^{c}}\right\rVert_{1}.

Additionally, since wx˘21\left\lVert w^{\prime}-\breve{x}\right\rVert_{2}\leq 1 and supp(w)U\operatorname{supp}(w^{\prime})\subseteq U, we know that

f(w)f(x˘)ζ2wx˘22.f(w^{\prime})-f(\breve{x})\geq\frac{\zeta}{2}\left\lVert w^{\prime}-\breve{x}\right\rVert_{2}^{2}.

Adding the second inequality to the square of the first inequality, and lower bounding the 1\ell_{1} norm by 2\ell_{2} norm,

12(f(w)f(x˘))2+12(f(w)f(x˘))\displaystyle\frac{1}{2}(f(w)-f(\breve{x}))^{2}+\frac{1}{2}(f(w)-f(\breve{x})) 12(f(w)f(w))2+12(f(w)f(x˘))\displaystyle\geq\frac{1}{2}(f(w)-f(w^{\prime}))^{2}+\frac{1}{2}(f(w^{\prime})-f(\breve{x}))
λ28wUc22+ζ4wx˘22\displaystyle\geq\frac{\lambda^{2}}{8}\left\lVert w_{U^{c}}\right\rVert_{2}^{2}+\frac{\zeta}{4}\left\lVert w^{\prime}-\breve{x}\right\rVert_{2}^{2}
λ28(wx˘)Uc22+ζ4(wx˘)U22\displaystyle\geq\frac{\lambda^{2}}{8}\left\lVert(w-\breve{x})_{U^{c}}\right\rVert_{2}^{2}+\frac{\zeta}{4}\left\lVert(w-\breve{x})_{U}\right\rVert_{2}^{2}
min(λ28,ζ4)wx˘22\displaystyle\geq\min\left(\frac{\lambda^{2}}{8},\frac{\zeta}{4}\right)\left\lVert w-\breve{x}\right\rVert_{2}^{2}

Since f(x)f(x˘)1f(x)-f(\breve{x})\leq 1, by convexity f(w)f(x˘)1f(w)-f(\breve{x})\leq 1 as well. Hence,

f(w)f(x˘)12(f(w)f(x˘))2+12(f(w)f(x˘))min(λ28,ζ4)wx˘22.f(w)-f(\breve{x})\geq\frac{1}{2}(f(w)-f(\breve{x}))^{2}+\frac{1}{2}(f(w)-f(\breve{x}))\geq\min\left(\frac{\lambda^{2}}{8},\frac{\zeta}{4}\right)\left\lVert w-\breve{x}\right\rVert_{2}^{2}. (7)

We distinguish two cases:

  1. 1.

    If t1/mt\leq 1/m, then w=xw=x, and it follows from Equation 7 that

    f(x)f(x˘)min(λ28,ζ4)xx˘22f(x)-f(\breve{x})\geq\min\left(\frac{\lambda^{2}}{8},\frac{\zeta}{4}\right)\left\lVert x-\breve{x}\right\rVert_{2}^{2}

    as desired.

  2. 2.

    If t1/mt\geq 1/m, then (wx˘)U2=1/m\left\lVert(w-\breve{x})_{U}\right\rVert_{2}=1/m, and thus wx˘21/m\left\lVert w-\breve{x}\right\rVert_{2}\geq 1/m. By convexity and this bound,

    f(x)f(x˘)f(w)f(x˘)min(λ28,ζ4)1m2,f(x)-f(\breve{x})\geq f(w)-f(\breve{x})\geq\min\left(\frac{\lambda^{2}}{8},\frac{\zeta}{4}\right)\frac{1}{m^{2}},

    which contradicts the lemma’s assumption for a sufficiently small constant cf>0c_{f}>0.

J.16 Proof of Theorem 5.9

By Lemmas 5.4, 5.5, and 5.6, we are guaranteed that x˘r\breve{x}\in\mathscr{E}_{r}, x(0)x˘22O(1)\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}^{2}\leq O(1), and 𝔼[v(t)22]O(n)\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}]\leq O(n) for all tt. Thus, applying Theorem H.1 with projection set r\mathscr{E}_{r}, step count T=m6nT=m^{6}n, and step size η=1/Tn\eta=1/\sqrt{Tn} gives 𝔼[f(x¯)]f(x˘)O(1/m3)\mathbb{E}[f(\bar{x})]-f(\breve{x})\leq O(1/m^{3}). Since f(x¯)f(x˘)f(\bar{x})-f(\breve{x}) is nonnegative, Markov’s inequality gives

Pr[f(x¯)f(x˘)cf(logn)/m3]1o(1).\Pr[f(\bar{x})-f(\breve{x})\leq c_{f}(\log n)/m^{3}]\geq 1-o(1).

From Theorem 5.8 we conclude that x¯x˘2O(1/m)\left\lVert\bar{x}-\breve{x}\right\rVert_{2}\leq O(1/m) with high probability.

Appendix K Efficient sampling for union of intervals

In this section, in Lemma K.4, we see that when S=i=1r[ai,bi]S=\cup_{i=1}^{r}[a_{i},b_{i}], with ai,bia_{i},b_{i}\in\mathbb{R}, then Assumption II holds with T(γS(t))=poly(log(1/γS(t)),r)T(\gamma_{S}(t))=\operatorname{poly}(\log(1/\gamma_{S}(t)),r). The only difference is that instead of exact sampling we have approximate sampling, but the approximation error is exponentially small in total variation distance and hence it cannot affect any algorithm that runs in polynomial time.

Definition K.1 (Evaluation Oracle).

Let f:f:\mathbb{R}\to\mathbb{R} be an arbitrary real function. We define the evaluation oracle f\mathscr{E}_{f} of ff as an oracle that given a number xx\in\mathbb{R} and a target accuracy η\eta computes an η\eta-approximate value of f(x)f(x), that is |f(x)f(x)|η\left|\mathscr{E}_{f}(x)-f(x)\right|\leq\eta.

Lemma K.2.

Let f:+f:\mathbb{R}\to\mathbb{R}_{+} be a real increasing and differentiable function and f(x)\mathscr{E}_{f}(x) an evaluation oracle of ff. Let f(x)L\ell\leq f^{\prime}(x)\leq L for some ,L+\ell,L\in\mathbb{R}_{+}. Then we can construct an algorithm that implements the evaluation oracle of f1f^{-1}, i.e. f1\mathscr{E}_{f^{-1}}. This implementation on input y+y\in\mathbb{R}_{+} and input accuracy η\eta runs in time TT and uses at most TT calls to the evaluation oracle f\mathscr{E}_{f} with inputs xx with representation length TT and input accuracy η=η/\eta^{\prime}=\eta/\ell, with T=polylog(max{|f(0)/y|,|y/f(0)|},L,1/,1/η)T=\operatorname{poly}\log(\max\{|f(0)/y|,|y/f(0)|\},L,1/\ell,1/\eta).

Proof of Lemma K.2.

Given a value y+y\in\mathbb{R}_{+} our goal is to find an xx\in\mathbb{R} such that f(x)=yf(x)=y. Using doubling we can find two numbers a,ba,b such that f(a)yηf(a)\leq y-\eta^{\prime} and f(b)y+ηf(b)\geq y+\eta^{\prime} for some η\eta^{\prime} to be determined later. Because of the lower bound \ell on the derivative of ff we have that this step will take log((1/)max{|f(0)/y|,|y/f(0)|})\log((1/\ell)\cdot\max\{|f(0)/y|,|y/f(0)|\}) steps. Then we can use binary search in the interval [a,b][a,b] where in each step we make a call to the oracle f\mathscr{E}_{f} with accuracy η\eta^{\prime} and we can find a point x^\hat{x} such that |f(x)f(x^)|η\left|f(x)-f(\hat{x})\right|\leq\eta^{\prime}. Because of the upper bound on the derivative of ff we have that ff is LL-Lipschitz and hence this binary search will need log(L/η)\log(L/\eta^{\prime}) time and oracle calls. Now using the mean value theorem we get that for some ξ[a,b]\xi\in[a,b] it holds that |f(x)f(x^)|=|f(ξ)||xx^|\left|f(x)-f(\hat{x})\right|=|f^{\prime}(\xi)|\left|x-\hat{x}\right| which implies that |xx^|η/\left|x-\hat{x}\right|\leq\eta^{\prime}/\ell, so if we set η=η\eta^{\prime}=\ell\cdot\eta, the lemma follows. ∎

Using the Lemma K.2 and the Proposition 3 of [7] it is easy to prove the following lemma.

Lemma K.3.

Let [a,b][a,b] be a closed interval and μ\mu\in\mathbb{R} such that γ[a,b](μ)=α\gamma_{[a,b]}(\mu)=\alpha. Then there exists an algorithm that runs in time polylog(1/α,ζ)\operatorname{poly}\log(1/\alpha,\zeta) and returns a sample of a distribution 𝒟\mathscr{D}, such that dTV(𝒟,N(μ,1;[a,b]))ζd_{\mathrm{TV}}(\mathscr{D},N(\mu,1;[a,b]))\leq\zeta.

Proof Sketch.

The sampling algorithm follows the steps: (1) from the cumulative distribution function FF of the distribution N(μ,1;[a,b])N(\mu,1;[a,b]) define a map from [a,b][a,b] to [0,1][0,1], (2) sample uniformly a number yy in [0,1][0,1] (3) using an evaluation oracle for the error function, as per Proposition 3 in [7], and using Lemma K.2 compute with exponential accuracy the value F1(y)F^{-1}(y) and return this as the desired sample. ∎

Finally using again Proposition 3 in [7] and Lemma K.3 we can get the following lemma.

Lemma K.4.

Let [a1,b1][a_{1},b_{1}], [a2,b2][a_{2},b_{2}], \dots, [ar,br][a_{r},b_{r}] be closed intervals and μ\mu\in\mathbb{R} such that γi=1r[ai,bi](μ)=α\gamma_{\cup_{i=1}^{r}[a_{i},b_{i}]}(\mu)=\alpha. Then there exists an algorithm that runs in time poly(log(1/α,ζ),r)\operatorname{poly}(\log(1/\alpha,\zeta),r) and returns a sample of a distribution 𝒟\mathscr{D}, such that dTV(𝒟,N(μ,1;i=1r[ai,bi]))ζd_{\mathrm{TV}}(\mathscr{D},N(\mu,1;\cup_{i=1}^{r}[a_{i},b_{i}]))\leq\zeta.

Proof Sketch.

Using Proposition 3 in [7] we can compute α^i\hat{\alpha}_{i} which estimated with exponential accuracy the mass αi=γ[ai,bi](μ)\alpha_{i}=\gamma_{[a_{i},b_{i}]}(\mu) of every interval [ai,bi][a_{i},b_{i}]. If α^i/αζ/3r\hat{\alpha}_{i}/\alpha\leq\zeta/3r then do not consider interval ii in the next step. From the remaining intervals we can choose one proportionally to α^i\hat{\alpha}_{i}. Because of the high accuracy in the computation of α^i\hat{\alpha}_{i} this is ζ/3\zeta/3 close in total variation distance to choosing an interval proportionally to αi\alpha_{i}. Finally after choosing an interval ii we can run the algorithm of Lemma K.3 with accuracy ζ/3\zeta/3 and hence the overall total variation distance from N(μ,1;i=1r[ai,bi])N(\mu,1;\cup_{i=1}^{r}[a_{i},b_{i}]) is at most ζ\zeta. ∎