This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\DeclareNewFootnote

AAffil[arabic] \DeclareNewFootnoteANote[fnsymbol]

Quantizing Heavy-tailed Data in Statistical Estimation: (Near) Minimax Rates, Covariate Quantization, and Uniform Recovery

Junren Chen\footnotemarkAAffil ,\footnotemarkANote    Michael K. Ng\footnotemarkAAffil[1] ,\footnotemarkANote    Di Wang\footnotemarkAAffil ,\footnotemarkANote[2]
Abstract

Modern datasets often exhibit heavy-tailed behaviour, while quantization is inevitable in digital signal processing and many machine learning problems. This paper studies the quantization of heavy-tailed data in several fundamental statistical estimation problems where the underlying distributions have bounded moments of some order (no greater than 44). We propose to truncate and properly dither the data prior to a uniform quantization. Our major standpoint is that (near) minimax rates of estimation error could be achieved by computationally tractable estimators based on the quantized data produced by the proposed scheme. In particular, concrete results are worked out for covariance estimation, compressed sensing (also interpreted as sparse linear regression), and matrix completion, all agreeing that the quantization only slightly worsens the multiplicative factor. Additionally, while prior results focused on quantization of responses (i.e., measurements), we study compressed sensing where the covariates (i.e., sensing vectors) are also quantized; in this case, our recovery program is non-convex because the covariance matrix estimator lacks positive semi-definiteness, but all local minimizers are proved to enjoy near optimal error bound. Moreover, by the concentration inequality of product process and a covering argument, we establish near minimax uniform recovery guarantee for quantized compressed sensing with heavy-tailed noise. Finally, numerical simulations are provided to corroborate our theoretical results.

1 Introduction

Heavy-tailed distributions are ubiquitous in modern datasets, especially those arising in economy, finance, imaging, biology, see [93, 84, 45, 5, 88, 57] for instance. In the recent literature, heavy-tailed distribution is often captured by bounded ll-th moment, where ll is some fixed small scalar; this is essentially weaker than sub-Gaussian assumption, and thus outliers and extreme values appear much more frequently in data from heavy-tailed distributions (referred to as heavy-tailed data), which poses challenges for statistical analysis. In fact, many standard statistical procedures developed for sub-Gaussian data suffer from performance degradation in the heavy-tailed regime. Fortunately, the past decade has witnessed considerable progresses on statistical estimation methods that are robust to heavy-tailedness, see [18, 34, 73, 68, 10, 44, 35, 97, 65] for instance.

Departing momentarily from heavy-tailed data, quantization is an inevitable process in the era of digital signal processing, which maps signals to bitstreams so that they can be stored, processed and transmitted. In particular, the resolution of quantization should be selected to achieve a trade-off between accuracy and various data processing costs, and in some applications relatively low resolution would be preferable. For instance, in a distributed learning setting or a MIMO system, the frequent information transmission among multiple parties often result in prohibitive communication cost [56, 69], and quantizing signals or data to fairly low resolution (while preserving satisfying utility) is an effective approach to reduce the cost [96, 43]. Under such a big picture, in recent years there has been rapidly growing literature on high-dimensional signal recovery from quantized data (see, e.g., [8, 26, 21, 89, 50, 32, 31] for 1-bit quantization, [89, 50, 94, 46] for multi-bit uniform quantization), trying to understand the interplay between quantization and signal reconstruction (or parameter learning) in some fundamental estimation problems.

Independently, a set of robustifying techniques has been developed to overcome the challenge posed by heavy-tailed data, and unidorm data quantization under uniform dither was shown to cost very little in some recovery problems. Considering ubiquitousness of heavy-tailed behaviour and data quantization, a natural question is to design quantization scheme for heavy-tailed data that only incurs minor information loss. For instance, when applied to statistical estimation problems with heavy-tailed data, an appropriate quantization scheme should enable at least one faithful estimator from the quantized data, and ideally an estimator nearly achieving optimal error rate. Despite the vast literature in this field, prior results that simultaneously take heavy-tailed data and quantization into account are surprisingly rare — only the ones presented in [32] and our earlier work [21] regarding the dithered 1-bit quantizer, to the best of our knowledge. These results remain incomplete and exhibit some downsides. Specifically, [32] considered a computationally intractable program for quantized compressed sensing and used techniques hard to generalize to other problems, while the error rates in [21] are inferior to the corresponding minimax ones (under unquantized sub-Gaussian data), as will be discussed in Section 1.1.3. In a nutshell, a quantization scheme for heavy-tailed data arising in statistical estimation problems that allows for computationally tractable near-minimax estimators is still lacking.

This paper aims to provide a solution to the above question and narrow the gap between heavy-tailed data and data quantization in the literature. In particular, we propose a unified quantization scheme for heavy-tailed data which, when applied to the canonical estimation problems of (sparse) covariance matrix estimation, compressed sensing (or sparse linear regression) and matrix completion, allows for (near) minimax estimators that are either in closed-form or solved from convex programs. Additionally, we present novel developments concerning covariate (or sensing vector) quantization and uniform signal recovery in quantized compressed sensing with heavy-tailed data.

1.1 Related Works and Our Contributions

This section is devoted to a review of the most relevant works. Before that we note that a heavy-tailed random variable in this work is formulated by the moment constraint 𝔼|X|lM\mathbbm{E}|X|^{l}\leq M where MM is oftentimes regarded as absolute constant and ll is some fixed small scalar (specifically, l4l\leq 4 in the present paper).

1.1.1 Statistical Estimation under Heavy-Tailed Data

Compared to sub-Gaussian data, heavy-tailed data may contain many outliers and extreme values that are overly influential to traditional estimation methods. Hence, developing estimation methods that are robust to heavy-tailedness has become a recent focus in statistics literature, where heavy-tailed distributions are often only assumed to have bounded moments of some small order. In particular, significant efforts have been devoted to the fundamental problem of mean estimation for heavy-tailed distribution. For instance, effective techniques available in the literature include Catoni’s mean estimator [18, 34], median of means [73, 68], and trimmed mean [66, 28]. While the strategies to achieve robustness are different, these methods indeed share the same core spirit of making the outliers less influential. To this end, the trimmed method (also referred to as truncation or shrinkage) may be the most intuitive — it truncates overlarge data to some threshold so that they are more benign for the estimation procedure. For more in-depth discussions we refer to the recent survey [65]. Furthermore, these robust methods for estimating the mean have been applied to empirical risk minimization [10, 44] and various high-dimensional estimation problems [35, 97], achieving near optimal guarantees. For instance, by invoking M-estimators with truncated data, (near) minimax rates can be achieved in high-diemnsional sparse linear regression, matrix completion, covariance estimation [35]. In fact, techniques similarly to truncation has proven effective in some related problems, e.g., in non-convex algorithms for phase retrieval, truncated Wirtinger flow using a more selective spectral initialization and carefully trimmed gradient [23] improves on Wirtinger flow [16] in terms of sample size.

Recall that we capture heavy-tailedness by bounded moment of some small order, while regarding statistical estimation beyond sub-Gaussianity, there has been a line of works considering sub-exponential or more generally sub-Weibull distributions [80, 85, 58, 39], which have heavier tail than sub-Gaussian ones but still possess finite moment up to arbitrary order. Specifically, without truncation and quantization, sparse linear regression was studied under sub-exponential data in [85] and under sub-Weibull data in [58], and the obtained error rates match the ones in sub-Gaussian case up to logarithmic factors. Additionally, under sub-exponential measurement matrix and noise, [80] established uniform guarantee for 1-bit generative compressed sensing, while [39] analysed generalized Lasso for a general nonlinear model. Because the tail assumptions in these works are substantially stronger than ours,111One exception is Theorem 4.6 in [58] for sparse linear regression with sub-Weibull covariate and heavy-tailed noise with bounded second moment, but still this result does not involve quantization procedure. we will not make special comparison with their results later; instead, we simply note here two key distinctions: 1) these works do not utilize special treatments for heavy-tailed data like truncation in the present paper; 2) most of them (with the single exception of [80] studying 1-bit quantization) do not focus on quantization, while this work concentrates on quantization of heavy-tailed data.

1.1.2 Statistical Estimation from Quantized Data

Quantized Compressed Sensing. Due to the prominent role of quantization in signal processing and machine learning, quantized compressed sensing that studies the interplay between sparse signal reconstruction and data quantization has become an active research branch in the field. In this work, we focus on memoryless quantization scheme222This means that the quantization for different measurements are independent. For other quantization schemes we refer to the recent survey [29]. that embraces simple hardware design. An important model is 1-bit compressed sensing where only the sign of the measurement is retained [8, 48, 75, 76], and more precisely this model concerns the recovery of sparse 𝜽d\bm{\theta^{\star}}\in\mathbb{R}^{d} from sgn(𝑿𝜽)\operatorname{sgn}(\bm{X\theta^{\star}}) with the sensing matrix 𝑿n×d\bm{X}\in\mathbb{R}^{n\times d}. However, 1-bit compressed sensing associated with the direct sgn()\operatorname{sgn}(\cdot) quantization suffers from some frustrating limitations, e.g., the loss of signal norm information, and the identifiability issue under some regular sensing matrix (e.g., under Bernoulli sensing matrix, see [32]).333In fact, almost all existing guarantees using the 1-bit observations sgn(𝑿𝜽)\operatorname{sgn}(\bm{X\theta^{\star}}) are restricted to standard Gaussian sensing matrix consisting of i.i.d. 𝒩(0,1)\mathcal{N}(0,1) entries, with the exceptions of [1] for sub-Gaussian sensing matrix and [30, 86] for partial Gaussian circulant matrix. Fortunately, these limitations can be overcome by random dithering prior to the quantization, under which the 1-bit measurements read as sgn(𝑿𝜽+𝝉)\operatorname{sgn}(\bm{X\theta^{\star}}+\bm{\tau}) for some suitably chosen random dither 𝝉n\bm{\tau}\in\mathbb{R}^{n}. Specifically, under Gaussian dither 𝝉𝒩(0,𝑰n)\bm{\tau}\sim\mathcal{N}(0,\bm{I}_{n}) and standard Gaussian sensing matrix 𝑿\bm{X}, full reconstruction with norm information could be achieved, for which the key idea is sgn(𝑿𝜽+𝝉)=sgn([𝑿𝝉][(𝜽),1])\operatorname{sgn}(\bm{X\theta^{\star}}+\bm{\tau})=\operatorname{sgn}([\bm{X}~{}\bm{\tau}][(\bm{\theta^{\star}})^{\top},1]^{\top}), thus reducing the dithered model to the undithered model for the sparse signal [(𝜽),1][(\bm{\theta^{\star}})^{\top},1]^{\top} whose last entry is known beforehand [54].444We note that this idea has been recently extended in [20] to the related problem of phase-only compressed sensing, also see [47, 36, 7] for prior developments. More surprisingly, under a uniform random dither, recovery with norm can be achieved under rather general sub-Gaussian sensing matrix [32, 50, 89, 21] even with near optimal error rate.

Besides the 1-bit quantizer that retains the sign, the uniform quantizer maps aa\in\mathbb{R} to 𝒬Δ(a)=Δ(aΔ+12)\mathcal{Q}_{\Delta}(a)=\Delta\big{(}\lfloor\frac{a}{\Delta}\rfloor+\frac{1}{2}\big{)} for some pre-specified Δ>0\Delta>0; here and hereafter, we refer to Δ\Delta as the quantization level, and note that smaller Δ\Delta represents higher resolution. While recovering 𝜽\bm{\theta^{\star}} from 𝒬Δ(𝑿𝜽)\mathcal{Q}_{\Delta}(\bm{X\theta^{\star}}) encounters identifiability issue,555For instance, if 𝑿{1,1}n×d\bm{X}\in\{-1,1\}^{n\times d} (typical example is the Bernoulli design where entries of 𝑿\bm{X} are i.i.d. zero-mean) and Δ=1\Delta=1, then 𝜽1:=1.1𝒆1\bm{\theta}_{1}:=1.1\bm{e}_{1} and 𝜽2:=1.2𝒆1+0.1𝒆2\bm{\theta}_{2}:=1.2\bm{e}_{1}+0.1\bm{e}_{2} can never be distinguished because 𝒬1(𝑿𝜽1)=𝒬1(𝑿𝜽2)\mathcal{Q}_{1}(\bm{X\theta}_{1})=\mathcal{Q}_{1}(\bm{X\theta}_{2}) always holds. it is again beneficial to use random dithering to obtain the measurements 𝒬Δ(𝑿𝜽+𝝉)\mathcal{Q}_{\Delta}(\bm{X\theta^{\star}}+\bm{\tau}). More specifically, by using uniform dither the Lasso estimator [89, 87] and projected back projection (PBP) method [94] achieve minimax rate in certain cases, and the derived error bounds for these estimators demonstrate that the dithered uniform quantization does not affect the scaling law but only slightly worsens the multiplicative factor. Although the aforementioned progresses regarding compressed sensing under dithered quantization were recently made, the technique of dithering in quantization indeed has a long history and (at least) dates back to some early engineering work (e.g., [83]), see [41] for a brief introduction.

Other Estimation Problems with Quantized Data. Beyond compressed sensing or its corresponding more statistical setting of sparse linear regression, some other statistical estimation problems were also investigated under dithered 1-bit quantization. Specifically, [24] studied a general signal estimation problem under dithered 1-bit quantization in a traditional setting where sample size tends to infinity, showing that only logarithmic rate loss is incurred by the quantization. Inspired by potential application in reduction of power consumption in a large scale massive MIMO system, [31] proposed to collect 2 bits per entry from each sub-Gaussian sample and developed an estimator that is near minimax optimal in certain cases. Their estimator from coarsely quantized samples was extended to high-dimensional sparse case in [21]. Then, considering the ubiquitousness of binary observations in many recommendation systems, the authors of [26] first approached the 1-bit matrix completion problem by maximum likelihood estimation with a nuclear norm constraint. Their method was developed by a series of follow-up works by using different regularizers/constraints to encourage low-rankness, or considering multi-bit quantization on a finite alphabet [14, 59, 52, 17, 3]. Quantizing the observed entries by a dithered 1-bit quantizer, the 1-bit matrix completion result in [21] essentially departs from the standard likelihood approach and can tolerate pre-quantization noise with unknown distribution.

1.1.3 Quantization of Heavy-Tailed Data in Statistical Estimation

From now on, we turn to existing results more closely related to this work and explain our contributions. Note that the results we just reviewed are for estimation problems from either unquantized heavy-tailed data (Section 1.1.1) or quantized sub-Gaussian data (Section 1.1.2). While quantization of heavy-tailed data (from distribution assumed to have bounded moment of some small order) is a natural question of significant practical value, prior investigations turn out to be surprisingly rare, and the only results we are aware of were presented in [21, 32] concerning dithered 1-bit quantization. Specifically, [32, Thm. 1.11] considered heavy-tailed noise and possibly heavy-tailed covariate, implying that a sharp uniform error rate is achievable (see their Example 1.10). However, their result is for a computationally intractable program (Hamming distance minimization) and hence of limited practical value. Another limitation is that their techniques are based on random hyperplane tessellations that is specialized to 1-bit compressed sensing but does not generalize to other estimation problems. In contrast, [21] proposed a unified quantization scheme that first truncates the data and then invokes a dithered 1-bit quantizer. Although this quantization scheme could (at least) be applied to sparse covariance matrix estimation, compressed sensing, and matrix completion while still enabling practical estimators, the main drawback is that the convergence rates of estimation errors are essentially slower than the corresponding minimax optimal ones (e.g., O~(sn1/3)\tilde{O}\big{(}\frac{\sqrt{s}}{n^{1/3}}\big{)} for 1-bit compressed sensing under heavy-tailed noise [21, Thm. 10]), and in certain cases the rates cannot be improved without changing the quantization process (e.g., [21, Thm. 11] complements [21, Thm. 10] with a nearly matching lower bound). Beyond that, the 1-bit compressed sensing results in [21] are non-uniform. In a nutshell, [32] proved sharp rate for 1-bit compressed sensing but used highly intractable program and techniques not extendable to other estimation regimes, while the more widely applicable scheme and practical estimators in [21] suffer from slow error rates (when compared to the ones achieved from unquantized sub-Gaussian data).

Our Main Contributions: (Near) Minimax Rates. We propose a unified quantization scheme for heavy-tailed data consisting of three steps: 1) truncation that shrinks data to some threshold, 2) dithering that adds suitable random noise to the truncated data, and 3) uniform quantization. For sub-Gaussian or sub-exponential data the truncation step is inessential, and we simply set the threshold as \infty in this case. Careful readers may notice that, we just replace the 1-bit quantizer in our prior work [21] with the less extreme (multi-bit) uniform quantizer 𝒬Δ()\mathcal{Q}_{\Delta}(\cdot), but the gain turns out to be significant — we are now able to derive (near) optimal rates that are essentially faster than the ones in [21], see Theorems 2-8. Compared to [32], besides the different quantizers, other major distinctions are that: 1) we utilize an additional truncation step, 2) our estimators are computationally feasible, and 3) we investigate multiple estimation problems with possibility of extensions to more estimation problems. Concerning the effect of quantization, our error rates suggests a unified conclusion that dithered uniform quantization does not affect the scaling law but only slightly worsens the multiplicative factor, which generalizes similar findings for quantized compressed sensing in [89, 87, 94] towards two directions, i.e., to the case where heavy-tailed data present and to some other estimation problems (i.e., covariance matrix estimation, matrix completion). As an example, for quantized compressed sensing with sub-Gaussian sensing vector 𝒙k\bm{x}_{k} but heavy-tailed measurement yky_{k} satisfying 𝔼|yk|2+νM\mathbbm{E}|y_{k}|^{2+\nu}\leq M for some ν>0\nu>0, we derive the 2\ell_{2}-norm error rate O(slogdn)O\big{(}\mathscr{L}\sqrt{\frac{s\log d}{n}}\big{)} where =M1/(2l)+Δ\mathscr{L}=M^{1/(2l)}+\Delta (Theorem 5, s,d,ns,d,n are respectively the sparsity, signal dimension, measurement number), which is reminiscent of the rates in [94, 89] in terms of the position of Δ\Delta. From a technical side, many of our analyses on the dithered quantizer are much cleaner than prior works because we make full use of the nice statistical properties of the quantization error and quantization noise (Theorem 1),666Prior work that did not fully leveraged these properties may incur extra technical complication, e.g., the symmetrization and contraction in [89, Lem. A.2]. see Section 2.2. Also, the property of quantization noise motivates us to use triangular dither when covariance estimation is necessary, which departs from the uniform dither commonly adopted in prior works (e.g., [89, 94, 31, 21]) and is novel to the literature of quantized compressed sensing. In our subsequent work [22], a clean analysis on quantized low-rank multivariate regression with possibly quantized covariates is provided, and we believe the innovations of this work will prove useful in other problems.

1.1.4 Covariate Quantization in Quantized Compressed Sensing

From now on we concentrate on quantized compressed sensing, i.e., the recovery of a sparse signal 𝜽d\bm{\theta^{\star}}\in\mathbb{R}^{d} from the quantized version of (𝒙k,yk:=𝒙k𝜽+ϵk)k=1n(\bm{x}_{k},y_{k}:=\bm{x}_{k}^{\top}\bm{\theta^{\star}}+\epsilon_{k})_{k=1}^{n} where 𝒙k,yk,ϵk\bm{x}_{k},y_{k},\epsilon_{k} are the sensing vector, measurement and noise, respectively. Let us first clarify some terminology issue before proceeding. Note that this formulation also models the sparse linear regression problem (e.g., [72, 81]) where one wants to learn a sparse parameter 𝜽d\bm{\theta^{\star}}\in\mathbb{R}^{d} from the given data (𝒙k,yk)k=1n(\bm{x}_{k},y_{k})_{k=1}^{n} that are believed to follow the linear model yk=𝒙k𝜽+ϵky_{k}=\bm{x}_{k}^{\top}\bm{\theta^{\star}}+\epsilon_{k}, and in this regression problem 𝒙k,yk\bm{x}_{k},y_{k} are commonly referred to as covariate and response, respectively. We are interested in both settings in this work (as further explained in Section 3.2), but for clearer presentation, we simply refer to the problem as quantized compressed sensing, while calling 𝒙k,yk\bm{x}_{k},y_{k} covariate and response, respectively.

Despite a large volume of results in quantized compressed sensing, almost all of them are restricted to response quantization, thus the question of covariate quantization that allows for accurate subsequent recovery remains unsolved. Note that this question is meaningful especially when the problem is interpreted as sparse linear regression — working with low-precision data in some (distributed) learning systems could significantly reduce communication cost and power consumption [96, 43], which we will further demonstrate in Section 4.1. Therefore, it is of interest to understand how covariate quantization affects the learning of 𝜽\bm{\theta^{\star}}. To the best of our knowledge, the only existing rigorous guarantees for quantized compressed sensing involving covariate quantization were obtained in [21, Thms. 7-8]. However, these results require 𝔼(𝒙k𝒙k)\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}) to be sparse [21, Assumption 3] (in order to employ their sparse covariance matrix estimator), and this assumption is non-standard in sparse linear regression and compressed sensing.777In fact, although isotropic sensing vector (i.e., 𝔼(𝒙k𝒙k)=𝑰d\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})=\bm{I}_{d}) has been conventional in compressed sensing, many results in the literature can be extended to sensing vector with general unknown covariance matrix and hence do not really rely on the sparsity of 𝔼(𝒙k𝒙k)\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}).

Our Contribution. Besides the above main contributions, we establish the estimation guarantees for quantized compressed sensing under covariate quantization that are free of the non-standard assumption on the sparsity of 𝔼(𝒙k𝒙k)\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}). Like [21], our estimation methods are built upon the quantized covariance matrix estimator developed in Section 3.1; but unlike [21] that relies on the sparsity of 𝔼(𝒙k𝒙k)\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}) to ensure convexity, we instead deal with the non-convex program with an additional 1\ell_{1}-norm constraint, under which we prove that all local minimizers deliver near minimax estimation errors (Theorems 9-10). Our analysis bears resemblance to a line of works on non-convex M-estimator [63, 62, 64] but also exhibits some essential differences (Remark 5). Further, we extract our techniques as a deterministic framework (Proposition 1) and then use it to establish guarantees under dithered 1-bit quantization and covariate quantization as byproducts (Theorems 11-12), which are comparable to [21, Thms. 7-8] but free of sparsity on 𝔼(𝒙k𝒙k)\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}).

1.1.5 Uniform Signal Recovery in Quantized Compressed Sensing

It is standard in compressed sensing to leverage a random sensing matrix, so a recovery guarantee can be uniform or non-uniform. More precisely, a uniform guarantee ensures the recovery of all structured signals of interest with a single draw of the sensing ensemble, while a non-uniform guarantee is only valid for a structured signal fixed before drawing the random ensemble, with the implication that a new realization of the sensing matrix is required for sensing a new signal. Uniformity is a highly desired property in compressed sensing, since in applications the measurement ensemble is typically fixed and is expected to work for all signals [40]. Besides, the derivation of a uniform guarantee is often significantly harder than a non-uniform one, making uniformity an interesting theoretical problem in its own right.

A classical fact in linear compressed sensing is that the restricted isometry property (RIP) of the sensing matrix implies uniform recovery of all sparse signals (e.g., [38]), but this is not the case when it comes to nonlinear compressed sensing models, for which uniform recovery guarantees are still eagerly pursued so far. For instance, in the specific quantization model involving 1-bit/uniform quantization with/without dithering, or the more general single index model yk=f(𝒙k𝜽)y_{k}=f\big{(}\bm{x}_{k}^{\top}\bm{\theta^{\star}}\big{)} with possibly unknown f()f(\cdot), most representative results are non-uniform (e.g., [75, 79, 78, 89, 94, 87]). We refer to [75, 32, 94, 50] for concrete uniform guarantees, some of which remain (near) optimal (e.g., [94, Sect. 7.2A]), while others suffer from essential degradation compared to the non-uniform ones (e.g., [75, Thm. 1.3]). It is worth noting the interesting recent work [40] who provided a unified approach to uniform guarantee in a series of non-linear models, but without the aid of some non-trivial embedding result, their uniform guarantees typically exhibit a decaying rate of O(n1/4)O(n^{-1/4}) that is slower than the non-uniform one of O(n1/2)O(n^{-1/2}) (Section 4 therein). Turning back to our focus of compressed sensing from quantized heavy-tailed data, results in [21, Sect. III] are non-uniform, while [32, Thm. 1.11] presents sharp uniform guarantee for the intractable program of hamming distance minimization under dithered 1-bit quantization.

Our Contribution. We additionally contribute to the literature a uniform guarantee for constrained Lasso under dithered uniform quantization of heavy-tailed response. Specifically, we upgrade our non-uniform Theorem 5 to its uniform version Theorem 13, which states that using a single realization of the sub-Gaussian sensing matrix, heavy-tailed noise and uniform dither, all ss-sparse signals within an 2\ell_{2}-ball can be uniformly recovered up to an 2\ell_{2}-norm error of O~(sn)\tilde{O}\big{(}\sqrt{\frac{s}{n}}\big{)}, thus matching the near minimax non-uniform rate in Theorem 5 up to logarithmic factors. The proof relies on a concentration inequality for product process [67] and a careful covering argument inspired by [94]. Due to the heavy-tailed noise, new treatment is needed before invoking the concentration result from [67].

1.2 Outline

The remainder of this paper is structured as follows. We provide the notation and preliminaries in Section 2. We present the first set of main results (concerning the (near) optimal guarantees for three estimation problems under quantized heavy-tailed data) in Section 3. Our second set of results (concerning covariate quantization and uniform recovery in quantized compressed sensing) is then presented in Section 4. To corroborate our theory, numerical results on synthetic data are reported in Section 5. We give some remarks to conclude the paper in Section 6. All the proofs are postponed to the Appendices.

2 Preliminaries

We adopt the following conventions throughout the paper:

1) We use boldface symbols (e.g., 𝑨\bm{A}, 𝒙\bm{x}) to denote matrices and vectors, and regular letters (e.g, aa, xx) for scalars. We write [m]={1,,m}[m]=\{1,...,m\} for positive integer mm. We denote the complex unit by 𝗂\mathsf{i}. The ii-th entry for a vector 𝒙\bm{x} (likewise, 𝒚\bm{y}, 𝝉\bm{\tau}) is denoted by xix_{i} (likewise, yiy_{i}, τi\tau_{i}).

2) Notation with \star as superscript denotes the desired underlying parameter or signal, e.g., 𝚺\bm{\Sigma}^{\star}, 𝜽\bm{\theta}^{\star}. Moreover, notation marked by a tilde (e.g., 𝒙~\bm{\widetilde{x}}) and a dot (e.g., 𝒙˙\bm{\dot{\bm{x}}}) stands for the truncated data and quantized data, respectively.

3) We reserve dd, nn for the problem dimension and sample size, respectively. In many cases 𝚼^\bm{\widehat{\Upsilon}} denotes the estimation error, e.g., 𝚼^=𝜽^𝜽\bm{\widehat{\Upsilon}}=\bm{\widehat{\theta}}-\bm{\theta^{\star}} if 𝜽^\bm{\widehat{\theta}} is the estimator for the desired signal 𝜽\bm{\theta^{\star}}. We use Σs\Sigma_{s} to denote the set of dd-dimensional ss-sparse signals.

4) For vector 𝒙d\bm{x}\in\mathbb{R}^{d}, we work with its transpose 𝒙\bm{x}^{\top}, p\ell_{p}-norm 𝒙p=(i[d]|xi|p)1/p\|\bm{x}\|_{p}=(\sum_{i\in[d]}|x_{i}|^{p})^{1/p} (p1p\geq 1), max norm 𝒙=maxi[d]|xi|\|\bm{x}\|_{\infty}=\max_{i\in[d]}|x_{i}|. We define the standard Euclidean sphere as 𝕊d1={𝒙d:𝒙2=1}\mathbb{S}^{d-1}=\{\bm{x}\in\mathbb{R}^{d}:\|\bm{x}\|_{2}=1\}.

5) For matrix 𝑨=[aij]m×n\bm{A}=[a_{ij}]\in\mathbb{R}^{m\times n} with singular values σ1σ2σmin{m,n}\sigma_{1}\geq\sigma_{2}\geq...\geq\sigma_{\min\{m,n\}}, recall the operator norm 𝑨op=sup𝒗𝕊n1𝑨𝒗2=σ1\|\bm{A}\|_{op}=\sup_{\bm{v}\in\mathbb{S}^{n-1}}\|\bm{Av}\|_{2}=\sigma_{1}, Frobenius norm 𝑨F=(i,jaij2)1/2\|\bm{A}\|_{F}=(\sum_{i,j}a_{ij}^{2})^{1/2}, nuclear norm 𝑨nu=k=1min{m,n}σk\|\bm{A}\|_{nu}=\sum_{k=1}^{\min\{m,n\}}\sigma_{k}, and max norm 𝑨=maxi,j|aij|\|\bm{A}\|_{\infty}=\max_{i,j}|a_{ij}|. λmin(𝑨)\lambda_{\min}(\bm{A}) (resp. λmax(𝑨)\lambda_{\max}(\bm{A})) stands for the minimum eigenvalue (resp. maximum eigenvalue) of a symmetric 𝑨\bm{A}.

6) We denote universal constants by CC, cc, CiC_{i} and cic_{i}, whose value may vary from line to line. We write T1T2T_{1}\lesssim T_{2} or T1=O(T2)T_{1}=O(T_{2}) if T1CT2T_{1}\leq CT_{2}. Conversely, if T1CT2T_{1}\geq CT_{2} we write T1T2T_{1}\gtrsim T_{2} or T1=Ω(T2)T_{1}=\Omega(T_{2}). Also, we write T1T2T_{1}\asymp T_{2} if T1=O(T2)T_{1}=O(T_{2}) and T2=Ω(T1)T_{2}=\Omega(T_{1}) simultaneously hold.

7) We use 𝒰(Ω)\mathscr{U}(\Omega) to denote the uniform distribution over ΩN\Omega\subset\mathbb{R}^{N}, 𝒩(𝝁,𝚺)\mathcal{N}(\bm{\mu},\bm{\Sigma}) to denote Gaussian distribution with mean 𝝁\bm{\mu} and covariance 𝚺\bm{\Sigma}, 𝗍(ν)\mathsf{t}(\nu) to denote student’s t distribution with degrees of freedom ν\nu.

8) Our technique to handle heavy-tailedness is a data truncation step, for which we introduce the operator 𝒯ζ()\mathscr{T}_{\zeta}(\cdot) for some threshold ζ>0\zeta>0. It is defined as 𝒯ζ(a)=sgn(a)min{|a|,ζ}\mathscr{T}_{\zeta}(a)=\operatorname{sgn}(a)\min\{|a|,\zeta\} for some aa\in\mathbb{R}. To truncate vector we apply 𝒯ζ()\mathscr{T}_{\zeta}(\cdot) entry-wisely in most cases, with the exception of covariance matrix estimation under operator norm error (Theorem 3).

9) 𝒬Δ(.)\mathcal{Q}_{\Delta}(.) is the uniform quantizer with quantization level Δ>0\Delta>0. It applies to scalar aa by 𝒬Δ(a)=Δ(aΔ+12)\mathcal{Q}_{\Delta}(a)=\Delta\big{(}\big{\lfloor}\frac{a}{\Delta}\big{\rfloor}+\frac{1}{2}\big{)}, and we set 𝒬0(a)=a\mathcal{Q}_{0}(a)=a. Given a threshold μ\mu, the hard thresholding of scalar aa is 𝒯μ(a)=a𝟙(|a|μ)\mathcal{T}_{\mu}(a)=a\cdot\mathbbm{1}(|a|\geq\mu). Both functions element-wisely apply to vectors or matrices.

2.1 High-Dimensional Statistics

Let XX be a real random variable, we present some basic knowledge on the sub-Gaussian and sub-exponential random variable. Then we also precisely formulate the heavy-tailed distribution.

1) The sub-Gaussian norm is defined as Xψ2=inf{t>0:𝔼exp(X2t2)2}\|X\|_{\psi_{2}}=\inf\{t>0:\mathbbm{E}\exp(\frac{X^{2}}{t^{2}})\leq 2\}. A random variable XX with finite Xψ2\|X\|_{\psi_{2}} is said to be sub-Gaussian. Analogously to Gaussian variable, a sub-Gaussian random variable exhibits an exponentially-decaying probability tail and satisfies a moment constraint:

(|X|t)2exp(ct2Xψ22);\displaystyle\mathbbm{P}(|X|\geq t)\leq 2\exp\left(-\frac{ct^{2}}{\|X\|_{\psi_{2}}^{2}}\right); (2.1)
(𝔼|X|p)1/pCXψ2p,p1.\displaystyle(\mathbbm{E}|X|^{p})^{1/p}\leq C\|X\|_{\psi_{2}}\sqrt{p},~{}\forall~{}p\geq 1. (2.2)

Note that these two properties can also define ψ2\|\cdot\|_{\psi_{2}} up to multiplicative constant, e.g., Xψ2supp1(𝔼|X|p)1/pp\|X\|_{\psi_{2}}\asymp\sup_{p\geq 1}\frac{(\mathbbm{E}|X|^{p})^{1/p}}{\sqrt{p}} (see [92, Prop. 2.5.2]). For a dd-dimensional random vector 𝒙\bm{x} we define its sub-Gaussian norm as 𝒙ψ2=sup𝒗𝕊d1𝒗𝒙ψ2\|\bm{x}\|_{\psi_{2}}=\sup_{\bm{v}\in\mathbb{S}^{d-1}}\|\bm{v}^{\top}\bm{x}\|_{\psi_{2}}.

2) The sub-exponential norm is defined as Xψ1=inf{t>0:𝔼exp(|X|t)2}\|X\|_{\psi_{1}}=\inf\{t>0:\mathbbm{E}\exp(\frac{|X|}{t})\leq 2\}, and XX is sub-exponential if Xψ1<\|X\|_{\psi_{1}}<\infty. The sub-exponential XX satisfies the following properties:

(|X|t)2exp(ctXψ1);\displaystyle\mathbbm{P}(|X|\geq t)\leq 2\exp\left(-\frac{ct}{\|X\|_{\psi_{1}}}\right);
(𝔼|X|p)1/pCXψ1p,p1.\displaystyle(\mathbbm{E}|X|^{p})^{1/p}\leq C\|X\|_{\psi_{1}}p,~{}\forall~{}p\geq 1. (2.3)

To relate .ψ1\|.\|_{\psi_{1}} and .ψ2\|.\|_{\psi_{2}} one has XYψ2Xψ1Yψ1\|XY\|_{\psi_{2}}\leq\|X\|_{\psi_{1}}\|Y\|_{\psi_{1}} [92, Lem. 2.7.7].

3) In contrast to the moment constraints in (2.2) and (2.3), heavy-tailed distributions in this work are only assumed to satisfy bounded moments of some small order no greater than 44, formulated for a random variable XX as 𝔼|X|lM\mathbbm{E}|X|^{l}\leq M for some M>0M>0 and l(0,4]l\in(0,4]. Following [58, Def. 2.4, 2.5], we consider the following two moment assumptions for a heavy-tailed random vector 𝒙d\bm{x}\in\mathbb{R}^{d} (again, M>0M>0, l(0,4]l\in(0,4]):

  • Marginal Moment Constraint. The weaker assumption that constraints the moment of each coordinate, formulated by supi[d]𝔼|xi|lM\sup_{i\in[d]}\mathbbm{E}|x_{i}|^{l}\leq M.

  • Joint Moment Constraint. The stronger assumption that constraints the moments “toward all directions 𝒗𝕊d1\bm{v}\in\mathbb{S}^{d-1},” formulated by sup𝒗𝕊d1𝔼|𝒗𝒙|lM\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}|\bm{v}^{\top}\bm{x}|^{l}\leq M.

2.2 Dithered Uniform Quantization

In this part, we describe the dithered uniform quantizer and its properties in detail. We also specify the choices of random dither in this work.

1) We first provide the detailed procedure of dithered quantization and its general property. Let 𝒙N\bm{x}\in\mathbb{R}^{N} be the input signal with dimension N1N\geq 1 whose entries may be random and dependent. Independent of 𝒙\bm{x}, we generate the random dither 𝝉N\bm{\tau}\in\mathbb{R}^{N} with i.i.d. entries from some distribution,888Throughout this work, we suppose that a random dither is drawn independent of anything else (particularly, the signal to be quantized and other dithers), and the dither has i.i.d. entries if it is a vector. and then quantize 𝒙\bm{x} to 𝒙˙=𝒬Δ(𝒙+𝝉)\bm{\dot{x}}=\mathcal{Q}_{\Delta}(\bm{x}+\bm{\tau}). Following [41], we refer to 𝒘:=𝒙˙(𝒙+𝝉)\bm{w}:=\bm{\dot{x}}-(\bm{x}+\bm{\tau}) as the quantization error, and 𝝃:=𝒙˙𝒙\bm{\xi}:=\bm{\dot{x}}-\bm{x} as the quantization noise. The principal properties of dithered quantization are provided in Theorem 1.

Theorem 1.

(Adapted from [41, Thms. 1-2]). Consider the dithered uniform quantization described above for the input signal 𝐱\bm{x}, with random dither 𝛕=[τi]\bm{\tau}=[\tau_{i}], quantization error 𝐰\bm{w} and quantization noise 𝛏=[ξi]\bm{\xi}=[\xi_{i}]. Use 𝗂\mathsf{i} to denote the imaginary unit, and let YY be the random variable having the same distribution as the random dither τi\tau_{i}.

(a) (Quantization Error). If f(u):=𝔼(exp(𝗂uY))f(u):=\mathbbm{E}(\exp(\mathsf{i}uY)) satisfies f(2πlΔ)=0f\big{(}\frac{2\pi l}{\Delta}\big{)}=0 for all non-zero integer ll, then 𝒘𝒰([Δ2,Δ2]N)\bm{w}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{N}) is independent of 𝒙\bm{x}.999Although the statement is a bit different, it can be implied by [41, Thm. 1] and the proof therein.

(b) (Quantization Noise). Assume that Z𝒰([Δ2,Δ2])Z\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]) is independent of YY. Let g(u):=𝔼(exp(𝗂uY))𝔼(exp(𝗂uZ))g(u):=\mathbbm{E}(\exp(\mathsf{i}uY))\mathbbm{E}(\exp(\mathsf{i}uZ)). Given positive integer pp, if the pp-th order derivative g(p)(u)g^{(p)}(u) satisfies g(p)(2πlΔ)=0g^{(p)}\big{(}\frac{2\pi l}{\Delta}\big{)}=0 for all non-zero integer ll, then the pp-th conditional moment of ξi\xi_{i} does not depend on 𝒙\bm{x}: 𝔼[ξip|𝒙]=𝔼(Y+Z)p\mathbbm{E}[\xi_{i}^{p}|\bm{x}]=\mathbbm{E}(Y+Z)^{p}.

We note that Theorem 1 serves as the cornerstone for our analysis on the dithered uniform quantizer; for instance, (a) allows for applications of concentration inequalities in our analyses, and (b) inspires us to develop a covariance matrix estimator from quantized samples. The take-home message is that, adding appropriate dither before quantization can make the quantization error and quantization noise behave in a statistically nice manner. For example, the elementary form of Theorem 1(a) is that under a dither τi\tau_{i} satisfying the condition there, the quantization noise 𝒬Δ(xi+τi)(xi+τi)\mathcal{Q}_{\Delta}(x_{i}+\tau_{i})-(x_{i}+\tau_{i}) follows 𝒰([Δ2,Δ2])\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]) under any given scalar xix_{i} [41, Lem. 1].

2) We use uniform dither for quantization of the response in compressed sensing and matrix completion. More specifically, under Δ>0\Delta>0, we adopt the uniform dither τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]) for the response yky_{k}\in\mathbb{R}, which is also a common choice in previous works (e.g., [89, 94, 32, 50]). For Y𝒰([Δ2,Δ2])Y\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]), it can be calculated that

𝔼(exp(𝗂uY))=Δ/2Δ/21Δ(cos(ux)+𝗂sin(ux))dx=2Δusin(Δu2),\mathbbm{E}(\exp(\mathsf{i}uY))=\int_{-\Delta/2}^{\Delta/2}~{}\frac{1}{\Delta}\big{(}\cos(ux)+\mathsf{i}\sin(ux)\big{)}~{}\mathrm{d}x=\frac{2}{\Delta u}\sin\Big{(}\frac{\Delta u}{2}\Big{)}, (2.4)

and hence 𝔼(exp(𝗂2πlΔY))=0\mathbbm{E}(\exp(\mathsf{i}\frac{2\pi l}{\Delta}Y))=0 holds for all non-zero integer ll. Therefore, the benefit of using τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]) is that the quantization errors wk=𝒬Δ(yk+τk)(yk+τk)w_{k}=\mathcal{Q}_{\Delta}(y_{k}+\tau_{k})-(y_{k}+\tau_{k}) i.i.d. follow 𝒰([Δ2,Δ2])\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]), and are independent of {yk}\{y_{k}\}.

3) We use triangular dither for quantization of the covariate, i.e., the sample in covariance estimation or the covariate in comprssed sensing. Particularly, when considering the uniform quantizer 𝒬Δ(.)\mathcal{Q}_{\Delta}(.) for the covariate 𝒙kd\bm{x}_{k}\in\mathbb{R}^{d}, we adopt the dither 𝝉k𝒰([Δ2,Δ2]d)+𝒰([Δ2,Δ2]d)\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}),101010An equivalent statement is that entries of 𝝉k\bm{\tau}_{k} are i.i.d. distributed as 𝒰([Δ2,Δ2])+𝒰([Δ2,Δ2])\mathscr{U}\big{(}\big{[}-\frac{\Delta}{2},\frac{\Delta}{2}\big{]}\big{)}+\mathscr{U}\big{(}\big{[}-\frac{\Delta}{2},\frac{\Delta}{2}\big{]}\big{)}. The equivalence can be clearly seen by comparing the joint probability density functions. which is the sum of two independent 𝒰([Δ2,Δ2]d)\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}) and referred to as a triangular dither [41]. Simple calculations verify that the triangular dither respects not only the condition in Theorem 1(a), but also the one in Theorem 1(b) with p=2p=2; specifically, let Y=Y1+Y2Y=Y_{1}+Y_{2} where Y1Y_{1} and Y2Y_{2} are independent and follow 𝒰([Δ2,Δ2])\mathscr{U}\big{(}\big{[}-\frac{\Delta}{2},\frac{\Delta}{2}\big{]}\big{)}, and let Z𝒰([Δ2,Δ2])Z\sim\mathscr{U}\big{(}\big{[}-\frac{\Delta}{2},\frac{\Delta}{2}\big{]}\big{)} be independent of YY, then based on (2.4), we know f(u)=𝔼(exp(𝗂uY))=[2ΔusinΔu2]2f(u)=\mathbbm{E}(\exp(\mathsf{i}uY))=\big{[}\frac{2}{\Delta u}\sin\frac{\Delta u}{2}\big{]}^{2} satisfies f(2πlΔ)=0f(\frac{2\pi l}{\Delta})=0, and g(u)=𝔼(exp(𝗂uY))𝔼(exp(𝗂uZ))=[2ΔusinΔu2]3g(u)=\mathbbm{E}(\exp(\mathsf{i}uY))\mathbbm{E}(\exp(\mathsf{i}uZ))=\big{[}\frac{2}{\Delta u}\sin\frac{\Delta u}{2}\big{]}^{3} satisfies g′′(2πlΔ)=0g^{\prime\prime}(\frac{2\pi l}{\Delta})=0, where ll is any non-zero integer. Thus, at the cost of a dithering variance larger than uniform dither, the triangular dither brings the additional nice property of signal-independent variance for the quantization noise — 𝔼(ξki2)=14Δ2\mathbbm{E}(\xi_{ki}^{2})=\frac{1}{4}\Delta^{2}, where ξki\xi_{ki} is the ii-th entry of 𝝃k=𝒬Δ(𝒙k+𝝉k)(𝒙k+𝝉k)\bm{\xi}_{k}=\mathcal{Q}_{\Delta}(\bm{x}_{k}+\bm{\tau}_{k})-(\bm{x}_{k}+\bm{\tau}_{k}).

To the best of our knowledge, the triangular dither is new to the literature of quantized compressed sensing. We will explain its necessity if covariance estimation is involved. This is also complemented by numerical simulation (see Figure 5(a)).

3 (Near) Minimax Error Rates

In this section we derive (near) optimal error rates for several canonical statistical estimation problems. Our novelty is that by using the proposed quantization scheme for heavy-tailed data, (near) optimal error rates could be achieved by computationally feasible estimators.

3.1 Quantized Covariance Matrix Estimation

Given 𝒳:={𝒙1,,𝒙n}\mathscr{X}:=\{\bm{x}_{1},...,\bm{x}_{n}\} as i.i.d. copies of a zero-mean random vector 𝒙d\bm{x}\in\mathbb{R}^{d}, one often encounters the covariance matrix estimation problem, i.e., to estimate 𝚺=𝔼(𝒙𝒙)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}\bm{x}^{\top}). This estimation problem is of fundamental importance in multivariate analysis and has attracted much research interest (e.g., [12, 11, 13, 4]). However, the practically useful setting (e.g., in a massive MIMO system [95]) where the samples undergo certain quantization process remains under-developed, for which we are only aware of the 1-bit quantization results in [31, 21]. This setting poses the problem of quantized covariance matrix estimation (QCME), in which one aims to design quantization scheme for 𝐱k\bm{x}_{k} that allows for accurate estimation of 𝚺\bm{\Sigma^{\star}} only based on the quantized samples. We consider heavy-tailed 𝒙k\bm{x}_{k} that possesses bounded fourth moments either marginally or jointly, but note that our estimation methods and theoretical results appear to be new even for sub-Gaussian 𝒙k\bm{x}_{k} (Remark 1).

As introduced before, we overcome the heavy-tailedness of 𝒙k\bm{x}_{k} by a data truncation step, i.e., we first truncate 𝒙k\bm{x}_{k} to 𝒙~k\bm{\widetilde{x}}_{k} in order to make the outliers less influential. Here, we defer the precise definition of 𝒙~k\bm{\widetilde{x}}_{k} to concrete results because it should be well suited to the error metric. After the truncation, we dither and quantize 𝒙~k\bm{\widetilde{x}}_{k} to 𝒙˙k=𝒬Δ(𝒙~k+𝝉k)\bm{\dot{x}}_{k}=\mathcal{Q}_{\Delta}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k}) with the triangular dither 𝝉k𝒰([Δ2,Δ2]d)+𝒰([Δ2,Δ2]d)\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}). Different from the uniform dither adopted in the literature (e.g., [21, 89, 94, 32, 50]), first let us explain our choice of triangular dither. Recall that the quantization noise and quantization error are respectively defined as 𝝃k:=𝒙˙k𝒙~k\bm{\xi}_{k}:=\bm{\dot{x}}_{k}-\bm{\widetilde{x}}_{k} and 𝒘k:=𝒙˙k𝒙~k𝝉k\bm{w}_{k}:=\bm{\dot{x}}_{k}-\bm{\widetilde{x}}_{k}-\bm{\tau}_{k}, thus giving 𝝃k=𝝉k+𝒘k\bm{\xi}_{k}=\bm{\tau}_{k}+\bm{w}_{k}. Under uniform dither or triangular dither, 𝒘k\bm{w}_{k} is independent of 𝒙~k\bm{\widetilde{x}}_{k} and follows 𝒰([Δ2,Δ2]d)\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}) (see Section 2.2), thus allowing us to calculate that

𝔼(𝒙˙k𝒙˙k)\displaystyle\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}) =𝔼((𝒙~k+𝝃k)(𝒙~k+𝝃k))\displaystyle=\mathbbm{E}\big{(}(\bm{\widetilde{x}}_{k}+\bm{\xi}_{k})(\bm{\widetilde{x}}_{k}+\bm{\xi}_{k})^{\top}\big{)} (3.1)
=𝔼(𝒙~k𝒙~k)+𝔼(𝒙~k𝝃k)+𝔼(𝝃k𝒙~k)+𝔼(𝝃k𝝃k)\displaystyle=\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\xi}_{k}^{\top})+\mathbbm{E}(\bm{\xi}_{k}\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})
=(i)𝔼(𝒙~k𝒙~k)+𝔼(𝝃k𝝃k).\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top}).

Note that (i)(i) is because 𝔼(𝝃k𝒙~k)=𝔼(𝝉k𝒙~k)+𝔼(𝒘k𝒙~k)=𝔼(𝝉k)𝔼(𝒙~k)+𝔼(𝒘k)𝔼(𝒙~k)=0\mathbbm{E}(\bm{\xi}_{k}\bm{\widetilde{x}}_{k}^{\top})=\mathbbm{E}(\bm{\tau}_{k}\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{w}_{k}\bm{\widetilde{x}}_{k}^{\top})=\mathbbm{E}(\bm{\tau}_{k})\mathbbm{E}(\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{w}_{k})\mathbbm{E}(\bm{\widetilde{x}}_{k}^{\top})=0, due to the previously noted fact that 𝝉k\bm{\tau}_{k} and 𝒘k\bm{w}_{k} are independent of 𝒙~k\bm{\widetilde{x}}_{k} and zero-mean. While with suitable choice of the truncation threshold 𝔼(𝒙~k𝒙~k)\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}) is expected to well approximate 𝚺\bm{\Sigma^{\star}}, the remaining 𝔼(𝝃k𝝃k)\mathbb{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top}) gives rise to constant bias. To address the issue, a straightforward idea is to remove the bias, which requires the full knowledge on 𝔼(𝝃k𝝃k)\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top}), i.e., the covariance matrix of the quantization noise. For iji\neq j, because 𝝉k\bm{\tau}_{k}, 𝒘k𝒰([Δ2,Δ2]d)\bm{w}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}) and 𝔼(wkiτkj)=𝔼x~ki(𝔼[wkiτkj|x~ki])=0\mathbbm{E}(w_{ki}\tau_{kj})=\mathbbm{E}_{\widetilde{x}_{ki}}(\mathbbm{E}[w_{ki}\tau_{kj}|\widetilde{x}_{ki}])=0 (note that conditionally on x~ki\widetilde{x}_{ki}, wki=𝒬Δ(x~ki+τki)(x~ki+τki)w_{ki}=\mathcal{Q}_{\Delta}(\widetilde{x}_{ki}+\tau_{ki})-(\widetilde{x}_{ki}+\tau_{ki}) and τkj\tau_{kj} are independent), we have

𝔼(ξkiξkj)=𝔼((wki+τki)(wkj+τkj))\displaystyle\mathbbm{E}(\xi_{ki}\xi_{kj})=\mathbbm{E}\big{(}(w_{ki}+\tau_{ki})(w_{kj}+\tau_{kj})\big{)}
=𝔼(wkiwkj)+𝔼(wkiτkj)+𝔼(τkiwkj)+𝔼(τkiτkj)=0,\displaystyle=\mathbbm{E}(w_{ki}w_{kj})+\mathbbm{E}(w_{ki}\tau_{kj})+\mathbbm{E}(\tau_{ki}w_{kj})+\mathbbm{E}(\tau_{ki}\tau_{kj})=0,

showing that 𝔼(𝝃k𝝃k)\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top}) is diagonal. Moreover, under triangular dither the ii-th diagonal entry is also known as 𝔼|ξki|2=Δ24\mathbbm{E}|\xi_{ki}|^{2}=\frac{\Delta^{2}}{4}, see Section 2.2. Taken collectively, we arrive at

𝔼(𝝃k𝝃k)=Δ24𝑰d;\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})=\frac{\Delta^{2}}{4}\bm{I}_{d}; (3.2)

Based on (3.1) we thus propose the following estimator

𝚺^=1nk=1n𝒙˙k𝒙˙kΔ24𝑰d,\bm{\widehat{\Sigma}}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\Delta^{2}}{4}\bm{I}_{d}, (3.3)

which is the sample covariance of the quantized sample 𝒳˙:={𝒙˙1,,𝒙˙n}\dot{\mathscr{X}}:=\{\bm{\dot{x}}_{1},...,\bm{\dot{x}}_{n}\} followed by a correction step. On the other hand, the reason why the standard uniform dither is not suitable for QCME becomes self-evident — the diagonal of 𝔼(𝝃k𝝃k)\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top}) remains unknown111111It depends on the input signal, see [41, Page 3]. and hence there is no hope to precisely remove the bias.

We are now ready to present error bounds for 𝚺^\bm{\widehat{\Sigma}} under max-norm, operator norm. We will also investigate the high-dimensional setting by assuming sparse structure of 𝚺\bm{\Sigma^{\star}}, for which we propose a thresholding estimator. More concretely, our first result provides the error rate under \|\cdot\|_{\infty}, in which we assume 𝒙k\bm{x}_{k} satisfies the marginal fourth moment constraint and utilize an element-wise truncation 𝒙~k=𝒯ζ(𝒙k)\bm{\widetilde{x}}_{k}=\mathscr{T}_{\zeta}(\bm{x}_{k}).

Theorem 2.

(Element-Wise Error). Given Δ>0\Delta>0 and δ>4\delta>4, we consider the problem of QCME described above. We suppose that 𝒙k\bm{x}_{k}s are i.i.d. zero-mean and satisfy the marginal moment constraint 𝔼|xki|4M\mathbbm{E}|x_{ki}|^{4}\leq M for any i[d]i\in[d], where xkix_{ki} is the ii-th entry of 𝒙k\bm{x}_{k}. We truncate 𝒙k\bm{x}_{k} to 𝒙~k=[x~ki]=𝒯ζ(𝒙k)\bm{\widetilde{x}}_{k}=[\widetilde{x}_{ki}]=\mathscr{T}_{\zeta}(\bm{x}_{k}) with threshold ζ(nMδlogd)1/4\zeta\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}, then quantize 𝒙~k\bm{\widetilde{x}}_{k} to 𝒙˙k=𝒬Δ(𝒙~k+𝝉k)\bm{\dot{x}}_{k}=\mathcal{Q}_{\Delta}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k}) with triangular dither 𝝉k𝒰([Δ2,Δ2]d)+𝒰([Δ2,Δ2]d)\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}). If nδlogdn\gtrsim\delta\log d, then the estimator in (3.3) satisfies

(𝚺^𝚺Cδlogdn)2d2δ,\mathbbm{P}\left(\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{\infty}\geq C\mathscr{L}\sqrt{\frac{\delta\log d}{n}}\right)\leq 2d^{2-\delta},

where :=M+Δ2\mathscr{L}:=\sqrt{M}+\Delta^{2}.

Notably, despite the heavy-tailedness and quantization, the estimator achieves an element-wise rate O(logdn)O(\sqrt{\frac{\log d}{n}}) coincident with the one for sub-Gaussian case. One can clearly position quantization level Δ\Delta in the multiplicative factor =M+Δ2\mathscr{L}=\sqrt{M}+\Delta^{2}. Thus, the information loss incurred by quantization is inessential in that it does not affect the key scaling law but only slightly worsens the leading factor. These remarks on the (near) optimality and the information loss incurred by quantization remain valid in our subsequent theorems.

Our next result concerns the operator norm estimation error, under which we impose a stronger joint moment constraint on 𝒙k\bm{x}_{k} and truncate 𝒙k\bm{x}_{k} regarding 4\ell_{4}-norm, i.e., 𝒙ˇ𝒌=𝒙k𝒙k4min{𝒙k4,ζ}\bm{\check{x}_{k}}=\frac{\bm{x}_{k}}{\|\bm{x}_{k}\|_{4}}\min\{\|\bm{x}_{k}\|_{4},\zeta\} for some threshold ζ\zeta. After the dithered uniform quantization, we still define the estimator as (3.3).

Theorem 3.

(Operator Norm Error). Given Δ>0\Delta>0 and δ>0\delta>0, we consider the problem of QCME described above. Suppose that the i.i.d. zero-mean 𝒙k\bm{x}_{k}s satisfy 𝔼|𝒗𝒙k|4M\mathbbm{E}|\bm{v}^{\top}\bm{x}_{k}|^{4}\leq M for any 𝒗𝕊d1\bm{v}\in\mathbb{S}^{d-1}. We truncate 𝒙k\bm{x}_{k} to 𝒙ˇk=𝒙k𝒙k4min{𝒙k4,ζ}\bm{\check{x}}_{k}=\frac{\bm{x}_{k}}{\|\bm{x}_{k}\|_{4}}\min\{\|\bm{x}_{k}\|_{4},\zeta\} with threshold ζ(M1/4+Δ)(nδlogd)1/4\zeta\asymp(M^{1/4}+\Delta)\big{(}\frac{n}{\delta\log d}\big{)}^{1/4}, then quantize 𝒙ˇk\bm{\check{x}}_{k} to 𝒙˙k=𝒬Δ(𝒙ˇk+𝝉k)\bm{\dot{x}}_{k}=\mathcal{Q}_{\Delta}(\bm{\check{x}}_{k}+\bm{\tau}_{k}) with triangular dither 𝝉k𝒰([Δ2,Δ2]d)+𝒰([Δ2,Δ2]d)\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}). If nδdlogdn\gtrsim\delta d\log d, then the estimator in (3.3) satisfies

(𝚺^𝚺opCδdlogdn)2dδ,\mathbbm{P}\left(\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{op}\geq C\mathscr{L}\sqrt{\frac{\delta d\log d}{n}}\right)\leq 2d^{-\delta},

with :=M+Δ2\mathscr{L}:=\sqrt{M}+\Delta^{2}.

The operator norm error rate in Theorem 3 is near minimax optimal, e.g., compared to the lower bound in [35, Thm. 7], which states that for any estimator 𝚺^\bm{\widehat{\Sigma}} of the positive semi-definite matrix 𝚺\bm{\Sigma^{\star}} based on i.i.d. zero-mean {𝒙k}k=1n\{\bm{x}_{k}\}_{k=1}^{n} with covariance matrix 𝚺\bm{\Sigma^{\star}}, there exists some 𝒗0𝕊d1\bm{v}_{0}\in\mathbb{S}^{d-1} such that (𝚺^𝚺op1486dn)13\mathbbm{P}\big{(}\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{op}\geq\frac{1}{48}\sqrt{\frac{6d}{n}}\big{)}\geq\frac{1}{3}, where 𝚺=𝑰d+𝒗0𝒗0\bm{\Sigma^{\star}}=\bm{I}_{d}+\bm{v}_{0}\bm{v}_{0}^{\top}. Again, the quantization only affects the multiplicative factor \mathscr{L}. Nevertheless, one still needs (at least) ndn\gtrsim d to achieve small operator norm error. In fact, in a high-dimensional setting where dd may exceed nn, even the sample covariance 1nk=1n𝒙k𝒙k\frac{1}{n}\sum_{k=1}^{n}\bm{x}_{k}\bm{x}_{k}^{\top} for sub-Gaussian zero-mean 𝒙k\bm{x}_{k} may have extremely bad performance. To achieve small operator norm error in a high-dimensional regime, we resort to additional structure on 𝚺\bm{\Sigma^{\star}}, and specifically we use column-wise sparsity as an example, which corresponds to the situations where dependencies among different coordinates are weak. Based on the estimator in Theorem 2, we further invoke a thresholding regularization [4, 12] to promote sparsity.

Theorem 4.

(Sparse QCME). Under conditions and estimator 𝚺^\bm{\widehat{\Sigma}} in Theorem 2, we additionally assume that all columns of 𝚺=[σij]\bm{\Sigma}^{\star}=[\sigma^{\star}_{ij}] are ss-sparse and consider the thresholding estimator 𝚺^s:=𝒯μ(𝚺^)\bm{\widehat{\Sigma}}_{s}:=\mathcal{T}_{\mu}(\bm{\widehat{\Sigma}}) for some μ\mu (recall that 𝒯μ(a)=a𝟙(|a|μ)\mathcal{T}_{\mu}(a)=a\cdot\mathbbm{1}(|a|\geq\mu) for aa\in\mathbb{R}). If μ=C1(M+Δ2)δlogdn\mu=C_{1}(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C1C_{1}, then 𝚺^s\bm{\widehat{\Sigma}}_{s} satisfies

(𝚺^s𝚺opCsδlogdn)1exp(0.25δ),\mathbbm{P}\left(\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}\leq C\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}\right)\geq 1-\exp(-0.25\delta),

where :=M+Δ2\mathscr{L}:=\sqrt{M}+\Delta^{2}.

Notably, our estimator 𝚺^s\bm{\widehat{\Sigma}}_{s} achieves minimax rates O(slogdn)O\big{(}s\sqrt{\frac{\log d}{n}}\big{)} under operator norm, e.g., compared to the minimax lower bound derived in [12, Thm. 2], which states that (under some regular scaling) for any covariance estimator 𝚺es\bm{\Sigma}_{es} based on nn i.i.d. samples of 𝒩(𝝁,𝚺)\mathcal{N}(\bm{\mu},\bm{\Sigma}^{\star}) where 𝚺\bm{\Sigma}^{\star} is the true covariance matrix, there exists some covariance matrix 𝚺\bm{\Sigma^{\star}} with ss-sparse columns such that 𝔼𝚺es𝚺op2s2logdn\mathbbm{E}\|\bm{\Sigma}_{es}-\bm{\Sigma}^{\star}\|_{op}^{2}\gtrsim s^{2}\frac{\log d}{n}.

To analyse the thresholding estimator, our proof resembles the ones developed in prior works (e.g., [12]) but requires more efforts like bounding the additional bias terms arising from the data truncation and quantization. We also point out that the results for the full-data unquantized regime immediately follow by setting Δ=0\Delta=0, thus Theorems 2-3 represent the strict extension of [35, Sect. 4], and Theorem 4 complements [35] with a high-dimensional sparse setting.

Remark 1.

(Sub-Gaussian Case). While we concentrate on quantization of heavy-tailed data in this work, our results can be readily adjusted to sub-Gaussian 𝒙k\bm{x}_{k}, for which the truncation step is inessential and can be removed (i.e., ζ=\zeta=\infty). These results are also new to the literature but will not be presented here.

3.2 Quantized Compressed Sensing

We consider the linear model

yk=𝒙k𝜽+ϵk,k=1,,n,y_{k}=\bm{x}_{k}^{\top}\bm{\theta^{\star}}+\epsilon_{k},~{}k=1,...,n, (3.4)

where 𝒙k\bm{x}_{k}s are the covariates, yky_{k}s are responses, 𝜽\bm{\theta^{\star}} is the sparse signal in compressed sensing or sparse parameter vector in high-dimensional linear regression that we want to estimate. In the quantized compressed sensing (QCS) problem, we are interested in developing quantization scheme for (𝐱k,yk)(\bm{x}_{k},y_{k})s ((mainly for yky_{k} in prior works)) that enables accurate recovery of 𝛉\bm{\theta^{\star}} based on the quantized data.

In spite of the same mathematical formulation, there are some important differences between compressed sensing and sparse linear regression that we should clarify first. Specifically, different from sensing vectors in compressed sensing that are generated by some analog measuring device and can oftentimes be designed, 𝒙k\bm{x}_{k}s in sparse linear regression represent the sample data from certain datasets that are believed to affect the responses yky_{k}s through (3.4). While the sparsity of 𝜽\bm{\theta^{\star}} is arguably the most classical signal structure for compressed sensing, due to good interpretability it is also commonly adopted to achieve dimension reduction in high-dimensional statistics. In this work, we are interested in both problem settings. Thus, we do not adopt the isotropic convention (i.e., 𝔼(𝒙k𝒙k)=𝑰d\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})=\bm{I}_{d}) from compressed sensing but instead deal with 𝒙k\bm{x}_{k} having general unknown covariance matrix. While the study of quantization and heavy-tailed noise is meaningful in both settings, we note that some of our subsequent results are mainly of interest to the specific sensing or regression problem. For instance, the heavy-tailed covariate considered in Theorem 6 is primarily motivated by the regression setting, in which 𝒙k\bm{x}_{k} may come from a dataset that exhibits much heavier tail than sub-Gaussian data. Moreover, as will be elaborated in Section 4 when appropriate, our subsequent results on covariate quantization (resp., uniform signal recovery guarantee) may prove more useful to the regression problem (resp., compressed sensing problem).

To fix idea, we assume that 𝒙k\bm{x}_{k}s are i.i.d. drawn from some multi-variate distribution, ϵk\epsilon_{k}s are i.i.d. statistical noise independent of the 𝒙k\bm{x}_{k}s, and we truncate yky_{k} to y~k=𝒯ζy(yk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}) and then quantize it to y˙k=𝒬Δ(y~k+τk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k}) with uniform dither τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]). Under these statistical assumptions and dithered quantization, near optimal recovery guarantees have been established in [89, 94] for the regime where both 𝒙k\bm{x}_{k} and ϵk\epsilon_{k} are drawn from sub-Gaussian distributions (hence the truncation is not needed). In contrast, our focus is on quantization of heavy-tailed data. Particularly, we always assume that the noise ϵk\epsilon_{k}s are i.i.d. drawn from some heavy-tailed distribution, resulting in heavy-tailed responses. We will separately deal with the case of sub-Gaussian covariate and a more challenging situation where 𝒙k\bm{x}_{k}s are also heavy-tailed.

To estimate the sparse 𝜽\bm{\theta^{\star}}, a classical approach is via the regularized M-estimator known as Lasso [90, 70, 72]

argmin𝜽12nk=1n(yk𝒙k𝜽)2+λ𝜽1,\displaystyle\mathop{\arg\min}\limits_{\bm{\theta}}~{}\frac{1}{2n}\sum_{k=1}^{n}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}+\lambda\|\bm{\theta}\|_{1},

whose objective combines the 2\ell_{2}-loss for data fidelity and 1\ell_{1}-norm that encourages sparsity. Because we can only access the quantized data (𝒙k,y˙k)(\bm{x}_{k},\dot{y}_{k}) (or even (𝒙˙k,y˙k)(\bm{\dot{x}}_{k},\dot{y}_{k}) if covariate quantization is involved, see Section 4), the main issue lies in the 2\ell_{2}-loss 12nk=1n(yk𝒙k𝜽)2\frac{1}{2n}\sum_{k=1}^{n}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2} that requires the unquantized data (𝒙k,yk)(\bm{x}_{k},y_{k}). To resolve the issue, we calculate the expected 2\ell_{2}-loss:

𝔼(yk𝒙k𝜽)2\displaystyle\mathbbm{E}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2} =(i)𝜽𝔼(𝒙k𝒙k)𝜽2𝔼(yk𝒙k)𝜽:\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\bm{\theta}^{\top}\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})\bm{\theta}-2\mathbbm{E}(y_{k}\bm{x}_{k})^{\top}\bm{\theta}: (3.5)
=(ii)𝜽𝚺𝜽2𝚺y𝒙𝜽,\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\bm{\theta}^{\top}\bm{\Sigma^{\star}}\bm{\theta}-2\bm{\Sigma}_{y\bm{x}}^{\top}\bm{\theta},

where (i)(i) holds up to an inessential constant 𝔼|yk|2\mathbbm{E}|y_{k}|^{2}, and in (ii)(ii) we let 𝚺:=𝔼(𝒙k𝒙k)\bm{\Sigma^{\star}}:=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}), 𝚺y𝒙=𝔼(yk𝒙k)\bm{\Sigma}_{y\bm{x}}=\mathbbm{E}(y_{k}\bm{x}_{k}). This inspires us to generalize the 2\ell_{2} loss to 12𝜽𝑸𝜽𝒃𝜽\frac{1}{2}\bm{\theta}^{\top}\bm{Q}\bm{\theta}-\bm{b}^{\top}\bm{\theta} and consider the following program

𝜽^=argmin𝜽𝒮12𝜽𝑸𝜽𝒃𝜽+λ𝜽1.\bm{\widehat{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\mathcal{S}}~{}\frac{1}{2}\bm{\theta}^{\top}\bm{Q}\bm{\theta}-\bm{b}^{\top}\bm{\theta}+\lambda\|\bm{\theta}\|_{1}. (3.6)

Compared to (3.5) we will use (𝑸,𝒃)(\bm{Q},\bm{b}) that well approximates (𝚺,𝚺y𝒙)(\bm{\Sigma^{\star}},\bm{\Sigma}_{y\bm{x}}), and we also introduce the constraint 𝜽𝒮\bm{\theta}\in\mathcal{S} to allow more flexibility. It is important to note that this is the general strategy in this work to design estimators in different QCS settings, see more discussions in Remark 3.

The next theorem is concerned with QCS under sub-Gaussian covariate but heavy-tailed response. Note that the heavy-tailedness of yky_{k} stems from the noise distribution assumed to have bounded 2+ν2+\nu moment (ν=2(l1)>0\nu=2(l-1)>0 in the theorem statement), but following [35, 21, 97] we directly impose the moment constraint on the response.

Theorem 5.

(Sub-Gaussian Covariate, Heavy-Tailed Response). Given some δ>0,Δ>0\delta>0,\Delta>0, in (3.4) we suppose that 𝒙k\bm{x}_{k}s are i.i.d., zero-mean sub-Gaussian with 𝒙kψ2σ\|\bm{x}_{k}\|_{\psi_{2}}\leq\sigma, κ0λmin(𝚺)λmax(𝚺)κ1\kappa_{0}\leq\lambda_{\min}(\bm{\Sigma^{\star}})\leq\lambda_{\max}(\bm{\Sigma^{\star}})\leq\kappa_{1} for some κ1>κ0>0\kappa_{1}>\kappa_{0}>0 where 𝚺=𝔼(𝒙k𝒙k)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}), 𝜽d\bm{\theta^{\star}}\in\mathbb{R}^{d} is ss-sparse, the noise ϵk\epsilon_{k}s are i.i.d. heavy-tailed and independent of 𝒙k\bm{x}_{k}s, and we assume 𝔼|yk|2lM\mathbbm{E}|y_{k}|^{2l}\leq M for some fixed l>1l>1. In the quantization, we truncate yky_{k} to y~k=𝒯ζy(yk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}) with threshold ζy(nM1/lδlogd)1/2\zeta_{y}\asymp\big{(}\frac{nM^{1/l}}{\delta\log d}\big{)}^{1/2}, then quantize y~k\widetilde{y}_{k} to y˙k=𝒬Δ(y~k+τk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k}) with uniform dither τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]). For recovery, we define the estimator 𝜽^\bm{\widehat{\theta}} as (3.6) with 𝑸=1nk=1n𝒙k𝒙k\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{x}_{k}\bm{x}_{k}^{\top}, 𝒃=1nk=1ny˙k𝒙k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{x}_{k}, 𝒮=d\mathcal{S}=\mathbb{R}^{d}. We set λ=C1σ2κ0(Δ+M1/(2l))δlogdn\lambda=C_{1}\frac{\sigma^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C1C_{1}. If nδslogdn\gtrsim\delta s\log d for some hidden constant only depending on (κ0,σ)(\kappa_{0},\sigma), then with probability at least 19d1δ1-9d^{1-\delta}, the estimation error 𝚼^=𝜽^𝜽\bm{\widehat{\Upsilon}}=\bm{\widehat{\theta}}-\bm{\theta^{\star}} satisfies

𝚼^2C3δslogdnand𝚼^1C4sδlogdn\|\bm{\widehat{\Upsilon}}\|_{2}\leq C_{3}\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}~{}~{}~{}\mathrm{and}~{}~{}~{}\|\bm{\widehat{\Upsilon}}\|_{1}\leq C_{4}\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}~{}

where :=σ2(Δ+M1/(2l))κ03/2\mathscr{L}:=\frac{\sigma^{2}(\Delta+M^{1/(2l)})}{\kappa_{0}^{3/2}}.

The rate O(slogdn)O\big{(}\sqrt{\frac{s\log d}{n}}\big{)} for 2\ell_{2}-norm error is minimax optimal up to logarithmic factor (e.g., compared to [81]). Note that a random noise bounded by Δ\Delta roughly contributes Δ\Delta to (𝔼|yk|2l)1/(2l)(\mathbbm{E}|y_{k}|^{2l})^{1/(2l)}, and the latter is bounded by M1/(2l)M^{1/(2l)}; because in the error bound Δ\Delta and M1/(2l)M^{1/(2l)} almost play the same role, the effect of uniform quantization can be readily interpreted as an additional bounded noise, analogously to the error rate in [87].

Next, we switch to the more challenging situation where both 𝒙k\bm{x}_{k} and yky_{k} are heavy-tailed, assuming that they both possess bounded fourth moments (a marginal moment constraint for 𝒙k\bm{x}_{k}). The consideration of this setting is motivated by the setting of sparse linear regression, where the covariates 𝒙k\bm{x}_{k}s may oftentimes exhibit heavy-tailed behaviour. Specifically, we element-wisely truncate 𝒙k\bm{x}_{k} to 𝒙~k\bm{\widetilde{x}}_{k} and set 𝑸:=1nk=1n𝒙~k𝒙~k\bm{Q}:=\frac{1}{n}\sum_{k=1}^{n}\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top} as a robust covariance matrix estimator, whose estimation performance under \|\cdot\|_{\infty} follows immediately from Theorem 2 by setting Δ=0\Delta=0.

Theorem 6.

(Heavy-Tailed Covariate, Heavy-Tailed Response). Given some δ>0\delta>0, Δ>0\Delta>0, in (3.4) we suppose that 𝒙k\bm{x}_{k}s are i.i.d. zero-mean satisfying a marginal fourth moment constraint supi[d]𝔼|xki|4M\sup_{i\in[d]}\mathbbm{E}|x_{ki}|^{4}\leq M, κ0λmin(𝚺)λmax(𝚺)κ1\kappa_{0}\leq\lambda_{\min}(\bm{\Sigma^{\star}})\leq\lambda_{\max}(\bm{\Sigma^{\star}})\leq\kappa_{1} for some κ1>κ0>0\kappa_{1}>\kappa_{0}>0 where 𝚺=𝔼(𝒙k𝒙k)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}), 𝜽Σs\bm{\theta^{\star}}\in\Sigma_{s} satisfies 𝜽1R\|\bm{\theta^{\star}}\|_{1}\leq R, the noise ϵk\epsilon_{k}s are i.i.d. heavy-tailed and independent of 𝒙k\bm{x}_{k}s, and we assume 𝔼|yk|4M\mathbbm{E}|y_{k}|^{4}\leq M. In the quantization, we truncate 𝒙k,yk\bm{x}_{k},y_{k} respectively to 𝒙~k=[x~ki]=𝒯ζx(𝒙k),y~k:=𝒯ζy(yk)\bm{\widetilde{x}}_{k}=[\widetilde{x}_{ki}]=\mathscr{T}_{\zeta_{x}}(\bm{x}_{k}),~{}\widetilde{y}_{k}:=\mathscr{T}_{\zeta_{y}}(y_{k}) with ζx,ζy(nMδlogd)1/4\zeta_{x},\zeta_{y}\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}, then we quantize y~k\widetilde{y}_{k} to y˙k=𝒬Δ(y~k+τk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k}) with uniform dither τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]). For recovery, we define the estimator 𝜽^\bm{\widehat{\theta}} as (3.6) with 𝑸=1nk=1n𝒙~k𝒙~k\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}, 𝒃=1nk=1ny˙k𝒙~k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\widetilde{x}}_{k}, 𝒮=d\mathcal{S}=\mathbb{R}^{d}. We set λ=C1(RM+Δ2)δlogdn\lambda=C_{1}(R\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C1C_{1}. If nδs2logdn\gtrsim\delta s^{2}\log d for some hidden constant only depending on (κ0,M)(\kappa_{0},M), then with probability at least 14d2δ1-4d^{2-\delta}, the estimation error 𝚼^:=𝜽^𝜽\bm{\widehat{\Upsilon}}:=\bm{\widehat{\theta}}-\bm{\theta^{\star}} satisfies

𝚼^2C2δslogdnand𝚼^1C3sδlogdn\|\bm{\widehat{\Upsilon}}\|_{2}\leq C_{2}\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}~{}~{}~{}\mathrm{and}~{}~{}~{}\|\bm{\widehat{\Upsilon}}\|_{1}\leq C_{3}\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}~{}

where :=RM+Δ2κ0\mathscr{L}:=\frac{R\sqrt{M}+\Delta^{2}}{\kappa_{0}}.

Theorem 6 generalizes [35, Thm. 2(b)] to the uniform quantization setting. Clearly, the obtained rate remains near minimax optimal if RR is of minor scaling (e.g., bounded or logarithmic factors). Nevertheless, such near optimality in Theorem 6 comes at the cost of more restricted conditions and stronger scaling, as remarked in the following.

Remark 2.

(Comparing Theorems 5-6). Compared with nslogdn\gtrsim s\log d in Theorem 5, the first downside of Theorem 6 is the sub-optimal sample complexity ns2logdn\gtrsim s^{2}\log d, and note that ns2logdn\gtrsim s^{2}\log d is also required in [35, Thm. 2(b)]. But indeed, it can be improved to nslogdn\gtrsim s\log d by explicitly adding the constraint 𝛉1R\|\bm{\theta}\|_{1}\leq R to the recovery program, as will be noted as an interesting side finding in Remark 6. Secondly, following [35] we impose an 1\ell_{1}-norm constraint 𝛉1R\|\bm{\theta^{\star}}\|_{1}\leq R that is stronger than 𝛉2M1/(2l)σ\|\bm{\theta^{\star}}\|_{2}\lesssim\frac{M^{1/(2l)}}{\sigma} used in the proof of Theorem 5. In fact, when replacing the 1\ell_{1} constraint in Theorem 6 with an 2\ell_{2}-norm bound 𝛉2R\|\bm{\theta^{\star}}\|_{2}\leq R, then our proof technique leads to an error rate 𝚼^2=O(s2logdn)\|\bm{\widehat{\Upsilon}}\|_{2}=O\big{(}\sqrt{\frac{s^{2}\log d}{n}}\big{)} that exhibits worse dependence on ss.

Remark 3.

(Modification of 2\ell_{2}-loss). Recall that we generalize the regular 2\ell_{2}-loss 12nk=1n(yk𝐱k𝛉)2\frac{1}{2n}\sum_{k=1}^{n}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2} to 12𝛉𝐐𝛉𝐛𝛉\frac{1}{2}\bm{\theta}^{\top}\bm{Q\theta}-\bm{b}^{\top}\bm{\theta} as loss function in (3.6). Note that the choice of (𝐐,𝐛)(\bm{Q},\bm{b}) in Theorem 5 is tantamount to using the loss function 12nk=1n(y˙k𝐱k𝛉)2\frac{1}{2n}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2} that replaces yky_{k} with the quantized response y˙k\dot{y}_{k}; this idea is analogous to the generalized Lasso investigated for single index model [78] and dithered quantized model [89], and will be used again in quantized matrix completion, see (3.8) below. However, our generalized 2\ell_{2}-loss provides more flexibility to deal with heavy-tailedness or quantization of 𝐱k\bm{x}_{k}, e.g., (𝐐,𝐛)(\bm{Q},\bm{b}) in Theorem 6 amounts to adopting 12nk=1n(y˙k𝐱~k𝛉)2\frac{1}{2n}\sum_{k=1}^{n}({\dot{y}_{k}}-\bm{\widetilde{x}}_{k}^{\top}\bm{\theta})^{2} as loss function, and under quantized covariate more delicate modifications are required in Theorems 9-12, which is beyond the range of prior works on generalized Lasso.

3.3 Quantized Matrix Completion

Completing a low-rank matrix from only a partial observation of its entries is known as the matrix completion problem, which has found many applications including recommendation system, image inpainting, quantum state tomography [19, 27, 2, 74, 42], to name just a few. Mathematically, let 𝚯d×d\bm{\Theta^{\star}}\in\mathbb{R}^{d\times d} be the underlying matrix satisfying rank(𝚯)r\operatorname{rank}(\bm{\Theta^{\star}})\leq r, the matrix completion problem can be formulated as

yk=<𝑿k,𝚯>+ϵk,k=1,2,,n,y_{k}=\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}+\epsilon_{k},~{}k=1,2,...,n, (3.7)

where 𝑿k\bm{X}_{k}s are distributed on 𝒳:={𝒆i𝒆j:i,j[d]}\mathcal{X}:=\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\} (𝒆i\bm{e}_{i} is the ii-th column of 𝑰d\bm{I}_{d}), ϵk\epsilon_{k} is observation noise. Note that for 𝑿k=𝒆i(k)𝒆j(k)\bm{X}_{k}=\bm{e}_{i(k)}\bm{e}_{j(k)}^{\top} one has <𝑿k,𝚯>=θi(k),j(k)\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}=\theta^{\star}_{i(k),j(k)}, so each observation is a noisy entry. Our main interest is on quantized matrix completion (QMC), where our goal is to design quantizer for the observation yky_{k} that allows for accurate estimation of 𝚯\bm{\Theta^{\star}} from the quantized observations.

Unlike in compressed sensing, additional condition (besides the low-rankness) on 𝚯\bm{\Theta^{\star}} is needed to ensure the well-posedness of the matrix completion problem. More specifically, certain incoherence conditions are required if we pursue exact recovery (e.g., [15, 82]), whereas a faithful estimation can be achieved as long as the underlying matrix is not overly spiky and sufficiently diffuse (e.g., [51, 71]). The latter condition is also known as “low spikiness” and is formulated by d𝚯𝚯Fα\frac{d\|\bm{\Theta^{\star}}\|_{\infty}}{\|\bm{\Theta^{\star}}\|_{F}}\leq\alpha [35, 71], which has been noted to be necessary for the well-posedness of matrix completion problem [27, 71]. In subsequent works the low-spikiness condition is often formulated as the simpler max-norm constraint 𝚯α\|\bm{\Theta^{\star}}\|_{\infty}\leq\alpha [51, 19, 53, 26, 37].

In this work, we consider the uniform sampling scheme 𝑿k𝒰(𝒳)\bm{X}_{k}\sim\mathscr{U}(\mathcal{X}), but with a little bit more work it generalizes to more general sampling scheme [51]. We apply the proposed quantization scheme to possibly heavy-tailed yky_{k} — we truncate yky_{k} to y~k=𝒯ζy(yk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}) with some threshold ζy\zeta_{y}, and then quantize y~k\widetilde{y}_{k} to y˙k=𝒬Δ(y~k+τk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k}) with uniform dither τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]). Because we do not pursue exact recovery (which is impossible under quantization), we do not assume any incoherence condition like [82]. Instead, we only hope to accurately estimate 𝚯\bm{\Theta^{\star}}, and following [51, 19, 53, 26, 37] we impose a max-norm constraint

𝚯α.\|\bm{\Theta^{\star}}\|_{\infty}\leq\alpha.

Overall, we estimate 𝚯\bm{\Theta^{\star}} from (𝑿k,y˙k)(\bm{X}_{k},\dot{y}_{k}) by the regularized M-estimator [70, 72]

𝚯^=argmin𝚯α12nk=1n(y˙k<𝑿k,𝚯>)2+λ𝚯nu\bm{\widehat{\Theta}}=\mathop{\arg\min}\limits_{\|\bm{\Theta}\|_{\infty}\leq\alpha}~{}\frac{1}{2n}\sum_{k=1}^{n}\big{(}\dot{y}_{k}-\big{<}\bm{X}_{k},\bm{\Theta}\big{>}\big{)}^{2}+\lambda\|\bm{\Theta}\|_{nu} (3.8)

that combines an 2\ell_{2}-loss and nuclear norm regularizer.

In the literature, there has been a line of works on 1-bit or multi-bit matrix completion related to our results to be presented [14, 59, 52, 17, 3]. While the referenced works commonly adopted a likelihood approach, our method is an essential departure and embraces some advantage, see a precise comparison in Remark 4. Considering such novelty, we include the result for sub-exponential ϵk\epsilon_{k} in Theorem 7, for which the truncation of yky_{k} becomes unnecessary and we simply set ζy=\zeta_{y}=\infty.

Theorem 7.

(QMC under Sub-Exponential Noise). Given some Δ>0,δ>0\Delta>0,\delta>0, in (3.7) we suppose that 𝐗k\bm{X}_{k}s are i.i.d. uniformly distributed over 𝒳={𝐞i𝐞j:i,j[d]}\mathcal{X}=\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}, 𝚯d×d\bm{\Theta^{\star}}\in\mathbb{R}^{d\times d} satisfies rank(𝚯)r\operatorname{rank}(\bm{\Theta^{\star}})\leq r and 𝚯α\|\bm{\Theta^{\star}}\|_{\infty}\leq\alpha, the noise ϵk\epsilon_{k}s are i.i.d. zero-mean sub-exponential satisfying ϵkψ1σ\|\epsilon_{k}\|_{\psi_{1}}\leq\sigma, and are independent of 𝐗k\bm{X}_{k}s. In the quantization, we do not truncation yky_{k} but directly quantize it to y˙k=𝒬Δ(yk+τk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(y_{k}+\tau_{k}) with uniform dither τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]). We choose λ=C1(σ+Δ)δlogdnd\lambda=C_{1}(\sigma+\Delta)\sqrt{\frac{\delta\log d}{nd}} with sufficiently large C1C_{1}, and define 𝚯^\bm{\widehat{\Theta}} as (3.8). If δdlog3dnδr2d2logd\delta d\log^{3}d\lesssim n\lesssim\delta r^{2}d^{2}\log d, then with probability at least 14dδ1-4d^{-\delta}, the estimation error 𝚼^:=𝚯^𝚯\bm{\widehat{\Upsilon}}:=\bm{\widehat{\Theta}}-\bm{\Theta^{\star}} satisfies

𝚼^FdC2δrdlogdnand𝚼^nudC3rδdlogdn\frac{\|\bm{\widehat{\Upsilon}}\|_{F}}{d}\leq C_{2}\mathscr{L}\sqrt{\frac{\delta rd\log d}{n}}~{}~{}\mathrm{and}~{}~{}\frac{\|\bm{\widehat{\Upsilon}}\|_{nu}}{d}\leq C_{3}\mathscr{L}r\sqrt{\frac{\delta d\log d}{n}}

where :=α+σ+Δ\mathscr{L}:=\alpha+\sigma+\Delta.

By contrast, under heavy-tailed noise only assumed to have bounded variance, we truncate yky_{k} with a suitable threshold before the dithered quantization to achieve an optimal trade-off between bias and variance.

Theorem 8.

(QMC under Heavy-tailed Noise). Given some Δ>0,δ>0\Delta>0,\delta>0, we consider (3.7) in the setting of Theorem 7 but with the assumption ϵkψ1σ\|\epsilon_{k}\|_{\psi_{1}}\leq\sigma replaced by 𝔼|ϵk|2M\mathbbm{E}|\epsilon_{k}|^{2}\leq M. In the quantization, we truncate yky_{k} to y~k=𝒯ζy(yk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}) with ζy(M+α)nδdlogd\zeta_{y}\asymp(\sqrt{M}+\alpha)\sqrt{\frac{n}{\delta d\log d}}, and then quantize y~k\widetilde{y}_{k} to y˙k=𝒬Δ(y~k+τk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k}) with uniform dither τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]). We choose λ=C1(α+M+Δ)δlogdnd\lambda=C_{1}(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}} with sufficiently large C1C_{1}, and define 𝚯^\bm{\widehat{\Theta}} as (3.8). If δdlogdnδr2d2logd\delta d\log d\lesssim n\lesssim\delta r^{2}d^{2}\log d, then with probability at least 16dδ1-6d^{-\delta}, the estimation error 𝚼^:=𝚯^𝚯\bm{\widehat{\Upsilon}}:=\bm{\widehat{\Theta}}-\bm{\Theta^{\star}} satisfies

𝚼^FdC2δrdlogdnand𝚼^nudC3rδdlogdn\frac{\|\bm{\widehat{\Upsilon}}\|_{F}}{d}\leq C_{2}\mathscr{L}\sqrt{\frac{\delta rd\log d}{n}}~{}~{}\mathrm{and}~{}~{}\frac{\|\bm{\widehat{\Upsilon}}\|_{nu}}{d}\leq C_{3}\mathscr{L}r\sqrt{\frac{\delta d\log d}{n}}

where :=α+M+Δ\mathscr{L}:=\alpha+\sqrt{M}+\Delta.

Compared to the information-theoretic lower bounds in [71, 55], the error rates obtained in Theorems 7-8 are minimax optimal up to logarithmic factors. Specifically, Theorem 8 derives near optimal guarantee for QMC with heavy-tailed observations, as the key standpoint of this paper. Note that, the 1-bit quantization counterpart of these two Theorems was derived in our previous work [21]; in sharp contrast to Theorem 8, for 1-bit QMC under heavy-tailed noise, the error rate under 𝚼^Fd\frac{\|\bm{\widehat{\Upsilon}}\|_{F}}{d} in [21, Thm. 13] reads as O((r2dlogdn)1/4)O\big{(}\big{(}\frac{r^{2}d\log d}{n}\big{)}^{1/4}\big{)} and is essentially slower; using the 1-bit observations therein, this slow error rate is indeed nearly tight due to the lower bound in [21, Thm. 14].

To close this section, we give a remark to illustrate the novelty and advantage of our QMC method by a careful comparison with prior works.

Remark 4.

QMC with 1-bit or multi-bit quantized observations has received considerable research interest [26, 14, 59, 52, 17, 3]. Adapted to our notation, these works studied the model y˙k=𝒬(<𝐗k,𝚯>+τk)\dot{y}_{k}=\mathcal{Q}(\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}+\tau_{k}) under general random dither τk\tau_{k} and quantizer 𝒬(.)\mathcal{Q}(.), and they commonly adopted regularized (or constrained) maximum likelihood estimation for estimating 𝚯\bm{\Theta^{\star}}. By contrast, with the random dither and quantizer specialized to τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]) and 𝒬Δ(.)\mathcal{Q}_{\Delta}(.), our model is formulated as y˙k=𝒬Δ(𝒯ζy(<𝐗k,𝚯>+ϵk)+τk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(\mathscr{T}_{\zeta_{y}}(\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}+\epsilon_{k})+\tau_{k}). Thus, while suffering from less generality in (τk,𝒬)(\tau_{k},\mathcal{Q}), our method embraces the advantage of robustness to pre-quantization noise ϵk\epsilon_{k}, whose distribution is unknown and can even be heavy-tailed. Note that such unknown ϵk\epsilon_{k} evidently forbids the likelihood approach.

4 Covariate Quantization and Uniform Signal Recovery in Quantized Compressed Sensing

By now we have presented near optimal results in the contexts of QCME, QCS and QMC under heavy-tailed data that further undergo the proposed quantization scheme, which we position as the primary contribution of this work. In this section, we further provide two additional developments to enhance our results on heavy-tailed QCS.

4.1 Covariate Quantization

In the area of QCS, almost all prior works merely focused on the quantization of response yky_{k}, see the recent survey [29]; here, we consider a setting of “complete quantization” — meaning that the covariate 𝒙k\bm{x}_{k} is also quantized. To motivate our study of “complete quantization”, we interpret compressed sensing as sparse linear regression. Indeed, to reduce the power consumption and computational cost, it is sometimes preferable to work with low-precision data in a machine learning system, e.g., the sample quantization scheme developed in [96] led to experimental success in training linear model. Also, it was shown that direct gradient quantization may not be efficient in certain distributed learning systems where the terminal nodes are connected to the server only through very weak communication fabric and the number of parameters are extremely huge; rather, quantizing and transmitting some important samples could provably reduce communication cost [43]. In fact, the process of data collection may already appeal to quantization due to certain limit of the data acquisition device (e.g., a low-resolution analog-to-digital module used in distributed signal processing [25]). Our main goal is to understand how quantization of (𝒙k,yk)(\bm{x}_{k},y_{k})s affects the subsequent recovery/learning process, particularly showing that the simple dithered uniform quantization scheme still allows for accurate estimator that may even provide near minimax error rate. To our best knowledge, the only prior rigorous estimation guarantees for QCS with covariate quantization are [21, Thms. 7-8]; these two results require a restricted and unnatural assumption, which we will also relax later.

4.1.1 Multi-bit QCS with Quantized Covariate

Since we will also consider the 1-bit quantization, we more precisely refer to the QCS under uniform quantizer as multi-bit QCS. We will generalize Theorems 5-6 to covariate quantization in the next two theorems.

Let (𝒙˙k,y˙k)(\bm{\dot{x}}_{k},\dot{y}_{k}) be the quantized covariate-response pair, we first quickly sketch the idea of our approach. Specifically, we stick to the framework of M-estimator in (3.6), which appeals to accurate surrogates for 𝚺=𝔼(𝒙k𝒙k)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}) and 𝚺y𝒙=𝔼(yk𝒙k)\bm{\Sigma}_{y\bm{x}}=\mathbbm{E}(y_{k}\bm{x}_{k}) based on (𝒙˙k,y˙k)(\bm{\dot{x}}_{k},\dot{y}_{k}), where 𝒙˙𝒌\bm{\dot{x}_{k}} represents the quantized covariate. Fortunately, the surrogates can be constructed analogously to our QCME estimator when triangular dither is used for quantizing 𝒙k\bm{x}_{k}. Let us first state our quantization scheme as follows:

  • Response Quantization. This is the same as Theorems 5-6. We truncate yky_{k} to y~k=𝒯ζy(yk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}) with threshold ζy\zeta_{y}, and then quantize y~k\widetilde{y}_{k} to y˙k=𝒬Δ(y~k+ϕk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\phi_{k}) with uniform dither ϕk𝒰([Δ2,Δ2])\phi_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]) and quantization level Δ0\Delta\geq 0.

  • Covariate Quantization. This is the same as Theorem 2. We truncate 𝒙k\bm{x}_{k} to 𝒙~k=𝒯ζx(𝒙k)\bm{\widetilde{x}}_{k}=\mathscr{T}_{\zeta_{x}}(\bm{x}_{k}) with threshold ζx\zeta_{x}, and then quantize 𝒙~k\bm{\widetilde{x}}_{k} to 𝒙˙k=𝒬Δ¯(𝒙~k+𝝉k)\bm{\dot{x}}_{k}=\mathcal{Q}_{\bar{\Delta}}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k}) with triangular dither 𝝉k𝒰([Δ¯2,Δ¯2]d)+𝒰([Δ¯2,Δ¯2]d)\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})+\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d}) and quantization level Δ¯0\bar{\Delta}\geq 0.

  • Notation. We write the quantization noise as φk=y˙ky~k\varphi_{k}=\dot{y}_{k}-\widetilde{y}_{k} and 𝝃k=𝒙˙k𝒙~k\bm{\xi}_{k}=\bm{\dot{x}}_{k}-\bm{\widetilde{x}}_{k}, the quantization error as ϑk=y˙k(y~k+ϕk)\vartheta_{k}=\dot{y}_{k}-(\widetilde{y}_{k}+\phi_{k}) and 𝒘k=𝒙˙k(𝒙~k+𝝉k)\bm{w}_{k}=\bm{\dot{x}}_{k}-(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k}).

We will adopt the above notation in subsequent developments. Based on the quantized covariate-response pairs (𝒙˙k,y˙k)(\bm{\dot{x}}_{k},\dot{y}_{k})s, we specify our estimator by setting (𝑸,𝒃)(\bm{Q},\bm{b}) in (3.6) as

𝑸=1nk=1n𝒙˙k𝒙˙kΔ¯24𝑰dand𝒃=1nk=1ny˙k𝒙˙k.\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}~{}~{}\mathrm{and}~{}~{}\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}. (4.1)

Note that the choice of 𝑸\bm{Q} is due to the estimator in Theorem 2, while 𝒃\bm{b} is inspired by the calculation

𝔼(y˙k𝒙˙k)=𝔼((y~k+φk)(𝒙~k+𝝃k))\displaystyle\mathbbm{E}(\dot{y}_{k}\bm{\dot{x}}_{k})=\mathbbm{E}\big{(}(\widetilde{y}_{k}+\varphi_{k})(\bm{\widetilde{x}}_{k}+\bm{\xi}_{k})\big{)}
=𝔼(y~k𝒙~k)+𝔼(y~k𝝃k)+𝔼(φk𝒙~k)+𝔼(φk𝝃k)=𝔼(y~k𝒙~k),\displaystyle=\mathbbm{E}(\widetilde{y}_{k}\bm{\widetilde{x}}_{k})+\mathbbm{E}(\widetilde{y}_{k}\bm{\xi}_{k})+\mathbbm{E}(\varphi_{k}\bm{\widetilde{x}}_{k})+\mathbbm{E}(\varphi_{k}\bm{\xi}_{k})=\mathbbm{E}(\widetilde{y}_{k}\bm{\widetilde{x}}_{k}),

where the last equality can be seen by conditioning on 𝒙~k\bm{\widetilde{x}}_{k} or y~k\widetilde{y}_{k}. However, the issue is that 𝑸\bm{Q} is not positive semi-definite, hence the resulting program is non-convex. To explain this, note that the rank of 1nk=1n𝒙˙k𝒙˙k\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top} does not exceed nn, so when d>nd>n at least dnd-n eigenvalues of 𝑸\bm{Q} are Δ¯24-\frac{\bar{\Delta}^{2}}{4}. Alternatively, the non-convexity can also be seen from the observation that setting (𝑸,𝒃)(\bm{Q},\bm{b}) as in (4.1) is tantamount to replacing the regular 2\ell_{2}-loss 12nk=1n(yk𝒙k𝜽)2\frac{1}{2n}\sum_{k=1}^{n}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2} with

12nk=1n(y˙k𝒙˙k𝜽)2Δ¯8𝜽22.\frac{1}{2n}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{\dot{x}}_{k}^{\top}\bm{\theta})^{2}-\frac{\bar{\Delta}}{8}\|\bm{\theta}\|_{2}^{2}.

We mention that the lack of positive semi-definiteness of 𝑸\bm{Q} is problematic in both statistics and optimization aspects: 1) Statistically, Lemma 4 used to derive the error rates in Theorems 5-6 requires 𝑸\bm{Q} to be positive semi-definite, and is hence no longer applicable here; 2) From the optimization side, it is in general unknown how to globally optimize a non-convex program.

Motivated by a line of previous works on non-convex M-estimator [63, 64, 62], we add an 1\ell_{1}-norm constraint to (3.6) by setting 𝒮={𝜽d:𝜽1R}\mathcal{S}=\{\bm{\theta}\in\mathbb{R}^{d}:\|\bm{\theta}\|_{1}\leq R\}, where RR represents the prior estimation on 𝜽1\|\bm{\theta^{\star}}\|_{1}. Let 𝜽11\partial\|\bm{\theta}_{1}\|_{1} be a subdifferential of 𝜽1\|\bm{\theta}\|_{1} at 𝜽=𝜽1\bm{\theta}=\bm{\theta}_{1},121212Thus, 𝜽~1\partial\|\bm{\widetilde{\theta}}\|_{1} in (4.2) below should be understood as “there exists one element in 𝜽~1\partial\|\bm{\widetilde{\theta}}\|_{1} such that (4.2) holds.” we consider the local minimizer of the proposed recovery program,131313The existence of local minimizer is guaranteed because of the additional 1\ell_{1}-constraint. or more generally put, 𝜽~𝒮\bm{\widetilde{\theta}}\in\mathcal{S} that satisfies141414To distinguish the global minimizer in (3.6), we denote by 𝜽~\bm{\widetilde{\theta}} the estimator in QCS with quantized covariate.

<𝑸𝜽~𝒃+λ𝜽~1,𝜽𝜽~>0,𝜽𝒮.\big{<}\bm{Q}\bm{\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\theta}-\bm{\widetilde{\theta}}\big{>}\geq 0,~{}~{}\forall~{}\bm{\theta}\in\mathcal{S}. (4.2)

We will prove a fairly strong guarantee stating that all 𝜽~𝒮\bm{\widetilde{\theta}}\in\mathcal{S} satisfying (4.2) (of course including all local minimizers) enjoy near minimax error rate. While this guarantee bears resemblance to the ones in [64], we point out that, [64] only derived concrete results for sub-Gaussian regime; because of the heavy-tailed data and quantization in our setting, some essentially different ingredients are required for the technical analysis (see Remark 5). As before, our results for sub-Gaussian 𝒙k\bm{x}_{k} and heavy-tailed 𝒙k\bm{x}_{k} are presented separately.

Theorem 9.

(Quantized Sub-Gaussian Covariate). Given Δ0\Delta\geq 0, Δ¯0\bar{\Delta}\geq 0, δ>0\delta>0, we consider (3.4) with the same assumptions on (𝐱k,yk,𝛉)(\bm{x}_{k},y_{k},\bm{\theta^{\star}}) as Theorem 5, and additionally assume that 𝛉2R\|\bm{\theta^{\star}}\|_{2}\leq R. The quantization of (𝐱k,yk)(\bm{x}_{k},y_{k}) is described above, and we set ζx=\zeta_{x}=\infty, ζynM1/lδlogd\zeta_{y}\asymp\sqrt{\frac{nM^{1/l}}{\delta\log d}}. For recovery, we let 𝐐=1nk=1n𝐱˙k𝐱˙kΔ¯24𝐈d\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}, 𝐛=1nk=1ny˙k𝐱˙k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}, 𝒮={𝛉:𝛉1Rs}\mathcal{S}=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq R\sqrt{s}\} and set λ=C1(σ+Δ¯)2κ0(Δ+M1/(2l))δlogdn\lambda=C_{1}\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C1C_{1}. If nδslogdn\gtrsim\delta s\log d for some hidden constant only depending on (κ0,σ,Δ,Δ¯,M,R)(\kappa_{0},\sigma,\Delta,\bar{\Delta},M,R), with probability at least 18d1δC2exp(C3n)1-8d^{1-\delta}-C_{2}\exp(-C_{3}n), all 𝛉~𝒮\bm{\widetilde{\theta}}\in\mathcal{S} satisfying (4.2) have estimation error 𝚼~:=𝛉~𝛉\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}} bounded by

𝚼~2Cδslogdnand𝚼~1Csδlogdn\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C^{\prime}\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}

where :=(σ+Δ¯)2(𝚫+M1/(2l))κ03/2\mathscr{L}:=\frac{(\sigma+\bar{\Delta})^{2}(\bm{\Delta}+M^{1/(2l)})}{\kappa_{0}^{3/2}}.

Similarly, the next result extends Theorem 6 to a setting involving covariate quantization.

Theorem 10.

(Quantized Heavy-Tailed Covariate). Given Δ0\Delta\geq 0, Δ¯0\bar{\Delta}\geq 0, δ>0\delta>0, we consider (3.4) with the same assumptions on (𝐱k,yk,𝛉)(\bm{x}_{k},y_{k},\bm{\theta^{\star}}) as Theorem 6. The quantization of (𝐱k,yk)(\bm{x}_{k},y_{k}) is described above, and we set ζx,ζy(nMδlogd)1/4\zeta_{x},\zeta_{y}\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}. For recovery, we let 𝐐=1nk=1n𝐱˙k𝐱˙kΔ¯24𝐈d\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}, 𝐛=1nk=1ny˙k𝐱˙k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}, 𝒮={𝛉:𝛉1R}\mathcal{S}=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq R\} and set λ=C1(RM+Δ2+RΔ¯2)δlogdn\lambda=C_{1}(R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C1C_{1}. If nδslogdn\gtrsim\delta s\log d for some hidden constant only depending on (κ0,M)(\kappa_{0},M), then with probability at least 18d1δ1-8d^{1-\delta}, all 𝛉~𝒮\bm{\widetilde{\theta}}\in\mathcal{S} satisfying (4.2) have estimation error 𝚼~:=𝛉~𝛉\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}} bounded by

𝚼~2C3δslogdnand𝚼~1C4sδlogdn.\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C_{3}\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C_{4}\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}.

where :=RM+Δ2+RΔ¯2κ0\mathscr{L}:=\frac{R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2}}{\kappa_{0}}.

Remark 5.

(Comparing Our Analyses with [64]). The above results are motivated by a line of works on nonconvex M-estimator [63, 64, 62], and our guarantee for the whole set of stationary points (4.2) resembles [64] most. While the main strategy for proving Theorem 9 is adjusted from [64], the proof of Theorem 10 does involve an essentially different RSC condition, see our (B.9). In particular, compared with [64, equation (4)], the leading factor of 𝚼~12\|\bm{\widetilde{\Upsilon}}\|_{1}^{2} in (B.9) degrades from O(logdn)O\big{(}\frac{\log d}{n}\big{)} to O(logdn)O\big{(}\sqrt{\frac{\log d}{n}}\big{)}. To retain near optimal rate we need to impose a stronger scaling 𝛉1R\|\bm{\theta^{\star}}\|_{1}\leq R with proper changes in the proof. Although Theorem 10 is presented for a concrete setting, it sheds light on extension of [64] to a weaker RSC condition that could accommodate covariate with heavier tail. Such extension is formally presented as a deterministic framework in Proposition 1.

Proposition 1.

Suppose that the ss-sparse 𝛉d\bm{\theta^{\star}}\in\mathbb{R}^{d} satisfies 𝛉1R\|\bm{\theta^{\star}}\|_{1}\leq R, and the positive definite matrix 𝚺d×d\bm{\Sigma^{\star}}\in\mathbb{R}^{d\times d} satisfies λmin(𝚺)κ0\lambda_{\min}(\bm{\Sigma^{\star}})\geq\kappa_{0}. If for some 𝐐d×d,𝐛d\bm{Q}\in\mathbb{R}^{d\times d},\bm{b}\in\mathbb{R}^{d} we have

λC1max{𝑸𝜽𝒃,R𝑸𝚺}\lambda\geq C_{1}\max\big{\{}\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty},~{}R\cdot\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\big{\}} (4.3)

holds for sufficiently large C1C_{1}, then all 𝛉~𝒮\bm{\widetilde{\theta}}\in\mathcal{S} satisfying (4.2) with 𝒮={𝛉d:𝛉1R}\mathcal{S}=\{\bm{\theta}\in\mathbb{R}^{d}:\|\bm{\theta}\|_{1}\leq R\} have estimation error 𝚼~:=𝛉~𝛉\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}} bounded by

𝚼~2C2sλκ0and𝚼~1C3sλκ0.\displaystyle\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C_{2}\frac{\sqrt{s}\lambda}{\kappa_{0}}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C_{3}\frac{s\lambda}{\kappa_{0}}~{}.

By extracting the ingredients that guarantee (4.2) to be accurate, interestingly, Proposition 1 is now independent of the model assumption (3.4). Particularly, we could set 𝚺=𝔼[𝒙k𝒙k]\bm{\Sigma^{\star}}=\mathbbm{E}[\bm{x}_{k}\bm{x}_{k}^{\top}] when we apply Proposition 1 to (3.4). Compared with the framework [64, Thm. 1], the key strength of Proposition 1 is that it does not explicitly assume the RSC condition on the loss function that is hard to verify without assuming sub-Gaussian covariate. Instead, the role of the RSC assumption in [64] is now played by λR𝑸𝚺\lambda\gtrsim R\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}, which immediately yields a kind of RSC condition by simple argument as (B.10). Although this RSC condition is often essentially weaker than the conventional one in terms of the leading factor of 𝚼~12\|\bm{\widetilde{\Upsilon}}\|_{1}^{2} (see Remark 5), along this line one can still derive non-trivial (or even near optimal) error rate. The gain of replacing RSC assumption with λR𝑸𝚺\lambda\gtrsim R\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty} is that the latter amounts to constructing element-wise estimator for 𝚺\bm{\Sigma^{\star}}, which is often much easier for heavy-tailed covariate (e.g., due to many existing robust covariance estimator).

We conclude this part with a side interesting observation.

Remark 6.

By setting Δ¯=0\bar{\Delta}=0, Theorem 10 produces a result (with convex program) for the setting of Theorem 6. Interestingly, with the additional 1\ell_{1}-constraint, a notable improvement is that the sub-optimal ns2logdn\gtrsim s^{2}\log d in Theorem 6 is sharpened to the near optimal one in Theorem 10. More concretely, this is because (ii) in (A.12) can be tightened to (ii)(ii) of (B.10). Going back to the full-data unquantized regime, Theorem 10 with Δ=Δ¯=0\Delta=\bar{\Delta}=0 recovers [35, Theorem 2(b)] with improved sample complexity requirement.

4.1.2 1-bit QCS with Quantized Covariate

Our consideration of covariate quantization in QCS seems fairly new to the literature. To the best of our knowledge, the only related results are [21, Thms. 7-8] for QCS with 1-bit quantized covariate and response. The assumption there, however, is quite restrictive. Specifically, it is assumed that 𝚺=𝔼(𝒙k𝒙k)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}) has sparse columns (see [21, Assumption 3]), which is non-standard in both compressed sensing and sparse linear regression. Departing momentarily from our focus of dithered uniform quantization, we consider QCS under dithered 1-bit quantization and will apply Proposition 1 to derive results comparable to [21, Thms. 7-8] without resorting to the sparsity of 𝚺\bm{\Sigma^{\star}}.

We first review the 1-bit quantization scheme developed in [21]:

  • Response Quantization. We truncate yky_{k} to y~k=𝒯ζy(yk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}) with some threshold ζy\zeta_{y}, and then quantize y~k\widetilde{y}_{k} to y˙k=sgn(y~k+ϕk)\dot{y}_{k}=\operatorname{sgn}(\widetilde{y}_{k}+\phi_{k}) with uniform dither ϕk𝒰([γy,γy])\phi_{k}\sim\mathscr{U}([-\gamma_{y},\gamma_{y}]).

  • Covariate Quantization. We truncate 𝒙k\bm{x}_{k} to 𝒙~k=𝒯ζx(𝒙k)\bm{\widetilde{x}}_{k}=\mathscr{T}_{\zeta_{x}}(\bm{x}_{k}) with some threshold ζx\zeta_{x}, and then quantize 𝒙~k\bm{\widetilde{x}}_{k} to 𝒙˙k1=sgn(𝒙~k+𝝉k1)\bm{\dot{x}}_{k1}=\operatorname{sgn}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k1}) and 𝒙˙k2=sgn(𝒙~k+𝝉k2)\bm{\dot{x}}_{k2}=\operatorname{sgn}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k2}), where 𝝉k1,𝝉k2𝒰([γx,γx]d)\bm{\tau}_{k1},\bm{\tau}_{k2}\sim\mathscr{U}([-\gamma_{x},\gamma_{x}]^{d}) are independent uniform dithers. (Note that we collect 2 bits per entry).

The following two results refine [21, Thms. 7-8] by deriving comparable error rates without using sparsity of 𝚺\bm{\Sigma^{\star}}.

Theorem 11.

(1-bit Quantized Sub-Gaussian Covariate). Given δ>0\delta>0, we consider (3.4) where the ss-sparse 𝛉\bm{\theta^{\star}} satisfies 𝛉1R\|\bm{\theta^{\star}}\|_{1}\leq R, 𝐱k\bm{x}_{k}s are i.i.d. zero-mean sub-Gaussian with 𝐱kψ2σ\|\bm{x}_{k}\|_{\psi_{2}}\leq\sigma, and 𝚺=𝔼(𝐱k𝐱k)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}) satisfies λmin(𝚺)κ0\lambda_{\min}\big{(}\bm{\Sigma^{\star}}\big{)}\geq\kappa_{0} for some κ0>0\kappa_{0}>0, the noise ϵk{\epsilon}_{k}s are independent of 𝐱k\bm{x}_{k}s and i.i.d. sub-Gaussian, while for simplicity we assume ykψ2σ\|y_{k}\|_{\psi_{2}}\leq\sigma. In the quantization of (𝐱k,yk)(\bm{x}_{k},y_{k}) described above, we set ζx=ζy=\zeta_{x}=\zeta_{y}=\infty and γx,γyσlog(n2δlogd)\gamma_{x},\gamma_{y}\asymp\sigma\sqrt{\log\big{(}\frac{n}{2\delta\log d}\big{)}}. For recovery we let 𝐐:=γx22nk=1n(𝐱˙k1𝐱˙k2+𝐱˙k2𝐱˙k1)\bm{Q}:=\frac{\gamma_{x}^{2}}{2n}\sum_{k=1}^{n}\big{(}\bm{\dot{x}}_{k1}\bm{\dot{x}}_{k2}^{\top}+\bm{\dot{x}}_{k2}\bm{\dot{x}}_{k1}^{\top}\big{)}, 𝐛:=γxγynk=1ny˙k𝐱˙k1\bm{b}:=\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}, 𝒮={𝛉:𝛉1R}\mathcal{S}=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq R\} and set λ=C1σ2Rδlogd(logn)2n\lambda=C_{1}\sigma^{2}R\sqrt{\frac{\delta\log d(\log n)^{2}}{n}} with sufficiently large C1C_{1}. If nδslogd(logn)2n\gtrsim\delta s\log d(\log n)^{2}, then with probability at least 14d2δ1-4d^{2-\delta}, all 𝛉~𝒮\bm{\widetilde{\theta}}\in\mathcal{S} satisfying (4.2) have estimation error 𝚼~:=𝛉~𝛉\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}} bounded by

𝚼~2C2σ2κ0Rδslogd(logn)2nand𝚼~1C3σ2κ0Rsδlogd(logn)2n.\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C_{2}\frac{\sigma^{2}}{\kappa_{0}}\cdot R\sqrt{\frac{\delta s\log d(\log n)^{2}}{n}}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C_{3}\frac{\sigma^{2}}{\kappa_{0}}\cdot Rs\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}.
Theorem 12.

(1-bit Quantized Heavy-Tailed Covariate). Given δ>0\delta>0, we consider (3.4) where the ss-sparse 𝛉\bm{\theta^{\star}} satisfies 𝛉1R\|\bm{\theta^{\star}}\|_{1}\leq R, 𝐱k\bm{x}_{k}s are i.i.d. zero-mean heavy-tailed satisfying the joint fourth moment constraint sup𝐯𝕊d1𝔼|𝐯𝐱k|4M\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}|\bm{v}^{\top}\bm{x}_{k}|^{4}\leq M, and 𝚺=𝔼(𝐱k𝐱k)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}) satisfies λmin(𝚺)κ0\lambda_{\min}\big{(}\bm{\Sigma^{\star}}\big{)}\geq\kappa_{0} for some κ0>0\kappa_{0}>0, the noise ϵk\epsilon_{k}s are independent of 𝐱k\bm{x}_{k}s and i.i.d. heavy-tailed with bounded fourth moment, while for simplicity we assume 𝔼|yk|4M\mathbbm{E}|y_{k}|^{4}\leq M. In the quantization of (𝐱k,yk)(\bm{x}_{k},y_{k}) described above, we set ζx,ζy,γx,γy(nM2δlogd)1/8\zeta_{x},\zeta_{y},\gamma_{x},\gamma_{y}\asymp\big{(}\frac{nM^{2}}{\delta\log d}\big{)}^{1/8} and enforce ζx<γx\zeta_{x}<\gamma_{x}, ζy<γy\zeta_{y}<\gamma_{y}. For recovery we let 𝐐:=γx22nk=1n(𝐱˙k1𝐱˙k2+𝐱˙k2𝐱˙k1)\bm{Q}:=\frac{\gamma_{x}^{2}}{2n}\sum_{k=1}^{n}\big{(}\bm{\dot{x}}_{k1}\bm{\dot{x}}_{k2}^{\top}+\bm{\dot{x}}_{k2}\bm{\dot{x}}_{k1}^{\top}\big{)}, 𝐛:=γxγynk=1ny˙k𝐱˙k1\bm{b}:=\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}, 𝒮={𝛉:𝛉1R}\mathcal{S}=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq R\} and set λ=C1MR(δlogdn)1/4\lambda=C_{1}\sqrt{M}R\big{(}\frac{\delta\log d}{n}\big{)}^{1/4} with sufficiently large C1C_{1}. If nδs2logdn\gtrsim\delta s^{2}\log d, then with probability at least 14d2δ1-4d^{2-\delta}, all 𝛉~𝒮\bm{\widetilde{\theta}}\in\mathcal{S} satisfying (4.2) have estimation error 𝚼~:=𝛉~𝛉\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}} bounded by

𝚼~2C2Mκ0R(δs2logdn)1/4and𝚼~1C3Mκ0Rs(δlogdn)1/4.\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C_{2}\frac{\sqrt{M}}{\kappa_{0}}\cdot R\Big{(}\frac{\delta s^{2}\log d}{n}\Big{)}^{1/4}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C_{3}\frac{\sqrt{M}}{\kappa_{0}}\cdot Rs\Big{(}\frac{\delta\log d}{n}\Big{)}^{1/4}.

4.2 Uniform Recovery Guarantee

Uniformity is a highly desired property for a compressed sensing guarantee because it allows one to use a fixed (possibly randomly drawn) measurement ensemble for all sparse signals. Unfortunately, as many other results for nonlinear compressed sensing in the literature, our earlier recovery guarantees are non-uniform and only ensure the accurate recovery of a sparse signal fixed before drawing the random measurement ensemble.

We provide another additional development to QCS in this part. Specifically, we establish a uniform recovery guarantee which, despite the heavy-tailed noise and nonlinear quantization scheme, notably retains a near minimax error rate. This is done by upgrading Theorem 5 to be uniform over all sparse 𝜽\bm{\theta^{\star}} by more in-depth technical tools and a careful covering argument. Part of the techniques is inspired by prior works [40, 94], but certain technical innovations are required:

1) Like the recent work [40], one crucial technical tool in our proof is a powerful concentration inequality for product process due to Mendelson [67], as adapted in the present Lemma 9. However, [40] only studied sub-Gaussian distribution, and the results produced by their unified approach typically exhibit a decaying rate of O(n1/4)O(n^{-1/4}) [40, Sect. 4]. By contrast, our problem involves heavy-tailed noise only having bounded (2+ν)(2+\nu)-th moment (ν>0\nu>0), and we aim to establish a near minimax uniform error bound — cautiousness and new treatment are thus needed in the application of Lemma 9. More specifically, in the proof we need to bound

I1=sup𝜽Σs,R0sup𝒗𝒱k=1n(y~k𝒙k𝒗𝔼[y~k𝒙k𝒗]),I_{1}=\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\big{(}\widetilde{y}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\widetilde{y}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)},

where 𝒱={𝒗:𝒗2=1,𝒗12s}\mathscr{V}=\{\bm{v}:\|\bm{v}\|_{2}=1,\|\bm{v}\|_{1}\leq 2\sqrt{s}\}, and Σs,R0=Σs{𝜽d:𝜽2R0}\Sigma_{s,R_{0}}=\Sigma_{s}\cap\{\bm{\theta}\in\mathbb{R}^{d}:\|\bm{\theta}\|_{2}\leq R_{0}\} is the signal space of interest, and recall that y~k=𝒯ζy(𝒙k𝜽+ϵk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k}) with sub-Gaussian 𝒙k\bm{x}_{k}. It is natural to invoke Lemma 9 to bound I1I_{1} straightforwardly, but the issue is on lack of good bound for y~kψ2\|\widetilde{y}_{k}\|_{\psi_{2}} due to the heavy-tailedness of ϵk\epsilon_{k}; indeed, one only has the trivial estimate as y~kψ2=O(ζy)\|\widetilde{y}_{k}\|_{\psi_{2}}=O(\zeta_{y}), which is much worse than an O(1)O(1) bound since ζynδlogd\zeta_{y}\asymp\sqrt{\frac{n}{\delta\log d}}, and using Lemma 9 with this estimate leads to a loose bound for I1I_{1} and finally a sub-optimal error rate. To address the issue, our main idea is to introduce the truncated heavy-tailed noise 𝒯ζy(ϵk)\mathscr{T}_{\zeta_{y}}(\epsilon_{k}) and define z~k=y~k𝒯ζy(ϵk)\widetilde{z}_{k}=\widetilde{y}_{k}-\mathscr{T}_{\zeta_{y}}(\epsilon_{k}), which enables us to decompose I1I_{1} as

I1sup𝜽Σs,R0sup𝒗𝒱k=1n(z~k𝒙k𝒗𝔼[z~k𝒙k𝒗]):=I11+sup𝒗𝒱k=1n(𝒯ζy(ϵk)𝒙k𝒗𝔼[𝒯ζy(ϵk)𝒙k𝒗]):=I12.I_{1}\leq\underbrace{\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\big{(}\widetilde{z}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\widetilde{z}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)}}_{:=I_{11}}+\underbrace{\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\big{(}\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}]\big{)}}_{:=I_{12}}.

Now, the benefits of working with I11,I12I_{11},I_{12} are that: i) We can directly invoke Lemma 9 to bound I11I_{11} since we have a good sub-Gaussian norm estimate z~kψ2𝒙k𝜽ψ2𝒙kψ2R0\|\widetilde{z}_{k}\|_{\psi_{2}}\leq\|\bm{x}_{k}^{\top}\bm{\theta}\|_{\psi_{2}}\lesssim\|\bm{x}_{k}\|_{\psi_{2}}R_{0}, see Step 2.1.1 in the proof; ii) I12I_{12} becomes the supremum of a process that is independent of 𝜽\bm{\theta} and only indexed by 𝒗\bm{v}, hence Bernstein’s inequality suffices for bounding I12I_{12} (Step 2.1.2 in the proof), analogously to the proof of the non-uniform guarantee (Theorem 5).

2) Like [94, Prop. 6.1], we invoke a covering argument with similar techniques to bound I0=sup𝜽Σs,R0sup𝒗𝒱k=1nξk𝒙k𝒗I_{0}=\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}^{\top}\bm{v}, where ξk=𝒬Δ(y~k+τk)y~k\xi_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})-\widetilde{y}_{k} is the quantization noise. Nevertheless, our Lasso estimator is different from their projected back projection estimator, and it turns out that we need to directly handle sup𝒗𝒱\sup_{\bm{v}\in\mathscr{V}} by Lemma 10, unlike [94, Prop. 6.2] that again used a covering argument for this purpose. See more discussions in Step 2.4 of the proof.

We are in a position to present our uniform recovery guarantee. We follow most assumptions in Theorem 5 but specify the signal space as 𝜽Σs,R0\bm{\theta^{\star}}\in\Sigma_{s,R_{0}} and impose the (2l)(2l)-th moment constraint on ϵk\epsilon_{k}. Following prior works on QCS (e.g., [40, 89]), we consider constrained Lasso that utilizes an 1\ell_{1}-constraint 𝜽1𝜽1\|\bm{\theta}\|_{1}\leq\|\bm{\theta}^{\star}\|_{1} (rather than (3.6)) to pursue uniform recovery.

Theorem 13.

(Uniform Version of Theorem 5). Given some δ>0,Δ>0\delta>0,\Delta>0, in (3.4) we suppose that 𝐱k\bm{x}_{k}s are i.i.d., zero-mean sub-Gaussian with 𝐱kψ2σ\|\bm{x}_{k}\|_{\psi_{2}}\leq\sigma, κ0λmin(𝚺)λmax(𝚺)κ1\kappa_{0}\leq\lambda_{\min}(\bm{\Sigma^{\star}})\leq\lambda_{\max}(\bm{\Sigma^{\star}})\leq\kappa_{1} for some κ1κ0>0\kappa_{1}\geq\kappa_{0}>0 where 𝚺=𝔼(𝐱k𝐱k)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}), 𝛉Σs,R0:=Σs{𝛉:𝛉2R0}\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}:=\Sigma_{s}\cap\{\bm{\theta}:\|\bm{\theta}\|_{2}\leq R_{0}\} for some absolute constant R0R_{0}, ϵk\epsilon_{k}s are i.i.d. noise that are independent of 𝐱k\bm{x}_{k}s and satisfy 𝔼|ϵk|2lM\mathbbm{E}|\epsilon_{k}|^{2l}\leq M for some fixed l>1l>1. In quantization, we truncate yky_{k} to y~k=𝒯ζy(yk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}) with threshold ζy(n(M1/l+σ2)δlogd)1/2\zeta_{y}\asymp\big{(}\frac{n(M^{1/l}+\sigma^{2})}{\delta\log d}\big{)}^{1/2}, then quantize y~k\widetilde{y}_{k} to y˙k=𝒬Δ(y~k+τk)\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k}) with uniform dither τk𝒰([Δ2,Δ2])\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]). For recovery, we define the estimator 𝛉^\bm{\widehat{\theta}} as the solution to constrained Lasso

𝜽^=argmin𝜽1𝜽112nk=1n(y˙k𝒙k𝜽)2\bm{\widehat{\theta}}=\mathop{\arg\min}\limits_{\|\bm{\theta}\|_{1}\leq\|\bm{\theta^{\star}}\|_{1}}~{}\frac{1}{2n}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}

If nδslog𝒲n\gtrsim\delta s\log\mathscr{W} for 𝒲=κ1d2n3Δ2s5δ3\mathscr{W}=\frac{\kappa_{1}d^{2}n^{3}}{\Delta^{2}s^{5}\delta^{3}} and some hidden constant depending on (κ0,σ)(\kappa_{0},\sigma), then with probability at least 1Cd1δ1-Cd^{1-\delta} on a single random draw of (𝐱k,ϵk,τk)k=1n(\bm{x}_{k},\epsilon_{k},\tau_{k})_{k=1}^{n}, it holds uniformly for all 𝛉Σs,R0\bm{\theta^{\star}}\in\Sigma_{s,R_{0}} that the estimation error 𝚼^:=𝛉^𝛉\bm{\widehat{\Upsilon}}:=\bm{\widehat{\theta}}-\bm{\theta^{\star}} satisfy

𝚼^2C3σ(σ+M12l)κ0δslogdn+C3σΔκ0δslog𝒲n,\displaystyle\|\bm{\widehat{\Upsilon}}\|_{2}\leq\frac{C_{3}\sigma(\sigma+M^{\frac{1}{2l}})}{\kappa_{0}}\sqrt{\frac{\delta s\log d}{n}}+\frac{C_{3}\sigma\Delta}{\kappa_{0}}\sqrt{\frac{\delta s\log\mathscr{W}}{n}},
𝚼^1C4σ(σ+M12l)κ0sδlogdn+C4σΔκ0sδlog𝒲n.\displaystyle\|\bm{\widehat{\Upsilon}}\|_{1}\leq\frac{C_{4}\sigma(\sigma+M^{\frac{1}{2l}})}{\kappa_{0}}s\sqrt{\frac{\delta\log d}{n}}+\frac{C_{4}\sigma\Delta}{\kappa_{0}}s\sqrt{\frac{\delta\log\mathscr{W}}{n}}.

Notably, our uniform guarantee is still minimax optimal up to some additional logarithmic factors (i.e., log𝒲\sqrt{\log\mathscr{W}}) arising from the covering argument (Step 2.4 of the proof), whose main aim is to show that one uniform dither 𝝉=[τk]\bm{\tau}=[\tau_{k}] suffices for all signals. Thus naturally, log𝒲\sqrt{\log\mathscr{W}} is associated with a leading factor of the quantization level Δ\Delta, meaning that the logarithmic gap between uniform recovery and non-uniform recovery closes when Δ0\Delta\to 0. In particular, Theorem 13 implies a uniform error rate matching the non-uniform one in Theorem 5 (up to some multiplicative factors) when Δ\Delta is small enough or in an unquantized case.

To the best of our knowledge, the only existing uniform guarantee for heavy-tailed QCS is [32, Thm. 1.11], but the following distinctions make it impossible to closely compare their result with our Theorem 13: 1) [32, Thm. 1.11] is for dithered 1-bit quantization, but ours is for dithered uniform quantizer; 2) We handle heavy-tailedness by truncation, while [32, Thm. 1.11] does not involve this kind of special treatment; 3) [32, Thm. 1.11] considers a highly intractable program with hamming distance as objective and 𝜽Σs\bm{\theta}\in\Sigma_{s} as constraint (when specialized to sparse signal), while our Theorem 13 is for the convex program Lasso; 4) Their analysis is based on an in-depth result on random hyperplane tessellations (see also [33, 77]), while our proof follows the more standard strategy (i.e., to upgrade each piece in a non-uniform proof to be uniform) and requires certain technical innovations (e.g., the treatment to deal with the truncation step). It is possible to use such a standard strategy to upgrade Theorem 6 to a uniform result whose error rate may exhibit worse dependence on ss due to covering argument.

5 Numerical Simulations

In this section we provide two sets of experimental results to support and demonstrate our theoretical developments. The first set of our simulations is devoted to validate our major standpoint that near minimax rates are achievable in quantized heavy-tailed settings. Then, the second set of results are presented to illustrate the crucial role played by the appropriate dither (i.e., triangular dither for covariate, uniform dither for response) before uniform quantization. For the importance of data truncation we refer to in [35, Sect. 5], which includes three estimation problems in this work and contrasts the estimations with or without the data truncation.

5.1 (Near) Minimax Error Rates

Each data point in our results is set to be the mean value of 5050 or 100100 independent trials.

5.1.1 Quantized Covariance Matrix Estimation

We start from covariance matrix estimation, specifically we verify the element-wise rate 1:=O(logdn)\mathscr{B}_{1}:=O\big{(}\mathscr{L}\sqrt{\frac{\log d}{n}}\big{)} and operator norm rate 2:=O(dlogdn)\mathscr{B}_{2}:=O\big{(}\mathscr{L}\sqrt{\frac{d\log d}{n}}\big{)} in Theorems 2-3.

For estimator in Theorem 2, we draw 𝒙k=(xki)\bm{x}_{k}=(x_{ki}) such that the first two coordinates are independently drawn from 𝗍(4.5)\mathsf{t}(4.5), (xki)i=3,4(x_{ki})_{i=3,4} are from 𝗍(6)\mathsf{t}(6) with covariance 𝔼(xk3xk4)=1.2\mathbbm{E}(x_{k3}x_{k4})=1.2, and the remaining d4d-4 coordinates are i.i.d. following 𝗍(6)\mathsf{t}(6). We test different choices of (d,Δ)(d,\Delta) under n=80:20:220n=80:20:220, and the log-log plots are shown in Figure 1(a). Clearly, for each (d,Δ)(d,\Delta) the experimental points roughly exhibit a straight line that is well aligned with the dashed line representing the n1/2n^{-1/2} rate. As predicted by the factor =M+Δ2\mathscr{L}=\sqrt{M}+\Delta^{2}, the curves with larger Δ\Delta are higher, but note that the error decreasing rates remain unchanged. In addition, the curves of (d,Δ)=(100,1),(120,1)(d,\Delta)=(100,1),(120,1) are extremely close, which is consistent with the logarithmic dependence of 1\mathscr{B}_{1} on dd.

For the error bound 2\mathscr{B}_{2}, the coordinates of 𝒙k\bm{x}_{k} are independently drawn from a scaled version of 𝗍(4.5)\mathsf{t}(4.5) such that 𝚺=diag(2,2,1,,1)\bm{\Sigma^{\star}}=\mathrm{diag}(2,2,1,...,1), and we test different settings of (d,Δ)(d,\Delta) under n=200:100:1000n=200:100:1000. As shown in Figure 1(b), the operator norm error decreases with nn in the optimal rate n1/2n^{-1/2}, and using a coarser dithered quantizer (i.e., larger Δ\Delta) only slightly lifts the curves. Indeed, the effect seems consistent with \mathscr{L}’s quadratic dependence on Δ\Delta. To validate the relative scaling of nn and dd, in addition to the setting (d,Δ)=(100,1)(d,\Delta)=(100,1) under n=200:100:1000n=200:100:1000, we try (d,Δ)=(150,1)(d,\Delta)=(150,1) under 1.51.5 times the original sample size n=1.5×(200:100:1000)n=1.5\times(200:100:1000) (but in Figure 1(b) we still plot the curve according to the sample size of 200:100:1000200:100:1000 without the multiplicative factor of 1.51.5), and surprisingly the obtained curve coincides with the one for (d,Δ)=(100,1)(d,\Delta)=(100,1). Thus, ignoring the logarithmic factor logd\log d, the operator norm error can be characterized by 2\mathscr{B}_{2} fairly well.

Additionally, we want to compare 1\mathscr{B}_{1} and 2\mathscr{B}_{2} regarding the dependence on dd more clearly. Specifically, we generate the samples 𝒙k\bm{x}_{k}s as in Figure 1(a) and test the fixed sample size n=180n=180 and varying dimension d=80:20:260d=80:20:260. The max norm estimation errors of 𝚺^\bm{\widehat{\Sigma}} in Theorem 2 and the operator norm errors (under d=80:20:180d=80:20:180 to ensure ndn\geq d) of the estimator in Theorem 3 are reported in Figure 1(c). It is clear that the max norm error increases with dd rather slowly, while the operator norm error increases much more significantly under larger dd. This is consistent with the logarithmic dependence of 1\mathscr{B}_{1} on dd and the more essential dependence of 2\mathscr{B}_{2} on dd.

Refer to caption

(a)                                       (b)                                     (c)

Figure 1: (a): Element-wise error (Theorem 2); (b): operator norm error (Theorem 3); (c): the dependence on dd of both error metrics.

5.1.2 Quantized Compressed Sensing

We now switch to QCS with unquantized covariate and aim to verify the 2\ell_{2}-norm error rate 3=O(slogdn)\mathscr{B}_{3}=O\big{(}\mathscr{L}\sqrt{\frac{s\log d}{n}}\big{)} obtained in Theorems 5-6. We let the support of the ss-sparse 𝜽d\bm{\theta^{\star}}\in\mathbb{R}^{d} be [s][s], and then draw the non-zero entries from a uniform distribution over 𝕊s1\mathbb{S}^{s-1} (hence 𝜽2=1\|\bm{\theta^{\star}}\|_{2}=1). For the setting of Theorem 5 we adopt 𝒙k𝒩(0,𝑰d)\bm{x}_{k}\sim\mathcal{N}(0,\bm{I}_{d}) and ϵk16𝗍(3)\epsilon_{k}\sim\frac{1}{\sqrt{6}}\mathsf{t}(3), while 𝒙kiiid53𝗍(4.5)\bm{x}_{ki}\stackrel{{\scriptstyle iid}}{{\sim}}\frac{\sqrt{5}}{3}\mathsf{t}(4.5) and ϵk13𝗍(4.5)\epsilon_{k}\sim\frac{1}{\sqrt{3}}\mathsf{t}(4.5) for Theorem 6. We simulate different choices of (d,s,Δ)(d,s,\Delta) under n=100:100:1000n=100:100:1000, and the proposed convex program (3.6) is solved with the framework of ADMM (we refer to the review [9]). Experimental results are shown as log-log plots in Figure 2. Consistent with the theoretical bound 3\mathscr{B}_{3}, the errors in both cases decrease in a rate of n1/2n^{-1/2}, whereas the effect of uniform quantization is merely on the multiplicative factor \mathscr{L}. Interestingly, it seems that the gaps between Δ=0,0.5\Delta=0,0.5 and Δ=0.5,1\Delta=0.5,1 are in agreement with the explicit form of \mathscr{L}, i.e., M1/(2l)+Δ\mathscr{L}\asymp M^{1/(2l)}+\Delta for Theorem 5, and M+Δ2\mathscr{L}\asymp\sqrt{M}+\Delta^{2} for Theorem 6. In addition, note that the curves of (d,s)=(150,5),(180,5)(d,s)=(150,5),(180,5) are close, whereas increasing s=8s=8 suffers from significantly larger error. This is consistent with the scaling law of (n,d,s)(n,d,s) in 3\mathscr{B}_{3}.

Refer to caption

(a)                                        (b)

Figure 2: (a): QCS in Theorem 5; (b): QCS in Theorem 6.

Then, we simulate the complete quantization setting where both covariate and response are quantized (Theorems 9-10). The simulation details are the same as before except that 𝒙k\bm{x}_{k} is also quantized with quantization level same as yky_{k}. We provide the best 1\ell_{1}-norm constraint for recovery, i.e., 𝒮:={𝜽:𝜽1𝜽1}\mathcal{S}:=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq\|\bm{\theta^{\star}}\|_{1}\}. Then, composite gradient descent [63, 64] is invoked to handle the non-convex estimation program. We show the log-log plots in Figure 3. Note that these results have implications similar to Figure 2, in terms of the n1/2n^{-1/2} rate, the effect of quantization, and the relative scaling of (n,d,s)(n,d,s).

Refer to caption

(a)                                        (b)

Figure 3: (a): QCS in Theorem 9; (b): QCS in Theorem 10.

5.1.3 Quantized Matrix Completion

Finally, we simulate QMC and demonstrate the error bound 4=O(rdlogdn)\mathscr{B}_{4}=O\big{(}\mathscr{L}\sqrt{\frac{rd\log d}{n}}\big{)} for 𝚼^F/d\|\bm{\widehat{\Upsilon}}\|_{F}/d in Theorems 7-8. We generate the rank-rr 𝚯d×d\bm{\Theta^{\star}}\in\mathbb{R}^{d\times d} as follows: we first generate 𝚯0d×r\bm{\Theta}_{0}\in\mathbb{R}^{d\times r} with i.i.d. standard Gaussian entries to obtain the rank-rr 𝚯1:=𝚯0𝚯0\bm{\Theta}_{1}:=\bm{\Theta}_{0}\bm{\Theta}_{0}^{\top}, then we rescale it to 𝚯:=k1𝚯1\bm{\Theta^{\star}}:=k_{1}\bm{\Theta}_{1} such that 𝚯F=d\|\bm{\Theta^{\star}}\|_{F}=d. We use ϵk𝒩(0,14)\epsilon_{k}\sim\mathcal{N}(0,\frac{1}{4}) to simulate the sub-exponential noise in Theorem 7, while ϵk16𝗍(3)\epsilon_{k}\sim\frac{1}{\sqrt{6}}\mathsf{t}(3) for Theorem 8. The convex program (3.8) is fed with α=𝚯\alpha=\|\bm{\Theta^{\star}}\|_{\infty} and optimized by the ADMM algorithm. We test different choices of (d,r,Δ)(d,r,\Delta) under n=2000:1000:8000n=2000:1000:8000, with the log-log error plots displayed in Figure 4. Firstly, the experimental curves are well aligned with the dashed line that represents the optimal n1/2n^{-1/2} rate. Then, comparing the results for Δ=0,0.5,1\Delta=0,0.5,1, we conclude that quantization only affects the multiplicative factor \mathscr{L} in the estimation error. It should also be noted that, increasing either dd or rr leads to significantly larger error, which is consistent with the 4\mathscr{B}_{4}’s essential dependence on dd and rr.

Refer to caption

(a)                                        (b)

Figure 4: (a): QMC in Theorem 7; (b): QMC in Theorem 8.

5.2 Importance of Appropriate Dithering

To demonstrate the crucial role played by the suitable dither, we provide the second set of simulations. In order to observe more significant phenomena and then conclude evidently, we may test huge sample size but a rather simple estimation problem under coarse quantization (i.e., large Δ\Delta).

Specifically, for covariance matrix estimation we set d=1d=1 and i.i.d. draw X1,,XnX_{1},...,X_{n} from 𝒩(0,1)\mathcal{N}(0,1). Thus, the problem boils down to estimating 𝔼|Xk|2\mathbbm{E}|X_{k}|^{2}, for which the estimators in Theorems 2-3 coincide. Since XkX_{k} is sub-Gaussian, we do not perform data truncation before dithered quantization. Besides our estimator 𝚺^=1nk=1nX˙k2Δ24\bm{\widehat{\Sigma}}=\frac{1}{n}\sum_{k=1}^{n}\dot{X}_{k}^{2}-\frac{\Delta^{2}}{4} where X˙k=𝒬Δ(Xk+τk)\dot{X}_{k}=\mathcal{Q}_{\Delta}(X_{k}+\tau_{k}) and τk\tau_{k} is triangular dither, we invite the following competitors:

  • 𝚺^no=1nk=1n(X˙k)2\bm{\widehat{\Sigma}}_{no}=\frac{1}{n}\sum_{k=1}^{n}(\dot{X}^{\prime}_{k})^{2}, where X˙k=𝒬Δ(Xk)\dot{X}^{\prime}_{k}=\mathcal{Q}_{\Delta}(X_{k}) is the direct quantization without dithering;

  • 𝚺^uΔ26\bm{\widehat{\Sigma}}_{u}-\frac{\Delta^{2}}{6} and 𝚺^u\bm{\widehat{\Sigma}}_{u}, where 𝚺^u=1nk=1n(X˙k′′)2\bm{\widehat{\Sigma}}_{u}=\frac{1}{n}\sum_{k=1}^{n}(\dot{X}^{\prime\prime}_{k})^{2}, and X˙k′′=𝒬Δ(Xk+τk′′)\dot{X}^{\prime\prime}_{k}=\mathcal{Q}_{\Delta}(X_{k}+\tau^{\prime\prime}_{k}) is quantized under uniform dither τk′′𝒰([Δ2,Δ2])\tau^{\prime\prime}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]).

To illustrate the choice of 𝚺^uΔ26\bm{\widehat{\Sigma}}_{u}-\frac{\Delta^{2}}{6} and 𝚺^u\bm{\widehat{\Sigma}}_{u}, we write X˙k′′=Xk+τk′′+wk=Xk+ξk\dot{X}^{\prime\prime}_{k}=X_{k}+\tau^{\prime\prime}_{k}+w_{k}=X_{k}+\xi_{k} with quantization error wk𝒰([Δ2,Δ2])w_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]) (due to Theorem 1(a)) and quantization noise ξk=τk′′+wk\xi_{k}=\tau^{\prime\prime}_{k}+w_{k}, then (3.1) gives 𝔼(X˙k′′)2=𝔼|Xk|2+𝔼|ξk|2\mathbbm{E}(\dot{X}^{\prime\prime}_{k})^{2}=\mathbbm{E}|X_{k}|^{2}+\mathbbm{E}|\xi_{k}|^{2}, while 𝔼|ξk|2\mathbbm{E}|\xi_{k}|^{2} remains unknown. Thus, we consider 𝚺^uΔ26\bm{\widehat{\Sigma}}_{u}-\frac{\Delta^{2}}{6} because of an unjustified guess 𝔼|ξk|2𝔼|τk′′|2+𝔼|wk|2=Δ26\mathbbm{E}|\xi_{k}|^{2}\approx\mathbbm{E}|\tau^{\prime\prime}_{k}|^{2}+\mathbbm{E}|w_{k}|^{2}=\frac{\Delta^{2}}{6}, while 𝚺^u\bm{\widehat{\Sigma}}_{u} simply gives up the correction of 𝔼|ξk|2\mathbbm{E}|\xi_{k}|^{2}. We test Δ=3\Delta=3 under n=(2:2:20)103n=(2:2:20)\cdot 10^{3}. From the results shown in Figure 5(a), the proposed estimator based on quantized data under triangular dither embraces the lowest estimation errors and the optimal rate of n1/2n^{-1/2}, whereas other competitors are not consistent, i.e., they all reach some error floors under a large sample size.

For the two remaining signal recovery problems, we simply focus on the quantization of the response yky_{k}. In particular, we simulate QCS in the setting of Theorem 5, with (d,s,Δ)=(50,3,2)(d,s,\Delta)=(50,3,2) under n:=(2:2:20)103n:=(2:2:20)\cdot 10^{3}. Other experimental details are as previously stated. We compare our estimator 𝜽^\bm{\widehat{\theta}} with its counterpart 𝜽^\bm{\widehat{\theta}}^{\prime} defined by (3.6) with the same 𝑸,𝒮\bm{Q},\mathcal{S} but 𝒃=1nk=1ny˙k𝒙k\bm{b}^{\prime}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}^{\prime}_{k}\bm{x}_{k}, where y˙k=𝒬Δ(y~k)\dot{y}_{k}^{\prime}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}) is a direct uniform quantization with no dither. Evidently, the simulation results in Figure 5(b) confirm that the application of a uniform dither significantly lessens the recovery errors. Without dithering, although our results under Gaussian covariate still exhibit n1/2n^{-1/2} decreasing rate, identifiability issue unavoidably arises under Bernoulli covariate. In that case, the simulation without dithering will evidently deviate from the n1/2n^{-1/2} rate, see [87, Figure 1] for instance.

In analogy, we simulate QMC (Theorem 7) with data generated as previous experiments, and specifically we try (d,r,Δ)=(30,5,1.5)(d,r,\Delta)=(30,5,1.5) under n=(5:5:25)103n=(5:5:25)\cdot 10^{3}. While our estimator 𝚯^\bm{\widehat{\Theta}} is defined in (3.8) involving y˙k\dot{y}_{k} from a dithered quantizer, we simulate the performance of its counterpart without dithering, i.e., 𝚯^\bm{\widehat{\Theta}}^{\prime} defined in (3.8) with y˙k\dot{y}_{k} substituted by y˙k=𝒬Δ(yk)\dot{y}_{k}^{\prime}=\mathcal{Q}_{\Delta}(y_{k}). From the experimental results displayed in Figure 5(c), one shall clearly see that 𝚯^\bm{\widehat{\Theta}} performs much better in terms of the decreasing rate of n1/2n^{-1/2} and the estimation error; while the curve without dithering even does not decrease.

Refer to caption

(a)                                          (b)                                          (c)

Figure 5: (a): covariance matrix estimation; (b): QCS in Theorem 5; (c): QMC in Theorem 7.

6 Concluding Remarks

In digital signal processing and many distributed machine learning systems, data quantization is an indispensable process. On the other hand, many modern datasets exhibit heavy-tailedness, and the past decade has witnessed an increasing interest in statistical estimation methods robust to heavy-tailed data. In this work we try to bridge these two developments by studying quantization of heavy-tailed data. We propose to truncate the heavy-tailed data prior to a uniform quantizer with random dither well suited to the problem at hand. Applying our quantization scheme to covariance matrix estimation, compressed sensing, and matrix completion, we have proposed (near) optimal estimators based on quantized data, and they are computationally feasible. These results suggest a unified conclusion that the dithered quantization does not affect the key scaling law in the error rate but only slightly worsens the multiplicative factor, which was complemented by numerical simulations. Further, from two respects, we presented additional developments for quantized compressed sensing. Firstly, we study a novel setting that involves covariate quantization. Because our quantized covariance matrix estimator is not positive semi-definite, the proposed recovery program is non-convex, but we proved that all local minimizers enjoy near minimax rate. At a higher level, this development extends a line of works on non-convex M-estimator [63, 64, 62] to accommodate heavy-tailed covariate, see the deterministic framework Proposition 1. As application, we derive results for (dithered) 1-bit compressed sensing as byproducts. Secondly, we established a near minimax uniform recovery guarantee for QCS under heavy-tailed noise, which states that all sparse signals within an 2\ell_{2}-ball can be uniformly recovered up to near optimal 2\ell_{2}-norm error, using a single realization of the measurement ensemble. We believe the developments presented in this work will prove useful in many other estimation problems, for instance, the triangular dither and the quantization scheme apply to multi-task learning, as shown by subsequent works [22, 60].

References

  • [1] Albert Ai, Alex Lapanowski, Yaniv Plan, and Roman Vershynin. One-bit compressed sensing with non-gaussian measurements. Linear Algebra and its Applications, 441:222–239, 2014.
  • [2] James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35. New York, 2007.
  • [3] Sonia A Bhaskar. Probabilistic low-rank matrix completion from quantized measurements. The Journal of Machine Learning Research, 17(1):2131–2164, 2016.
  • [4] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of statistics, 36(6):2577–2604, 2008.
  • [5] Atanu Biswas, Sujay Datta, Jason P Fine, and Mark R Segal. Statistical advances in the biomedical sciences: clinical trials, epidemiology, survival analysis, and bioinformatics. John Wiley & Sons, 2007.
  • [6] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
  • [7] Petros T Boufounos. Sparse signal reconstruction from phase-only measurements. In Proc. Int. Conf. Sampling Theory and Applications (SampTA)],(July 1-5 2013). Citeseer, 2013.
  • [8] Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. In 2008 42nd Annual Conference on Information Sciences and Systems, pages 16–21. IEEE, 2008.
  • [9] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • [10] Christian Brownlees, Emilien Joly, and Gábor Lugosi. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507–2536, 2015.
  • [11] T Tony Cai, Cun-Hui Zhang, and Harrison H Zhou. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38(4):2118–2144, 2010.
  • [12] T Tony Cai and Harrison H Zhou. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics, pages 2389–2420, 2012.
  • [13] Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106(494):672–684, 2011.
  • [14] Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res., 14(1):3619–3647, 2013.
  • [15] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119, 2012.
  • [16] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
  • [17] Yang Cao and Yao Xie. Categorical matrix completion. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 369–372. IEEE, 2015.
  • [18] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, pages 1148–1185, 2012.
  • [19] Junren Chen and Michael K Ng. Color image inpainting via robust pure quaternion matrix completion: Error bound and weighted loss. SIAM Journal on Imaging Sciences, 15(3):1469–1498, 2022.
  • [20] Junren Chen and Michael K Ng. Uniform exact reconstruction of sparse signals and low-rank matrices from phase-only measurements. IEEE Transactions on Information Theory, 2023.
  • [21] Junren Chen, Cheng-Long Wang, Michael K Ng, and Di Wang. High dimensional statistical estimation under one-bit quantization. IEEE Transactions on Information Theory, 2023.
  • [22] Junren Chen, Yueqi Wang, and Michael K Ng. Quantized low-rank multivariate regression with random dithering. arXiv preprint arXiv:2302.11197, 2023.
  • [23] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. Advances in Neural Information Processing Systems, 28, 2015.
  • [24] Onkar Dabeer and Aditya Karnik. Signal parameter estimation using 1-bit dithered quantization. IEEE Transactions on Information Theory, 52(12):5389–5405, 2006.
  • [25] Alireza Danaee, Rodrigo C de Lamare, and Vitor Heloiz Nascimento. Distributed quantization-aware rls learning with bias compensation and coarsely quantized signals. IEEE Transactions on Signal Processing, 70:3441–3455, 2022.
  • [26] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.
  • [27] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622, 2016.
  • [28] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I Oliveira. Sub-gaussian mean estimators. The Annals of Statistics, 44(6):2695–2725, 2016.
  • [29] Sjoerd Dirksen. Quantized compressed sensing: a survey. In Compressed Sensing and Its Applications, pages 67–95. Springer, 2019.
  • [30] Sjoerd Dirksen, Hans Christian Jung, and Holger Rauhut. One-bit compressed sensing with partial gaussian circulant matrices. Information and Inference: A Journal of the IMA, 9(3):601–626, 2020.
  • [31] Sjoerd Dirksen, Johannes Maly, and Holger Rauhut. Covariance estimation under one-bit quantization. The Annals of Statistics, 50(6):3538–3562, 2022.
  • [32] Sjoerd Dirksen and Shahar Mendelson. Non-gaussian hyperplane tessellations and robust one-bit compressed sensing. Journal of the European Mathematical Society, 23(9):2913–2947, 2021.
  • [33] Sjoerd Dirksen, Shahar Mendelson, and Alexander Stollenwerk. Sharp estimates on random hyperplane tessellations. SIAM Journal on Mathematics of Data Science, 4(4):1396–1419, 2022.
  • [34] Jianqing Fan, Quefeng Li, and Yuyan Wang. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247–265, 2017.
  • [35] Jianqing Fan, Weichen Wang, and Ziwei Zhu. A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of statistics, 49(3):1239, 2021.
  • [36] Thomas Feuillen, Mike E Davies, Luc Vandendorpe, and Laurent Jacques. (1\ell_{1},2\ell_{2})-rip and projected back-projection reconstruction for phase-only measurements. IEEE Signal Processing Letters, 27:396–400, 2020.
  • [37] Simon Foucart, Deanna Needell, Reese Pathak, Yaniv Plan, and Mary Wootters. Weighted matrix completion from non-random, non-uniform sampling patterns. IEEE Transactions on Information Theory, 67(2):1264–1290, 2020.
  • [38] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Springer New York, New York, NY, 2013.
  • [39] Martin Genzel and Christian Kipp. Generic error bounds for the generalized lasso with sub-exponential data. Sampling Theory, Signal Processing, and Data Analysis, 20(2):15, 2022.
  • [40] Martin Genzel and Alexander Stollenwerk. A unified approach to uniform signal recovery from nonlinear observations. Foundations of Computational Mathematics, pages 1–74, 2022.
  • [41] Robert M Gray and Thomas G Stockham. Dithered quantizers. IEEE Transactions on Information Theory, 39(3):805–812, 1993.
  • [42] David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quantum state tomography via compressed sensing. Physical review letters, 105(15):150401, 2010.
  • [43] Osama A Hanna, Yahya H Ezzeldin, Christina Fragouli, and Suhas Diggavi. Quantization of distributed data for learning. IEEE Journal on Selected Areas in Information Theory, 2(3):987–1001, 2021.
  • [44] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
  • [45] Marat Ibragimov, Rustam Ibragimov, and Johan Walden. Heavy-tailed distributions and robustness in economics and finance, volume 214. Springer, 2015.
  • [46] Laurent Jacques and Valerio Cambareri. Time for dithering: fast and quantized random embeddings via the restricted isometry property. Information and Inference: A Journal of the IMA, 6(4):441–476, 2017.
  • [47] Laurent Jacques and Thomas Feuillen. The importance of phase in complex compressive sensing. IEEE Transactions on Information Theory, 67(6):4150–4161, 2021.
  • [48] Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE transactions on information theory, 59(4):2082–2102, 2013.
  • [49] Halyun Jeong, Xiaowei Li, Yaniv Plan, and Ozgur Yilmaz. Sub-gaussian matrices on sets: Optimal tail dependence and applications. Communications on Pure and Applied Mathematics, 75(8):1713–1754, 2022.
  • [50] Hans Christian Jung, Johannes Maly, Lars Palzer, and Alexander Stollenwerk. Quantized compressed sensing by rectified linear units. IEEE transactions on information theory, 67(6):4125–4149, 2021.
  • [51] Olga Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282–303, 2014.
  • [52] Olga Klopp, Jean Lafond, Éric Moulines, and Joseph Salmon. Adaptive multinomial matrix completion. Electronic Journal of Statistics, 9(2):2950–2975, 2015.
  • [53] Olga Klopp, Karim Lounici, and Alexandre B Tsybakov. Robust matrix completion. Probability Theory and Related Fields, 169(1):523–564, 2017.
  • [54] Karin Knudson, Rayan Saab, and Rachel Ward. One-bit compressive sensing with norm estimation. IEEE Transactions on Information Theory, 62(5):2748–2758, 2016.
  • [55] Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329, 2011.
  • [56] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  • [57] Piotr Kruczek, Radosław Zimroz, and Agnieszka Wyłomańska. How to detect the cyclostationarity in heavy-tailed distributed signals. Signal Processing, 172:107514, 2020.
  • [58] Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11(4):1389–1456, 2022.
  • [59] Jean Lafond, Olga Klopp, Eric Moulines, and Joseph Salmon. Probabilistic low-rank matrix completion on finite alphabets. Advances in Neural Information Processing Systems, 27, 2014.
  • [60] Kangqiang Li and Yuxuan Wang. Two results on low-rank heavy-tailed multiresponse regressions. arXiv preprint arXiv:2305.13897, 2023.
  • [61] Christopher Liaw, Abbas Mehrabian, Yaniv Plan, and Roman Vershynin. A simple tool for bounding the deviation of random matrices on geometric sets. In Geometric aspects of functional analysis, pages 277–299. Springer, 2017.
  • [62] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust mm-estimators. The Annals of Statistics, 45(2):866–896, 2017.
  • [63] Po-Ling Loh and Martin J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of statistics, 40(3):1637–1664, 2012.
  • [64] Po-Ling Loh and Martin J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19):559–616, 2015.
  • [65] Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190, 2019.
  • [66] Gabor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. The Annals of Statistics, 49(1):393–410, 2021.
  • [67] Shahar Mendelson. Upper bounds on product and multiplier empirical processes. Stochastic Processes and their Applications, 126(12):3652–3680, 2016.
  • [68] Stanislav Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 2015.
  • [69] Jianhua Mo and Robert W Heath. Limited feedback in single and multi-user mimo systems with finite-bit adcs. IEEE Transactions on Wireless Communications, 17(5):3284–3297, 2018.
  • [70] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069–1097, 2011.
  • [71] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697, 2012.
  • [72] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Statistical science, 27(4):538–557, 2012.
  • [73] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
  • [74] Luong Trung Nguyen, Junhan Kim, and Byonghyo Shim. Low-rank matrix completion: A contemporary survey. IEEE Access, 7:94215–94237, 2019.
  • [75] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2012.
  • [76] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
  • [77] Yaniv Plan and Roman Vershynin. Dimension reduction by random hyperplane tessellations. Discrete & Computational Geometry, 51(2):438–461, 2014.
  • [78] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on information theory, 62(3):1528–1537, 2016.
  • [79] Yaniv Plan, Roman Vershynin, and Elena Yudovina. High-dimensional estimation with geometric constraints. Information and Inference: A Journal of the IMA, 6(1):1–40, 2017.
  • [80] Shuang Qiu, Xiaohan Wei, and Zhuoran Yang. Robust one-bit recovery via relu generative networks: Near-optimal statistical rate and global landscape analysis. In International Conference on Machine Learning, pages 7857–7866. PMLR, 2020.
  • [81] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over q\ell_{q}-balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
  • [82] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12), 2011.
  • [83] Lawrence Roberts. Picture coding using pseudo-random noise. IRE Transactions on Information Theory, 8(2):145–154, 1962.
  • [84] Sima Sahu, Harsh Vikram Singh, Basant Kumar, and Amit Kumar Singh. De-noising of ultrasound image using bayesian approached heavy-tailed cauchy distribution. Multimedia Tools and Applications, 78(4):4089–4106, 2019.
  • [85] Vidyashankar Sivakumar, Arindam Banerjee, and Pradeep K Ravikumar. Beyond sub-gaussian measurements: High-dimensional structured estimation with sub-exponential designs. Advances in neural information processing systems, 28, 2015.
  • [86] Alexander Stollenwerk. One-bit compressed sensing and fast binary embeddings. PhD thesis, Dissertation, RWTH Aachen University, 2019, 2019.
  • [87] Zhongxing Sun, Wei Cui, and Yulong Liu. Quantized corrupted sensing with random dithering. IEEE Transactions on Signal Processing, 70:600–615, 2022.
  • [88] Ananthram Swami and Brian M Sadler. On some detection and estimation problems in heavy-tailed noise. Signal Processing, 82(12):1829–1846, 2002.
  • [89] Christos Thrampoulidis and Ankit Singh Rawat. The generalized lasso for sub-gaussian measurements with dithered quantization. IEEE Transactions on Information Theory, 66(4):2487–2500, 2020.
  • [90] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [91] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.
  • [92] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [93] Robert F Woolson and William R Clarke. Statistical methods for the analysis of biomedical data. John Wiley & Sons, 2011.
  • [94] Chunlei Xu and Laurent Jacques. Quantized compressive sensing with rip matrices: The benefit of dithering. Information and Inference: A Journal of the IMA, 9(3):543–586, 2020.
  • [95] Tianyu Yang, Johannes Maly, Sjoerd Dirksen, and Giuseppe Caire. Plug-in channel estimation with dithered quantized signals in spatially non-stationary massive mimo systems. arXiv preprint arXiv:2301.04641, 2023.
  • [96] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning, pages 4035–4043. PMLR, 2017.
  • [97] Ziwei Zhu and Wenjing Zhou. Taming heavy-tailed features by shrinkage. In International Conference on Artificial Intelligence and Statistics, pages 3268–3276. PMLR, 2021.

Appendix A Proofs in Section 3

A.1 Quantized Covariance Matrix Estimation

We first provide Bernstein’s inequality that is recurring in our proofs. In application we will choose the more convenient one from (A.1) and (A.2).

Lemma 1.

(Bernstein’s inequality, [6, Thm. 2.10, Coro. 2.11]). Let X1,,XnX_{1},...,X_{n} be independent random variables, and assume that there exist positive numbers vv and cc such that i=1n𝔼[Xi2]v\sum_{i=1}^{n}\mathbbm{E}[X_{i}^{2}]\leq v and

i=1n𝔼|Xi|qq!2vcq2 for all integers q3,\sum_{i=1}^{n}\mathbbm{E}|X_{i}|^{q}\leq\frac{q!}{2}vc^{q-2}\text{ for all integers }q\geq 3,

then for any t>0t>0 we have

(|i=1n(Xi𝔼Xi)|2vt+ct)2exp(t)\displaystyle\mathbbm{P}\left(\Big{|}\sum_{i=1}^{n}(X_{i}-\mathbbm{E}X_{i})\Big{|}\geq\sqrt{2vt}+ct\right)\leq 2\exp\big{(}-t\big{)} (A.1)
(|i=1n(Xi𝔼Xi)|t)2exp(t22(v+ct))\displaystyle\mathbbm{P}\left(\Big{|}\sum_{i=1}^{n}(X_{i}-\mathbbm{E}X_{i})\Big{|}\geq t\right)\leq 2\exp\left(-\frac{t^{2}}{2(v+ct)}\right) (A.2)

We will also use the Matrix Bernstein’s inequality.

Lemma 2.

(Matrix Bernstein, [91, Thm. 6.1.1]). Let 𝐒1,,𝐒n\bm{S}_{1},...,\bm{S}_{n} be independent zero-mean random variables with common dimension d1×d2d_{1}\times d_{2}. We assume that 𝐒kopL\|\bm{S}_{k}\|_{op}\leq L for k[n]k\in[n] and introduce the matrix variance statistic

ν=max{k=1n𝔼(𝑺k𝑺k)op,k=1n𝔼(𝑺k𝑺k)op}.\nu=\max\left\{\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{S}_{k}\bm{S}_{k}^{\top})\Big{\|}_{op},\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{S}_{k}^{\top}\bm{S}_{k})\Big{\|}_{op}\right\}.

Then for any t0t\geq 0, we have

(k=1n𝑺kopt)(d1+d2)exp(12t2ν+Lt3).\mathbbm{P}\left(\Big{\|}\sum_{k=1}^{n}\bm{S}_{k}\Big{\|}_{op}\geq t\right)\leq(d_{1}+d_{2})\exp\left(\frac{-\frac{1}{2}t^{2}}{\nu+\frac{Lt}{3}}\right).

A.1.1 Proof of Theorem 2

Proof.

Recall that 𝝃k=𝒙˙k𝒙~k\bm{\xi}_{k}=\bm{\dot{x}}_{k}-\bm{\widetilde{x}}_{k} is the quantization noise, and 𝔼(𝝃k𝝃k)=Δ24𝑰d\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})=\frac{\Delta^{2}}{4}\bm{I}_{d}, which implies 𝔼𝚺^=𝔼(𝒙~k𝒙~k)\mathbbm{E}\bm{\widehat{\Sigma}}=\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}). Thus, by using triangle inequality we obtain

𝚺^𝚺𝚺^𝔼𝚺^+𝔼(𝒙~k𝒙~k𝒙k𝒙k):=I1+I2.\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{\infty}\leq\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{\infty}+\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{\infty}:=I_{1}+I_{2}.

Step 1. Bounding I1I_{1}.

Note that 𝚺^𝔼𝚺^=1nk=1n𝒙˙k𝒙˙k𝔼(𝒙˙k𝒙˙k)\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{\infty}=\|\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top})\|_{\infty}, so for any (i,j)[d]×[d](i,j)\in[d]\times[d] we aim to bound the (i,j)(i,j)-th entry error

|σ^ij𝔼σ^ij|=|k=1n1nx˙kix˙kj𝔼(x˙kix˙kj)||\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|=\left|\sum_{k=1}^{n}\frac{1}{n}\dot{x}_{ki}\dot{x}_{kj}-\mathbbm{E}(\dot{x}_{ki}\dot{x}_{kj})\right|

Observe that the quantization noise is bounded as follows

𝝃k𝒬Δ(𝒙~k+𝝉k)(𝒙~k+𝝉k)+𝝉k3Δ2,\|\bm{\xi}_{k}\|_{\infty}\leq\|\mathcal{Q}_{\Delta}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})-(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})\|_{\infty}+\|\bm{\tau}_{k}\|_{\infty}\leq\frac{3\Delta}{2},

which implies 𝔼|ξki|4(3Δ2)4\mathbbm{E}|\xi_{ki}|^{4}\leq(\frac{3\Delta}{2})^{4} and 𝒙˙k𝒙~k+𝝃kζ+3Δ2\|\bm{\dot{x}}_{k}\|_{\infty}\leq\|\bm{\widetilde{x}}_{k}\|_{\infty}+\|\bm{\xi}_{k}\|_{\infty}\leq\zeta+\frac{3\Delta}{2}. By the moment constraint on xkix_{ki} we have 𝔼|x~ki|4𝔼|xki|4M\mathbbm{E}|\widetilde{x}_{ki}|^{4}\leq\mathbbm{E}|x_{ki}|^{4}\leq M. Thus, for any positive integer p2p\geq 2 we have the following bound

k=1n𝔼|x˙kix˙kjn|q\displaystyle\sum_{k=1}^{n}\mathbbm{E}\Big{|}\frac{\dot{x}_{ki}\dot{x}_{kj}}{n}\Big{|}^{q} (ζ+32Δ)2(q2)nqk=1n𝔼(x˙kix˙kj)2\displaystyle\leq\frac{(\zeta+\frac{3}{2}\Delta)^{2(q-2)}}{n^{q}}\sum_{k=1}^{n}\mathbbm{E}(\dot{x}_{ki}\dot{x}_{kj})^{2} (A.3)
(ζ+32Δ)2(q2)2nqk=1n(𝔼|x˙ki|4+𝔼|x˙kj|4)\displaystyle\leq\frac{(\zeta+\frac{3}{2}\Delta)^{2(q-2)}}{2n^{q}}\sum_{k=1}^{n}\big{(}\mathbbm{E}|\dot{x}_{ki}|^{4}+\mathbbm{E}|\dot{x}_{kj}|^{4}\big{)}
4(ζ+32Δ)2(q2)nqk=1n(𝔼|x~ki|4+𝔼|x~kj|4+𝔼|ξki|4+𝔼|ξkj|4)\displaystyle\leq\frac{4(\zeta+\frac{3}{2}\Delta)^{2(q-2)}}{n^{q}}\sum_{k=1}^{n}\big{(}\mathbbm{E}|\widetilde{x}_{ki}|^{4}+\mathbbm{E}|\widetilde{x}_{kj}|^{4}+\mathbbm{E}|\xi_{ki}|^{4}+\mathbbm{E}|\xi_{kj}|^{4}\big{)}
q!2v0c0q2,\displaystyle\leq\frac{q!}{2}v_{0}c_{0}^{q-2},

for some v0=O(M+Δ4n)v_{0}=O\big{(}\frac{{M}+\Delta^{4}}{n}\big{)}, c0=O((ζ+Δ)2n)c_{0}=O\big{(}\frac{(\zeta+\Delta)^{2}}{n}\big{)}. With these preparations, we can invoke Bernstein’s inequality (Lemma 1) to obtain that, for any t0t\geq 0, with probability at least 12exp(t)1-2\exp(-t),

|σ^ij𝔼σ^ij|C1((M+Δ4)tn+(ζ2+Δ2)tn).|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|\leq C_{1}\Big{(}\sqrt{\frac{(M+\Delta^{4})t}{n}}+\frac{(\zeta^{2}+\Delta^{2})t}{n}\Big{)}.

Taking t=δlogdt=\delta\log d and using the choice ζ(nMδlogd)1/4\zeta\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}, then applying a union bound over (i,j)[d]×[d](i,j)\in[d]\times[d], under the scaling nδlogdn\gtrsim\delta\log d, we obtain that I1(M+Δ2)δlogdnI_{1}\lesssim(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}} holds with probability at least 12d2δ1-2d^{2-\delta}.

Step 2. Bounding I2I_{2}.

We aim to bound |𝔼(x~kix~kjxkixkj)||\mathbbm{E}(\widetilde{x}_{ki}\widetilde{x}_{kj}-x_{ki}x_{kj})| for any (i,j)[d]×[d](i,j)\in[d]\times[d]. First by the definition of truncation we have

|𝔼(x~kix~kjxkixkj)|𝔼[|xkixkj|(𝟙(|xki|ζ)+𝟙(|xkj|ζ))];\big{|}\mathbbm{E}(\widetilde{x}_{ki}\widetilde{x}_{kj}-x_{ki}x_{kj})\big{|}\leq\mathbbm{E}\big{[}|x_{ki}x_{kj}|(\mathbbm{1}(|x_{ki}|\geq\zeta)+\mathbbm{1}(|x_{kj}|\geq\zeta))\big{]};

then applying Cauchy-Schwarz to 𝔼[|xkixkj|𝟙(|xkiζ|)]\mathbbm{E}\big{[}|x_{ki}x_{kj}|\mathbbm{1}(|x_{ki}\geq\zeta|)\big{]}, we obtain

𝔼[|xkixkj|𝟙(|xki|ζ)][𝔼|xkixkj|2]1/2[(|xki|ζ)]1/2MMζ4=Mζ2,\mathbbm{E}\big{[}|x_{ki}x_{kj}|\mathbbm{1}(|x_{ki}|\geq\zeta)\big{]}\leq\big{[}\mathbbm{E}|x_{ki}x_{kj}|^{2}\big{]}^{1/2}\big{[}\mathbbm{P}(|x_{ki}|\geq\zeta)\big{]}^{1/2}\leq\sqrt{M}\sqrt{\frac{M}{\zeta^{4}}}=\frac{M}{\zeta^{2}},

where the second inequality is due to Markov’s inequality. Note that this bound remains valid for 𝔼[|xkixkj|𝟙(|xkj|ζ)]\mathbbm{E}\big{[}|x_{ki}x_{kj}|\mathbbm{1}(|x_{kj}|\geq\zeta)\big{]}. Since this holds for any (i,j)[d]×[d](i,j)\in[d]\times[d], combining with ζ(nMδlogd)1/4\zeta\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}, we obtain 𝔼(𝒙~k𝒙~k𝒙k𝒙k)2Mζ2Mδlogdn\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{\infty}\leq\frac{2M}{\zeta^{2}}\lesssim\sqrt{M}\sqrt{\frac{\delta\log d}{n}}.

By putting pieces together, we have 𝚺^𝚺(M+Δ2)δlogdn\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{\infty}\lesssim(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}} with probability at least 12d2δ1-2d^{2-\delta}, as claimed. ∎

A.1.2 Proof of Theorem 3

Proof.

Note that the calculations in (3.1) and (3.2) remain valid (but the truncated samples are denoted by 𝒙ˇk\bm{\check{x}}_{k} rather than 𝒙~k\bm{\widetilde{x}}_{k}), so we have 𝔼𝚺^=𝔼(𝒙ˇk𝒙ˇk)\mathbbm{E}\bm{\widehat{\Sigma}}=\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}). Using triangle inequality we first decompose the error as

𝚺^𝚺op𝚺^𝔼𝚺^op+𝔼(𝒙ˇk𝒙ˇk𝒙k𝒙k)op:=I1+I2.\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{op}\leq\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{op}+\|\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{op}:=I_{1}+I_{2}.

Step 1. Bounding I1I_{1}.

We first write that

𝚺^𝔼𝚺^=1nk=1n𝑺k where 𝑺k=𝒙˙k𝒙˙k𝔼(𝒙˙k𝒙˙k).\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}=\frac{1}{n}\sum_{k=1}^{n}\bm{S}_{k}\text{ where }\bm{S}_{k}=\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}).

Recall that we define quantization error as 𝒘k=𝒙˙k𝒙ˇk𝝉k\bm{w}_{k}=\bm{\dot{x}}_{k}-\bm{\check{x}}_{k}-\bm{\tau}_{k} and quantization noise as 𝝃k=𝒙˙k𝒙ˇk\bm{\xi}_{k}=\bm{\dot{x}}_{k}-\bm{\check{x}}_{k}, and observe that the quantization noise is bounded 𝝃k=𝒙˙k𝒙ˇk=𝝉k+𝒘k32Δ\|\bm{\xi}_{k}\|_{\infty}=\|\bm{\dot{x}}_{k}-\bm{\check{x}}_{k}\|_{\infty}=\|\bm{\tau}_{k}+\bm{w}_{k}\|_{\infty}\leq\frac{3}{2}\Delta. Thus, by 𝒂22d𝒂42\|\bm{a}\|^{2}_{2}\leq\sqrt{d}\|\bm{a}\|_{4}^{2} that holds for any 𝒂d\bm{a}\in\mathbb{R}^{d}, we obtain

𝒙˙k𝒙˙kop\displaystyle\|\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\|_{op} =𝒙˙k22=𝒙ˇk+𝝃k222𝒙ˇk22+2𝝃k22\displaystyle=\|\bm{\dot{x}}_{k}\|_{2}^{2}=\|\bm{\check{x}}_{k}+\bm{\xi}_{k}\|^{2}_{2}\leq 2\|\bm{\check{x}}_{k}\|^{2}_{2}+2\|\bm{\xi}_{k}\|^{2}_{2}
2d𝒙ˇk42+2d(3Δ2)22dζ2+92dΔ2,\displaystyle\leq 2\sqrt{d}\cdot\|\bm{\check{x}}_{k}\|_{4}^{2}+2d\cdot\Big{(}\frac{3\Delta}{2}\Big{)}^{2}\leq 2\sqrt{d}\zeta^{2}+\frac{9}{2}d\Delta^{2},

which implies 𝑺kop𝒙˙k𝒙˙kop+𝔼𝒙˙k𝒙˙kop4dζ2+9dΔ2\|\bm{S}_{k}\|_{op}\leq\|\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\|_{op}+\mathbbm{E}\|\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\|_{op}\leq 4\sqrt{d}\zeta^{2}+9d\Delta^{2}. Moreover, we estimate the matrix variance statistic. Since 𝑺k\bm{S}_{k} is symmetric, we simply deal with 𝔼𝑺k2op\|\mathbbm{E}\bm{S}_{k}^{2}\|_{op} and some algebra gives 𝔼𝑺k2=𝔼[𝒙˙k22𝒙˙k𝒙˙k](𝔼[𝒙˙k𝒙˙k])2\mathbbm{E}\bm{S}_{k}^{2}=\mathbbm{E}\big{[}\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}-\big{(}\mathbbm{E}\big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}\big{)}^{2}. First let us note that

(𝔼[𝒙˙k𝒙˙k])2op\displaystyle\Big{\|}\big{(}\mathbbm{E}\big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}\big{)}^{2}\Big{\|}_{op} =𝔼[𝒙˙k𝒙˙k]op2=𝔼[𝒙ˇk𝒙ˇk]+Δ24𝑰dop2\displaystyle=\Big{\|}\mathbbm{E}\big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}\Big{\|}^{2}_{op}=\Big{\|}\mathbbm{E}\big{[}\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}\big{]}+\frac{\Delta^{2}}{4}\bm{I}_{d}\Big{\|}^{2}_{op}
(𝔼[𝒙ˇk𝒙ˇk]op+Δ24)22𝔼[𝒙ˇk𝒙ˇk]op2+Δ48.\displaystyle\leq\Big{(}\Big{\|}\mathbbm{E}\big{[}\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}\big{]}\Big{\|}_{op}+\frac{\Delta^{2}}{4}\Big{)}^{2}\leq 2\Big{\|}\mathbbm{E}\big{[}\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}\big{]}\Big{\|}_{op}^{2}+\frac{\Delta^{4}}{8}.

Combining with the observation that

𝔼[𝒙ˇk𝒙ˇk]op=sup𝒗𝕊d1𝔼(𝒗𝒙ˇk)2sup𝒗𝕊d1𝔼(𝒗𝒙k)4M,\Big{\|}\mathbbm{E}\big{[}\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}\big{]}\Big{\|}_{op}=\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}(\bm{v}^{\top}\bm{\check{x}}_{k})^{2}\leq\sup_{\bm{v}\in\mathbb{S}^{d-1}}\sqrt{\mathbbm{E}(\bm{v}^{\top}\bm{x}_{k})^{4}}\leq\sqrt{M},

we obtain (𝔼[𝒙˙k𝒙˙k])2op=O(M+Δ4)\big{\|}(\mathbbm{E}[\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}])^{2}\big{\|}_{op}=O(M+\Delta^{4}). Then we turn to the operator norm of 𝔼[𝒙˙k22𝒙˙k𝒙˙k]\mathbbm{E}[\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}]. We apply Cauchy-Schwarz to estimate

𝔼(𝒙˙k22𝒙˙k𝒙˙k)op=sup𝒗𝕊d1𝔼(𝒙˙k22(𝒗𝒙˙k)2)𝔼𝒙˙k24sup𝒗𝕊d1𝔼(𝒗𝒙˙k)4.\displaystyle\big{\|}\mathbbm{E}\big{(}\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{)}\big{\|}_{op}=\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}\big{(}\|\bm{\dot{x}}_{k}\|_{2}^{2}(\bm{v}^{\top}\bm{\dot{x}}_{k})^{2}\big{)}{\leq}\sqrt{\mathbbm{E}\|\bm{\dot{x}}_{k}\|_{2}^{4}}\sup_{\bm{v}\in\mathbb{S}^{d-1}}\sqrt{\mathbbm{E}(\bm{v}^{\top}\bm{\dot{x}}_{k})^{4}}. (A.4)

By 𝒂22d𝒂42\|\bm{a}\|^{2}_{2}\leq\sqrt{d}\|\bm{a}\|_{4}^{2} that holds for any 𝒂d\bm{a}\in\mathbb{R}^{d}, 𝔼|xˇki|4𝔼|xki|4M\mathbbm{E}|\check{x}_{ki}|^{4}\leq\mathbbm{E}|x_{ki}|^{4}\leq M, 𝒙˙k=𝒙ˇk+𝝃k\bm{\dot{x}}_{k}=\bm{\check{x}}_{k}+\bm{\xi}_{k} and 𝝃k3Δ2\|\bm{\xi}_{k}\|_{\infty}\leq\frac{3\Delta}{2}, we obtain

𝔼𝒙˙k24\displaystyle\mathbbm{E}\|\bm{\dot{x}}_{k}\|^{4}_{2} 𝔼(𝒙ˇk2+𝝃k2)4𝔼(𝒙ˇk24+𝝃k24)\displaystyle\leq\mathbbm{E}(\|\bm{\check{x}}_{k}\|_{2}+\|\bm{\xi}_{k}\|_{2})^{4}\lesssim\mathbbm{E}(\|\bm{\check{x}}_{k}\|_{2}^{4}+\|\bm{\xi}_{k}\|^{4}_{2}) (A.5)
d𝔼(𝒙ˇk44+𝝃k44)d2(M+Δ4).\displaystyle\leq d\mathbbm{E}(\|\bm{\check{x}}_{k}\|^{4}_{4}+\|\bm{\xi}_{k}\|^{4}_{4})\lesssim d^{2}(M+\Delta^{4}).

For any 𝒗𝕊d1\bm{v}\in{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbb{S}^{d-1}}, we write 𝒙˙k=𝒙ˇk+𝝉k+𝒘k\bm{\dot{x}}_{k}=\bm{\check{x}}_{k}+\bm{\tau}_{k}+\bm{w}_{k} and then have the bound

𝔼(𝒗𝒙˙k)4𝔼(𝒗𝒙ˇk)4+𝔼(𝒗𝝉k)4+𝔼(𝒗𝒘k)4(i)M+Δ4,\displaystyle\mathbbm{E}(\bm{v}^{\top}\bm{\dot{x}}_{k})^{4}\lesssim\mathbbm{E}(\bm{v}^{\top}\bm{\check{x}}_{k})^{4}+\mathbbm{E}(\bm{v}^{\top}\bm{\tau}_{k})^{4}+\mathbbm{E}(\bm{v}^{\top}\bm{w}_{k})^{4}\stackrel{{\scriptstyle(i)}}{{\lesssim}}M+\Delta^{4}, (A.6)

where (i)(i) is because 𝔼(𝒗𝒙ˇk)4𝔼(𝒗𝒙k)4M\mathbbm{E}(\bm{v}^{\top}\bm{\check{x}}_{k})^{4}\leq\mathbbm{E}(\bm{v}^{\top}\bm{x}_{k})^{4}\leq M, 𝝉k𝒰([Δ2,Δ2]d)+𝒰([Δ2,Δ2]d)\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}), and the quantization error 𝒘k\bm{w}_{k} follows 𝒰([Δ2,Δ2]d)\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d}); in more detail, 𝒗𝝉kψ2,𝒗𝒘kψ2=O(Δ)\|\bm{v}^{\top}\bm{\tau}_{k}\|_{\psi_{2}},\|\bm{v}^{\top}\bm{w}_{k}\|_{\psi_{2}}=O(\Delta) and then the moment constraint of sub-Gaussian random variable implies 𝔼(𝒗𝝉k)4=O(Δ4)\mathbbm{E}(\bm{v}^{\top}\bm{\tau}_{k})^{4}=O(\Delta^{4}) and 𝔼(𝒗𝒘k)=O(Δ4)\mathbbm{E}(\bm{v}^{\top}\bm{w}_{k})=O(\Delta^{4}). From (A.4), (A.5) and (LABEL:627add2), we obtain 𝔼(𝒙˙k22𝒙˙k𝒙˙k)op=O(d(Δ4+M))\big{\|}\mathbbm{E}\big{(}\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{)}\big{\|}_{op}=O\big{(}d(\Delta^{4}+M)\big{)}. Further combining with 𝔼𝑺k2=𝔼[𝒙˙k22𝒙˙k𝒙˙k](𝔼[𝒙˙k𝒙˙k])2\mathbbm{E}\bm{S}_{k}^{2}=\mathbbm{E}\big{[}\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}-\big{(}\mathbbm{E}\big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}\big{)}^{2} and (𝔼[𝒙˙k𝒙˙k])2op=O(M+Δ4)\big{\|}(\mathbbm{E}[\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}])^{2}\big{\|}_{op}=O(M+\Delta^{4}), we arrive at 𝔼𝑺k2opd(Δ4+M)\|\mathbbm{E}\bm{S}_{k}^{2}\|_{op}\lesssim d(\Delta^{4}+M) and hence k=1n𝔼𝑺k2opnd(Δ4+M)\big{\|}\sum_{k=1}^{n}\mathbbm{E}\bm{S}_{k}^{2}\big{\|}_{op}\lesssim nd(\Delta^{4}+M). With these preparations, Matrix Bernstein’s inequality (Lemma 2) yields the following inequality that holds for any t0t\geq 0

(𝚺^𝔼𝚺^opt)2dexp(C1nt2(M+Δ4)d+(dζ2+dΔ2)t).\mathbbm{P}\Big{(}\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{op}\geq t\Big{)}\leq 2d\exp\left(-\frac{C_{1}nt^{2}}{(M+\Delta^{4})d+(\sqrt{d}\zeta^{2}+d\Delta^{2})t}\right).

Setting t=C2(M+Δ2)δdlogdnt=C_{2}(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta d\log d}{n}} with sufficiently large C2C_{2}, under the scaling of nδdlogdn\gtrsim\delta d\log d and the threshold ζ(M1/4+Δ)(nδlogd)1/4\zeta\asymp(M^{1/4}+\Delta)\big{(}\frac{n}{\delta\log d}\big{)}^{1/4}, we obtain that I1=𝚺^𝔼𝚺^opC2(M+Δ2)δdlogdnI_{1}=\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{op}\leq C_{2}(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta d\log d}{n}} holds with probability at least 12d1δ1-2d^{1-\delta}.

Step 2. Bounding I2I_{2}.

Having bounded the concentration term I1I_{1}, we now switch to the bias term

I2=sup𝒗𝕊d1|𝒗𝔼(𝒙ˇk𝒙ˇk𝒙k𝒙k)𝒗|.I_{2}=\sup_{\bm{v}\in\mathbb{S}^{d-1}}\big{|}\bm{v}^{\top}\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\bm{v}\big{|}.

For any 𝒗𝕊d1\bm{v}\in\mathbb{S}^{d-1}, because 𝒙ˇk\bm{\check{x}}_{k} is obtained from truncating 𝒙4\bm{x}_{4} regarding 4\ell_{4}-norm, we have

|𝒗𝔼(𝒙ˇk𝒙ˇk𝒙k𝒙k)𝒗|\displaystyle\big{|}\bm{v}^{\top}\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\bm{v}\big{|} =|𝔼[((𝒗𝒙ˇk)2(𝒗𝒙k)2)𝟙(𝒙k4ζ)]|\displaystyle=\Big{|}\mathbbm{E}\Big{[}\big{(}(\bm{v}^{\top}\bm{\check{x}}_{k})^{2}-(\bm{v}^{\top}\bm{x}_{k})^{2}\big{)}\mathbbm{1}(\|\bm{x}_{k}\|_{4}\geq\zeta)\Big{]}\Big{|}
𝔼[(𝒗𝒙k)2𝟙(𝒙k4ζ)]\displaystyle\leq\mathbbm{E}\big{[}(\bm{v}^{\top}\bm{x}_{k})^{2}\mathbbm{1}(\|\bm{x}_{k}\|_{4}\geq\zeta)\big{]}
(i)𝔼(𝒗𝒙k)4(𝒙k44ζ4)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\sqrt{\mathbbm{E}(\bm{v}^{\top}\bm{x}_{k})^{4}}\sqrt{\mathbbm{P}(\|\bm{x}_{k}\|^{4}_{4}\geq\zeta^{4})}
(ii)M𝔼𝒙k44ζ4(iii)Mδdlogdn,\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\sqrt{M\frac{\mathbbm{E}\|\bm{x}_{k}\|_{4}^{4}}{\zeta^{4}}}\stackrel{{\scriptstyle(iii)}}{{\lesssim}}\sqrt{\frac{M\delta d\log d}{n}},

where (i)(i) and (ii)(ii) are respectively by Cauchy-Schwarz and Markov’s, and in (iii)(iii) we use ζ(M1/4+Δ)(δdlogdn)1/4\zeta\asymp(M^{1/4}+\Delta)\big{(}\frac{\delta d\log d}{n}\big{)}^{1/4}. This leads to the bound I2Mδdlogdn.I_{2}\lesssim\sqrt{\frac{M\delta d\log d}{n}}. Combining the bounds of I1,I2I_{1},I_{2} completes the proof.∎

A.1.3 Proof of Theorem 4

This small appendix is devoted to the proof of Theorem 4, for which we need a Lemma concerning the element-wise error rate of 𝚺^s\bm{\widehat{\Sigma}}_{s}, i.e., |σ˘ijσij||\breve{\sigma}_{ij}-\sigma^{\star}_{ij}| where we write 𝚺^s=[σ˘ij]\bm{\widehat{\Sigma}}_{s}=[\breve{\sigma}_{ij}], 𝚺=𝔼(𝒙k𝒙k)=[σij]\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})=[\sigma_{ij}^{\star}]. Recalling that 𝚺^s=𝒯μ(𝚺^)\bm{\widehat{\Sigma}}_{s}=\mathcal{T}_{\mu}(\bm{\widehat{\Sigma}}), the key message from Lemma 3 is that due to the thresholding operator 𝒯μ()\mathcal{T}_{\mu}(\cdot), 𝚺^s\bm{\widehat{\Sigma}}_{s} respects an element-wise bound tighter than O(δlogdn)O\big{(}\sqrt{\frac{\delta\log d}{n}}\big{)} in Theorem 2, as can be seen from the additional branch |σij||\sigma^{\star}_{ij}| in (A.7).

Lemma 3.

(Element-wise Error Rate of 𝚺^s\bm{\widehat{\Sigma}}_{s}). For any i,j[d]i,j\in[d], the thresholding estimator 𝚺^s=[σ˘ij]\bm{\widehat{\Sigma}}_{s}=[\breve{\sigma}_{ij}] in Theorem 4 satisfies for some CC that

(|σ˘ijσij|Cmin{|σij|,δlogdn})12dδ\mathbbm{P}\left(|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq C\min\Big{\{}|\sigma^{\star}_{ij}|,\mathscr{L}\sqrt{\frac{\delta\log d}{n}}\Big{\}}\right)\geq 1-2d^{-\delta} (A.7)

where :=M+Δ2\mathscr{L}:=\sqrt{M}+\Delta^{2}.

Proof.

Recall that 𝚺^s=[σ˘ij]=𝒯μ(𝚺^)=𝒯μ([σ^ij])\bm{\widehat{\Sigma}}_{s}=[\breve{\sigma}_{ij}]=\mathcal{T}_{\mu}(\bm{\widehat{\Sigma}})=\mathcal{T}_{\mu}\big{(}[\widehat{\sigma}_{ij}]\big{)} and hence σ˘ij=𝒯μ(σ^ij)\breve{\sigma}_{ij}=\mathcal{T}_{\mu}(\widehat{\sigma}_{ij}). Given (i,j)(i,j), the proof of Theorem 2 delivers |σ^ijσij|C1δlogdn|\widehat{\sigma}_{ij}-\sigma_{ij}^{\star}|\leq C_{1}\mathscr{L}\sqrt{\frac{\delta\log d}{n}} with probability at least 12dδ1-2d^{-\delta}. Assume that we are on this event in the following analyses. As stated in Theorem 4, we set μ=C2δlogdn\mu=C_{2}\mathscr{L}\sqrt{\frac{\delta\log d}{n}} with C2>C1C_{2}>C_{1}, =M+Δ2\mathscr{L}=\sqrt{M}+\Delta^{2}. Since σ˘ij=𝒯μ(σ^ij)\breve{\sigma}_{ij}=\mathcal{T}_{\mu}(\widehat{\sigma}_{ij}), we discuss whether |σ^|μ|\widehat{\sigma}|\geq\mu holds.

Case 1. when |σ^ij|<μ|\widehat{\sigma}_{ij}|<\mu holds.

In this case we have σ˘ij=0\breve{\sigma}_{ij}=0, thus |σ˘ijσij||σij||\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq|\sigma^{\star}_{ij}|. Further, |σij||σijσ^ij|+|σ^ij|C1δlogdn+μδlogdn|\sigma^{\star}_{ij}|\leq|\sigma^{\star}_{ij}-\widehat{\sigma}_{ij}|+|\widehat{\sigma}_{ij}|\leq C_{1}\mathscr{L}\sqrt{\frac{\delta\log d}{n}}+\mu\lesssim\mathscr{L}\sqrt{\frac{\delta\log d}{n}}, so we also have |σ˘ijσij|δlogdn|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\lesssim\mathscr{L}\sqrt{\frac{\delta\log d}{n}}.

Case 2. when |σ^ij|μ|\widehat{\sigma}_{ij}|\geq\mu holds.

Then, we consider |σ^ij|μ|\widehat{\sigma}_{ij}|\geq\mu that implies σ˘ij=σ^ij\breve{\sigma}_{ij}=\widehat{\sigma}_{ij}. Then |σ˘ijσij|=|σ^ijσij|C1δlogdn|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|=|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq C_{1}\mathscr{L}\sqrt{\frac{\delta\log d}{n}}. Moreover, |σij||σ^ij||σ^ijσij|μ|σ^ijσij|(C2C1)δlogdn|\sigma^{\star}_{ij}|\geq|\widehat{\sigma}_{ij}|-|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|\geq\mu-|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|\geq(C_{2}-C_{1})\mathscr{L}\sqrt{\frac{\delta\log d}{n}}, so we also have |σ˘ijσij|=O(|σij|)|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|=O(|\sigma^{\star}_{ij}|).

Therefore, in both cases we have proved that |σ˘ijσij|min{|σij|,δlogdn}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\lesssim\min\big{\{}|\sigma^{\star}_{ij}|,\mathscr{L}\sqrt{\frac{\delta\log d}{n}}\big{\}}, which completes the proof. ∎

We are now in a position to present the proof.

Proof of Theorem 4. We let p=δ41p=\frac{\delta}{4}\geq 1 (just assume δ4\delta\geq 4) and use B0:=δlogdnB_{0}:=\mathscr{L}\sqrt{\frac{\delta\log d}{n}} as shorthand. For (i,j)[d]×[d](i,j)\in[d]\times[d] we define the event 𝒜ij\mathscr{A}_{ij} as

𝒜ij={|σ˘ijσij|C1min{|σij|,B0}}.\mathscr{A}_{ij}=\Big{\{}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq C_{1}\min\big{\{}|\sigma^{\star}_{ij}|,B_{0}\big{\}}\Big{\}}.

By Lemma 3 we can choose C1C_{1} to be sufficiently large such that C1B0>3μC_{1}B_{0}>3\mu and (𝒜ij)2dδ\mathbbm{P}(\mathscr{A}_{ij}^{\complement})\leq 2d^{-\delta}; here, by convention we let 𝒜ij\mathscr{A}_{ij}^{\complement} be the complement of 𝒜ij\mathscr{A}_{ij}. Our proof strategy is to first bound the pp-th order moment 𝔼𝚺^s𝚺opp\mathbbm{E}\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}^{p}, and then invoke Markov’s inequality to derive a high probability bound. We start with a simple estimate

𝔼𝚺^s𝚺opp(i)𝔼(supj[d]i=1d|σ˘ijσij|𝟙(𝒜ij)+supj[d]i=1d|σ˘ijσij|𝟙(𝒜ij))p\displaystyle\mathbbm{E}\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|^{p}_{op}\stackrel{{\scriptstyle(i)}}{{\leq}}\mathbbm{E}\Big{(}\sup_{j\in[d]}\sum_{i=1}^{d}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij})+\sup_{j\in[d]}\sum_{i=1}^{d}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\Big{)}^{p}
(ii)2p𝔼supj[d](i=1d|σ˘ijσij|𝟙(𝒜ij))p+2p𝔼supj[d](i=1d|σ˘ijσij|𝟙(𝒜ij))p:=I1+I2\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}2^{p}\mathbbm{E}\sup_{j\in[d]}\Big{(}\sum_{i=1}^{d}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij})\Big{)}^{p}+2^{p}\mathbbm{E}\sup_{j\in[d]}\Big{(}\sum_{i=1}^{d}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}^{\complement}_{ij})\Big{)}^{p}:=I_{1}+I_{2}

where (i)(i) and (ii)(ii) are due to 𝑨opsupj[d]i[d]|aij|\|\bm{A}\|_{op}\leq\sup_{j\in[d]}\sum_{i\in[d]}|a_{ij}| for symmetric 𝑨\bm{A} and (a+b)p(2a)p+(2b)p(a+b)^{p}\leq(2a)^{p}+(2b)^{p}. In this proof, the ranges of indices in summation or supremum, if omitted, are [d][d].

Step 1. Bounding I1I_{1}.

By the definition of 𝒜ij\mathscr{A}_{ij}, |σ˘ijσij|=0|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|=0 if |σij|=0|\sigma^{\star}_{ij}|=0. Because the columns of 𝚺\bm{\Sigma^{\star}} are ss-sparse, we can straightforwardly bound I1I_{1} as follows:

I1=2p𝔼supj(i:|σij|>0|σ˘ijσij|𝟙(𝒜ij))p(2C1sB0)p.\displaystyle I_{1}=2^{p}\mathbbm{E}\sup_{j}\Big{(}\sum_{i:|\sigma^{\star}_{ij}|>0}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij})\Big{)}^{p}\leq\big{(}2C_{1}sB_{0}\big{)}^{p}. (A.8)

Step 2. Bounding I2I_{2}.

We first write I2=2p𝔼supjWjI_{2}=2^{p}\mathbbm{E}\sup_{j}W_{j} with Wj:=(i|σ˘ijσij|𝟙(𝒜ij))pW_{j}:=\big{(}\sum_{i}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\big{)}^{p}, then start from

Wj(i)(i=1d|σij|𝟙(𝒜ij)𝟙(|σ^ij|<μ)+i=1d|σ^ij𝔼σ^ij|𝟙(𝒜ij)+i=1d|σ~ijσij|𝟙(𝒜ij))p\displaystyle W_{j}\stackrel{{\scriptstyle(i)}}{{\leq}}\Big{(}\sum_{i=1}^{d}|\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\mathbbm{1}(|\widehat{\sigma}_{ij}|<\mu)+\sum_{i=1}^{d}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})+\sum_{i=1}^{d}|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\Big{)}^{p}
(3d)p1(i=1d|σij|p𝟙(𝒜ij)𝟙(|σ^ij|<μ)+i=1d|σ^ij𝔼σ^ij|p𝟙(𝒜ij)+i=1d|σ~ijσij|p𝟙(𝒜ij)),\displaystyle\leq(3d)^{p-1}\Big{(}\sum_{i=1}^{d}|\sigma^{\star}_{ij}|^{p}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\mathbbm{1}(|\widehat{\sigma}_{ij}|<\mu)+\sum_{i=1}^{d}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|^{p}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})+\sum_{i=1}^{d}|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|^{p}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\Big{)},

note that in (i)(i) we define

𝔼σ^ij=𝔼(x~kix~kj):=σ~ij.\mathbbm{E}\widehat{\sigma}_{ij}=\mathbbm{E}(\widetilde{x}_{ki}\widetilde{x}_{kj}):=\widetilde{\sigma}_{ij}.

By replacing supj\sup_{j} with j\sum_{j}, this further gives

I26pdp1(i,j|σij|p𝔼[𝟙(𝒜ij)𝟙(|σ^ij|<μ)]+i,j𝔼[|σ^ij𝔼σ^ij|p𝟙(𝒜ij)]\displaystyle I_{2}\leq 6^{p}d^{p-1}\Big{(}{\sum_{i,j}|\sigma^{\star}_{ij}|^{p}\mathbbm{E}\big{[}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\mathbbm{1}(|\widehat{\sigma}_{ij}|<\mu)\big{]}}+{\sum_{i,j}\mathbbm{E}\big{[}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|^{p}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\big{]}} (A.9)
+i,j|σ~ijσij|p(𝒜ij)):=6pdp1(I21+I22+I23).\displaystyle+\sum_{i,j}|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|^{p}\mathbbm{P}(\mathscr{A}_{ij}^{\complement})\Big{)}:=6^{p}d^{p-1}\big{(}I_{21}+I_{22}+I_{23}\big{)}.

Step 2.1. Bounding I21I_{21}.

Note that 𝒜ij\mathscr{A}_{ij}^{\complement} means |σ˘ijσij|>C1min{|σij|,B0}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|>C_{1}\min\{|\sigma^{\star}_{ij}|,B_{0}\}, and |σ^ij|<μ|\widehat{\sigma}_{ij}|<\mu implies σ˘ij=0\breve{\sigma}_{ij}=0, their combination thus allows us to proceed as the following (i)(i) and (iii)(iii):

|σij|>(i)C1B0>(ii)3μ>(iii)3|σ^ij|3|σij|3|σ^ijσij|,|\sigma^{\star}_{ij}|\stackrel{{\scriptstyle(i)}}{{>}}C_{1}B_{0}\stackrel{{\scriptstyle(ii)}}{{>}}3\mu\stackrel{{\scriptstyle(iii)}}{{>}}3|\widehat{\sigma}_{ij}|\geq 3|\sigma^{\star}_{ij}|-3|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|,

where (ii)(ii) is due to our choice of C1C_{1}. Thus, 𝒜ij{|σ^ij|<μ}\mathscr{A}_{ij}^{\complement}\cap\{|\widehat{\sigma}_{ij}|<\mu\} implies |σ^ijσij|>23|σij||\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|>\frac{2}{3}|\sigma^{\star}_{ij}| and |σij|>3μ|\sigma^{\star}_{ij}|>3\mu. Note that Step 2 in the proof of Theorem 2 gives |σ~ijσij|=O(B0)|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|=O\big{(}B_{0}\big{)}, and hence we can assume μ>|σ~ijσij|\mu>|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}| and so |σij|>3|σ~ijσij||\sigma^{\star}_{ij}|>3|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|. Using these relations and triangle inequality, we obtain

23|σij|<|σ^ijσij||σ^ij𝔼σ^ij|+|σ~ijσij|<|σ^ij𝔼σ^ij|+13|σij|,\frac{2}{3}|\sigma^{\star}_{ij}|<|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|+|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|<|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|+\frac{1}{3}|\sigma^{\star}_{ij}|,

which implies |σ^ij𝔼σ^ij|>13|σij||\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}|. Now we conclude that, 𝒜ij{|σ^ij|<μ}\mathscr{A}_{ij}^{\complement}\cap\{|\widehat{\sigma}_{ij}|<\mu\} implies |σ^ij𝔼σ^ij|>13|σij||\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}| and |σij|>3μ|\sigma^{\star}_{ij}|>3\mu, which allows us to bound I21I_{21} as

I21=i,j|σij|p𝟙(|σij|>3μ)(|σ^ij𝔼σ^ij|>13|σij|).\displaystyle I_{21}=\sum_{i,j}|\sigma^{\star}_{ij}|^{p}\mathbbm{1}(|\sigma_{ij}^{\star}|>3\mu)\mathbbm{P}\Big{(}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}|\Big{)}. (A.10)

Analogously to the proof of Theorem 2, we can apply Bernstein’s inequality to (|σ^ij𝔼σ^ij|>13|σij|)\mathbbm{P}\big{(}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}|\big{)}. More specifically, by preparations as in (A.3), we can use (A.2) in Lemma 1 with v=O(M+Δ4n)v=O\big{(}\frac{M+\Delta^{4}}{n}\big{)}, c=O(ζ2+Δ2n)=O(Δ2n+Mnδlogd)c=O\big{(}\frac{\zeta^{2}+\Delta^{2}}{n})=O(\frac{\Delta^{2}}{n}+\sqrt{\frac{M}{n\delta\log d}}\big{)} (recall that ζ(nMΔlogd)1/4\zeta\asymp\big{(}\frac{nM}{\Delta\log d}\big{)}^{1/4}). For some absolute constants C2,C3C_{2},C_{3}, it gives

(|σ^ij𝔼σ^ij|>13|σij|)2exp(|σij|2C2{M+Δ4n+Δ2|σij|n+Mnδlogd|σij|})\displaystyle\mathbbm{P}\Big{(}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}|\Big{)}\leq 2\exp\left(-\frac{|\sigma^{\star}_{ij}|^{2}}{C_{2}\big{\{}\frac{M+\Delta^{4}}{n}+\frac{\Delta^{2}|\sigma^{\star}_{ij}|}{n}+\sqrt{\frac{M}{n\delta\log d}}|\sigma^{\star}_{ij}|\big{\}}}\right) (A.11)
2exp(3n|σij|C2min{|σij|M+Δ4,1Δ2,δlogdnM})(i)2exp(C3|σij|nδlogdM+Δ2),\displaystyle{\leq}2\exp\left(-\frac{3n|\sigma^{\star}_{ij}|}{C_{2}}\min\Big{\{}\frac{|\sigma^{\star}_{ij}|}{M+\Delta^{4}},\frac{1}{\Delta^{2}},\sqrt{\frac{\delta\log d}{nM}}\Big{\}}\right)\stackrel{{\scriptstyle(i)}}{{\leq}}2\exp\left(-\frac{C_{3}|\sigma^{\star}_{ij}|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right),

and in (i)(i) we use min{|σij|M+Δ4,Δ2,δlogdnM}1M+Δ2δlogdn\min\big{\{}\frac{|\sigma^{\star}_{ij}|}{M+\Delta^{4}},\Delta^{-2},\sqrt{\frac{\delta\log d}{nM}}\big{\}}\gtrsim\frac{1}{\sqrt{M}+\Delta^{2}}\sqrt{\frac{\delta\log d}{n}} that holds because |σij|>3μ|\sigma^{\star}_{ij}|>3\mu and nδlogdn\gtrsim\delta\log d. We substitute (LABEL:627add4) into (LABEL:A.5) and perform some estimates

I212i,j|σij|p𝟙(|σij|>3μ)exp(C3|σij|nδlogdM+Δ2)\displaystyle I_{21}\leq 2\sum_{i,j}|\sigma^{\star}_{ij}|^{p}\mathbbm{1}\big{(}|\sigma^{\star}_{ij}|>3\mu\big{)}\exp\left(-\frac{C_{3}|\sigma^{\star}_{ij}|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)
=2i,j(M+Δ2C3nδlogd)p(C3|σij|nδlogdM+Δ2)pexp(0.5C3|σij|nδlogdM+Δ2)\displaystyle=2\sum_{i,j}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}\sqrt{n\delta\log d}}\right)^{p}\cdot\left(\frac{C_{3}|\sigma^{\star}_{ij}|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)^{p}\exp\left(-\frac{0.5C_{3}|\sigma^{\star}_{ij}|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)
exp(0.5C3|σij|nδlogdM+Δ2)𝟙(|σij|>3μ)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\cdot\exp\left(-\frac{0.5C_{3}|\sigma^{\star}_{ij}|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)\mathbbm{1}\big{(}|\sigma^{\star}_{ij}|>3\mu\big{)}
(i)2i,j(M+Δ2C3nδlogd)p(supt0tpexp(t2))exp(3C32nδlogdμM+Δ2)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}2\sum_{i,j}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}\sqrt{n\delta\log d}}\right)^{p}\cdot\left(\sup_{t\geq 0}~{}t^{p}\exp\Big{(}-\frac{t}{2}\Big{)}\right)\cdot\exp\left(-\frac{3C_{3}}{2}\frac{\sqrt{n\delta\log d}\cdot\mu}{\sqrt{M}+\Delta^{2}}\right)
(ii)2d210δ(M+Δ2C3δnlogd)p2d210δ(C31B0)p,\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}2d^{2-10\delta}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}}\sqrt{\frac{\delta}{n\log d}}\right)^{p}\leq 2d^{2-10\delta}(C_{3}^{-1}B_{0})^{p},

where in (i)(i) we substitute |σij|>3μ|\sigma^{\star}_{ij}|>3\mu from the indicator function into the exponent, (ii)(ii) is because supt0tpexp(t2)pp\sup_{t\geq 0}t^{p}\exp\big{(}-\frac{t}{2}\big{)}\leq p^{p}, p=δ4p=\frac{\delta}{4}, and we consider μ=C4(M+Δ2)δlogdn\mu=C_{4}(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}} with C4C_{4} large enough.

Step 2.2. Bounding I22I_{22}.

Then, we deal with I22I_{22} by Cauchy-Schwarz

I22i,j𝔼|σ^ij𝔼σ^ij|2p(𝒜ij).I_{22}\leq\sum_{i,j}\sqrt{\mathbbm{E}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|^{2p}}\sqrt{\mathbbm{P}(\mathscr{A}_{ij}^{\complement})}.

As in (LABEL:627add4), we can use (A.2) in Lemma 1 with v=O(M+Δ4n)v=O\big{(}\frac{M+\Delta^{4}}{n}\big{)} and c=O(Δ2n+Mnδlogd)c=O\big{(}\frac{\Delta^{2}}{n}+\sqrt{\frac{M}{n\delta\log d}}\big{)}, yielding that for any t0t\geq 0, (|σ^ij𝔼σ^ij|t)2exp(t22(v+ct))2exp(t24v)+2exp(t4c).\mathbbm{P}\big{(}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|\geq t\big{)}\leq 2\exp\big{(}-\frac{t^{2}}{2(v+ct)}\big{)}\leq 2\exp\big{(}-\frac{t^{2}}{4v}\big{)}+2\exp\big{(}-\frac{t}{4c}\big{)}. Based on this probability tail bound, we can bound the moment via integral as follows

𝔼|σ^ij𝔼σ^ij|2p\displaystyle\mathbbm{E}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|^{2p} =2p0t2p1(|σ^ij𝔼σ^ij|>t)dt\displaystyle=2p\int_{0}^{\infty}t^{2p-1}\mathbbm{P}(|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>t)~{}\mathrm{d}t
4p0t2p1(exp(t24v)+exp(t4c))dt\displaystyle\leq 4p\int_{0}^{\infty}t^{2p-1}\big{(}\exp(-\frac{t^{2}}{4v})+\exp(-\frac{t}{4c})\big{)}~{}\mathrm{d}t
=2[(4v)pΓ(p+1)+(4c)2pΓ(2p+1)](i)2[(4vp)p+(8cp)2p],\displaystyle=2\big{[}(4v)^{p}\Gamma(p+1)+(4c)^{2p}\Gamma(2p+1)\big{]}\stackrel{{\scriptstyle(i)}}{{\leq}}2\big{[}(4vp)^{p}+(8cp)^{2p}\big{]},

where we use Γ(p+1)pp\Gamma(p+1)\leq p^{p}, Γ(2p+1)(2p)2p\Gamma(2p+1)\leq(2p)^{2p} in (i)(i) under suitably large pp. Thus, it follows that

I22i,j2dδ2(4vp)p+(8cp)2p2d2δ2[(2pv)p+(8cp)p](i)2d2δ2(C4B0)p,I_{22}\leq\sum_{i,j}2d^{-\frac{\delta}{2}}\sqrt{(4vp)^{p}+(8cp)^{2p}}\leq 2d^{2-\frac{\delta}{2}}\big{[}(2\sqrt{pv})^{p}+(8cp)^{p}\big{]}\stackrel{{\scriptstyle(i)}}{{\leq}}2d^{2-\frac{\delta}{2}}(C_{4}B_{0})^{p},

where (i)(i) is due to 2pv(M+Δ2)δn2\sqrt{pv}\leq(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta}{n}} and 8cp=2Δ2δn+2δMnlogd8cp=\frac{2\Delta^{2}\delta}{n}+2\sqrt{\frac{\delta M}{n\log d}} (recall that p=δ4p=\frac{\delta}{4}).

Step 2.3. Bounding I23I_{23}.

From Step 2 in the proof of Theorem 2 we have |σ~ijσij|C5B0|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq C_{5}B_{0}. This directly leads to

I23d22dδ(C5B0)p=2d2δ(C5B0)p.I_{23}\leq d^{2}\cdot 2d^{-\delta}\cdot(C_{5}B_{0})^{p}=2d^{2-\delta}(C_{5}B_{0})^{p}.

We are in a position to combine everything and conclude the proof. Putting all pieces into (LABEL:A.4), it follows that I2d1δ4(C6B0)pI_{2}\leq d^{1-\frac{\delta}{4}}(C_{6}B_{0})^{p}. Assuming δ4\delta\geq 4, such upper bound is dominated by (A.8) for I1I_{1}, we can hence conclude that 𝔼𝚺^s𝚺opp(C6sB0)p\mathbbm{E}\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}^{p}\leq(C_{6}sB_{0})^{p}. Therefore, by Markov’s inequality,

(𝚺^s𝚺opC6esB0)𝔼𝚺^s𝚺opp(C6esB0)pexp(p)=exp(δ4),\mathbbm{P}(\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}\geq C_{6}esB_{0})\leq\frac{\mathbbm{E}\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}^{p}}{(C_{6}esB_{0})^{p}}\leq\exp(-p)=\exp\Big{(}-\frac{\delta}{4}\Big{)},

which completes the proof. \square

A.2 Quantized Compressed Sensing

Note that our estimation procedure in QCS, QMC falls in the framework of regularized M-estimator, see [70, 35, 21] for instance. Particularly, we introduce the following deterministic result for analysing the estimator (3.6).

Lemma 4.

(Adapted from[21, Coro. 2]). Consider (3.4) and the estimator 𝛉^\bm{\widehat{\theta}} defined in (3.6), let 𝚼^:=𝛉^𝛉\bm{\widehat{\Upsilon}}:=\bm{\widehat{\theta}}-\bm{\theta^{\star}} be the estimation error. If 𝐐\bm{Q} is positive semi-definite, and λ2𝐐𝛉𝐛\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}, then it holds that 𝚼^110s𝚼^2\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}. Moreover, if for some κ>0\kappa>0 we have the restricted strong convexity (RSC) 𝚼^𝐐𝚼^κ𝚼^22\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}}\geq\kappa\|\bm{\widehat{\Upsilon}}\|_{2}^{2}, then we have the error bounds 𝚼^230s(λκ)\|\bm{\widehat{\Upsilon}}\|_{2}\leq 30\sqrt{s}\big{(}\frac{\lambda}{\kappa}\big{)} and 𝚼^1300s(λκ)\|\bm{\widehat{\Upsilon}}\|_{1}\leq 300s\big{(}\frac{\lambda}{\kappa}\big{)}.151515We do not optimize the constants in Lemmas 4, 6 for easy reference.

To establish the RSC condition, a convenient way is to use the matrix deviation inequality. The following Lemma is adapted from [61], by combining Theorem 3 and Remark 1 therein.161616The dependence on KK can be further refined [49], while this is not pursued in the present paper.

Lemma 5.

(Adapted from[61, Thm. 3]). Assume 𝐀n×d\bm{A}\in\mathbb{R}^{n\times d} has independent zero-mean sub-Gaussian rows 𝛂k\bm{\alpha}_{k}^{\top}s satisfying 𝛂kψ2K\|\bm{\alpha}_{k}\|_{\psi_{2}}\leq K, and the eigenvalues of 𝚺:=𝔼(𝛂k𝛂k)\bm{\Sigma}:=\mathbbm{E}(\bm{\alpha}_{k}\bm{\alpha}_{k}^{\top}) are between [κ0,κ1][\kappa_{0},\kappa_{1}] for some κ1κ0>0\kappa_{1}\geq\kappa_{0}>0. For 𝒯d\mathcal{T}\subset\mathbb{R}^{d} we let rad(𝒯)=sup𝐱𝒯𝐱2\mathrm{rad}(\mathcal{T})=\sup_{\bm{x}\in\mathcal{T}}\|\bm{x}\|_{2} be its radius. Then with probability at least 1exp(u2)1-\exp(-u^{2}), it holds that

sup𝒙𝒯|𝑨𝒙2n𝚺𝒙2|Cκ1K2κ0(ω(𝒯)+urad(𝒯)),\sup_{\bm{x}\in\mathcal{T}}\Big{|}\|\bm{Ax}\|_{2}-\sqrt{n}\|\sqrt{\bm{\Sigma}}\bm{x}\|_{2}\Big{|}\leq\frac{C\sqrt{\kappa_{1}}K^{2}}{\kappa_{0}}\Big{(}\omega(\mathcal{T})+u\cdot\mathrm{rad}(\mathcal{T})\Big{)},

where ω(𝒯)=𝔼sup𝐯𝒯[𝐠𝐯]\omega(\mathcal{T})=\mathbbm{E}\sup_{\bm{v}\in\mathcal{T}}[\bm{g}^{\top}\bm{v}] with 𝐠𝒩(0,𝐈d)\bm{g}\sim\mathcal{N}(0,\bm{I}_{d}) is the Gaussian width of 𝒯\mathcal{T}.

Based on Lemma 4, the proofs of Theorems 5-6 are divided into two steps, i.e., showing λ2𝑸𝜽𝒃\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty} and verifying the RSC. While we still have full 𝒙k\bm{x}_{k} in Theorems 5-6, we will study the more challenging settings where the covariates 𝒙k\bm{x}_{k}s are also quantized via 𝒬Δ¯()\mathcal{Q}_{\bar{\Delta}}(\cdot) in Theorems 9-10, in which we can take Δ¯=0\bar{\Delta}=0 to return the settings of Theorems 5-6. Using such perspective, for most technical ingredients (e.g., the verification of λ2𝑸𝜽𝒃\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}) in the proofs of Theorems 5-6 we can simply refer to the counterparts established in the proofs of Theorems 9-10. This avoids repetition and will be explained in the proofs more clearly.

A.2.1 Proof of Theorem 5

Proof. We divide the proofs into two steps.

Step 1. Proving λ2𝑸𝜽𝒃\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}

Recall that we choose 𝑸=1nk=1n𝒙k𝒙k\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{x}_{k}\bm{x}_{k}^{\top} and 𝒃=1nk=1ny˙k𝒙k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{x}_{k}. In the setting of Theorem 9, the process of obtaining y˙k\dot{y}_{k} remains the same, while the covariates 𝒙k\bm{x}_{k}s are further quantized to 𝒙˙k=𝒬Δ¯(𝒙k+𝝉𝒌)\bm{\dot{x}}_{k}=\mathcal{Q}_{\bar{\Delta}}(\bm{x}_{k}+\bm{\tau_{k}}) for some Δ¯>0\bar{\Delta}>0 under triangular dither 𝝉k𝒰([Δ¯2,Δ¯2]d)+𝒰([Δ¯2,Δ¯2]d)\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})+\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d}), and we choose 𝑸=1nk=1n𝒙˙k𝒙˙kΔ¯24𝑰d\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d} and 𝒃=1nk=1ny˙k𝒙˙k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k} there. As a result, by considering Δ¯=0\bar{\Delta}=0, it can be implied by Step 1 in the proof of Theorem 9 that under the choice λ=C1σ2κ0(Δ+M1/(2l))δlogdn\lambda=C_{1}\frac{\sigma^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C1C_{1}, λ21nk=1n𝒙k𝒙k𝜽1nk=1ny˙k𝒙k\lambda\geq 2\|\frac{1}{n}\sum_{k=1}^{n}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{x}_{k}\|_{\infty} holds with probability at least 18d1δ1-8d^{1-\delta}. Then, by using Lemma 4 we obtain 𝚼^110s𝚼^2\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}.

Step 2. Verifying the RSC 𝚼^𝑸𝚼^κ𝚼^22\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}}\geq\kappa\|\bm{\widehat{\Upsilon}}\|_{2}^{2}

We refer to Step 2 in the proof of Theorem 9. In particular, with the choices Δ¯=0\bar{\Delta}=0 and 𝒗=𝚼^\bm{v}=\bm{\widehat{\Upsilon}} in (LABEL:nB.7), combined with 𝚼^110s𝚼^2\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}, we obtain

1n𝑿𝚫^2κ0𝚼^2C2κ1σ2κ0δslogdn𝚼^212κ0𝚼^2,\displaystyle\frac{1}{\sqrt{n}}\|\bm{X\widehat{\Delta}}\|_{2}\geq\sqrt{\kappa_{0}}\|\bm{\widehat{\Upsilon}}\|_{2}-\frac{C_{2}\sqrt{\kappa_{1}}\sigma^{2}}{\kappa_{0}}\sqrt{\frac{\delta s\log d}{n}}\|\bm{\widehat{\Upsilon}}\|_{2}\geq\frac{1}{2}\sqrt{\kappa_{0}}\|\bm{\widehat{\Upsilon}}\|_{2},

where the last inequality is due to the assumed scaling nδslogdn\gtrsim\delta s\log d. With these preparations, a direct application of Lemma 4 completes the proof. \square

A.2.2 Proof of Theorem 6

Proof. The proof is similarly based on Lemma 4.

Step 1. Proving λ2𝑸𝜽𝒃\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}

Recall that we choose 𝑸=1nk=1n𝒙~k𝒙~k\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top} and 𝒃=1nk=1ny˙k𝒙~k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\widetilde{x}}_{k}. In the setting of Theorem 10, the process of obtaining y˙k\dot{y}_{k} remains the same, while the truncated covariates 𝒙~k\bm{\widetilde{x}}_{k}s are further quantized to 𝒙˙k=𝒬Δ¯(𝒙~k+𝝉k)\bm{\dot{x}}_{k}=\mathcal{Q}_{\bar{\Delta}}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k}) for some Δ¯0\bar{\Delta}\geq 0 under triangular dither 𝝉k𝒰([Δ¯2,Δ¯2]d)+𝒰([Δ¯2,Δ¯2]d)\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})+\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d}), and we choose 𝑸=1nk=1n𝒙˙k𝒙˙kΔ¯24𝑰d\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d} and 𝒃=1nk=1ny˙k𝒙˙k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k} there. As a result, by considering Δ¯=0\bar{\Delta}=0, it can be implied by step 1 in the proof of Theorem 10 that, our choice λ=C1(RM+Δ2)δlogdn\lambda=C_{1}(R\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C1C_{1} ensures λ2𝑸𝜽𝒃\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty} with the promised probability. By Lemma 4 we obtain 𝚼^110s𝚼^2\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}.

Step 2. Verifying the RSC 𝚼^𝑸𝚼^κ𝚼^22\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}}\geq\kappa\|\bm{\widehat{\Upsilon}}\|_{2}^{2}

Unlike the case of sub-Gaussian covariate that is based on matrix deviation inequality (Lemma 5), here we establish a lower bound for 𝚼^𝑸𝚼^\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}} using the bound on 𝑸𝚺\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty} (Theorem 2). Specifically, setting Δ=0\Delta=0 in Theorem 2 yields that, 𝑸𝚺δMlogdn\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\lesssim\sqrt{\frac{\delta M\log d}{n}} holds with probability at least 12d2δ1-2d^{2-\delta}, which allows us to proceed as follows:

𝚼^𝑸𝚼^\displaystyle\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}} =𝚼^𝚺𝚼^𝚼^(𝚺𝑸)𝚼^\displaystyle=\bm{\widehat{\Upsilon}}^{\top}\bm{\Sigma^{\star}}\bm{\widehat{\Upsilon}}-\bm{\widehat{\Upsilon}}^{\top}(\bm{\Sigma^{\star}}-\bm{Q})\bm{\widehat{\Upsilon}} (A.12)
(i)κ0𝚼^22δMlogdn𝚼^12\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}\kappa_{0}\|\bm{\widehat{\Upsilon}}\|_{2}^{2}-\sqrt{\frac{\delta M\log d}{n}}\|\bm{\widehat{\Upsilon}}\|_{1}^{2}
(ii)(κ0C6sδMlogdn)𝚼^22(iii)κ02𝚼^22,\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}\Big{(}\kappa_{0}-C_{6}s\sqrt{\frac{\delta M\log d}{n}}\Big{)}\|\bm{\widehat{\Upsilon}}\|^{2}_{2}\stackrel{{\scriptstyle(iii)}}{{\geq}}\frac{\kappa_{0}}{2}\|\bm{\widehat{\Upsilon}}\|^{2}_{2},

where (i)(i) is because 𝚼^(𝚺𝑸)𝚼^𝚼^12𝑸𝚺\bm{\widehat{\Upsilon}}^{\top}(\bm{\Sigma^{\star}}-\bm{Q})\bm{\widehat{\Upsilon}}\leq\|\bm{\widehat{\Upsilon}}\|_{1}^{2}\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}, (ii)(ii) is due to 𝚼^110s𝚼^2\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}, (iii)(iii) is due to the the assumed scaling nδs2logdn\gtrsim\delta s^{2}\log d. Now the desired results follow immediately from Lemma 4. \square

A.3 Quantized Matrix Completion

Under the observation model (3.7), we first provide a deterministic framework for analysing the estimator (3.8).

Lemma 6.

(Adapted from[21, Coro. 3]). Let 𝚼^:=𝚯^𝚯\bm{\widehat{\Upsilon}}:=\bm{\widehat{\Theta}}-\bm{\Theta^{\star}}. If

λ21nk=1n(<𝑿k,𝚯>y˙k)𝑿kop,\lambda\geq 2\left\|\frac{1}{n}\sum_{k=1}^{n}(\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}-\dot{y}_{k})\bm{X}_{k}\right\|_{op}, (A.13)

then it holds that 𝚼^nu10r𝚼^F\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}. Moreover, if for some κ>0\kappa>0 we have the restricted strong convexity (RSC) 1nk=1n|<𝐗k,𝚼^>|2κ𝚼^F2\frac{1}{n}\sum_{k=1}^{n}\big{|}\big{<}\bm{X}_{k},\bm{\widehat{\Upsilon}}\big{>}\big{|}^{2}\geq\kappa\|\bm{\widehat{\Upsilon}}\|_{F}^{2}, then we have the error bounds 𝚼^F30r(λκ)\|\bm{\widehat{\Upsilon}}\|_{F}\leq 30\sqrt{r}\big{(}\frac{\lambda}{\kappa}\big{)} and 𝚼^nu300r(λκ)\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 300r\big{(}\frac{\lambda}{\kappa}\big{)}.

Clearly, to derive statistical error rate of 𝚯^\bm{\widehat{\Theta}} from Lemma 6, the key ingredients are (A.13) and the RSC. Specialized to the covariate 𝑿k𝒰({𝒆i𝒆j:i,j[d]})\bm{X}_{k}\sim\mathscr{U}\big{(}\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}\big{)} in matrix completion, we will use the following lemma to establish RSC.

Lemma 7.

(Adapted from [21, Lem. 4] with q=0q=0). Given some α>0,δ>0\alpha>0,\delta>0, we define the constraint set ψ\mathcal{\psi} with sufficiently large ψ\psi as

𝒞(ψ)={𝚯d×d:𝚯\displaystyle\mathcal{C}(\psi)=\Big{\{}\bm{\Theta}\in\mathbb{R}^{d\times d}:\|\bm{\Theta} 2α,𝚯nu10r𝚯F,\displaystyle\|_{\infty}\leq 2\alpha,\|\bm{\Theta}\|_{nu}\leq 10\sqrt{r}\|\bm{\Theta}\|_{F}, (A.14)
𝚯F2(αd)2ψδlogdn}.\displaystyle\|\bm{\Theta}\|_{F}^{2}\geq(\alpha d)^{2}\sqrt{\frac{\psi\delta\log d}{n}}\Big{\}}.

Let 𝐗1,,𝐗n\bm{X}_{1},...,\bm{X}_{n} be i.i.d. uniformly distributed on {𝐞i𝐞j:i,j[d]}\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}, then there exist absolute constants κ(0,1)\kappa\in(0,1) and CC, such that with probability at least 1dδ1-d^{-\delta} we have

1nk=1n|<𝑿k,𝚯>|2κ𝚯F2d2Cα2rdlogdn,𝚯𝒞(ψ).\frac{1}{n}\sum_{k=1}^{n}\big{|}\big{<}\bm{X}_{k},\bm{\Theta}\big{>}\big{|}^{2}\geq\frac{\kappa\|\bm{\Theta}\|_{F}^{2}}{d^{2}}-\frac{C\alpha^{2}rd\log d}{n},~{}\forall~{}\bm{\Theta}\in\mathcal{C}(\psi). (A.15)

Matrix completion with sub-exponential noise was studied in [51], and we make use of the following Lemma in the sub-exponential case.

Lemma 8.

(Adapted from [51, Lem. 5]). Given some δ>0\delta>0. Let 𝐗1,,𝐗n\bm{X}_{1},...,\bm{X}_{n} be i.i.d. uniformly distributed on {𝐞i𝐞j:i,j[d]}\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}, independent of 𝐗k\bm{X}_{k}s, ϵ1,,ϵn\epsilon_{1},...,{\epsilon_{n}} are i.i.d. zero-mean and satisfy ϵkψ1σ\|\epsilon_{k}\|_{\psi_{1}}\leq\sigma. If nδdlog3dn\gtrsim\delta d\log^{3}d, with probability at least 1dδ1-d^{-\delta} we have

1nk=1nϵk𝑿kopσδlogdnd.\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\epsilon_{k}\bm{X}_{k}\Big{\|}_{op}\leq\sigma\sqrt{\frac{\delta\log d}{nd}}.

A.3.1 Proof of Theorem 7

Proof. We divide the proof into two steps.

Step 1. Proving (A.13)

Defining wk:=y˙kykτkw_{k}:=\dot{y}_{k}-y_{k}-\tau_{k} as the quantization error, from Theorem 1(a) we know that wkw_{k}s are independent of 𝑿k\bm{X}_{k} and i.i.d. uniformly distributed on [Δ2,Δ2][-\frac{\Delta}{2},\frac{\Delta}{2}]. Thus, we can further write that

y˙k=yk+τk+wk=<𝑿k,𝚯>+ϵk+τk+wk,\dot{y}_{k}=y_{k}+\tau_{k}+w_{k}=\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}+\epsilon_{k}+\tau_{k}+w_{k},

which allows us to decompose II into

I1nk=1nϵk𝑿kop+1nk=1nτk𝑿kop+1nk=1nwk𝑿kop=I1+I2+I3.I\leq\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\epsilon_{k}\bm{X}_{k}\Big{\|}_{op}+\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\tau_{k}\bm{X}_{k}\Big{\|}_{op}+\Big{\|}\frac{1}{n}\sum_{k=1}^{n}w_{k}\bm{X}_{k}\Big{\|}_{op}=I_{1}+I_{2}+I_{3}.

Because ϵk\epsilon_{k}s are independent of 𝑿k\bm{X}_{k}s and i.i.d. sub-exponential noise satisfying ϵkψ1σ\|\epsilon_{k}\|_{\psi_{1}}\leq\sigma, under the scaling nδdlog3dn\gtrsim\delta d\log^{3}d, Lemma 8 implies that I1σδlogdndI_{1}\lesssim\sigma\sqrt{\frac{\delta\log d}{nd}} holds with probability at least 1dδ1-d^{-\delta}. Analogously, τk\tau_{k}s and wkw_{k}s are independent of {𝑿k:k[n]}\{\bm{X}_{k}:k\in[n]\} and are i.i.d. uniformly distributed on [Δ2,Δ2][-\frac{\Delta}{2},\frac{\Delta}{2}], Lemma 8 also applies to I2I_{2} and I3I_{3}, yielding that with the promised probability I2+I3ΔδlogdndI_{2}+I_{3}\lesssim\Delta\sqrt{\frac{\delta\log d}{nd}}. Taken collectively, I(σ+Δ)δlogdndI\lesssim(\sigma+\Delta)\sqrt{\frac{\delta\log d}{nd}}, so setting λ=C1(σ+Δ)δlogdnd\lambda=C_{1}(\sigma+\Delta)\sqrt{\frac{\delta\log d}{nd}} with sufficiently large C1C_{1} ensures λ2I\lambda\geq 2I, with probability at least 13dδ1-3d^{-\delta}. Further, Lemma 6 gives 𝚼^nu10r𝚼^F\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}.

Step 2. Verifying RSC

First note that 𝚼^𝚯^+𝚯2α\|\bm{\widehat{\Upsilon}}\|_{\infty}\leq\|\bm{\widehat{\Theta}}\|_{\infty}+\|\bm{\Theta^{\star}}\|_{\infty}\leq 2\alpha; and as proved before, 𝚼^nu10r𝚼^F\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}. To proceed we define the constraint set 𝒞(ψ)\mathcal{C}(\psi) as in (A.14) with some properly chosen constant ψ\psi. Then using Lemma 7, for some absolute constants κ,C\kappa,C, (A.15) holds with probability at least 1dδ1-d^{-\delta}. Then we discuss several cases.

1) If 𝚼^𝒞(ψ)\bm{\widehat{\Upsilon}}\notin\mathcal{C}(\psi), because 𝚼^\bm{\widehat{\Upsilon}} satisfies the first two constraints in the definition of 𝒞(ψ)\mathcal{C}(\psi), it must violate the third constraint and satisfy 𝚼^F2(αd)2ψδlogdn\|\bm{\widehat{\Upsilon}}\|_{F}^{2}\leq(\alpha d)^{2}\sqrt{\frac{\psi\delta\log d}{n}}, which gives 𝚼^Fαd(δlogdn)1/4(i)αdδrdlogdn\|\bm{\widehat{\Upsilon}}\|_{F}\lesssim\alpha d\big{(}\frac{\delta\log d}{n}\big{)}^{1/4}\stackrel{{\scriptstyle(i)}}{{\lesssim}}\alpha d\sqrt{\frac{\delta rd\log d}{n}}, as desired. Note that (i)(i) is due to the scaling nδr2d2logdn\lesssim\delta r^{2}d^{2}\log d.

2) If 𝚼^𝒞(ψ)\bm{\widehat{\Upsilon}}\in\mathcal{C}(\psi), (A.15) implies that 1nk=1n|<𝑿k,𝚼^>|2κ𝚼^F2d2Cα2rdlogdn\frac{1}{n}\sum_{k=1}^{n}\big{|}\big{<}\bm{X}_{k},\bm{\widehat{\Upsilon}}\big{>}\big{|}^{2}\geq\frac{\kappa\|\bm{\widehat{\Upsilon}}\|_{F}^{2}}{d^{2}}-C\frac{\alpha^{2}rd\log d}{n}, and we further consider the following two cases.

2.1) If Cα2rdlogdnκ𝚼^F22d2C\frac{\alpha^{2}rd\log d}{n}\geq\frac{\kappa\|\bm{\widehat{\Upsilon}}\|_{F}^{2}}{2d^{2}}, we have 𝚼^Fαdrdlogdn\|\bm{\widehat{\Upsilon}}\|_{F}\lesssim\alpha d\sqrt{\frac{rd\log d}{n}}, as desired.

2.2) If Cα2rdlogdn<κ𝚼^F22d2C\frac{\alpha^{2}rd\log d}{n}<\frac{\kappa\|\bm{\widehat{\Upsilon}}\|_{F}^{2}}{2d^{2}}, then the RSC condition holds: 1nk=1n|<𝑿k,𝚼^>|2κ𝚼^F22d2\frac{1}{n}\sum_{k=1}^{n}\big{|}\big{<}\bm{X}_{k},\bm{\widehat{\Upsilon}}\big{>}\big{|}^{2}\geq\frac{\kappa\|\bm{\widehat{\Upsilon}}\|^{2}_{F}}{2d^{2}}. This allows us to apply Lemma 6 to obtain 𝚼^F(σ+Δ)dδrdlogdn\|\bm{\widehat{\Upsilon}}\|_{F}\lesssim(\sigma+\Delta)d\sqrt{\frac{\delta rd\log d}{n}}.

Thus, in any case, we have shown 𝚼^F=O((α+σ+Δ)dδrdlogdn)\|\bm{\widehat{\Upsilon}}\|_{F}=O\big{(}(\alpha+\sigma+\Delta)d\sqrt{\frac{\delta rd\log d}{n}}\big{)}. The bound on 𝚼^nu\|\bm{\widehat{\Upsilon}}\|_{nu} follows immediately from 𝚼^nu10r𝚼^F\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}. The proof is complete. \square

A.3.2 Proof of Theorem 8

Proof. The proof is based on Lemma 6 and divided into two steps.

Step 1. Proving (A.13)

Recall that the quantization error wk:=y˙ky~kτkw_{k}:=\dot{y}_{k}-\widetilde{y}_{k}-\tau_{k} is zero-mean and independent of 𝑿k\bm{X}_{k} (Theorem 1(a)), thus we have 𝔼(y˙k𝑿k)=𝔼(y~k𝑿k)+𝔼(τk𝑿k)+𝔼(wk𝑿k)=𝔼(y~k𝑿k).\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})=\mathbbm{E}(\widetilde{y}_{k}\bm{X}_{k})+\mathbbm{E}(\tau_{k}\bm{X}_{k})+\mathbbm{E}(w_{k}\bm{X}_{k})=\mathbbm{E}(\widetilde{y}_{k}\bm{X}_{k}). Combining with 𝔼(<𝑿k,𝚯>𝑿k)=𝔼(yk𝑿k)\mathbbm{E}\big{(}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}\big{)}=\mathbbm{E}(y_{k}\bm{X}_{k}), triangle inequality can first decompose the target term into

1nk=1n(y˙k<𝑿k,𝚯>)𝑿kop\displaystyle\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\big{(}\dot{y}_{k}-\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\big{)}\bm{X}_{k}\Big{\|}_{op} 1nk=1ny˙k𝑿k𝔼(y˙k𝑿k)op\displaystyle\leq\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{X}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})\Big{\|}_{op}
+𝔼(yk𝑿ky~k𝑿k)op+\displaystyle+\Big{\|}\mathbbm{E}\big{(}y_{k}\bm{X}_{k}-\widetilde{y}_{k}\bm{X}_{k}\big{)}\Big{\|}_{op}+\Big{\|} 1nk=1n<𝑿k,𝚯>𝑿k𝔼(<𝑿k,𝚯>𝑿k)op\displaystyle\frac{1}{n}\sum_{k=1}^{n}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}-\mathbbm{E}\big{(}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}\big{)}\Big{\|}_{op}
:=I1+I2+I3.\displaystyle:=I_{1}+I_{2}+I_{3}.

Step 1.1. Bounding I1I_{1} and I3I_{3}

We write I1=k=1n𝑺kopI_{1}=\|\sum_{k=1}^{n}\bm{S}_{k}\|_{op} and I3=k=1n𝑾kopI_{3}=\|\sum_{k=1}^{n}\bm{W}_{k}\|_{op} by defining

𝑺k=1n(y˙k𝑿k𝔼(y˙k𝑿k)),𝑾k=1n(<𝑿k,𝚯>𝑿k𝔼[<𝑿k,𝚯>𝑿k]),\displaystyle\bm{S}_{k}=\frac{1}{n}\big{(}\dot{y}_{k}\bm{X}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})\big{)},\bm{W}_{k}=\frac{1}{n}\big{(}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}-\mathbbm{E}\big{[}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}\big{]}\big{)},

By |y˙k||y~k|+|τk|+|wk|ζy+Δ|\dot{y}_{k}|\leq|\widetilde{y}_{k}|+|\tau_{k}|+|w_{k}|\leq\zeta_{y}+\Delta we have

𝑺kop1ny˙k𝑿kop+1n𝔼(y˙k𝑿k)op1ny˙k𝑿kop+1n𝔼y˙k𝑿kop2(ζy+Δ)n.\|\bm{S}_{k}\|_{op}\leq\frac{1}{n}\|\dot{y}_{k}\bm{X}_{k}\|_{op}+\frac{1}{n}\|\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})\|_{op}\leq\frac{1}{n}\|\dot{y}_{k}\bm{X}_{k}\|_{op}+\frac{1}{n}\mathbbm{E}\|\dot{y}_{k}\bm{X}_{k}\|_{op}\leq\frac{2(\zeta_{y}+\Delta)}{n}.

Analogously, we have 𝑾kop2αn\|\bm{W}_{k}\|_{op}\leq\frac{2\alpha}{n} since |<𝑿k,𝚯>|𝚯α\big{|}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\big{|}\leq\|\bm{\Theta^{\star}}\|_{\infty}\leq\alpha. In addition, by 𝔼{(𝑨𝔼𝑨)(𝑨𝔼𝑨)}op𝔼(𝑨𝑨)op\|\mathbbm{E}\{(\bm{A}-\mathbbm{E}\bm{A})^{\top}(\bm{A}-\mathbbm{E}\bm{A})\}\|_{op}\leq\|\mathbbm{E}(\bm{A}^{\top}\bm{A})\|_{op} (𝑨\forall\bm{A}) and the simple fact 𝔼(𝑿k𝑿k)=𝔼(𝑿k𝑿k)=𝑰d/d\mathbbm{E}(\bm{X}_{k}\bm{X}_{k}^{\top})=\mathbbm{E}(\bm{X}_{k}^{\top}\bm{X}_{k})=\bm{I}_{d}/d, we estimate the matrix variance statistic as follows

k=1n𝔼(𝑺k𝑺k)op=n𝔼(𝑺k𝑺k)op1n𝔼(y˙k2𝑿k𝑿k)op\displaystyle\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{S}_{k}\bm{S}_{k}^{\top})\Big{\|}_{op}=n\big{\|}\mathbbm{E}(\bm{S}_{k}\bm{S}_{k}^{\top})\big{\|}_{op}\leq\frac{1}{n}\big{\|}\mathbbm{E}\big{(}\dot{y}_{k}^{2}\bm{X}_{k}\bm{X}_{k}^{\top}\big{)}\big{\|}_{op}
=1nsup𝒗𝕊d1𝔼(y˙k2𝑿k𝒗22)=1nsup𝒗𝕊d1𝔼𝑿k([𝔼y˙k|𝑿k(y˙k2)]𝑿k𝒗22)\displaystyle=\frac{1}{n}\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}\big{(}\dot{y}_{k}^{2}\cdot\|\bm{X}_{k}^{\top}\bm{v}\|_{2}^{2}\big{)}=\frac{1}{n}\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}_{\bm{X}_{k}}\Big{(}\big{[}\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\dot{y}_{k}^{2})\big{]}\|\bm{X}_{k}^{\top}\bm{v}\|_{2}^{2}\Big{)}
(i)4n(α2+M+Δ2)sup𝒗𝕊d1𝔼𝑿k𝑿k𝒗224(α2+M+Δ2)nd,\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{4}{n}(\alpha^{2}+M+\Delta^{2})\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}_{\bm{X}_{k}}\|\bm{X}_{k}^{\top}\bm{v}\|^{2}_{2}\leq\frac{4(\alpha^{2}+M+\Delta^{2})}{nd},

where (i)(i) is because given 𝑿k\bm{X}_{k} we can estimate 𝔼y˙k|𝑿k(y˙k2)2(𝔼y˙k|𝑿k(y~k2)+Δ2)\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\dot{y}_{k}^{2})\leq 2\big{(}\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\widetilde{y}_{k}^{2})+\Delta^{2}\big{)} since |y˙ky~k|Δ|\dot{y}_{k}-\widetilde{y}_{k}|\leq\Delta, and moreover 𝔼y˙k|𝑿k(y~k2)𝔼y˙k|𝑿k(yk2)2(𝔼y˙k|𝑿k(<𝑿k,𝚯>2)+𝔼y˙k|𝑿k(ϵk2))2(α2+M).\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\widetilde{y}_{k}^{2})\leq\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(y_{k}^{2})\leq 2\big{(}\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}^{2})+\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\epsilon_{k}^{2})\big{)}\leq 2(\alpha^{2}+M). It is not hard to see that this bound remains valid for k=1n𝔼(𝑺k𝑺k)op\|\sum_{k=1}^{n}\mathbbm{E}(\bm{S}_{k}^{\top}\bm{S}_{k})\|_{op}. Also, by similar arguments one can prove

max{k=1n𝔼(𝑾k𝑾k)op,k=1n𝔼(𝑾k𝑾k)op}α2nd.\max\left\{\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{W}_{k}^{\top}\bm{W}_{k})\Big{\|}_{op},\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{W}_{k}\bm{W}_{k}^{\top})\Big{\|}_{op}\right\}\leq\frac{\alpha^{2}}{nd}.

Thus, Matrix Bernstein’s inequality (Lemma 2) gives

(I1t)2dexp(C4ndt2(α2+M+Δ2)+(ζy+Δ)dt)\displaystyle\mathbbm{P}\big{(}I_{1}\geq t\big{)}\leq 2d\cdot\exp\Big{(}-\frac{C_{4}ndt^{2}}{(\alpha^{2}+M+\Delta^{2})+(\zeta_{y}+\Delta)dt}\Big{)}
(I3t)2dexp(C5ndt2α2+αdt)\displaystyle\mathbbm{P}\big{(}I_{3}\geq t\big{)}\leq 2d\cdot\exp\Big{(}-\frac{C_{5}ndt^{2}}{\alpha^{2}+\alpha dt}\Big{)}

Thus, setting t=C6(α+M+Δ)δlogdndt=C_{6}(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}} in the two inequalities above with sufficiently large C6C_{6}, combining with the scaling that δdlogdn=O(1)\sqrt{\frac{\delta d\log d}{n}}=O(1), we obtain I1+I3(α+M+Δ)δlogdndI_{1}+I_{3}\lesssim(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}} with probability at least 14d1δ1-4d^{1-\delta}.

Step 1.2. Bounding I2I_{2}

Let us bound 𝔼((yky~k)𝑿k)\|\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})\bm{X}_{k}\big{)}\|_{\infty} first. Write (i,j)(i,j)-th entry of 𝑿k\bm{X}_{k} as xk,ijx_{k,ij}, then for given (i,j)(i,j), (xk,ij=1)=d2\mathbbm{P}(x_{k,ij}=1)=d^{-2}, xk,ij=0x_{k,ij}=0 otherwise. We can thus proceed by the following estimations:

|𝔼((yky~k)xk,ij)|\displaystyle\big{|}\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})x_{k,ij}\big{)}\big{|} =|𝔼((yky~k)xk,ij𝟙(|yk|ζy))|\displaystyle=\big{|}\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})x_{k,ij}\mathbbm{1}(|y_{k}|\geq\zeta_{y})\big{)}\big{|}
𝔼(|yk|xk,ij𝟙(|yk|ζy))\displaystyle\leq\mathbbm{E}\big{(}|y_{k}|x_{k,ij}\mathbbm{1}(|y_{k}|\geq\zeta_{y})\big{)}
=𝔼xk,ij({𝔼yk|xk,ij|yk|𝟙(|yk|ζy)}xk,ij)\displaystyle=\mathbbm{E}_{x_{k,ij}}\big{(}\big{\{}\mathbbm{E}_{y_{k}|x_{k,ij}}|y_{k}|\mathbbm{1}(|y_{k}|\geq\zeta_{y})\big{\}}x_{k,ij}\big{)}
=d2𝔼yk|xk,ij=1(|yk|𝟙(|yk|ζy))\displaystyle=d^{-2}\mathbbm{E}_{y_{k}|x_{k,ij}=1}\big{(}|y_{k}|\mathbbm{1}(|y_{k}|\geq\zeta_{y})\big{)}
(i)d2𝔼ykθij+ϵk(yk2)ykθij+ϵk(yk2ζy2)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}d^{-2}\sqrt{\mathbbm{E}_{y_{k}\sim\theta^{\star}_{ij}+\epsilon_{k}}(y_{k}^{2})}\sqrt{\mathbbm{P}_{y_{k}\sim\theta^{\star}_{ij}+\epsilon_{k}}(y_{k}^{2}\geq\zeta_{y}^{2})}
(ii)d2α2+Mζyα+Md2δdlogdn,\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}d^{-2}\frac{\alpha^{2}+M}{\zeta_{y}}\lesssim\frac{\alpha+\sqrt{M}}{d^{2}}\sqrt{\frac{\delta d\log d}{n}},

where (i)(i), (ii)(ii) is by Cauchy-Schwarz and Markov’s, respectively. Since this holds for any (i,j)(i,j), we obtain 𝔼((yky~k)𝑿k)=O((α+M)d2δdlogdn)\|\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})\bm{X}_{k}\big{)}\|_{\infty}=O\big{(}(\alpha+\sqrt{M})d^{-2}\sqrt{\frac{\delta d\log d}{n}}\big{)}, which further gives I2=O((α+M)δlogdnd)I_{2}=O\big{(}(\alpha+\sqrt{M})\sqrt{\frac{\delta\log d}{nd}}\big{)} by using 𝑨opd𝑨\|\bm{A}\|_{op}\leq d\|\bm{A}\|_{\infty} (𝑨d×d\forall\bm{A}\in\mathbb{R}^{d\times d}). Putting pieces together, with probability at least 14d1δ1-4d^{1-\delta} we have 1nk(y˙k<𝑿k,𝚯>)𝑿kop(α+M+Δ)δlogdnd\|\frac{1}{n}\sum_{k}\big{(}\dot{y}_{k}-\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\big{)}\bm{X}_{k}\|_{op}\lesssim(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}}, hence λ=C1(α+M+Δ)δlogdnd\lambda=C_{1}(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}} ensures (A.13) under the same probability. Further, Lemma 6 gives 𝚼^nu10r𝚼^F\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}.

Step 2. Verifying RSC

The remaining part is almost the same as Step 2 in the proof of Theorem 7 — defining the constraint set 𝒞(ψ)\mathcal{C}(\psi) as (A.14) and then discussing several cases based on whether 𝚼^𝒞(ψ)\bm{\widehat{\Upsilon}}\in\mathcal{C}(\psi) holds. Thus, we conclude the proof without providing the details. \square

Appendix B Proofs in Section 4

This appendix collects the proofs in Section 4 concerning covariate quantization and uniform signal recovery in QCS.

B.1 Covariate Quantization

Because of the non-convexity, the proofs in this part can no longer be based on Lemma 4. Indeed, bounding the estimation errors of 𝜽~\bm{\widetilde{\theta}}s satisfying (4.2) require more tedious manipulations essentially due to the additional 1\ell_{1} constraint (induced by the constraint 𝒮\mathcal{S} in (4.2)).

B.1.1 Proof of Theorem 9

Proof. The proof is divided into three steps — the first two steps resemble the previous proofs that are based on Lemma 4, while we bound the estimation errors in the last step.

Step 1. Proving λβ𝑸𝜽𝒃\lambda\geq\beta\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty} for some pre-specified β>2\beta>2

Recall that (𝑸,𝒃)(\bm{Q},\bm{b}) are constructed from the quantized data as 𝑸=1nk=1n𝒙˙k𝒙˙kΔ¯24𝑰d\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{{\Delta}}^{2}}{4}\bm{I}_{d} and 𝒃=1nk=1ny˙k𝒙˙k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}. We will show that, λ=C1(σ+Δ¯)2κ0(Δ+M1/(2l))logdn\lambda=C_{1}\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\log d}{n}} guarantees λβ1nk=1n(𝒙˙k𝒙˙kΔ¯24𝑰d)𝜽1nk=1ny˙k𝒙˙k\lambda\geq\beta\|\frac{1}{n}\sum_{k=1}^{n}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d})\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}\|_{\infty} holds with the promised probability, where β>2\beta>2 is any pre-specified constant. Recall the notation we introduced: y˙k=y~k+ϕk+ϑk\dot{y}_{k}=\widetilde{y}_{k}+\phi_{k}+\vartheta_{k} with the quantization error ϑk𝒰([Δ2,Δ2])\vartheta_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]) being independent of y~k\widetilde{y}_{k}, 𝒙˙k=𝒙k+𝝉k+𝒘k\bm{\dot{x}}_{k}=\bm{x}_{k}+\bm{\tau}_{k}+\bm{w}_{k} with the quantization error 𝒘k𝒰([Δ¯2,Δ¯2]d)\bm{w}_{k}\sim\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d}) being independent of 𝒙k\bm{x}_{k}. Combining with the assumptions that the dithers are independent of (𝒙k,yk)(\bm{x}_{k},y_{k}) and that ϕk\phi_{k}s and 𝝉k\bm{\tau}_{k}s are independent, we have

𝔼(y˙k\displaystyle\mathbbm{E}(\dot{y}_{k} 𝒙˙k)=𝔼((y~k+ϕk+ϑk)(𝒙k+𝝉k+𝒘k))=𝔼(y~k𝒙k),\displaystyle\bm{\dot{x}}_{k})=\mathbbm{E}\big{(}(\widetilde{y}_{k}+\phi_{k}+\vartheta_{k})(\bm{x}_{k}+\bm{\tau}_{k}+\bm{w}_{k})\big{)}=\mathbbm{E}(\widetilde{y}_{k}\bm{x}_{k}), (B.1)
𝔼([𝒙˙k𝒙˙kΔ¯24𝑰d]𝜽)=𝔼(𝒙k𝒙k𝜽)=𝔼(yk𝒙k),\displaystyle\mathbbm{E}\Big{(}\Big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}\Big{]}\bm{\theta^{\star}}\Big{)}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}})=\mathbbm{E}(y_{k}\bm{x}_{k}),

which allows us to decompose the target term as two concentration terms (I1,I3I_{1},I_{3}) and a bias term (I2)(I_{2})

1nk=1n[𝒙˙k𝒙˙kΔ¯24𝑰d]𝜽1nk=1ny˙k𝒙˙k1nk=1ny˙k𝒙˙k𝔼(y˙k𝒙˙k)\displaystyle\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\Big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}\Big{]}\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}\Big{\|}_{\infty}\leq{\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{\dot{x}}_{k})\Big{\|}_{\infty}}
+𝔼(yk𝒙ky~k𝒙k)+1nk=1n𝒙˙k𝒙˙k𝜽𝔼(𝒙˙k𝒙˙k𝜽):=I1+I2+I3.\displaystyle+{\Big{\|}\mathbbm{E}\big{(}y_{k}\bm{x}_{k}-\widetilde{y}_{k}\bm{x}_{k}\big{)}\Big{\|}_{\infty}}+{\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}}-\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})\Big{\|}_{\infty}}:=I_{1}+I_{2}+I_{3}.

Step 1.1. Bounding I1I_{1}

Denote the ii-th entry of 𝒙k,𝒙˙k,𝝉k,𝒘k\bm{x}_{k},\bm{\dot{x}}_{k},\bm{\tau}_{k},\bm{w}_{k} by xki,x˙ki,τki,wkix_{ki},\dot{x}_{ki},\tau_{ki},w_{ki}, respectively. For I1I_{1}, the ii-th entry reads 1nk=1ny˙kx˙ki𝔼(y˙kx˙ki)\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\dot{x}_{ki}-\mathbbm{E}(\dot{y}_{k}\dot{x}_{ki}). By using the relations |y˙k||y~k|+|ϕk|+|ϑk|ζy+Δ|\dot{y}_{k}|\leq|\widetilde{y}_{k}|+|\phi_{k}|+|\vartheta_{k}|\leq\zeta_{y}+\Delta, 𝒙˙kψ2𝒙kψ2+𝝉kψ2+𝒘kψ2σ+Δ¯\|\bm{\dot{x}}_{k}\|_{\psi_{2}}\leq\|\bm{x}_{k}\|_{\psi_{2}}+\|\bm{\tau}_{k}\|_{\psi_{2}}+\|\bm{w}_{k}\|_{\psi_{2}}\lesssim\sigma+\bar{\Delta} and 𝔼|y˙k|2l𝔼|y~k|2l+𝔼|ϕk+ϑk|2lM+Δ2l\mathbbm{E}|\dot{y}_{k}|^{2l}\lesssim\mathbbm{E}|\widetilde{y}_{k}|^{2l}+\mathbbm{E}|\phi_{k}+\vartheta_{k}|^{2l}\lesssim M+\Delta^{2l}, for any integer q2q\geq 2 we can bound that

k=1n𝔼|y˙kx˙kin|q(ζy+Δ)q2nqk=1n𝔼|y˙k2x˙kiq|\displaystyle\sum_{k=1}^{n}\mathbbm{E}\Big{|}\frac{\dot{y}_{k}\dot{x}_{ki}}{n}\Big{|}^{q}\leq\frac{(\zeta_{y}+\Delta)^{q-2}}{n^{q}}\sum_{k=1}^{n}\mathbbm{E}|\dot{y}_{k}^{2}\dot{x}_{ki}^{q}| (B.2)
(i)(ζy+Δ)q2nqk=1n{𝔼|y˙k|2l}1l{𝔼|x˙ki|lql1}11l\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{(\zeta_{y}+\Delta)^{q-2}}{n^{q}}\sum_{k=1}^{n}\big{\{}\mathbbm{E}|\dot{y}_{k}|^{2l}\big{\}}^{\frac{1}{l}}\big{\{}\mathbbm{E}|\dot{x}_{ki}|^{\frac{lq}{l-1}}\big{\}}^{1-\frac{1}{l}}
(ii)((σ+Δ¯)(ζy+Δ)n)q2((σ+Δ¯)2(M1l+Δ2)n)(lql1)q;\displaystyle\stackrel{{\scriptstyle(ii)}}{{\lesssim}}\Big{(}\frac{(\sigma+\bar{\Delta})(\zeta_{y}+\Delta)}{n}\Big{)}^{q-2}\Big{(}\frac{(\sigma+\bar{\Delta})^{2}(M^{\frac{1}{l}}+\Delta^{2})}{n}\Big{)}\Big{(}\sqrt{\frac{lq}{l-1}}\Big{)}^{q};

combining with Stirling’s approximation and treating ll as absolute constant, this provides

k=1n𝔼|y˙kx˙kin|qq!2v0c0q2 where v0=O((σ+Δ¯)2(M1/l+Δ2)n),c0=O((σ+Δ¯)(ζy+Δ)n).\sum_{k=1}^{n}\mathbbm{E}\Big{|}\frac{\dot{y}_{k}\dot{x}_{ki}}{n}\Big{|}^{q}\leq\frac{q!}{2}v_{0}c_{0}^{q-2}\text{ where }v_{0}=O\Big{(}\frac{(\sigma+\bar{\Delta})^{2}(M^{{1}/{l}}+\Delta^{2})}{n}\Big{)},~{}c_{0}=O\Big{(}\frac{(\sigma+\bar{\Delta})(\zeta_{y}+\Delta)}{n}\Big{)}.

In (LABEL:nB.2), (i)(i) is due to Holder’s, and in (ii)(ii) we use the moment constraint of sub-Gaussian variable (2.2). With these preparations, we invoke Bernstein’s inequality (Lemma 1) and then a union bound over i[d]i\in[d] to obtain

(I1(σ+Δ¯)(M12l+Δ)tn+(σ+Δ¯)(ζy+Δ)tn)12dexp(t),\mathbbm{P}\Big{(}I_{1}\lesssim(\sigma+\bar{\Delta})(M^{\frac{1}{2l}}+\Delta)\sqrt{\frac{t}{n}}+\frac{(\sigma+\bar{\Delta})(\zeta_{y}+\Delta)t}{n}\Big{)}\geq 1-2d\cdot\exp(-t),

Thus, taking t=δlogdt=\delta\log d and plug in ζynM1/lδlogd\zeta_{y}\asymp\sqrt{\frac{nM^{1/l}}{\delta\log d}}, we obtain

(I1(σ+Δ¯)(M1/(2l)+Δ)δlogdn)12d1δ.\mathbbm{P}\Big{(}I_{1}\lesssim(\sigma+\bar{\Delta})(M^{1/(2l)}+\Delta)\sqrt{\frac{\delta\log d}{n}}\Big{)}\geq 1-2d^{1-\delta}.

Step 1.2. Bounding I2I_{2}

Moreover, we estimate the ii-th entry of I2I_{2} by

|𝔼((yky~k)xki)|𝔼|ykxki𝟙(|yk|ζy)|\displaystyle|\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})x_{ki}\big{)}|\leq\mathbbm{E}|y_{k}x_{ki}\mathbbm{1}(|y_{k}|\geq\zeta_{y})| (B.3)
(i)(𝔼|yk|2l2l1|xki|2l2l1)112l((|yk|ζy))12l\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\big{(}\mathbbm{E}|y_{k}|^{\frac{2l}{2l-1}}|x_{ki}|^{\frac{2l}{2l-1}}\big{)}^{1-\frac{1}{2l}}\big{(}\mathbbm{P}(|y_{k}|\geq\zeta_{y})\big{)}^{\frac{1}{2l}}
(ii)([𝔼|yk|2l]12l1[𝔼|xki|ll1]2l22l1)112l((|yk|2lζy2l))12l\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\Big{(}\big{[}\mathbbm{E}|y_{k}|^{2l}\big{]}^{\frac{1}{2l-1}}\big{[}\mathbbm{E}|x_{ki}|^{\frac{l}{l-1}}\big{]}^{\frac{2l-2}{2l-1}}\Big{)}^{1-\frac{1}{2l}}\Big{(}\mathbbm{P}(|y_{k}|^{2l}\geq\zeta_{y}^{2l})\Big{)}^{\frac{1}{2l}}
(iii)σM1/lζyσM12lδlogdn,\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}\frac{\sigma M^{1/l}}{\zeta_{y}}\lesssim\sigma M^{\frac{1}{2l}}\sqrt{\frac{\delta\log d}{n}},

where (i)(i), (ii)(ii) are due to Holder’s, (iii)(iii) is due to Markov’s. Since this holds for i[d]i\in[d], it gives I2σM1/(2l)δlogdnI_{2}\lesssim\sigma M^{1/(2l)}\sqrt{\frac{\delta\log d}{n}}.

Step 1.3. Bounding I3I_{3}

We first derive a bound for 𝜽2\|\bm{\theta^{\star}}\|_{2} that is implicitly implied by other conditions:

M1/l𝔼|yk|2𝔼(𝒙k𝜽)2=(𝜽)𝚺𝜽κ0𝜽22𝜽2=O(M1/(2l)κ0).M^{1/l}\geq\mathbbm{E}|y_{k}|^{2}\geq\mathbbm{E}(\bm{x}_{k}^{\top}\bm{\theta^{\star}})^{2}=(\bm{\theta^{\star}})^{\top}\bm{\Sigma^{\star}}\bm{\theta^{\star}}\geq\kappa_{0}\|\bm{\theta^{\star}}\|_{2}^{2}\Longrightarrow\|\bm{\theta^{\star}}\|_{2}=O\Big{(}\frac{M^{1/(2l)}}{\sqrt{\kappa_{0}}}\Big{)}.

Hence, we can estimate (𝒙˙k𝜽)x˙kiψ1𝒙˙k𝜽ψ2x˙kiψ2𝒙˙kψ22𝜽2(σ+Δ¯)2M1/(2l)κ0\|(\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})\dot{x}_{ki}\|_{\psi_{1}}\leq\|\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}}\|_{\psi_{2}}\|\dot{x}_{ki}\|_{\psi_{2}}\leq\|\bm{\dot{x}}_{k}\|_{\psi_{2}}^{2}\|\bm{\theta^{\star}}\|_{2}\lesssim{(\sigma+\bar{\Delta})^{2}}\frac{M^{1/(2l)}}{\sqrt{\kappa_{0}}}. Then, we invoke Bernstein’s inequality regarding the independent sum of sub-exponential random variables (e.g., [92, Thm. 2.8.1])171717The application of Bernstein’s inequality leads to the σ2\sigma^{2} (σ\sigma is the upper bound on 𝒙kψ2\|\bm{x}_{k}\|_{\psi_{2}}) dependence in the multiplicative factor \mathscr{L}. It is possible to refine this quadratic dependence by using a new Bernstein’s inequality developed in [49, Thm. 1.3], but we do not pursue this in the present paper. to obtain that for any t0t\geq 0 we have

(|1nk=1n(𝒙˙k𝜽)x˙ki𝔼{(𝒙˙k𝜽)x˙ki}|t)\displaystyle\mathbbm{P}\left(\Big{|}\frac{1}{n}\sum_{k=1}^{n}(\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})\dot{x}_{ki}-\mathbbm{E}\{(\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})\dot{x}_{ki}\}\Big{|}\geq t\right)
2exp(C5nmin{κ0t(σ+Δ¯)2M12l,(κ0t(σ+Δ¯)2M12l)2})\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\leq 2\exp\left(-C_{5}n\min\left\{\frac{\sqrt{\kappa_{0}}t}{(\sigma+\bar{\Delta})^{2}M^{\frac{1}{2l}}},\Big{(}\frac{\sqrt{\kappa_{0}}t}{(\sigma+\bar{\Delta})^{2}M^{\frac{1}{2l}}}\Big{)}^{2}\right\}\right)

Hence, we can set t=C6(σ+Δ¯)2κ0M1/(2l)δlogdnt=C_{6}\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}M^{1/(2l)}\sqrt{\frac{\delta\log d}{n}} with sufficiently large C6C_{6}, and further apply union bound over i[d]i\in[d], under the scaling that δlogdn\frac{\delta\log d}{n} is small enough, we obtain I3(σ+Δ¯)2κ0M1/(2l)δlogdnI_{3}\lesssim\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}M^{1/(2l)}\sqrt{\frac{\delta\log d}{n}} holds with probability at least 12d1δ1-2d^{1-\delta}.

Putting pieces together, since κ0σ2\kappa_{0}\lesssim\sigma^{2}, it is immediate that I(σ+Δ¯)2κ0(Δ+M1/(2l))δlogdnI\lesssim\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}\big{(}\Delta+M^{1/(2l)}\big{)}\sqrt{\frac{\delta\log d}{n}} holds with probability at least 18d1δ1-8d^{1-\delta}. Since λ=C1(σ+Δ¯)2κ0(Δ+M1/(2l))δlogdn\lambda=C_{1}\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C1C_{1}, λβ𝑸𝜽𝒃\lambda\geq\beta\cdot\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty} holds w.h.p.

Step 2. Verifying RSC

We provide a lower bound for 𝒗𝑸𝒗=1n𝑿˙𝒗22Δ¯24𝒗22\bm{v}^{\top}\bm{Q}\bm{v}=\frac{1}{n}\|\bm{\dot{X}v}\|_{2}^{2}-\frac{\bar{\Delta}^{2}}{4}\|\bm{v}\|_{2}^{2} by using the matrix deviation inequality (Lemma 5). First note that the rows of 𝑿˙\bm{\dot{X}} are sub-Gaussian 𝒙˙kψ2σ+Δ¯\|\bm{\dot{x}}_{k}\|_{\psi_{2}}\lesssim\sigma+\bar{\Delta}. Since 𝔼(𝒙˙k𝒙˙k)=𝚺+Δ¯24𝑰d\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top})=\bm{\Sigma^{\star}}+\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}, all eigenvalues of 𝚺˙:=𝔼(𝒙˙k𝒙˙k)\bm{\dot{\Sigma}}:=\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}) are between [κ0+14Δ¯2,κ1+14Δ¯2][\kappa_{0}+\frac{1}{4}\bar{\Delta}^{2},\kappa_{1}+\frac{1}{4}\bar{\Delta}^{2}]. Thus, we invoke Lemma 5 for 𝒯:={𝒗d:𝒗1=1}\mathcal{T}:=\{\bm{v}\in\mathbb{R}^{d}:\|\bm{v}\|_{1}=1\} with u=δlogdu=\sqrt{\delta\log d}; due to the well-known Gaussian width estimate ω(𝒯)logd\omega(\mathcal{T})\lesssim\sqrt{\log d} [92, Example 7.5.9], with probability at least 1dδ1-d^{-\delta} the following event holds

sup𝒗1=1|𝑿˙𝒗2n𝚺˙1/2𝒗2|\displaystyle\sup_{\|\bm{v}\|_{1}=1}\Big{|}\|\bm{\dot{X}v}\|_{2}-\sqrt{n}\|\bm{\dot{\Sigma}}^{1/2}\bm{v}\|_{2}\Big{|} c1κ1+14Δ¯2(σ+Δ¯)2κ0+14Δ¯2δlogd\displaystyle\leq\frac{c_{1}\sqrt{\kappa_{1}+\frac{1}{4}\bar{\Delta}^{2}}(\sigma+\bar{\Delta})^{2}}{\kappa_{0}+\frac{1}{4}\bar{\Delta}^{2}}\sqrt{\delta\log d}
:=c11δlogd.\displaystyle:=c_{1}\mathscr{L}_{1}\sqrt{\delta\log d}.

Under the same probability, a simple rescaling then provides

|1n𝑿˙𝒗2𝚺˙1/2𝒗2|c11δlogdn𝒗1,𝒗d,\Big{|}\frac{1}{\sqrt{n}}\|\bm{\dot{X}v}\|_{2}-\|\bm{\dot{\Sigma}}^{1/2}\bm{v}\|_{2}\Big{|}\leq c_{1}\mathscr{L}_{1}\sqrt{\frac{\delta\log d}{n}}\|\bm{v}\|_{1},~{}\forall~{}\bm{v}\in\mathbb{R}^{d}, (B.4)

which implies

1n𝑿˙𝒗2𝚺˙1/2𝒗2c11(δlogdn)1/2𝒗1\displaystyle\frac{1}{\sqrt{n}}\|\bm{\dot{X}v}\|_{2}\geq\|\bm{\dot{\Sigma}}^{1/2}\bm{v}\|_{2}-c_{1}\mathscr{L}_{1}\Big{(}{\frac{\delta\log d}{n}}\Big{)}^{1/2}\|\bm{v}\|_{1} (B.5)
(κ0+14Δ¯2)1/2𝒗2c11(δlogdn)1/2𝒗1,𝒗d.\displaystyle\geq\Big{(}{\kappa_{0}+\frac{1}{4}\bar{\Delta}^{2}}\Big{)}^{1/2}\|\bm{v}\|_{2}-c_{1}\mathscr{L}_{1}\Big{(}{\frac{\delta\log d}{n}}\Big{)}^{1/2}\|\bm{v}\|_{1},~{}\forall~{}\bm{v}\in\mathbb{R}^{d}.

Based on (LABEL:nB.7), we let c^:=2κ0+Δ¯24κ0+Δ¯2\hat{c}:=\frac{2\kappa_{0}+\bar{\Delta}^{2}}{4\kappa_{0}+\bar{\Delta}^{2}} and use the inequality (ab)2c^a2c^1c^b2(a-b)^{2}\geq\hat{c}a^{2}-\frac{\hat{c}}{1-\hat{c}}b^{2} to obtain

𝒗𝑸𝒗\displaystyle\bm{v}^{\top}\bm{Q}\bm{v} =1n𝑿˙𝒗22Δ¯24𝒗22\displaystyle=\frac{1}{n}\|\bm{\dot{X}v}\|_{2}^{2}-\frac{\bar{\Delta}^{2}}{4}\|\bm{v}\|_{2}^{2}
c^(κ0+14Δ¯2)𝒗22c1212c^1c^δlogdn𝒗12Δ¯24𝒗22\displaystyle\geq\hat{c}\big{(}\kappa_{0}+\frac{1}{4}\bar{\Delta}^{2}\big{)}\|\bm{v}\|_{2}^{2}-c_{1}^{2}\mathscr{L}_{1}^{2}\frac{\hat{c}}{1-\hat{c}}\frac{\delta\log d}{n}\|\bm{v}\|_{1}^{2}-\frac{\bar{\Delta}^{2}}{4}\|\bm{v}\|_{2}^{2}
κ02𝒗22c1212(1+Δ¯22κ0)δlogdn𝒗12\displaystyle\geq\frac{\kappa_{0}}{2}\|\bm{v}\|_{2}^{2}-c_{1}^{2}\mathscr{L}_{1}^{2}\big{(}1+\frac{\bar{\Delta}^{2}}{2\kappa_{0}}\big{)}\frac{\delta\log d}{n}\|\bm{v}\|_{1}^{2}
:=κ02𝒗22c2(κ0,σ,Δ¯)δlogdn𝒗12,\displaystyle:=\frac{\kappa_{0}}{2}\|\bm{v}\|_{2}^{2}-c_{2}(\kappa_{0},\sigma,\bar{\Delta})\cdot\frac{\delta\log d}{n}\|\bm{v}\|_{1}^{2},

which holds for all 𝒗d\bm{v}\in\mathbb{R}^{d} and c^2:=c2(κ0,σ,Δ¯)\hat{c}_{2}:=c_{2}(\kappa_{0},\sigma,\bar{\Delta}) is a constant depending on κ0,σ,Δ¯\kappa_{0},\sigma,\bar{\Delta} (we remove the dependence on κ1\kappa_{1} by κ1σ2\kappa_{1}\lesssim\sigma^{2}).

Step 3. Bounding the Estimation Error

We are in a position to bound the estimation error of any 𝜽~\bm{\widetilde{\theta}} satisfying (4.2). Note that definition of 𝜽~\partial\|\bm{\widetilde{\theta}}\| gives λ𝜽1λ𝜽~1<λ𝜽~1,𝚼~>\lambda\|\bm{\theta^{\star}}\|_{1}-\lambda\|\bm{\widetilde{\theta}}\|_{1}\geq\big{<}\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},-\bm{\widetilde{\Upsilon}}\big{>}. Thus, we set 𝜽=𝜽\bm{\theta}=\bm{\theta^{\star}} in (4.2) and proceed as follows

0\displaystyle 0 <𝑸𝜽~𝒃+λ𝜽~1,𝚼~>\displaystyle\geq\big{<}\bm{Q\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\widetilde{\Upsilon}}\big{>} (B.6)
=𝚼~𝑸𝚼~+<𝑸𝜽𝒃,𝚼~>+<λ𝜽~1,𝚼~>\displaystyle=\bm{\widetilde{\Upsilon}}^{\top}\bm{Q}\bm{\widetilde{\Upsilon}}+\big{<}\bm{Q\theta^{\star}}-\bm{b},\bm{\widetilde{\Upsilon}}\big{>}+\big{<}\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\widetilde{\Upsilon}}\big{>}
(i)κ02𝚼~222c^2Rsδlogdn𝚼~1λβ𝚼~1+λ(𝜽~1𝜽1)\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}\frac{\kappa_{0}}{2}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-\frac{2\hat{c}_{2}R\sqrt{s}\cdot\delta\log d}{n}\|\bm{\widetilde{\Upsilon}}\|_{1}-\frac{\lambda}{\beta}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}
(ii)κ02𝚼~22λ2𝚼~1+λ(𝜽~1𝜽1),\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}\frac{\kappa_{0}}{2}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)},

where we use λβ𝑸𝜽𝒃(β>2)\lambda\geq\beta\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}~{}(\beta>2) and 𝚼~1𝜽~1+𝜽12Rs\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\|\bm{\widetilde{\theta}}\|_{1}+\|\bm{\theta^{\star}}\|_{1}\leq 2R\sqrt{s} in (i)(i), and (ii)(ii) is due to the scaling 2c^2Rδslogdn(121β)λ2\hat{c}_{2}R\delta\sqrt{s}\frac{\log d}{n}\leq(\frac{1}{2}-\frac{1}{\beta})\lambda that holds under the assumed nδlogdn\gtrsim\delta\log d for some hidden constant depending on (κ0,σ,Δ¯,Δ,M,R)(\kappa_{0},\sigma,\bar{\Delta},\Delta,M,R). Thus, we arrive at λ2𝚼~1λ(𝜽~1𝜽1)\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\geq\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}.

For 𝒂d\bm{a}\in\mathbb{R}^{d}, 𝒥[d]\mathcal{J}\subset[d] we obtain 𝒂𝒥d\bm{a}_{\mathcal{J}}\in\mathbb{R}^{d} by keeping entries of 𝒂\bm{a} in 𝒥\mathcal{J} while setting others to zero. Let 𝒜\mathcal{A} be the support of 𝜽\bm{\theta^{\star}}, 𝒜c=[d]𝒜\mathcal{A}^{c}=[d]\setminus\mathcal{A}, then we have

12𝚼~1𝜽+𝚼~1𝜽1=𝜽+𝚼~𝒜+𝚼~𝒜c1𝜽1\displaystyle\frac{1}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\geq\|\bm{\theta^{\star}}+\bm{\widetilde{\Upsilon}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}=\|\bm{\theta^{\star}}+\bm{\widetilde{\Upsilon}}_{\mathcal{A}}+\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}-\|\bm{\theta^{\star}}\|_{1} (B.7)
𝜽1+𝚼~𝒜c1𝚼~𝒜1𝜽1=𝚼~𝒜c1𝚼~𝒜1.\displaystyle\geq\|\bm{\theta^{\star}}\|_{1}+\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}-\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}=\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}-\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}.

Further use 12𝚼~112𝚼~𝒜1+12𝚼~𝒜c1\frac{1}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\frac{1}{2}\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}+\frac{1}{2}\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}, we obtain 𝚼~𝒜c13𝚼~𝒜1\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}\leq 3\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}. Hence, we have 𝚼~1𝚼~𝒜1+𝚼~𝒜c14𝚼~𝒜14s𝚼~2\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}+\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}\leq 4\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}. Now, we further substitute this into (B.6) and obtain

12κ0𝚼~22λ2𝚼~1+λ(𝜽1𝜽~1)3λ2𝚼~16λs𝚼~2.\frac{1}{2}\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|^{2}_{2}\leq\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\theta^{\star}}\|_{1}-\|\bm{\widetilde{\theta}}\|_{1}\big{)}\leq\frac{3\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 6\lambda\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}.

Thus, we arrive at the desired error bound for 2\ell_{2}-norm

𝚼~2δslogdn,with:=(σ+Δ¯)2(Δ+M1/(2l))κ03/2.\|\bm{\widetilde{\Upsilon}}\|_{2}\lesssim\mathscr{L}\sqrt{\frac{\delta s\log d}{n}},~{}\mathrm{with}~{}\mathscr{L}:=\frac{(\sigma+\bar{\Delta})^{2}(\Delta+M^{1/(2l)})}{\kappa_{0}^{3/2}}.

We simply use 𝚼~14s𝚼~2\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2} again to establish the bound for 𝚼~1\|\bm{\widetilde{\Upsilon}}\|_{1}. The proof is complete. \square

B.1.2 Proof of Theorem 10

Proof. The proof is divided into two steps. Compared to the last proof, due to the heavy-tailedness of 𝒙k\bm{x}_{k}, the step of “verifying RSC” reduces to the simpler argument in (B.9).

Step 1. Proving λβ𝑸𝜽𝒃\lambda\geq\beta\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty} for some pre-specified β>2\beta>2

Recall that (𝑸,𝒃)(\bm{Q},\bm{b}) are constructed from the quantized data as 𝑸=1nk=1n𝒙˙k𝒙˙kΔ¯4𝑰d\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}}{4}\bm{I}_{d} and 𝒃=1nk=1ny˙k𝒙˙k\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}. Thus, our main aim in this step is to prove that λ=C1(RM+Δ2+RΔ¯2)δlogdn\lambda=C_{1}(R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}} suffices to ensure λβ1nk=1n(𝒙˙k𝒙˙kΔ¯24𝑰d)𝜽1nk=1ny˙k𝒙˙k\lambda\geq\beta\|\frac{1}{n}\sum_{k=1}^{n}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d})\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}\|_{\infty} with the promised probability and any pre-specified β>2\beta>2. We let 𝒙˙k=𝒙~k+𝝉k+𝒘k\bm{\dot{x}}_{k}=\bm{\widetilde{x}}_{k}+\bm{\tau}_{k}+\bm{w}_{k}, y˙k=y~k+ϕk+ϑk\dot{y}_{k}=\widetilde{y}_{k}+\phi_{k}+\vartheta_{k} with quantization errors 𝒘k\bm{w}_{k} and ϑk\vartheta_{k}. Analogously to (B.1), we have 𝔼[y˙k𝒙˙k]=𝔼(y~k𝒙~k)\mathbbm{E}[\dot{y}_{k}\bm{\dot{x}}_{k}]=\mathbbm{E}(\widetilde{y}_{k}\bm{\widetilde{x}}_{k}) and 𝔼(𝒙˙k𝒙˙k𝜽)=Δ¯24𝜽+𝔼(𝒙~k𝒙~k𝜽).\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})=\frac{\bar{\Delta}^{2}}{4}\bm{\theta^{\star}}+\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}\bm{\theta^{\star}}). Thus, the term we want to bound can be first decomposed into two concentration terms (I1,I3I_{1},I_{3}) and one bias term (I2I_{2}):

(1nk=1n𝒙˙k𝒙˙kΔ¯24𝑰d)𝜽1nk=1ny˙k𝒙˙k1nk=1ny˙k𝒙˙k𝔼(y˙k𝒙˙k)\displaystyle\Big{\|}\Big{(}\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}\Big{)}\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}\Big{\|}_{\infty}\leq\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{\dot{x}}_{k})\Big{\|}_{\infty} (B.8)
+𝔼(𝒙~k𝒙~k𝜽)𝔼(y~k𝒙~k)+(1nk=1n𝒙˙k𝒙˙k𝔼(𝒙˙k𝒙˙k))𝜽:=I1+I2+I3,\displaystyle+\Big{\|}\mathbbm{E}\big{(}\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}\bm{\theta^{\star}}\big{)}-\mathbbm{E}\big{(}\widetilde{y}_{k}\bm{\widetilde{x}}_{k}\big{)}\Big{\|}_{\infty}+\Big{\|}\Big{(}\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top})\Big{)}\bm{\theta^{\star}}\Big{\|}_{\infty}:=I_{1}+I_{2}+I_{3},

Step 1.1. Bounding I1I_{1}

Denote the ii-th entry of 𝒙k,𝒙~k,𝒙˙k,𝝉k,𝒘k\bm{x}_{k},\bm{\widetilde{x}}_{k},\bm{\dot{x}}_{k},\bm{\tau}_{k},\bm{w}_{k} by xki,x~ki,x˙ki,τki,wkix_{ki},\widetilde{x}_{ki},\dot{x}_{ki},\tau_{ki},w_{ki}, respectively. Since |y˙k||y~k|+|ϕk|+|ϑk||y~k|+Δ|\dot{y}_{k}|\leq|\widetilde{y}_{k}|+|\phi_{k}|+|\vartheta_{k}|\leq|\widetilde{y}_{k}|+\Delta, x˙ki|x~ki|+|τki|+|wki||x~ki|+32Δ¯\dot{x}_{ki}\leq|\widetilde{x}_{ki}|+|\tau_{ki}|+|w_{ki}|\leq|\widetilde{x}_{ki}|+\frac{3}{2}\bar{\Delta}, for any integer q2q\geq 2 we can bound the moments as

k=1n𝔼|y˙kx˙kin|q\displaystyle\sum_{k=1}^{n}\mathbbm{E}\Big{|}\frac{\dot{y}_{k}\dot{x}_{ki}}{n}\Big{|}^{q} (ζx+32Δ¯)q2(ζy+Δ)q2nqk=1n𝔼|y˙kx˙ki|2\displaystyle\leq\frac{(\zeta_{x}+\frac{3}{2}\bar{\Delta})^{q-2}(\zeta_{y}+\Delta)^{q-2}}{n^{q}}\sum_{k=1}^{n}\mathbbm{E}|\dot{y}_{k}\dot{x}_{ki}|^{2}
[(ζx+32Δ¯)(ζy+Δ)]q2nqk=1n𝔼|y˙k|4𝔼|x˙ki|4\displaystyle\leq\frac{[(\zeta_{x}+\frac{3}{2}\bar{\Delta})(\zeta_{y}+\Delta)]^{q-2}}{n^{q}}\sum_{k=1}^{n}\sqrt{\mathbbm{E}|\dot{y}_{k}|^{4}\mathbbm{E}|\dot{x}_{ki}|^{4}}
((ζx+32Δ¯)(ζy+Δ)n)q2(M+Δ4+Δ¯4n),\displaystyle\lesssim\Big{(}\frac{(\zeta_{x}+\frac{3}{2}\bar{\Delta})(\zeta_{y}+\Delta)}{n}\Big{)}^{q-2}\Big{(}\frac{M+\Delta^{4}+\bar{\Delta}^{4}}{n}\Big{)},

and in the last inequality we use 𝔼|y~k|𝔼|yk|4M\mathbbm{E}|\widetilde{y}_{k}|\leq\mathbbm{E}|y_{k}|^{4}\leq M and 𝔼|x~ki|4𝔼|xki|4M\mathbbm{E}|\widetilde{x}_{ki}|^{4}\leq\mathbbm{E}|x_{ki}|^{4}\leq M. With these preparations, we apply Bernstein’s inequality (Lemma 1) and a union bound, yielding that

(I1C5{(M+Δ4+Δ¯4)tn+(ζx+32Δ¯)(ζy+Δ)tn})2dexp(t).\mathbbm{P}\left(I_{1}\geq C_{5}\left\{\sqrt{\frac{(M+\Delta^{4}+\bar{\Delta}^{4})t}{n}}+\frac{(\zeta_{x}+\frac{3}{2}\bar{\Delta})(\zeta_{y}+\Delta)t}{n}\right\}\right)\leq 2d\cdot\exp(-t).

Set t=δlogdt=\delta\log d, I1(M+Δ2+Δ¯2)δlogdnI_{1}\lesssim(\sqrt{M}+\Delta^{2}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}} holds with probability at least 12d1δ1-2d^{1-\delta}.

Step 1.2. Bounding I2I_{2}

Noting that 𝔼(yk𝒙k)=𝔼(𝒙k𝒙k𝜽)+𝔼(ϵk𝒙k)=𝔼(𝒙k𝒙k𝜽)\mathbbm{E}(y_{k}\bm{x}_{k})=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}})+\mathbbm{E}(\epsilon_{k}\bm{x}_{k})=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}}), we could further decompose I2I_{2} as I2𝔼(𝒙~k𝒙~k𝜽)𝔼(𝒙k𝒙k𝜽)+𝔼(yk𝒙k)𝔼(y~k𝒙~k):=I21+I22I_{2}\leq\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}\bm{\theta^{\star}})-\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}})\|_{\infty}+\|\mathbbm{E}(y_{k}\bm{x}_{k})-\mathbbm{E}(\widetilde{y}_{k}\bm{\widetilde{x}}_{k})\|_{\infty}:=I_{21}+I_{22}. To bound I21I_{21}, we note that the assumption and truncation procedure for 𝒙k\bm{x}_{k} are the same as in Theorem 2; thus, Step 2 in the proof of Theorem 2 can yield that 𝔼(𝒙~k𝒙~k𝒙k𝒙k)δMlogdn\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{\infty}\leq\sqrt{\frac{\delta M\log d}{n}}. Thus, we have I21𝔼(𝒙~k𝒙~k𝒙k𝒙k)𝜽1RMδlogdnI_{21}\leq\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{\infty}\|\bm{\theta^{\star}}\|_{1}\leq R\sqrt{M}\sqrt{\frac{\delta\log d}{n}}. To bound I22I_{22}, we estimate the ii-th entry

|𝔼(ykxkiy~kx~ki)|\displaystyle\big{|}\mathbbm{E}(y_{k}x_{ki}-\widetilde{y}_{k}\widetilde{x}_{ki})\big{|} =|𝔼(ykxkiy~kx~ki)(𝟙(|yk|>ζy)+𝟙(|xki|ζx))|\displaystyle=\big{|}\mathbbm{E}(y_{k}x_{ki}-\widetilde{y}_{k}\widetilde{x}_{ki})\big{(}\mathbbm{1}(|y_{k}|>\zeta_{y})+\mathbbm{1}(|x_{ki}|\geq\zeta_{x})\big{)}\big{|}
𝔼(|ykxki|𝟙(|yk|>ζy))+𝔼(|ykxki|𝟙(|xki|ζx))\displaystyle\leq\mathbbm{E}\big{(}|y_{k}x_{ki}|\mathbbm{1}(|y_{k}|>\zeta_{y})\big{)}+\mathbbm{E}\big{(}|y_{k}x_{ki}|\mathbbm{1}(|x_{ki}|\geq\zeta_{x})\big{)}
(i)M(1ζx2+1ζy2)δMlogdn,\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}M\Big{(}\frac{1}{\zeta_{x}^{2}}+\frac{1}{\zeta_{y}^{2}}\Big{)}\lesssim\sqrt{\frac{\delta M\log d}{n}},

where (i)(i) is because 𝔼(|ykxki|𝟙(|yk|>ζy))[𝔼|yk2xki2|]1/2(|yk|>ζy)(𝔼yk4)1/4(𝔼xki4)1/4𝔼yk4ζy4Mζy2\mathbbm{E}\big{(}|y_{k}x_{ki}|\mathbbm{1}(|y_{k}|>\zeta_{y})\big{)}\leq\big{[}\mathbbm{E}|y_{k}^{2}x_{ki}^{2}|\big{]}^{1/2}\sqrt{\mathbbm{P}(|y_{k}|>\zeta_{y})}\leq(\mathbbm{E}y_{k}^{4})^{1/4}(\mathbbm{E}x_{ki}^{4})^{1/4}\sqrt{\frac{\mathbbm{E}y_{k}^{4}}{\zeta_{y}^{4}}}\leq\frac{M}{\zeta_{y}^{2}}, and applying similar treatment to 𝔼(|ykxki|𝟙(|xki|>ζx))\mathbbm{E}\big{(}|y_{k}x_{ki}|\mathbbm{1}(|x_{ki}|>\zeta_{x})\big{)}. Overall, we have I2RMδlogdnI_{2}\lesssim R\sqrt{M}\sqrt{\frac{\delta\log d}{n}}.

Step 1.3. Bounding I3I_{3}

We first note that

I3=(𝑸𝚺)𝜽𝑸𝚺𝜽1R𝑸𝚺.I_{3}=\|(\bm{Q}-\bm{\Sigma^{\star}})\bm{\theta^{\star}}\|_{\infty}\leq\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\cdot\|\bm{\theta^{\star}}\|_{1}\leq R\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}.

By Theorem 2, 𝑸𝚺(M+Δ¯2)δlogdn\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\lesssim(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}} holds with probability at least 12d2δ1-2d^{2-\delta}, which leads to I3𝑸𝚺𝜽1R(M+Δ¯2)δlogdnI_{3}\leq\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\|\bm{\theta^{\star}}\|_{1}\lesssim R(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}. Thus, by combining everything, we obtain that 𝑸𝜽𝒃(RM+Δ2+RΔ¯2)δlogdn\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}\lesssim(R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}} holds with probability at least 14d2δ1-4d^{2-\delta}. Compared to our choice of λ\lambda, the claim of this step follows.

Step 2. Bounding the Estimation Error

We are now ready to bound the estimation error of any 𝜽~\bm{\widetilde{\theta}} satisfying (4.2). Set 𝜽=𝜽\bm{\theta}=\bm{\theta^{\star}} in (LABEL:4.12), it gives <𝑸𝜽~𝒃+λ𝜽~1,𝚼~>0\big{<}\bm{Q\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\widetilde{\Upsilon}}\big{>}\leq 0. Recall that we can assume 𝑸𝚺C6(M+Δ¯2)δlogdn\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\leq C_{6}(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}} with probability at least 12d2δ1-2d^{2-\delta}, which leads to

𝚼~𝑸𝚼~=𝚼~𝚺𝚼~𝚼~(𝚺𝑸)𝚼~\displaystyle\bm{\widetilde{\Upsilon}}^{\top}\bm{Q}\bm{\widetilde{\Upsilon}}=\bm{\widetilde{\Upsilon}}^{\top}\bm{\Sigma^{\star}}\bm{\widetilde{\Upsilon}}-\bm{\widetilde{\Upsilon}}^{\top}(\bm{\Sigma^{\star}}-\bm{Q})\bm{\widetilde{\Upsilon}} (B.9)
κ0𝚼~22C6(M+Δ¯2)δlogdn𝚼~12.\displaystyle\geq\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-C_{6}(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}\|\bm{\widetilde{\Upsilon}}\|_{1}^{2}.

Thus, it follows that

0\displaystyle 0 <𝑸𝜽~𝒃+λ𝜽~1,𝚼~>\displaystyle\geq\big{<}\bm{Q\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\widetilde{\Upsilon}}\big{>} (B.10)
=<𝑸𝜽𝒃,𝚼~>+𝚼~𝑸𝚼~+λ<𝜽~1,𝚼~>\displaystyle=\big{<}\bm{Q\theta^{\star}}-\bm{b},\bm{\widetilde{\Upsilon}}\big{>}+\bm{\widetilde{\Upsilon}}^{\top}\bm{Q}\bm{\widetilde{\Upsilon}}+\lambda\big{<}\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\widetilde{\Upsilon}}\big{>}
(i)C0M𝚼~22C6(M+Δ¯2)δlogdn𝚼~12\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}C_{0}\sqrt{M}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-C_{6}(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}\|\bm{\widetilde{\Upsilon}}\|_{1}^{2}
𝑸𝜽𝒃𝚼~1+λ(𝜽~1𝜽1)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}
(ii)C0M𝚼~22(2C6R(M+Δ¯2)δlogdn+𝑸𝜽𝒃)𝚼~1\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}C_{0}\sqrt{M}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-\Big{(}2C_{6}R(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}+\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}\Big{)}\|\bm{\widetilde{\Upsilon}}\|_{1}
+λ(𝜽~1𝜽1)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}
(iii)C0M𝚼~22λ2𝚼~1+λ(𝜽~1𝜽1).\displaystyle\stackrel{{\scriptstyle(iii)}}{{\geq}}C_{0}\sqrt{M}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}.

Note that (i)(i) is due to (B.9) and 𝜽1𝜽~1<𝜽~1,𝚼~>\|\bm{\theta^{\star}}\|_{1}-\|\bm{\widetilde{\theta}}\|_{1}\geq\big{<}\partial\|\bm{\widetilde{\theta}}\|_{1},-\bm{\widetilde{\Upsilon}}\big{>}, in (ii)(ii) we use 𝚼~1𝜽~1+𝜽12R\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\|\bm{\widetilde{\theta}}\|_{1}+\|\bm{\theta^{\star}}\|_{1}\leq 2R, and from Step 1 (iii)(iii) holds when λ=C2(RM+Δ2+RΔ¯2)δlogdn\lambda=C_{2}(R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}} with sufficiently large C2C_{2}. Therefore, we arrive at 12𝚼~1𝜽~1𝜽1\frac{1}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\geq\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}. Similar to Step 3 in the proof of Theorem 9, we can show 𝚼~14s𝚼~2\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}. Applying (B.10) again, it yields

κ0𝚼~22λ2𝚼~1+λ𝚼~13λ2𝚼~16sλ𝚼~2.\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}\leq\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\frac{3\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 6\sqrt{s}\lambda\|\bm{\widetilde{\Upsilon}}\|_{2}.

Thus, we obtain 𝚼~2δslogdn\|\bm{\widetilde{\Upsilon}}\|_{2}\lesssim\mathscr{L}\sqrt{\frac{\delta s\log d}{n}} with :=RM+Δ2+RΔ¯2κ0\mathscr{L}:=\frac{R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2}}{\kappa_{0}}. The proof can be concluded by further applying 𝚼~14s𝚼~2\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2} to derive the bound for 1\ell_{1}-norm. \square

B.1.3 Proof of Proposition 1

Proof.

We let 𝜽=𝜽\bm{\theta}=\bm{\theta^{\star}} in (4.2), then proceeds as the proof of Theorem 10:

0\displaystyle 0 <𝑸𝜽~𝒃+λ𝜽~1,𝚼~>\displaystyle\geq\big{<}\bm{Q\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\widetilde{\Upsilon}}\big{>} (B.11)
=𝚼~𝚺𝚼~+<𝑸𝜽𝒃,𝚼~>+𝚼~(𝑸𝚺)𝚼~+λ<𝜽~1,𝚼~>\displaystyle=\bm{\widetilde{\Upsilon}}^{\top}\bm{\Sigma^{\star}}\bm{\widetilde{\Upsilon}}+\big{<}\bm{Q\theta^{\star}}-\bm{b},\bm{\widetilde{\Upsilon}}\big{>}+\bm{\widetilde{\Upsilon}}^{\top}(\bm{Q}-\bm{\Sigma^{\star}})\bm{\widetilde{\Upsilon}}+\lambda\big{<}\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\widetilde{\Upsilon}}\big{>}
(i)κ0𝚼~22𝑸𝜽𝒃𝚼~1𝑸𝚺𝚼~12+λ(𝜽~1𝜽1)\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}\|\bm{\widetilde{\Upsilon}}\|_{1}-\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\|\bm{\widetilde{\Upsilon}}\|_{1}^{2}+\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}
(ii)κ0𝚼~22(𝑸𝜽𝒃+2R𝑸𝚺)𝚼~1+λ(𝜽~1𝜽1)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-\big{(}\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}+2R\cdot\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\big{)}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}
(iii)κ0𝚼~22λ2𝚼~1+λ(𝜽~1𝜽1),\displaystyle\stackrel{{\scriptstyle(iii)}}{{\geq}}\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}-\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)},

where in (i)(i) we use λmin(𝚺)κ0\lambda_{\min}(\bm{\Sigma^{\star}})\geq\kappa_{0} and 𝜽1𝜽~1<𝜽~1,𝚼~>\|\bm{\theta^{\star}}\|_{1}-\|\bm{\widetilde{\theta}}\|_{1}\geq\big{<}\partial\|\bm{\widetilde{\theta}}\|_{1},-\bm{\widetilde{\Upsilon}}\big{>}, (ii)(ii) is by 𝚼~1𝜽~1+𝜽12R\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\|\bm{\widetilde{\theta}}\|_{1}+\|\bm{\theta^{\star}}\|_{1}\leq 2R, in (iii)(iii) we use the assumption (4.3). Thus, by κ0𝚼~220\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|^{2}_{2}\geq 0 we obtain 2(𝜽~1𝜽1)𝚼~12\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}\leq\|\bm{\widetilde{\Upsilon}}\|_{1}. Similarly to Step 3 in the proof of Theorem 9, we can show 𝚼~14s𝚼~2\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}. Again we use (B.11), it gives

κ0𝚼~22λ2𝚼~1+λ(𝜽1𝜽~1)32λ𝚼~16λs𝚼~2,\displaystyle\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|^{2}_{2}\leq\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\theta^{\star}}\|_{1}-\|\bm{\widetilde{\theta}}\|_{1}\big{)}\leq\frac{3}{2}\lambda\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 6\lambda\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2},

thus displaying 𝚼~26λsκ0\|\bm{\widetilde{\Upsilon}}\|_{2}\leq\frac{6\lambda\sqrt{s}}{\kappa_{0}}. The proof can be concluded by using 𝚼~14s𝚼~2\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}. ∎

B.1.4 Proof of Theorem 11

Proof.

To invoke Proposition 1, it is enough to verify (4.3). Recalling 𝚺=𝔼(𝒙k𝒙k)\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}), we first invoke [21, Thm. 1] and obtain 𝑸𝚺=O(σ2δlogd(logn)2n)\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}=O\big{(}\sigma^{2}\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}\big{)} holds with probability at least 12d2δ1-2d^{2-\delta}. This confirms λR𝑸𝚺\lambda\gtrsim R\cdot\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}. Then, it remains to upper bound 𝑸𝜽𝒃\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}:

𝑸𝜽𝒃(𝑸𝚺)𝜽+𝚺𝜽𝒃\displaystyle\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}\leq\|(\bm{Q}-\bm{\Sigma^{\star}})\bm{\theta^{\star}}\|_{\infty}+\|\bm{\Sigma^{\star}\theta^{\star}}-\bm{b}\|_{\infty}
𝑸𝚺𝜽1+γxγynk=1ny˙k𝒙˙k1𝔼(yk𝒙k)\displaystyle\leq\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\|\bm{\theta^{\star}}\|_{1}+\Big{\|}\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}-\mathbbm{E}(y_{k}\bm{x}_{k})\Big{\|}_{\infty}
(i)σ2(R+1)δlogd(logn)2n,\displaystyle\stackrel{{\scriptstyle(i)}}{{\lesssim}}\sigma^{2}(R+1)\sqrt{\frac{\delta\log d(\log n)^{2}}{n}},

where in (i)(i) we use a known estimate from [21, Eq. (A.31)]:

(γxγynk=1ny˙k𝒙˙k1𝔼(yk𝒙k)σ2δlogd(logn)2n)12d1δ.\mathbbm{P}\left(\Big{\|}\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}-\mathbbm{E}(y_{k}\bm{x}_{k})\Big{\|}_{\infty}\lesssim\sigma^{2}\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}\right)\geq 1-2d^{1-\delta}.

Thus, by setting λ=C1σ2Rδlogd(logn)2n\lambda=C_{1}\sigma^{2}R\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}, (4.3) can be satisfied with probability at least 14d2δ1-4d^{2-\delta}, hence using Proposition 1 concludes the proof.∎

B.1.5 Proof of Theorem 12

Proof.

The proof is again based on Proposition 1 and some ingredients from [21]. From [21, Thm. 4], 𝑸𝚺(M2δlogdn)1/4\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\lesssim\big{(}\frac{M^{2}\delta\log d}{n}\big{)}^{1/4} holds with probability at least 12d2δ1-2d^{2-\delta}, thus confirming λR𝑸𝚺\lambda\gtrsim R\cdot\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty} with the same probability. Moreover,

𝑸𝜽𝒃(𝑸𝚺)𝜽+𝚺𝜽𝒃\displaystyle\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}\leq\|(\bm{Q}-\bm{\Sigma^{\star}})\bm{\theta^{\star}}\|_{\infty}+\|\bm{\Sigma^{\star}\theta^{\star}}-\bm{b}\|_{\infty}
𝑸𝚺𝜽1+γxγynk=1ny˙k𝒙˙k1𝔼(yk𝒙k)\displaystyle\leq\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\|\bm{\theta^{\star}}\|_{1}+\Big{\|}\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}-\mathbbm{E}(y_{k}\bm{x}_{k})\Big{\|}_{\infty}
(i)M(R+1)(δlogdn)1/4,\displaystyle\stackrel{{\scriptstyle(i)}}{{\lesssim}}\sqrt{M}(R+1)\Big{(}\frac{\delta\log d}{n}\Big{)}^{1/4},

where (i)(i) is due to a known estimate from [21, Eq. (A.34)]:

(γxγynk=1ny˙k𝒙˙k1𝔼(yk𝒙k)M(δlogdn)1/4)12d1δ\mathbbm{P}\left(\Big{\|}\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}-\mathbbm{E}(y_{k}\bm{x}_{k})\Big{\|}_{\infty}\lesssim\sqrt{M}\Big{(}\frac{\delta\log d}{n}\Big{)}^{1/4}\right)\geq 1-2d^{1-\delta}

Thus, with probability at least 14d2δ1-4d^{2-\delta}, (4.3) holds if λ=C1MR(δlogdn)1/4\lambda=C_{1}\sqrt{M}R\big{(}\frac{\delta\log d}{n}\big{)}^{1/4} with sufficiently large C1C_{1}. The proof can be concluded by invoking Proposition 1. ∎

B.2 Uniform Recovery Guarantee

We need some auxiliary results to support the proof. The first one is a concentration inequality for product process due to Mendelson [67]; the following statement can be directly adapted from [40, Thm. 8] by specifying the pseudo-metrics as 2\ell_{2}-distance.

Lemma 9.

(Concentration of Product Process). Let {g𝐚}𝐚𝒜\{g_{\bm{a}}\}_{\bm{a}\in\mathcal{A}} and {h𝐛}𝐛\{h_{\bm{b}}\}_{\bm{b}\in\mathcal{B}} be stochastic processes indexed by two sets 𝒜p\mathcal{A}\subset\mathbb{R}^{p}, q\mathcal{B}\subset\mathbb{R}^{q}, both defined on a common probability space (Ω,A,)(\Omega,\textsc{A},\mathbbm{P}). We assume that there exist K𝒜,K,r𝒜,r0K_{\mathcal{A}},K_{\mathcal{B}},r_{\mathcal{A}},r_{\mathcal{B}}\geq 0 such that

g𝒂g𝒂ψ2K𝒜𝒂𝒂2,g𝒂ψ2r𝒜,𝒂,𝒂𝒜;\displaystyle\|g_{\bm{a}}-g_{\bm{a}^{\prime}}\|_{\psi_{2}}\leq K_{\mathcal{A}}\|\bm{a}-\bm{a}^{\prime}\|_{2},~{}\|g_{\bm{a}}\|_{\psi_{2}}\leq r_{\mathcal{A}},~{}\forall~{}\bm{a},\bm{a}^{\prime}\in\mathcal{A};
h𝒃h𝒃ψ2K𝒃𝒃2,h𝒃ψ2r,𝒃,𝒃.\displaystyle\|h_{\bm{b}}-h_{\bm{b}^{\prime}}\|_{\psi_{2}}\leq{K}_{\mathcal{B}}\|\bm{b}-\bm{b}^{\prime}\|_{2},~{}\|h_{\bm{b}}\|_{\psi_{2}}\leq r_{\mathcal{B}},~{}\forall~{}\bm{b},\bm{b}^{\prime}\in\mathcal{B}.

Finally, let X1,,XmX_{1},...,X_{m} be independent copies of a random variable XX\sim\mathbbm{P}, then for every u1u\geq 1 the following holds with probability at least 12exp(cu2)1-2\exp(-cu^{2})

sup𝒂𝒜𝒃1n|i=1ng𝒂(Xi)h𝒃(Xi)𝔼[g𝒂(Xi)h𝒃(Xi)]|\displaystyle\sup_{\begin{subarray}{c}\bm{a}\in\mathcal{A}\\ \bm{b}\in\mathcal{B}\end{subarray}}~{}\frac{1}{n}\left|\sum_{i=1}^{n}g_{\bm{a}}(X_{i})h_{\bm{b}}(X_{i})-\mathbbm{E}\big{[}g_{\bm{a}}(X_{i})h_{\bm{b}}(X_{i})\big{]}\right|
C((K𝒜ω(𝒜)+ur𝒜)(Kω()+ur)n\displaystyle~{}~{}~{}\leq C\Big{(}\frac{(K_{\mathcal{A}}\cdot\omega(\mathcal{A})+u\cdot r_{\mathcal{A}})\cdot(K_{\mathcal{B}}\cdot\omega(\mathcal{B})+u\cdot r_{\mathcal{B}})}{n}
+r𝒜Kω()+rK𝒜ω(𝒜)+ur𝒜rn),\displaystyle~{}~{}~{}~{}~{}~{}+\frac{r_{\mathcal{A}}\cdot K_{\mathcal{B}}\cdot\omega(\mathcal{B})+r_{\mathcal{B}}\cdot K_{\mathcal{A}}\cdot\omega(\mathcal{A})+u\cdot r_{\mathcal{A}}r_{\mathcal{B}}}{\sqrt{n}}\Big{)},

where ω(𝒜)=𝔼sup𝐚𝒜(𝐠𝐚)\omega(\mathcal{A})=\mathbbm{E}\sup_{\bm{a}\in\mathcal{A}}(\bm{g}^{\top}\bm{a}) with 𝐠𝒩(0,𝐈p)\bm{g}\sim\mathcal{N}(0,\bm{I}_{p}) is the Gaussian width of 𝒜p\mathcal{A}\subset\mathbb{R}^{p}, and similarly, ω()\omega(\mathcal{B}) is the Gaussian width of \mathcal{B}.

We will use the following result that can be found in [61, Thm. 8].

Lemma 10.

Let (X𝐮)𝐮𝒯(X_{\bm{u}})_{\bm{u}\in\mathcal{T}} be a random process indexed by points in a bounded set 𝒯n\mathcal{T}\subset\mathbb{R}^{n}. Assume that the process has sub-Gaussian increments, i.e., there exists M>0M>0 such that X𝐮X𝐯ψ2M𝐮𝐯2\|X_{\bm{u}}-X_{\bm{v}}\|_{\psi_{2}}\leq M\|\bm{u}-\bm{v}\|_{2} holds for any 𝐮,𝐯𝒯\bm{u},\bm{v}\in\mathcal{T}. Then for every t>0t>0, the event

sup𝒖,𝒗𝒯|X𝒖X𝒗|CM(ω(𝒯)+tdiam(𝒯))\sup_{\bm{u},\bm{v}\in\mathcal{T}}|X_{\bm{u}}-X_{\bm{v}}|\leq CM\cdot\big{(}\omega(\mathcal{T})+t\cdot\mathrm{diam}(\mathcal{T})\big{)}

holds with probability at least 1exp(t2)1-\exp(-t^{2}), where diam(𝒯):=sup𝐱,𝐲𝒯𝐱𝐲2\mathrm{diam}(\mathcal{T}):=\sup_{\bm{x},\bm{y}\in\mathcal{T}}\|\bm{x}-\bm{y}\|_{2} denotes the diameter of 𝒯\mathcal{T}.

B.2.1 Proof of Theorem 13

Proof.

We start from the optimality k=1n(y˙k𝒙k𝜽^)2k=1n(y˙k𝒙k𝜽)2.\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\widehat{\theta}})^{2}\leq\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{{\theta^{\star}}})^{2}. By substituting 𝜽^=𝜽+𝚼^\bm{\widehat{\theta}}=\bm{\theta^{\star}}+\bm{\widehat{\Upsilon}} and performing some algebra, we obtain

k=1n(𝒙k𝚼^)22k=1n(y˙k𝒙k𝜽)𝒙k𝚼^.\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{\widehat{\Upsilon}})^{2}\leq 2\sum_{k=1}^{n}\big{(}\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta^{\star}}\big{)}\bm{x}_{k}^{\top}\bm{\widehat{\Upsilon}}.

Due to the constraint we have 𝜽+𝚼^1𝜽1\|\bm{\theta^{\star}}+\bm{\widehat{\Upsilon}}\|_{1}\leq\|\bm{\theta^{\star}}\|_{1}, then similar to (LABEL:decom) we can show 𝚼^12s𝚼^2\|\bm{\widehat{\Upsilon}}\|_{1}\leq 2\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2} holds. Thus, we let 𝒱={𝒗:𝒗2=1,𝒗12s}\mathscr{V}=\{\bm{v}:\|\bm{v}\|_{2}=1,\|\bm{v}\|_{1}\leq 2\sqrt{s}\}, then the following holds uniformly for all 𝜽Σs,R0\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}

𝚼^22inf𝒗𝒱k=1n(𝒙k𝒗)22𝚼^2sup𝒗𝒱k=1n(y˙k𝒙k𝜽)𝒙k𝒗\|\bm{\widehat{\Upsilon}}\|_{2}^{2}\cdot\inf_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{v})^{2}\leq 2\|\bm{\widehat{\Upsilon}}\|_{2}\cdot\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta^{\star}})\bm{x}_{k}^{\top}\bm{v} (B.12)

Similarly to previous developments, our strategy is to lower bound the left-hand side while upper bound the right hand side, but with the difference that the bounds must be valid uniformly for all 𝜽Σs,R0\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}.

Step 1. Bounding the Left-Hand Side From Below

Letting Δ¯=0\bar{\Delta}=0 and restricting 𝒗\bm{v} to 𝒱\mathscr{V}, we use (LABEL:nB.7) in the proof of Theorem 9, then for some constant c(κ0,σ)c(\kappa_{0},\sigma) depending on κ0,σ\kappa_{0},\sigma, with probability at least 1dδ1-d^{-\delta}

inf𝒗𝒱1n[k=1n(𝒙k𝒗)2]1/2κ0c(κ0,σ)slogdn.\inf_{\bm{v}\in\mathscr{V}}\frac{1}{\sqrt{n}}\left[\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{v})^{2}\right]^{1/2}\geq\sqrt{\kappa_{0}}-c(\kappa_{0},\sigma)\cdot\sqrt{\frac{s\log d}{n}}.

Thus, if n4c2(κ0,σ)κ0slogdn\geq\frac{4c^{2}(\kappa_{0},\sigma)}{\kappa_{0}}s\log d, then it holds that inf𝒗𝒱k=1n(𝒙k𝒗)214κ0n\inf_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{v})^{2}\geq\frac{1}{4}\kappa_{0}n.

Step 2. Bounding the Right-Hand Side Uniformly

To pursue the uniformity over 𝜽Σs,R0\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}, we take a supremum by replacing specific 𝜽\bm{\theta^{\star}} with sup𝜽Σs,R0\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}, then we consider the upper bound on

I:=sup𝜽Σs,R0sup𝒗𝒱k=1n(y˙k𝒙k𝜽)𝒙k𝒗,I:=\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}\sup_{\bm{v}\in\mathscr{V}}~{}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta})\bm{x}_{k}^{\top}\bm{v}, (B.13)

where y˙k=𝒬Δ(y~k+τk),y~k=𝒯ζy(yk):=sgn(yk)min{|yk|,ζy},yk=𝒙k𝜽+ϵk\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k}),~{}\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}):=\operatorname{sgn}(y_{k})\min\{|y_{k}|,\zeta_{y}\},~{}y_{k}=\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k}; note that y˙k,y~k,yk\dot{y}_{k},\widetilde{y}_{k},y_{k} depend on 𝜽\bm{\theta}, and we will use notation y˙𝜽,k,y~𝜽,k,y𝜽,k\dot{y}_{\bm{\theta},k},\widetilde{y}_{\bm{\theta},k},y_{\bm{\theta},k} to indicate such dependence when necessary. In this proof, the ranges of 𝜽\bm{\theta} and 𝒗\bm{v} (e.g., in supremum), if omitted, are respectively 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}} and 𝒗𝒱\bm{v}\in\mathscr{V}. Now let the quantization noise be ξk=y˙ky~k\xi_{k}=\dot{y}_{k}-\widetilde{y}_{k}, observing that 𝔼(yk𝒙k𝒗)=𝔼(𝜽𝒙k𝒙k𝒗)+𝔼(ϵk𝒙k𝒗)=𝔼(𝜽𝒙k𝒙k𝒗)\mathbbm{E}(y_{k}\bm{x}_{k}^{\top}\bm{v})=\mathbbm{E}(\bm{\theta}^{\top}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{v})+\mathbbm{E}(\epsilon_{k}\bm{x}_{k}^{\top}\bm{v})=\mathbbm{E}(\bm{\theta}^{\top}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{v}), then we can first decompose II as

I\displaystyle I sup𝜽,𝒗k=1nξk𝒙k𝒗+sup𝜽,𝒗k=1n(y~k𝒙k𝒗𝔼[y~k𝒙k𝒗])\displaystyle\leq\sup_{\bm{\theta},\bm{v}}~{}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}^{\top}\bm{v}+\sup_{\bm{\theta},\bm{v}}~{}\sum_{k=1}^{n}\big{(}\widetilde{y}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\widetilde{y}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)} (B.14)
+sup𝜽,𝒗k=1n𝔼((y~kyk)𝒙k𝒗)+sup𝜽,𝒗k=1n(𝜽𝒙k𝒙k𝒗𝔼[𝜽𝒙k𝒙k𝒗])\displaystyle+\sup_{\bm{\theta},\bm{v}}~{}\sum_{k=1}^{n}\mathbbm{E}\big{(}(\widetilde{y}_{k}-y_{k})\bm{x}_{k}^{\top}\bm{v}\big{)}+\sup_{\bm{\theta},\bm{v}}~{}\sum_{k=1}^{n}\big{(}\bm{\theta}^{\top}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\bm{\theta}^{\top}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)}
:=I0+I1+I2+I3,\displaystyle:={I}_{0}+{I}_{1}+{I}_{2}+{I}_{3},

where I0I_{0} is the term arising from quantization, I1I_{1} is the concentration term involving truncation of heavy-tailed data for which we develop some new machinery to bound it, I2I_{2} is the bias term, I3I_{3} is a more regular concentration term that can be bounded via Lemma 9. In the remainder of the proof, we will bound I1,I2,I3I_{1},I_{2},I_{3} separately and finally deal with I0I_{0}.

Step 2.1. Bounding I1{I}_{1}

Using y~k=𝒯ζy(𝒙k𝜽+ϵk)\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k}), I1I_{1} is concerned with the concentration of the product process {k=1n𝒯ζy(𝜽𝒙k+ϵk)𝒙k𝒗}𝜽,𝒗\big{\{}\sum_{k=1}^{n}\mathscr{T}_{\zeta_{y}}(\bm{\theta}^{\top}\bm{x}_{k}+\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}\big{\}}_{\bm{\theta},\bm{v}} about its mean. It is natural to apply Lemma 9 towards this end, but we lack good bound on 𝒯ζy(𝜽𝒙k+ϵk)ψ2\|\mathscr{T}_{\zeta_{y}}(\bm{\theta}^{\top}\bm{x}_{k}+\epsilon_{k})\|_{\psi_{2}} because of the heavy-tailedness of ϵk\epsilon_{k} (on the other hand, the bound O(ζy)O(\zeta_{y}) is just too crude to yield a sharp rate). Our strategy is already introduced in the mainbody — we introduce z~k:=y~k𝒯ζy(ϵk)\widetilde{z}_{k}:=\widetilde{y}_{k}-\mathscr{T}_{\zeta_{y}}(\epsilon_{k}) and decompose I1{I}_{1} as

I1sup𝒗,𝜽k=1n(z~k𝒙k𝒗𝔼[z~k𝒙k𝒗]):=I11+sup𝒗k=1n(𝒯ζy(ϵk)𝒙k𝒗𝔼[𝒯ζy(ϵk)𝒙k𝒗]):=I12.I_{1}\leq\underbrace{\sup_{\bm{v},\bm{\theta}}\sum_{k=1}^{n}\big{(}\widetilde{z}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\widetilde{z}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)}}_{:=I_{11}}+\underbrace{\sup_{\bm{v}}\sum_{k=1}^{n}\big{(}\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}]\big{)}}_{:=I_{12}}.

Thus, it suffices to bound I11I_{11} and I12I_{12}.

Step 2.1.1. Bounding I11{I}_{11}

We use Lemma 9 to bound I11I_{11}. For any 𝒗1,𝒗2𝒱\bm{v}_{1},\bm{v}_{2}\in\mathscr{V}, it is evident that we have 𝒙k𝒗ψ2𝒙kψ2σ\|\bm{x}_{k}^{\top}\bm{v}\|_{\psi_{2}}\leq\|\bm{x}_{k}\|_{\psi_{2}}\leq\sigma and 𝒙k𝒗1𝒙k𝒗2ψ2σ𝒗1𝒗22\|\bm{x}_{k}^{\top}\bm{v}_{1}-\bm{x}_{k}^{\top}\bm{v}_{2}\|_{\psi_{2}}\leq\sigma\|\bm{v}_{1}-\bm{v}_{2}\|_{2}. Regarding z~k=z~𝜽,k:=𝒯ζy(𝒙k𝜽+ϵk)𝒯ζy(ϵk)\widetilde{z}_{k}=\widetilde{z}_{\bm{\theta},k}:=\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k})-\mathscr{T}_{\zeta_{y}}(\epsilon_{k}) indexed by 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}}, the 11-Lipschitzness of 𝒯ζy()\mathscr{T}_{\zeta_{y}}(\cdot) gives |z~k||𝒙k𝜽||\widetilde{z}_{k}|\leq|\bm{x}_{k}^{\top}\bm{\theta}|, and then the definition of sub-Gaussian norm yields z~kψ2𝒙k𝜽ψ2𝒙kψ2𝜽2R0σ\|\widetilde{z}_{k}\|_{\psi_{2}}\leq\|\bm{x}_{k}^{\top}\bm{\theta}\|_{\psi_{2}}\leq\|\bm{x}_{k}\|_{\psi_{2}}\|\bm{\theta}\|_{2}\leq R_{0}\sigma (this addresses the aforementioned issue). Further, for any 𝜽1,𝜽2Σs,R0\bm{\theta}_{1},\bm{\theta}_{2}\in\Sigma_{s,R_{0}} we verify the sub-Gaussian increments

|z~𝜽1,kz~𝜽2,k|=\displaystyle|\widetilde{z}_{\bm{\theta}_{1},k}-\widetilde{z}_{\bm{\theta}_{2},k}|= |𝒯ζy(𝒙k𝜽1+ϵk)𝒯ζy(ϵk)𝒯ζy(𝒙k𝜽2+ϵk)+𝒯ζy(ϵk)|\displaystyle\big{|}\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}_{1}+\epsilon_{k})-\mathscr{T}_{\zeta_{y}}(\epsilon_{k})-\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}_{2}+\epsilon_{k})+\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\big{|}
=\displaystyle= |𝒯ζy(𝒙k𝜽1+ϵk)𝒯ζy(𝒙k𝜽2+ϵk)||𝒙k(𝜽1𝜽2)|,\displaystyle\big{|}\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}_{1}+\epsilon_{k})-\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}_{2}+\epsilon_{k})\big{|}\leq\big{|}\bm{x}_{k}^{\top}(\bm{\theta}_{1}-\bm{\theta}_{2})\big{|},

which leads to z~𝜽1,kz~𝜽2,kψ2𝒙k(𝜽1𝜽2)ψ2𝒙kψ2𝜽1𝜽22σ𝜽1𝜽22\|\widetilde{z}_{\bm{\theta}_{1},k}-\widetilde{z}_{\bm{\theta}_{2},k}\|_{\psi_{2}}\leq\|\bm{x}_{k}^{\top}(\bm{\theta}_{1}-\bm{\theta}_{2})\|_{\psi_{2}}\leq\|\bm{x}_{k}\|_{\psi_{2}}\|\bm{\theta}_{1}-\bm{\theta}_{2}\|_{2}\leq\sigma\|\bm{\theta}_{1}-\bm{\theta}_{2}\|_{2}. With these preparations, we can invoke Lemma 9 use the well-known estimates ω(Σs,R0),ω(𝒱)=O(slogd)\omega(\Sigma_{s,R_{0}}),\omega(\mathscr{V})=O(\sqrt{s\log d})181818In fact, we have the tighter estimate ω(Σs,R0),ω(𝒱)=O(slog(eds))\omega(\Sigma_{s,R_{0}}),\omega(\mathscr{V})=O\Big{(}\sqrt{s\log\big{(}\frac{ed}{s}\big{)}}\Big{)} (e.g., [76]) but we simply put slogd\sqrt{s\log d} to be consistent with earlier results concerning unconstrained Lasso. to obtain that, with probability at least 12exp(cu2)1-2\exp(-cu^{2}) we have

I11\displaystyle{I}_{11} σ2[n(ω(Σs,R0)+ω(𝒱)+u)+(ω(Σs,R0)+u)(ω(𝒱)+u)]\displaystyle\lesssim\sigma^{2}\Big{[}\sqrt{n}\big{(}\omega(\Sigma_{s,R_{0}})+\omega(\mathscr{V})+u\big{)}+\big{(}\omega(\Sigma_{s,R_{0}})+u\big{)}\cdot\big{(}\omega(\mathscr{V})+u\big{)}\Big{]} (B.15)
σ2[n(slogd+u)+(slogd+u)(slogd+u)].\displaystyle\lesssim\sigma^{2}\Big{[}\sqrt{n}\big{(}\sqrt{s\log d}+u\big{)}+\big{(}\sqrt{s\log d}+u\big{)}\cdot\big{(}\sqrt{s\log d}+u\big{)}\Big{]}.

Therefore, we can set u=δslogdu=\sqrt{\delta s\log d} in (B.15), under the scaling of nsδlogdn\gtrsim s\delta\log d it provides

(I11σ2nδslogd)12dδΩ(s).\mathbbm{P}\big{(}I_{11}\lesssim\sigma^{2}\sqrt{n\delta s\log d}\big{)}\geq 1-2d^{-\delta\Omega(s)}. (B.16)

Step 2.1.2. Bounding I12I_{12}

By 𝒗12s\|\bm{v}\|_{1}\leq 2\sqrt{s} we have I122sk=1n(𝒯ζy(ϵk)𝒙k𝔼[𝒯ζy(ϵk)𝒙k]).I_{12}\leq 2\sqrt{s}\|\sum_{k=1}^{n}(\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}])\|_{\infty}. Then to apply Bernstein’s inequality, for integer q2q\geq 2 and i[d]i\in[d], analogously to (LABEL:nB.2) in the proof of Theorem 9, we can bound that

k=1n𝔼|𝒯ζy(ϵk)xkin|q\displaystyle\sum_{k=1}^{n}\mathbbm{E}\left|\frac{\mathscr{T}_{\zeta_{y}}(\epsilon_{k})x_{ki}}{n}\right|^{q} (ζyn)q21n2k=1n𝔼|𝒯ζy2(ϵk)xkiq|\displaystyle\leq\Big{(}\frac{\zeta_{y}}{n}\Big{)}^{q-2}\frac{1}{n^{2}}\sum_{k=1}^{n}\mathbbm{E}\big{|}\mathscr{T}^{2}_{\zeta_{y}}(\epsilon_{k})x_{ki}^{q}\big{|}
(σζyn)q2(σ2M1ln)(Cq)q2q!2v0c0q2,\displaystyle\leq\Big{(}\frac{\sigma\zeta_{y}}{n}\Big{)}^{q-2}\Big{(}\frac{\sigma^{2}M^{\frac{1}{l}}}{n}\Big{)}(Cq)^{\frac{q}{2}}\leq\frac{q!}{2}v_{0}c_{0}^{q-2},

for some v0=O(σ2M1/ln)v_{0}=O\big{(}\frac{\sigma^{2}M^{1/l}}{n}\big{)}, c0=O(σζyn)c_{0}=O\big{(}\frac{\sigma\zeta_{y}}{n}\big{)}. Then we use Lemma 1 to obtain that, with probability at least 12exp(t)1-2\exp(-t) we have

|1nk=1n(𝒯ζy(ϵk)xki𝔼[𝒯ζy(ϵk)xki])|Cσ(M12ltn+ζytn)\left|\frac{1}{n}\sum_{k=1}^{n}(\mathscr{T}_{\zeta_{y}}(\epsilon_{k})x_{ki}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})x_{ki}])\right|\leq C\sigma\Big{(}M^{\frac{1}{2l}}\sqrt{\frac{t}{n}}+\frac{\zeta_{y}t}{n}\Big{)}

Then we use ζy(σ+M12l)nδlogd\zeta_{y}\asymp\big{(}\sigma+M^{\frac{1}{2l}}\big{)}\sqrt{\frac{n}{\delta\log d}}, set tδlogdt\asymp\delta\log d, and take a union bound over i[d]i\in[d] to obtain that, k=1n1n(𝒯ζy(ϵk)𝒙k𝔼[𝒯ζy(ϵk)𝒙k])σ(M1/(2l)+σ)δlogdn\|\sum_{k=1}^{n}\frac{1}{n}(\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}])\|_{\infty}\lesssim\sigma(M^{1/(2l)}+\sigma)\sqrt{\frac{\delta\log d}{n}} holds with probability at least 12d1δ1-2d^{1-\delta}, which implies the following under the same probability

I12σ(M12l+σ)nsδlogd.I_{12}\lesssim\sigma\big{(}M^{\frac{1}{2l}}+\sigma\big{)}\sqrt{ns\delta\log d}. (B.17)

Therefore, combining (B.16) and (B.17), we obtain that I1σ(M12l+σ)nsδlogdI_{1}\lesssim\sigma\big{(}M^{\frac{1}{2l}}+\sigma\big{)}\sqrt{ns\delta\log d} with the promised probability.

Step 2.2. Bounding I2{I}_{2}

For this bias term the supremum does not make things harder. We begin with I2=nsup𝜽,𝒗𝔼((y~kyk)𝒙k𝒗)2nssup𝜽𝔼(y~kyk)𝒙k.I_{2}=n\cdot\sup_{\bm{\theta},\bm{v}}\mathbbm{E}\big{(}(\widetilde{y}_{k}-y_{k})\bm{x}_{k}^{\top}\bm{v}\big{)}\leq 2n\sqrt{s}\cdot\sup_{\bm{\theta}}\|\mathbbm{E}(\widetilde{y}_{k}-y_{k})\bm{x}_{k}\|_{\infty}. Fix any 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}}, we have 𝔼|yk|2l𝔼|𝒙k𝜽|2l+𝔼|ϵk|2lM+σ2l\mathbbm{E}|y_{k}|^{2l}\lesssim\mathbbm{E}|\bm{x}_{k}^{\top}\bm{\theta}|^{2l}+\mathbbm{E}|\epsilon_{k}|^{2l}\lesssim M+\sigma^{2l}. Then following arguments similarly to (LABEL:nB.3) we obtain

𝔼(y~kyk)𝒙kσM1/lζyσ(M1/(2l)+σ)δlogdn,\|\mathbbm{E}(\widetilde{y}_{k}-y_{k})\bm{x}_{k}\|_{\infty}\lesssim\frac{\sigma M^{1/l}}{\zeta_{y}}\lesssim\sigma(M^{1/(2l)}+\sigma)\sqrt{\frac{\delta\log d}{n}},

which implies I2σM12lnδslogdI_{2}\lesssim\sigma M^{\frac{1}{2l}}\sqrt{n\delta s\log d}.

Step 2.3. Bounding I3I_{3}

It is evident that we can apply Lemma 9 with (g𝜽(𝒙k),h𝒗(𝒙k))=(𝜽𝒙k,𝒗𝒙k)(g_{\bm{\theta}}(\bm{x}_{k}),h_{\bm{v}}(\bm{x}_{k}))=(\bm{\theta}^{\top}\bm{x}_{k},\bm{v}^{\top}\bm{x}_{k}), (𝒜,)=(Σs,R0,𝒱)(\mathcal{A},\mathcal{B})=(\Sigma_{s,R_{0}},\mathscr{V}), and hence with K𝒜,r𝒜,K,r=O(σ)K_{\mathcal{A}},r_{\mathcal{A}},K_{\mathcal{B}},r_{\mathcal{B}}=O(\sigma). Along with ω(Σs,R0),ω(𝒱)slogd\omega(\Sigma_{s,R_{0}}),\omega(\mathscr{V})\lesssim\sqrt{s\log d}, we obtain that the following holds with probability at least 12exp(cu2)1-2\exp(-cu^{2}):

I3σ2[(slogd+u)2+n(slogd+u)].I_{3}\leq\sigma^{2}\Big{[}\big{(}\sqrt{s\log d}+u\big{)}^{2}+\sqrt{n}\big{(}\sqrt{s\log d}+u\big{)}\Big{]}.

By taking usδlogdu\asymp\sqrt{s\delta\log d}, under the scaling nδslogdn\gtrsim\delta s\log d, it follows that I3σ2nδslogdI_{3}\lesssim\sigma^{2}\sqrt{n\delta s\log d} with probability at least 12dδΩ(s)1-2d^{-\delta\Omega(s)}.

Step 2.4. Bounding I0I_{0}

It remains to bound I0=sup𝜽,𝒗k=1nξk𝒙k𝒗I_{0}=\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}^{\top}\bm{v}. Bounding I0I_{0} is similar to establishing the “limited projection distortion (LPD)” property in [94], but the key distinction is that 𝜽\bm{\theta} and 𝒗\bm{v} in I0I_{0} take value in different spaces.

The main difficulty associated with sup𝜽\sup_{\bm{\theta}} lies in the discontinuity of the quantization noise ξk:=𝒬Δ(y~k+τk)y~k\xi_{k}:=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})-\widetilde{y}_{k}, which we overcome by a covering argument and some machinery developed in [94, Prop. 6.1]. However, the essential difference from [94] is that we use Lemma 10 to handle sup𝒗\sup_{\bm{v}}, while [94] again used covering argument for 𝒗\bm{v} to strengthen their Proposition 6.1 to their Proposition 6.2, which is unfortunately insufficient in our setting because the covering number of 𝒱\mathscr{V} significantly increases under smaller covering radius (on the other hand, using covering argument for 𝒗\bm{v} suffices for the analyses in [94] regarding a different estimator named projected back projection).

Let us first construct a ρ\rho-net of Σs,R0\Sigma_{s,R_{0}} denoted by 𝒢={𝜽1,,𝜽N}\mathcal{G}=\{\bm{\theta}_{1},...,\bm{\theta}_{N}\}, so that for any 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}} we can pick 𝜽𝒢\bm{\theta}^{\prime}\in\mathcal{G} satisfying 𝜽𝜽2ρ\|\bm{\theta}^{\prime}-\bm{\theta}\|_{2}\leq\rho; here, the covering radius ρ\rho is to be chosen later, and we assume that N(9dρs)sN\leq\big{(}\frac{9d}{\rho s}\big{)}^{s} [76, Lemma 3.3]. As is standard in a covering argument, we first control I0I_{0} over the net 𝒢\mathcal{G} (by replacing sup𝜽Σs,R0\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}} with sup𝜽𝒢\sup_{\bm{\theta}\in\mathcal{G}}), and then bound the approximation error induced by such replacement.

Step 2.4.1. Bounding I0I_{0} over 𝒢\mathcal{G}

In this step, we want to bound I0,𝒢:=sup𝜽𝒢sup𝒗𝒱k=1nξk𝒙k𝒗I_{0,\mathcal{G}}:=\sup_{\bm{\theta}\in\mathcal{G}}\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}^{\top}\bm{v}. First let us consider a fixed 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}}. Then since |ξk|Δ|\xi_{k}|\leq\Delta, we have ξk𝒙kψ2Δσ\|\xi_{k}\bm{x}_{k}\|_{\psi_{2}}\lesssim\Delta\sigma. Because {ξk𝒙k:k[n]}\{\xi_{k}\bm{x}_{k}:k\in[n]\} are independent zero mean, by [92, Prop. 2.6.1] we have k=1nξk𝒙kψ2nΔσ\big{\|}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\big{\|}_{\psi_{2}}\lesssim\sqrt{n}\Delta\sigma. Define 𝒱=𝒱{0}\mathscr{V}^{\prime}=\mathscr{V}\cup\{0\}, then for any 𝒗1,𝒗2𝒱\bm{v}_{1},\bm{v}_{2}\in\mathscr{V}^{\prime} we have

(k=1nξk𝒙k)𝒗1(k=1nξk𝒙k)𝒗2ψ2CnΔσ𝒗1𝒗22.\left\|\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}_{1}-\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}_{2}\right\|_{\psi_{2}}\leq C\sqrt{n}\Delta\sigma\|\bm{v}_{1}-\bm{v}_{2}\|_{2}.

Thus, by Lemma 10, it holds with probability at least 1exp(u2)1-\exp(-u^{2}) that

sup𝒗𝒱(k=1nξk𝒙k)𝒗\displaystyle\sup_{\bm{v}\in\mathscr{V}}\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v} sup𝒗,𝒗𝒱|(k=1nξk𝒙k)𝒗(k=1nξk𝒙k)𝒗|\displaystyle\leq\sup_{\bm{v},\bm{v}^{\prime}\in\mathscr{V}^{\prime}}\left|\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}-\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}^{\prime}\right|
CnΔσ(ω(𝒱)+u)C1nΔσ(slog9ds+u).\displaystyle\leq C\sqrt{n}\Delta\sigma\big{(}\omega(\mathscr{V}^{\prime})+u\big{)}\leq C_{1}\sqrt{n}\Delta\sigma\Big{(}\sqrt{s\log\frac{9d}{s}}+u\Big{)}.

Moreover, by a union bound over 𝒢\mathcal{G}, we obtain that I0,𝒢σΔn(slog9ds+u)I_{0,\mathcal{G}}\lesssim\sigma\Delta\sqrt{n}\big{(}\sqrt{s\log\frac{9d}{s}}+u\big{)} holds with probability at least 1exp(slog9dρsu2)1-\exp\big{(}s\log\frac{9d}{\rho s}-u^{2}\big{)}. We set usδlog(9dρs)u\asymp\sqrt{s\delta\log\big{(}\frac{9d}{\rho s}\big{)}} and arrive at

(I0,𝒢σΔnsδlog9dρs)1(9dρs)Ω(δs).\mathbbm{P}\left(I_{0,\mathcal{G}}\lesssim\sigma\Delta\sqrt{ns\delta\log\frac{9d}{\rho s}}\right)\geq 1-\Big{(}\frac{9d}{\rho s}\Big{)}^{-\Omega(\delta s)}. (B.18)

Step 2.4.2. Bounding the Approximation Error

From now on we indicate the dependence of ξk\xi_{k} on 𝜽\bm{\theta} by using the notation ξ𝜽,k:=𝒬Δ(y~𝜽,k+τk)y~𝜽,k\xi_{\bm{\theta},k}:=\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},k}+\tau_{k})-\widetilde{y}_{\bm{\theta},k} where y~𝜽,k=𝒯ζy(𝒙k𝜽+ϵk)\widetilde{y}_{\bm{\theta},k}=\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k}). For any 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}} we pick one 𝜽𝒢\bm{\theta}^{\prime}\in\mathcal{G} such that 𝜽𝜽2ρ\|\bm{\theta}-\bm{\theta}^{\prime}\|_{2}\leq\rho; we fix such correspondence and remember that from now on every 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}} is associated with some 𝜽𝒢\bm{\theta^{\prime}}\in\mathcal{G}, (which of course depends on 𝜽\bm{\theta} but our notation omits such dependence). Thus, we can bound I0=sup𝜽,𝒗k=1nξ𝜽,k𝒙k𝒗I_{0}=\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}\xi_{\bm{\theta},k}\bm{x}_{k}^{\top}\bm{v} as

I0\displaystyle I_{0} sup𝜽,𝒗k=1nξ𝜽,k𝒙k𝒗+sup𝜽,𝒗k=1n(ξ𝜽,kξ𝜽,k)𝒙k𝒗I0,𝒢+I01.\displaystyle\leq\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}\xi_{\bm{\theta^{\prime}},k}\bm{x}_{k}^{\top}\bm{v}+\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}(\xi_{\bm{\theta},k}-\xi_{\bm{\theta}^{\prime},k})\bm{x}_{k}^{\top}\bm{v}\leq I_{0,\mathcal{G}}+I_{01}. (B.19)

Note that the bound on I0,𝒢I_{0,\mathcal{G}} is available in (B.18), so it remains to bound I01:=sup𝜽,𝒗k=1n(ξ𝜽,kξ𝜽,k)𝒙k𝒗I_{01}:=\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}(\xi_{\bm{\theta},k}-\xi_{\bm{\theta^{\prime}},k})\bm{x}_{k}^{\top}\bm{v}, which can be understood as the approximation error of the net 𝒢\mathcal{G} regarding the empirical process of interest. To facilitate the presentation we switch to the more compact notations — let 𝑿n×d\bm{X}\in\mathbb{R}^{n\times d} with rows 𝒙k\bm{x}_{k}^{\top} be the sensing matrix, 𝝃𝜽=[ξ𝜽,k]n\bm{\xi}_{\bm{\theta}}=[\xi_{\bm{\theta},k}]\in\mathbb{R}^{n} be the quantization error indexed by 𝜽\bm{\theta}, 𝝉=[τk]n\bm{\tau}=[\tau_{k}]\in\mathbb{R}^{n} be the random dither vector, ϵ=[ϵk]n\bm{\epsilon}=[\epsilon_{k}]\in\mathbb{R}^{n} be the heavy-tailed noise vector, 𝒚𝜽=[y𝜽,k]=𝑿𝜽+ϵn\bm{{y}}_{\bm{\theta}}=[y_{\bm{\theta},k}]=\bm{X\theta}+\bm{\epsilon}\in\mathbb{R}^{n} and 𝒚~𝜽=[y~𝜽,k]=𝒯ζy(𝒚𝜽)\bm{\widetilde{y}}_{\bm{\theta}}=[\widetilde{y}_{\bm{\theta},k}]=\mathscr{T}_{\zeta_{y}}(\bm{y}_{\bm{\theta}}) be the measurement vector and truncated measurement vector, respectively. With these conventions we can write I01=sup𝜽,𝒗(𝝃𝜽𝝃𝜽)𝑿𝒗I_{01}=\sup_{\bm{\theta},\bm{v}}(\bm{\xi_{\theta}}-\bm{\xi_{\theta^{\prime}}})^{\top}\bm{Xv}. Recall that a specific 𝜽\bm{\theta^{\prime}} has been specified for each 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}}, so defining 𝚿𝜽:=𝝃𝜽𝝃𝜽\bm{\Psi_{\theta}}:=\bm{\xi_{\theta}}-\bm{\xi_{\theta^{\prime}}} allows us to write I01=sup𝜽,𝒗𝚿𝜽𝑿𝒗I_{01}=\sup_{\bm{\theta},\bm{v}}~{}\bm{\Psi}_{\bm{\theta}}^{\top}\bm{Xv}. Further, we define 𝚿~𝜽:=𝒚~𝜽𝒚~𝜽,𝚿^𝜽:=𝒬Δ(𝒚~𝜽+𝝉)𝒬Δ(𝒚~𝜽+𝝉)\bm{\widetilde{\Psi}_{\theta}}:=\bm{\widetilde{y}_{\theta}}-\bm{\widetilde{y}_{\theta^{\prime}}},\bm{\widehat{\Psi}_{\theta}}:=\mathcal{Q}_{\Delta}(\bm{\widetilde{y}_{\theta}}+\bm{\tau})-\mathcal{Q}_{\Delta}(\bm{\widetilde{y}_{\theta^{\prime}}}+\bm{\tau}) and make the following observation

𝚿𝜽=𝝃𝜽𝝃𝜽=𝚿^𝜽𝚿~𝜽.\displaystyle\bm{\Psi_{\theta}}=\bm{\xi_{\theta}}-\bm{\xi_{\theta^{\prime}}}=\bm{\widehat{\Psi}_{\theta}}-\bm{\widetilde{\Psi}_{\theta}}. (B.20)

We pause to establish a property of 𝑿\bm{X} that holds w.h.p.. Specifically, we restrict (B.4) to 𝒗𝒱\bm{v}\in\mathscr{V} (recall that Δ¯=0\bar{\Delta}=0 and so 𝑿˙=𝑿\bm{\dot{X}}=\bm{X} and 𝚺˙=𝚺\bm{\dot{\Sigma}}=\bm{\Sigma^{\star}}), then under the promised probability it holds for some c(κ0,σ)c(\kappa_{0},\sigma) that

sup𝒗𝒱|𝑿𝒗2n𝚺𝒗2|c(κ0,σ)δslogdn.\sup_{\bm{v}\in\mathscr{V}}\left|\frac{\|\bm{Xv}\|_{2}}{\sqrt{n}}-\|\bm{\sqrt{\Sigma^{\star}}v}\|_{2}\right|\leq c(\kappa_{0},\sigma)\sqrt{\frac{\delta s\log d}{n}}.

Thus, when nδslogdn\gtrsim\delta s\log d with for large enough hidden constant depending on (κ0,σ)(\kappa_{0},\sigma), it holds that

sup𝒗𝒱𝑿𝒗2nsup𝒗𝒱𝚺𝒗2+κ12κ1.\sup_{\bm{v}\in\mathscr{V}}\frac{\|\bm{Xv}\|_{2}}{\sqrt{n}}\leq\sup_{\bm{v}\in\mathscr{V}}\|\sqrt{\bm{\Sigma^{\star}}}\bm{v}\|_{2}+\sqrt{\kappa_{1}}\leq 2\sqrt{\kappa_{1}}. (B.21)

We proceed by assuming we are on this event, which allows us to bound I01I_{01} as

I01=supθ,𝒗𝚿𝜽𝑿𝒗\displaystyle I_{01}=\sup_{\theta,\bm{v}}\bm{\Psi_{\theta}}^{\top}\bm{Xv} sup𝜽𝚿𝜽2sup𝒗𝒱𝑿𝒗2\displaystyle\leq\sup_{\bm{\theta}}\|\bm{\Psi_{\theta}}\|_{2}\sup_{\bm{v}\in\mathscr{V}}\|\bm{Xv}\|_{2} (B.22)
2κ1nsup𝜽𝚿𝜽2.\displaystyle\leq 2\sqrt{\kappa_{1}n}\cdot\sup_{\bm{\theta}}\|\bm{\Psi_{\theta}}\|_{2}.

To bound sup𝜽𝚿𝜽2\sup_{\bm{\theta}}\|\bm{\Psi_{\theta}}\|_{2}, motivated by (B.20), we will investigate 𝚿~𝜽\bm{\widetilde{\Psi}_{\theta}} and 𝚿^𝜽\bm{\widehat{\Psi}_{\theta}} more carefully. We pick a threshold η(0,Δ2)\eta\in(0,\frac{\Delta}{2}) (that is to be chosen later), and by the 11-Lipschitzness of 𝒯ζy()\mathscr{T}_{\zeta_{y}}(\cdot) we have

sup𝜽𝚿~𝜽2\displaystyle\sup_{\bm{\theta}}\|\bm{\widetilde{\Psi}_{\theta}}\|_{2} =sup𝜽𝒚~𝜽𝒚~𝜽2sup𝜽𝒚𝜽𝒚𝜽2\displaystyle=\sup_{\bm{\theta}}\|\bm{\widetilde{y}}_{\bm{\theta}}-\bm{\widetilde{y}}_{\bm{\theta}^{\prime}}\|_{2}\leq\sup_{\bm{\theta}}\|\bm{y}_{\bm{\theta}}-\bm{y}_{\bm{\theta}^{\prime}}\|_{2} (B.23)
=sup𝜽𝑿(𝜽𝜽)22κ1nρ,\displaystyle=\sup_{\bm{\theta}}\|\bm{X}(\bm{\theta}-\bm{\theta}^{\prime})\|_{2}\leq 2\sqrt{\kappa_{1}n}\rho,

where the last inequality is because 𝜽𝜽\bm{\theta}-\bm{\theta}^{\prime} is 2s2s-sparse, hence (B.21) implies 𝑿(𝜽𝜽)22κ1n𝜽𝜽22κ1nρ\|\bm{X}(\bm{\theta}-\bm{\theta}^{\prime})\|_{2}\leq 2\sqrt{\kappa_{1}n}\|\bm{\theta}-\bm{\theta}^{\prime}\|_{2}\leq 2\sqrt{\kappa_{1}n}\rho.

To proceed, we will define for specific 𝜽\bm{\theta} the index vectors 𝑱𝜽,1,𝑱𝜽,2{0,1}n\bm{J}_{\bm{\theta},1},\bm{J}_{\bm{\theta},2}\in\{0,1\}^{n} and use |𝑱𝜽,1||\bm{J}_{\bm{\theta},1}| to denote the number of 11s in 𝑱𝜽,1\bm{J}_{\bm{\theta},1} (similar meaning for |𝑱𝜽,2||\bm{J}_{\bm{\theta},2}|). Specifically, using the entry-wise notation 𝚿~𝜽=[Ψ~𝜽,k]\bm{\widetilde{\Psi}_{\theta}}=[\widetilde{\Psi}_{\bm{\theta},k}] we define 𝑱𝜽,1=[𝟙(|Ψ~𝜽,k|η)]\bm{J}_{\bm{\theta},1}=[\mathbbm{1}(|\widetilde{\Psi}_{\bm{\theta},k}|\geq\eta)]. Recall that (B.23) gives sup𝜽𝚿~𝜽224κ1nρ2\sup_{\bm{\theta}}\|\bm{\widetilde{\Psi}_{\theta}}\|_{2}^{2}\leq 4\kappa_{1}n\rho^{2}; combined with the simple observation sup𝜽𝚿~𝜽22sup𝜽η2|𝑱𝜽,1|\sup_{\bm{\theta}}\|\bm{\widetilde{\Psi}_{\theta}}\|_{2}^{2}\geq\sup_{\bm{\theta}}\eta^{2}|\bm{J}_{\bm{\theta},1}|, we obtain a uniform bound on |𝑱𝜽,1||\bm{J}_{\bm{\theta},1}| as

sup𝜽Σs,R0|𝑱𝜽,1|4κ1nρ2η2.\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}|\bm{J}_{\bm{\theta},1}|\leq\frac{4\kappa_{1}n\rho^{2}}{\eta^{2}}. (B.24)

Next, we define the index vector 𝑱𝜽,2\bm{J}_{\bm{\theta},2} for 𝜽𝒢\bm{\theta}\in\mathcal{G}: first let 𝜽,i={𝒬Δ(y~𝜽,i+τi+t) is discontinuous \mathscr{E}_{\bm{\theta},i}=\{\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},i}+\tau_{i}+t)\text{ is discontinuous } in t[η,η]},\text{in }t\in[-\eta,\eta]\}, and then we define 𝑱𝜽,2:=[𝟙(𝜽,i)]\bm{J}_{\bm{\theta},2}:=[\mathbbm{1}(\mathscr{E}_{\bm{\theta},i})]. Then by Lemma 11 that we prove later, we have

(sup𝜽𝒢|𝑱𝜽,2|CnηΔ)1exp(cnηΔ+slog9dρs):=1𝒫1.\mathbbm{P}\Big{(}\sup_{\bm{\theta}\in\mathcal{G}}|\bm{J}_{\bm{\theta},2}|\leq\frac{Cn\eta}{\Delta}\Big{)}\geq 1-\exp\Big{(}-\frac{cn\eta}{\Delta}+s\log\frac{9d}{\rho s}\Big{)}:=1-\mathscr{P}_{1}. (B.25)

Note that 𝜽,i\mathscr{E}_{\bm{\theta},i} does not happen (i.e., 𝟙(𝜽,i)=0\mathbbm{1}(\mathscr{E}_{\bm{\theta},i})=0) means that 𝒬Δ(y~𝜽,i+τi+t)\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},i}+\tau_{i}+t) is continuous in t[η,η]t\in[-\eta,\eta]; combined with the definition of 𝒬Δ()\mathcal{Q}_{\Delta}(\cdot), this is also equivalent to the statement that 𝒬Δ(y~𝜽,i+τi+t)\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},i}+\tau_{i}+t) remains constant in t[η,η]t\in[-\eta,\eta].” Thus, given a fixed 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}} and its associated 𝜽\bm{\theta^{\prime}}, suppose that the ii-th entry of 𝑱𝜽,2\bm{J}_{\bm{\theta^{\prime}},2} is zero (meaning that 𝒬Δ(y~𝜽,i+τi+t)\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta^{\prime}},i}+\tau_{i}+t) remains constant in t[η,η]t\in[-\eta,\eta]), if additionally ii-th entry of 𝑱𝜽,1\bm{J}_{\bm{\theta},1} is zero (i.e., |Ψ~𝜽,i|<η|\widetilde{\Psi}_{\bm{\theta},i}|<\eta), then the ii-th entry of 𝚿^𝜽=𝒬Δ(𝒚~𝜽+𝝉)𝒬Δ(𝒚~𝜽+𝝉)\bm{\widehat{\Psi}_{\bm{\theta}}}=\mathcal{Q}_{\Delta}(\bm{\widetilde{y}_{\theta}}+\bm{\tau})-\mathcal{Q}_{\Delta}(\bm{\widetilde{y}_{\theta^{\prime}}}+\bm{\tau}) vanishes:

Ψ^𝜽,i\displaystyle\widehat{\Psi}_{\bm{\theta},i} =𝒬Δ(y~𝜽,i+τi)𝒬Δ(y~𝜽,i+τi)\displaystyle=\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},i}+\tau_{i})-\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta^{\prime}},i}+\tau_{i})
=𝒬Δ(y~𝜽,i+Ψ~𝜽,i+τi)𝒬Δ(y~𝜽,i+τi)=0;\displaystyle=\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta^{\prime}},i}+\widetilde{\Psi}_{\bm{\theta},i}+\tau_{i})-\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta^{\prime}},i}+\tau_{i})=0;

combining with (B.20), this implies Ψ𝜽,i=Ψ~𝜽,i\Psi_{\bm{\theta},i}=-\widetilde{\Psi}_{\bm{\theta},i}. Recall from (B.22) that we want to bound sup𝜽𝚿𝜽2\sup_{\bm{\theta}}\|\bm{\Psi_{\theta}}\|_{2}. Write 𝑱𝜽,1c=𝟏𝑱𝜽,1\bm{J}_{\bm{\theta},1}^{c}=\bm{1}-\bm{J}_{\bm{\theta},1} and 𝑱𝜽,2c=𝟏𝑱𝜽,2\bm{J}_{\bm{\theta^{\prime}},2}^{c}=\bm{1}-\bm{J}_{\bm{\theta^{\prime}},2}, then denoting hadamard product by \odot and using the decomposition 𝟏=max{𝑱𝜽,1,𝑱𝜽,2}+min{𝑱𝜽,1c,𝑱𝜽,2c}\bm{1}=\max\{\bm{J}_{\bm{\theta},1},\bm{J}_{\bm{\theta^{\prime}},2}\}+\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}_{\bm{\theta^{\prime}},2}^{c}\} and (B.20) we have

sup𝜽𝚿𝜽2=sup𝜽𝚿𝜽max{𝑱𝜽,1,𝑱𝜽,2}+𝚿𝜽min{𝑱𝜽,1c,𝑱𝜽,2c}2\displaystyle\sup_{\bm{\theta}}\big{\|}\bm{\Psi_{\theta}}\big{\|}_{2}=\sup_{\bm{\theta}}\big{\|}\bm{\Psi_{\theta}}\odot\max\{\bm{J}_{\bm{\theta},1},\bm{J}_{\bm{\theta^{\prime}},2}\}+\bm{\Psi_{\theta}}\odot\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}^{c}_{\bm{\theta^{\prime}},2}\}\big{\|}_{2} (B.26)
sup𝜽𝚿𝜽max{𝑱𝜽,1,𝑱𝜽,2}2+sup𝜽𝚿𝜽min{𝑱𝜽,1c,𝑱𝜽,2c}2\displaystyle\leq\sup_{\bm{\theta}}\big{\|}\bm{\Psi_{\theta}}\odot\max\{\bm{J}_{\bm{\theta},1},\bm{J}_{\bm{\theta^{\prime}},2}\}\big{\|}_{2}+\sup_{\bm{\theta}}\big{\|}\bm{\Psi_{\theta}}\odot\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}^{c}_{\bm{\theta^{\prime}},2}\}\big{\|}_{2}
(i)sup𝜽𝚿𝜽|𝑱𝜽,1|+|𝑱𝜽,2|+sup𝜽𝚿~𝜽min{𝑱𝜽,1c,𝑱𝜽,2c}2\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\sup_{\bm{\theta}}\big{\|}\bm{\Psi_{\theta}}\big{\|}_{\infty}\cdot\sqrt{|\bm{J}_{\bm{\theta},1}|+|\bm{J}_{\bm{\theta^{\prime}},2}|}~{}+\sup_{\bm{\theta}}\big{\|}\bm{\widetilde{\Psi}_{\bm{\theta}}}\odot\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}_{\bm{\theta^{\prime}},2}^{c}\}\big{\|}_{2}
(ii)C{Δn(κ1ρη+ηΔ)+ρκ1n},\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}C\Big{\{}\Delta\sqrt{n}\Big{(}\frac{\sqrt{\kappa_{1}}\rho}{\eta}+\sqrt{\frac{\eta}{\Delta}}\Big{)}+\rho\sqrt{\kappa_{1}n}\Big{\}},

where (i)(i) is because entries of 𝚿𝜽\bm{\Psi}_{\bm{\theta}} equal to those of 𝚿~𝜽-\bm{\widetilde{\Psi}}_{\bm{\theta}} if the index corresponds to min{𝑱𝜽,1c,𝑱𝜽,2c}=1\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}_{\bm{\theta^{\prime}},2}^{c}\}=1, and in (ii)(ii) we use the simple bound 𝚿𝜽𝝃𝜽+𝝃𝜽2Δ\|\bm{\Psi_{\theta}}\|_{\infty}\leq\|\bm{\xi_{\theta}}\|_{\infty}+\|\bm{\xi_{\theta^{\prime}}}\|_{\infty}\leq 2\Delta and the derived bounds on sup𝜽|𝑱𝜽,1|,sup𝜽𝒢|𝑱𝜽,2|,sup𝜽𝚿~𝜽2\sup_{\bm{\theta}}|\bm{J}_{\bm{\theta},1}|,\sup_{\bm{\theta}\in\mathcal{G}}|\bm{J}_{\bm{\theta},2}|,\sup_{\bm{\theta}}\|\bm{\widetilde{\Psi}_{\theta}}\|_{2} in (B.24), (B.25) and (B.23), respectively.

Step 2.4.3. Concluding the Bound on I0I_{0}

We are ready to put pieces together, specify ρ,η\rho,\eta, and conclude the bound on I0I_{0}. Overall, with probability at least 1𝒫1𝒫21-\mathscr{P}_{1}-\mathscr{P}_{2} for 𝒫1\mathscr{P}_{1} defined in (B.25) and some 𝒫2\mathscr{P}_{2} within the promised probability, combining (B.18), (B.19), (B.22) and (LABEL:B.47) we obtain

I0σΔnsδlog9dρs+κ1nΔρη+nκ1ηΔ+κ1ρn.I_{0}\lesssim\sigma\Delta\sqrt{ns\delta\log\frac{9d}{\rho s}}+\frac{\kappa_{1}n\Delta\rho}{\eta}+n\sqrt{\kappa_{1}\eta\Delta}+\kappa_{1}\rho n.

Thus, we take the (near-optimal) choice of (ρ,η)(\rho,\eta) as ρΔκ1(sδn)3/2\rho\asymp\frac{\Delta}{\sqrt{\kappa_{1}}}\big{(}\frac{s\delta}{n}\big{)}^{3/2} and ηδΔsnlog9dρs,\eta\asymp\frac{\delta\Delta s}{n}\log\frac{9d}{\rho s}, under which we obtain that, with the promised probability (as 𝒫1\mathscr{P}_{1} is also sufficiently small), we obtain the bound on I0I_{0} as

I0σΔnsδlog(κ1d2n3Δ2s5δ3)I_{0}\lesssim\sigma\Delta\sqrt{ns\delta\log\Big{(}\frac{\kappa_{1}d^{2}n^{3}}{\Delta^{2}s^{5}\delta^{3}}\Big{)}} (B.27)

We can conclude the proof with all the works above. Substituting inf𝒗𝒱k=1n(𝒙k𝒗)214κ0n\inf_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{v})^{2}\geq\frac{1}{4}\kappa_{0}n and the definition of II in (B.13) into (B.12), then we obtain 14κ0n𝚼^222𝚼^2I\frac{1}{4}\kappa_{0}n\|\bm{\widehat{\Upsilon}}\|_{2}^{2}\leq 2\|\bm{\widehat{\Upsilon}}\|_{2}I that holds uniformly for all 𝜽Σs,R0\bm{\theta}\in\Sigma_{s,R_{0}}, which implies sup𝜽𝚼^28Iκ0n\sup_{\bm{\theta}}\|\bm{\widehat{\Upsilon}}\|_{2}\leq\frac{8I}{\kappa_{0}n}. Substituting the derived bounds on I1,I2,I3,I0I_{1},I_{2},I_{3},I_{0} into (B.14), with the promised probability we have

Iσ(σ+M12l)nsδlogd+σΔnsδlog(κ1d2n3Δ2s5δ3),I\lesssim\sigma\big{(}\sigma+M^{\frac{1}{2l}}\big{)}\sqrt{ns\delta\log d}+\sigma\Delta\sqrt{ns\delta\log\Big{(}\frac{\kappa_{1}d^{2}n^{3}}{\Delta^{2}s^{5}\delta^{3}}\Big{)}},

so the uniform bound on 𝚼^2\|\bm{\widehat{\Upsilon}}\|_{2} follows immediately. Further using 𝚼^12s𝚼^2\|\bm{\widehat{\Upsilon}}\|_{1}\leq 2\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2} completes the proof.∎

Lemma 11.

(Bounding sup𝜽𝒢|𝑱𝜽,2|\sup_{\bm{\theta}\in\mathcal{G}}|\bm{J}_{\bm{\theta},2}|). Along the proof of Theorem 13, it holds that

(sup𝜽𝒢|𝑱𝜽,2|CnηΔ)exp(cnηΔ+slog9dρs).\mathbbm{P}\Big{(}\sup_{\bm{\theta}\in\mathcal{G}}|\bm{J}_{\bm{\theta},2}|\geq\frac{Cn\eta}{\Delta}\Big{)}\leq\exp\Big{(}-\frac{cn\eta}{\Delta}+s\log\frac{9d}{\rho s}\Big{)}. (B.28)
Proof.

Notation and details in the proof of Theorem 13 will be used. We first consider a fixed 𝜽𝒢\bm{\theta}\in\mathcal{G}, and by a simple shifting 𝜽,i\mathscr{E}_{\bm{\theta},i} happens if and only if 𝒬Δ()\mathcal{Q}_{\Delta}(\cdot) is discontinuous in [y~𝜽,i+τiη,y~𝜽,i+τi+η][\widetilde{y}_{\bm{\theta},i}+\tau_{i}-\eta,\widetilde{y}_{\bm{\theta},i}+\tau_{i}+\eta], which is also equivalently to [y~𝜽,i+τiη,y~𝜽,i+τi+η](Δ)=[\widetilde{y}_{\bm{\theta},i}+\tau_{i}-\eta,\widetilde{y}_{\bm{\theta},i}+\tau_{i}+\eta]\cap(\Delta\cdot\mathbb{Z})=\varnothing. Because τi𝒰([Δ2,Δ2])\tau_{i}\sim\mathscr{U}\big{(}[-\frac{\Delta}{2},\frac{\Delta}{2}]\big{)} and η<Δ2\eta<\frac{\Delta}{2}, (𝜽,i)=2ηΔ\mathbbm{P}(\mathscr{E}_{\bm{\theta},i})=\frac{2\eta}{\Delta} is valid independent of the location of [y~𝜽,iη,y~𝜽,i+η][\widetilde{y}_{\bm{\theta},i}-\eta,\widetilde{y}_{\bm{\theta},i}+\eta]. Thus, for fixed 𝜽\bm{\theta}, by conditioning on (𝑿,ϵ)(\bm{X},\bm{\epsilon}), |𝑱𝜽,2||\bm{J}_{\bm{\theta},2}| follows the binomial distribution with nn trials and probability of success p:=2ηΔp:=\frac{2\eta}{\Delta}. This allows us to write |𝑱𝜽,2|=k=1nJk|\bm{J}_{\bm{\theta},2}|=\sum_{k=1}^{n}J_{k} with JkJ_{k} i.i.d. following Bernoulli distribution with success probability 𝔼Jk=p\mathbbm{E}J_{k}=p. Then for any integer q2q\geq 2 we have k=1n𝔼|Jk𝔼Jk|qk=1n𝔼|Jk𝔼Jk|2np(1p)q!2np.\sum_{k=1}^{n}\mathbbm{E}|J_{k}-\mathbbm{E}J_{k}|^{q}\leq\sum_{k=1}^{n}\mathbbm{E}|J_{k}-\mathbbm{E}J_{k}|^{2}\leq np(1-p)\leq\frac{q!}{2}np. Now we invoke Bernstein’s inequality (Lemma 1) to obtain that for any t>0t>0,

(|𝑱𝜽,2|np2npt+t)exp(t).\mathbbm{P}\big{(}|\bm{J}_{\bm{\theta},2}|-np\geq\sqrt{2npt}+t\big{)}\leq\exp(-t).

We let t=cnpt=cnp and take a union bound over 𝜽𝒢\bm{\theta}\in\mathcal{G}; this yields the desired claim since |𝒢|(9dρs)s|\mathcal{G}|\leq\big{(}\frac{9d}{\rho s}\big{)}^{s}. ∎