\DeclareNewFootnote

AAffil[arabic] \DeclareNewFootnoteANote[fnsymbol]

Quantizing Heavy-tailed Data in Statistical Estimation: (Near) Minimax Rates, Covariate Quantization, and Uniform Recovery

Junren Chen\footnotemarkAAffil ^,\footnotemarkANote Michael K. Ng\footnotemarkAAffil[1] ^,\footnotemarkANote Di Wang\footnotemarkAAffil ^,\footnotemarkANote[2]

Abstract

Modern datasets often exhibit heavy-tailed behaviour, while quantization is inevitable in digital signal processing and many machine learning problems. This paper studies the quantization of heavy-tailed data in several fundamental statistical estimation problems where the underlying distributions have bounded moments of some order (no greater than $4$ ). We propose to truncate and properly dither the data prior to a uniform quantization. Our major standpoint is that (near) minimax rates of estimation error could be achieved by computationally tractable estimators based on the quantized data produced by the proposed scheme. In particular, concrete results are worked out for covariance estimation, compressed sensing (also interpreted as sparse linear regression), and matrix completion, all agreeing that the quantization only slightly worsens the multiplicative factor. Additionally, while prior results focused on quantization of responses (i.e., measurements), we study compressed sensing where the covariates (i.e., sensing vectors) are also quantized; in this case, our recovery program is non-convex because the covariance matrix estimator lacks positive semi-definiteness, but all local minimizers are proved to enjoy near optimal error bound. Moreover, by the concentration inequality of product process and a covering argument, we establish near minimax uniform recovery guarantee for quantized compressed sensing with heavy-tailed noise. Finally, numerical simulations are provided to corroborate our theoretical results.

1 Introduction

Heavy-tailed distributions are ubiquitous in modern datasets, especially those arising in economy, finance, imaging, biology, see [93, 84, 45, 5, 88, 57] for instance. In the recent literature, heavy-tailed distribution is often captured by bounded $l$ -th moment, where $l$ is some fixed small scalar; this is essentially weaker than sub-Gaussian assumption, and thus outliers and extreme values appear much more frequently in data from heavy-tailed distributions (referred to as heavy-tailed data), which poses challenges for statistical analysis. In fact, many standard statistical procedures developed for sub-Gaussian data suffer from performance degradation in the heavy-tailed regime. Fortunately, the past decade has witnessed considerable progresses on statistical estimation methods that are robust to heavy-tailedness, see [18, 34, 73, 68, 10, 44, 35, 97, 65] for instance.

Departing momentarily from heavy-tailed data, quantization is an inevitable process in the era of digital signal processing, which maps signals to bitstreams so that they can be stored, processed and transmitted. In particular, the resolution of quantization should be selected to achieve a trade-off between accuracy and various data processing costs, and in some applications relatively low resolution would be preferable. For instance, in a distributed learning setting or a MIMO system, the frequent information transmission among multiple parties often result in prohibitive communication cost [56, 69], and quantizing signals or data to fairly low resolution (while preserving satisfying utility) is an effective approach to reduce the cost [96, 43]. Under such a big picture, in recent years there has been rapidly growing literature on high-dimensional signal recovery from quantized data (see, e.g., [8, 26, 21, 89, 50, 32, 31] for 1-bit quantization, [89, 50, 94, 46] for multi-bit uniform quantization), trying to understand the interplay between quantization and signal reconstruction (or parameter learning) in some fundamental estimation problems.

Independently, a set of robustifying techniques has been developed to overcome the challenge posed by heavy-tailed data, and unidorm data quantization under uniform dither was shown to cost very little in some recovery problems. Considering ubiquitousness of heavy-tailed behaviour and data quantization, a natural question is to design quantization scheme for heavy-tailed data that only incurs minor information loss. For instance, when applied to statistical estimation problems with heavy-tailed data, an appropriate quantization scheme should enable at least one faithful estimator from the quantized data, and ideally an estimator nearly achieving optimal error rate. Despite the vast literature in this field, prior results that simultaneously take heavy-tailed data and quantization into account are surprisingly rare — only the ones presented in [32] and our earlier work [21] regarding the dithered 1-bit quantizer, to the best of our knowledge. These results remain incomplete and exhibit some downsides. Specifically, [32] considered a computationally intractable program for quantized compressed sensing and used techniques hard to generalize to other problems, while the error rates in [21] are inferior to the corresponding minimax ones (under unquantized sub-Gaussian data), as will be discussed in Section 1.1.3. In a nutshell, a quantization scheme for heavy-tailed data arising in statistical estimation problems that allows for computationally tractable near-minimax estimators is still lacking.

This paper aims to provide a solution to the above question and narrow the gap between heavy-tailed data and data quantization in the literature. In particular, we propose a unified quantization scheme for heavy-tailed data which, when applied to the canonical estimation problems of (sparse) covariance matrix estimation, compressed sensing (or sparse linear regression) and matrix completion, allows for (near) minimax estimators that are either in closed-form or solved from convex programs. Additionally, we present novel developments concerning covariate (or sensing vector) quantization and uniform signal recovery in quantized compressed sensing with heavy-tailed data.

1.1 Related Works and Our Contributions

This section is devoted to a review of the most relevant works. Before that we note that a heavy-tailed random variable in this work is formulated by the moment constraint $\mathbbm{E}|X|^{l}\leq M$ where $M$ is oftentimes regarded as absolute constant and $l$ is some fixed small scalar (specifically, $l\leq 4$ in the present paper).

1.1.1 Statistical Estimation under Heavy-Tailed Data

Compared to sub-Gaussian data, heavy-tailed data may contain many outliers and extreme values that are overly influential to traditional estimation methods. Hence, developing estimation methods that are robust to heavy-tailedness has become a recent focus in statistics literature, where heavy-tailed distributions are often only assumed to have bounded moments of some small order. In particular, significant efforts have been devoted to the fundamental problem of mean estimation for heavy-tailed distribution. For instance, effective techniques available in the literature include Catoni’s mean estimator [18, 34], median of means [73, 68], and trimmed mean [66, 28]. While the strategies to achieve robustness are different, these methods indeed share the same core spirit of making the outliers less influential. To this end, the trimmed method (also referred to as truncation or shrinkage) may be the most intuitive — it truncates overlarge data to some threshold so that they are more benign for the estimation procedure. For more in-depth discussions we refer to the recent survey [65]. Furthermore, these robust methods for estimating the mean have been applied to empirical risk minimization [10, 44] and various high-dimensional estimation problems [35, 97], achieving near optimal guarantees. For instance, by invoking M-estimators with truncated data, (near) minimax rates can be achieved in high-diemnsional sparse linear regression, matrix completion, covariance estimation [35]. In fact, techniques similarly to truncation has proven effective in some related problems, e.g., in non-convex algorithms for phase retrieval, truncated Wirtinger flow using a more selective spectral initialization and carefully trimmed gradient [23] improves on Wirtinger flow [16] in terms of sample size.

Recall that we capture heavy-tailedness by bounded moment of some small order, while regarding statistical estimation beyond sub-Gaussianity, there has been a line of works considering sub-exponential or more generally sub-Weibull distributions [80, 85, 58, 39], which have heavier tail than sub-Gaussian ones but still possess finite moment up to arbitrary order. Specifically, without truncation and quantization, sparse linear regression was studied under sub-exponential data in [85] and under sub-Weibull data in [58], and the obtained error rates match the ones in sub-Gaussian case up to logarithmic factors. Additionally, under sub-exponential measurement matrix and noise, [80] established uniform guarantee for 1-bit generative compressed sensing, while [39] analysed generalized Lasso for a general nonlinear model. Because the tail assumptions in these works are substantially stronger than ours,¹¹1One exception is Theorem 4.6 in [58] for sparse linear regression with sub-Weibull covariate and heavy-tailed noise with bounded second moment, but still this result does not involve quantization procedure. we will not make special comparison with their results later; instead, we simply note here two key distinctions: 1) these works do not utilize special treatments for heavy-tailed data like truncation in the present paper; 2) most of them (with the single exception of [80] studying 1-bit quantization) do not focus on quantization, while this work concentrates on quantization of heavy-tailed data.

1.1.2 Statistical Estimation from Quantized Data

Quantized Compressed Sensing. Due to the prominent role of quantization in signal processing and machine learning, quantized compressed sensing that studies the interplay between sparse signal reconstruction and data quantization has become an active research branch in the field. In this work, we focus on memoryless quantization scheme²²2This means that the quantization for different measurements are independent. For other quantization schemes we refer to the recent survey [29]. that embraces simple hardware design. An important model is 1-bit compressed sensing where only the sign of the measurement is retained [8, 48, 75, 76], and more precisely this model concerns the recovery of sparse $\bm{\theta^{\star}}\in\mathbb{R}^{d}$ from $\operatorname{sgn}(\bm{X\theta^{\star}})$ with the sensing matrix $\bm{X}\in\mathbb{R}^{n\times d}$ . However, 1-bit compressed sensing associated with the direct $\operatorname{sgn}(\cdot)$ quantization suffers from some frustrating limitations, e.g., the loss of signal norm information, and the identifiability issue under some regular sensing matrix (e.g., under Bernoulli sensing matrix, see [32]).³³3In fact, almost all existing guarantees using the 1-bit observations $\operatorname{sgn}(\bm{X\theta^{\star}})$ are restricted to standard Gaussian sensing matrix consisting of i.i.d. $\mathcal{N}(0,1)$ entries, with the exceptions of [1] for sub-Gaussian sensing matrix and [30, 86] for partial Gaussian circulant matrix. Fortunately, these limitations can be overcome by random dithering prior to the quantization, under which the 1-bit measurements read as $\operatorname{sgn}(\bm{X\theta^{\star}}+\bm{\tau})$ for some suitably chosen random dither $\bm{\tau}\in\mathbb{R}^{n}$ . Specifically, under Gaussian dither $\bm{\tau}\sim\mathcal{N}(0,\bm{I}_{n})$ and standard Gaussian sensing matrix $\bm{X}$ , full reconstruction with norm information could be achieved, for which the key idea is $\operatorname{sgn}(\bm{X\theta^{\star}}+\bm{\tau})=\operatorname{sgn}([\bm{X}~{}\bm{\tau}][(\bm{\theta^{\star}})^{\top},1]^{\top})$ , thus reducing the dithered model to the undithered model for the sparse signal $[(\bm{\theta^{\star}})^{\top},1]^{\top}$ whose last entry is known beforehand [54].⁴⁴4We note that this idea has been recently extended in [20] to the related problem of phase-only compressed sensing, also see [47, 36, 7] for prior developments. More surprisingly, under a uniform random dither, recovery with norm can be achieved under rather general sub-Gaussian sensing matrix [32, 50, 89, 21] even with near optimal error rate.

Besides the 1-bit quantizer that retains the sign, the uniform quantizer maps $a\in\mathbb{R}$ to $\mathcal{Q}_{\Delta}(a)=\Delta\big{(}\lfloor\frac{a}{\Delta}\rfloor+\frac{1}{2}\big{)}$ for some pre-specified $\Delta>0$ ; here and hereafter, we refer to $\Delta$ as the quantization level, and note that smaller $\Delta$ represents higher resolution. While recovering $\bm{\theta^{\star}}$ from $\mathcal{Q}_{\Delta}(\bm{X\theta^{\star}})$ encounters identifiability issue,⁵⁵5For instance, if $\bm{X}\in\{-1,1\}^{n\times d}$ (typical example is the Bernoulli design where entries of $\bm{X}$ are i.i.d. zero-mean) and $\Delta=1$ , then $\bm{\theta}_{1}:=1.1\bm{e}_{1}$ and $\bm{\theta}_{2}:=1.2\bm{e}_{1}+0.1\bm{e}_{2}$ can never be distinguished because $\mathcal{Q}_{1}(\bm{X\theta}_{1})=\mathcal{Q}_{1}(\bm{X\theta}_{2})$ always holds. it is again beneficial to use random dithering to obtain the measurements $\mathcal{Q}_{\Delta}(\bm{X\theta^{\star}}+\bm{\tau})$ . More specifically, by using uniform dither the Lasso estimator [89, 87] and projected back projection (PBP) method [94] achieve minimax rate in certain cases, and the derived error bounds for these estimators demonstrate that the dithered uniform quantization does not affect the scaling law but only slightly worsens the multiplicative factor. Although the aforementioned progresses regarding compressed sensing under dithered quantization were recently made, the technique of dithering in quantization indeed has a long history and (at least) dates back to some early engineering work (e.g., [83]), see [41] for a brief introduction.

Other Estimation Problems with Quantized Data. Beyond compressed sensing or its corresponding more statistical setting of sparse linear regression, some other statistical estimation problems were also investigated under dithered 1-bit quantization. Specifically, [24] studied a general signal estimation problem under dithered 1-bit quantization in a traditional setting where sample size tends to infinity, showing that only logarithmic rate loss is incurred by the quantization. Inspired by potential application in reduction of power consumption in a large scale massive MIMO system, [31] proposed to collect 2 bits per entry from each sub-Gaussian sample and developed an estimator that is near minimax optimal in certain cases. Their estimator from coarsely quantized samples was extended to high-dimensional sparse case in [21]. Then, considering the ubiquitousness of binary observations in many recommendation systems, the authors of [26] first approached the 1-bit matrix completion problem by maximum likelihood estimation with a nuclear norm constraint. Their method was developed by a series of follow-up works by using different regularizers/constraints to encourage low-rankness, or considering multi-bit quantization on a finite alphabet [14, 59, 52, 17, 3]. Quantizing the observed entries by a dithered 1-bit quantizer, the 1-bit matrix completion result in [21] essentially departs from the standard likelihood approach and can tolerate pre-quantization noise with unknown distribution.

1.1.3 Quantization of Heavy-Tailed Data in Statistical Estimation

From now on, we turn to existing results more closely related to this work and explain our contributions. Note that the results we just reviewed are for estimation problems from either unquantized heavy-tailed data (Section 1.1.1) or quantized sub-Gaussian data (Section 1.1.2). While quantization of heavy-tailed data (from distribution assumed to have bounded moment of some small order) is a natural question of significant practical value, prior investigations turn out to be surprisingly rare, and the only results we are aware of were presented in [21, 32] concerning dithered 1-bit quantization. Specifically, [32, Thm. 1.11] considered heavy-tailed noise and possibly heavy-tailed covariate, implying that a sharp uniform error rate is achievable (see their Example 1.10). However, their result is for a computationally intractable program (Hamming distance minimization) and hence of limited practical value. Another limitation is that their techniques are based on random hyperplane tessellations that is specialized to 1-bit compressed sensing but does not generalize to other estimation problems. In contrast, [21] proposed a unified quantization scheme that first truncates the data and then invokes a dithered 1-bit quantizer. Although this quantization scheme could (at least) be applied to sparse covariance matrix estimation, compressed sensing, and matrix completion while still enabling practical estimators, the main drawback is that the convergence rates of estimation errors are essentially slower than the corresponding minimax optimal ones (e.g., $\tilde{O}\big{(}\frac{\sqrt{s}}{n^{1/3}}\big{)}$ for 1-bit compressed sensing under heavy-tailed noise [21, Thm. 10]), and in certain cases the rates cannot be improved without changing the quantization process (e.g., [21, Thm. 11] complements [21, Thm. 10] with a nearly matching lower bound). Beyond that, the 1-bit compressed sensing results in [21] are non-uniform. In a nutshell, [32] proved sharp rate for 1-bit compressed sensing but used highly intractable program and techniques not extendable to other estimation regimes, while the more widely applicable scheme and practical estimators in [21] suffer from slow error rates (when compared to the ones achieved from unquantized sub-Gaussian data).

Our Main Contributions: (Near) Minimax Rates. We propose a unified quantization scheme for heavy-tailed data consisting of three steps: 1) truncation that shrinks data to some threshold, 2) dithering that adds suitable random noise to the truncated data, and 3) uniform quantization. For sub-Gaussian or sub-exponential data the truncation step is inessential, and we simply set the threshold as $\infty$ in this case. Careful readers may notice that, we just replace the 1-bit quantizer in our prior work [21] with the less extreme (multi-bit) uniform quantizer $\mathcal{Q}_{\Delta}(\cdot)$ , but the gain turns out to be significant — we are now able to derive (near) optimal rates that are essentially faster than the ones in [21], see Theorems 2-8. Compared to [32], besides the different quantizers, other major distinctions are that: 1) we utilize an additional truncation step, 2) our estimators are computationally feasible, and 3) we investigate multiple estimation problems with possibility of extensions to more estimation problems. Concerning the effect of quantization, our error rates suggests a unified conclusion that dithered uniform quantization does not affect the scaling law but only slightly worsens the multiplicative factor, which generalizes similar findings for quantized compressed sensing in [89, 87, 94] towards two directions, i.e., to the case where heavy-tailed data present and to some other estimation problems (i.e., covariance matrix estimation, matrix completion). As an example, for quantized compressed sensing with sub-Gaussian sensing vector $\bm{x}_{k}$ but heavy-tailed measurement $y_{k}$ satisfying $\mathbbm{E}|y_{k}|^{2+\nu}\leq M$ for some $\nu>0$ , we derive the $\ell_{2}$ -norm error rate $O\big{(}\mathscr{L}\sqrt{\frac{s\log d}{n}}\big{)}$ where $\mathscr{L}=M^{1/(2l)}+\Delta$ (Theorem 5, $s,d,n$ are respectively the sparsity, signal dimension, measurement number), which is reminiscent of the rates in [94, 89] in terms of the position of $\Delta$ . From a technical side, many of our analyses on the dithered quantizer are much cleaner than prior works because we make full use of the nice statistical properties of the quantization error and quantization noise (Theorem 1),⁶⁶6Prior work that did not fully leveraged these properties may incur extra technical complication, e.g., the symmetrization and contraction in [89, Lem. A.2]. see Section 2.2. Also, the property of quantization noise motivates us to use triangular dither when covariance estimation is necessary, which departs from the uniform dither commonly adopted in prior works (e.g., [89, 94, 31, 21]) and is novel to the literature of quantized compressed sensing. In our subsequent work [22], a clean analysis on quantized low-rank multivariate regression with possibly quantized covariates is provided, and we believe the innovations of this work will prove useful in other problems.

1.1.4 Covariate Quantization in Quantized Compressed Sensing

From now on we concentrate on quantized compressed sensing, i.e., the recovery of a sparse signal $\bm{\theta^{\star}}\in\mathbb{R}^{d}$ from the quantized version of $(\bm{x}_{k},y_{k}:=\bm{x}_{k}^{\top}\bm{\theta^{\star}}+\epsilon_{k})_{k=1}^{n}$ where $\bm{x}_{k},y_{k},\epsilon_{k}$ are the sensing vector, measurement and noise, respectively. Let us first clarify some terminology issue before proceeding. Note that this formulation also models the sparse linear regression problem (e.g., [72, 81]) where one wants to learn a sparse parameter $\bm{\theta^{\star}}\in\mathbb{R}^{d}$ from the given data $(\bm{x}_{k},y_{k})_{k=1}^{n}$ that are believed to follow the linear model $y_{k}=\bm{x}_{k}^{\top}\bm{\theta^{\star}}+\epsilon_{k}$ , and in this regression problem $\bm{x}_{k},y_{k}$ are commonly referred to as covariate and response, respectively. We are interested in both settings in this work (as further explained in Section 3.2), but for clearer presentation, we simply refer to the problem as quantized compressed sensing, while calling $\bm{x}_{k},y_{k}$ covariate and response, respectively.

Despite a large volume of results in quantized compressed sensing, almost all of them are restricted to response quantization, thus the question of covariate quantization that allows for accurate subsequent recovery remains unsolved. Note that this question is meaningful especially when the problem is interpreted as sparse linear regression — working with low-precision data in some (distributed) learning systems could significantly reduce communication cost and power consumption [96, 43], which we will further demonstrate in Section 4.1. Therefore, it is of interest to understand how covariate quantization affects the learning of $\bm{\theta^{\star}}$ . To the best of our knowledge, the only existing rigorous guarantees for quantized compressed sensing involving covariate quantization were obtained in [21, Thms. 7-8]. However, these results require $\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ to be sparse [21, Assumption 3] (in order to employ their sparse covariance matrix estimator), and this assumption is non-standard in sparse linear regression and compressed sensing.⁷⁷7In fact, although isotropic sensing vector (i.e., $\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})=\bm{I}_{d}$ ) has been conventional in compressed sensing, many results in the literature can be extended to sensing vector with general unknown covariance matrix and hence do not really rely on the sparsity of $\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ .

Our Contribution. Besides the above main contributions, we establish the estimation guarantees for quantized compressed sensing under covariate quantization that are free of the non-standard assumption on the sparsity of $\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ . Like [21], our estimation methods are built upon the quantized covariance matrix estimator developed in Section 3.1; but unlike [21] that relies on the sparsity of $\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ to ensure convexity, we instead deal with the non-convex program with an additional $\ell_{1}$ -norm constraint, under which we prove that all local minimizers deliver near minimax estimation errors (Theorems 9-10). Our analysis bears resemblance to a line of works on non-convex M-estimator [63, 62, 64] but also exhibits some essential differences (Remark 5). Further, we extract our techniques as a deterministic framework (Proposition 1) and then use it to establish guarantees under dithered 1-bit quantization and covariate quantization as byproducts (Theorems 11-12), which are comparable to [21, Thms. 7-8] but free of sparsity on $\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ .

1.1.5 Uniform Signal Recovery in Quantized Compressed Sensing

It is standard in compressed sensing to leverage a random sensing matrix, so a recovery guarantee can be uniform or non-uniform. More precisely, a uniform guarantee ensures the recovery of all structured signals of interest with a single draw of the sensing ensemble, while a non-uniform guarantee is only valid for a structured signal fixed before drawing the random ensemble, with the implication that a new realization of the sensing matrix is required for sensing a new signal. Uniformity is a highly desired property in compressed sensing, since in applications the measurement ensemble is typically fixed and is expected to work for all signals [40]. Besides, the derivation of a uniform guarantee is often significantly harder than a non-uniform one, making uniformity an interesting theoretical problem in its own right.

A classical fact in linear compressed sensing is that the restricted isometry property (RIP) of the sensing matrix implies uniform recovery of all sparse signals (e.g., [38]), but this is not the case when it comes to nonlinear compressed sensing models, for which uniform recovery guarantees are still eagerly pursued so far. For instance, in the specific quantization model involving 1-bit/uniform quantization with/without dithering, or the more general single index model $y_{k}=f\big{(}\bm{x}_{k}^{\top}\bm{\theta^{\star}}\big{)}$ with possibly unknown $f(\cdot)$ , most representative results are non-uniform (e.g., [75, 79, 78, 89, 94, 87]). We refer to [75, 32, 94, 50] for concrete uniform guarantees, some of which remain (near) optimal (e.g., [94, Sect. 7.2A]), while others suffer from essential degradation compared to the non-uniform ones (e.g., [75, Thm. 1.3]). It is worth noting the interesting recent work [40] who provided a unified approach to uniform guarantee in a series of non-linear models, but without the aid of some non-trivial embedding result, their uniform guarantees typically exhibit a decaying rate of $O(n^{-1/4})$ that is slower than the non-uniform one of $O(n^{-1/2})$ (Section 4 therein). Turning back to our focus of compressed sensing from quantized heavy-tailed data, results in [21, Sect. III] are non-uniform, while [32, Thm. 1.11] presents sharp uniform guarantee for the intractable program of hamming distance minimization under dithered 1-bit quantization.

Our Contribution. We additionally contribute to the literature a uniform guarantee for constrained Lasso under dithered uniform quantization of heavy-tailed response. Specifically, we upgrade our non-uniform Theorem 5 to its uniform version Theorem 13, which states that using a single realization of the sub-Gaussian sensing matrix, heavy-tailed noise and uniform dither, all $s$ -sparse signals within an $\ell_{2}$ -ball can be uniformly recovered up to an $\ell_{2}$ -norm error of $\tilde{O}\big{(}\sqrt{\frac{s}{n}}\big{)}$ , thus matching the near minimax non-uniform rate in Theorem 5 up to logarithmic factors. The proof relies on a concentration inequality for product process [67] and a careful covering argument inspired by [94]. Due to the heavy-tailed noise, new treatment is needed before invoking the concentration result from [67].

1.2 Outline

The remainder of this paper is structured as follows. We provide the notation and preliminaries in Section 2. We present the first set of main results (concerning the (near) optimal guarantees for three estimation problems under quantized heavy-tailed data) in Section 3. Our second set of results (concerning covariate quantization and uniform recovery in quantized compressed sensing) is then presented in Section 4. To corroborate our theory, numerical results on synthetic data are reported in Section 5. We give some remarks to conclude the paper in Section 6. All the proofs are postponed to the Appendices.

2 Preliminaries

We adopt the following conventions throughout the paper:

1) We use boldface symbols (e.g., $\bm{A}$ , $\bm{x}$ ) to denote matrices and vectors, and regular letters (e.g, $a$ , $x$ ) for scalars. We write $[m]=\{1,...,m\}$ for positive integer $m$ . We denote the complex unit by $\mathsf{i}$ . The $i$ -th entry for a vector $\bm{x}$ (likewise, $\bm{y}$ , $\bm{\tau}$ ) is denoted by $x_{i}$ (likewise, $y_{i}$ , $\tau_{i}$ ).

2) Notation with “ $\star$ ” as superscript denotes the desired underlying parameter or signal, e.g., $\bm{\Sigma}^{\star}$ , $\bm{\theta}^{\star}$ . Moreover, notation marked by a tilde (e.g., $\bm{\widetilde{x}}$ ) and a dot (e.g., $\bm{\dot{\bm{x}}}$ ) stands for the truncated data and quantized data, respectively.

3) We reserve $d$ , $n$ for the problem dimension and sample size, respectively. In many cases $\bm{\widehat{\Upsilon}}$ denotes the estimation error, e.g., $\bm{\widehat{\Upsilon}}=\bm{\widehat{\theta}}-\bm{\theta^{\star}}$ if $\bm{\widehat{\theta}}$ is the estimator for the desired signal $\bm{\theta^{\star}}$ . We use $\Sigma_{s}$ to denote the set of $d$ -dimensional $s$ -sparse signals.

4) For vector $\bm{x}\in\mathbb{R}^{d}$ , we work with its transpose $\bm{x}^{\top}$ , $\ell_{p}$ -norm $\|\bm{x}\|_{p}=(\sum_{i\in[d]}|x_{i}|^{p})^{1/p}$ ( $p\geq 1$ ), max norm $\|\bm{x}\|_{\infty}=\max_{i\in[d]}|x_{i}|$ . We define the standard Euclidean sphere as $\mathbb{S}^{d-1}=\{\bm{x}\in\mathbb{R}^{d}:\|\bm{x}\|_{2}=1\}$ .

5) For matrix $\bm{A}=[a_{ij}]\in\mathbb{R}^{m\times n}$ with singular values $\sigma_{1}\geq\sigma_{2}\geq...\geq\sigma_{\min\{m,n\}}$ , recall the operator norm $\|\bm{A}\|_{op}=\sup_{\bm{v}\in\mathbb{S}^{n-1}}\|\bm{Av}\|_{2}=\sigma_{1}$ , Frobenius norm $\|\bm{A}\|_{F}=(\sum_{i,j}a_{ij}^{2})^{1/2}$ , nuclear norm $\|\bm{A}\|_{nu}=\sum_{k=1}^{\min\{m,n\}}\sigma_{k}$ , and max norm $\|\bm{A}\|_{\infty}=\max_{i,j}|a_{ij}|$ . $\lambda_{\min}(\bm{A})$ (resp. $\lambda_{\max}(\bm{A})$ ) stands for the minimum eigenvalue (resp. maximum eigenvalue) of a symmetric $\bm{A}$ .

6) We denote universal constants by $C$ , $c$ , $C_{i}$ and $c_{i}$ , whose value may vary from line to line. We write $T_{1}\lesssim T_{2}$ or $T_{1}=O(T_{2})$ if $T_{1}\leq CT_{2}$ . Conversely, if $T_{1}\geq CT_{2}$ we write $T_{1}\gtrsim T_{2}$ or $T_{1}=\Omega(T_{2})$ . Also, we write $T_{1}\asymp T_{2}$ if $T_{1}=O(T_{2})$ and $T_{2}=\Omega(T_{1})$ simultaneously hold.

7) We use $\mathscr{U}(\Omega)$ to denote the uniform distribution over $\Omega\subset\mathbb{R}^{N}$ , $\mathcal{N}(\bm{\mu},\bm{\Sigma})$ to denote Gaussian distribution with mean $\bm{\mu}$ and covariance $\bm{\Sigma}$ , $\mathsf{t}(\nu)$ to denote student’s t distribution with degrees of freedom $\nu$ .

8) Our technique to handle heavy-tailedness is a data truncation step, for which we introduce the operator $\mathscr{T}_{\zeta}(\cdot)$ for some threshold $\zeta>0$ . It is defined as $\mathscr{T}_{\zeta}(a)=\operatorname{sgn}(a)\min\{|a|,\zeta\}$ for some $a\in\mathbb{R}$ . To truncate vector we apply $\mathscr{T}_{\zeta}(\cdot)$ entry-wisely in most cases, with the exception of covariance matrix estimation under operator norm error (Theorem 3).

9) $\mathcal{Q}_{\Delta}(.)$ is the uniform quantizer with quantization level $\Delta>0$ . It applies to scalar $a$ by $\mathcal{Q}_{\Delta}(a)=\Delta\big{(}\big{\lfloor}\frac{a}{\Delta}\big{\rfloor}+\frac{1}{2}\big{)}$ , and we set $\mathcal{Q}_{0}(a)=a$ . Given a threshold $\mu$ , the hard thresholding of scalar $a$ is $\mathcal{T}_{\mu}(a)=a\cdot\mathbbm{1}(|a|\geq\mu)$ . Both functions element-wisely apply to vectors or matrices.

2.1 High-Dimensional Statistics

Let $X$ be a real random variable, we present some basic knowledge on the sub-Gaussian and sub-exponential random variable. Then we also precisely formulate the heavy-tailed distribution.

1) The sub-Gaussian norm is defined as $\|X\|_{\psi_{2}}=\inf\{t>0:\mathbbm{E}\exp(\frac{X^{2}}{t^{2}})\leq 2\}$ . A random variable $X$ with finite $\|X\|_{\psi_{2}}$ is said to be sub-Gaussian. Analogously to Gaussian variable, a sub-Gaussian random variable exhibits an exponentially-decaying probability tail and satisfies a moment constraint:

	$\displaystyle\mathbbm{P}(\|X\|\geq t)\leq 2\exp\left(-\frac{ct^{2}}{\\|X\\|_{\psi_{2}}^{2}}\right);$		(2.1)
	$\displaystyle(\mathbbm{E}\|X\|^{p})^{1/p}\leq C\\|X\\|_{\psi_{2}}\sqrt{p},~{}\forall~{}p\geq 1.$		(2.2)

Note that these two properties can also define $\|\cdot\|_{\psi_{2}}$ up to multiplicative constant, e.g., $\|X\|_{\psi_{2}}\asymp\sup_{p\geq 1}\frac{(\mathbbm{E}|X|^{p})^{1/p}}{\sqrt{p}}$ (see [92, Prop. 2.5.2]). For a $d$ -dimensional random vector $\bm{x}$ we define its sub-Gaussian norm as $\|\bm{x}\|_{\psi_{2}}=\sup_{\bm{v}\in\mathbb{S}^{d-1}}\|\bm{v}^{\top}\bm{x}\|_{\psi_{2}}$ .

2) The sub-exponential norm is defined as $\|X\|_{\psi_{1}}=\inf\{t>0:\mathbbm{E}\exp(\frac{|X|}{t})\leq 2\}$ , and $X$ is sub-exponential if $\|X\|_{\psi_{1}}<\infty$ . The sub-exponential $X$ satisfies the following properties:

	$\displaystyle\mathbbm{P}(\|X\|\geq t)\leq 2\exp\left(-\frac{ct}{\\|X\\|_{\psi_{1}}}\right);$
	$\displaystyle(\mathbbm{E}\|X\|^{p})^{1/p}\leq C\\|X\\|_{\psi_{1}}p,~{}\forall~{}p\geq 1.$		(2.3)

To relate $\|.\|_{\psi_{1}}$ and $\|.\|_{\psi_{2}}$ one has $\|XY\|_{\psi_{2}}\leq\|X\|_{\psi_{1}}\|Y\|_{\psi_{1}}$ [92, Lem. 2.7.7].

3) In contrast to the moment constraints in (2.2) and (2.3), heavy-tailed distributions in this work are only assumed to satisfy bounded moments of some small order no greater than $4$ , formulated for a random variable $X$ as $\mathbbm{E}|X|^{l}\leq M$ for some $M>0$ and $l\in(0,4]$ . Following [58, Def. 2.4, 2.5], we consider the following two moment assumptions for a heavy-tailed random vector $\bm{x}\in\mathbb{R}^{d}$ (again, $M>0$ , $l\in(0,4]$ ):

•

Marginal Moment Constraint. The weaker assumption that constraints the moment of each coordinate, formulated by $\sup_{i\in[d]}\mathbbm{E}|x_{i}|^{l}\leq M$ .
•

Joint Moment Constraint. The stronger assumption that constraints the moments “toward all directions $\bm{v}\in\mathbb{S}^{d-1}$ ,” formulated by $\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}|\bm{v}^{\top}\bm{x}|^{l}\leq M$ .

2.2 Dithered Uniform Quantization

In this part, we describe the dithered uniform quantizer and its properties in detail. We also specify the choices of random dither in this work.

1) We first provide the detailed procedure of dithered quantization and its general property. Let $\bm{x}\in\mathbb{R}^{N}$ be the input signal with dimension $N\geq 1$ whose entries may be random and dependent. Independent of $\bm{x}$ , we generate the random dither $\bm{\tau}\in\mathbb{R}^{N}$ with i.i.d. entries from some distribution,⁸⁸8Throughout this work, we suppose that a random dither is drawn independent of anything else (particularly, the signal to be quantized and other dithers), and the dither has i.i.d. entries if it is a vector. and then quantize $\bm{x}$ to $\bm{\dot{x}}=\mathcal{Q}_{\Delta}(\bm{x}+\bm{\tau})$ . Following [41], we refer to $\bm{w}:=\bm{\dot{x}}-(\bm{x}+\bm{\tau})$ as the quantization error, and $\bm{\xi}:=\bm{\dot{x}}-\bm{x}$ as the quantization noise. The principal properties of dithered quantization are provided in Theorem 1.

Theorem 1.

(Adapted from [41, Thms. 1-2]). Consider the dithered uniform quantization described above for the input signal $\bm{x}$ , with random dither $\bm{\tau}=[\tau_{i}]$ , quantization error $\bm{w}$ and quantization noise $\bm{\xi}=[\xi_{i}]$ . Use $\mathsf{i}$ to denote the imaginary unit, and let $Y$ be the random variable having the same distribution as the random dither $\tau_{i}$ .

(a) (Quantization Error). If $f(u):=\mathbbm{E}(\exp(\mathsf{i}uY))$ satisfies $f\big{(}\frac{2\pi l}{\Delta}\big{)}=0$ for all non-zero integer $l$ , then $\bm{w}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{N})$ is independent of $\bm{x}$ .⁹⁹9Although the statement is a bit different, it can be implied by [41, Thm. 1] and the proof therein.

(b) (Quantization Noise). Assume that $Z\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ is independent of $Y$ . Let $g(u):=\mathbbm{E}(\exp(\mathsf{i}uY))\mathbbm{E}(\exp(\mathsf{i}uZ))$ . Given positive integer $p$ , if the $p$ -th order derivative $g^{(p)}(u)$ satisfies $g^{(p)}\big{(}\frac{2\pi l}{\Delta}\big{)}=0$ for all non-zero integer $l$ , then the $p$ -th conditional moment of $\xi_{i}$ does not depend on $\bm{x}$ : $\mathbbm{E}[\xi_{i}^{p}|\bm{x}]=\mathbbm{E}(Y+Z)^{p}$ .

We note that Theorem 1 serves as the cornerstone for our analysis on the dithered uniform quantizer; for instance, (a) allows for applications of concentration inequalities in our analyses, and (b) inspires us to develop a covariance matrix estimator from quantized samples. The take-home message is that, adding appropriate dither before quantization can make the quantization error and quantization noise behave in a statistically nice manner. For example, the elementary form of Theorem 1(a) is that under a dither $\tau_{i}$ satisfying the condition there, the quantization noise $\mathcal{Q}_{\Delta}(x_{i}+\tau_{i})-(x_{i}+\tau_{i})$ follows $\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ under any given scalar $x_{i}$ [41, Lem. 1].

2) We use uniform dither for quantization of the response in compressed sensing and matrix completion. More specifically, under $\Delta>0$ , we adopt the uniform dither $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ for the response $y_{k}\in\mathbb{R}$ , which is also a common choice in previous works (e.g., [89, 94, 32, 50]). For $Y\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ , it can be calculated that

\mathbbm{E}(\exp(\mathsf{i}uY))=\int_{-\Delta/2}^{\Delta/2}~{}\frac{1}{\Delta}\big{(}\cos(ux)+\mathsf{i}\sin(ux)\big{)}~{}\mathrm{d}x=\frac{2}{\Delta u}\sin\Big{(}\frac{\Delta u}{2}\Big{)},

(2.4)

and hence $\mathbbm{E}(\exp(\mathsf{i}\frac{2\pi l}{\Delta}Y))=0$ holds for all non-zero integer $l$ . Therefore, the benefit of using $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ is that the quantization errors $w_{k}=\mathcal{Q}_{\Delta}(y_{k}+\tau_{k})-(y_{k}+\tau_{k})$ i.i.d. follow $\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ , and are independent of $\{y_{k}\}$ .

3) We use triangular dither for quantization of the covariate, i.e., the sample in covariance estimation or the covariate in comprssed sensing. Particularly, when considering the uniform quantizer $\mathcal{Q}_{\Delta}(.)$ for the covariate $\bm{x}_{k}\in\mathbb{R}^{d}$ , we adopt the dither $\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ ,¹⁰¹⁰10An equivalent statement is that entries of $\bm{\tau}_{k}$ are i.i.d. distributed as $\mathscr{U}\big{(}\big{[}-\frac{\Delta}{2},\frac{\Delta}{2}\big{]}\big{)}+\mathscr{U}\big{(}\big{[}-\frac{\Delta}{2},\frac{\Delta}{2}\big{]}\big{)}$ . The equivalence can be clearly seen by comparing the joint probability density functions. which is the sum of two independent $\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ and referred to as a triangular dither [41]. Simple calculations verify that the triangular dither respects not only the condition in Theorem 1(a), but also the one in Theorem 1(b) with $p=2$ ; specifically, let $Y=Y_{1}+Y_{2}$ where $Y_{1}$ and $Y_{2}$ are independent and follow $\mathscr{U}\big{(}\big{[}-\frac{\Delta}{2},\frac{\Delta}{2}\big{]}\big{)}$ , and let $Z\sim\mathscr{U}\big{(}\big{[}-\frac{\Delta}{2},\frac{\Delta}{2}\big{]}\big{)}$ be independent of $Y$ , then based on (2.4), we know $f(u)=\mathbbm{E}(\exp(\mathsf{i}uY))=\big{[}\frac{2}{\Delta u}\sin\frac{\Delta u}{2}\big{]}^{2}$ satisfies $f(\frac{2\pi l}{\Delta})=0$ , and $g(u)=\mathbbm{E}(\exp(\mathsf{i}uY))\mathbbm{E}(\exp(\mathsf{i}uZ))=\big{[}\frac{2}{\Delta u}\sin\frac{\Delta u}{2}\big{]}^{3}$ satisfies $g^{\prime\prime}(\frac{2\pi l}{\Delta})=0$ , where $l$ is any non-zero integer. Thus, at the cost of a dithering variance larger than uniform dither, the triangular dither brings the additional nice property of signal-independent variance for the quantization noise — $\mathbbm{E}(\xi_{ki}^{2})=\frac{1}{4}\Delta^{2}$ , where $\xi_{ki}$ is the $i$ -th entry of $\bm{\xi}_{k}=\mathcal{Q}_{\Delta}(\bm{x}_{k}+\bm{\tau}_{k})-(\bm{x}_{k}+\bm{\tau}_{k})$ .

To the best of our knowledge, the triangular dither is new to the literature of quantized compressed sensing. We will explain its necessity if covariance estimation is involved. This is also complemented by numerical simulation (see Figure 5(a)).

3 (Near) Minimax Error Rates

In this section we derive (near) optimal error rates for several canonical statistical estimation problems. Our novelty is that by using the proposed quantization scheme for heavy-tailed data, (near) optimal error rates could be achieved by computationally feasible estimators.

3.1 Quantized Covariance Matrix Estimation

Given $\mathscr{X}:=\{\bm{x}_{1},...,\bm{x}_{n}\}$ as i.i.d. copies of a zero-mean random vector $\bm{x}\in\mathbb{R}^{d}$ , one often encounters the covariance matrix estimation problem, i.e., to estimate $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}\bm{x}^{\top})$ . This estimation problem is of fundamental importance in multivariate analysis and has attracted much research interest (e.g., [12, 11, 13, 4]). However, the practically useful setting (e.g., in a massive MIMO system [95]) where the samples undergo certain quantization process remains under-developed, for which we are only aware of the 1-bit quantization results in [31, 21]. This setting poses the problem of quantized covariance matrix estimation (QCME), in which one aims to design quantization scheme for $\bm{x}_{k}$ that allows for accurate estimation of $\bm{\Sigma^{\star}}$ only based on the quantized samples. We consider heavy-tailed $\bm{x}_{k}$ that possesses bounded fourth moments either marginally or jointly, but note that our estimation methods and theoretical results appear to be new even for sub-Gaussian $\bm{x}_{k}$ (Remark 1).

As introduced before, we overcome the heavy-tailedness of $\bm{x}_{k}$ by a data truncation step, i.e., we first truncate $\bm{x}_{k}$ to $\bm{\widetilde{x}}_{k}$ in order to make the outliers less influential. Here, we defer the precise definition of $\bm{\widetilde{x}}_{k}$ to concrete results because it should be well suited to the error metric. After the truncation, we dither and quantize $\bm{\widetilde{x}}_{k}$ to $\bm{\dot{x}}_{k}=\mathcal{Q}_{\Delta}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})$ with the triangular dither $\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ . Different from the uniform dither adopted in the literature (e.g., [21, 89, 94, 32, 50]), first let us explain our choice of triangular dither. Recall that the quantization noise and quantization error are respectively defined as $\bm{\xi}_{k}:=\bm{\dot{x}}_{k}-\bm{\widetilde{x}}_{k}$ and $\bm{w}_{k}:=\bm{\dot{x}}_{k}-\bm{\widetilde{x}}_{k}-\bm{\tau}_{k}$ , thus giving $\bm{\xi}_{k}=\bm{\tau}_{k}+\bm{w}_{k}$ . Under uniform dither or triangular dither, $\bm{w}_{k}$ is independent of $\bm{\widetilde{x}}_{k}$ and follows $\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ (see Section 2.2), thus allowing us to calculate that

$\displaystyle\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top})$	$\displaystyle=\mathbbm{E}\big{(}(\bm{\widetilde{x}}_{k}+\bm{\xi}_{k})(\bm{\widetilde{x}}_{k}+\bm{\xi}_{k})^{\top}\big{)}$	(3.1)
	$\displaystyle=\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\xi}_{k}^{\top})+\mathbbm{E}(\bm{\xi}_{k}\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top}).$

Note that $(i)$ is because $\mathbbm{E}(\bm{\xi}_{k}\bm{\widetilde{x}}_{k}^{\top})=\mathbbm{E}(\bm{\tau}_{k}\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{w}_{k}\bm{\widetilde{x}}_{k}^{\top})=\mathbbm{E}(\bm{\tau}_{k})\mathbbm{E}(\bm{\widetilde{x}}_{k}^{\top})+\mathbbm{E}(\bm{w}_{k})\mathbbm{E}(\bm{\widetilde{x}}_{k}^{\top})=0$ , due to the previously noted fact that $\bm{\tau}_{k}$ and $\bm{w}_{k}$ are independent of $\bm{\widetilde{x}}_{k}$ and zero-mean. While with suitable choice of the truncation threshold $\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top})$ is expected to well approximate $\bm{\Sigma^{\star}}$ , the remaining $\mathbb{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})$ gives rise to constant bias. To address the issue, a straightforward idea is to remove the bias, which requires the full knowledge on $\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})$ , i.e., the covariance matrix of the quantization noise. For $i\neq j$ , because $\bm{\tau}_{k}$ , $\bm{w}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ and $\mathbbm{E}(w_{ki}\tau_{kj})=\mathbbm{E}_{\widetilde{x}_{ki}}(\mathbbm{E}[w_{ki}\tau_{kj}|\widetilde{x}_{ki}])=0$ (note that conditionally on $\widetilde{x}_{ki}$ , $w_{ki}=\mathcal{Q}_{\Delta}(\widetilde{x}_{ki}+\tau_{ki})-(\widetilde{x}_{ki}+\tau_{ki})$ and $\tau_{kj}$ are independent), we have

		$\displaystyle\mathbbm{E}(\xi_{ki}\xi_{kj})=\mathbbm{E}\big{(}(w_{ki}+\tau_{ki})(w_{kj}+\tau_{kj})\big{)}$
		$\displaystyle=\mathbbm{E}(w_{ki}w_{kj})+\mathbbm{E}(w_{ki}\tau_{kj})+\mathbbm{E}(\tau_{ki}w_{kj})+\mathbbm{E}(\tau_{ki}\tau_{kj})=0,$

showing that $\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})$ is diagonal. Moreover, under triangular dither the $i$ -th diagonal entry is also known as $\mathbbm{E}|\xi_{ki}|^{2}=\frac{\Delta^{2}}{4}$ , see Section 2.2. Taken collectively, we arrive at

\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})=\frac{\Delta^{2}}{4}\bm{I}_{d};

(3.2)

Based on (3.1) we thus propose the following estimator

\bm{\widehat{\Sigma}}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\Delta^{2}}{4}\bm{I}_{d},

(3.3)

which is the sample covariance of the quantized sample $\dot{\mathscr{X}}:=\{\bm{\dot{x}}_{1},...,\bm{\dot{x}}_{n}\}$ followed by a correction step. On the other hand, the reason why the standard uniform dither is not suitable for QCME becomes self-evident — the diagonal of $\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})$ remains unknown¹¹¹¹11It depends on the input signal, see [41, Page 3]. and hence there is no hope to precisely remove the bias.

We are now ready to present error bounds for $\bm{\widehat{\Sigma}}$ under max-norm, operator norm. We will also investigate the high-dimensional setting by assuming sparse structure of $\bm{\Sigma^{\star}}$ , for which we propose a thresholding estimator. More concretely, our first result provides the error rate under $\|\cdot\|_{\infty}$ , in which we assume $\bm{x}_{k}$ satisfies the marginal fourth moment constraint and utilize an element-wise truncation $\bm{\widetilde{x}}_{k}=\mathscr{T}_{\zeta}(\bm{x}_{k})$ .

Theorem 2.

(Element-Wise Error). Given $\Delta>0$ and $\delta>4$ , we consider the problem of QCME described above. We suppose that $\bm{x}_{k}$ s are i.i.d. zero-mean and satisfy the marginal moment constraint $\mathbbm{E}|x_{ki}|^{4}\leq M$ for any $i\in[d]$ , where $x_{ki}$ is the $i$ -th entry of $\bm{x}_{k}$ . We truncate $\bm{x}_{k}$ to $\bm{\widetilde{x}}_{k}=[\widetilde{x}_{ki}]=\mathscr{T}_{\zeta}(\bm{x}_{k})$ with threshold $\zeta\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}$ , then quantize $\bm{\widetilde{x}}_{k}$ to $\bm{\dot{x}}_{k}=\mathcal{Q}_{\Delta}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})$ with triangular dither $\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ . If $n\gtrsim\delta\log d$ , then the estimator in (3.3) satisfies

\mathbbm{P}\left(\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{\infty}\geq C\mathscr{L}\sqrt{\frac{\delta\log d}{n}}\right)\leq 2d^{2-\delta},

where $\mathscr{L}:=\sqrt{M}+\Delta^{2}$ .

Notably, despite the heavy-tailedness and quantization, the estimator achieves an element-wise rate $O(\sqrt{\frac{\log d}{n}})$ coincident with the one for sub-Gaussian case. One can clearly position quantization level $\Delta$ in the multiplicative factor $\mathscr{L}=\sqrt{M}+\Delta^{2}$ . Thus, the information loss incurred by quantization is inessential in that it does not affect the key scaling law but only slightly worsens the leading factor. These remarks on the (near) optimality and the information loss incurred by quantization remain valid in our subsequent theorems.

Our next result concerns the operator norm estimation error, under which we impose a stronger joint moment constraint on $\bm{x}_{k}$ and truncate $\bm{x}_{k}$ regarding $\ell_{4}$ -norm, i.e., $\bm{\check{x}_{k}}=\frac{\bm{x}_{k}}{\|\bm{x}_{k}\|_{4}}\min\{\|\bm{x}_{k}\|_{4},\zeta\}$ for some threshold $\zeta$ . After the dithered uniform quantization, we still define the estimator as (3.3).

Theorem 3.

(Operator Norm Error). Given $\Delta>0$ and $\delta>0$ , we consider the problem of QCME described above. Suppose that the i.i.d. zero-mean $\bm{x}_{k}$ s satisfy $\mathbbm{E}|\bm{v}^{\top}\bm{x}_{k}|^{4}\leq M$ for any $\bm{v}\in\mathbb{S}^{d-1}$ . We truncate $\bm{x}_{k}$ to $\bm{\check{x}}_{k}=\frac{\bm{x}_{k}}{\|\bm{x}_{k}\|_{4}}\min\{\|\bm{x}_{k}\|_{4},\zeta\}$ with threshold $\zeta\asymp(M^{1/4}+\Delta)\big{(}\frac{n}{\delta\log d}\big{)}^{1/4}$ , then quantize $\bm{\check{x}}_{k}$ to $\bm{\dot{x}}_{k}=\mathcal{Q}_{\Delta}(\bm{\check{x}}_{k}+\bm{\tau}_{k})$ with triangular dither $\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ . If $n\gtrsim\delta d\log d$ , then the estimator in (3.3) satisfies

\mathbbm{P}\left(\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{op}\geq C\mathscr{L}\sqrt{\frac{\delta d\log d}{n}}\right)\leq 2d^{-\delta},

with $\mathscr{L}:=\sqrt{M}+\Delta^{2}$ .

The operator norm error rate in Theorem 3 is near minimax optimal, e.g., compared to the lower bound in [35, Thm. 7], which states that for any estimator $\bm{\widehat{\Sigma}}$ of the positive semi-definite matrix $\bm{\Sigma^{\star}}$ based on i.i.d. zero-mean $\{\bm{x}_{k}\}_{k=1}^{n}$ with covariance matrix $\bm{\Sigma^{\star}}$ , there exists some $\bm{v}_{0}\in\mathbb{S}^{d-1}$ such that $\mathbbm{P}\big{(}\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{op}\geq\frac{1}{48}\sqrt{\frac{6d}{n}}\big{)}\geq\frac{1}{3}$ , where $\bm{\Sigma^{\star}}=\bm{I}_{d}+\bm{v}_{0}\bm{v}_{0}^{\top}$ . Again, the quantization only affects the multiplicative factor $\mathscr{L}$ . Nevertheless, one still needs (at least) $n\gtrsim d$ to achieve small operator norm error. In fact, in a high-dimensional setting where $d$ may exceed $n$ , even the sample covariance $\frac{1}{n}\sum_{k=1}^{n}\bm{x}_{k}\bm{x}_{k}^{\top}$ for sub-Gaussian zero-mean $\bm{x}_{k}$ may have extremely bad performance. To achieve small operator norm error in a high-dimensional regime, we resort to additional structure on $\bm{\Sigma^{\star}}$ , and specifically we use column-wise sparsity as an example, which corresponds to the situations where dependencies among different coordinates are weak. Based on the estimator in Theorem 2, we further invoke a thresholding regularization [4, 12] to promote sparsity.

Theorem 4.

(Sparse QCME). Under conditions and estimator $\bm{\widehat{\Sigma}}$ in Theorem 2, we additionally assume that all columns of $\bm{\Sigma}^{\star}=[\sigma^{\star}_{ij}]$ are $s$ -sparse and consider the thresholding estimator $\bm{\widehat{\Sigma}}_{s}:=\mathcal{T}_{\mu}(\bm{\widehat{\Sigma}})$ for some $\mu$ (recall that $\mathcal{T}_{\mu}(a)=a\cdot\mathbbm{1}(|a|\geq\mu)$ for $a\in\mathbb{R}$ ). If $\mu=C_{1}(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{1}$ , then $\bm{\widehat{\Sigma}}_{s}$ satisfies

\mathbbm{P}\left(\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}\leq C\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}\right)\geq 1-\exp(-0.25\delta),

where $\mathscr{L}:=\sqrt{M}+\Delta^{2}$ .

Notably, our estimator $\bm{\widehat{\Sigma}}_{s}$ achieves minimax rates $O\big{(}s\sqrt{\frac{\log d}{n}}\big{)}$ under operator norm, e.g., compared to the minimax lower bound derived in [12, Thm. 2], which states that (under some regular scaling) for any covariance estimator $\bm{\Sigma}_{es}$ based on $n$ i.i.d. samples of $\mathcal{N}(\bm{\mu},\bm{\Sigma}^{\star})$ where $\bm{\Sigma}^{\star}$ is the true covariance matrix, there exists some covariance matrix $\bm{\Sigma^{\star}}$ with $s$ -sparse columns such that $\mathbbm{E}\|\bm{\Sigma}_{es}-\bm{\Sigma}^{\star}\|_{op}^{2}\gtrsim s^{2}\frac{\log d}{n}$ .

To analyse the thresholding estimator, our proof resembles the ones developed in prior works (e.g., [12]) but requires more efforts like bounding the additional bias terms arising from the data truncation and quantization. We also point out that the results for the full-data unquantized regime immediately follow by setting $\Delta=0$ , thus Theorems 2-3 represent the strict extension of [35, Sect. 4], and Theorem 4 complements [35] with a high-dimensional sparse setting.

Remark 1.

(Sub-Gaussian Case). While we concentrate on quantization of heavy-tailed data in this work, our results can be readily adjusted to sub-Gaussian $\bm{x}_{k}$ , for which the truncation step is inessential and can be removed (i.e., $\zeta=\infty$ ). These results are also new to the literature but will not be presented here.

3.2 Quantized Compressed Sensing

We consider the linear model

y_{k}=\bm{x}_{k}^{\top}\bm{\theta^{\star}}+\epsilon_{k},~{}k=1,...,n,

(3.4)

where $\bm{x}_{k}$ s are the covariates, $y_{k}$ s are responses, $\bm{\theta^{\star}}$ is the sparse signal in compressed sensing or sparse parameter vector in high-dimensional linear regression that we want to estimate. In the quantized compressed sensing (QCS) problem, we are interested in developing quantization scheme for $(\bm{x}_{k},y_{k})$ s $($ mainly for $y_{k}$ in prior works $)$ that enables accurate recovery of $\bm{\theta^{\star}}$ based on the quantized data.

In spite of the same mathematical formulation, there are some important differences between compressed sensing and sparse linear regression that we should clarify first. Specifically, different from sensing vectors in compressed sensing that are generated by some analog measuring device and can oftentimes be designed, $\bm{x}_{k}$ s in sparse linear regression represent the sample data from certain datasets that are believed to affect the responses $y_{k}$ s through (3.4). While the sparsity of $\bm{\theta^{\star}}$ is arguably the most classical signal structure for compressed sensing, due to good interpretability it is also commonly adopted to achieve dimension reduction in high-dimensional statistics. In this work, we are interested in both problem settings. Thus, we do not adopt the isotropic convention (i.e., $\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})=\bm{I}_{d}$ ) from compressed sensing but instead deal with $\bm{x}_{k}$ having general unknown covariance matrix. While the study of quantization and heavy-tailed noise is meaningful in both settings, we note that some of our subsequent results are mainly of interest to the specific sensing or regression problem. For instance, the heavy-tailed covariate considered in Theorem 6 is primarily motivated by the regression setting, in which $\bm{x}_{k}$ may come from a dataset that exhibits much heavier tail than sub-Gaussian data. Moreover, as will be elaborated in Section 4 when appropriate, our subsequent results on covariate quantization (resp., uniform signal recovery guarantee) may prove more useful to the regression problem (resp., compressed sensing problem).

To fix idea, we assume that $\bm{x}_{k}$ s are i.i.d. drawn from some multi-variate distribution, $\epsilon_{k}$ s are i.i.d. statistical noise independent of the $\bm{x}_{k}$ s, and we truncate $y_{k}$ to $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k})$ and then quantize it to $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})$ with uniform dither $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ . Under these statistical assumptions and dithered quantization, near optimal recovery guarantees have been established in [89, 94] for the regime where both $\bm{x}_{k}$ and $\epsilon_{k}$ are drawn from sub-Gaussian distributions (hence the truncation is not needed). In contrast, our focus is on quantization of heavy-tailed data. Particularly, we always assume that the noise $\epsilon_{k}$ s are i.i.d. drawn from some heavy-tailed distribution, resulting in heavy-tailed responses. We will separately deal with the case of sub-Gaussian covariate and a more challenging situation where $\bm{x}_{k}$ s are also heavy-tailed.

To estimate the sparse $\bm{\theta^{\star}}$ , a classical approach is via the regularized M-estimator known as Lasso [90, 70, 72]

\displaystyle\mathop{\arg\min}\limits_{\bm{\theta}}~{}\frac{1}{2n}\sum_{k=1}^{n}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}+\lambda\|\bm{\theta}\|_{1},

whose objective combines the $\ell_{2}$ -loss for data fidelity and $\ell_{1}$ -norm that encourages sparsity. Because we can only access the quantized data $(\bm{x}_{k},\dot{y}_{k})$ (or even $(\bm{\dot{x}}_{k},\dot{y}_{k})$ if covariate quantization is involved, see Section 4), the main issue lies in the $\ell_{2}$ -loss $\frac{1}{2n}\sum_{k=1}^{n}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}$ that requires the unquantized data $(\bm{x}_{k},y_{k})$ . To resolve the issue, we calculate the expected $\ell_{2}$ -loss:

	$\displaystyle\mathbbm{E}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}$	$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\bm{\theta}^{\top}\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})\bm{\theta}-2\mathbbm{E}(y_{k}\bm{x}_{k})^{\top}\bm{\theta}:$		(3.5)
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\bm{\theta}^{\top}\bm{\Sigma^{\star}}\bm{\theta}-2\bm{\Sigma}_{y\bm{x}}^{\top}\bm{\theta},$		(3.5)

where $(i)$ holds up to an inessential constant $\mathbbm{E}|y_{k}|^{2}$ , and in $(ii)$ we let $\bm{\Sigma^{\star}}:=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ , $\bm{\Sigma}_{y\bm{x}}=\mathbbm{E}(y_{k}\bm{x}_{k})$ . This inspires us to generalize the $\ell_{2}$ loss to $\frac{1}{2}\bm{\theta}^{\top}\bm{Q}\bm{\theta}-\bm{b}^{\top}\bm{\theta}$ and consider the following program

\bm{\widehat{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\mathcal{S}}~{}\frac{1}{2}\bm{\theta}^{\top}\bm{Q}\bm{\theta}-\bm{b}^{\top}\bm{\theta}+\lambda\|\bm{\theta}\|_{1}.

(3.6)

Compared to (3.5) we will use $(\bm{Q},\bm{b})$ that well approximates $(\bm{\Sigma^{\star}},\bm{\Sigma}_{y\bm{x}})$ , and we also introduce the constraint $\bm{\theta}\in\mathcal{S}$ to allow more flexibility. It is important to note that this is the general strategy in this work to design estimators in different QCS settings, see more discussions in Remark 3.

The next theorem is concerned with QCS under sub-Gaussian covariate but heavy-tailed response. Note that the heavy-tailedness of $y_{k}$ stems from the noise distribution assumed to have bounded $2+\nu$ moment ( $\nu=2(l-1)>0$ in the theorem statement), but following [35, 21, 97] we directly impose the moment constraint on the response.

Theorem 5.

(Sub-Gaussian Covariate, Heavy-Tailed Response). Given some $\delta>0,\Delta>0$ , in (3.4) we suppose that $\bm{x}_{k}$ s are i.i.d., zero-mean sub-Gaussian with $\|\bm{x}_{k}\|_{\psi_{2}}\leq\sigma$ , $\kappa_{0}\leq\lambda_{\min}(\bm{\Sigma^{\star}})\leq\lambda_{\max}(\bm{\Sigma^{\star}})\leq\kappa_{1}$ for some $\kappa_{1}>\kappa_{0}>0$ where $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ , $\bm{\theta^{\star}}\in\mathbb{R}^{d}$ is $s$ -sparse, the noise $\epsilon_{k}$ s are i.i.d. heavy-tailed and independent of $\bm{x}_{k}$ s, and we assume $\mathbbm{E}|y_{k}|^{2l}\leq M$ for some fixed $l>1$ . In the quantization, we truncate $y_{k}$ to $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k})$ with threshold $\zeta_{y}\asymp\big{(}\frac{nM^{1/l}}{\delta\log d}\big{)}^{1/2}$ , then quantize $\widetilde{y}_{k}$ to $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})$ with uniform dither $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ . For recovery, we define the estimator $\bm{\widehat{\theta}}$ as (3.6) with $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{x}_{k}\bm{x}_{k}^{\top}$ , $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{x}_{k}$ , $\mathcal{S}=\mathbb{R}^{d}$ . We set $\lambda=C_{1}\frac{\sigma^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{1}$ . If $n\gtrsim\delta s\log d$ for some hidden constant only depending on $(\kappa_{0},\sigma)$ , then with probability at least $1-9d^{1-\delta}$ , the estimation error $\bm{\widehat{\Upsilon}}=\bm{\widehat{\theta}}-\bm{\theta^{\star}}$ satisfies

\|\bm{\widehat{\Upsilon}}\|_{2}\leq C_{3}\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}~{}~{}~{}\mathrm{and}~{}~{}~{}\|\bm{\widehat{\Upsilon}}\|_{1}\leq C_{4}\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}~{}

where $\mathscr{L}:=\frac{\sigma^{2}(\Delta+M^{1/(2l)})}{\kappa_{0}^{3/2}}$ .

The rate $O\big{(}\sqrt{\frac{s\log d}{n}}\big{)}$ for $\ell_{2}$ -norm error is minimax optimal up to logarithmic factor (e.g., compared to [81]). Note that a random noise bounded by $\Delta$ roughly contributes $\Delta$ to $(\mathbbm{E}|y_{k}|^{2l})^{1/(2l)}$ , and the latter is bounded by $M^{1/(2l)}$ ; because in the error bound $\Delta$ and $M^{1/(2l)}$ almost play the same role, the effect of uniform quantization can be readily interpreted as an additional bounded noise, analogously to the error rate in [87].

Next, we switch to the more challenging situation where both $\bm{x}_{k}$ and $y_{k}$ are heavy-tailed, assuming that they both possess bounded fourth moments (a marginal moment constraint for $\bm{x}_{k}$ ). The consideration of this setting is motivated by the setting of sparse linear regression, where the covariates $\bm{x}_{k}$ s may oftentimes exhibit heavy-tailed behaviour. Specifically, we element-wisely truncate $\bm{x}_{k}$ to $\bm{\widetilde{x}}_{k}$ and set $\bm{Q}:=\frac{1}{n}\sum_{k=1}^{n}\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}$ as a robust covariance matrix estimator, whose estimation performance under $\|\cdot\|_{\infty}$ follows immediately from Theorem 2 by setting $\Delta=0$ .

Theorem 6.

(Heavy-Tailed Covariate, Heavy-Tailed Response). Given some $\delta>0$ , $\Delta>0$ , in (3.4) we suppose that $\bm{x}_{k}$ s are i.i.d. zero-mean satisfying a marginal fourth moment constraint $\sup_{i\in[d]}\mathbbm{E}|x_{ki}|^{4}\leq M$ , $\kappa_{0}\leq\lambda_{\min}(\bm{\Sigma^{\star}})\leq\lambda_{\max}(\bm{\Sigma^{\star}})\leq\kappa_{1}$ for some $\kappa_{1}>\kappa_{0}>0$ where $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ , $\bm{\theta^{\star}}\in\Sigma_{s}$ satisfies $\|\bm{\theta^{\star}}\|_{1}\leq R$ , the noise $\epsilon_{k}$ s are i.i.d. heavy-tailed and independent of $\bm{x}_{k}$ s, and we assume $\mathbbm{E}|y_{k}|^{4}\leq M$ . In the quantization, we truncate $\bm{x}_{k},y_{k}$ respectively to $\bm{\widetilde{x}}_{k}=[\widetilde{x}_{ki}]=\mathscr{T}_{\zeta_{x}}(\bm{x}_{k}),~{}\widetilde{y}_{k}:=\mathscr{T}_{\zeta_{y}}(y_{k})$ with $\zeta_{x},\zeta_{y}\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}$ , then we quantize $\widetilde{y}_{k}$ to $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})$ with uniform dither $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ . For recovery, we define the estimator $\bm{\widehat{\theta}}$ as (3.6) with $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}$ , $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\widetilde{x}}_{k}$ , $\mathcal{S}=\mathbb{R}^{d}$ . We set $\lambda=C_{1}(R\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{1}$ . If $n\gtrsim\delta s^{2}\log d$ for some hidden constant only depending on $(\kappa_{0},M)$ , then with probability at least $1-4d^{2-\delta}$ , the estimation error $\bm{\widehat{\Upsilon}}:=\bm{\widehat{\theta}}-\bm{\theta^{\star}}$ satisfies

\|\bm{\widehat{\Upsilon}}\|_{2}\leq C_{2}\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}~{}~{}~{}\mathrm{and}~{}~{}~{}\|\bm{\widehat{\Upsilon}}\|_{1}\leq C_{3}\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}~{}

where $\mathscr{L}:=\frac{R\sqrt{M}+\Delta^{2}}{\kappa_{0}}$ .

Theorem 6 generalizes [35, Thm. 2(b)] to the uniform quantization setting. Clearly, the obtained rate remains near minimax optimal if $R$ is of minor scaling (e.g., bounded or logarithmic factors). Nevertheless, such near optimality in Theorem 6 comes at the cost of more restricted conditions and stronger scaling, as remarked in the following.

Remark 2.

(Comparing Theorems 5-6). Compared with $n\gtrsim s\log d$ in Theorem 5, the first downside of Theorem 6 is the sub-optimal sample complexity $n\gtrsim s^{2}\log d$ , and note that $n\gtrsim s^{2}\log d$ is also required in [35, Thm. 2(b)]. But indeed, it can be improved to $n\gtrsim s\log d$ by explicitly adding the constraint $\|\bm{\theta}\|_{1}\leq R$ to the recovery program, as will be noted as an interesting side finding in Remark 6. Secondly, following [35] we impose an $\ell_{1}$ -norm constraint $\|\bm{\theta^{\star}}\|_{1}\leq R$ that is stronger than $\|\bm{\theta^{\star}}\|_{2}\lesssim\frac{M^{1/(2l)}}{\sigma}$ used in the proof of Theorem 5. In fact, when replacing the $\ell_{1}$ constraint in Theorem 6 with an $\ell_{2}$ -norm bound $\|\bm{\theta^{\star}}\|_{2}\leq R$ , then our proof technique leads to an error rate $\|\bm{\widehat{\Upsilon}}\|_{2}=O\big{(}\sqrt{\frac{s^{2}\log d}{n}}\big{)}$ that exhibits worse dependence on $s$ .

Remark 3.

(Modification of $\ell_{2}$ -loss). Recall that we generalize the regular $\ell_{2}$ -loss $\frac{1}{2n}\sum_{k=1}^{n}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}$ to $\frac{1}{2}\bm{\theta}^{\top}\bm{Q\theta}-\bm{b}^{\top}\bm{\theta}$ as loss function in (3.6). Note that the choice of $(\bm{Q},\bm{b})$ in Theorem 5 is tantamount to using the loss function $\frac{1}{2n}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}$ that replaces $y_{k}$ with the quantized response $\dot{y}_{k}$ ; this idea is analogous to the generalized Lasso investigated for single index model [78] and dithered quantized model [89], and will be used again in quantized matrix completion, see (3.8) below. However, our generalized $\ell_{2}$ -loss provides more flexibility to deal with heavy-tailedness or quantization of $\bm{x}_{k}$ , e.g., $(\bm{Q},\bm{b})$ in Theorem 6 amounts to adopting $\frac{1}{2n}\sum_{k=1}^{n}({\dot{y}_{k}}-\bm{\widetilde{x}}_{k}^{\top}\bm{\theta})^{2}$ as loss function, and under quantized covariate more delicate modifications are required in Theorems 9-12, which is beyond the range of prior works on generalized Lasso.

3.3 Quantized Matrix Completion

Completing a low-rank matrix from only a partial observation of its entries is known as the matrix completion problem, which has found many applications including recommendation system, image inpainting, quantum state tomography [19, 27, 2, 74, 42], to name just a few. Mathematically, let $\bm{\Theta^{\star}}\in\mathbb{R}^{d\times d}$ be the underlying matrix satisfying $\operatorname{rank}(\bm{\Theta^{\star}})\leq r$ , the matrix completion problem can be formulated as

y_{k}=\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}+\epsilon_{k},~{}k=1,2,...,n,

(3.7)

where $\bm{X}_{k}$ s are distributed on $\mathcal{X}:=\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}$ ( $\bm{e}_{i}$ is the $i$ -th column of $\bm{I}_{d}$ ), $\epsilon_{k}$ is observation noise. Note that for $\bm{X}_{k}=\bm{e}_{i(k)}\bm{e}_{j(k)}^{\top}$ one has $\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}=\theta^{\star}_{i(k),j(k)}$ , so each observation is a noisy entry. Our main interest is on quantized matrix completion (QMC), where our goal is to design quantizer for the observation $y_{k}$ that allows for accurate estimation of $\bm{\Theta^{\star}}$ from the quantized observations.

Unlike in compressed sensing, additional condition (besides the low-rankness) on $\bm{\Theta^{\star}}$ is needed to ensure the well-posedness of the matrix completion problem. More specifically, certain incoherence conditions are required if we pursue exact recovery (e.g., [15, 82]), whereas a faithful estimation can be achieved as long as the underlying matrix is not overly spiky and sufficiently diffuse (e.g., [51, 71]). The latter condition is also known as “low spikiness” and is formulated by $\frac{d\|\bm{\Theta^{\star}}\|_{\infty}}{\|\bm{\Theta^{\star}}\|_{F}}\leq\alpha$ [35, 71], which has been noted to be necessary for the well-posedness of matrix completion problem [27, 71]. In subsequent works the low-spikiness condition is often formulated as the simpler max-norm constraint $\|\bm{\Theta^{\star}}\|_{\infty}\leq\alpha$ [51, 19, 53, 26, 37].

In this work, we consider the uniform sampling scheme $\bm{X}_{k}\sim\mathscr{U}(\mathcal{X})$ , but with a little bit more work it generalizes to more general sampling scheme [51]. We apply the proposed quantization scheme to possibly heavy-tailed $y_{k}$ — we truncate $y_{k}$ to $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k})$ with some threshold $\zeta_{y}$ , and then quantize $\widetilde{y}_{k}$ to $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})$ with uniform dither $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ . Because we do not pursue exact recovery (which is impossible under quantization), we do not assume any incoherence condition like [82]. Instead, we only hope to accurately estimate $\bm{\Theta^{\star}}$ , and following [51, 19, 53, 26, 37] we impose a max-norm constraint

\|\bm{\Theta^{\star}}\|_{\infty}\leq\alpha.

Overall, we estimate $\bm{\Theta^{\star}}$ from $(\bm{X}_{k},\dot{y}_{k})$ by the regularized M-estimator [70, 72]

\bm{\widehat{\Theta}}=\mathop{\arg\min}\limits_{\|\bm{\Theta}\|_{\infty}\leq\alpha}~{}\frac{1}{2n}\sum_{k=1}^{n}\big{(}\dot{y}_{k}-\big{<}\bm{X}_{k},\bm{\Theta}\big{>}\big{)}^{2}+\lambda\|\bm{\Theta}\|_{nu}

(3.8)

that combines an $\ell_{2}$ -loss and nuclear norm regularizer.

In the literature, there has been a line of works on 1-bit or multi-bit matrix completion related to our results to be presented [14, 59, 52, 17, 3]. While the referenced works commonly adopted a likelihood approach, our method is an essential departure and embraces some advantage, see a precise comparison in Remark 4. Considering such novelty, we include the result for sub-exponential $\epsilon_{k}$ in Theorem 7, for which the truncation of $y_{k}$ becomes unnecessary and we simply set $\zeta_{y}=\infty$ .

Theorem 7.

(QMC under Sub-Exponential Noise). Given some $\Delta>0,\delta>0$ , in (3.7) we suppose that $\bm{X}_{k}$ s are i.i.d. uniformly distributed over $\mathcal{X}=\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}$ , $\bm{\Theta^{\star}}\in\mathbb{R}^{d\times d}$ satisfies $\operatorname{rank}(\bm{\Theta^{\star}})\leq r$ and $\|\bm{\Theta^{\star}}\|_{\infty}\leq\alpha$ , the noise $\epsilon_{k}$ s are i.i.d. zero-mean sub-exponential satisfying $\|\epsilon_{k}\|_{\psi_{1}}\leq\sigma$ , and are independent of $\bm{X}_{k}$ s. In the quantization, we do not truncation $y_{k}$ but directly quantize it to $\dot{y}_{k}=\mathcal{Q}_{\Delta}(y_{k}+\tau_{k})$ with uniform dither $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ . We choose $\lambda=C_{1}(\sigma+\Delta)\sqrt{\frac{\delta\log d}{nd}}$ with sufficiently large $C_{1}$ , and define $\bm{\widehat{\Theta}}$ as (3.8). If $\delta d\log^{3}d\lesssim n\lesssim\delta r^{2}d^{2}\log d$ , then with probability at least $1-4d^{-\delta}$ , the estimation error $\bm{\widehat{\Upsilon}}:=\bm{\widehat{\Theta}}-\bm{\Theta^{\star}}$ satisfies

\frac{\|\bm{\widehat{\Upsilon}}\|_{F}}{d}\leq C_{2}\mathscr{L}\sqrt{\frac{\delta rd\log d}{n}}~{}~{}\mathrm{and}~{}~{}\frac{\|\bm{\widehat{\Upsilon}}\|_{nu}}{d}\leq C_{3}\mathscr{L}r\sqrt{\frac{\delta d\log d}{n}}

where $\mathscr{L}:=\alpha+\sigma+\Delta$ .

By contrast, under heavy-tailed noise only assumed to have bounded variance, we truncate $y_{k}$ with a suitable threshold before the dithered quantization to achieve an optimal trade-off between bias and variance.

Theorem 8.

(QMC under Heavy-tailed Noise). Given some $\Delta>0,\delta>0$ , we consider (3.7) in the setting of Theorem 7 but with the assumption $\|\epsilon_{k}\|_{\psi_{1}}\leq\sigma$ replaced by $\mathbbm{E}|\epsilon_{k}|^{2}\leq M$ . In the quantization, we truncate $y_{k}$ to $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k})$ with $\zeta_{y}\asymp(\sqrt{M}+\alpha)\sqrt{\frac{n}{\delta d\log d}}$ , and then quantize $\widetilde{y}_{k}$ to $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})$ with uniform dither $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ . We choose $\lambda=C_{1}(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}}$ with sufficiently large $C_{1}$ , and define $\bm{\widehat{\Theta}}$ as (3.8). If $\delta d\log d\lesssim n\lesssim\delta r^{2}d^{2}\log d$ , then with probability at least $1-6d^{-\delta}$ , the estimation error $\bm{\widehat{\Upsilon}}:=\bm{\widehat{\Theta}}-\bm{\Theta^{\star}}$ satisfies

\frac{\|\bm{\widehat{\Upsilon}}\|_{F}}{d}\leq C_{2}\mathscr{L}\sqrt{\frac{\delta rd\log d}{n}}~{}~{}\mathrm{and}~{}~{}\frac{\|\bm{\widehat{\Upsilon}}\|_{nu}}{d}\leq C_{3}\mathscr{L}r\sqrt{\frac{\delta d\log d}{n}}

where $\mathscr{L}:=\alpha+\sqrt{M}+\Delta$ .

Compared to the information-theoretic lower bounds in [71, 55], the error rates obtained in Theorems 7-8 are minimax optimal up to logarithmic factors. Specifically, Theorem 8 derives near optimal guarantee for QMC with heavy-tailed observations, as the key standpoint of this paper. Note that, the 1-bit quantization counterpart of these two Theorems was derived in our previous work [21]; in sharp contrast to Theorem 8, for 1-bit QMC under heavy-tailed noise, the error rate under $\frac{\|\bm{\widehat{\Upsilon}}\|_{F}}{d}$ in [21, Thm. 13] reads as $O\big{(}\big{(}\frac{r^{2}d\log d}{n}\big{)}^{1/4}\big{)}$ and is essentially slower; using the 1-bit observations therein, this slow error rate is indeed nearly tight due to the lower bound in [21, Thm. 14].

To close this section, we give a remark to illustrate the novelty and advantage of our QMC method by a careful comparison with prior works.

Remark 4.

QMC with 1-bit or multi-bit quantized observations has received considerable research interest [26, 14, 59, 52, 17, 3]. Adapted to our notation, these works studied the model $\dot{y}_{k}=\mathcal{Q}(\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}+\tau_{k})$ under general random dither $\tau_{k}$ and quantizer $\mathcal{Q}(.)$ , and they commonly adopted regularized (or constrained) maximum likelihood estimation for estimating $\bm{\Theta^{\star}}$ . By contrast, with the random dither and quantizer specialized to $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ and $\mathcal{Q}_{\Delta}(.)$ , our model is formulated as $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\mathscr{T}_{\zeta_{y}}(\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}+\epsilon_{k})+\tau_{k})$ . Thus, while suffering from less generality in $(\tau_{k},\mathcal{Q})$ , our method embraces the advantage of robustness to pre-quantization noise $\epsilon_{k}$ , whose distribution is unknown and can even be heavy-tailed. Note that such unknown $\epsilon_{k}$ evidently forbids the likelihood approach.

4 Covariate Quantization and Uniform Signal Recovery in Quantized Compressed Sensing

By now we have presented near optimal results in the contexts of QCME, QCS and QMC under heavy-tailed data that further undergo the proposed quantization scheme, which we position as the primary contribution of this work. In this section, we further provide two additional developments to enhance our results on heavy-tailed QCS.

4.1 Covariate Quantization

In the area of QCS, almost all prior works merely focused on the quantization of response $y_{k}$ , see the recent survey [29]; here, we consider a setting of “complete quantization” — meaning that the covariate $\bm{x}_{k}$ is also quantized. To motivate our study of “complete quantization”, we interpret compressed sensing as sparse linear regression. Indeed, to reduce the power consumption and computational cost, it is sometimes preferable to work with low-precision data in a machine learning system, e.g., the sample quantization scheme developed in [96] led to experimental success in training linear model. Also, it was shown that direct gradient quantization may not be efficient in certain distributed learning systems where the terminal nodes are connected to the server only through very weak communication fabric and the number of parameters are extremely huge; rather, quantizing and transmitting some important samples could provably reduce communication cost [43]. In fact, the process of data collection may already appeal to quantization due to certain limit of the data acquisition device (e.g., a low-resolution analog-to-digital module used in distributed signal processing [25]). Our main goal is to understand how quantization of $(\bm{x}_{k},y_{k})$ s affects the subsequent recovery/learning process, particularly showing that the simple dithered uniform quantization scheme still allows for accurate estimator that may even provide near minimax error rate. To our best knowledge, the only prior rigorous estimation guarantees for QCS with covariate quantization are [21, Thms. 7-8]; these two results require a restricted and unnatural assumption, which we will also relax later.

4.1.1 Multi-bit QCS with Quantized Covariate

Since we will also consider the 1-bit quantization, we more precisely refer to the QCS under uniform quantizer as multi-bit QCS. We will generalize Theorems 5-6 to covariate quantization in the next two theorems.

Let $(\bm{\dot{x}}_{k},\dot{y}_{k})$ be the quantized covariate-response pair, we first quickly sketch the idea of our approach. Specifically, we stick to the framework of M-estimator in (3.6), which appeals to accurate surrogates for $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ and $\bm{\Sigma}_{y\bm{x}}=\mathbbm{E}(y_{k}\bm{x}_{k})$ based on $(\bm{\dot{x}}_{k},\dot{y}_{k})$ , where $\bm{\dot{x}_{k}}$ represents the quantized covariate. Fortunately, the surrogates can be constructed analogously to our QCME estimator when triangular dither is used for quantizing $\bm{x}_{k}$ . Let us first state our quantization scheme as follows:

•

Response Quantization. This is the same as Theorems 5-6. We truncate $y_{k}$ to $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k})$ with threshold $\zeta_{y}$ , and then quantize $\widetilde{y}_{k}$ to $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\phi_{k})$ with uniform dither $\phi_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ and quantization level $\Delta\geq 0$ .
•

Covariate Quantization. This is the same as Theorem 2. We truncate $\bm{x}_{k}$ to $\bm{\widetilde{x}}_{k}=\mathscr{T}_{\zeta_{x}}(\bm{x}_{k})$ with threshold $\zeta_{x}$ , and then quantize $\bm{\widetilde{x}}_{k}$ to $\bm{\dot{x}}_{k}=\mathcal{Q}_{\bar{\Delta}}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})$ with triangular dither $\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})+\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})$ and quantization level $\bar{\Delta}\geq 0$ .
•

Notation. We write the quantization noise as $\varphi_{k}=\dot{y}_{k}-\widetilde{y}_{k}$ and $\bm{\xi}_{k}=\bm{\dot{x}}_{k}-\bm{\widetilde{x}}_{k}$ , the quantization error as $\vartheta_{k}=\dot{y}_{k}-(\widetilde{y}_{k}+\phi_{k})$ and $\bm{w}_{k}=\bm{\dot{x}}_{k}-(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})$ .

We will adopt the above notation in subsequent developments. Based on the quantized covariate-response pairs $(\bm{\dot{x}}_{k},\dot{y}_{k})$ s, we specify our estimator by setting $(\bm{Q},\bm{b})$ in (3.6) as

\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}~{}~{}\mathrm{and}~{}~{}\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}.

(4.1)

Note that the choice of $\bm{Q}$ is due to the estimator in Theorem 2, while $\bm{b}$ is inspired by the calculation

		$\displaystyle\mathbbm{E}(\dot{y}_{k}\bm{\dot{x}}_{k})=\mathbbm{E}\big{(}(\widetilde{y}_{k}+\varphi_{k})(\bm{\widetilde{x}}_{k}+\bm{\xi}_{k})\big{)}$
		$\displaystyle=\mathbbm{E}(\widetilde{y}_{k}\bm{\widetilde{x}}_{k})+\mathbbm{E}(\widetilde{y}_{k}\bm{\xi}_{k})+\mathbbm{E}(\varphi_{k}\bm{\widetilde{x}}_{k})+\mathbbm{E}(\varphi_{k}\bm{\xi}_{k})=\mathbbm{E}(\widetilde{y}_{k}\bm{\widetilde{x}}_{k}),$

where the last equality can be seen by conditioning on $\bm{\widetilde{x}}_{k}$ or $\widetilde{y}_{k}$ . However, the issue is that $\bm{Q}$ is not positive semi-definite, hence the resulting program is non-convex. To explain this, note that the rank of $\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}$ does not exceed $n$ , so when $d>n$ at least $d-n$ eigenvalues of $\bm{Q}$ are $-\frac{\bar{\Delta}^{2}}{4}$ . Alternatively, the non-convexity can also be seen from the observation that setting $(\bm{Q},\bm{b})$ as in (4.1) is tantamount to replacing the regular $\ell_{2}$ -loss $\frac{1}{2n}\sum_{k=1}^{n}(y_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}$ with

\frac{1}{2n}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{\dot{x}}_{k}^{\top}\bm{\theta})^{2}-\frac{\bar{\Delta}}{8}\|\bm{\theta}\|_{2}^{2}.

We mention that the lack of positive semi-definiteness of $\bm{Q}$ is problematic in both statistics and optimization aspects: 1) Statistically, Lemma 4 used to derive the error rates in Theorems 5-6 requires $\bm{Q}$ to be positive semi-definite, and is hence no longer applicable here; 2) From the optimization side, it is in general unknown how to globally optimize a non-convex program.

Motivated by a line of previous works on non-convex M-estimator [63, 64, 62], we add an $\ell_{1}$ -norm constraint to (3.6) by setting $\mathcal{S}=\{\bm{\theta}\in\mathbb{R}^{d}:\|\bm{\theta}\|_{1}\leq R\}$ , where $R$ represents the prior estimation on $\|\bm{\theta^{\star}}\|_{1}$ . Let $\partial\|\bm{\theta}_{1}\|_{1}$ be a subdifferential of $\|\bm{\theta}\|_{1}$ at $\bm{\theta}=\bm{\theta}_{1}$ ,¹²¹²12Thus, $\partial\|\bm{\widetilde{\theta}}\|_{1}$ in (4.2) below should be understood as “there exists one element in $\partial\|\bm{\widetilde{\theta}}\|_{1}$ such that (4.2) holds.” we consider the local minimizer of the proposed recovery program,¹³¹³13The existence of local minimizer is guaranteed because of the additional $\ell_{1}$ -constraint. or more generally put, $\bm{\widetilde{\theta}}\in\mathcal{S}$ that satisfies¹⁴¹⁴14To distinguish the global minimizer in (3.6), we denote by $\bm{\widetilde{\theta}}$ the estimator in QCS with quantized covariate.

\big{<}\bm{Q}\bm{\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\theta}-\bm{\widetilde{\theta}}\big{>}\geq 0,~{}~{}\forall~{}\bm{\theta}\in\mathcal{S}.

(4.2)

We will prove a fairly strong guarantee stating that all $\bm{\widetilde{\theta}}\in\mathcal{S}$ satisfying (4.2) (of course including all local minimizers) enjoy near minimax error rate. While this guarantee bears resemblance to the ones in [64], we point out that, [64] only derived concrete results for sub-Gaussian regime; because of the heavy-tailed data and quantization in our setting, some essentially different ingredients are required for the technical analysis (see Remark 5). As before, our results for sub-Gaussian $\bm{x}_{k}$ and heavy-tailed $\bm{x}_{k}$ are presented separately.

Theorem 9.

(Quantized Sub-Gaussian Covariate). Given $\Delta\geq 0$ , $\bar{\Delta}\geq 0$ , $\delta>0$ , we consider (3.4) with the same assumptions on $(\bm{x}_{k},y_{k},\bm{\theta^{\star}})$ as Theorem 5, and additionally assume that $\|\bm{\theta^{\star}}\|_{2}\leq R$ . The quantization of $(\bm{x}_{k},y_{k})$ is described above, and we set $\zeta_{x}=\infty$ , $\zeta_{y}\asymp\sqrt{\frac{nM^{1/l}}{\delta\log d}}$ . For recovery, we let $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}$ , $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}$ , $\mathcal{S}=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq R\sqrt{s}\}$ and set $\lambda=C_{1}\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{1}$ . If $n\gtrsim\delta s\log d$ for some hidden constant only depending on $(\kappa_{0},\sigma,\Delta,\bar{\Delta},M,R)$ , with probability at least $1-8d^{1-\delta}-C_{2}\exp(-C_{3}n)$ , all $\bm{\widetilde{\theta}}\in\mathcal{S}$ satisfying (4.2) have estimation error $\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}}$ bounded by

\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C^{\prime}\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}

where $\mathscr{L}:=\frac{(\sigma+\bar{\Delta})^{2}(\bm{\Delta}+M^{1/(2l)})}{\kappa_{0}^{3/2}}$ .

Similarly, the next result extends Theorem 6 to a setting involving covariate quantization.

Theorem 10.

(Quantized Heavy-Tailed Covariate). Given $\Delta\geq 0$ , $\bar{\Delta}\geq 0$ , $\delta>0$ , we consider (3.4) with the same assumptions on $(\bm{x}_{k},y_{k},\bm{\theta^{\star}})$ as Theorem 6. The quantization of $(\bm{x}_{k},y_{k})$ is described above, and we set $\zeta_{x},\zeta_{y}\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}$ . For recovery, we let $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}$ , $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}$ , $\mathcal{S}=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq R\}$ and set $\lambda=C_{1}(R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{1}$ . If $n\gtrsim\delta s\log d$ for some hidden constant only depending on $(\kappa_{0},M)$ , then with probability at least $1-8d^{1-\delta}$ , all $\bm{\widetilde{\theta}}\in\mathcal{S}$ satisfying (4.2) have estimation error $\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}}$ bounded by

\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C_{3}\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C_{4}\mathscr{L}s\sqrt{\frac{\delta\log d}{n}}.

where $\mathscr{L}:=\frac{R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2}}{\kappa_{0}}$ .

Remark 5.

(Comparing Our Analyses with [64]). The above results are motivated by a line of works on nonconvex M-estimator [63, 64, 62], and our guarantee for the whole set of stationary points (4.2) resembles [64] most. While the main strategy for proving Theorem 9 is adjusted from [64], the proof of Theorem 10 does involve an essentially different RSC condition, see our (B.9). In particular, compared with [64, equation (4)], the leading factor of $\|\bm{\widetilde{\Upsilon}}\|_{1}^{2}$ in (B.9) degrades from $O\big{(}\frac{\log d}{n}\big{)}$ to $O\big{(}\sqrt{\frac{\log d}{n}}\big{)}$ . To retain near optimal rate we need to impose a stronger scaling $\|\bm{\theta^{\star}}\|_{1}\leq R$ with proper changes in the proof. Although Theorem 10 is presented for a concrete setting, it sheds light on extension of [64] to a weaker RSC condition that could accommodate covariate with heavier tail. Such extension is formally presented as a deterministic framework in Proposition 1.

Proposition 1.

Suppose that the $s$ -sparse $\bm{\theta^{\star}}\in\mathbb{R}^{d}$ satisfies $\|\bm{\theta^{\star}}\|_{1}\leq R$ , and the positive definite matrix $\bm{\Sigma^{\star}}\in\mathbb{R}^{d\times d}$ satisfies $\lambda_{\min}(\bm{\Sigma^{\star}})\geq\kappa_{0}$ . If for some $\bm{Q}\in\mathbb{R}^{d\times d},\bm{b}\in\mathbb{R}^{d}$ we have

\lambda\geq C_{1}\max\big{\{}\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty},~{}R\cdot\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\big{\}}

(4.3)

holds for sufficiently large $C_{1}$ , then all $\bm{\widetilde{\theta}}\in\mathcal{S}$ satisfying (4.2) with $\mathcal{S}=\{\bm{\theta}\in\mathbb{R}^{d}:\|\bm{\theta}\|_{1}\leq R\}$ have estimation error $\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}}$ bounded by

\displaystyle\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C_{2}\frac{\sqrt{s}\lambda}{\kappa_{0}}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C_{3}\frac{s\lambda}{\kappa_{0}}~{}.

By extracting the ingredients that guarantee (4.2) to be accurate, interestingly, Proposition 1 is now independent of the model assumption (3.4). Particularly, we could set $\bm{\Sigma^{\star}}=\mathbbm{E}[\bm{x}_{k}\bm{x}_{k}^{\top}]$ when we apply Proposition 1 to (3.4). Compared with the framework [64, Thm. 1], the key strength of Proposition 1 is that it does not explicitly assume the RSC condition on the loss function that is hard to verify without assuming sub-Gaussian covariate. Instead, the role of the RSC assumption in [64] is now played by $\lambda\gtrsim R\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}$ , which immediately yields a kind of RSC condition by simple argument as (B.10). Although this RSC condition is often essentially weaker than the conventional one in terms of the leading factor of $\|\bm{\widetilde{\Upsilon}}\|_{1}^{2}$ (see Remark 5), along this line one can still derive non-trivial (or even near optimal) error rate. The gain of replacing RSC assumption with $\lambda\gtrsim R\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}$ is that the latter amounts to constructing element-wise estimator for $\bm{\Sigma^{\star}}$ , which is often much easier for heavy-tailed covariate (e.g., due to many existing robust covariance estimator).

We conclude this part with a side interesting observation.

Remark 6.

By setting $\bar{\Delta}=0$ , Theorem 10 produces a result (with convex program) for the setting of Theorem 6. Interestingly, with the additional $\ell_{1}$ -constraint, a notable improvement is that the sub-optimal $n\gtrsim s^{2}\log d$ in Theorem 6 is sharpened to the near optimal one in Theorem 10. More concretely, this is because (ii) in (A.12) can be tightened to $(ii)$ of (B.10). Going back to the full-data unquantized regime, Theorem 10 with $\Delta=\bar{\Delta}=0$ recovers [35, Theorem 2(b)] with improved sample complexity requirement.

4.1.2 1-bit QCS with Quantized Covariate

Our consideration of covariate quantization in QCS seems fairly new to the literature. To the best of our knowledge, the only related results are [21, Thms. 7-8] for QCS with 1-bit quantized covariate and response. The assumption there, however, is quite restrictive. Specifically, it is assumed that $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ has sparse columns (see [21, Assumption 3]), which is non-standard in both compressed sensing and sparse linear regression. Departing momentarily from our focus of dithered uniform quantization, we consider QCS under dithered 1-bit quantization and will apply Proposition 1 to derive results comparable to [21, Thms. 7-8] without resorting to the sparsity of $\bm{\Sigma^{\star}}$ .

We first review the 1-bit quantization scheme developed in [21]:

•

Response Quantization. We truncate $y_{k}$ to $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k})$ with some threshold $\zeta_{y}$ , and then quantize $\widetilde{y}_{k}$ to $\dot{y}_{k}=\operatorname{sgn}(\widetilde{y}_{k}+\phi_{k})$ with uniform dither $\phi_{k}\sim\mathscr{U}([-\gamma_{y},\gamma_{y}])$ .
•

Covariate Quantization. We truncate $\bm{x}_{k}$ to $\bm{\widetilde{x}}_{k}=\mathscr{T}_{\zeta_{x}}(\bm{x}_{k})$ with some threshold $\zeta_{x}$ , and then quantize $\bm{\widetilde{x}}_{k}$ to $\bm{\dot{x}}_{k1}=\operatorname{sgn}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k1})$ and $\bm{\dot{x}}_{k2}=\operatorname{sgn}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k2})$ , where $\bm{\tau}_{k1},\bm{\tau}_{k2}\sim\mathscr{U}([-\gamma_{x},\gamma_{x}]^{d})$ are independent uniform dithers. (Note that we collect 2 bits per entry).

The following two results refine [21, Thms. 7-8] by deriving comparable error rates without using sparsity of $\bm{\Sigma^{\star}}$ .

Theorem 11.

(1-bit Quantized Sub-Gaussian Covariate). Given $\delta>0$ , we consider (3.4) where the $s$ -sparse $\bm{\theta^{\star}}$ satisfies $\|\bm{\theta^{\star}}\|_{1}\leq R$ , $\bm{x}_{k}$ s are i.i.d. zero-mean sub-Gaussian with $\|\bm{x}_{k}\|_{\psi_{2}}\leq\sigma$ , and $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ satisfies $\lambda_{\min}\big{(}\bm{\Sigma^{\star}}\big{)}\geq\kappa_{0}$ for some $\kappa_{0}>0$ , the noise ${\epsilon}_{k}$ s are independent of $\bm{x}_{k}$ s and i.i.d. sub-Gaussian, while for simplicity we assume $\|y_{k}\|_{\psi_{2}}\leq\sigma$ . In the quantization of $(\bm{x}_{k},y_{k})$ described above, we set $\zeta_{x}=\zeta_{y}=\infty$ and $\gamma_{x},\gamma_{y}\asymp\sigma\sqrt{\log\big{(}\frac{n}{2\delta\log d}\big{)}}$ . For recovery we let $\bm{Q}:=\frac{\gamma_{x}^{2}}{2n}\sum_{k=1}^{n}\big{(}\bm{\dot{x}}_{k1}\bm{\dot{x}}_{k2}^{\top}+\bm{\dot{x}}_{k2}\bm{\dot{x}}_{k1}^{\top}\big{)}$ , $\bm{b}:=\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}$ , $\mathcal{S}=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq R\}$ and set $\lambda=C_{1}\sigma^{2}R\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}$ with sufficiently large $C_{1}$ . If $n\gtrsim\delta s\log d(\log n)^{2}$ , then with probability at least $1-4d^{2-\delta}$ , all $\bm{\widetilde{\theta}}\in\mathcal{S}$ satisfying (4.2) have estimation error $\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}}$ bounded by

\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C_{2}\frac{\sigma^{2}}{\kappa_{0}}\cdot R\sqrt{\frac{\delta s\log d(\log n)^{2}}{n}}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C_{3}\frac{\sigma^{2}}{\kappa_{0}}\cdot Rs\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}.

Theorem 12.

(1-bit Quantized Heavy-Tailed Covariate). Given $\delta>0$ , we consider (3.4) where the $s$ -sparse $\bm{\theta^{\star}}$ satisfies $\|\bm{\theta^{\star}}\|_{1}\leq R$ , $\bm{x}_{k}$ s are i.i.d. zero-mean heavy-tailed satisfying the joint fourth moment constraint $\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}|\bm{v}^{\top}\bm{x}_{k}|^{4}\leq M$ , and $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ satisfies $\lambda_{\min}\big{(}\bm{\Sigma^{\star}}\big{)}\geq\kappa_{0}$ for some $\kappa_{0}>0$ , the noise $\epsilon_{k}$ s are independent of $\bm{x}_{k}$ s and i.i.d. heavy-tailed with bounded fourth moment, while for simplicity we assume $\mathbbm{E}|y_{k}|^{4}\leq M$ . In the quantization of $(\bm{x}_{k},y_{k})$ described above, we set $\zeta_{x},\zeta_{y},\gamma_{x},\gamma_{y}\asymp\big{(}\frac{nM^{2}}{\delta\log d}\big{)}^{1/8}$ and enforce $\zeta_{x}<\gamma_{x}$ , $\zeta_{y}<\gamma_{y}$ . For recovery we let $\bm{Q}:=\frac{\gamma_{x}^{2}}{2n}\sum_{k=1}^{n}\big{(}\bm{\dot{x}}_{k1}\bm{\dot{x}}_{k2}^{\top}+\bm{\dot{x}}_{k2}\bm{\dot{x}}_{k1}^{\top}\big{)}$ , $\bm{b}:=\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}$ , $\mathcal{S}=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq R\}$ and set $\lambda=C_{1}\sqrt{M}R\big{(}\frac{\delta\log d}{n}\big{)}^{1/4}$ with sufficiently large $C_{1}$ . If $n\gtrsim\delta s^{2}\log d$ , then with probability at least $1-4d^{2-\delta}$ , all $\bm{\widetilde{\theta}}\in\mathcal{S}$ satisfying (4.2) have estimation error $\bm{\widetilde{\Upsilon}}:=\bm{\widetilde{\theta}}-\bm{\theta^{\star}}$ bounded by

\|\bm{\widetilde{\Upsilon}}\|_{2}\leq C_{2}\frac{\sqrt{M}}{\kappa_{0}}\cdot R\Big{(}\frac{\delta s^{2}\log d}{n}\Big{)}^{1/4}~{}~{}\mathrm{and}~{}~{}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq C_{3}\frac{\sqrt{M}}{\kappa_{0}}\cdot Rs\Big{(}\frac{\delta\log d}{n}\Big{)}^{1/4}.

4.2 Uniform Recovery Guarantee

Uniformity is a highly desired property for a compressed sensing guarantee because it allows one to use a fixed (possibly randomly drawn) measurement ensemble for all sparse signals. Unfortunately, as many other results for nonlinear compressed sensing in the literature, our earlier recovery guarantees are non-uniform and only ensure the accurate recovery of a sparse signal fixed before drawing the random measurement ensemble.

We provide another additional development to QCS in this part. Specifically, we establish a uniform recovery guarantee which, despite the heavy-tailed noise and nonlinear quantization scheme, notably retains a near minimax error rate. This is done by upgrading Theorem 5 to be uniform over all sparse $\bm{\theta^{\star}}$ by more in-depth technical tools and a careful covering argument. Part of the techniques is inspired by prior works [40, 94], but certain technical innovations are required:

1) Like the recent work [40], one crucial technical tool in our proof is a powerful concentration inequality for product process due to Mendelson [67], as adapted in the present Lemma 9. However, [40] only studied sub-Gaussian distribution, and the results produced by their unified approach typically exhibit a decaying rate of $O(n^{-1/4})$ [40, Sect. 4]. By contrast, our problem involves heavy-tailed noise only having bounded $(2+\nu)$ -th moment ( $\nu>0$ ), and we aim to establish a near minimax uniform error bound — cautiousness and new treatment are thus needed in the application of Lemma 9. More specifically, in the proof we need to bound

I_{1}=\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\big{(}\widetilde{y}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\widetilde{y}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)},

where $\mathscr{V}=\{\bm{v}:\|\bm{v}\|_{2}=1,\|\bm{v}\|_{1}\leq 2\sqrt{s}\}$ , and $\Sigma_{s,R_{0}}=\Sigma_{s}\cap\{\bm{\theta}\in\mathbb{R}^{d}:\|\bm{\theta}\|_{2}\leq R_{0}\}$ is the signal space of interest, and recall that $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k})$ with sub-Gaussian $\bm{x}_{k}$ . It is natural to invoke Lemma 9 to bound $I_{1}$ straightforwardly, but the issue is on lack of good bound for $\|\widetilde{y}_{k}\|_{\psi_{2}}$ due to the heavy-tailedness of $\epsilon_{k}$ ; indeed, one only has the trivial estimate as $\|\widetilde{y}_{k}\|_{\psi_{2}}=O(\zeta_{y})$ , which is much worse than an $O(1)$ bound since $\zeta_{y}\asymp\sqrt{\frac{n}{\delta\log d}}$ , and using Lemma 9 with this estimate leads to a loose bound for $I_{1}$ and finally a sub-optimal error rate. To address the issue, our main idea is to introduce the truncated heavy-tailed noise $\mathscr{T}_{\zeta_{y}}(\epsilon_{k})$ and define $\widetilde{z}_{k}=\widetilde{y}_{k}-\mathscr{T}_{\zeta_{y}}(\epsilon_{k})$ , which enables us to decompose $I_{1}$ as

I_{1}\leq\underbrace{\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\big{(}\widetilde{z}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\widetilde{z}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)}}_{:=I_{11}}+\underbrace{\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\big{(}\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}]\big{)}}_{:=I_{12}}.

Now, the benefits of working with $I_{11},I_{12}$ are that: i) We can directly invoke Lemma 9 to bound $I_{11}$ since we have a good sub-Gaussian norm estimate $\|\widetilde{z}_{k}\|_{\psi_{2}}\leq\|\bm{x}_{k}^{\top}\bm{\theta}\|_{\psi_{2}}\lesssim\|\bm{x}_{k}\|_{\psi_{2}}R_{0}$ , see Step 2.1.1 in the proof; ii) $I_{12}$ becomes the supremum of a process that is independent of $\bm{\theta}$ and only indexed by $\bm{v}$ , hence Bernstein’s inequality suffices for bounding $I_{12}$ (Step 2.1.2 in the proof), analogously to the proof of the non-uniform guarantee (Theorem 5).

2) Like [94, Prop. 6.1], we invoke a covering argument with similar techniques to bound $I_{0}=\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}^{\top}\bm{v}$ , where $\xi_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})-\widetilde{y}_{k}$ is the quantization noise. Nevertheless, our Lasso estimator is different from their projected back projection estimator, and it turns out that we need to directly handle “ $\sup_{\bm{v}\in\mathscr{V}}$ ” by Lemma 10, unlike [94, Prop. 6.2] that again used a covering argument for this purpose. See more discussions in Step 2.4 of the proof.

We are in a position to present our uniform recovery guarantee. We follow most assumptions in Theorem 5 but specify the signal space as $\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}$ and impose the $(2l)$ -th moment constraint on $\epsilon_{k}$ . Following prior works on QCS (e.g., [40, 89]), we consider constrained Lasso that utilizes an $\ell_{1}$ -constraint $\|\bm{\theta}\|_{1}\leq\|\bm{\theta}^{\star}\|_{1}$ (rather than (3.6)) to pursue uniform recovery.

Theorem 13.

(Uniform Version of Theorem 5). Given some $\delta>0,\Delta>0$ , in (3.4) we suppose that $\bm{x}_{k}$ s are i.i.d., zero-mean sub-Gaussian with $\|\bm{x}_{k}\|_{\psi_{2}}\leq\sigma$ , $\kappa_{0}\leq\lambda_{\min}(\bm{\Sigma^{\star}})\leq\lambda_{\max}(\bm{\Sigma^{\star}})\leq\kappa_{1}$ for some $\kappa_{1}\geq\kappa_{0}>0$ where $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ , $\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}:=\Sigma_{s}\cap\{\bm{\theta}:\|\bm{\theta}\|_{2}\leq R_{0}\}$ for some absolute constant $R_{0}$ , $\epsilon_{k}$ s are i.i.d. noise that are independent of $\bm{x}_{k}$ s and satisfy $\mathbbm{E}|\epsilon_{k}|^{2l}\leq M$ for some fixed $l>1$ . In quantization, we truncate $y_{k}$ to $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k})$ with threshold $\zeta_{y}\asymp\big{(}\frac{n(M^{1/l}+\sigma^{2})}{\delta\log d}\big{)}^{1/2}$ , then quantize $\widetilde{y}_{k}$ to $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})$ with uniform dither $\tau_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ . For recovery, we define the estimator $\bm{\widehat{\theta}}$ as the solution to constrained Lasso

\bm{\widehat{\theta}}=\mathop{\arg\min}\limits_{\|\bm{\theta}\|_{1}\leq\|\bm{\theta^{\star}}\|_{1}}~{}\frac{1}{2n}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta})^{2}

If $n\gtrsim\delta s\log\mathscr{W}$ for $\mathscr{W}=\frac{\kappa_{1}d^{2}n^{3}}{\Delta^{2}s^{5}\delta^{3}}$ and some hidden constant depending on $(\kappa_{0},\sigma)$ , then with probability at least $1-Cd^{1-\delta}$ on a single random draw of $(\bm{x}_{k},\epsilon_{k},\tau_{k})_{k=1}^{n}$ , it holds uniformly for all $\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}$ that the estimation error $\bm{\widehat{\Upsilon}}:=\bm{\widehat{\theta}}-\bm{\theta^{\star}}$ satisfy

	$\displaystyle\\|\bm{\widehat{\Upsilon}}\\|_{2}\leq\frac{C_{3}\sigma(\sigma+M^{\frac{1}{2l}})}{\kappa_{0}}\sqrt{\frac{\delta s\log d}{n}}+\frac{C_{3}\sigma\Delta}{\kappa_{0}}\sqrt{\frac{\delta s\log\mathscr{W}}{n}},$
	$\displaystyle\\|\bm{\widehat{\Upsilon}}\\|_{1}\leq\frac{C_{4}\sigma(\sigma+M^{\frac{1}{2l}})}{\kappa_{0}}s\sqrt{\frac{\delta\log d}{n}}+\frac{C_{4}\sigma\Delta}{\kappa_{0}}s\sqrt{\frac{\delta\log\mathscr{W}}{n}}.$

Notably, our uniform guarantee is still minimax optimal up to some additional logarithmic factors (i.e., $\sqrt{\log\mathscr{W}}$ ) arising from the covering argument (Step 2.4 of the proof), whose main aim is to show that one uniform dither $\bm{\tau}=[\tau_{k}]$ suffices for all signals. Thus naturally, $\sqrt{\log\mathscr{W}}$ is associated with a leading factor of the quantization level $\Delta$ , meaning that the logarithmic gap between uniform recovery and non-uniform recovery closes when $\Delta\to 0$ . In particular, Theorem 13 implies a uniform error rate matching the non-uniform one in Theorem 5 (up to some multiplicative factors) when $\Delta$ is small enough or in an unquantized case.

To the best of our knowledge, the only existing uniform guarantee for heavy-tailed QCS is [32, Thm. 1.11], but the following distinctions make it impossible to closely compare their result with our Theorem 13: 1) [32, Thm. 1.11] is for dithered 1-bit quantization, but ours is for dithered uniform quantizer; 2) We handle heavy-tailedness by truncation, while [32, Thm. 1.11] does not involve this kind of special treatment; 3) [32, Thm. 1.11] considers a highly intractable program with hamming distance as objective and $\bm{\theta}\in\Sigma_{s}$ as constraint (when specialized to sparse signal), while our Theorem 13 is for the convex program Lasso; 4) Their analysis is based on an in-depth result on random hyperplane tessellations (see also [33, 77]), while our proof follows the more standard strategy (i.e., to upgrade each piece in a non-uniform proof to be uniform) and requires certain technical innovations (e.g., the treatment to deal with the truncation step). It is possible to use such a standard strategy to upgrade Theorem 6 to a uniform result whose error rate may exhibit worse dependence on $s$ due to covering argument.

5 Numerical Simulations

In this section we provide two sets of experimental results to support and demonstrate our theoretical developments. The first set of our simulations is devoted to validate our major standpoint that near minimax rates are achievable in quantized heavy-tailed settings. Then, the second set of results are presented to illustrate the crucial role played by the appropriate dither (i.e., triangular dither for covariate, uniform dither for response) before uniform quantization. For the importance of data truncation we refer to in [35, Sect. 5], which includes three estimation problems in this work and contrasts the estimations with or without the data truncation.

5.1 (Near) Minimax Error Rates

Each data point in our results is set to be the mean value of $50$ or $100$ independent trials.

5.1.1 Quantized Covariance Matrix Estimation

We start from covariance matrix estimation, specifically we verify the element-wise rate $\mathscr{B}_{1}:=O\big{(}\mathscr{L}\sqrt{\frac{\log d}{n}}\big{)}$ and operator norm rate $\mathscr{B}_{2}:=O\big{(}\mathscr{L}\sqrt{\frac{d\log d}{n}}\big{)}$ in Theorems 2-3.

For estimator in Theorem 2, we draw $\bm{x}_{k}=(x_{ki})$ such that the first two coordinates are independently drawn from $\mathsf{t}(4.5)$ , $(x_{ki})_{i=3,4}$ are from $\mathsf{t}(6)$ with covariance $\mathbbm{E}(x_{k3}x_{k4})=1.2$ , and the remaining $d-4$ coordinates are i.i.d. following $\mathsf{t}(6)$ . We test different choices of $(d,\Delta)$ under $n=80:20:220$ , and the log-log plots are shown in Figure 1(a). Clearly, for each $(d,\Delta)$ the experimental points roughly exhibit a straight line that is well aligned with the dashed line representing the $n^{-1/2}$ rate. As predicted by the factor $\mathscr{L}=\sqrt{M}+\Delta^{2}$ , the curves with larger $\Delta$ are higher, but note that the error decreasing rates remain unchanged. In addition, the curves of $(d,\Delta)=(100,1),(120,1)$ are extremely close, which is consistent with the logarithmic dependence of $\mathscr{B}_{1}$ on $d$ .

For the error bound $\mathscr{B}_{2}$ , the coordinates of $\bm{x}_{k}$ are independently drawn from a scaled version of $\mathsf{t}(4.5)$ such that $\bm{\Sigma^{\star}}=\mathrm{diag}(2,2,1,...,1)$ , and we test different settings of $(d,\Delta)$ under $n=200:100:1000$ . As shown in Figure 1(b), the operator norm error decreases with $n$ in the optimal rate $n^{-1/2}$ , and using a coarser dithered quantizer (i.e., larger $\Delta$ ) only slightly lifts the curves. Indeed, the effect seems consistent with $\mathscr{L}$ ’s quadratic dependence on $\Delta$ . To validate the relative scaling of $n$ and $d$ , in addition to the setting $(d,\Delta)=(100,1)$ under $n=200:100:1000$ , we try $(d,\Delta)=(150,1)$ under $1.5$ times the original sample size $n=1.5\times(200:100:1000)$ (but in Figure 1(b) we still plot the curve according to the sample size of $200:100:1000$ without the multiplicative factor of $1.5$ ), and surprisingly the obtained curve coincides with the one for $(d,\Delta)=(100,1)$ . Thus, ignoring the logarithmic factor $\log d$ , the operator norm error can be characterized by $\mathscr{B}_{2}$ fairly well.

Additionally, we want to compare $\mathscr{B}_{1}$ and $\mathscr{B}_{2}$ regarding the dependence on $d$ more clearly. Specifically, we generate the samples $\bm{x}_{k}$ s as in Figure 1(a) and test the fixed sample size $n=180$ and varying dimension $d=80:20:260$ . The max norm estimation errors of $\bm{\widehat{\Sigma}}$ in Theorem 2 and the operator norm errors (under $d=80:20:180$ to ensure $n\geq d$ ) of the estimator in Theorem 3 are reported in Figure 1(c). It is clear that the max norm error increases with $d$ rather slowly, while the operator norm error increases much more significantly under larger $d$ . This is consistent with the logarithmic dependence of $\mathscr{B}_{1}$ on $d$ and the more essential dependence of $\mathscr{B}_{2}$ on $d$ .

Refer to caption — Figure 1: (a): Element-wise error (Theorem 2); (b): operator norm error (Theorem 3); (c): the dependence on $d$ of both error metrics.

5.1.2 Quantized Compressed Sensing

We now switch to QCS with unquantized covariate and aim to verify the $\ell_{2}$ -norm error rate $\mathscr{B}_{3}=O\big{(}\mathscr{L}\sqrt{\frac{s\log d}{n}}\big{)}$ obtained in Theorems 5-6. We let the support of the $s$ -sparse $\bm{\theta^{\star}}\in\mathbb{R}^{d}$ be $[s]$ , and then draw the non-zero entries from a uniform distribution over $\mathbb{S}^{s-1}$ (hence $\|\bm{\theta^{\star}}\|_{2}=1$ ). For the setting of Theorem 5 we adopt $\bm{x}_{k}\sim\mathcal{N}(0,\bm{I}_{d})$ and $\epsilon_{k}\sim\frac{1}{\sqrt{6}}\mathsf{t}(3)$ , while $\bm{x}_{ki}\stackrel{{\scriptstyle iid}}{{\sim}}\frac{\sqrt{5}}{3}\mathsf{t}(4.5)$ and $\epsilon_{k}\sim\frac{1}{\sqrt{3}}\mathsf{t}(4.5)$ for Theorem 6. We simulate different choices of $(d,s,\Delta)$ under $n=100:100:1000$ , and the proposed convex program (3.6) is solved with the framework of ADMM (we refer to the review [9]). Experimental results are shown as log-log plots in Figure 2. Consistent with the theoretical bound $\mathscr{B}_{3}$ , the errors in both cases decrease in a rate of $n^{-1/2}$ , whereas the effect of uniform quantization is merely on the multiplicative factor $\mathscr{L}$ . Interestingly, it seems that the gaps between $\Delta=0,0.5$ and $\Delta=0.5,1$ are in agreement with the explicit form of $\mathscr{L}$ , i.e., $\mathscr{L}\asymp M^{1/(2l)}+\Delta$ for Theorem 5, and $\mathscr{L}\asymp\sqrt{M}+\Delta^{2}$ for Theorem 6. In addition, note that the curves of $(d,s)=(150,5),(180,5)$ are close, whereas increasing $s=8$ suffers from significantly larger error. This is consistent with the scaling law of $(n,d,s)$ in $\mathscr{B}_{3}$ .

Then, we simulate the complete quantization setting where both covariate and response are quantized (Theorems 9-10). The simulation details are the same as before except that $\bm{x}_{k}$ is also quantized with quantization level same as $y_{k}$ . We provide the best $\ell_{1}$ -norm constraint for recovery, i.e., $\mathcal{S}:=\{\bm{\theta}:\|\bm{\theta}\|_{1}\leq\|\bm{\theta^{\star}}\|_{1}\}$ . Then, composite gradient descent [63, 64] is invoked to handle the non-convex estimation program. We show the log-log plots in Figure 3. Note that these results have implications similar to Figure 2, in terms of the $n^{-1/2}$ rate, the effect of quantization, and the relative scaling of $(n,d,s)$ .

5.1.3 Quantized Matrix Completion

Finally, we simulate QMC and demonstrate the error bound $\mathscr{B}_{4}=O\big{(}\mathscr{L}\sqrt{\frac{rd\log d}{n}}\big{)}$ for $\|\bm{\widehat{\Upsilon}}\|_{F}/d$ in Theorems 7-8. We generate the rank- $r$ $\bm{\Theta^{\star}}\in\mathbb{R}^{d\times d}$ as follows: we first generate $\bm{\Theta}_{0}\in\mathbb{R}^{d\times r}$ with i.i.d. standard Gaussian entries to obtain the rank- $r$ $\bm{\Theta}_{1}:=\bm{\Theta}_{0}\bm{\Theta}_{0}^{\top}$ , then we rescale it to $\bm{\Theta^{\star}}:=k_{1}\bm{\Theta}_{1}$ such that $\|\bm{\Theta^{\star}}\|_{F}=d$ . We use $\epsilon_{k}\sim\mathcal{N}(0,\frac{1}{4})$ to simulate the sub-exponential noise in Theorem 7, while $\epsilon_{k}\sim\frac{1}{\sqrt{6}}\mathsf{t}(3)$ for Theorem 8. The convex program (3.8) is fed with $\alpha=\|\bm{\Theta^{\star}}\|_{\infty}$ and optimized by the ADMM algorithm. We test different choices of $(d,r,\Delta)$ under $n=2000:1000:8000$ , with the log-log error plots displayed in Figure 4. Firstly, the experimental curves are well aligned with the dashed line that represents the optimal $n^{-1/2}$ rate. Then, comparing the results for $\Delta=0,0.5,1$ , we conclude that quantization only affects the multiplicative factor $\mathscr{L}$ in the estimation error. It should also be noted that, increasing either $d$ or $r$ leads to significantly larger error, which is consistent with the $\mathscr{B}_{4}$ ’s essential dependence on $d$ and $r$ .

5.2 Importance of Appropriate Dithering

To demonstrate the crucial role played by the suitable dither, we provide the second set of simulations. In order to observe more significant phenomena and then conclude evidently, we may test huge sample size but a rather simple estimation problem under coarse quantization (i.e., large $\Delta$ ).

Specifically, for covariance matrix estimation we set $d=1$ and i.i.d. draw $X_{1},...,X_{n}$ from $\mathcal{N}(0,1)$ . Thus, the problem boils down to estimating $\mathbbm{E}|X_{k}|^{2}$ , for which the estimators in Theorems 2-3 coincide. Since $X_{k}$ is sub-Gaussian, we do not perform data truncation before dithered quantization. Besides our estimator $\bm{\widehat{\Sigma}}=\frac{1}{n}\sum_{k=1}^{n}\dot{X}_{k}^{2}-\frac{\Delta^{2}}{4}$ where $\dot{X}_{k}=\mathcal{Q}_{\Delta}(X_{k}+\tau_{k})$ and $\tau_{k}$ is triangular dither, we invite the following competitors:

•

$\bm{\widehat{\Sigma}}_{no}=\frac{1}{n}\sum_{k=1}^{n}(\dot{X}^{\prime}_{k})^{2}$ , where $\dot{X}^{\prime}_{k}=\mathcal{Q}_{\Delta}(X_{k})$ is the direct quantization without dithering;
•

$\bm{\widehat{\Sigma}}_{u}-\frac{\Delta^{2}}{6}$ and $\bm{\widehat{\Sigma}}_{u}$ , where $\bm{\widehat{\Sigma}}_{u}=\frac{1}{n}\sum_{k=1}^{n}(\dot{X}^{\prime\prime}_{k})^{2}$ , and $\dot{X}^{\prime\prime}_{k}=\mathcal{Q}_{\Delta}(X_{k}+\tau^{\prime\prime}_{k})$ is quantized under uniform dither $\tau^{\prime\prime}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ .

To illustrate the choice of $\bm{\widehat{\Sigma}}_{u}-\frac{\Delta^{2}}{6}$ and $\bm{\widehat{\Sigma}}_{u}$ , we write $\dot{X}^{\prime\prime}_{k}=X_{k}+\tau^{\prime\prime}_{k}+w_{k}=X_{k}+\xi_{k}$ with quantization error $w_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ (due to Theorem 1(a)) and quantization noise $\xi_{k}=\tau^{\prime\prime}_{k}+w_{k}$ , then (3.1) gives $\mathbbm{E}(\dot{X}^{\prime\prime}_{k})^{2}=\mathbbm{E}|X_{k}|^{2}+\mathbbm{E}|\xi_{k}|^{2}$ , while $\mathbbm{E}|\xi_{k}|^{2}$ remains unknown. Thus, we consider $\bm{\widehat{\Sigma}}_{u}-\frac{\Delta^{2}}{6}$ because of an unjustified guess $\mathbbm{E}|\xi_{k}|^{2}\approx\mathbbm{E}|\tau^{\prime\prime}_{k}|^{2}+\mathbbm{E}|w_{k}|^{2}=\frac{\Delta^{2}}{6}$ , while $\bm{\widehat{\Sigma}}_{u}$ simply gives up the correction of $\mathbbm{E}|\xi_{k}|^{2}$ . We test $\Delta=3$ under $n=(2:2:20)\cdot 10^{3}$ . From the results shown in Figure 5(a), the proposed estimator based on quantized data under triangular dither embraces the lowest estimation errors and the optimal rate of $n^{-1/2}$ , whereas other competitors are not consistent, i.e., they all reach some error floors under a large sample size.

For the two remaining signal recovery problems, we simply focus on the quantization of the response $y_{k}$ . In particular, we simulate QCS in the setting of Theorem 5, with $(d,s,\Delta)=(50,3,2)$ under $n:=(2:2:20)\cdot 10^{3}$ . Other experimental details are as previously stated. We compare our estimator $\bm{\widehat{\theta}}$ with its counterpart $\bm{\widehat{\theta}}^{\prime}$ defined by (3.6) with the same $\bm{Q},\mathcal{S}$ but $\bm{b}^{\prime}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}^{\prime}_{k}\bm{x}_{k}$ , where $\dot{y}_{k}^{\prime}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k})$ is a direct uniform quantization with no dither. Evidently, the simulation results in Figure 5(b) confirm that the application of a uniform dither significantly lessens the recovery errors. Without dithering, although our results under Gaussian covariate still exhibit $n^{-1/2}$ decreasing rate, identifiability issue unavoidably arises under Bernoulli covariate. In that case, the simulation without dithering will evidently deviate from the $n^{-1/2}$ rate, see [87, Figure 1] for instance.

In analogy, we simulate QMC (Theorem 7) with data generated as previous experiments, and specifically we try $(d,r,\Delta)=(30,5,1.5)$ under $n=(5:5:25)\cdot 10^{3}$ . While our estimator $\bm{\widehat{\Theta}}$ is defined in (3.8) involving $\dot{y}_{k}$ from a dithered quantizer, we simulate the performance of its counterpart without dithering, i.e., $\bm{\widehat{\Theta}}^{\prime}$ defined in (3.8) with $\dot{y}_{k}$ substituted by $\dot{y}_{k}^{\prime}=\mathcal{Q}_{\Delta}(y_{k})$ . From the experimental results displayed in Figure 5(c), one shall clearly see that $\bm{\widehat{\Theta}}$ performs much better in terms of the decreasing rate of $n^{-1/2}$ and the estimation error; while the curve without dithering even does not decrease.

6 Concluding Remarks

In digital signal processing and many distributed machine learning systems, data quantization is an indispensable process. On the other hand, many modern datasets exhibit heavy-tailedness, and the past decade has witnessed an increasing interest in statistical estimation methods robust to heavy-tailed data. In this work we try to bridge these two developments by studying quantization of heavy-tailed data. We propose to truncate the heavy-tailed data prior to a uniform quantizer with random dither well suited to the problem at hand. Applying our quantization scheme to covariance matrix estimation, compressed sensing, and matrix completion, we have proposed (near) optimal estimators based on quantized data, and they are computationally feasible. These results suggest a unified conclusion that the dithered quantization does not affect the key scaling law in the error rate but only slightly worsens the multiplicative factor, which was complemented by numerical simulations. Further, from two respects, we presented additional developments for quantized compressed sensing. Firstly, we study a novel setting that involves covariate quantization. Because our quantized covariance matrix estimator is not positive semi-definite, the proposed recovery program is non-convex, but we proved that all local minimizers enjoy near minimax rate. At a higher level, this development extends a line of works on non-convex M-estimator [63, 64, 62] to accommodate heavy-tailed covariate, see the deterministic framework Proposition 1. As application, we derive results for (dithered) 1-bit compressed sensing as byproducts. Secondly, we established a near minimax uniform recovery guarantee for QCS under heavy-tailed noise, which states that all sparse signals within an $\ell_{2}$ -ball can be uniformly recovered up to near optimal $\ell_{2}$ -norm error, using a single realization of the measurement ensemble. We believe the developments presented in this work will prove useful in many other estimation problems, for instance, the triangular dither and the quantization scheme apply to multi-task learning, as shown by subsequent works [22, 60].

References

[1] Albert Ai, Alex Lapanowski, Yaniv Plan, and Roman Vershynin. One-bit compressed sensing with non-gaussian measurements. Linear Algebra and its Applications, 441:222–239, 2014.
[2] James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35. New York, 2007.
[3] Sonia A Bhaskar. Probabilistic low-rank matrix completion from quantized measurements. The Journal of Machine Learning Research, 17(1):2131–2164, 2016.
[4] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of statistics, 36(6):2577–2604, 2008.
[5] Atanu Biswas, Sujay Datta, Jason P Fine, and Mark R Segal. Statistical advances in the biomedical sciences: clinical trials, epidemiology, survival analysis, and bioinformatics. John Wiley & Sons, 2007.
[6] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
[7] Petros T Boufounos. Sparse signal reconstruction from phase-only measurements. In Proc. Int. Conf. Sampling Theory and Applications (SampTA)],(July 1-5 2013). Citeseer, 2013.
[8] Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. In 2008 42nd Annual Conference on Information Sciences and Systems, pages 16–21. IEEE, 2008.
[9] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
[10] Christian Brownlees, Emilien Joly, and Gábor Lugosi. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507–2536, 2015.
[11] T Tony Cai, Cun-Hui Zhang, and Harrison H Zhou. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38(4):2118–2144, 2010.
[12] T Tony Cai and Harrison H Zhou. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics, pages 2389–2420, 2012.
[13] Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106(494):672–684, 2011.
[14] Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res., 14(1):3619–3647, 2013.
[15] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119, 2012.
[16] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
[17] Yang Cao and Yao Xie. Categorical matrix completion. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 369–372. IEEE, 2015.
[18] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, pages 1148–1185, 2012.
[19] Junren Chen and Michael K Ng. Color image inpainting via robust pure quaternion matrix completion: Error bound and weighted loss. SIAM Journal on Imaging Sciences, 15(3):1469–1498, 2022.
[20] Junren Chen and Michael K Ng. Uniform exact reconstruction of sparse signals and low-rank matrices from phase-only measurements. IEEE Transactions on Information Theory, 2023.
[21] Junren Chen, Cheng-Long Wang, Michael K Ng, and Di Wang. High dimensional statistical estimation under one-bit quantization. IEEE Transactions on Information Theory, 2023.
[22] Junren Chen, Yueqi Wang, and Michael K Ng. Quantized low-rank multivariate regression with random dithering. arXiv preprint arXiv:2302.11197, 2023.
[23] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. Advances in Neural Information Processing Systems, 28, 2015.
[24] Onkar Dabeer and Aditya Karnik. Signal parameter estimation using 1-bit dithered quantization. IEEE Transactions on Information Theory, 52(12):5389–5405, 2006.
[25] Alireza Danaee, Rodrigo C de Lamare, and Vitor Heloiz Nascimento. Distributed quantization-aware rls learning with bias compensation and coarsely quantized signals. IEEE Transactions on Signal Processing, 70:3441–3455, 2022.
[26] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.
[27] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622, 2016.
[28] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I Oliveira. Sub-gaussian mean estimators. The Annals of Statistics, 44(6):2695–2725, 2016.
[29] Sjoerd Dirksen. Quantized compressed sensing: a survey. In Compressed Sensing and Its Applications, pages 67–95. Springer, 2019.
[30] Sjoerd Dirksen, Hans Christian Jung, and Holger Rauhut. One-bit compressed sensing with partial gaussian circulant matrices. Information and Inference: A Journal of the IMA, 9(3):601–626, 2020.
[31] Sjoerd Dirksen, Johannes Maly, and Holger Rauhut. Covariance estimation under one-bit quantization. The Annals of Statistics, 50(6):3538–3562, 2022.
[32] Sjoerd Dirksen and Shahar Mendelson. Non-gaussian hyperplane tessellations and robust one-bit compressed sensing. Journal of the European Mathematical Society, 23(9):2913–2947, 2021.
[33] Sjoerd Dirksen, Shahar Mendelson, and Alexander Stollenwerk. Sharp estimates on random hyperplane tessellations. SIAM Journal on Mathematics of Data Science, 4(4):1396–1419, 2022.
[34] Jianqing Fan, Quefeng Li, and Yuyan Wang. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247–265, 2017.
[35] Jianqing Fan, Weichen Wang, and Ziwei Zhu. A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of statistics, 49(3):1239, 2021.
[36] Thomas Feuillen, Mike E Davies, Luc Vandendorpe, and Laurent Jacques. ( $\ell_{1}$ , $\ell_{2}$ )-rip and projected back-projection reconstruction for phase-only measurements. IEEE Signal Processing Letters, 27:396–400, 2020.
[37] Simon Foucart, Deanna Needell, Reese Pathak, Yaniv Plan, and Mary Wootters. Weighted matrix completion from non-random, non-uniform sampling patterns. IEEE Transactions on Information Theory, 67(2):1264–1290, 2020.
[38] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Springer New York, New York, NY, 2013.
[39] Martin Genzel and Christian Kipp. Generic error bounds for the generalized lasso with sub-exponential data. Sampling Theory, Signal Processing, and Data Analysis, 20(2):15, 2022.
[40] Martin Genzel and Alexander Stollenwerk. A unified approach to uniform signal recovery from nonlinear observations. Foundations of Computational Mathematics, pages 1–74, 2022.
[41] Robert M Gray and Thomas G Stockham. Dithered quantizers. IEEE Transactions on Information Theory, 39(3):805–812, 1993.
[42] David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quantum state tomography via compressed sensing. Physical review letters, 105(15):150401, 2010.
[43] Osama A Hanna, Yahya H Ezzeldin, Christina Fragouli, and Suhas Diggavi. Quantization of distributed data for learning. IEEE Journal on Selected Areas in Information Theory, 2(3):987–1001, 2021.
[44] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
[45] Marat Ibragimov, Rustam Ibragimov, and Johan Walden. Heavy-tailed distributions and robustness in economics and finance, volume 214. Springer, 2015.
[46] Laurent Jacques and Valerio Cambareri. Time for dithering: fast and quantized random embeddings via the restricted isometry property. Information and Inference: A Journal of the IMA, 6(4):441–476, 2017.
[47] Laurent Jacques and Thomas Feuillen. The importance of phase in complex compressive sensing. IEEE Transactions on Information Theory, 67(6):4150–4161, 2021.
[48] Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE transactions on information theory, 59(4):2082–2102, 2013.
[49] Halyun Jeong, Xiaowei Li, Yaniv Plan, and Ozgur Yilmaz. Sub-gaussian matrices on sets: Optimal tail dependence and applications. Communications on Pure and Applied Mathematics, 75(8):1713–1754, 2022.
[50] Hans Christian Jung, Johannes Maly, Lars Palzer, and Alexander Stollenwerk. Quantized compressed sensing by rectified linear units. IEEE transactions on information theory, 67(6):4125–4149, 2021.
[51] Olga Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282–303, 2014.
[52] Olga Klopp, Jean Lafond, Éric Moulines, and Joseph Salmon. Adaptive multinomial matrix completion. Electronic Journal of Statistics, 9(2):2950–2975, 2015.
[53] Olga Klopp, Karim Lounici, and Alexandre B Tsybakov. Robust matrix completion. Probability Theory and Related Fields, 169(1):523–564, 2017.
[54] Karin Knudson, Rayan Saab, and Rachel Ward. One-bit compressive sensing with norm estimation. IEEE Transactions on Information Theory, 62(5):2748–2758, 2016.
[55] Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329, 2011.
[56] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
[57] Piotr Kruczek, Radosław Zimroz, and Agnieszka Wyłomańska. How to detect the cyclostationarity in heavy-tailed distributed signals. Signal Processing, 172:107514, 2020.
[58] Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11(4):1389–1456, 2022.
[59] Jean Lafond, Olga Klopp, Eric Moulines, and Joseph Salmon. Probabilistic low-rank matrix completion on finite alphabets. Advances in Neural Information Processing Systems, 27, 2014.
[60] Kangqiang Li and Yuxuan Wang. Two results on low-rank heavy-tailed multiresponse regressions. arXiv preprint arXiv:2305.13897, 2023.
[61] Christopher Liaw, Abbas Mehrabian, Yaniv Plan, and Roman Vershynin. A simple tool for bounding the deviation of random matrices on geometric sets. In Geometric aspects of functional analysis, pages 277–299. Springer, 2017.
[62] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust $m$ -estimators. The Annals of Statistics, 45(2):866–896, 2017.
[63] Po-Ling Loh and Martin J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of statistics, 40(3):1637–1664, 2012.
[64] Po-Ling Loh and Martin J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19):559–616, 2015.
[65] Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190, 2019.
[66] Gabor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. The Annals of Statistics, 49(1):393–410, 2021.
[67] Shahar Mendelson. Upper bounds on product and multiplier empirical processes. Stochastic Processes and their Applications, 126(12):3652–3680, 2016.
[68] Stanislav Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 2015.
[69] Jianhua Mo and Robert W Heath. Limited feedback in single and multi-user mimo systems with finite-bit adcs. IEEE Transactions on Wireless Communications, 17(5):3284–3297, 2018.
[70] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069–1097, 2011.
[71] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697, 2012.
[72] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of $m$ -estimators with decomposable regularizers. Statistical science, 27(4):538–557, 2012.
[73] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
[74] Luong Trung Nguyen, Junhan Kim, and Byonghyo Shim. Low-rank matrix completion: A contemporary survey. IEEE Access, 7:94215–94237, 2019.
[75] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2012.
[76] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
[77] Yaniv Plan and Roman Vershynin. Dimension reduction by random hyperplane tessellations. Discrete & Computational Geometry, 51(2):438–461, 2014.
[78] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on information theory, 62(3):1528–1537, 2016.
[79] Yaniv Plan, Roman Vershynin, and Elena Yudovina. High-dimensional estimation with geometric constraints. Information and Inference: A Journal of the IMA, 6(1):1–40, 2017.
[80] Shuang Qiu, Xiaohan Wei, and Zhuoran Yang. Robust one-bit recovery via relu generative networks: Near-optimal statistical rate and global landscape analysis. In International Conference on Machine Learning, pages 7857–7866. PMLR, 2020.
[81] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$ -balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
[82] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12), 2011.
[83] Lawrence Roberts. Picture coding using pseudo-random noise. IRE Transactions on Information Theory, 8(2):145–154, 1962.
[84] Sima Sahu, Harsh Vikram Singh, Basant Kumar, and Amit Kumar Singh. De-noising of ultrasound image using bayesian approached heavy-tailed cauchy distribution. Multimedia Tools and Applications, 78(4):4089–4106, 2019.
[85] Vidyashankar Sivakumar, Arindam Banerjee, and Pradeep K Ravikumar. Beyond sub-gaussian measurements: High-dimensional structured estimation with sub-exponential designs. Advances in neural information processing systems, 28, 2015.
[86] Alexander Stollenwerk. One-bit compressed sensing and fast binary embeddings. PhD thesis, Dissertation, RWTH Aachen University, 2019, 2019.
[87] Zhongxing Sun, Wei Cui, and Yulong Liu. Quantized corrupted sensing with random dithering. IEEE Transactions on Signal Processing, 70:600–615, 2022.
[88] Ananthram Swami and Brian M Sadler. On some detection and estimation problems in heavy-tailed noise. Signal Processing, 82(12):1829–1846, 2002.
[89] Christos Thrampoulidis and Ankit Singh Rawat. The generalized lasso for sub-gaussian measurements with dithered quantization. IEEE Transactions on Information Theory, 66(4):2487–2500, 2020.
[90] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[91] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.
[92] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
[93] Robert F Woolson and William R Clarke. Statistical methods for the analysis of biomedical data. John Wiley & Sons, 2011.
[94] Chunlei Xu and Laurent Jacques. Quantized compressive sensing with rip matrices: The benefit of dithering. Information and Inference: A Journal of the IMA, 9(3):543–586, 2020.
[95] Tianyu Yang, Johannes Maly, Sjoerd Dirksen, and Giuseppe Caire. Plug-in channel estimation with dithered quantized signals in spatially non-stationary massive mimo systems. arXiv preprint arXiv:2301.04641, 2023.
[96] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning, pages 4035–4043. PMLR, 2017.
[97] Ziwei Zhu and Wenjing Zhou. Taming heavy-tailed features by shrinkage. In International Conference on Artificial Intelligence and Statistics, pages 3268–3276. PMLR, 2021.

Appendix A Proofs in Section 3

A.1 Quantized Covariance Matrix Estimation

We first provide Bernstein’s inequality that is recurring in our proofs. In application we will choose the more convenient one from (A.1) and (A.2).

Lemma 1.

(Bernstein’s inequality, [6, Thm. 2.10, Coro. 2.11]). Let $X_{1},...,X_{n}$ be independent random variables, and assume that there exist positive numbers $v$ and $c$ such that $\sum_{i=1}^{n}\mathbbm{E}[X_{i}^{2}]\leq v$ and

\sum_{i=1}^{n}\mathbbm{E}|X_{i}|^{q}\leq\frac{q!}{2}vc^{q-2}\text{ for all integers }q\geq 3,

then for any $t>0$ we have

	$\displaystyle\mathbbm{P}\left(\Big{\|}\sum_{i=1}^{n}(X_{i}-\mathbbm{E}X_{i})\Big{\|}\geq\sqrt{2vt}+ct\right)\leq 2\exp\big{(}-t\big{)}$		(A.1)
	$\displaystyle\mathbbm{P}\left(\Big{\|}\sum_{i=1}^{n}(X_{i}-\mathbbm{E}X_{i})\Big{\|}\geq t\right)\leq 2\exp\left(-\frac{t^{2}}{2(v+ct)}\right)$		(A.2)

We will also use the Matrix Bernstein’s inequality.

Lemma 2.

(Matrix Bernstein, [91, Thm. 6.1.1]). Let $\bm{S}_{1},...,\bm{S}_{n}$ be independent zero-mean random variables with common dimension $d_{1}\times d_{2}$ . We assume that $\|\bm{S}_{k}\|_{op}\leq L$ for $k\in[n]$ and introduce the matrix variance statistic

\nu=\max\left\{\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{S}_{k}\bm{S}_{k}^{\top})\Big{\|}_{op},\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{S}_{k}^{\top}\bm{S}_{k})\Big{\|}_{op}\right\}.

Then for any $t\geq 0$ , we have

\mathbbm{P}\left(\Big{\|}\sum_{k=1}^{n}\bm{S}_{k}\Big{\|}_{op}\geq t\right)\leq(d_{1}+d_{2})\exp\left(\frac{-\frac{1}{2}t^{2}}{\nu+\frac{Lt}{3}}\right).

A.1.1 Proof of Theorem 2

Proof.

Recall that $\bm{\xi}_{k}=\bm{\dot{x}}_{k}-\bm{\widetilde{x}}_{k}$ is the quantization noise, and $\mathbbm{E}(\bm{\xi}_{k}\bm{\xi}_{k}^{\top})=\frac{\Delta^{2}}{4}\bm{I}_{d}$ , which implies $\mathbbm{E}\bm{\widehat{\Sigma}}=\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top})$ . Thus, by using triangle inequality we obtain

\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{\infty}\leq\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{\infty}+\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{\infty}:=I_{1}+I_{2}.

Step 1. Bounding $I_{1}$ .

Note that $\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{\infty}=\|\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top})\|_{\infty}$ , so for any $(i,j)\in[d]\times[d]$ we aim to bound the $(i,j)$ -th entry error

|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|=\left|\sum_{k=1}^{n}\frac{1}{n}\dot{x}_{ki}\dot{x}_{kj}-\mathbbm{E}(\dot{x}_{ki}\dot{x}_{kj})\right|

Observe that the quantization noise is bounded as follows

\|\bm{\xi}_{k}\|_{\infty}\leq\|\mathcal{Q}_{\Delta}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})-(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})\|_{\infty}+\|\bm{\tau}_{k}\|_{\infty}\leq\frac{3\Delta}{2},

which implies $\mathbbm{E}|\xi_{ki}|^{4}\leq(\frac{3\Delta}{2})^{4}$ and $\|\bm{\dot{x}}_{k}\|_{\infty}\leq\|\bm{\widetilde{x}}_{k}\|_{\infty}+\|\bm{\xi}_{k}\|_{\infty}\leq\zeta+\frac{3\Delta}{2}$ . By the moment constraint on $x_{ki}$ we have $\mathbbm{E}|\widetilde{x}_{ki}|^{4}\leq\mathbbm{E}|x_{ki}|^{4}\leq M$ . Thus, for any positive integer $p\geq 2$ we have the following bound

$\displaystyle\sum_{k=1}^{n}\mathbbm{E}\Big{\|}\frac{\dot{x}_{ki}\dot{x}_{kj}}{n}\Big{\|}^{q}$	$\displaystyle\leq\frac{(\zeta+\frac{3}{2}\Delta)^{2(q-2)}}{n^{q}}\sum_{k=1}^{n}\mathbbm{E}(\dot{x}_{ki}\dot{x}_{kj})^{2}$	(A.3)
	$\displaystyle\leq\frac{(\zeta+\frac{3}{2}\Delta)^{2(q-2)}}{2n^{q}}\sum_{k=1}^{n}\big{(}\mathbbm{E}\|\dot{x}_{ki}\|^{4}+\mathbbm{E}\|\dot{x}_{kj}\|^{4}\big{)}$
	$\displaystyle\leq\frac{4(\zeta+\frac{3}{2}\Delta)^{2(q-2)}}{n^{q}}\sum_{k=1}^{n}\big{(}\mathbbm{E}\|\widetilde{x}_{ki}\|^{4}+\mathbbm{E}\|\widetilde{x}_{kj}\|^{4}+\mathbbm{E}\|\xi_{ki}\|^{4}+\mathbbm{E}\|\xi_{kj}\|^{4}\big{)}$
	$\displaystyle\leq\frac{q!}{2}v_{0}c_{0}^{q-2},$

for some $v_{0}=O\big{(}\frac{{M}+\Delta^{4}}{n}\big{)}$ , $c_{0}=O\big{(}\frac{(\zeta+\Delta)^{2}}{n}\big{)}$ . With these preparations, we can invoke Bernstein’s inequality (Lemma 1) to obtain that, for any $t\geq 0$ , with probability at least $1-2\exp(-t)$ ,

|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|\leq C_{1}\Big{(}\sqrt{\frac{(M+\Delta^{4})t}{n}}+\frac{(\zeta^{2}+\Delta^{2})t}{n}\Big{)}.

Taking $t=\delta\log d$ and using the choice $\zeta\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}$ , then applying a union bound over $(i,j)\in[d]\times[d]$ , under the scaling $n\gtrsim\delta\log d$ , we obtain that $I_{1}\lesssim(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}}$ holds with probability at least $1-2d^{2-\delta}$ .

Step 2. Bounding $I_{2}$ .

We aim to bound $|\mathbbm{E}(\widetilde{x}_{ki}\widetilde{x}_{kj}-x_{ki}x_{kj})|$ for any $(i,j)\in[d]\times[d]$ . First by the definition of truncation we have

\big{|}\mathbbm{E}(\widetilde{x}_{ki}\widetilde{x}_{kj}-x_{ki}x_{kj})\big{|}\leq\mathbbm{E}\big{[}|x_{ki}x_{kj}|(\mathbbm{1}(|x_{ki}|\geq\zeta)+\mathbbm{1}(|x_{kj}|\geq\zeta))\big{]};

then applying Cauchy-Schwarz to $\mathbbm{E}\big{[}|x_{ki}x_{kj}|\mathbbm{1}(|x_{ki}\geq\zeta|)\big{]}$ , we obtain

\mathbbm{E}\big{[}|x_{ki}x_{kj}|\mathbbm{1}(|x_{ki}|\geq\zeta)\big{]}\leq\big{[}\mathbbm{E}|x_{ki}x_{kj}|^{2}\big{]}^{1/2}\big{[}\mathbbm{P}(|x_{ki}|\geq\zeta)\big{]}^{1/2}\leq\sqrt{M}\sqrt{\frac{M}{\zeta^{4}}}=\frac{M}{\zeta^{2}},

where the second inequality is due to Markov’s inequality. Note that this bound remains valid for $\mathbbm{E}\big{[}|x_{ki}x_{kj}|\mathbbm{1}(|x_{kj}|\geq\zeta)\big{]}$ . Since this holds for any $(i,j)\in[d]\times[d]$ , combining with $\zeta\asymp\big{(}\frac{nM}{\delta\log d}\big{)}^{1/4}$ , we obtain $\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{\infty}\leq\frac{2M}{\zeta^{2}}\lesssim\sqrt{M}\sqrt{\frac{\delta\log d}{n}}$ .

By putting pieces together, we have $\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{\infty}\lesssim(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}}$ with probability at least $1-2d^{2-\delta}$ , as claimed. ∎

A.1.2 Proof of Theorem 3

Proof.

Note that the calculations in (3.1) and (3.2) remain valid (but the truncated samples are denoted by $\bm{\check{x}}_{k}$ rather than $\bm{\widetilde{x}}_{k}$ ), so we have $\mathbbm{E}\bm{\widehat{\Sigma}}=\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top})$ . Using triangle inequality we first decompose the error as

\|\bm{\widehat{\Sigma}}-\bm{\Sigma^{\star}}\|_{op}\leq\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{op}+\|\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{op}:=I_{1}+I_{2}.

Step 1. Bounding $I_{1}$ .

We first write that

\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}=\frac{1}{n}\sum_{k=1}^{n}\bm{S}_{k}\text{ where }\bm{S}_{k}=\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}).

Recall that we define quantization error as $\bm{w}_{k}=\bm{\dot{x}}_{k}-\bm{\check{x}}_{k}-\bm{\tau}_{k}$ and quantization noise as $\bm{\xi}_{k}=\bm{\dot{x}}_{k}-\bm{\check{x}}_{k}$ , and observe that the quantization noise is bounded $\|\bm{\xi}_{k}\|_{\infty}=\|\bm{\dot{x}}_{k}-\bm{\check{x}}_{k}\|_{\infty}=\|\bm{\tau}_{k}+\bm{w}_{k}\|_{\infty}\leq\frac{3}{2}\Delta$ . Thus, by $\|\bm{a}\|^{2}_{2}\leq\sqrt{d}\|\bm{a}\|_{4}^{2}$ that holds for any $\bm{a}\in\mathbb{R}^{d}$ , we obtain

	$\displaystyle\\|\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\\|_{op}$	$\displaystyle=\\|\bm{\dot{x}}_{k}\\|_{2}^{2}=\\|\bm{\check{x}}_{k}+\bm{\xi}_{k}\\|^{2}_{2}\leq 2\\|\bm{\check{x}}_{k}\\|^{2}_{2}+2\\|\bm{\xi}_{k}\\|^{2}_{2}$
		$\displaystyle\leq 2\sqrt{d}\cdot\\|\bm{\check{x}}_{k}\\|_{4}^{2}+2d\cdot\Big{(}\frac{3\Delta}{2}\Big{)}^{2}\leq 2\sqrt{d}\zeta^{2}+\frac{9}{2}d\Delta^{2},$

which implies $\|\bm{S}_{k}\|_{op}\leq\|\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\|_{op}+\mathbbm{E}\|\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\|_{op}\leq 4\sqrt{d}\zeta^{2}+9d\Delta^{2}$ . Moreover, we estimate the matrix variance statistic. Since $\bm{S}_{k}$ is symmetric, we simply deal with $\|\mathbbm{E}\bm{S}_{k}^{2}\|_{op}$ and some algebra gives $\mathbbm{E}\bm{S}_{k}^{2}=\mathbbm{E}\big{[}\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}-\big{(}\mathbbm{E}\big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}\big{)}^{2}$ . First let us note that

	$\displaystyle\Big{\\|}\big{(}\mathbbm{E}\big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}\big{)}^{2}\Big{\\|}_{op}$	$\displaystyle=\Big{\\|}\mathbbm{E}\big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}\Big{\\|}^{2}_{op}=\Big{\\|}\mathbbm{E}\big{[}\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}\big{]}+\frac{\Delta^{2}}{4}\bm{I}_{d}\Big{\\|}^{2}_{op}$
		$\displaystyle\leq\Big{(}\Big{\\|}\mathbbm{E}\big{[}\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}\big{]}\Big{\\|}_{op}+\frac{\Delta^{2}}{4}\Big{)}^{2}\leq 2\Big{\\|}\mathbbm{E}\big{[}\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}\big{]}\Big{\\|}_{op}^{2}+\frac{\Delta^{4}}{8}.$

Combining with the observation that

\Big{\|}\mathbbm{E}\big{[}\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}\big{]}\Big{\|}_{op}=\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}(\bm{v}^{\top}\bm{\check{x}}_{k})^{2}\leq\sup_{\bm{v}\in\mathbb{S}^{d-1}}\sqrt{\mathbbm{E}(\bm{v}^{\top}\bm{x}_{k})^{4}}\leq\sqrt{M},

we obtain $\big{\|}(\mathbbm{E}[\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}])^{2}\big{\|}_{op}=O(M+\Delta^{4})$ . Then we turn to the operator norm of $\mathbbm{E}[\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}]$ . We apply Cauchy-Schwarz to estimate

\displaystyle\big{\|}\mathbbm{E}\big{(}\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{)}\big{\|}_{op}=\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}\big{(}\|\bm{\dot{x}}_{k}\|_{2}^{2}(\bm{v}^{\top}\bm{\dot{x}}_{k})^{2}\big{)}{\leq}\sqrt{\mathbbm{E}\|\bm{\dot{x}}_{k}\|_{2}^{4}}\sup_{\bm{v}\in\mathbb{S}^{d-1}}\sqrt{\mathbbm{E}(\bm{v}^{\top}\bm{\dot{x}}_{k})^{4}}.

(A.4)

By $\|\bm{a}\|^{2}_{2}\leq\sqrt{d}\|\bm{a}\|_{4}^{2}$ that holds for any $\bm{a}\in\mathbb{R}^{d}$ , $\mathbbm{E}|\check{x}_{ki}|^{4}\leq\mathbbm{E}|x_{ki}|^{4}\leq M$ , $\bm{\dot{x}}_{k}=\bm{\check{x}}_{k}+\bm{\xi}_{k}$ and $\|\bm{\xi}_{k}\|_{\infty}\leq\frac{3\Delta}{2}$ , we obtain

	$\displaystyle\mathbbm{E}\\|\bm{\dot{x}}_{k}\\|^{4}_{2}$	$\displaystyle\leq\mathbbm{E}(\\|\bm{\check{x}}_{k}\\|_{2}+\\|\bm{\xi}_{k}\\|_{2})^{4}\lesssim\mathbbm{E}(\\|\bm{\check{x}}_{k}\\|_{2}^{4}+\\|\bm{\xi}_{k}\\|^{4}_{2})$		(A.5)
		$\displaystyle\leq d\mathbbm{E}(\\|\bm{\check{x}}_{k}\\|^{4}_{4}+\\|\bm{\xi}_{k}\\|^{4}_{4})\lesssim d^{2}(M+\Delta^{4}).$		(A.5)

For any $\bm{v}\in{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbb{S}^{d-1}}$ , we write $\bm{\dot{x}}_{k}=\bm{\check{x}}_{k}+\bm{\tau}_{k}+\bm{w}_{k}$ and then have the bound

\displaystyle\mathbbm{E}(\bm{v}^{\top}\bm{\dot{x}}_{k})^{4}\lesssim\mathbbm{E}(\bm{v}^{\top}\bm{\check{x}}_{k})^{4}+\mathbbm{E}(\bm{v}^{\top}\bm{\tau}_{k})^{4}+\mathbbm{E}(\bm{v}^{\top}\bm{w}_{k})^{4}\stackrel{{\scriptstyle(i)}}{{\lesssim}}M+\Delta^{4},

(A.6)

where $(i)$ is because $\mathbbm{E}(\bm{v}^{\top}\bm{\check{x}}_{k})^{4}\leq\mathbbm{E}(\bm{v}^{\top}\bm{x}_{k})^{4}\leq M$ , $\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})+\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ , and the quantization error $\bm{w}_{k}$ follows $\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}]^{d})$ ; in more detail, $\|\bm{v}^{\top}\bm{\tau}_{k}\|_{\psi_{2}},\|\bm{v}^{\top}\bm{w}_{k}\|_{\psi_{2}}=O(\Delta)$ and then the moment constraint of sub-Gaussian random variable implies $\mathbbm{E}(\bm{v}^{\top}\bm{\tau}_{k})^{4}=O(\Delta^{4})$ and $\mathbbm{E}(\bm{v}^{\top}\bm{w}_{k})=O(\Delta^{4})$ . From (A.4), (A.5) and (LABEL:627add2), we obtain $\big{\|}\mathbbm{E}\big{(}\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{)}\big{\|}_{op}=O\big{(}d(\Delta^{4}+M)\big{)}$ . Further combining with $\mathbbm{E}\bm{S}_{k}^{2}=\mathbbm{E}\big{[}\|\bm{\dot{x}}_{k}\|_{2}^{2}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}-\big{(}\mathbbm{E}\big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\big{]}\big{)}^{2}$ and $\big{\|}(\mathbbm{E}[\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}])^{2}\big{\|}_{op}=O(M+\Delta^{4})$ , we arrive at $\|\mathbbm{E}\bm{S}_{k}^{2}\|_{op}\lesssim d(\Delta^{4}+M)$ and hence $\big{\|}\sum_{k=1}^{n}\mathbbm{E}\bm{S}_{k}^{2}\big{\|}_{op}\lesssim nd(\Delta^{4}+M)$ . With these preparations, Matrix Bernstein’s inequality (Lemma 2) yields the following inequality that holds for any $t\geq 0$

\mathbbm{P}\Big{(}\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{op}\geq t\Big{)}\leq 2d\exp\left(-\frac{C_{1}nt^{2}}{(M+\Delta^{4})d+(\sqrt{d}\zeta^{2}+d\Delta^{2})t}\right).

Setting $t=C_{2}(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta d\log d}{n}}$ with sufficiently large $C_{2}$ , under the scaling of $n\gtrsim\delta d\log d$ and the threshold $\zeta\asymp(M^{1/4}+\Delta)\big{(}\frac{n}{\delta\log d}\big{)}^{1/4}$ , we obtain that $I_{1}=\|\bm{\widehat{\Sigma}}-\mathbbm{E}\bm{\widehat{\Sigma}}\|_{op}\leq C_{2}(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta d\log d}{n}}$ holds with probability at least $1-2d^{1-\delta}$ .

Step 2. Bounding $I_{2}$ .

Having bounded the concentration term $I_{1}$ , we now switch to the bias term

I_{2}=\sup_{\bm{v}\in\mathbb{S}^{d-1}}\big{|}\bm{v}^{\top}\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\bm{v}\big{|}.

For any $\bm{v}\in\mathbb{S}^{d-1}$ , because $\bm{\check{x}}_{k}$ is obtained from truncating $\bm{x}_{4}$ regarding $\ell_{4}$ -norm, we have

	$\displaystyle\big{\|}\bm{v}^{\top}\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\bm{v}\big{\|}$	$\displaystyle=\Big{\|}\mathbbm{E}\Big{[}\big{(}(\bm{v}^{\top}\bm{\check{x}}_{k})^{2}-(\bm{v}^{\top}\bm{x}_{k})^{2}\big{)}\mathbbm{1}(\\|\bm{x}_{k}\\|_{4}\geq\zeta)\Big{]}\Big{\|}$
		$\displaystyle\leq\mathbbm{E}\big{[}(\bm{v}^{\top}\bm{x}_{k})^{2}\mathbbm{1}(\\|\bm{x}_{k}\\|_{4}\geq\zeta)\big{]}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\sqrt{\mathbbm{E}(\bm{v}^{\top}\bm{x}_{k})^{4}}\sqrt{\mathbbm{P}(\\|\bm{x}_{k}\\|^{4}_{4}\geq\zeta^{4})}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\sqrt{M\frac{\mathbbm{E}\\|\bm{x}_{k}\\|_{4}^{4}}{\zeta^{4}}}\stackrel{{\scriptstyle(iii)}}{{\lesssim}}\sqrt{\frac{M\delta d\log d}{n}},$

where $(i)$ and $(ii)$ are respectively by Cauchy-Schwarz and Markov’s, and in $(iii)$ we use $\zeta\asymp(M^{1/4}+\Delta)\big{(}\frac{\delta d\log d}{n}\big{)}^{1/4}$ . This leads to the bound $I_{2}\lesssim\sqrt{\frac{M\delta d\log d}{n}}.$ Combining the bounds of $I_{1},I_{2}$ completes the proof.∎

A.1.3 Proof of Theorem 4

This small appendix is devoted to the proof of Theorem 4, for which we need a Lemma concerning the element-wise error rate of $\bm{\widehat{\Sigma}}_{s}$ , i.e., $|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|$ where we write $\bm{\widehat{\Sigma}}_{s}=[\breve{\sigma}_{ij}]$ , $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})=[\sigma_{ij}^{\star}]$ . Recalling that $\bm{\widehat{\Sigma}}_{s}=\mathcal{T}_{\mu}(\bm{\widehat{\Sigma}})$ , the key message from Lemma 3 is that due to the thresholding operator $\mathcal{T}_{\mu}(\cdot)$ , $\bm{\widehat{\Sigma}}_{s}$ respects an element-wise bound tighter than $O\big{(}\sqrt{\frac{\delta\log d}{n}}\big{)}$ in Theorem 2, as can be seen from the additional branch $|\sigma^{\star}_{ij}|$ in (A.7).

Lemma 3.

(Element-wise Error Rate of $\bm{\widehat{\Sigma}}_{s}$ ). For any $i,j\in[d]$ , the thresholding estimator $\bm{\widehat{\Sigma}}_{s}=[\breve{\sigma}_{ij}]$ in Theorem 4 satisfies for some $C$ that

\mathbbm{P}\left(|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq C\min\Big{\{}|\sigma^{\star}_{ij}|,\mathscr{L}\sqrt{\frac{\delta\log d}{n}}\Big{\}}\right)\geq 1-2d^{-\delta}

(A.7)

where $\mathscr{L}:=\sqrt{M}+\Delta^{2}$ .

Proof.

Recall that $\bm{\widehat{\Sigma}}_{s}=[\breve{\sigma}_{ij}]=\mathcal{T}_{\mu}(\bm{\widehat{\Sigma}})=\mathcal{T}_{\mu}\big{(}[\widehat{\sigma}_{ij}]\big{)}$ and hence $\breve{\sigma}_{ij}=\mathcal{T}_{\mu}(\widehat{\sigma}_{ij})$ . Given $(i,j)$ , the proof of Theorem 2 delivers $|\widehat{\sigma}_{ij}-\sigma_{ij}^{\star}|\leq C_{1}\mathscr{L}\sqrt{\frac{\delta\log d}{n}}$ with probability at least $1-2d^{-\delta}$ . Assume that we are on this event in the following analyses. As stated in Theorem 4, we set $\mu=C_{2}\mathscr{L}\sqrt{\frac{\delta\log d}{n}}$ with $C_{2}>C_{1}$ , $\mathscr{L}=\sqrt{M}+\Delta^{2}$ . Since $\breve{\sigma}_{ij}=\mathcal{T}_{\mu}(\widehat{\sigma}_{ij})$ , we discuss whether $|\widehat{\sigma}|\geq\mu$ holds.

Case 1. when $|\widehat{\sigma}_{ij}|<\mu$ holds.

In this case we have $\breve{\sigma}_{ij}=0$ , thus $|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq|\sigma^{\star}_{ij}|$ . Further, $|\sigma^{\star}_{ij}|\leq|\sigma^{\star}_{ij}-\widehat{\sigma}_{ij}|+|\widehat{\sigma}_{ij}|\leq C_{1}\mathscr{L}\sqrt{\frac{\delta\log d}{n}}+\mu\lesssim\mathscr{L}\sqrt{\frac{\delta\log d}{n}}$ , so we also have $|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\lesssim\mathscr{L}\sqrt{\frac{\delta\log d}{n}}$ .

Case 2. when $|\widehat{\sigma}_{ij}|\geq\mu$ holds.

Then, we consider $|\widehat{\sigma}_{ij}|\geq\mu$ that implies $\breve{\sigma}_{ij}=\widehat{\sigma}_{ij}$ . Then $|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|=|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq C_{1}\mathscr{L}\sqrt{\frac{\delta\log d}{n}}$ . Moreover, $|\sigma^{\star}_{ij}|\geq|\widehat{\sigma}_{ij}|-|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|\geq\mu-|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|\geq(C_{2}-C_{1})\mathscr{L}\sqrt{\frac{\delta\log d}{n}}$ , so we also have $|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|=O(|\sigma^{\star}_{ij}|)$ .

Therefore, in both cases we have proved that $|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\lesssim\min\big{\{}|\sigma^{\star}_{ij}|,\mathscr{L}\sqrt{\frac{\delta\log d}{n}}\big{\}}$ , which completes the proof. ∎

We are now in a position to present the proof.

Proof of Theorem 4. We let $p=\frac{\delta}{4}\geq 1$ (just assume $\delta\geq 4$ ) and use $B_{0}:=\mathscr{L}\sqrt{\frac{\delta\log d}{n}}$ as shorthand. For $(i,j)\in[d]\times[d]$ we define the event $\mathscr{A}_{ij}$ as

\mathscr{A}_{ij}=\Big{\{}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq C_{1}\min\big{\{}|\sigma^{\star}_{ij}|,B_{0}\big{\}}\Big{\}}.

By Lemma 3 we can choose $C_{1}$ to be sufficiently large such that $C_{1}B_{0}>3\mu$ and $\mathbbm{P}(\mathscr{A}_{ij}^{\complement})\leq 2d^{-\delta}$ ; here, by convention we let $\mathscr{A}_{ij}^{\complement}$ be the complement of $\mathscr{A}_{ij}$ . Our proof strategy is to first bound the $p$ -th order moment $\mathbbm{E}\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}^{p}$ , and then invoke Markov’s inequality to derive a high probability bound. We start with a simple estimate

		$\displaystyle\mathbbm{E}\\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\\|^{p}_{op}\stackrel{{\scriptstyle(i)}}{{\leq}}\mathbbm{E}\Big{(}\sup_{j\in[d]}\sum_{i=1}^{d}\|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}\|\mathbbm{1}(\mathscr{A}_{ij})+\sup_{j\in[d]}\sum_{i=1}^{d}\|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}\|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\Big{)}^{p}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}2^{p}\mathbbm{E}\sup_{j\in[d]}\Big{(}\sum_{i=1}^{d}\|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}\|\mathbbm{1}(\mathscr{A}_{ij})\Big{)}^{p}+2^{p}\mathbbm{E}\sup_{j\in[d]}\Big{(}\sum_{i=1}^{d}\|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}\|\mathbbm{1}(\mathscr{A}^{\complement}_{ij})\Big{)}^{p}:=I_{1}+I_{2}$

where $(i)$ and $(ii)$ are due to $\|\bm{A}\|_{op}\leq\sup_{j\in[d]}\sum_{i\in[d]}|a_{ij}|$ for symmetric $\bm{A}$ and $(a+b)^{p}\leq(2a)^{p}+(2b)^{p}$ . In this proof, the ranges of indices in summation or supremum, if omitted, are $[d]$ .

Step 1. Bounding $I_{1}$ .

By the definition of $\mathscr{A}_{ij}$ , $|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|=0$ if $|\sigma^{\star}_{ij}|=0$ . Because the columns of $\bm{\Sigma^{\star}}$ are $s$ -sparse, we can straightforwardly bound $I_{1}$ as follows:

\displaystyle I_{1}=2^{p}\mathbbm{E}\sup_{j}\Big{(}\sum_{i:|\sigma^{\star}_{ij}|>0}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij})\Big{)}^{p}\leq\big{(}2C_{1}sB_{0}\big{)}^{p}.

(A.8)

Step 2. Bounding $I_{2}$ .

We first write $I_{2}=2^{p}\mathbbm{E}\sup_{j}W_{j}$ with $W_{j}:=\big{(}\sum_{i}|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\big{)}^{p}$ , then start from

		$\displaystyle W_{j}\stackrel{{\scriptstyle(i)}}{{\leq}}\Big{(}\sum_{i=1}^{d}\|\sigma^{\star}_{ij}\|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\mathbbm{1}(\|\widehat{\sigma}_{ij}\|<\mu)+\sum_{i=1}^{d}\|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}\|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})+\sum_{i=1}^{d}\|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}\|\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\Big{)}^{p}$
		$\displaystyle\leq(3d)^{p-1}\Big{(}\sum_{i=1}^{d}\|\sigma^{\star}_{ij}\|^{p}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\mathbbm{1}(\|\widehat{\sigma}_{ij}\|<\mu)+\sum_{i=1}^{d}\|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}\|^{p}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})+\sum_{i=1}^{d}\|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}\|^{p}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\Big{)},$

note that in $(i)$ we define

\mathbbm{E}\widehat{\sigma}_{ij}=\mathbbm{E}(\widetilde{x}_{ki}\widetilde{x}_{kj}):=\widetilde{\sigma}_{ij}.

By replacing $\sup_{j}$ with $\sum_{j}$ , this further gives

		$\displaystyle I_{2}\leq 6^{p}d^{p-1}\Big{(}{\sum_{i,j}\|\sigma^{\star}_{ij}\|^{p}\mathbbm{E}\big{[}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\mathbbm{1}(\|\widehat{\sigma}_{ij}\|<\mu)\big{]}}+{\sum_{i,j}\mathbbm{E}\big{[}\|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}\|^{p}\mathbbm{1}(\mathscr{A}_{ij}^{\complement})\big{]}}$		(A.9)
		$\displaystyle+\sum_{i,j}\|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}\|^{p}\mathbbm{P}(\mathscr{A}_{ij}^{\complement})\Big{)}:=6^{p}d^{p-1}\big{(}I_{21}+I_{22}+I_{23}\big{)}.$		(A.9)

Step 2.1. Bounding $I_{21}$ .

Note that $\mathscr{A}_{ij}^{\complement}$ means $|\breve{\sigma}_{ij}-\sigma^{\star}_{ij}|>C_{1}\min\{|\sigma^{\star}_{ij}|,B_{0}\}$ , and $|\widehat{\sigma}_{ij}|<\mu$ implies $\breve{\sigma}_{ij}=0$ , their combination thus allows us to proceed as the following $(i)$ and $(iii)$ :

|\sigma^{\star}_{ij}|\stackrel{{\scriptstyle(i)}}{{>}}C_{1}B_{0}\stackrel{{\scriptstyle(ii)}}{{>}}3\mu\stackrel{{\scriptstyle(iii)}}{{>}}3|\widehat{\sigma}_{ij}|\geq 3|\sigma^{\star}_{ij}|-3|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|,

where $(ii)$ is due to our choice of $C_{1}$ . Thus, $\mathscr{A}_{ij}^{\complement}\cap\{|\widehat{\sigma}_{ij}|<\mu\}$ implies $|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|>\frac{2}{3}|\sigma^{\star}_{ij}|$ and $|\sigma^{\star}_{ij}|>3\mu$ . Note that Step 2 in the proof of Theorem 2 gives $|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|=O\big{(}B_{0}\big{)}$ , and hence we can assume $\mu>|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|$ and so $|\sigma^{\star}_{ij}|>3|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|$ . Using these relations and triangle inequality, we obtain

\frac{2}{3}|\sigma^{\star}_{ij}|<|\widehat{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|+|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|<|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|+\frac{1}{3}|\sigma^{\star}_{ij}|,

which implies $|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}|$ . Now we conclude that, $\mathscr{A}_{ij}^{\complement}\cap\{|\widehat{\sigma}_{ij}|<\mu\}$ implies $|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}|$ and $|\sigma^{\star}_{ij}|>3\mu$ , which allows us to bound $I_{21}$ as

\displaystyle I_{21}=\sum_{i,j}|\sigma^{\star}_{ij}|^{p}\mathbbm{1}(|\sigma_{ij}^{\star}|>3\mu)\mathbbm{P}\Big{(}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}|\Big{)}.

(A.10)

Analogously to the proof of Theorem 2, we can apply Bernstein’s inequality to $\mathbbm{P}\big{(}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|>\frac{1}{3}|\sigma^{\star}_{ij}|\big{)}$ . More specifically, by preparations as in (A.3), we can use (A.2) in Lemma 1 with $v=O\big{(}\frac{M+\Delta^{4}}{n}\big{)}$ , $c=O\big{(}\frac{\zeta^{2}+\Delta^{2}}{n})=O(\frac{\Delta^{2}}{n}+\sqrt{\frac{M}{n\delta\log d}}\big{)}$ (recall that $\zeta\asymp\big{(}\frac{nM}{\Delta\log d}\big{)}^{1/4}$ ). For some absolute constants $C_{2},C_{3}$ , it gives

		$\displaystyle\mathbbm{P}\Big{(}\|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}\|>\frac{1}{3}\|\sigma^{\star}_{ij}\|\Big{)}\leq 2\exp\left(-\frac{\|\sigma^{\star}_{ij}\|^{2}}{C_{2}\big{\{}\frac{M+\Delta^{4}}{n}+\frac{\Delta^{2}\|\sigma^{\star}_{ij}\|}{n}+\sqrt{\frac{M}{n\delta\log d}}\|\sigma^{\star}_{ij}\|\big{\}}}\right)$		(A.11)
		$\displaystyle{\leq}2\exp\left(-\frac{3n\|\sigma^{\star}_{ij}\|}{C_{2}}\min\Big{\{}\frac{\|\sigma^{\star}_{ij}\|}{M+\Delta^{4}},\frac{1}{\Delta^{2}},\sqrt{\frac{\delta\log d}{nM}}\Big{\}}\right)\stackrel{{\scriptstyle(i)}}{{\leq}}2\exp\left(-\frac{C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right),$		(A.11)

and in $(i)$ we use $\min\big{\{}\frac{|\sigma^{\star}_{ij}|}{M+\Delta^{4}},\Delta^{-2},\sqrt{\frac{\delta\log d}{nM}}\big{\}}\gtrsim\frac{1}{\sqrt{M}+\Delta^{2}}\sqrt{\frac{\delta\log d}{n}}$ that holds because $|\sigma^{\star}_{ij}|>3\mu$ and $n\gtrsim\delta\log d$ . We substitute (LABEL:627add4) into (LABEL:A.5) and perform some estimates

		$\displaystyle I_{21}\leq 2\sum_{i,j}\|\sigma^{\star}_{ij}\|^{p}\mathbbm{1}\big{(}\|\sigma^{\star}_{ij}\|>3\mu\big{)}\exp\left(-\frac{C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)$
		$\displaystyle=2\sum_{i,j}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}\sqrt{n\delta\log d}}\right)^{p}\cdot\left(\frac{C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)^{p}\exp\left(-\frac{0.5C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)$
		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\cdot\exp\left(-\frac{0.5C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)\mathbbm{1}\big{(}\|\sigma^{\star}_{ij}\|>3\mu\big{)}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}2\sum_{i,j}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}\sqrt{n\delta\log d}}\right)^{p}\cdot\left(\sup_{t\geq 0}~{}t^{p}\exp\Big{(}-\frac{t}{2}\Big{)}\right)\cdot\exp\left(-\frac{3C_{3}}{2}\frac{\sqrt{n\delta\log d}\cdot\mu}{\sqrt{M}+\Delta^{2}}\right)$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}2d^{2-10\delta}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}}\sqrt{\frac{\delta}{n\log d}}\right)^{p}\leq 2d^{2-10\delta}(C_{3}^{-1}B_{0})^{p},$

where in $(i)$ we substitute $|\sigma^{\star}_{ij}|>3\mu$ from the indicator function into the exponent, $(ii)$ is because $\sup_{t\geq 0}t^{p}\exp\big{(}-\frac{t}{2}\big{)}\leq p^{p}$ , $p=\frac{\delta}{4}$ , and we consider $\mu=C_{4}(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}}$ with $C_{4}$ large enough.

Step 2.2. Bounding $I_{22}$ .

Then, we deal with $I_{22}$ by Cauchy-Schwarz

I_{22}\leq\sum_{i,j}\sqrt{\mathbbm{E}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|^{2p}}\sqrt{\mathbbm{P}(\mathscr{A}_{ij}^{\complement})}.

As in (LABEL:627add4), we can use (A.2) in Lemma 1 with $v=O\big{(}\frac{M+\Delta^{4}}{n}\big{)}$ and $c=O\big{(}\frac{\Delta^{2}}{n}+\sqrt{\frac{M}{n\delta\log d}}\big{)}$ , yielding that for any $t\geq 0$ , $\mathbbm{P}\big{(}|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}|\geq t\big{)}\leq 2\exp\big{(}-\frac{t^{2}}{2(v+ct)}\big{)}\leq 2\exp\big{(}-\frac{t^{2}}{4v}\big{)}+2\exp\big{(}-\frac{t}{4c}\big{)}.$ Based on this probability tail bound, we can bound the moment via integral as follows

	$\displaystyle\mathbbm{E}\|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}\|^{2p}$	$\displaystyle=2p\int_{0}^{\infty}t^{2p-1}\mathbbm{P}(\|\widehat{\sigma}_{ij}-\mathbbm{E}\widehat{\sigma}_{ij}\|>t)~{}\mathrm{d}t$
		$\displaystyle\leq 4p\int_{0}^{\infty}t^{2p-1}\big{(}\exp(-\frac{t^{2}}{4v})+\exp(-\frac{t}{4c})\big{)}~{}\mathrm{d}t$
		$\displaystyle=2\big{[}(4v)^{p}\Gamma(p+1)+(4c)^{2p}\Gamma(2p+1)\big{]}\stackrel{{\scriptstyle(i)}}{{\leq}}2\big{[}(4vp)^{p}+(8cp)^{2p}\big{]},$

where we use $\Gamma(p+1)\leq p^{p}$ , $\Gamma(2p+1)\leq(2p)^{2p}$ in $(i)$ under suitably large $p$ . Thus, it follows that

I_{22}\leq\sum_{i,j}2d^{-\frac{\delta}{2}}\sqrt{(4vp)^{p}+(8cp)^{2p}}\leq 2d^{2-\frac{\delta}{2}}\big{[}(2\sqrt{pv})^{p}+(8cp)^{p}\big{]}\stackrel{{\scriptstyle(i)}}{{\leq}}2d^{2-\frac{\delta}{2}}(C_{4}B_{0})^{p},

where $(i)$ is due to $2\sqrt{pv}\leq(\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta}{n}}$ and $8cp=\frac{2\Delta^{2}\delta}{n}+2\sqrt{\frac{\delta M}{n\log d}}$ (recall that $p=\frac{\delta}{4}$ ).

Step 2.3. Bounding $I_{23}$ .

From Step 2 in the proof of Theorem 2 we have $|\widetilde{\sigma}_{ij}-\sigma^{\star}_{ij}|\leq C_{5}B_{0}$ . This directly leads to

I_{23}\leq d^{2}\cdot 2d^{-\delta}\cdot(C_{5}B_{0})^{p}=2d^{2-\delta}(C_{5}B_{0})^{p}.

We are in a position to combine everything and conclude the proof. Putting all pieces into (LABEL:A.4), it follows that $I_{2}\leq d^{1-\frac{\delta}{4}}(C_{6}B_{0})^{p}$ . Assuming $\delta\geq 4$ , such upper bound is dominated by (A.8) for $I_{1}$ , we can hence conclude that $\mathbbm{E}\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}^{p}\leq(C_{6}sB_{0})^{p}$ . Therefore, by Markov’s inequality,

\mathbbm{P}(\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}\geq C_{6}esB_{0})\leq\frac{\mathbbm{E}\|\bm{\widehat{\Sigma}}_{s}-\bm{\Sigma^{\star}}\|_{op}^{p}}{(C_{6}esB_{0})^{p}}\leq\exp(-p)=\exp\Big{(}-\frac{\delta}{4}\Big{)},

which completes the proof. $\square$

A.2 Quantized Compressed Sensing

Note that our estimation procedure in QCS, QMC falls in the framework of regularized M-estimator, see [70, 35, 21] for instance. Particularly, we introduce the following deterministic result for analysing the estimator (3.6).

Lemma 4.

(Adapted from[21, Coro. 2]). Consider (3.4) and the estimator $\bm{\widehat{\theta}}$ defined in (3.6), let $\bm{\widehat{\Upsilon}}:=\bm{\widehat{\theta}}-\bm{\theta^{\star}}$ be the estimation error. If $\bm{Q}$ is positive semi-definite, and $\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$ , then it holds that $\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}$ . Moreover, if for some $\kappa>0$ we have the restricted strong convexity (RSC) $\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}}\geq\kappa\|\bm{\widehat{\Upsilon}}\|_{2}^{2}$ , then we have the error bounds $\|\bm{\widehat{\Upsilon}}\|_{2}\leq 30\sqrt{s}\big{(}\frac{\lambda}{\kappa}\big{)}$ and $\|\bm{\widehat{\Upsilon}}\|_{1}\leq 300s\big{(}\frac{\lambda}{\kappa}\big{)}$ .¹⁵¹⁵15We do not optimize the constants in Lemmas 4, 6 for easy reference.

To establish the RSC condition, a convenient way is to use the matrix deviation inequality. The following Lemma is adapted from [61], by combining Theorem 3 and Remark 1 therein.¹⁶¹⁶16The dependence on $K$ can be further refined [49], while this is not pursued in the present paper.

Lemma 5.

(Adapted from[61, Thm. 3]). Assume $\bm{A}\in\mathbb{R}^{n\times d}$ has independent zero-mean sub-Gaussian rows $\bm{\alpha}_{k}^{\top}$ s satisfying $\|\bm{\alpha}_{k}\|_{\psi_{2}}\leq K$ , and the eigenvalues of $\bm{\Sigma}:=\mathbbm{E}(\bm{\alpha}_{k}\bm{\alpha}_{k}^{\top})$ are between $[\kappa_{0},\kappa_{1}]$ for some $\kappa_{1}\geq\kappa_{0}>0$ . For $\mathcal{T}\subset\mathbb{R}^{d}$ we let $\mathrm{rad}(\mathcal{T})=\sup_{\bm{x}\in\mathcal{T}}\|\bm{x}\|_{2}$ be its radius. Then with probability at least $1-\exp(-u^{2})$ , it holds that

\sup_{\bm{x}\in\mathcal{T}}\Big{|}\|\bm{Ax}\|_{2}-\sqrt{n}\|\sqrt{\bm{\Sigma}}\bm{x}\|_{2}\Big{|}\leq\frac{C\sqrt{\kappa_{1}}K^{2}}{\kappa_{0}}\Big{(}\omega(\mathcal{T})+u\cdot\mathrm{rad}(\mathcal{T})\Big{)},

where $\omega(\mathcal{T})=\mathbbm{E}\sup_{\bm{v}\in\mathcal{T}}[\bm{g}^{\top}\bm{v}]$ with $\bm{g}\sim\mathcal{N}(0,\bm{I}_{d})$ is the Gaussian width of $\mathcal{T}$ .

Based on Lemma 4, the proofs of Theorems 5-6 are divided into two steps, i.e., showing $\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$ and verifying the RSC. While we still have full $\bm{x}_{k}$ in Theorems 5-6, we will study the more challenging settings where the covariates $\bm{x}_{k}$ s are also quantized via $\mathcal{Q}_{\bar{\Delta}}(\cdot)$ in Theorems 9-10, in which we can take $\bar{\Delta}=0$ to return the settings of Theorems 5-6. Using such perspective, for most technical ingredients (e.g., the verification of $\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$ ) in the proofs of Theorems 5-6 we can simply refer to the counterparts established in the proofs of Theorems 9-10. This avoids repetition and will be explained in the proofs more clearly.

A.2.1 Proof of Theorem 5

Proof. We divide the proofs into two steps.

Step 1. Proving $\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$

Recall that we choose $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{x}_{k}\bm{x}_{k}^{\top}$ and $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{x}_{k}$ . In the setting of Theorem 9, the process of obtaining $\dot{y}_{k}$ remains the same, while the covariates $\bm{x}_{k}$ s are further quantized to $\bm{\dot{x}}_{k}=\mathcal{Q}_{\bar{\Delta}}(\bm{x}_{k}+\bm{\tau_{k}})$ for some $\bar{\Delta}>0$ under triangular dither $\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})+\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})$ , and we choose $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}$ and $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}$ there. As a result, by considering $\bar{\Delta}=0$ , it can be implied by Step 1 in the proof of Theorem 9 that under the choice $\lambda=C_{1}\frac{\sigma^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{1}$ , $\lambda\geq 2\|\frac{1}{n}\sum_{k=1}^{n}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{x}_{k}\|_{\infty}$ holds with probability at least $1-8d^{1-\delta}$ . Then, by using Lemma 4 we obtain $\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}$ .

Step 2. Verifying the RSC $\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}}\geq\kappa\|\bm{\widehat{\Upsilon}}\|_{2}^{2}$

We refer to Step 2 in the proof of Theorem 9. In particular, with the choices $\bar{\Delta}=0$ and $\bm{v}=\bm{\widehat{\Upsilon}}$ in (LABEL:nB.7), combined with $\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}$ , we obtain

\displaystyle\frac{1}{\sqrt{n}}\|\bm{X\widehat{\Delta}}\|_{2}\geq\sqrt{\kappa_{0}}\|\bm{\widehat{\Upsilon}}\|_{2}-\frac{C_{2}\sqrt{\kappa_{1}}\sigma^{2}}{\kappa_{0}}\sqrt{\frac{\delta s\log d}{n}}\|\bm{\widehat{\Upsilon}}\|_{2}\geq\frac{1}{2}\sqrt{\kappa_{0}}\|\bm{\widehat{\Upsilon}}\|_{2},

where the last inequality is due to the assumed scaling $n\gtrsim\delta s\log d$ . With these preparations, a direct application of Lemma 4 completes the proof. $\square$

A.2.2 Proof of Theorem 6

Proof. The proof is similarly based on Lemma 4.

Step 1. Proving $\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$

Recall that we choose $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}$ and $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\widetilde{x}}_{k}$ . In the setting of Theorem 10, the process of obtaining $\dot{y}_{k}$ remains the same, while the truncated covariates $\bm{\widetilde{x}}_{k}$ s are further quantized to $\bm{\dot{x}}_{k}=\mathcal{Q}_{\bar{\Delta}}(\bm{\widetilde{x}}_{k}+\bm{\tau}_{k})$ for some $\bar{\Delta}\geq 0$ under triangular dither $\bm{\tau}_{k}\sim\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})+\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})$ , and we choose $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}$ and $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}$ there. As a result, by considering $\bar{\Delta}=0$ , it can be implied by step 1 in the proof of Theorem 10 that, our choice $\lambda=C_{1}(R\sqrt{M}+\Delta^{2})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{1}$ ensures $\lambda\geq 2\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$ with the promised probability. By Lemma 4 we obtain $\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}$ .

Step 2. Verifying the RSC $\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}}\geq\kappa\|\bm{\widehat{\Upsilon}}\|_{2}^{2}$

Unlike the case of sub-Gaussian covariate that is based on matrix deviation inequality (Lemma 5), here we establish a lower bound for $\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}}$ using the bound on $\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}$ (Theorem 2). Specifically, setting $\Delta=0$ in Theorem 2 yields that, $\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\lesssim\sqrt{\frac{\delta M\log d}{n}}$ holds with probability at least $1-2d^{2-\delta}$ , which allows us to proceed as follows:

$\displaystyle\bm{\widehat{\Upsilon}}^{\top}\bm{Q}\bm{\widehat{\Upsilon}}$	$\displaystyle=\bm{\widehat{\Upsilon}}^{\top}\bm{\Sigma^{\star}}\bm{\widehat{\Upsilon}}-\bm{\widehat{\Upsilon}}^{\top}(\bm{\Sigma^{\star}}-\bm{Q})\bm{\widehat{\Upsilon}}$	(A.12)
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}\kappa_{0}\\|\bm{\widehat{\Upsilon}}\\|_{2}^{2}-\sqrt{\frac{\delta M\log d}{n}}\\|\bm{\widehat{\Upsilon}}\\|_{1}^{2}$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}\Big{(}\kappa_{0}-C_{6}s\sqrt{\frac{\delta M\log d}{n}}\Big{)}\\|\bm{\widehat{\Upsilon}}\\|^{2}_{2}\stackrel{{\scriptstyle(iii)}}{{\geq}}\frac{\kappa_{0}}{2}\\|\bm{\widehat{\Upsilon}}\\|^{2}_{2},$

where $(i)$ is because $\bm{\widehat{\Upsilon}}^{\top}(\bm{\Sigma^{\star}}-\bm{Q})\bm{\widehat{\Upsilon}}\leq\|\bm{\widehat{\Upsilon}}\|_{1}^{2}\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}$ , $(ii)$ is due to $\|\bm{\widehat{\Upsilon}}\|_{1}\leq 10\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}$ , $(iii)$ is due to the the assumed scaling $n\gtrsim\delta s^{2}\log d$ . Now the desired results follow immediately from Lemma 4. $\square$

A.3 Quantized Matrix Completion

Under the observation model (3.7), we first provide a deterministic framework for analysing the estimator (3.8).

Lemma 6.

(Adapted from[21, Coro. 3]). Let $\bm{\widehat{\Upsilon}}:=\bm{\widehat{\Theta}}-\bm{\Theta^{\star}}$ . If

\lambda\geq 2\left\|\frac{1}{n}\sum_{k=1}^{n}(\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}-\dot{y}_{k})\bm{X}_{k}\right\|_{op},

(A.13)

then it holds that $\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}$ . Moreover, if for some $\kappa>0$ we have the restricted strong convexity (RSC) $\frac{1}{n}\sum_{k=1}^{n}\big{|}\big{<}\bm{X}_{k},\bm{\widehat{\Upsilon}}\big{>}\big{|}^{2}\geq\kappa\|\bm{\widehat{\Upsilon}}\|_{F}^{2}$ , then we have the error bounds $\|\bm{\widehat{\Upsilon}}\|_{F}\leq 30\sqrt{r}\big{(}\frac{\lambda}{\kappa}\big{)}$ and $\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 300r\big{(}\frac{\lambda}{\kappa}\big{)}$ .

Clearly, to derive statistical error rate of $\bm{\widehat{\Theta}}$ from Lemma 6, the key ingredients are (A.13) and the RSC. Specialized to the covariate $\bm{X}_{k}\sim\mathscr{U}\big{(}\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}\big{)}$ in matrix completion, we will use the following lemma to establish RSC.

Lemma 7.

(Adapted from [21, Lem. 4] with $q=0$ ). Given some $\alpha>0,\delta>0$ , we define the constraint set $\mathcal{\psi}$ with sufficiently large $\psi$ as

	$\displaystyle\mathcal{C}(\psi)=\Big{\{}\bm{\Theta}\in\mathbb{R}^{d\times d}:\\|\bm{\Theta}$	$\displaystyle\\|_{\infty}\leq 2\alpha,\\|\bm{\Theta}\\|_{nu}\leq 10\sqrt{r}\\|\bm{\Theta}\\|_{F},$		(A.14)
		$\displaystyle\\|\bm{\Theta}\\|_{F}^{2}\geq(\alpha d)^{2}\sqrt{\frac{\psi\delta\log d}{n}}\Big{\}}.$		(A.14)

Let $\bm{X}_{1},...,\bm{X}_{n}$ be i.i.d. uniformly distributed on $\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}$ , then there exist absolute constants $\kappa\in(0,1)$ and $C$ , such that with probability at least $1-d^{-\delta}$ we have

\frac{1}{n}\sum_{k=1}^{n}\big{|}\big{<}\bm{X}_{k},\bm{\Theta}\big{>}\big{|}^{2}\geq\frac{\kappa\|\bm{\Theta}\|_{F}^{2}}{d^{2}}-\frac{C\alpha^{2}rd\log d}{n},~{}\forall~{}\bm{\Theta}\in\mathcal{C}(\psi).

(A.15)

Matrix completion with sub-exponential noise was studied in [51], and we make use of the following Lemma in the sub-exponential case.

Lemma 8.

(Adapted from [51, Lem. 5]). Given some $\delta>0$ . Let $\bm{X}_{1},...,\bm{X}_{n}$ be i.i.d. uniformly distributed on $\{\bm{e}_{i}\bm{e}_{j}^{\top}:i,j\in[d]\}$ , independent of $\bm{X}_{k}$ s, $\epsilon_{1},...,{\epsilon_{n}}$ are i.i.d. zero-mean and satisfy $\|\epsilon_{k}\|_{\psi_{1}}\leq\sigma$ . If $n\gtrsim\delta d\log^{3}d$ , with probability at least $1-d^{-\delta}$ we have

\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\epsilon_{k}\bm{X}_{k}\Big{\|}_{op}\leq\sigma\sqrt{\frac{\delta\log d}{nd}}.

A.3.1 Proof of Theorem 7

Proof. We divide the proof into two steps.

Step 1. Proving (A.13)

Defining $w_{k}:=\dot{y}_{k}-y_{k}-\tau_{k}$ as the quantization error, from Theorem 1(a) we know that $w_{k}$ s are independent of $\bm{X}_{k}$ and i.i.d. uniformly distributed on $[-\frac{\Delta}{2},\frac{\Delta}{2}]$ . Thus, we can further write that

\dot{y}_{k}=y_{k}+\tau_{k}+w_{k}=\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}+\epsilon_{k}+\tau_{k}+w_{k},

which allows us to decompose $I$ into

I\leq\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\epsilon_{k}\bm{X}_{k}\Big{\|}_{op}+\Big{\|}\frac{1}{n}\sum_{k=1}^{n}\tau_{k}\bm{X}_{k}\Big{\|}_{op}+\Big{\|}\frac{1}{n}\sum_{k=1}^{n}w_{k}\bm{X}_{k}\Big{\|}_{op}=I_{1}+I_{2}+I_{3}.

Because $\epsilon_{k}$ s are independent of $\bm{X}_{k}$ s and i.i.d. sub-exponential noise satisfying $\|\epsilon_{k}\|_{\psi_{1}}\leq\sigma$ , under the scaling $n\gtrsim\delta d\log^{3}d$ , Lemma 8 implies that $I_{1}\lesssim\sigma\sqrt{\frac{\delta\log d}{nd}}$ holds with probability at least $1-d^{-\delta}$ . Analogously, $\tau_{k}$ s and $w_{k}$ s are independent of $\{\bm{X}_{k}:k\in[n]\}$ and are i.i.d. uniformly distributed on $[-\frac{\Delta}{2},\frac{\Delta}{2}]$ , Lemma 8 also applies to $I_{2}$ and $I_{3}$ , yielding that with the promised probability $I_{2}+I_{3}\lesssim\Delta\sqrt{\frac{\delta\log d}{nd}}$ . Taken collectively, $I\lesssim(\sigma+\Delta)\sqrt{\frac{\delta\log d}{nd}}$ , so setting $\lambda=C_{1}(\sigma+\Delta)\sqrt{\frac{\delta\log d}{nd}}$ with sufficiently large $C_{1}$ ensures $\lambda\geq 2I$ , with probability at least $1-3d^{-\delta}$ . Further, Lemma 6 gives $\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}$ .

Step 2. Verifying RSC

First note that $\|\bm{\widehat{\Upsilon}}\|_{\infty}\leq\|\bm{\widehat{\Theta}}\|_{\infty}+\|\bm{\Theta^{\star}}\|_{\infty}\leq 2\alpha$ ; and as proved before, $\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}$ . To proceed we define the constraint set $\mathcal{C}(\psi)$ as in (A.14) with some properly chosen constant $\psi$ . Then using Lemma 7, for some absolute constants $\kappa,C$ , (A.15) holds with probability at least $1-d^{-\delta}$ . Then we discuss several cases.

1) If $\bm{\widehat{\Upsilon}}\notin\mathcal{C}(\psi)$ , because $\bm{\widehat{\Upsilon}}$ satisfies the first two constraints in the definition of $\mathcal{C}(\psi)$ , it must violate the third constraint and satisfy $\|\bm{\widehat{\Upsilon}}\|_{F}^{2}\leq(\alpha d)^{2}\sqrt{\frac{\psi\delta\log d}{n}}$ , which gives $\|\bm{\widehat{\Upsilon}}\|_{F}\lesssim\alpha d\big{(}\frac{\delta\log d}{n}\big{)}^{1/4}\stackrel{{\scriptstyle(i)}}{{\lesssim}}\alpha d\sqrt{\frac{\delta rd\log d}{n}}$ , as desired. Note that $(i)$ is due to the scaling $n\lesssim\delta r^{2}d^{2}\log d$ .

2) If $\bm{\widehat{\Upsilon}}\in\mathcal{C}(\psi)$ , (A.15) implies that $\frac{1}{n}\sum_{k=1}^{n}\big{|}\big{<}\bm{X}_{k},\bm{\widehat{\Upsilon}}\big{>}\big{|}^{2}\geq\frac{\kappa\|\bm{\widehat{\Upsilon}}\|_{F}^{2}}{d^{2}}-C\frac{\alpha^{2}rd\log d}{n}$ , and we further consider the following two cases.

2.1) If $C\frac{\alpha^{2}rd\log d}{n}\geq\frac{\kappa\|\bm{\widehat{\Upsilon}}\|_{F}^{2}}{2d^{2}}$ , we have $\|\bm{\widehat{\Upsilon}}\|_{F}\lesssim\alpha d\sqrt{\frac{rd\log d}{n}}$ , as desired.

2.2) If $C\frac{\alpha^{2}rd\log d}{n}<\frac{\kappa\|\bm{\widehat{\Upsilon}}\|_{F}^{2}}{2d^{2}}$ , then the RSC condition holds: $\frac{1}{n}\sum_{k=1}^{n}\big{|}\big{<}\bm{X}_{k},\bm{\widehat{\Upsilon}}\big{>}\big{|}^{2}\geq\frac{\kappa\|\bm{\widehat{\Upsilon}}\|^{2}_{F}}{2d^{2}}$ . This allows us to apply Lemma 6 to obtain $\|\bm{\widehat{\Upsilon}}\|_{F}\lesssim(\sigma+\Delta)d\sqrt{\frac{\delta rd\log d}{n}}$ .

Thus, in any case, we have shown $\|\bm{\widehat{\Upsilon}}\|_{F}=O\big{(}(\alpha+\sigma+\Delta)d\sqrt{\frac{\delta rd\log d}{n}}\big{)}$ . The bound on $\|\bm{\widehat{\Upsilon}}\|_{nu}$ follows immediately from $\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}$ . The proof is complete. $\square$

A.3.2 Proof of Theorem 8

Proof. The proof is based on Lemma 6 and divided into two steps.

Step 1. Proving (A.13)

Recall that the quantization error $w_{k}:=\dot{y}_{k}-\widetilde{y}_{k}-\tau_{k}$ is zero-mean and independent of $\bm{X}_{k}$ (Theorem 1(a)), thus we have $\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})=\mathbbm{E}(\widetilde{y}_{k}\bm{X}_{k})+\mathbbm{E}(\tau_{k}\bm{X}_{k})+\mathbbm{E}(w_{k}\bm{X}_{k})=\mathbbm{E}(\widetilde{y}_{k}\bm{X}_{k}).$ Combining with $\mathbbm{E}\big{(}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}\big{)}=\mathbbm{E}(y_{k}\bm{X}_{k})$ , triangle inequality can first decompose the target term into

	$\displaystyle\Big{\\|}\frac{1}{n}\sum_{k=1}^{n}\big{(}\dot{y}_{k}-\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\big{)}\bm{X}_{k}\Big{\\|}_{op}$	$\displaystyle\leq\Big{\\|}\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{X}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})\Big{\\|}_{op}$
	$\displaystyle+\Big{\\|}\mathbbm{E}\big{(}y_{k}\bm{X}_{k}-\widetilde{y}_{k}\bm{X}_{k}\big{)}\Big{\\|}_{op}+\Big{\\|}$	$\displaystyle\frac{1}{n}\sum_{k=1}^{n}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}-\mathbbm{E}\big{(}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}\big{)}\Big{\\|}_{op}$
		$\displaystyle:=I_{1}+I_{2}+I_{3}.$

Step 1.1. Bounding $I_{1}$ and $I_{3}$

We write $I_{1}=\|\sum_{k=1}^{n}\bm{S}_{k}\|_{op}$ and $I_{3}=\|\sum_{k=1}^{n}\bm{W}_{k}\|_{op}$ by defining

\displaystyle\bm{S}_{k}=\frac{1}{n}\big{(}\dot{y}_{k}\bm{X}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})\big{)},\bm{W}_{k}=\frac{1}{n}\big{(}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}-\mathbbm{E}\big{[}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}\big{]}\big{)},

By $|\dot{y}_{k}|\leq|\widetilde{y}_{k}|+|\tau_{k}|+|w_{k}|\leq\zeta_{y}+\Delta$ we have

\|\bm{S}_{k}\|_{op}\leq\frac{1}{n}\|\dot{y}_{k}\bm{X}_{k}\|_{op}+\frac{1}{n}\|\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})\|_{op}\leq\frac{1}{n}\|\dot{y}_{k}\bm{X}_{k}\|_{op}+\frac{1}{n}\mathbbm{E}\|\dot{y}_{k}\bm{X}_{k}\|_{op}\leq\frac{2(\zeta_{y}+\Delta)}{n}.

Analogously, we have $\|\bm{W}_{k}\|_{op}\leq\frac{2\alpha}{n}$ since $\big{|}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\big{|}\leq\|\bm{\Theta^{\star}}\|_{\infty}\leq\alpha$ . In addition, by $\|\mathbbm{E}\{(\bm{A}-\mathbbm{E}\bm{A})^{\top}(\bm{A}-\mathbbm{E}\bm{A})\}\|_{op}\leq\|\mathbbm{E}(\bm{A}^{\top}\bm{A})\|_{op}$ ( $\forall\bm{A}$ ) and the simple fact $\mathbbm{E}(\bm{X}_{k}\bm{X}_{k}^{\top})=\mathbbm{E}(\bm{X}_{k}^{\top}\bm{X}_{k})=\bm{I}_{d}/d$ , we estimate the matrix variance statistic as follows

		$\displaystyle\Big{\\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{S}_{k}\bm{S}_{k}^{\top})\Big{\\|}_{op}=n\big{\\|}\mathbbm{E}(\bm{S}_{k}\bm{S}_{k}^{\top})\big{\\|}_{op}\leq\frac{1}{n}\big{\\|}\mathbbm{E}\big{(}\dot{y}_{k}^{2}\bm{X}_{k}\bm{X}_{k}^{\top}\big{)}\big{\\|}_{op}$
		$\displaystyle=\frac{1}{n}\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}\big{(}\dot{y}_{k}^{2}\cdot\\|\bm{X}_{k}^{\top}\bm{v}\\|_{2}^{2}\big{)}=\frac{1}{n}\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}_{\bm{X}_{k}}\Big{(}\big{[}\mathbbm{E}_{\dot{y}_{k}\|\bm{X}_{k}}(\dot{y}_{k}^{2})\big{]}\\|\bm{X}_{k}^{\top}\bm{v}\\|_{2}^{2}\Big{)}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{4}{n}(\alpha^{2}+M+\Delta^{2})\sup_{\bm{v}\in\mathbb{S}^{d-1}}\mathbbm{E}_{\bm{X}_{k}}\\|\bm{X}_{k}^{\top}\bm{v}\\|^{2}_{2}\leq\frac{4(\alpha^{2}+M+\Delta^{2})}{nd},$

where $(i)$ is because given $\bm{X}_{k}$ we can estimate $\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\dot{y}_{k}^{2})\leq 2\big{(}\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\widetilde{y}_{k}^{2})+\Delta^{2}\big{)}$ since $|\dot{y}_{k}-\widetilde{y}_{k}|\leq\Delta$ , and moreover $\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\widetilde{y}_{k}^{2})\leq\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(y_{k}^{2})\leq 2\big{(}\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}^{2})+\mathbbm{E}_{\dot{y}_{k}|\bm{X}_{k}}(\epsilon_{k}^{2})\big{)}\leq 2(\alpha^{2}+M).$ It is not hard to see that this bound remains valid for $\|\sum_{k=1}^{n}\mathbbm{E}(\bm{S}_{k}^{\top}\bm{S}_{k})\|_{op}$ . Also, by similar arguments one can prove

\max\left\{\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{W}_{k}^{\top}\bm{W}_{k})\Big{\|}_{op},\Big{\|}\sum_{k=1}^{n}\mathbbm{E}(\bm{W}_{k}\bm{W}_{k}^{\top})\Big{\|}_{op}\right\}\leq\frac{\alpha^{2}}{nd}.

Thus, Matrix Bernstein’s inequality (Lemma 2) gives

	$\displaystyle\mathbbm{P}\big{(}I_{1}\geq t\big{)}\leq 2d\cdot\exp\Big{(}-\frac{C_{4}ndt^{2}}{(\alpha^{2}+M+\Delta^{2})+(\zeta_{y}+\Delta)dt}\Big{)}$
	$\displaystyle\mathbbm{P}\big{(}I_{3}\geq t\big{)}\leq 2d\cdot\exp\Big{(}-\frac{C_{5}ndt^{2}}{\alpha^{2}+\alpha dt}\Big{)}$

Thus, setting $t=C_{6}(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}}$ in the two inequalities above with sufficiently large $C_{6}$ , combining with the scaling that $\sqrt{\frac{\delta d\log d}{n}}=O(1)$ , we obtain $I_{1}+I_{3}\lesssim(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}}$ with probability at least $1-4d^{1-\delta}$ .

Step 1.2. Bounding $I_{2}$

Let us bound $\|\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})\bm{X}_{k}\big{)}\|_{\infty}$ first. Write $(i,j)$ -th entry of $\bm{X}_{k}$ as $x_{k,ij}$ , then for given $(i,j)$ , $\mathbbm{P}(x_{k,ij}=1)=d^{-2}$ , $x_{k,ij}=0$ otherwise. We can thus proceed by the following estimations:

	$\displaystyle\big{\|}\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})x_{k,ij}\big{)}\big{\|}$	$\displaystyle=\big{\|}\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})x_{k,ij}\mathbbm{1}(\|y_{k}\|\geq\zeta_{y})\big{)}\big{\|}$
		$\displaystyle\leq\mathbbm{E}\big{(}\|y_{k}\|x_{k,ij}\mathbbm{1}(\|y_{k}\|\geq\zeta_{y})\big{)}$
		$\displaystyle=\mathbbm{E}_{x_{k,ij}}\big{(}\big{\{}\mathbbm{E}_{y_{k}\|x_{k,ij}}\|y_{k}\|\mathbbm{1}(\|y_{k}\|\geq\zeta_{y})\big{\}}x_{k,ij}\big{)}$
		$\displaystyle=d^{-2}\mathbbm{E}_{y_{k}\|x_{k,ij}=1}\big{(}\|y_{k}\|\mathbbm{1}(\|y_{k}\|\geq\zeta_{y})\big{)}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}d^{-2}\sqrt{\mathbbm{E}_{y_{k}\sim\theta^{\star}_{ij}+\epsilon_{k}}(y_{k}^{2})}\sqrt{\mathbbm{P}_{y_{k}\sim\theta^{\star}_{ij}+\epsilon_{k}}(y_{k}^{2}\geq\zeta_{y}^{2})}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}d^{-2}\frac{\alpha^{2}+M}{\zeta_{y}}\lesssim\frac{\alpha+\sqrt{M}}{d^{2}}\sqrt{\frac{\delta d\log d}{n}},$

where $(i)$ , $(ii)$ is by Cauchy-Schwarz and Markov’s, respectively. Since this holds for any $(i,j)$ , we obtain $\|\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})\bm{X}_{k}\big{)}\|_{\infty}=O\big{(}(\alpha+\sqrt{M})d^{-2}\sqrt{\frac{\delta d\log d}{n}}\big{)}$ , which further gives $I_{2}=O\big{(}(\alpha+\sqrt{M})\sqrt{\frac{\delta\log d}{nd}}\big{)}$ by using $\|\bm{A}\|_{op}\leq d\|\bm{A}\|_{\infty}$ ( $\forall\bm{A}\in\mathbb{R}^{d\times d}$ ). Putting pieces together, with probability at least $1-4d^{1-\delta}$ we have $\|\frac{1}{n}\sum_{k}\big{(}\dot{y}_{k}-\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\big{)}\bm{X}_{k}\|_{op}\lesssim(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}}$ , hence $\lambda=C_{1}(\alpha+\sqrt{M}+\Delta)\sqrt{\frac{\delta\log d}{nd}}$ ensures (A.13) under the same probability. Further, Lemma 6 gives $\|\bm{\widehat{\Upsilon}}\|_{nu}\leq 10\sqrt{r}\|\bm{\widehat{\Upsilon}}\|_{F}$ .

Step 2. Verifying RSC

The remaining part is almost the same as Step 2 in the proof of Theorem 7 — defining the constraint set $\mathcal{C}(\psi)$ as (A.14) and then discussing several cases based on whether $\bm{\widehat{\Upsilon}}\in\mathcal{C}(\psi)$ holds. Thus, we conclude the proof without providing the details. $\square$

Appendix B Proofs in Section 4

This appendix collects the proofs in Section 4 concerning covariate quantization and uniform signal recovery in QCS.

B.1 Covariate Quantization

Because of the non-convexity, the proofs in this part can no longer be based on Lemma 4. Indeed, bounding the estimation errors of $\bm{\widetilde{\theta}}$ s satisfying (4.2) require more tedious manipulations essentially due to the additional $\ell_{1}$ constraint (induced by the constraint $\mathcal{S}$ in (4.2)).

B.1.1 Proof of Theorem 9

Proof. The proof is divided into three steps — the first two steps resemble the previous proofs that are based on Lemma 4, while we bound the estimation errors in the last step.

Step 1. Proving $\lambda\geq\beta\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$ for some pre-specified $\beta>2$

Recall that $(\bm{Q},\bm{b})$ are constructed from the quantized data as $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{{\Delta}}^{2}}{4}\bm{I}_{d}$ and $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}$ . We will show that, $\lambda=C_{1}\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\log d}{n}}$ guarantees $\lambda\geq\beta\|\frac{1}{n}\sum_{k=1}^{n}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d})\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}\|_{\infty}$ holds with the promised probability, where $\beta>2$ is any pre-specified constant. Recall the notation we introduced: $\dot{y}_{k}=\widetilde{y}_{k}+\phi_{k}+\vartheta_{k}$ with the quantization error $\vartheta_{k}\sim\mathscr{U}([-\frac{\Delta}{2},\frac{\Delta}{2}])$ being independent of $\widetilde{y}_{k}$ , $\bm{\dot{x}}_{k}=\bm{x}_{k}+\bm{\tau}_{k}+\bm{w}_{k}$ with the quantization error $\bm{w}_{k}\sim\mathscr{U}([-\frac{\bar{\Delta}}{2},\frac{\bar{\Delta}}{2}]^{d})$ being independent of $\bm{x}_{k}$ . Combining with the assumptions that the dithers are independent of $(\bm{x}_{k},y_{k})$ and that $\phi_{k}$ s and $\bm{\tau}_{k}$ s are independent, we have

	$\displaystyle\mathbbm{E}(\dot{y}_{k}$	$\displaystyle\bm{\dot{x}}_{k})=\mathbbm{E}\big{(}(\widetilde{y}_{k}+\phi_{k}+\vartheta_{k})(\bm{x}_{k}+\bm{\tau}_{k}+\bm{w}_{k})\big{)}=\mathbbm{E}(\widetilde{y}_{k}\bm{x}_{k}),$		(B.1)
		$\displaystyle\mathbbm{E}\Big{(}\Big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}\Big{]}\bm{\theta^{\star}}\Big{)}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}})=\mathbbm{E}(y_{k}\bm{x}_{k}),$		(B.1)

which allows us to decompose the target term as two concentration terms ( $I_{1},I_{3}$ ) and a bias term $(I_{2})$

		$\displaystyle\Big{\\|}\frac{1}{n}\sum_{k=1}^{n}\Big{[}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}\Big{]}\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}\Big{\\|}_{\infty}\leq{\Big{\\|}\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{\dot{x}}_{k})\Big{\\|}_{\infty}}$
		$\displaystyle+{\Big{\\|}\mathbbm{E}\big{(}y_{k}\bm{x}_{k}-\widetilde{y}_{k}\bm{x}_{k}\big{)}\Big{\\|}_{\infty}}+{\Big{\\|}\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}}-\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})\Big{\\|}_{\infty}}:=I_{1}+I_{2}+I_{3}.$

Step 1.1. Bounding $I_{1}$

Denote the $i$ -th entry of $\bm{x}_{k},\bm{\dot{x}}_{k},\bm{\tau}_{k},\bm{w}_{k}$ by $x_{ki},\dot{x}_{ki},\tau_{ki},w_{ki}$ , respectively. For $I_{1}$ , the $i$ -th entry reads $\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\dot{x}_{ki}-\mathbbm{E}(\dot{y}_{k}\dot{x}_{ki})$ . By using the relations $|\dot{y}_{k}|\leq|\widetilde{y}_{k}|+|\phi_{k}|+|\vartheta_{k}|\leq\zeta_{y}+\Delta$ , $\|\bm{\dot{x}}_{k}\|_{\psi_{2}}\leq\|\bm{x}_{k}\|_{\psi_{2}}+\|\bm{\tau}_{k}\|_{\psi_{2}}+\|\bm{w}_{k}\|_{\psi_{2}}\lesssim\sigma+\bar{\Delta}$ and $\mathbbm{E}|\dot{y}_{k}|^{2l}\lesssim\mathbbm{E}|\widetilde{y}_{k}|^{2l}+\mathbbm{E}|\phi_{k}+\vartheta_{k}|^{2l}\lesssim M+\Delta^{2l}$ , for any integer $q\geq 2$ we can bound that

		$\displaystyle\sum_{k=1}^{n}\mathbbm{E}\Big{\|}\frac{\dot{y}_{k}\dot{x}_{ki}}{n}\Big{\|}^{q}\leq\frac{(\zeta_{y}+\Delta)^{q-2}}{n^{q}}\sum_{k=1}^{n}\mathbbm{E}\|\dot{y}_{k}^{2}\dot{x}_{ki}^{q}\|$		(B.2)
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{(\zeta_{y}+\Delta)^{q-2}}{n^{q}}\sum_{k=1}^{n}\big{\{}\mathbbm{E}\|\dot{y}_{k}\|^{2l}\big{\}}^{\frac{1}{l}}\big{\{}\mathbbm{E}\|\dot{x}_{ki}\|^{\frac{lq}{l-1}}\big{\}}^{1-\frac{1}{l}}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\lesssim}}\Big{(}\frac{(\sigma+\bar{\Delta})(\zeta_{y}+\Delta)}{n}\Big{)}^{q-2}\Big{(}\frac{(\sigma+\bar{\Delta})^{2}(M^{\frac{1}{l}}+\Delta^{2})}{n}\Big{)}\Big{(}\sqrt{\frac{lq}{l-1}}\Big{)}^{q};$

combining with Stirling’s approximation and treating $l$ as absolute constant, this provides

\sum_{k=1}^{n}\mathbbm{E}\Big{|}\frac{\dot{y}_{k}\dot{x}_{ki}}{n}\Big{|}^{q}\leq\frac{q!}{2}v_{0}c_{0}^{q-2}\text{ where }v_{0}=O\Big{(}\frac{(\sigma+\bar{\Delta})^{2}(M^{{1}/{l}}+\Delta^{2})}{n}\Big{)},~{}c_{0}=O\Big{(}\frac{(\sigma+\bar{\Delta})(\zeta_{y}+\Delta)}{n}\Big{)}.

In (LABEL:nB.2), $(i)$ is due to Holder’s, and in $(ii)$ we use the moment constraint of sub-Gaussian variable (2.2). With these preparations, we invoke Bernstein’s inequality (Lemma 1) and then a union bound over $i\in[d]$ to obtain

\mathbbm{P}\Big{(}I_{1}\lesssim(\sigma+\bar{\Delta})(M^{\frac{1}{2l}}+\Delta)\sqrt{\frac{t}{n}}+\frac{(\sigma+\bar{\Delta})(\zeta_{y}+\Delta)t}{n}\Big{)}\geq 1-2d\cdot\exp(-t),

Thus, taking $t=\delta\log d$ and plug in $\zeta_{y}\asymp\sqrt{\frac{nM^{1/l}}{\delta\log d}}$ , we obtain

\mathbbm{P}\Big{(}I_{1}\lesssim(\sigma+\bar{\Delta})(M^{1/(2l)}+\Delta)\sqrt{\frac{\delta\log d}{n}}\Big{)}\geq 1-2d^{1-\delta}.

Step 1.2. Bounding $I_{2}$

Moreover, we estimate the $i$ -th entry of $I_{2}$ by

		$\displaystyle\|\mathbbm{E}\big{(}(y_{k}-\widetilde{y}_{k})x_{ki}\big{)}\|\leq\mathbbm{E}\|y_{k}x_{ki}\mathbbm{1}(\|y_{k}\|\geq\zeta_{y})\|$		(B.3)
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\big{(}\mathbbm{E}\|y_{k}\|^{\frac{2l}{2l-1}}\|x_{ki}\|^{\frac{2l}{2l-1}}\big{)}^{1-\frac{1}{2l}}\big{(}\mathbbm{P}(\|y_{k}\|\geq\zeta_{y})\big{)}^{\frac{1}{2l}}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\Big{(}\big{[}\mathbbm{E}\|y_{k}\|^{2l}\big{]}^{\frac{1}{2l-1}}\big{[}\mathbbm{E}\|x_{ki}\|^{\frac{l}{l-1}}\big{]}^{\frac{2l-2}{2l-1}}\Big{)}^{1-\frac{1}{2l}}\Big{(}\mathbbm{P}(\|y_{k}\|^{2l}\geq\zeta_{y}^{2l})\Big{)}^{\frac{1}{2l}}$
		$\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}\frac{\sigma M^{1/l}}{\zeta_{y}}\lesssim\sigma M^{\frac{1}{2l}}\sqrt{\frac{\delta\log d}{n}},$

where $(i)$ , $(ii)$ are due to Holder’s, $(iii)$ is due to Markov’s. Since this holds for $i\in[d]$ , it gives $I_{2}\lesssim\sigma M^{1/(2l)}\sqrt{\frac{\delta\log d}{n}}$ .

Step 1.3. Bounding $I_{3}$

We first derive a bound for $\|\bm{\theta^{\star}}\|_{2}$ that is implicitly implied by other conditions:

M^{1/l}\geq\mathbbm{E}|y_{k}|^{2}\geq\mathbbm{E}(\bm{x}_{k}^{\top}\bm{\theta^{\star}})^{2}=(\bm{\theta^{\star}})^{\top}\bm{\Sigma^{\star}}\bm{\theta^{\star}}\geq\kappa_{0}\|\bm{\theta^{\star}}\|_{2}^{2}\Longrightarrow\|\bm{\theta^{\star}}\|_{2}=O\Big{(}\frac{M^{1/(2l)}}{\sqrt{\kappa_{0}}}\Big{)}.

Hence, we can estimate $\|(\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})\dot{x}_{ki}\|_{\psi_{1}}\leq\|\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}}\|_{\psi_{2}}\|\dot{x}_{ki}\|_{\psi_{2}}\leq\|\bm{\dot{x}}_{k}\|_{\psi_{2}}^{2}\|\bm{\theta^{\star}}\|_{2}\lesssim{(\sigma+\bar{\Delta})^{2}}\frac{M^{1/(2l)}}{\sqrt{\kappa_{0}}}$ . Then, we invoke Bernstein’s inequality regarding the independent sum of sub-exponential random variables (e.g., [92, Thm. 2.8.1])¹⁷¹⁷17The application of Bernstein’s inequality leads to the $\sigma^{2}$ ( $\sigma$ is the upper bound on $\|\bm{x}_{k}\|_{\psi_{2}}$ ) dependence in the multiplicative factor $\mathscr{L}$ . It is possible to refine this quadratic dependence by using a new Bernstein’s inequality developed in [49, Thm. 1.3], but we do not pursue this in the present paper. to obtain that for any $t\geq 0$ we have

		$\displaystyle\mathbbm{P}\left(\Big{\|}\frac{1}{n}\sum_{k=1}^{n}(\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})\dot{x}_{ki}-\mathbbm{E}\{(\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})\dot{x}_{ki}\}\Big{\|}\geq t\right)$
		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\leq 2\exp\left(-C_{5}n\min\left\{\frac{\sqrt{\kappa_{0}}t}{(\sigma+\bar{\Delta})^{2}M^{\frac{1}{2l}}},\Big{(}\frac{\sqrt{\kappa_{0}}t}{(\sigma+\bar{\Delta})^{2}M^{\frac{1}{2l}}}\Big{)}^{2}\right\}\right)$

Hence, we can set $t=C_{6}\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}M^{1/(2l)}\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{6}$ , and further apply union bound over $i\in[d]$ , under the scaling that $\frac{\delta\log d}{n}$ is small enough, we obtain $I_{3}\lesssim\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}M^{1/(2l)}\sqrt{\frac{\delta\log d}{n}}$ holds with probability at least $1-2d^{1-\delta}$ .

Putting pieces together, since $\kappa_{0}\lesssim\sigma^{2}$ , it is immediate that $I\lesssim\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}\big{(}\Delta+M^{1/(2l)}\big{)}\sqrt{\frac{\delta\log d}{n}}$ holds with probability at least $1-8d^{1-\delta}$ . Since $\lambda=C_{1}\frac{(\sigma+\bar{\Delta})^{2}}{\sqrt{\kappa_{0}}}(\Delta+M^{1/(2l)})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{1}$ , $\lambda\geq\beta\cdot\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$ holds w.h.p.

Step 2. Verifying RSC

We provide a lower bound for $\bm{v}^{\top}\bm{Q}\bm{v}=\frac{1}{n}\|\bm{\dot{X}v}\|_{2}^{2}-\frac{\bar{\Delta}^{2}}{4}\|\bm{v}\|_{2}^{2}$ by using the matrix deviation inequality (Lemma 5). First note that the rows of $\bm{\dot{X}}$ are sub-Gaussian $\|\bm{\dot{x}}_{k}\|_{\psi_{2}}\lesssim\sigma+\bar{\Delta}$ . Since $\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top})=\bm{\Sigma^{\star}}+\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}$ , all eigenvalues of $\bm{\dot{\Sigma}}:=\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top})$ are between $[\kappa_{0}+\frac{1}{4}\bar{\Delta}^{2},\kappa_{1}+\frac{1}{4}\bar{\Delta}^{2}]$ . Thus, we invoke Lemma 5 for $\mathcal{T}:=\{\bm{v}\in\mathbb{R}^{d}:\|\bm{v}\|_{1}=1\}$ with $u=\sqrt{\delta\log d}$ ; due to the well-known Gaussian width estimate $\omega(\mathcal{T})\lesssim\sqrt{\log d}$ [92, Example 7.5.9], with probability at least $1-d^{-\delta}$ the following event holds

	$\displaystyle\sup_{\\|\bm{v}\\|_{1}=1}\Big{\|}\\|\bm{\dot{X}v}\\|_{2}-\sqrt{n}\\|\bm{\dot{\Sigma}}^{1/2}\bm{v}\\|_{2}\Big{\|}$	$\displaystyle\leq\frac{c_{1}\sqrt{\kappa_{1}+\frac{1}{4}\bar{\Delta}^{2}}(\sigma+\bar{\Delta})^{2}}{\kappa_{0}+\frac{1}{4}\bar{\Delta}^{2}}\sqrt{\delta\log d}$
		$\displaystyle:=c_{1}\mathscr{L}_{1}\sqrt{\delta\log d}.$

Under the same probability, a simple rescaling then provides

\Big{|}\frac{1}{\sqrt{n}}\|\bm{\dot{X}v}\|_{2}-\|\bm{\dot{\Sigma}}^{1/2}\bm{v}\|_{2}\Big{|}\leq c_{1}\mathscr{L}_{1}\sqrt{\frac{\delta\log d}{n}}\|\bm{v}\|_{1},~{}\forall~{}\bm{v}\in\mathbb{R}^{d},

(B.4)

which implies

		$\displaystyle\frac{1}{\sqrt{n}}\\|\bm{\dot{X}v}\\|_{2}\geq\\|\bm{\dot{\Sigma}}^{1/2}\bm{v}\\|_{2}-c_{1}\mathscr{L}_{1}\Big{(}{\frac{\delta\log d}{n}}\Big{)}^{1/2}\\|\bm{v}\\|_{1}$		(B.5)
		$\displaystyle\geq\Big{(}{\kappa_{0}+\frac{1}{4}\bar{\Delta}^{2}}\Big{)}^{1/2}\\|\bm{v}\\|_{2}-c_{1}\mathscr{L}_{1}\Big{(}{\frac{\delta\log d}{n}}\Big{)}^{1/2}\\|\bm{v}\\|_{1},~{}\forall~{}\bm{v}\in\mathbb{R}^{d}.$		(B.5)

Based on (LABEL:nB.7), we let $\hat{c}:=\frac{2\kappa_{0}+\bar{\Delta}^{2}}{4\kappa_{0}+\bar{\Delta}^{2}}$ and use the inequality $(a-b)^{2}\geq\hat{c}a^{2}-\frac{\hat{c}}{1-\hat{c}}b^{2}$ to obtain

	$\displaystyle\bm{v}^{\top}\bm{Q}\bm{v}$	$\displaystyle=\frac{1}{n}\\|\bm{\dot{X}v}\\|_{2}^{2}-\frac{\bar{\Delta}^{2}}{4}\\|\bm{v}\\|_{2}^{2}$
		$\displaystyle\geq\hat{c}\big{(}\kappa_{0}+\frac{1}{4}\bar{\Delta}^{2}\big{)}\\|\bm{v}\\|_{2}^{2}-c_{1}^{2}\mathscr{L}_{1}^{2}\frac{\hat{c}}{1-\hat{c}}\frac{\delta\log d}{n}\\|\bm{v}\\|_{1}^{2}-\frac{\bar{\Delta}^{2}}{4}\\|\bm{v}\\|_{2}^{2}$
		$\displaystyle\geq\frac{\kappa_{0}}{2}\\|\bm{v}\\|_{2}^{2}-c_{1}^{2}\mathscr{L}_{1}^{2}\big{(}1+\frac{\bar{\Delta}^{2}}{2\kappa_{0}}\big{)}\frac{\delta\log d}{n}\\|\bm{v}\\|_{1}^{2}$
		$\displaystyle:=\frac{\kappa_{0}}{2}\\|\bm{v}\\|_{2}^{2}-c_{2}(\kappa_{0},\sigma,\bar{\Delta})\cdot\frac{\delta\log d}{n}\\|\bm{v}\\|_{1}^{2},$

which holds for all $\bm{v}\in\mathbb{R}^{d}$ and $\hat{c}_{2}:=c_{2}(\kappa_{0},\sigma,\bar{\Delta})$ is a constant depending on $\kappa_{0},\sigma,\bar{\Delta}$ (we remove the dependence on $\kappa_{1}$ by $\kappa_{1}\lesssim\sigma^{2}$ ).

Step 3. Bounding the Estimation Error

We are in a position to bound the estimation error of any $\bm{\widetilde{\theta}}$ satisfying (4.2). Note that definition of $\partial\|\bm{\widetilde{\theta}}\|$ gives $\lambda\|\bm{\theta^{\star}}\|_{1}-\lambda\|\bm{\widetilde{\theta}}\|_{1}\geq\big{<}\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},-\bm{\widetilde{\Upsilon}}\big{>}$ . Thus, we set $\bm{\theta}=\bm{\theta^{\star}}$ in (4.2) and proceed as follows

$\displaystyle 0$	$\displaystyle\geq\big{<}\bm{Q\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\\|\bm{\widetilde{\theta}}\\|_{1},\bm{\widetilde{\Upsilon}}\big{>}$	(B.6)
	$\displaystyle=\bm{\widetilde{\Upsilon}}^{\top}\bm{Q}\bm{\widetilde{\Upsilon}}+\big{<}\bm{Q\theta^{\star}}-\bm{b},\bm{\widetilde{\Upsilon}}\big{>}+\big{<}\lambda\cdot\partial\\|\bm{\widetilde{\theta}}\\|_{1},\bm{\widetilde{\Upsilon}}\big{>}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}\frac{\kappa_{0}}{2}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-\frac{2\hat{c}_{2}R\sqrt{s}\cdot\delta\log d}{n}\\|\bm{\widetilde{\Upsilon}}\\|_{1}-\frac{\lambda}{\beta}\\|\bm{\widetilde{\Upsilon}}\\|_{1}+\lambda\big{(}\\|\bm{\widetilde{\theta}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}\big{)}$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}\frac{\kappa_{0}}{2}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-\frac{\lambda}{2}\\|\bm{\widetilde{\Upsilon}}\\|_{1}+\lambda\big{(}\\|\bm{\widetilde{\theta}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}\big{)},$

where we use $\lambda\geq\beta\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}~{}(\beta>2)$ and $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\|\bm{\widetilde{\theta}}\|_{1}+\|\bm{\theta^{\star}}\|_{1}\leq 2R\sqrt{s}$ in $(i)$ , and $(ii)$ is due to the scaling $2\hat{c}_{2}R\delta\sqrt{s}\frac{\log d}{n}\leq(\frac{1}{2}-\frac{1}{\beta})\lambda$ that holds under the assumed $n\gtrsim\delta\log d$ for some hidden constant depending on $(\kappa_{0},\sigma,\bar{\Delta},\Delta,M,R)$ . Thus, we arrive at $\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\geq\lambda\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}$ .

For $\bm{a}\in\mathbb{R}^{d}$ , $\mathcal{J}\subset[d]$ we obtain $\bm{a}_{\mathcal{J}}\in\mathbb{R}^{d}$ by keeping entries of $\bm{a}$ in $\mathcal{J}$ while setting others to zero. Let $\mathcal{A}$ be the support of $\bm{\theta^{\star}}$ , $\mathcal{A}^{c}=[d]\setminus\mathcal{A}$ , then we have

		$\displaystyle\frac{1}{2}\\|\bm{\widetilde{\Upsilon}}\\|_{1}\geq\\|\bm{\theta^{\star}}+\bm{\widetilde{\Upsilon}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}=\\|\bm{\theta^{\star}}+\bm{\widetilde{\Upsilon}}_{\mathcal{A}}+\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}$		(B.7)
		$\displaystyle\geq\\|\bm{\theta^{\star}}\\|_{1}+\\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\\|_{1}-\\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}=\\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\\|_{1}-\\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\\|_{1}.$		(B.7)

Further use $\frac{1}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\frac{1}{2}\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}+\frac{1}{2}\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}$ , we obtain $\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}\leq 3\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}$ . Hence, we have $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}+\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}^{c}}\|_{1}\leq 4\|\bm{\widetilde{\Upsilon}}_{\mathcal{A}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}$ . Now, we further substitute this into (B.6) and obtain

\frac{1}{2}\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|^{2}_{2}\leq\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\theta^{\star}}\|_{1}-\|\bm{\widetilde{\theta}}\|_{1}\big{)}\leq\frac{3\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 6\lambda\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}.

Thus, we arrive at the desired error bound for $\ell_{2}$ -norm

\|\bm{\widetilde{\Upsilon}}\|_{2}\lesssim\mathscr{L}\sqrt{\frac{\delta s\log d}{n}},~{}\mathrm{with}~{}\mathscr{L}:=\frac{(\sigma+\bar{\Delta})^{2}(\Delta+M^{1/(2l)})}{\kappa_{0}^{3/2}}.

We simply use $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}$ again to establish the bound for $\|\bm{\widetilde{\Upsilon}}\|_{1}$ . The proof is complete. $\square$

B.1.2 Proof of Theorem 10

Proof. The proof is divided into two steps. Compared to the last proof, due to the heavy-tailedness of $\bm{x}_{k}$ , the step of “verifying RSC” reduces to the simpler argument in (B.9).

Step 1. Proving $\lambda\geq\beta\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$ for some pre-specified $\beta>2$

Recall that $(\bm{Q},\bm{b})$ are constructed from the quantized data as $\bm{Q}=\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}}{4}\bm{I}_{d}$ and $\bm{b}=\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}$ . Thus, our main aim in this step is to prove that $\lambda=C_{1}(R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}$ suffices to ensure $\lambda\geq\beta\|\frac{1}{n}\sum_{k=1}^{n}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d})\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}\|_{\infty}$ with the promised probability and any pre-specified $\beta>2$ . We let $\bm{\dot{x}}_{k}=\bm{\widetilde{x}}_{k}+\bm{\tau}_{k}+\bm{w}_{k}$ , $\dot{y}_{k}=\widetilde{y}_{k}+\phi_{k}+\vartheta_{k}$ with quantization errors $\bm{w}_{k}$ and $\vartheta_{k}$ . Analogously to (B.1), we have $\mathbbm{E}[\dot{y}_{k}\bm{\dot{x}}_{k}]=\mathbbm{E}(\widetilde{y}_{k}\bm{\widetilde{x}}_{k})$ and $\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}\bm{\theta^{\star}})=\frac{\bar{\Delta}^{2}}{4}\bm{\theta^{\star}}+\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}\bm{\theta^{\star}}).$ Thus, the term we want to bound can be first decomposed into two concentration terms ( $I_{1},I_{3}$ ) and one bias term ( $I_{2}$ ):

		$\displaystyle\Big{\\|}\Big{(}\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\frac{\bar{\Delta}^{2}}{4}\bm{I}_{d}\Big{)}\bm{\theta^{\star}}-\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}\Big{\\|}_{\infty}\leq\Big{\\|}\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{\dot{x}}_{k})\Big{\\|}_{\infty}$		(B.8)
		$\displaystyle+\Big{\\|}\mathbbm{E}\big{(}\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}\bm{\theta^{\star}}\big{)}-\mathbbm{E}\big{(}\widetilde{y}_{k}\bm{\widetilde{x}}_{k}\big{)}\Big{\\|}_{\infty}+\Big{\\|}\Big{(}\frac{1}{n}\sum_{k=1}^{n}\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top}-\mathbbm{E}(\bm{\dot{x}}_{k}\bm{\dot{x}}_{k}^{\top})\Big{)}\bm{\theta^{\star}}\Big{\\|}_{\infty}:=I_{1}+I_{2}+I_{3},$		(B.8)

Step 1.1. Bounding $I_{1}$

Denote the $i$ -th entry of $\bm{x}_{k},\bm{\widetilde{x}}_{k},\bm{\dot{x}}_{k},\bm{\tau}_{k},\bm{w}_{k}$ by $x_{ki},\widetilde{x}_{ki},\dot{x}_{ki},\tau_{ki},w_{ki}$ , respectively. Since $|\dot{y}_{k}|\leq|\widetilde{y}_{k}|+|\phi_{k}|+|\vartheta_{k}|\leq|\widetilde{y}_{k}|+\Delta$ , $\dot{x}_{ki}\leq|\widetilde{x}_{ki}|+|\tau_{ki}|+|w_{ki}|\leq|\widetilde{x}_{ki}|+\frac{3}{2}\bar{\Delta}$ , for any integer $q\geq 2$ we can bound the moments as

	$\displaystyle\sum_{k=1}^{n}\mathbbm{E}\Big{\|}\frac{\dot{y}_{k}\dot{x}_{ki}}{n}\Big{\|}^{q}$	$\displaystyle\leq\frac{(\zeta_{x}+\frac{3}{2}\bar{\Delta})^{q-2}(\zeta_{y}+\Delta)^{q-2}}{n^{q}}\sum_{k=1}^{n}\mathbbm{E}\|\dot{y}_{k}\dot{x}_{ki}\|^{2}$
		$\displaystyle\leq\frac{[(\zeta_{x}+\frac{3}{2}\bar{\Delta})(\zeta_{y}+\Delta)]^{q-2}}{n^{q}}\sum_{k=1}^{n}\sqrt{\mathbbm{E}\|\dot{y}_{k}\|^{4}\mathbbm{E}\|\dot{x}_{ki}\|^{4}}$
		$\displaystyle\lesssim\Big{(}\frac{(\zeta_{x}+\frac{3}{2}\bar{\Delta})(\zeta_{y}+\Delta)}{n}\Big{)}^{q-2}\Big{(}\frac{M+\Delta^{4}+\bar{\Delta}^{4}}{n}\Big{)},$

and in the last inequality we use $\mathbbm{E}|\widetilde{y}_{k}|\leq\mathbbm{E}|y_{k}|^{4}\leq M$ and $\mathbbm{E}|\widetilde{x}_{ki}|^{4}\leq\mathbbm{E}|x_{ki}|^{4}\leq M$ . With these preparations, we apply Bernstein’s inequality (Lemma 1) and a union bound, yielding that

\mathbbm{P}\left(I_{1}\geq C_{5}\left\{\sqrt{\frac{(M+\Delta^{4}+\bar{\Delta}^{4})t}{n}}+\frac{(\zeta_{x}+\frac{3}{2}\bar{\Delta})(\zeta_{y}+\Delta)t}{n}\right\}\right)\leq 2d\cdot\exp(-t).

Set $t=\delta\log d$ , $I_{1}\lesssim(\sqrt{M}+\Delta^{2}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}$ holds with probability at least $1-2d^{1-\delta}$ .

Step 1.2. Bounding $I_{2}$

Noting that $\mathbbm{E}(y_{k}\bm{x}_{k})=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}})+\mathbbm{E}(\epsilon_{k}\bm{x}_{k})=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}})$ , we could further decompose $I_{2}$ as $I_{2}\leq\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}\bm{\theta^{\star}})-\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top}\bm{\theta^{\star}})\|_{\infty}+\|\mathbbm{E}(y_{k}\bm{x}_{k})-\mathbbm{E}(\widetilde{y}_{k}\bm{\widetilde{x}}_{k})\|_{\infty}:=I_{21}+I_{22}$ . To bound $I_{21}$ , we note that the assumption and truncation procedure for $\bm{x}_{k}$ are the same as in Theorem 2; thus, Step 2 in the proof of Theorem 2 can yield that $\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{\infty}\leq\sqrt{\frac{\delta M\log d}{n}}$ . Thus, we have $I_{21}\leq\|\mathbbm{E}(\bm{\widetilde{x}}_{k}\bm{\widetilde{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\|_{\infty}\|\bm{\theta^{\star}}\|_{1}\leq R\sqrt{M}\sqrt{\frac{\delta\log d}{n}}$ . To bound $I_{22}$ , we estimate the $i$ -th entry

	$\displaystyle\big{\|}\mathbbm{E}(y_{k}x_{ki}-\widetilde{y}_{k}\widetilde{x}_{ki})\big{\|}$	$\displaystyle=\big{\|}\mathbbm{E}(y_{k}x_{ki}-\widetilde{y}_{k}\widetilde{x}_{ki})\big{(}\mathbbm{1}(\|y_{k}\|>\zeta_{y})+\mathbbm{1}(\|x_{ki}\|\geq\zeta_{x})\big{)}\big{\|}$
		$\displaystyle\leq\mathbbm{E}\big{(}\|y_{k}x_{ki}\|\mathbbm{1}(\|y_{k}\|>\zeta_{y})\big{)}+\mathbbm{E}\big{(}\|y_{k}x_{ki}\|\mathbbm{1}(\|x_{ki}\|\geq\zeta_{x})\big{)}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}M\Big{(}\frac{1}{\zeta_{x}^{2}}+\frac{1}{\zeta_{y}^{2}}\Big{)}\lesssim\sqrt{\frac{\delta M\log d}{n}},$

where $(i)$ is because $\mathbbm{E}\big{(}|y_{k}x_{ki}|\mathbbm{1}(|y_{k}|>\zeta_{y})\big{)}\leq\big{[}\mathbbm{E}|y_{k}^{2}x_{ki}^{2}|\big{]}^{1/2}\sqrt{\mathbbm{P}(|y_{k}|>\zeta_{y})}\leq(\mathbbm{E}y_{k}^{4})^{1/4}(\mathbbm{E}x_{ki}^{4})^{1/4}\sqrt{\frac{\mathbbm{E}y_{k}^{4}}{\zeta_{y}^{4}}}\leq\frac{M}{\zeta_{y}^{2}}$ , and applying similar treatment to $\mathbbm{E}\big{(}|y_{k}x_{ki}|\mathbbm{1}(|x_{ki}|>\zeta_{x})\big{)}$ . Overall, we have $I_{2}\lesssim R\sqrt{M}\sqrt{\frac{\delta\log d}{n}}$ .

Step 1.3. Bounding $I_{3}$

We first note that

I_{3}=\|(\bm{Q}-\bm{\Sigma^{\star}})\bm{\theta^{\star}}\|_{\infty}\leq\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\cdot\|\bm{\theta^{\star}}\|_{1}\leq R\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}.

By Theorem 2, $\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\lesssim(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}$ holds with probability at least $1-2d^{2-\delta}$ , which leads to $I_{3}\leq\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\|\bm{\theta^{\star}}\|_{1}\lesssim R(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}$ . Thus, by combining everything, we obtain that $\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}\lesssim(R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}$ holds with probability at least $1-4d^{2-\delta}$ . Compared to our choice of $\lambda$ , the claim of this step follows.

Step 2. Bounding the Estimation Error

We are now ready to bound the estimation error of any $\bm{\widetilde{\theta}}$ satisfying (4.2). Set $\bm{\theta}=\bm{\theta^{\star}}$ in (LABEL:4.12), it gives $\big{<}\bm{Q\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\|\bm{\widetilde{\theta}}\|_{1},\bm{\widetilde{\Upsilon}}\big{>}\leq 0$ . Recall that we can assume $\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\leq C_{6}(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}$ with probability at least $1-2d^{2-\delta}$ , which leads to

		$\displaystyle\bm{\widetilde{\Upsilon}}^{\top}\bm{Q}\bm{\widetilde{\Upsilon}}=\bm{\widetilde{\Upsilon}}^{\top}\bm{\Sigma^{\star}}\bm{\widetilde{\Upsilon}}-\bm{\widetilde{\Upsilon}}^{\top}(\bm{\Sigma^{\star}}-\bm{Q})\bm{\widetilde{\Upsilon}}$		(B.9)
		$\displaystyle\geq\kappa_{0}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-C_{6}(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}\\|\bm{\widetilde{\Upsilon}}\\|_{1}^{2}.$		(B.9)

Thus, it follows that

$\displaystyle 0$	$\displaystyle\geq\big{<}\bm{Q\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\\|\bm{\widetilde{\theta}}\\|_{1},\bm{\widetilde{\Upsilon}}\big{>}$	(B.10)
	$\displaystyle=\big{<}\bm{Q\theta^{\star}}-\bm{b},\bm{\widetilde{\Upsilon}}\big{>}+\bm{\widetilde{\Upsilon}}^{\top}\bm{Q}\bm{\widetilde{\Upsilon}}+\lambda\big{<}\partial\\|\bm{\widetilde{\theta}}\\|_{1},\bm{\widetilde{\Upsilon}}\big{>}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}C_{0}\sqrt{M}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-C_{6}(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}\\|\bm{\widetilde{\Upsilon}}\\|_{1}^{2}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\\|\bm{Q\theta^{\star}}-\bm{b}\\|_{\infty}\\|\bm{\widetilde{\Upsilon}}\\|_{1}+\lambda\big{(}\\|\bm{\widetilde{\theta}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}\big{)}$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}C_{0}\sqrt{M}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-\Big{(}2C_{6}R(\sqrt{M}+\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}+\\|\bm{Q\theta^{\star}}-\bm{b}\\|_{\infty}\Big{)}\\|\bm{\widetilde{\Upsilon}}\\|_{1}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\lambda\big{(}\\|\bm{\widetilde{\theta}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}\big{)}$
	$\displaystyle\stackrel{{\scriptstyle(iii)}}{{\geq}}C_{0}\sqrt{M}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-\frac{\lambda}{2}\\|\bm{\widetilde{\Upsilon}}\\|_{1}+\lambda\big{(}\\|\bm{\widetilde{\theta}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}\big{)}.$

Note that $(i)$ is due to (B.9) and $\|\bm{\theta^{\star}}\|_{1}-\|\bm{\widetilde{\theta}}\|_{1}\geq\big{<}\partial\|\bm{\widetilde{\theta}}\|_{1},-\bm{\widetilde{\Upsilon}}\big{>}$ , in $(ii)$ we use $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\|\bm{\widetilde{\theta}}\|_{1}+\|\bm{\theta^{\star}}\|_{1}\leq 2R$ , and from Step 1 $(iii)$ holds when $\lambda=C_{2}(R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2})\sqrt{\frac{\delta\log d}{n}}$ with sufficiently large $C_{2}$ . Therefore, we arrive at $\frac{1}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\geq\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}$ . Similar to Step 3 in the proof of Theorem 9, we can show $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}$ . Applying (B.10) again, it yields

\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|_{2}^{2}\leq\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\frac{3\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 6\sqrt{s}\lambda\|\bm{\widetilde{\Upsilon}}\|_{2}.

Thus, we obtain $\|\bm{\widetilde{\Upsilon}}\|_{2}\lesssim\mathscr{L}\sqrt{\frac{\delta s\log d}{n}}$ with $\mathscr{L}:=\frac{R\sqrt{M}+\Delta^{2}+R\bar{\Delta}^{2}}{\kappa_{0}}$ . The proof can be concluded by further applying $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}$ to derive the bound for $\ell_{1}$ -norm. $\square$

B.1.3 Proof of Proposition 1

Proof.

We let $\bm{\theta}=\bm{\theta^{\star}}$ in (4.2), then proceeds as the proof of Theorem 10:

$\displaystyle 0$	$\displaystyle\geq\big{<}\bm{Q\widetilde{\theta}}-\bm{b}+\lambda\cdot\partial\\|\bm{\widetilde{\theta}}\\|_{1},\bm{\widetilde{\Upsilon}}\big{>}$	(B.11)
	$\displaystyle=\bm{\widetilde{\Upsilon}}^{\top}\bm{\Sigma^{\star}}\bm{\widetilde{\Upsilon}}+\big{<}\bm{Q\theta^{\star}}-\bm{b},\bm{\widetilde{\Upsilon}}\big{>}+\bm{\widetilde{\Upsilon}}^{\top}(\bm{Q}-\bm{\Sigma^{\star}})\bm{\widetilde{\Upsilon}}+\lambda\big{<}\partial\\|\bm{\widetilde{\theta}}\\|_{1},\bm{\widetilde{\Upsilon}}\big{>}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}\kappa_{0}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-\\|\bm{Q\theta^{\star}}-\bm{b}\\|_{\infty}\\|\bm{\widetilde{\Upsilon}}\\|_{1}-\\|\bm{Q}-\bm{\Sigma^{\star}}\\|_{\infty}\\|\bm{\widetilde{\Upsilon}}\\|_{1}^{2}+\lambda\big{(}\\|\bm{\widetilde{\theta}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}\big{)}$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}\kappa_{0}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-\big{(}\\|\bm{Q\theta^{\star}}-\bm{b}\\|_{\infty}+2R\cdot\\|\bm{Q}-\bm{\Sigma^{\star}}\\|_{\infty}\big{)}\\|\bm{\widetilde{\Upsilon}}\\|_{1}+\lambda\big{(}\\|\bm{\widetilde{\theta}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}\big{)}$
	$\displaystyle\stackrel{{\scriptstyle(iii)}}{{\geq}}\kappa_{0}\\|\bm{\widetilde{\Upsilon}}\\|_{2}^{2}-\frac{\lambda}{2}\\|\bm{\widetilde{\Upsilon}}\\|_{1}+\lambda\big{(}\\|\bm{\widetilde{\theta}}\\|_{1}-\\|\bm{\theta^{\star}}\\|_{1}\big{)},$

where in $(i)$ we use $\lambda_{\min}(\bm{\Sigma^{\star}})\geq\kappa_{0}$ and $\|\bm{\theta^{\star}}\|_{1}-\|\bm{\widetilde{\theta}}\|_{1}\geq\big{<}\partial\|\bm{\widetilde{\theta}}\|_{1},-\bm{\widetilde{\Upsilon}}\big{>}$ , $(ii)$ is by $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq\|\bm{\widetilde{\theta}}\|_{1}+\|\bm{\theta^{\star}}\|_{1}\leq 2R$ , in $(iii)$ we use the assumption (4.3). Thus, by $\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|^{2}_{2}\geq 0$ we obtain $2\big{(}\|\bm{\widetilde{\theta}}\|_{1}-\|\bm{\theta^{\star}}\|_{1}\big{)}\leq\|\bm{\widetilde{\Upsilon}}\|_{1}$ . Similarly to Step 3 in the proof of Theorem 9, we can show $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}$ . Again we use (B.11), it gives

\displaystyle\kappa_{0}\|\bm{\widetilde{\Upsilon}}\|^{2}_{2}\leq\frac{\lambda}{2}\|\bm{\widetilde{\Upsilon}}\|_{1}+\lambda\big{(}\|\bm{\theta^{\star}}\|_{1}-\|\bm{\widetilde{\theta}}\|_{1}\big{)}\leq\frac{3}{2}\lambda\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 6\lambda\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2},

thus displaying $\|\bm{\widetilde{\Upsilon}}\|_{2}\leq\frac{6\lambda\sqrt{s}}{\kappa_{0}}$ . The proof can be concluded by using $\|\bm{\widetilde{\Upsilon}}\|_{1}\leq 4\sqrt{s}\|\bm{\widetilde{\Upsilon}}\|_{2}$ . ∎

B.1.4 Proof of Theorem 11

Proof.

To invoke Proposition 1, it is enough to verify (4.3). Recalling $\bm{\Sigma^{\star}}=\mathbbm{E}(\bm{x}_{k}\bm{x}_{k}^{\top})$ , we first invoke [21, Thm. 1] and obtain $\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}=O\big{(}\sigma^{2}\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}\big{)}$ holds with probability at least $1-2d^{2-\delta}$ . This confirms $\lambda\gtrsim R\cdot\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}$ . Then, it remains to upper bound $\|\bm{Q\theta^{\star}}-\bm{b}\|_{\infty}$ :

		$\displaystyle\\|\bm{Q\theta^{\star}}-\bm{b}\\|_{\infty}\leq\\|(\bm{Q}-\bm{\Sigma^{\star}})\bm{\theta^{\star}}\\|_{\infty}+\\|\bm{\Sigma^{\star}\theta^{\star}}-\bm{b}\\|_{\infty}$
		$\displaystyle\leq\\|\bm{Q}-\bm{\Sigma^{\star}}\\|_{\infty}\\|\bm{\theta^{\star}}\\|_{1}+\Big{\\|}\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}-\mathbbm{E}(y_{k}\bm{x}_{k})\Big{\\|}_{\infty}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\lesssim}}\sigma^{2}(R+1)\sqrt{\frac{\delta\log d(\log n)^{2}}{n}},$

where in $(i)$ we use a known estimate from [21, Eq. (A.31)]:

\mathbbm{P}\left(\Big{\|}\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}-\mathbbm{E}(y_{k}\bm{x}_{k})\Big{\|}_{\infty}\lesssim\sigma^{2}\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}\right)\geq 1-2d^{1-\delta}.

Thus, by setting $\lambda=C_{1}\sigma^{2}R\sqrt{\frac{\delta\log d(\log n)^{2}}{n}}$ , (4.3) can be satisfied with probability at least $1-4d^{2-\delta}$ , hence using Proposition 1 concludes the proof.∎

B.1.5 Proof of Theorem 12

Proof.

The proof is again based on Proposition 1 and some ingredients from [21]. From [21, Thm. 4], $\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}\lesssim\big{(}\frac{M^{2}\delta\log d}{n}\big{)}^{1/4}$ holds with probability at least $1-2d^{2-\delta}$ , thus confirming $\lambda\gtrsim R\cdot\|\bm{Q}-\bm{\Sigma^{\star}}\|_{\infty}$ with the same probability. Moreover,

		$\displaystyle\\|\bm{Q\theta^{\star}}-\bm{b}\\|_{\infty}\leq\\|(\bm{Q}-\bm{\Sigma^{\star}})\bm{\theta^{\star}}\\|_{\infty}+\\|\bm{\Sigma^{\star}\theta^{\star}}-\bm{b}\\|_{\infty}$
		$\displaystyle\leq\\|\bm{Q}-\bm{\Sigma^{\star}}\\|_{\infty}\\|\bm{\theta^{\star}}\\|_{1}+\Big{\\|}\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}-\mathbbm{E}(y_{k}\bm{x}_{k})\Big{\\|}_{\infty}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\lesssim}}\sqrt{M}(R+1)\Big{(}\frac{\delta\log d}{n}\Big{)}^{1/4},$

where $(i)$ is due to a known estimate from [21, Eq. (A.34)]:

\mathbbm{P}\left(\Big{\|}\frac{\gamma_{x}\gamma_{y}}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{\dot{x}}_{k1}-\mathbbm{E}(y_{k}\bm{x}_{k})\Big{\|}_{\infty}\lesssim\sqrt{M}\Big{(}\frac{\delta\log d}{n}\Big{)}^{1/4}\right)\geq 1-2d^{1-\delta}

Thus, with probability at least $1-4d^{2-\delta}$ , (4.3) holds if $\lambda=C_{1}\sqrt{M}R\big{(}\frac{\delta\log d}{n}\big{)}^{1/4}$ with sufficiently large $C_{1}$ . The proof can be concluded by invoking Proposition 1. ∎

B.2 Uniform Recovery Guarantee

We need some auxiliary results to support the proof. The first one is a concentration inequality for product process due to Mendelson [67]; the following statement can be directly adapted from [40, Thm. 8] by specifying the pseudo-metrics as $\ell_{2}$ -distance.

Lemma 9.

(Concentration of Product Process). Let $\{g_{\bm{a}}\}_{\bm{a}\in\mathcal{A}}$ and $\{h_{\bm{b}}\}_{\bm{b}\in\mathcal{B}}$ be stochastic processes indexed by two sets $\mathcal{A}\subset\mathbb{R}^{p}$ , $\mathcal{B}\subset\mathbb{R}^{q}$ , both defined on a common probability space $(\Omega,\textsc{A},\mathbbm{P})$ . We assume that there exist $K_{\mathcal{A}},K_{\mathcal{B}},r_{\mathcal{A}},r_{\mathcal{B}}\geq 0$ such that

		$\displaystyle\\|g_{\bm{a}}-g_{\bm{a}^{\prime}}\\|_{\psi_{2}}\leq K_{\mathcal{A}}\\|\bm{a}-\bm{a}^{\prime}\\|_{2},~{}\\|g_{\bm{a}}\\|_{\psi_{2}}\leq r_{\mathcal{A}},~{}\forall~{}\bm{a},\bm{a}^{\prime}\in\mathcal{A};$
		$\displaystyle\\|h_{\bm{b}}-h_{\bm{b}^{\prime}}\\|_{\psi_{2}}\leq{K}_{\mathcal{B}}\\|\bm{b}-\bm{b}^{\prime}\\|_{2},~{}\\|h_{\bm{b}}\\|_{\psi_{2}}\leq r_{\mathcal{B}},~{}\forall~{}\bm{b},\bm{b}^{\prime}\in\mathcal{B}.$

Finally, let $X_{1},...,X_{m}$ be independent copies of a random variable $X\sim\mathbbm{P}$ , then for every $u\geq 1$ the following holds with probability at least $1-2\exp(-cu^{2})$

		$\displaystyle\sup_{\begin{subarray}{c}\bm{a}\in\mathcal{A}\\ \bm{b}\in\mathcal{B}\end{subarray}}~{}\frac{1}{n}\left\|\sum_{i=1}^{n}g_{\bm{a}}(X_{i})h_{\bm{b}}(X_{i})-\mathbbm{E}\big{[}g_{\bm{a}}(X_{i})h_{\bm{b}}(X_{i})\big{]}\right\|$
		$\displaystyle~{}~{}~{}\leq C\Big{(}\frac{(K_{\mathcal{A}}\cdot\omega(\mathcal{A})+u\cdot r_{\mathcal{A}})\cdot(K_{\mathcal{B}}\cdot\omega(\mathcal{B})+u\cdot r_{\mathcal{B}})}{n}$
		$\displaystyle~{}~{}~{}~{}~{}~{}+\frac{r_{\mathcal{A}}\cdot K_{\mathcal{B}}\cdot\omega(\mathcal{B})+r_{\mathcal{B}}\cdot K_{\mathcal{A}}\cdot\omega(\mathcal{A})+u\cdot r_{\mathcal{A}}r_{\mathcal{B}}}{\sqrt{n}}\Big{)},$

where $\omega(\mathcal{A})=\mathbbm{E}\sup_{\bm{a}\in\mathcal{A}}(\bm{g}^{\top}\bm{a})$ with $\bm{g}\sim\mathcal{N}(0,\bm{I}_{p})$ is the Gaussian width of $\mathcal{A}\subset\mathbb{R}^{p}$ , and similarly, $\omega(\mathcal{B})$ is the Gaussian width of $\mathcal{B}$ .

We will use the following result that can be found in [61, Thm. 8].

Lemma 10.

Let $(X_{\bm{u}})_{\bm{u}\in\mathcal{T}}$ be a random process indexed by points in a bounded set $\mathcal{T}\subset\mathbb{R}^{n}$ . Assume that the process has sub-Gaussian increments, i.e., there exists $M>0$ such that $\|X_{\bm{u}}-X_{\bm{v}}\|_{\psi_{2}}\leq M\|\bm{u}-\bm{v}\|_{2}$ holds for any $\bm{u},\bm{v}\in\mathcal{T}$ . Then for every $t>0$ , the event

\sup_{\bm{u},\bm{v}\in\mathcal{T}}|X_{\bm{u}}-X_{\bm{v}}|\leq CM\cdot\big{(}\omega(\mathcal{T})+t\cdot\mathrm{diam}(\mathcal{T})\big{)}

holds with probability at least $1-\exp(-t^{2})$ , where $\mathrm{diam}(\mathcal{T}):=\sup_{\bm{x},\bm{y}\in\mathcal{T}}\|\bm{x}-\bm{y}\|_{2}$ denotes the diameter of $\mathcal{T}$ .

B.2.1 Proof of Theorem 13

Proof.

We start from the optimality $\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\widehat{\theta}})^{2}\leq\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{{\theta^{\star}}})^{2}.$ By substituting $\bm{\widehat{\theta}}=\bm{\theta^{\star}}+\bm{\widehat{\Upsilon}}$ and performing some algebra, we obtain

\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{\widehat{\Upsilon}})^{2}\leq 2\sum_{k=1}^{n}\big{(}\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta^{\star}}\big{)}\bm{x}_{k}^{\top}\bm{\widehat{\Upsilon}}.

Due to the constraint we have $\|\bm{\theta^{\star}}+\bm{\widehat{\Upsilon}}\|_{1}\leq\|\bm{\theta^{\star}}\|_{1}$ , then similar to (LABEL:decom) we can show $\|\bm{\widehat{\Upsilon}}\|_{1}\leq 2\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}$ holds. Thus, we let $\mathscr{V}=\{\bm{v}:\|\bm{v}\|_{2}=1,\|\bm{v}\|_{1}\leq 2\sqrt{s}\}$ , then the following holds uniformly for all $\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}$

\|\bm{\widehat{\Upsilon}}\|_{2}^{2}\cdot\inf_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{v})^{2}\leq 2\|\bm{\widehat{\Upsilon}}\|_{2}\cdot\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta^{\star}})\bm{x}_{k}^{\top}\bm{v}

(B.12)

Similarly to previous developments, our strategy is to lower bound the left-hand side while upper bound the right hand side, but with the difference that the bounds must be valid uniformly for all $\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}$ .

Step 1. Bounding the Left-Hand Side From Below

Letting $\bar{\Delta}=0$ and restricting $\bm{v}$ to $\mathscr{V}$ , we use (LABEL:nB.7) in the proof of Theorem 9, then for some constant $c(\kappa_{0},\sigma)$ depending on $\kappa_{0},\sigma$ , with probability at least $1-d^{-\delta}$

\inf_{\bm{v}\in\mathscr{V}}\frac{1}{\sqrt{n}}\left[\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{v})^{2}\right]^{1/2}\geq\sqrt{\kappa_{0}}-c(\kappa_{0},\sigma)\cdot\sqrt{\frac{s\log d}{n}}.

Thus, if $n\geq\frac{4c^{2}(\kappa_{0},\sigma)}{\kappa_{0}}s\log d$ , then it holds that $\inf_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{v})^{2}\geq\frac{1}{4}\kappa_{0}n$ .

Step 2. Bounding the Right-Hand Side Uniformly

To pursue the uniformity over $\bm{\theta^{\star}}\in\Sigma_{s,R_{0}}$ , we take a supremum by replacing specific $\bm{\theta^{\star}}$ with $\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}$ , then we consider the upper bound on

I:=\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}\sup_{\bm{v}\in\mathscr{V}}~{}\sum_{k=1}^{n}(\dot{y}_{k}-\bm{x}_{k}^{\top}\bm{\theta})\bm{x}_{k}^{\top}\bm{v},

(B.13)

where $\dot{y}_{k}=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k}),~{}\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(y_{k}):=\operatorname{sgn}(y_{k})\min\{|y_{k}|,\zeta_{y}\},~{}y_{k}=\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k}$ ; note that $\dot{y}_{k},\widetilde{y}_{k},y_{k}$ depend on $\bm{\theta}$ , and we will use notation $\dot{y}_{\bm{\theta},k},\widetilde{y}_{\bm{\theta},k},y_{\bm{\theta},k}$ to indicate such dependence when necessary. In this proof, the ranges of $\bm{\theta}$ and $\bm{v}$ (e.g., in supremum), if omitted, are respectively $\bm{\theta}\in\Sigma_{s,R_{0}}$ and $\bm{v}\in\mathscr{V}$ . Now let the quantization noise be $\xi_{k}=\dot{y}_{k}-\widetilde{y}_{k}$ , observing that $\mathbbm{E}(y_{k}\bm{x}_{k}^{\top}\bm{v})=\mathbbm{E}(\bm{\theta}^{\top}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{v})+\mathbbm{E}(\epsilon_{k}\bm{x}_{k}^{\top}\bm{v})=\mathbbm{E}(\bm{\theta}^{\top}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{v})$ , then we can first decompose $I$ as

$\displaystyle I$	$\displaystyle\leq\sup_{\bm{\theta},\bm{v}}~{}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}^{\top}\bm{v}+\sup_{\bm{\theta},\bm{v}}~{}\sum_{k=1}^{n}\big{(}\widetilde{y}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\widetilde{y}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)}$	(B.14)
	$\displaystyle+\sup_{\bm{\theta},\bm{v}}~{}\sum_{k=1}^{n}\mathbbm{E}\big{(}(\widetilde{y}_{k}-y_{k})\bm{x}_{k}^{\top}\bm{v}\big{)}+\sup_{\bm{\theta},\bm{v}}~{}\sum_{k=1}^{n}\big{(}\bm{\theta}^{\top}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\bm{\theta}^{\top}\bm{x}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)}$
	$\displaystyle:={I}_{0}+{I}_{1}+{I}_{2}+{I}_{3},$

where $I_{0}$ is the term arising from quantization, $I_{1}$ is the concentration term involving truncation of heavy-tailed data for which we develop some new machinery to bound it, $I_{2}$ is the bias term, $I_{3}$ is a more regular concentration term that can be bounded via Lemma 9. In the remainder of the proof, we will bound $I_{1},I_{2},I_{3}$ separately and finally deal with $I_{0}$ .

Step 2.1. Bounding ${I}_{1}$

Using $\widetilde{y}_{k}=\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k})$ , $I_{1}$ is concerned with the concentration of the product process $\big{\{}\sum_{k=1}^{n}\mathscr{T}_{\zeta_{y}}(\bm{\theta}^{\top}\bm{x}_{k}+\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}\big{\}}_{\bm{\theta},\bm{v}}$ about its mean. It is natural to apply Lemma 9 towards this end, but we lack good bound on $\|\mathscr{T}_{\zeta_{y}}(\bm{\theta}^{\top}\bm{x}_{k}+\epsilon_{k})\|_{\psi_{2}}$ because of the heavy-tailedness of $\epsilon_{k}$ (on the other hand, the bound $O(\zeta_{y})$ is just too crude to yield a sharp rate). Our strategy is already introduced in the mainbody — we introduce $\widetilde{z}_{k}:=\widetilde{y}_{k}-\mathscr{T}_{\zeta_{y}}(\epsilon_{k})$ and decompose ${I}_{1}$ as

I_{1}\leq\underbrace{\sup_{\bm{v},\bm{\theta}}\sum_{k=1}^{n}\big{(}\widetilde{z}_{k}\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\widetilde{z}_{k}\bm{x}_{k}^{\top}\bm{v}]\big{)}}_{:=I_{11}}+\underbrace{\sup_{\bm{v}}\sum_{k=1}^{n}\big{(}\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}^{\top}\bm{v}]\big{)}}_{:=I_{12}}.

Thus, it suffices to bound $I_{11}$ and $I_{12}$ .

Step 2.1.1. Bounding ${I}_{11}$

We use Lemma 9 to bound $I_{11}$ . For any $\bm{v}_{1},\bm{v}_{2}\in\mathscr{V}$ , it is evident that we have $\|\bm{x}_{k}^{\top}\bm{v}\|_{\psi_{2}}\leq\|\bm{x}_{k}\|_{\psi_{2}}\leq\sigma$ and $\|\bm{x}_{k}^{\top}\bm{v}_{1}-\bm{x}_{k}^{\top}\bm{v}_{2}\|_{\psi_{2}}\leq\sigma\|\bm{v}_{1}-\bm{v}_{2}\|_{2}$ . Regarding $\widetilde{z}_{k}=\widetilde{z}_{\bm{\theta},k}:=\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k})-\mathscr{T}_{\zeta_{y}}(\epsilon_{k})$ indexed by $\bm{\theta}\in\Sigma_{s,R_{0}}$ , the $1$ -Lipschitzness of $\mathscr{T}_{\zeta_{y}}(\cdot)$ gives $|\widetilde{z}_{k}|\leq|\bm{x}_{k}^{\top}\bm{\theta}|$ , and then the definition of sub-Gaussian norm yields $\|\widetilde{z}_{k}\|_{\psi_{2}}\leq\|\bm{x}_{k}^{\top}\bm{\theta}\|_{\psi_{2}}\leq\|\bm{x}_{k}\|_{\psi_{2}}\|\bm{\theta}\|_{2}\leq R_{0}\sigma$ (this addresses the aforementioned issue). Further, for any $\bm{\theta}_{1},\bm{\theta}_{2}\in\Sigma_{s,R_{0}}$ we verify the sub-Gaussian increments

	$\displaystyle\|\widetilde{z}_{\bm{\theta}_{1},k}-\widetilde{z}_{\bm{\theta}_{2},k}\|=$	$\displaystyle\big{\|}\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}_{1}+\epsilon_{k})-\mathscr{T}_{\zeta_{y}}(\epsilon_{k})-\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}_{2}+\epsilon_{k})+\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\big{\|}$
	$\displaystyle=$	$\displaystyle\big{\|}\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}_{1}+\epsilon_{k})-\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}_{2}+\epsilon_{k})\big{\|}\leq\big{\|}\bm{x}_{k}^{\top}(\bm{\theta}_{1}-\bm{\theta}_{2})\big{\|},$

which leads to $\|\widetilde{z}_{\bm{\theta}_{1},k}-\widetilde{z}_{\bm{\theta}_{2},k}\|_{\psi_{2}}\leq\|\bm{x}_{k}^{\top}(\bm{\theta}_{1}-\bm{\theta}_{2})\|_{\psi_{2}}\leq\|\bm{x}_{k}\|_{\psi_{2}}\|\bm{\theta}_{1}-\bm{\theta}_{2}\|_{2}\leq\sigma\|\bm{\theta}_{1}-\bm{\theta}_{2}\|_{2}$ . With these preparations, we can invoke Lemma 9 use the well-known estimates $\omega(\Sigma_{s,R_{0}}),\omega(\mathscr{V})=O(\sqrt{s\log d})$ ¹⁸¹⁸18In fact, we have the tighter estimate $\omega(\Sigma_{s,R_{0}}),\omega(\mathscr{V})=O\Big{(}\sqrt{s\log\big{(}\frac{ed}{s}\big{)}}\Big{)}$ (e.g., [76]) but we simply put $\sqrt{s\log d}$ to be consistent with earlier results concerning unconstrained Lasso. to obtain that, with probability at least $1-2\exp(-cu^{2})$ we have

	$\displaystyle{I}_{11}$	$\displaystyle\lesssim\sigma^{2}\Big{[}\sqrt{n}\big{(}\omega(\Sigma_{s,R_{0}})+\omega(\mathscr{V})+u\big{)}+\big{(}\omega(\Sigma_{s,R_{0}})+u\big{)}\cdot\big{(}\omega(\mathscr{V})+u\big{)}\Big{]}$		(B.15)
		$\displaystyle\lesssim\sigma^{2}\Big{[}\sqrt{n}\big{(}\sqrt{s\log d}+u\big{)}+\big{(}\sqrt{s\log d}+u\big{)}\cdot\big{(}\sqrt{s\log d}+u\big{)}\Big{]}.$		(B.15)

Therefore, we can set $u=\sqrt{\delta s\log d}$ in (B.15), under the scaling of $n\gtrsim s\delta\log d$ it provides

\mathbbm{P}\big{(}I_{11}\lesssim\sigma^{2}\sqrt{n\delta s\log d}\big{)}\geq 1-2d^{-\delta\Omega(s)}.

(B.16)

Step 2.1.2. Bounding $I_{12}$

By $\|\bm{v}\|_{1}\leq 2\sqrt{s}$ we have $I_{12}\leq 2\sqrt{s}\|\sum_{k=1}^{n}(\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}])\|_{\infty}.$ Then to apply Bernstein’s inequality, for integer $q\geq 2$ and $i\in[d]$ , analogously to (LABEL:nB.2) in the proof of Theorem 9, we can bound that

	$\displaystyle\sum_{k=1}^{n}\mathbbm{E}\left\|\frac{\mathscr{T}_{\zeta_{y}}(\epsilon_{k})x_{ki}}{n}\right\|^{q}$	$\displaystyle\leq\Big{(}\frac{\zeta_{y}}{n}\Big{)}^{q-2}\frac{1}{n^{2}}\sum_{k=1}^{n}\mathbbm{E}\big{\|}\mathscr{T}^{2}_{\zeta_{y}}(\epsilon_{k})x_{ki}^{q}\big{\|}$
		$\displaystyle\leq\Big{(}\frac{\sigma\zeta_{y}}{n}\Big{)}^{q-2}\Big{(}\frac{\sigma^{2}M^{\frac{1}{l}}}{n}\Big{)}(Cq)^{\frac{q}{2}}\leq\frac{q!}{2}v_{0}c_{0}^{q-2},$

for some $v_{0}=O\big{(}\frac{\sigma^{2}M^{1/l}}{n}\big{)}$ , $c_{0}=O\big{(}\frac{\sigma\zeta_{y}}{n}\big{)}$ . Then we use Lemma 1 to obtain that, with probability at least $1-2\exp(-t)$ we have

\left|\frac{1}{n}\sum_{k=1}^{n}(\mathscr{T}_{\zeta_{y}}(\epsilon_{k})x_{ki}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})x_{ki}])\right|\leq C\sigma\Big{(}M^{\frac{1}{2l}}\sqrt{\frac{t}{n}}+\frac{\zeta_{y}t}{n}\Big{)}

Then we use $\zeta_{y}\asymp\big{(}\sigma+M^{\frac{1}{2l}}\big{)}\sqrt{\frac{n}{\delta\log d}}$ , set $t\asymp\delta\log d$ , and take a union bound over $i\in[d]$ to obtain that, $\|\sum_{k=1}^{n}\frac{1}{n}(\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}-\mathbbm{E}[\mathscr{T}_{\zeta_{y}}(\epsilon_{k})\bm{x}_{k}])\|_{\infty}\lesssim\sigma(M^{1/(2l)}+\sigma)\sqrt{\frac{\delta\log d}{n}}$ holds with probability at least $1-2d^{1-\delta}$ , which implies the following under the same probability

I_{12}\lesssim\sigma\big{(}M^{\frac{1}{2l}}+\sigma\big{)}\sqrt{ns\delta\log d}.

(B.17)

Therefore, combining (B.16) and (B.17), we obtain that $I_{1}\lesssim\sigma\big{(}M^{\frac{1}{2l}}+\sigma\big{)}\sqrt{ns\delta\log d}$ with the promised probability.

Step 2.2. Bounding ${I}_{2}$

For this bias term the supremum does not make things harder. We begin with $I_{2}=n\cdot\sup_{\bm{\theta},\bm{v}}\mathbbm{E}\big{(}(\widetilde{y}_{k}-y_{k})\bm{x}_{k}^{\top}\bm{v}\big{)}\leq 2n\sqrt{s}\cdot\sup_{\bm{\theta}}\|\mathbbm{E}(\widetilde{y}_{k}-y_{k})\bm{x}_{k}\|_{\infty}.$ Fix any $\bm{\theta}\in\Sigma_{s,R_{0}}$ , we have $\mathbbm{E}|y_{k}|^{2l}\lesssim\mathbbm{E}|\bm{x}_{k}^{\top}\bm{\theta}|^{2l}+\mathbbm{E}|\epsilon_{k}|^{2l}\lesssim M+\sigma^{2l}$ . Then following arguments similarly to (LABEL:nB.3) we obtain

\|\mathbbm{E}(\widetilde{y}_{k}-y_{k})\bm{x}_{k}\|_{\infty}\lesssim\frac{\sigma M^{1/l}}{\zeta_{y}}\lesssim\sigma(M^{1/(2l)}+\sigma)\sqrt{\frac{\delta\log d}{n}},

which implies $I_{2}\lesssim\sigma M^{\frac{1}{2l}}\sqrt{n\delta s\log d}$ .

Step 2.3. Bounding $I_{3}$

It is evident that we can apply Lemma 9 with $(g_{\bm{\theta}}(\bm{x}_{k}),h_{\bm{v}}(\bm{x}_{k}))=(\bm{\theta}^{\top}\bm{x}_{k},\bm{v}^{\top}\bm{x}_{k})$ , $(\mathcal{A},\mathcal{B})=(\Sigma_{s,R_{0}},\mathscr{V})$ , and hence with $K_{\mathcal{A}},r_{\mathcal{A}},K_{\mathcal{B}},r_{\mathcal{B}}=O(\sigma)$ . Along with $\omega(\Sigma_{s,R_{0}}),\omega(\mathscr{V})\lesssim\sqrt{s\log d}$ , we obtain that the following holds with probability at least $1-2\exp(-cu^{2})$ :

I_{3}\leq\sigma^{2}\Big{[}\big{(}\sqrt{s\log d}+u\big{)}^{2}+\sqrt{n}\big{(}\sqrt{s\log d}+u\big{)}\Big{]}.

By taking $u\asymp\sqrt{s\delta\log d}$ , under the scaling $n\gtrsim\delta s\log d$ , it follows that $I_{3}\lesssim\sigma^{2}\sqrt{n\delta s\log d}$ with probability at least $1-2d^{-\delta\Omega(s)}$ .

Step 2.4. Bounding $I_{0}$

It remains to bound $I_{0}=\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}^{\top}\bm{v}$ . Bounding $I_{0}$ is similar to establishing the “limited projection distortion (LPD)” property in [94], but the key distinction is that $\bm{\theta}$ and $\bm{v}$ in $I_{0}$ take value in different spaces.

The main difficulty associated with “ $\sup_{\bm{\theta}}$ ” lies in the discontinuity of the quantization noise $\xi_{k}:=\mathcal{Q}_{\Delta}(\widetilde{y}_{k}+\tau_{k})-\widetilde{y}_{k}$ , which we overcome by a covering argument and some machinery developed in [94, Prop. 6.1]. However, the essential difference from [94] is that we use Lemma 10 to handle “ $\sup_{\bm{v}}$ ”, while [94] again used covering argument for $\bm{v}$ to strengthen their Proposition 6.1 to their Proposition 6.2, which is unfortunately insufficient in our setting because the covering number of $\mathscr{V}$ significantly increases under smaller covering radius (on the other hand, using covering argument for $\bm{v}$ suffices for the analyses in [94] regarding a different estimator named projected back projection).

Let us first construct a $\rho$ -net of $\Sigma_{s,R_{0}}$ denoted by $\mathcal{G}=\{\bm{\theta}_{1},...,\bm{\theta}_{N}\}$ , so that for any $\bm{\theta}\in\Sigma_{s,R_{0}}$ we can pick $\bm{\theta}^{\prime}\in\mathcal{G}$ satisfying $\|\bm{\theta}^{\prime}-\bm{\theta}\|_{2}\leq\rho$ ; here, the covering radius $\rho$ is to be chosen later, and we assume that $N\leq\big{(}\frac{9d}{\rho s}\big{)}^{s}$ [76, Lemma 3.3]. As is standard in a covering argument, we first control $I_{0}$ over the net $\mathcal{G}$ (by replacing “ $\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}$ ” with “ $\sup_{\bm{\theta}\in\mathcal{G}}$ ”), and then bound the approximation error induced by such replacement.

Step 2.4.1. Bounding $I_{0}$ over $\mathcal{G}$

In this step, we want to bound $I_{0,\mathcal{G}}:=\sup_{\bm{\theta}\in\mathcal{G}}\sup_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}^{\top}\bm{v}$ . First let us consider a fixed $\bm{\theta}\in\Sigma_{s,R_{0}}$ . Then since $|\xi_{k}|\leq\Delta$ , we have $\|\xi_{k}\bm{x}_{k}\|_{\psi_{2}}\lesssim\Delta\sigma$ . Because $\{\xi_{k}\bm{x}_{k}:k\in[n]\}$ are independent zero mean, by [92, Prop. 2.6.1] we have $\big{\|}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\big{\|}_{\psi_{2}}\lesssim\sqrt{n}\Delta\sigma$ . Define $\mathscr{V}^{\prime}=\mathscr{V}\cup\{0\}$ , then for any $\bm{v}_{1},\bm{v}_{2}\in\mathscr{V}^{\prime}$ we have

\left\|\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}_{1}-\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}_{2}\right\|_{\psi_{2}}\leq C\sqrt{n}\Delta\sigma\|\bm{v}_{1}-\bm{v}_{2}\|_{2}.

Thus, by Lemma 10, it holds with probability at least $1-\exp(-u^{2})$ that

	$\displaystyle\sup_{\bm{v}\in\mathscr{V}}\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}$	$\displaystyle\leq\sup_{\bm{v},\bm{v}^{\prime}\in\mathscr{V}^{\prime}}\left\|\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}-\Big{(}\sum_{k=1}^{n}\xi_{k}\bm{x}_{k}\Big{)}^{\top}\bm{v}^{\prime}\right\|$
		$\displaystyle\leq C\sqrt{n}\Delta\sigma\big{(}\omega(\mathscr{V}^{\prime})+u\big{)}\leq C_{1}\sqrt{n}\Delta\sigma\Big{(}\sqrt{s\log\frac{9d}{s}}+u\Big{)}.$

Moreover, by a union bound over $\mathcal{G}$ , we obtain that $I_{0,\mathcal{G}}\lesssim\sigma\Delta\sqrt{n}\big{(}\sqrt{s\log\frac{9d}{s}}+u\big{)}$ holds with probability at least $1-\exp\big{(}s\log\frac{9d}{\rho s}-u^{2}\big{)}$ . We set $u\asymp\sqrt{s\delta\log\big{(}\frac{9d}{\rho s}\big{)}}$ and arrive at

\mathbbm{P}\left(I_{0,\mathcal{G}}\lesssim\sigma\Delta\sqrt{ns\delta\log\frac{9d}{\rho s}}\right)\geq 1-\Big{(}\frac{9d}{\rho s}\Big{)}^{-\Omega(\delta s)}.

(B.18)

Step 2.4.2. Bounding the Approximation Error

From now on we indicate the dependence of $\xi_{k}$ on $\bm{\theta}$ by using the notation $\xi_{\bm{\theta},k}:=\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},k}+\tau_{k})-\widetilde{y}_{\bm{\theta},k}$ where $\widetilde{y}_{\bm{\theta},k}=\mathscr{T}_{\zeta_{y}}(\bm{x}_{k}^{\top}\bm{\theta}+\epsilon_{k})$ . For any $\bm{\theta}\in\Sigma_{s,R_{0}}$ we pick one $\bm{\theta}^{\prime}\in\mathcal{G}$ such that $\|\bm{\theta}-\bm{\theta}^{\prime}\|_{2}\leq\rho$ ; we fix such correspondence and remember that from now on every $\bm{\theta}\in\Sigma_{s,R_{0}}$ is associated with some $\bm{\theta^{\prime}}\in\mathcal{G}$ , (which of course depends on $\bm{\theta}$ but our notation omits such dependence). Thus, we can bound $I_{0}=\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}\xi_{\bm{\theta},k}\bm{x}_{k}^{\top}\bm{v}$ as

\displaystyle I_{0}

\displaystyle\leq\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}\xi_{\bm{\theta^{\prime}},k}\bm{x}_{k}^{\top}\bm{v}+\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}(\xi_{\bm{\theta},k}-\xi_{\bm{\theta}^{\prime},k})\bm{x}_{k}^{\top}\bm{v}\leq I_{0,\mathcal{G}}+I_{01}.

(B.19)

Note that the bound on $I_{0,\mathcal{G}}$ is available in (B.18), so it remains to bound $I_{01}:=\sup_{\bm{\theta},\bm{v}}\sum_{k=1}^{n}(\xi_{\bm{\theta},k}-\xi_{\bm{\theta^{\prime}},k})\bm{x}_{k}^{\top}\bm{v}$ , which can be understood as the approximation error of the net $\mathcal{G}$ regarding the empirical process of interest. To facilitate the presentation we switch to the more compact notations — let $\bm{X}\in\mathbb{R}^{n\times d}$ with rows $\bm{x}_{k}^{\top}$ be the sensing matrix, $\bm{\xi}_{\bm{\theta}}=[\xi_{\bm{\theta},k}]\in\mathbb{R}^{n}$ be the quantization error indexed by $\bm{\theta}$ , $\bm{\tau}=[\tau_{k}]\in\mathbb{R}^{n}$ be the random dither vector, $\bm{\epsilon}=[\epsilon_{k}]\in\mathbb{R}^{n}$ be the heavy-tailed noise vector, $\bm{{y}}_{\bm{\theta}}=[y_{\bm{\theta},k}]=\bm{X\theta}+\bm{\epsilon}\in\mathbb{R}^{n}$ and $\bm{\widetilde{y}}_{\bm{\theta}}=[\widetilde{y}_{\bm{\theta},k}]=\mathscr{T}_{\zeta_{y}}(\bm{y}_{\bm{\theta}})$ be the measurement vector and truncated measurement vector, respectively. With these conventions we can write $I_{01}=\sup_{\bm{\theta},\bm{v}}(\bm{\xi_{\theta}}-\bm{\xi_{\theta^{\prime}}})^{\top}\bm{Xv}$ . Recall that a specific $\bm{\theta^{\prime}}$ has been specified for each $\bm{\theta}\in\Sigma_{s,R_{0}}$ , so defining $\bm{\Psi_{\theta}}:=\bm{\xi_{\theta}}-\bm{\xi_{\theta^{\prime}}}$ allows us to write $I_{01}=\sup_{\bm{\theta},\bm{v}}~{}\bm{\Psi}_{\bm{\theta}}^{\top}\bm{Xv}$ . Further, we define $\bm{\widetilde{\Psi}_{\theta}}:=\bm{\widetilde{y}_{\theta}}-\bm{\widetilde{y}_{\theta^{\prime}}},\bm{\widehat{\Psi}_{\theta}}:=\mathcal{Q}_{\Delta}(\bm{\widetilde{y}_{\theta}}+\bm{\tau})-\mathcal{Q}_{\Delta}(\bm{\widetilde{y}_{\theta^{\prime}}}+\bm{\tau})$ and make the following observation

\displaystyle\bm{\Psi_{\theta}}=\bm{\xi_{\theta}}-\bm{\xi_{\theta^{\prime}}}=\bm{\widehat{\Psi}_{\theta}}-\bm{\widetilde{\Psi}_{\theta}}.

(B.20)

We pause to establish a property of $\bm{X}$ that holds w.h.p.. Specifically, we restrict (B.4) to $\bm{v}\in\mathscr{V}$ (recall that $\bar{\Delta}=0$ and so $\bm{\dot{X}}=\bm{X}$ and $\bm{\dot{\Sigma}}=\bm{\Sigma^{\star}}$ ), then under the promised probability it holds for some $c(\kappa_{0},\sigma)$ that

\sup_{\bm{v}\in\mathscr{V}}\left|\frac{\|\bm{Xv}\|_{2}}{\sqrt{n}}-\|\bm{\sqrt{\Sigma^{\star}}v}\|_{2}\right|\leq c(\kappa_{0},\sigma)\sqrt{\frac{\delta s\log d}{n}}.

Thus, when $n\gtrsim\delta s\log d$ with for large enough hidden constant depending on $(\kappa_{0},\sigma)$ , it holds that

\sup_{\bm{v}\in\mathscr{V}}\frac{\|\bm{Xv}\|_{2}}{\sqrt{n}}\leq\sup_{\bm{v}\in\mathscr{V}}\|\sqrt{\bm{\Sigma^{\star}}}\bm{v}\|_{2}+\sqrt{\kappa_{1}}\leq 2\sqrt{\kappa_{1}}.

(B.21)

We proceed by assuming we are on this event, which allows us to bound $I_{01}$ as

	$\displaystyle I_{01}=\sup_{\theta,\bm{v}}\bm{\Psi_{\theta}}^{\top}\bm{Xv}$	$\displaystyle\leq\sup_{\bm{\theta}}\\|\bm{\Psi_{\theta}}\\|_{2}\sup_{\bm{v}\in\mathscr{V}}\\|\bm{Xv}\\|_{2}$		(B.22)
		$\displaystyle\leq 2\sqrt{\kappa_{1}n}\cdot\sup_{\bm{\theta}}\\|\bm{\Psi_{\theta}}\\|_{2}.$		(B.22)

To bound $\sup_{\bm{\theta}}\|\bm{\Psi_{\theta}}\|_{2}$ , motivated by (B.20), we will investigate $\bm{\widetilde{\Psi}_{\theta}}$ and $\bm{\widehat{\Psi}_{\theta}}$ more carefully. We pick a threshold $\eta\in(0,\frac{\Delta}{2})$ (that is to be chosen later), and by the $1$ -Lipschitzness of $\mathscr{T}_{\zeta_{y}}(\cdot)$ we have

	$\displaystyle\sup_{\bm{\theta}}\\|\bm{\widetilde{\Psi}_{\theta}}\\|_{2}$	$\displaystyle=\sup_{\bm{\theta}}\\|\bm{\widetilde{y}}_{\bm{\theta}}-\bm{\widetilde{y}}_{\bm{\theta}^{\prime}}\\|_{2}\leq\sup_{\bm{\theta}}\\|\bm{y}_{\bm{\theta}}-\bm{y}_{\bm{\theta}^{\prime}}\\|_{2}$		(B.23)
		$\displaystyle=\sup_{\bm{\theta}}\\|\bm{X}(\bm{\theta}-\bm{\theta}^{\prime})\\|_{2}\leq 2\sqrt{\kappa_{1}n}\rho,$		(B.23)

where the last inequality is because $\bm{\theta}-\bm{\theta}^{\prime}$ is $2s$ -sparse, hence (B.21) implies $\|\bm{X}(\bm{\theta}-\bm{\theta}^{\prime})\|_{2}\leq 2\sqrt{\kappa_{1}n}\|\bm{\theta}-\bm{\theta}^{\prime}\|_{2}\leq 2\sqrt{\kappa_{1}n}\rho$ .

To proceed, we will define for specific $\bm{\theta}$ the index vectors $\bm{J}_{\bm{\theta},1},\bm{J}_{\bm{\theta},2}\in\{0,1\}^{n}$ and use $|\bm{J}_{\bm{\theta},1}|$ to denote the number of $1$ s in $\bm{J}_{\bm{\theta},1}$ (similar meaning for $|\bm{J}_{\bm{\theta},2}|$ ). Specifically, using the entry-wise notation $\bm{\widetilde{\Psi}_{\theta}}=[\widetilde{\Psi}_{\bm{\theta},k}]$ we define $\bm{J}_{\bm{\theta},1}=[\mathbbm{1}(|\widetilde{\Psi}_{\bm{\theta},k}|\geq\eta)]$ . Recall that (B.23) gives $\sup_{\bm{\theta}}\|\bm{\widetilde{\Psi}_{\theta}}\|_{2}^{2}\leq 4\kappa_{1}n\rho^{2}$ ; combined with the simple observation $\sup_{\bm{\theta}}\|\bm{\widetilde{\Psi}_{\theta}}\|_{2}^{2}\geq\sup_{\bm{\theta}}\eta^{2}|\bm{J}_{\bm{\theta},1}|$ , we obtain a uniform bound on $|\bm{J}_{\bm{\theta},1}|$ as

\sup_{\bm{\theta}\in\Sigma_{s,R_{0}}}|\bm{J}_{\bm{\theta},1}|\leq\frac{4\kappa_{1}n\rho^{2}}{\eta^{2}}.

(B.24)

Next, we define the index vector $\bm{J}_{\bm{\theta},2}$ for $\bm{\theta}\in\mathcal{G}$ : first let $\mathscr{E}_{\bm{\theta},i}=\{\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},i}+\tau_{i}+t)\text{ is discontinuous }$ $\text{in }t\in[-\eta,\eta]\},$ and then we define $\bm{J}_{\bm{\theta},2}:=[\mathbbm{1}(\mathscr{E}_{\bm{\theta},i})]$ . Then by Lemma 11 that we prove later, we have

\mathbbm{P}\Big{(}\sup_{\bm{\theta}\in\mathcal{G}}|\bm{J}_{\bm{\theta},2}|\leq\frac{Cn\eta}{\Delta}\Big{)}\geq 1-\exp\Big{(}-\frac{cn\eta}{\Delta}+s\log\frac{9d}{\rho s}\Big{)}:=1-\mathscr{P}_{1}.

(B.25)

Note that $\mathscr{E}_{\bm{\theta},i}$ does not happen (i.e., $\mathbbm{1}(\mathscr{E}_{\bm{\theta},i})=0$ ) means that $\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},i}+\tau_{i}+t)$ is continuous in $t\in[-\eta,\eta]$ ; combined with the definition of $\mathcal{Q}_{\Delta}(\cdot)$ , this is also equivalent to the statement that “ $\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},i}+\tau_{i}+t)$ remains constant in $t\in[-\eta,\eta]$ .” Thus, given a fixed $\bm{\theta}\in\Sigma_{s,R_{0}}$ and its associated $\bm{\theta^{\prime}}$ , suppose that the $i$ -th entry of $\bm{J}_{\bm{\theta^{\prime}},2}$ is zero (meaning that “ $\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta^{\prime}},i}+\tau_{i}+t)$ remains constant in $t\in[-\eta,\eta]$ ”), if additionally $i$ -th entry of $\bm{J}_{\bm{\theta},1}$ is zero (i.e., $|\widetilde{\Psi}_{\bm{\theta},i}|<\eta$ ), then the $i$ -th entry of $\bm{\widehat{\Psi}_{\bm{\theta}}}=\mathcal{Q}_{\Delta}(\bm{\widetilde{y}_{\theta}}+\bm{\tau})-\mathcal{Q}_{\Delta}(\bm{\widetilde{y}_{\theta^{\prime}}}+\bm{\tau})$ vanishes:

	$\displaystyle\widehat{\Psi}_{\bm{\theta},i}$	$\displaystyle=\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta},i}+\tau_{i})-\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta^{\prime}},i}+\tau_{i})$
		$\displaystyle=\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta^{\prime}},i}+\widetilde{\Psi}_{\bm{\theta},i}+\tau_{i})-\mathcal{Q}_{\Delta}(\widetilde{y}_{\bm{\theta^{\prime}},i}+\tau_{i})=0;$

combining with (B.20), this implies $\Psi_{\bm{\theta},i}=-\widetilde{\Psi}_{\bm{\theta},i}$ . Recall from (B.22) that we want to bound $\sup_{\bm{\theta}}\|\bm{\Psi_{\theta}}\|_{2}$ . Write $\bm{J}_{\bm{\theta},1}^{c}=\bm{1}-\bm{J}_{\bm{\theta},1}$ and $\bm{J}_{\bm{\theta^{\prime}},2}^{c}=\bm{1}-\bm{J}_{\bm{\theta^{\prime}},2}$ , then denoting hadamard product by $\odot$ and using the decomposition $\bm{1}=\max\{\bm{J}_{\bm{\theta},1},\bm{J}_{\bm{\theta^{\prime}},2}\}+\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}_{\bm{\theta^{\prime}},2}^{c}\}$ and (B.20) we have

		$\displaystyle\sup_{\bm{\theta}}\big{\\|}\bm{\Psi_{\theta}}\big{\\|}_{2}=\sup_{\bm{\theta}}\big{\\|}\bm{\Psi_{\theta}}\odot\max\{\bm{J}_{\bm{\theta},1},\bm{J}_{\bm{\theta^{\prime}},2}\}+\bm{\Psi_{\theta}}\odot\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}^{c}_{\bm{\theta^{\prime}},2}\}\big{\\|}_{2}$		(B.26)
		$\displaystyle\leq\sup_{\bm{\theta}}\big{\\|}\bm{\Psi_{\theta}}\odot\max\{\bm{J}_{\bm{\theta},1},\bm{J}_{\bm{\theta^{\prime}},2}\}\big{\\|}_{2}+\sup_{\bm{\theta}}\big{\\|}\bm{\Psi_{\theta}}\odot\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}^{c}_{\bm{\theta^{\prime}},2}\}\big{\\|}_{2}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\sup_{\bm{\theta}}\big{\\|}\bm{\Psi_{\theta}}\big{\\|}_{\infty}\cdot\sqrt{\|\bm{J}_{\bm{\theta},1}\|+\|\bm{J}_{\bm{\theta^{\prime}},2}\|}~{}+\sup_{\bm{\theta}}\big{\\|}\bm{\widetilde{\Psi}_{\bm{\theta}}}\odot\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}_{\bm{\theta^{\prime}},2}^{c}\}\big{\\|}_{2}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}C\Big{\{}\Delta\sqrt{n}\Big{(}\frac{\sqrt{\kappa_{1}}\rho}{\eta}+\sqrt{\frac{\eta}{\Delta}}\Big{)}+\rho\sqrt{\kappa_{1}n}\Big{\}},$

where $(i)$ is because entries of $\bm{\Psi}_{\bm{\theta}}$ equal to those of $-\bm{\widetilde{\Psi}}_{\bm{\theta}}$ if the index corresponds to $\min\{\bm{J}_{\bm{\theta},1}^{c},\bm{J}_{\bm{\theta^{\prime}},2}^{c}\}=1$ , and in $(ii)$ we use the simple bound $\|\bm{\Psi_{\theta}}\|_{\infty}\leq\|\bm{\xi_{\theta}}\|_{\infty}+\|\bm{\xi_{\theta^{\prime}}}\|_{\infty}\leq 2\Delta$ and the derived bounds on $\sup_{\bm{\theta}}|\bm{J}_{\bm{\theta},1}|,\sup_{\bm{\theta}\in\mathcal{G}}|\bm{J}_{\bm{\theta},2}|,\sup_{\bm{\theta}}\|\bm{\widetilde{\Psi}_{\theta}}\|_{2}$ in (B.24), (B.25) and (B.23), respectively.

Step 2.4.3. Concluding the Bound on $I_{0}$

We are ready to put pieces together, specify $\rho,\eta$ , and conclude the bound on $I_{0}$ . Overall, with probability at least $1-\mathscr{P}_{1}-\mathscr{P}_{2}$ for $\mathscr{P}_{1}$ defined in (B.25) and some $\mathscr{P}_{2}$ within the promised probability, combining (B.18), (B.19), (B.22) and (LABEL:B.47) we obtain

I_{0}\lesssim\sigma\Delta\sqrt{ns\delta\log\frac{9d}{\rho s}}+\frac{\kappa_{1}n\Delta\rho}{\eta}+n\sqrt{\kappa_{1}\eta\Delta}+\kappa_{1}\rho n.

Thus, we take the (near-optimal) choice of $(\rho,\eta)$ as $\rho\asymp\frac{\Delta}{\sqrt{\kappa_{1}}}\big{(}\frac{s\delta}{n}\big{)}^{3/2}$ and $\eta\asymp\frac{\delta\Delta s}{n}\log\frac{9d}{\rho s},$ under which we obtain that, with the promised probability (as $\mathscr{P}_{1}$ is also sufficiently small), we obtain the bound on $I_{0}$ as

I_{0}\lesssim\sigma\Delta\sqrt{ns\delta\log\Big{(}\frac{\kappa_{1}d^{2}n^{3}}{\Delta^{2}s^{5}\delta^{3}}\Big{)}}

(B.27)

We can conclude the proof with all the works above. Substituting $\inf_{\bm{v}\in\mathscr{V}}\sum_{k=1}^{n}(\bm{x}_{k}^{\top}\bm{v})^{2}\geq\frac{1}{4}\kappa_{0}n$ and the definition of $I$ in (B.13) into (B.12), then we obtain $\frac{1}{4}\kappa_{0}n\|\bm{\widehat{\Upsilon}}\|_{2}^{2}\leq 2\|\bm{\widehat{\Upsilon}}\|_{2}I$ that holds uniformly for all $\bm{\theta}\in\Sigma_{s,R_{0}}$ , which implies $\sup_{\bm{\theta}}\|\bm{\widehat{\Upsilon}}\|_{2}\leq\frac{8I}{\kappa_{0}n}$ . Substituting the derived bounds on $I_{1},I_{2},I_{3},I_{0}$ into (B.14), with the promised probability we have

I\lesssim\sigma\big{(}\sigma+M^{\frac{1}{2l}}\big{)}\sqrt{ns\delta\log d}+\sigma\Delta\sqrt{ns\delta\log\Big{(}\frac{\kappa_{1}d^{2}n^{3}}{\Delta^{2}s^{5}\delta^{3}}\Big{)}},

so the uniform bound on $\|\bm{\widehat{\Upsilon}}\|_{2}$ follows immediately. Further using $\|\bm{\widehat{\Upsilon}}\|_{1}\leq 2\sqrt{s}\|\bm{\widehat{\Upsilon}}\|_{2}$ completes the proof.∎

Lemma 11.

(Bounding $\sup_{\bm{\theta}\in\mathcal{G}}|\bm{J}_{\bm{\theta},2}|$ ). Along the proof of Theorem 13, it holds that

\mathbbm{P}\Big{(}\sup_{\bm{\theta}\in\mathcal{G}}|\bm{J}_{\bm{\theta},2}|\geq\frac{Cn\eta}{\Delta}\Big{)}\leq\exp\Big{(}-\frac{cn\eta}{\Delta}+s\log\frac{9d}{\rho s}\Big{)}.

(B.28)

Proof.

Notation and details in the proof of Theorem 13 will be used. We first consider a fixed $\bm{\theta}\in\mathcal{G}$ , and by a simple shifting $\mathscr{E}_{\bm{\theta},i}$ happens if and only if $\mathcal{Q}_{\Delta}(\cdot)$ is discontinuous in $[\widetilde{y}_{\bm{\theta},i}+\tau_{i}-\eta,\widetilde{y}_{\bm{\theta},i}+\tau_{i}+\eta]$ , which is also equivalently to $[\widetilde{y}_{\bm{\theta},i}+\tau_{i}-\eta,\widetilde{y}_{\bm{\theta},i}+\tau_{i}+\eta]\cap(\Delta\cdot\mathbb{Z})=\varnothing$ . Because $\tau_{i}\sim\mathscr{U}\big{(}[-\frac{\Delta}{2},\frac{\Delta}{2}]\big{)}$ and $\eta<\frac{\Delta}{2}$ , $\mathbbm{P}(\mathscr{E}_{\bm{\theta},i})=\frac{2\eta}{\Delta}$ is valid independent of the location of $[\widetilde{y}_{\bm{\theta},i}-\eta,\widetilde{y}_{\bm{\theta},i}+\eta]$ . Thus, for fixed $\bm{\theta}$ , by conditioning on $(\bm{X},\bm{\epsilon})$ , $|\bm{J}_{\bm{\theta},2}|$ follows the binomial distribution with $n$ trials and probability of success $p:=\frac{2\eta}{\Delta}$ . This allows us to write $|\bm{J}_{\bm{\theta},2}|=\sum_{k=1}^{n}J_{k}$ with $J_{k}$ i.i.d. following Bernoulli distribution with success probability $\mathbbm{E}J_{k}=p$ . Then for any integer $q\geq 2$ we have $\sum_{k=1}^{n}\mathbbm{E}|J_{k}-\mathbbm{E}J_{k}|^{q}\leq\sum_{k=1}^{n}\mathbbm{E}|J_{k}-\mathbbm{E}J_{k}|^{2}\leq np(1-p)\leq\frac{q!}{2}np.$ Now we invoke Bernstein’s inequality (Lemma 1) to obtain that for any $t>0$ ,

\mathbbm{P}\big{(}|\bm{J}_{\bm{\theta},2}|-np\geq\sqrt{2npt}+t\big{)}\leq\exp(-t).

We let $t=cnp$ and take a union bound over $\bm{\theta}\in\mathcal{G}$ ; this yields the desired claim since $|\mathcal{G}|\leq\big{(}\frac{9d}{\rho s}\big{)}^{s}$ . ∎

	$\displaystyle\mathbbm{E}\\|\bm{\dot{x}}_{k}\\|^{4}_{2}$	$\displaystyle\leq\mathbbm{E}(\\|\bm{\check{x}}_{k}\\|_{2}+\\|\bm{\xi}_{k}\\|_{2})^{4}\lesssim\mathbbm{E}(\\|\bm{\check{x}}_{k}\\|_{2}^{4}+\\|\bm{\xi}_{k}\\|^{4}_{2})$		(A.5)
		$\displaystyle\leq d\mathbbm{E}(\\|\bm{\check{x}}_{k}\\|^{4}_{4}+\\|\bm{\xi}_{k}\\|^{4}_{4})\lesssim d^{2}(M+\Delta^{4}).$		(A.5)

	$\displaystyle\big{\|}\bm{v}^{\top}\mathbbm{E}(\bm{\check{x}}_{k}\bm{\check{x}}_{k}^{\top}-\bm{x}_{k}\bm{x}_{k}^{\top})\bm{v}\big{\|}$	$\displaystyle=\Big{\|}\mathbbm{E}\Big{[}\big{(}(\bm{v}^{\top}\bm{\check{x}}_{k})^{2}-(\bm{v}^{\top}\bm{x}_{k})^{2}\big{)}\mathbbm{1}(\\|\bm{x}_{k}\\|_{4}\geq\zeta)\Big{]}\Big{\|}$
		$\displaystyle\leq\mathbbm{E}\big{[}(\bm{v}^{\top}\bm{x}_{k})^{2}\mathbbm{1}(\\|\bm{x}_{k}\\|_{4}\geq\zeta)\big{]}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\sqrt{\mathbbm{E}(\bm{v}^{\top}\bm{x}_{k})^{4}}\sqrt{\mathbbm{P}(\\|\bm{x}_{k}\\|^{4}_{4}\geq\zeta^{4})}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\sqrt{M\frac{\mathbbm{E}\\|\bm{x}_{k}\\|_{4}^{4}}{\zeta^{4}}}\stackrel{{\scriptstyle(iii)}}{{\lesssim}}\sqrt{\frac{M\delta d\log d}{n}},$

		$\displaystyle I_{21}\leq 2\sum_{i,j}\|\sigma^{\star}_{ij}\|^{p}\mathbbm{1}\big{(}\|\sigma^{\star}_{ij}\|>3\mu\big{)}\exp\left(-\frac{C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)$
		$\displaystyle=2\sum_{i,j}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}\sqrt{n\delta\log d}}\right)^{p}\cdot\left(\frac{C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)^{p}\exp\left(-\frac{0.5C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)$
		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\cdot\exp\left(-\frac{0.5C_{3}\|\sigma^{\star}_{ij}\|\sqrt{n\delta\log d}}{\sqrt{M}+\Delta^{2}}\right)\mathbbm{1}\big{(}\|\sigma^{\star}_{ij}\|>3\mu\big{)}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}2\sum_{i,j}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}\sqrt{n\delta\log d}}\right)^{p}\cdot\left(\sup_{t\geq 0}~{}t^{p}\exp\Big{(}-\frac{t}{2}\Big{)}\right)\cdot\exp\left(-\frac{3C_{3}}{2}\frac{\sqrt{n\delta\log d}\cdot\mu}{\sqrt{M}+\Delta^{2}}\right)$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}2d^{2-10\delta}\left(\frac{\sqrt{M}+\Delta^{2}}{C_{3}}\sqrt{\frac{\delta}{n\log d}}\right)^{p}\leq 2d^{2-10\delta}(C_{3}^{-1}B_{0})^{p},$

	$\displaystyle\mathcal{C}(\psi)=\Big{\{}\bm{\Theta}\in\mathbb{R}^{d\times d}:\\|\bm{\Theta}$	$\displaystyle\\|_{\infty}\leq 2\alpha,\\|\bm{\Theta}\\|_{nu}\leq 10\sqrt{r}\\|\bm{\Theta}\\|_{F},$		(A.14)
		$\displaystyle\\|\bm{\Theta}\\|_{F}^{2}\geq(\alpha d)^{2}\sqrt{\frac{\psi\delta\log d}{n}}\Big{\}}.$		(A.14)

	$\displaystyle\Big{\\|}\frac{1}{n}\sum_{k=1}^{n}\big{(}\dot{y}_{k}-\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\big{)}\bm{X}_{k}\Big{\\|}_{op}$	$\displaystyle\leq\Big{\\|}\frac{1}{n}\sum_{k=1}^{n}\dot{y}_{k}\bm{X}_{k}-\mathbbm{E}(\dot{y}_{k}\bm{X}_{k})\Big{\\|}_{op}$
	$\displaystyle+\Big{\\|}\mathbbm{E}\big{(}y_{k}\bm{X}_{k}-\widetilde{y}_{k}\bm{X}_{k}\big{)}\Big{\\|}_{op}+\Big{\\|}$	$\displaystyle\frac{1}{n}\sum_{k=1}^{n}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}-\mathbbm{E}\big{(}\big{<}\bm{X}_{k},\bm{\Theta^{\star}}\big{>}\bm{X}_{k}\big{)}\Big{\\|}_{op}$
		$\displaystyle:=I_{1}+I_{2}+I_{3}.$