AAffil[arabic] \DeclareNewFootnoteANote[fnsymbol]
Quantizing Heavy-tailed Data in Statistical Estimation: (Near) Minimax Rates, Covariate Quantization, and Uniform Recovery
Abstract
Modern datasets often exhibit heavy-tailed behaviour, while quantization is inevitable in digital signal processing and many machine learning problems. This paper studies the quantization of heavy-tailed data in several fundamental statistical estimation problems where the underlying distributions have bounded moments of some order (no greater than ). We propose to truncate and properly dither the data prior to a uniform quantization. Our major standpoint is that (near) minimax rates of estimation error could be achieved by computationally tractable estimators based on the quantized data produced by the proposed scheme. In particular, concrete results are worked out for covariance estimation, compressed sensing (also interpreted as sparse linear regression), and matrix completion, all agreeing that the quantization only slightly worsens the multiplicative factor. Additionally, while prior results focused on quantization of responses (i.e., measurements), we study compressed sensing where the covariates (i.e., sensing vectors) are also quantized; in this case, our recovery program is non-convex because the covariance matrix estimator lacks positive semi-definiteness, but all local minimizers are proved to enjoy near optimal error bound. Moreover, by the concentration inequality of product process and a covering argument, we establish near minimax uniform recovery guarantee for quantized compressed sensing with heavy-tailed noise. Finally, numerical simulations are provided to corroborate our theoretical results.
1 Introduction
Heavy-tailed distributions are ubiquitous in modern datasets, especially those arising in economy, finance, imaging, biology, see [93, 84, 45, 5, 88, 57] for instance. In the recent literature, heavy-tailed distribution is often captured by bounded -th moment, where is some fixed small scalar; this is essentially weaker than sub-Gaussian assumption, and thus outliers and extreme values appear much more frequently in data from heavy-tailed distributions (referred to as heavy-tailed data), which poses challenges for statistical analysis. In fact, many standard statistical procedures developed for sub-Gaussian data suffer from performance degradation in the heavy-tailed regime. Fortunately, the past decade has witnessed considerable progresses on statistical estimation methods that are robust to heavy-tailedness, see [18, 34, 73, 68, 10, 44, 35, 97, 65] for instance.
Departing momentarily from heavy-tailed data, quantization is an inevitable process in the era of digital signal processing, which maps signals to bitstreams so that they can be stored, processed and transmitted. In particular, the resolution of quantization should be selected to achieve a trade-off between accuracy and various data processing costs, and in some applications relatively low resolution would be preferable. For instance, in a distributed learning setting or a MIMO system, the frequent information transmission among multiple parties often result in prohibitive communication cost [56, 69], and quantizing signals or data to fairly low resolution (while preserving satisfying utility) is an effective approach to reduce the cost [96, 43]. Under such a big picture, in recent years there has been rapidly growing literature on high-dimensional signal recovery from quantized data (see, e.g., [8, 26, 21, 89, 50, 32, 31] for 1-bit quantization, [89, 50, 94, 46] for multi-bit uniform quantization), trying to understand the interplay between quantization and signal reconstruction (or parameter learning) in some fundamental estimation problems.
Independently, a set of robustifying techniques has been developed to overcome the challenge posed by heavy-tailed data, and unidorm data quantization under uniform dither was shown to cost very little in some recovery problems. Considering ubiquitousness of heavy-tailed behaviour and data quantization, a natural question is to design quantization scheme for heavy-tailed data that only incurs minor information loss. For instance, when applied to statistical estimation problems with heavy-tailed data, an appropriate quantization scheme should enable at least one faithful estimator from the quantized data, and ideally an estimator nearly achieving optimal error rate. Despite the vast literature in this field, prior results that simultaneously take heavy-tailed data and quantization into account are surprisingly rare — only the ones presented in [32] and our earlier work [21] regarding the dithered 1-bit quantizer, to the best of our knowledge. These results remain incomplete and exhibit some downsides. Specifically, [32] considered a computationally intractable program for quantized compressed sensing and used techniques hard to generalize to other problems, while the error rates in [21] are inferior to the corresponding minimax ones (under unquantized sub-Gaussian data), as will be discussed in Section 1.1.3. In a nutshell, a quantization scheme for heavy-tailed data arising in statistical estimation problems that allows for computationally tractable near-minimax estimators is still lacking.
This paper aims to provide a solution to the above question and narrow the gap between heavy-tailed data and data quantization in the literature. In particular, we propose a unified quantization scheme for heavy-tailed data which, when applied to the canonical estimation problems of (sparse) covariance matrix estimation, compressed sensing (or sparse linear regression) and matrix completion, allows for (near) minimax estimators that are either in closed-form or solved from convex programs. Additionally, we present novel developments concerning covariate (or sensing vector) quantization and uniform signal recovery in quantized compressed sensing with heavy-tailed data.
1.1 Related Works and Our Contributions
This section is devoted to a review of the most relevant works. Before that we note that a heavy-tailed random variable in this work is formulated by the moment constraint where is oftentimes regarded as absolute constant and is some fixed small scalar (specifically, in the present paper).
1.1.1 Statistical Estimation under Heavy-Tailed Data
Compared to sub-Gaussian data, heavy-tailed data may contain many outliers and extreme values that are overly influential to traditional estimation methods. Hence, developing estimation methods that are robust to heavy-tailedness has become a recent focus in statistics literature, where heavy-tailed distributions are often only assumed to have bounded moments of some small order. In particular, significant efforts have been devoted to the fundamental problem of mean estimation for heavy-tailed distribution. For instance, effective techniques available in the literature include Catoni’s mean estimator [18, 34], median of means [73, 68], and trimmed mean [66, 28]. While the strategies to achieve robustness are different, these methods indeed share the same core spirit of making the outliers less influential. To this end, the trimmed method (also referred to as truncation or shrinkage) may be the most intuitive — it truncates overlarge data to some threshold so that they are more benign for the estimation procedure. For more in-depth discussions we refer to the recent survey [65]. Furthermore, these robust methods for estimating the mean have been applied to empirical risk minimization [10, 44] and various high-dimensional estimation problems [35, 97], achieving near optimal guarantees. For instance, by invoking M-estimators with truncated data, (near) minimax rates can be achieved in high-diemnsional sparse linear regression, matrix completion, covariance estimation [35]. In fact, techniques similarly to truncation has proven effective in some related problems, e.g., in non-convex algorithms for phase retrieval, truncated Wirtinger flow using a more selective spectral initialization and carefully trimmed gradient [23] improves on Wirtinger flow [16] in terms of sample size.
Recall that we capture heavy-tailedness by bounded moment of some small order, while regarding statistical estimation beyond sub-Gaussianity, there has been a line of works considering sub-exponential or more generally sub-Weibull distributions [80, 85, 58, 39], which have heavier tail than sub-Gaussian ones but still possess finite moment up to arbitrary order. Specifically, without truncation and quantization, sparse linear regression was studied under sub-exponential data in [85] and under sub-Weibull data in [58], and the obtained error rates match the ones in sub-Gaussian case up to logarithmic factors. Additionally, under sub-exponential measurement matrix and noise, [80] established uniform guarantee for 1-bit generative compressed sensing, while [39] analysed generalized Lasso for a general nonlinear model. Because the tail assumptions in these works are substantially stronger than ours,111One exception is Theorem 4.6 in [58] for sparse linear regression with sub-Weibull covariate and heavy-tailed noise with bounded second moment, but still this result does not involve quantization procedure. we will not make special comparison with their results later; instead, we simply note here two key distinctions: 1) these works do not utilize special treatments for heavy-tailed data like truncation in the present paper; 2) most of them (with the single exception of [80] studying 1-bit quantization) do not focus on quantization, while this work concentrates on quantization of heavy-tailed data.
1.1.2 Statistical Estimation from Quantized Data
Quantized Compressed Sensing. Due to the prominent role of quantization in signal processing and machine learning, quantized compressed sensing that studies the interplay between sparse signal reconstruction and data quantization has become an active research branch in the field. In this work, we focus on memoryless quantization scheme222This means that the quantization for different measurements are independent. For other quantization schemes we refer to the recent survey [29]. that embraces simple hardware design. An important model is 1-bit compressed sensing where only the sign of the measurement is retained [8, 48, 75, 76], and more precisely this model concerns the recovery of sparse from with the sensing matrix . However, 1-bit compressed sensing associated with the direct quantization suffers from some frustrating limitations, e.g., the loss of signal norm information, and the identifiability issue under some regular sensing matrix (e.g., under Bernoulli sensing matrix, see [32]).333In fact, almost all existing guarantees using the 1-bit observations are restricted to standard Gaussian sensing matrix consisting of i.i.d. entries, with the exceptions of [1] for sub-Gaussian sensing matrix and [30, 86] for partial Gaussian circulant matrix. Fortunately, these limitations can be overcome by random dithering prior to the quantization, under which the 1-bit measurements read as for some suitably chosen random dither . Specifically, under Gaussian dither and standard Gaussian sensing matrix , full reconstruction with norm information could be achieved, for which the key idea is , thus reducing the dithered model to the undithered model for the sparse signal whose last entry is known beforehand [54].444We note that this idea has been recently extended in [20] to the related problem of phase-only compressed sensing, also see [47, 36, 7] for prior developments. More surprisingly, under a uniform random dither, recovery with norm can be achieved under rather general sub-Gaussian sensing matrix [32, 50, 89, 21] even with near optimal error rate.
Besides the 1-bit quantizer that retains the sign, the uniform quantizer maps to for some pre-specified ; here and hereafter, we refer to as the quantization level, and note that smaller represents higher resolution. While recovering from encounters identifiability issue,555For instance, if (typical example is the Bernoulli design where entries of are i.i.d. zero-mean) and , then and can never be distinguished because always holds. it is again beneficial to use random dithering to obtain the measurements . More specifically, by using uniform dither the Lasso estimator [89, 87] and projected back projection (PBP) method [94] achieve minimax rate in certain cases, and the derived error bounds for these estimators demonstrate that the dithered uniform quantization does not affect the scaling law but only slightly worsens the multiplicative factor. Although the aforementioned progresses regarding compressed sensing under dithered quantization were recently made, the technique of dithering in quantization indeed has a long history and (at least) dates back to some early engineering work (e.g., [83]), see [41] for a brief introduction.
Other Estimation Problems with Quantized Data. Beyond compressed sensing or its corresponding more statistical setting of sparse linear regression, some other statistical estimation problems were also investigated under dithered 1-bit quantization. Specifically, [24] studied a general signal estimation problem under dithered 1-bit quantization in a traditional setting where sample size tends to infinity, showing that only logarithmic rate loss is incurred by the quantization. Inspired by potential application in reduction of power consumption in a large scale massive MIMO system, [31] proposed to collect 2 bits per entry from each sub-Gaussian sample and developed an estimator that is near minimax optimal in certain cases. Their estimator from coarsely quantized samples was extended to high-dimensional sparse case in [21]. Then, considering the ubiquitousness of binary observations in many recommendation systems, the authors of [26] first approached the 1-bit matrix completion problem by maximum likelihood estimation with a nuclear norm constraint. Their method was developed by a series of follow-up works by using different regularizers/constraints to encourage low-rankness, or considering multi-bit quantization on a finite alphabet [14, 59, 52, 17, 3]. Quantizing the observed entries by a dithered 1-bit quantizer, the 1-bit matrix completion result in [21] essentially departs from the standard likelihood approach and can tolerate pre-quantization noise with unknown distribution.
1.1.3 Quantization of Heavy-Tailed Data in Statistical Estimation
From now on, we turn to existing results more closely related to this work and explain our contributions. Note that the results we just reviewed are for estimation problems from either unquantized heavy-tailed data (Section 1.1.1) or quantized sub-Gaussian data (Section 1.1.2). While quantization of heavy-tailed data (from distribution assumed to have bounded moment of some small order) is a natural question of significant practical value, prior investigations turn out to be surprisingly rare, and the only results we are aware of were presented in [21, 32] concerning dithered 1-bit quantization. Specifically, [32, Thm. 1.11] considered heavy-tailed noise and possibly heavy-tailed covariate, implying that a sharp uniform error rate is achievable (see their Example 1.10). However, their result is for a computationally intractable program (Hamming distance minimization) and hence of limited practical value. Another limitation is that their techniques are based on random hyperplane tessellations that is specialized to 1-bit compressed sensing but does not generalize to other estimation problems. In contrast, [21] proposed a unified quantization scheme that first truncates the data and then invokes a dithered 1-bit quantizer. Although this quantization scheme could (at least) be applied to sparse covariance matrix estimation, compressed sensing, and matrix completion while still enabling practical estimators, the main drawback is that the convergence rates of estimation errors are essentially slower than the corresponding minimax optimal ones (e.g., for 1-bit compressed sensing under heavy-tailed noise [21, Thm. 10]), and in certain cases the rates cannot be improved without changing the quantization process (e.g., [21, Thm. 11] complements [21, Thm. 10] with a nearly matching lower bound). Beyond that, the 1-bit compressed sensing results in [21] are non-uniform. In a nutshell, [32] proved sharp rate for 1-bit compressed sensing but used highly intractable program and techniques not extendable to other estimation regimes, while the more widely applicable scheme and practical estimators in [21] suffer from slow error rates (when compared to the ones achieved from unquantized sub-Gaussian data).
Our Main Contributions: (Near) Minimax Rates. We propose a unified quantization scheme for heavy-tailed data consisting of three steps: 1) truncation that shrinks data to some threshold, 2) dithering that adds suitable random noise to the truncated data, and 3) uniform quantization. For sub-Gaussian or sub-exponential data the truncation step is inessential, and we simply set the threshold as in this case. Careful readers may notice that, we just replace the 1-bit quantizer in our prior work [21] with the less extreme (multi-bit) uniform quantizer , but the gain turns out to be significant — we are now able to derive (near) optimal rates that are essentially faster than the ones in [21], see Theorems 2-8. Compared to [32], besides the different quantizers, other major distinctions are that: 1) we utilize an additional truncation step, 2) our estimators are computationally feasible, and 3) we investigate multiple estimation problems with possibility of extensions to more estimation problems. Concerning the effect of quantization, our error rates suggests a unified conclusion that dithered uniform quantization does not affect the scaling law but only slightly worsens the multiplicative factor, which generalizes similar findings for quantized compressed sensing in [89, 87, 94] towards two directions, i.e., to the case where heavy-tailed data present and to some other estimation problems (i.e., covariance matrix estimation, matrix completion). As an example, for quantized compressed sensing with sub-Gaussian sensing vector but heavy-tailed measurement satisfying for some , we derive the -norm error rate where (Theorem 5, are respectively the sparsity, signal dimension, measurement number), which is reminiscent of the rates in [94, 89] in terms of the position of . From a technical side, many of our analyses on the dithered quantizer are much cleaner than prior works because we make full use of the nice statistical properties of the quantization error and quantization noise (Theorem 1),666Prior work that did not fully leveraged these properties may incur extra technical complication, e.g., the symmetrization and contraction in [89, Lem. A.2]. see Section 2.2. Also, the property of quantization noise motivates us to use triangular dither when covariance estimation is necessary, which departs from the uniform dither commonly adopted in prior works (e.g., [89, 94, 31, 21]) and is novel to the literature of quantized compressed sensing. In our subsequent work [22], a clean analysis on quantized low-rank multivariate regression with possibly quantized covariates is provided, and we believe the innovations of this work will prove useful in other problems.
1.1.4 Covariate Quantization in Quantized Compressed Sensing
From now on we concentrate on quantized compressed sensing, i.e., the recovery of a sparse signal from the quantized version of where are the sensing vector, measurement and noise, respectively. Let us first clarify some terminology issue before proceeding. Note that this formulation also models the sparse linear regression problem (e.g., [72, 81]) where one wants to learn a sparse parameter from the given data that are believed to follow the linear model , and in this regression problem are commonly referred to as covariate and response, respectively. We are interested in both settings in this work (as further explained in Section 3.2), but for clearer presentation, we simply refer to the problem as quantized compressed sensing, while calling covariate and response, respectively.
Despite a large volume of results in quantized compressed sensing, almost all of them are restricted to response quantization, thus the question of covariate quantization that allows for accurate subsequent recovery remains unsolved. Note that this question is meaningful especially when the problem is interpreted as sparse linear regression — working with low-precision data in some (distributed) learning systems could significantly reduce communication cost and power consumption [96, 43], which we will further demonstrate in Section 4.1. Therefore, it is of interest to understand how covariate quantization affects the learning of . To the best of our knowledge, the only existing rigorous guarantees for quantized compressed sensing involving covariate quantization were obtained in [21, Thms. 7-8]. However, these results require to be sparse [21, Assumption 3] (in order to employ their sparse covariance matrix estimator), and this assumption is non-standard in sparse linear regression and compressed sensing.777In fact, although isotropic sensing vector (i.e., ) has been conventional in compressed sensing, many results in the literature can be extended to sensing vector with general unknown covariance matrix and hence do not really rely on the sparsity of .
Our Contribution. Besides the above main contributions, we establish the estimation guarantees for quantized compressed sensing under covariate quantization that are free of the non-standard assumption on the sparsity of . Like [21], our estimation methods are built upon the quantized covariance matrix estimator developed in Section 3.1; but unlike [21] that relies on the sparsity of to ensure convexity, we instead deal with the non-convex program with an additional -norm constraint, under which we prove that all local minimizers deliver near minimax estimation errors (Theorems 9-10). Our analysis bears resemblance to a line of works on non-convex M-estimator [63, 62, 64] but also exhibits some essential differences (Remark 5). Further, we extract our techniques as a deterministic framework (Proposition 1) and then use it to establish guarantees under dithered 1-bit quantization and covariate quantization as byproducts (Theorems 11-12), which are comparable to [21, Thms. 7-8] but free of sparsity on .
1.1.5 Uniform Signal Recovery in Quantized Compressed Sensing
It is standard in compressed sensing to leverage a random sensing matrix, so a recovery guarantee can be uniform or non-uniform. More precisely, a uniform guarantee ensures the recovery of all structured signals of interest with a single draw of the sensing ensemble, while a non-uniform guarantee is only valid for a structured signal fixed before drawing the random ensemble, with the implication that a new realization of the sensing matrix is required for sensing a new signal. Uniformity is a highly desired property in compressed sensing, since in applications the measurement ensemble is typically fixed and is expected to work for all signals [40]. Besides, the derivation of a uniform guarantee is often significantly harder than a non-uniform one, making uniformity an interesting theoretical problem in its own right.
A classical fact in linear compressed sensing is that the restricted isometry property (RIP) of the sensing matrix implies uniform recovery of all sparse signals (e.g., [38]), but this is not the case when it comes to nonlinear compressed sensing models, for which uniform recovery guarantees are still eagerly pursued so far. For instance, in the specific quantization model involving 1-bit/uniform quantization with/without dithering, or the more general single index model with possibly unknown , most representative results are non-uniform (e.g., [75, 79, 78, 89, 94, 87]). We refer to [75, 32, 94, 50] for concrete uniform guarantees, some of which remain (near) optimal (e.g., [94, Sect. 7.2A]), while others suffer from essential degradation compared to the non-uniform ones (e.g., [75, Thm. 1.3]). It is worth noting the interesting recent work [40] who provided a unified approach to uniform guarantee in a series of non-linear models, but without the aid of some non-trivial embedding result, their uniform guarantees typically exhibit a decaying rate of that is slower than the non-uniform one of (Section 4 therein). Turning back to our focus of compressed sensing from quantized heavy-tailed data, results in [21, Sect. III] are non-uniform, while [32, Thm. 1.11] presents sharp uniform guarantee for the intractable program of hamming distance minimization under dithered 1-bit quantization.
Our Contribution. We additionally contribute to the literature a uniform guarantee for constrained Lasso under dithered uniform quantization of heavy-tailed response. Specifically, we upgrade our non-uniform Theorem 5 to its uniform version Theorem 13, which states that using a single realization of the sub-Gaussian sensing matrix, heavy-tailed noise and uniform dither, all -sparse signals within an -ball can be uniformly recovered up to an -norm error of , thus matching the near minimax non-uniform rate in Theorem 5 up to logarithmic factors. The proof relies on a concentration inequality for product process [67] and a careful covering argument inspired by [94]. Due to the heavy-tailed noise, new treatment is needed before invoking the concentration result from [67].
1.2 Outline
The remainder of this paper is structured as follows. We provide the notation and preliminaries in Section 2. We present the first set of main results (concerning the (near) optimal guarantees for three estimation problems under quantized heavy-tailed data) in Section 3. Our second set of results (concerning covariate quantization and uniform recovery in quantized compressed sensing) is then presented in Section 4. To corroborate our theory, numerical results on synthetic data are reported in Section 5. We give some remarks to conclude the paper in Section 6. All the proofs are postponed to the Appendices.
2 Preliminaries
We adopt the following conventions throughout the paper:
1) We use boldface symbols (e.g., , ) to denote matrices and vectors, and regular letters (e.g, , ) for scalars. We write for positive integer . We denote the complex unit by . The -th entry for a vector (likewise, , ) is denoted by (likewise, , ).
2) Notation with “” as superscript denotes the desired underlying parameter or signal, e.g., , . Moreover, notation marked by a tilde (e.g., ) and a dot (e.g., ) stands for the truncated data and quantized data, respectively.
3) We reserve , for the problem dimension and sample size, respectively. In many cases denotes the estimation error, e.g., if is the estimator for the desired signal . We use to denote the set of -dimensional -sparse signals.
4) For vector , we work with its transpose , -norm (), max norm . We define the standard Euclidean sphere as .
5) For matrix with singular values , recall the operator norm , Frobenius norm , nuclear norm , and max norm . (resp. ) stands for the minimum eigenvalue (resp. maximum eigenvalue) of a symmetric .
6) We denote universal constants by , , and , whose value may vary from line to line. We write or if . Conversely, if we write or . Also, we write if and simultaneously hold.
7) We use to denote the uniform distribution over , to denote Gaussian distribution with mean and covariance , to denote student’s t distribution with degrees of freedom .
8) Our technique to handle heavy-tailedness is a data truncation step, for which we introduce the operator for some threshold . It is defined as for some . To truncate vector we apply entry-wisely in most cases, with the exception of covariance matrix estimation under operator norm error (Theorem 3).
9) is the uniform quantizer with quantization level . It applies to scalar by , and we set . Given a threshold , the hard thresholding of scalar is . Both functions element-wisely apply to vectors or matrices.
2.1 High-Dimensional Statistics
Let be a real random variable, we present some basic knowledge on the sub-Gaussian and sub-exponential random variable. Then we also precisely formulate the heavy-tailed distribution.
1) The sub-Gaussian norm is defined as . A random variable with finite is said to be sub-Gaussian. Analogously to Gaussian variable, a sub-Gaussian random variable exhibits an exponentially-decaying probability tail and satisfies a moment constraint:
(2.1) | |||
(2.2) |
Note that these two properties can also define up to multiplicative constant, e.g., (see [92, Prop. 2.5.2]). For a -dimensional random vector we define its sub-Gaussian norm as .
2) The sub-exponential norm is defined as , and is sub-exponential if . The sub-exponential satisfies the following properties:
(2.3) |
To relate and one has [92, Lem. 2.7.7].
3) In contrast to the moment constraints in (2.2) and (2.3), heavy-tailed distributions in this work are only assumed to satisfy bounded moments of some small order no greater than , formulated for a random variable as for some and . Following [58, Def. 2.4, 2.5], we consider the following two moment assumptions for a heavy-tailed random vector (again, , ):
-
•
Marginal Moment Constraint. The weaker assumption that constraints the moment of each coordinate, formulated by .
-
•
Joint Moment Constraint. The stronger assumption that constraints the moments “toward all directions ,” formulated by .
2.2 Dithered Uniform Quantization
In this part, we describe the dithered uniform quantizer and its properties in detail. We also specify the choices of random dither in this work.
1) We first provide the detailed procedure of dithered quantization and its general property. Let be the input signal with dimension whose entries may be random and dependent. Independent of , we generate the random dither with i.i.d. entries from some distribution,888Throughout this work, we suppose that a random dither is drawn independent of anything else (particularly, the signal to be quantized and other dithers), and the dither has i.i.d. entries if it is a vector. and then quantize to . Following [41], we refer to as the quantization error, and as the quantization noise. The principal properties of dithered quantization are provided in Theorem 1.
Theorem 1.
(Adapted from [41, Thms. 1-2]). Consider the dithered uniform quantization described above for the input signal , with random dither , quantization error and quantization noise . Use to denote the imaginary unit, and let be the random variable having the same distribution as the random dither .
(a) (Quantization Error). If satisfies for all non-zero integer , then is independent of .999Although the statement is a bit different, it can be implied by [41, Thm. 1] and the proof therein.
(b) (Quantization Noise). Assume that is independent of . Let . Given positive integer , if the -th order derivative satisfies for all non-zero integer , then the -th conditional moment of does not depend on : .
We note that Theorem 1 serves as the cornerstone for our analysis on the dithered uniform quantizer; for instance, (a) allows for applications of concentration inequalities in our analyses, and (b) inspires us to develop a covariance matrix estimator from quantized samples. The take-home message is that, adding appropriate dither before quantization can make the quantization error and quantization noise behave in a statistically nice manner. For example, the elementary form of Theorem 1(a) is that under a dither satisfying the condition there, the quantization noise follows under any given scalar [41, Lem. 1].
2) We use uniform dither for quantization of the response in compressed sensing and matrix completion. More specifically, under , we adopt the uniform dither for the response , which is also a common choice in previous works (e.g., [89, 94, 32, 50]). For , it can be calculated that
(2.4) |
and hence holds for all non-zero integer . Therefore, the benefit of using is that the quantization errors i.i.d. follow , and are independent of .
3) We use triangular dither for quantization of the covariate, i.e., the sample in covariance estimation or the covariate in comprssed sensing. Particularly, when considering the uniform quantizer for the covariate , we adopt the dither ,101010An equivalent statement is that entries of are i.i.d. distributed as . The equivalence can be clearly seen by comparing the joint probability density functions. which is the sum of two independent and referred to as a triangular dither [41]. Simple calculations verify that the triangular dither respects not only the condition in Theorem 1(a), but also the one in Theorem 1(b) with ; specifically, let where and are independent and follow , and let be independent of , then based on (2.4), we know satisfies , and satisfies , where is any non-zero integer. Thus, at the cost of a dithering variance larger than uniform dither, the triangular dither brings the additional nice property of signal-independent variance for the quantization noise — , where is the -th entry of .
To the best of our knowledge, the triangular dither is new to the literature of quantized compressed sensing. We will explain its necessity if covariance estimation is involved. This is also complemented by numerical simulation (see Figure 5(a)).
3 (Near) Minimax Error Rates
In this section we derive (near) optimal error rates for several canonical statistical estimation problems. Our novelty is that by using the proposed quantization scheme for heavy-tailed data, (near) optimal error rates could be achieved by computationally feasible estimators.
3.1 Quantized Covariance Matrix Estimation
Given as i.i.d. copies of a zero-mean random vector , one often encounters the covariance matrix estimation problem, i.e., to estimate . This estimation problem is of fundamental importance in multivariate analysis and has attracted much research interest (e.g., [12, 11, 13, 4]). However, the practically useful setting (e.g., in a massive MIMO system [95]) where the samples undergo certain quantization process remains under-developed, for which we are only aware of the 1-bit quantization results in [31, 21]. This setting poses the problem of quantized covariance matrix estimation (QCME), in which one aims to design quantization scheme for that allows for accurate estimation of only based on the quantized samples. We consider heavy-tailed that possesses bounded fourth moments either marginally or jointly, but note that our estimation methods and theoretical results appear to be new even for sub-Gaussian (Remark 1).
As introduced before, we overcome the heavy-tailedness of by a data truncation step, i.e., we first truncate to in order to make the outliers less influential. Here, we defer the precise definition of to concrete results because it should be well suited to the error metric. After the truncation, we dither and quantize to with the triangular dither . Different from the uniform dither adopted in the literature (e.g., [21, 89, 94, 32, 50]), first let us explain our choice of triangular dither. Recall that the quantization noise and quantization error are respectively defined as and , thus giving . Under uniform dither or triangular dither, is independent of and follows (see Section 2.2), thus allowing us to calculate that
(3.1) | ||||
Note that is because , due to the previously noted fact that and are independent of and zero-mean. While with suitable choice of the truncation threshold is expected to well approximate , the remaining gives rise to constant bias. To address the issue, a straightforward idea is to remove the bias, which requires the full knowledge on , i.e., the covariance matrix of the quantization noise. For , because , and (note that conditionally on , and are independent), we have
showing that is diagonal. Moreover, under triangular dither the -th diagonal entry is also known as , see Section 2.2. Taken collectively, we arrive at
(3.2) |
Based on (3.1) we thus propose the following estimator
(3.3) |
which is the sample covariance of the quantized sample followed by a correction step. On the other hand, the reason why the standard uniform dither is not suitable for QCME becomes self-evident — the diagonal of remains unknown111111It depends on the input signal, see [41, Page 3]. and hence there is no hope to precisely remove the bias.
We are now ready to present error bounds for under max-norm, operator norm. We will also investigate the high-dimensional setting by assuming sparse structure of , for which we propose a thresholding estimator. More concretely, our first result provides the error rate under , in which we assume satisfies the marginal fourth moment constraint and utilize an element-wise truncation .
Theorem 2.
(Element-Wise Error). Given and , we consider the problem of QCME described above. We suppose that s are i.i.d. zero-mean and satisfy the marginal moment constraint for any , where is the -th entry of . We truncate to with threshold , then quantize to with triangular dither . If , then the estimator in (3.3) satisfies
where .
Notably, despite the heavy-tailedness and quantization, the estimator achieves an element-wise rate coincident with the one for sub-Gaussian case. One can clearly position quantization level in the multiplicative factor . Thus, the information loss incurred by quantization is inessential in that it does not affect the key scaling law but only slightly worsens the leading factor. These remarks on the (near) optimality and the information loss incurred by quantization remain valid in our subsequent theorems.
Our next result concerns the operator norm estimation error, under which we impose a stronger joint moment constraint on and truncate regarding -norm, i.e., for some threshold . After the dithered uniform quantization, we still define the estimator as (3.3).
Theorem 3.
(Operator Norm Error). Given and , we consider the problem of QCME described above. Suppose that the i.i.d. zero-mean s satisfy for any . We truncate to with threshold , then quantize to with triangular dither . If , then the estimator in (3.3) satisfies
with .
The operator norm error rate in Theorem 3 is near minimax optimal, e.g., compared to the lower bound in [35, Thm. 7], which states that for any estimator of the positive semi-definite matrix based on i.i.d. zero-mean with covariance matrix , there exists some such that , where . Again, the quantization only affects the multiplicative factor . Nevertheless, one still needs (at least) to achieve small operator norm error. In fact, in a high-dimensional setting where may exceed , even the sample covariance for sub-Gaussian zero-mean may have extremely bad performance. To achieve small operator norm error in a high-dimensional regime, we resort to additional structure on , and specifically we use column-wise sparsity as an example, which corresponds to the situations where dependencies among different coordinates are weak. Based on the estimator in Theorem 2, we further invoke a thresholding regularization [4, 12] to promote sparsity.
Theorem 4.
(Sparse QCME). Under conditions and estimator in Theorem 2, we additionally assume that all columns of are -sparse and consider the thresholding estimator for some (recall that for ). If with sufficiently large , then satisfies
where .
Notably, our estimator achieves minimax rates under operator norm, e.g., compared to the minimax lower bound derived in [12, Thm. 2], which states that (under some regular scaling) for any covariance estimator based on i.i.d. samples of where is the true covariance matrix, there exists some covariance matrix with -sparse columns such that .
To analyse the thresholding estimator, our proof resembles the ones developed in prior works (e.g., [12]) but requires more efforts like bounding the additional bias terms arising from the data truncation and quantization. We also point out that the results for the full-data unquantized regime immediately follow by setting , thus Theorems 2-3 represent the strict extension of [35, Sect. 4], and Theorem 4 complements [35] with a high-dimensional sparse setting.
Remark 1.
(Sub-Gaussian Case). While we concentrate on quantization of heavy-tailed data in this work, our results can be readily adjusted to sub-Gaussian , for which the truncation step is inessential and can be removed (i.e., ). These results are also new to the literature but will not be presented here.
3.2 Quantized Compressed Sensing
We consider the linear model
(3.4) |
where s are the covariates, s are responses, is the sparse signal in compressed sensing or sparse parameter vector in high-dimensional linear regression that we want to estimate. In the quantized compressed sensing (QCS) problem, we are interested in developing quantization scheme for s mainly for in prior works that enables accurate recovery of based on the quantized data.
In spite of the same mathematical formulation, there are some important differences between compressed sensing and sparse linear regression that we should clarify first. Specifically, different from sensing vectors in compressed sensing that are generated by some analog measuring device and can oftentimes be designed, s in sparse linear regression represent the sample data from certain datasets that are believed to affect the responses s through (3.4). While the sparsity of is arguably the most classical signal structure for compressed sensing, due to good interpretability it is also commonly adopted to achieve dimension reduction in high-dimensional statistics. In this work, we are interested in both problem settings. Thus, we do not adopt the isotropic convention (i.e., ) from compressed sensing but instead deal with having general unknown covariance matrix. While the study of quantization and heavy-tailed noise is meaningful in both settings, we note that some of our subsequent results are mainly of interest to the specific sensing or regression problem. For instance, the heavy-tailed covariate considered in Theorem 6 is primarily motivated by the regression setting, in which may come from a dataset that exhibits much heavier tail than sub-Gaussian data. Moreover, as will be elaborated in Section 4 when appropriate, our subsequent results on covariate quantization (resp., uniform signal recovery guarantee) may prove more useful to the regression problem (resp., compressed sensing problem).
To fix idea, we assume that s are i.i.d. drawn from some multi-variate distribution, s are i.i.d. statistical noise independent of the s, and we truncate to and then quantize it to with uniform dither . Under these statistical assumptions and dithered quantization, near optimal recovery guarantees have been established in [89, 94] for the regime where both and are drawn from sub-Gaussian distributions (hence the truncation is not needed). In contrast, our focus is on quantization of heavy-tailed data. Particularly, we always assume that the noise s are i.i.d. drawn from some heavy-tailed distribution, resulting in heavy-tailed responses. We will separately deal with the case of sub-Gaussian covariate and a more challenging situation where s are also heavy-tailed.
To estimate the sparse , a classical approach is via the regularized M-estimator known as Lasso [90, 70, 72]
whose objective combines the -loss for data fidelity and -norm that encourages sparsity. Because we can only access the quantized data (or even if covariate quantization is involved, see Section 4), the main issue lies in the -loss that requires the unquantized data . To resolve the issue, we calculate the expected -loss:
(3.5) | ||||
where holds up to an inessential constant , and in we let , . This inspires us to generalize the loss to and consider the following program
(3.6) |
Compared to (3.5) we will use that well approximates , and we also introduce the constraint to allow more flexibility. It is important to note that this is the general strategy in this work to design estimators in different QCS settings, see more discussions in Remark 3.
The next theorem is concerned with QCS under sub-Gaussian covariate but heavy-tailed response. Note that the heavy-tailedness of stems from the noise distribution assumed to have bounded moment ( in the theorem statement), but following [35, 21, 97] we directly impose the moment constraint on the response.
Theorem 5.
(Sub-Gaussian Covariate, Heavy-Tailed Response). Given some , in (3.4) we suppose that s are i.i.d., zero-mean sub-Gaussian with , for some where , is -sparse, the noise s are i.i.d. heavy-tailed and independent of s, and we assume for some fixed . In the quantization, we truncate to with threshold , then quantize to with uniform dither . For recovery, we define the estimator as (3.6) with , , . We set with sufficiently large . If for some hidden constant only depending on , then with probability at least , the estimation error satisfies
where .
The rate for -norm error is minimax optimal up to logarithmic factor (e.g., compared to [81]). Note that a random noise bounded by roughly contributes to , and the latter is bounded by ; because in the error bound and almost play the same role, the effect of uniform quantization can be readily interpreted as an additional bounded noise, analogously to the error rate in [87].
Next, we switch to the more challenging situation where both and are heavy-tailed, assuming that they both possess bounded fourth moments (a marginal moment constraint for ). The consideration of this setting is motivated by the setting of sparse linear regression, where the covariates s may oftentimes exhibit heavy-tailed behaviour. Specifically, we element-wisely truncate to and set as a robust covariance matrix estimator, whose estimation performance under follows immediately from Theorem 2 by setting .
Theorem 6.
(Heavy-Tailed Covariate, Heavy-Tailed Response). Given some , , in (3.4) we suppose that s are i.i.d. zero-mean satisfying a marginal fourth moment constraint , for some where , satisfies , the noise s are i.i.d. heavy-tailed and independent of s, and we assume . In the quantization, we truncate respectively to with , then we quantize to with uniform dither . For recovery, we define the estimator as (3.6) with , , . We set with sufficiently large . If for some hidden constant only depending on , then with probability at least , the estimation error satisfies
where .
Theorem 6 generalizes [35, Thm. 2(b)] to the uniform quantization setting. Clearly, the obtained rate remains near minimax optimal if is of minor scaling (e.g., bounded or logarithmic factors). Nevertheless, such near optimality in Theorem 6 comes at the cost of more restricted conditions and stronger scaling, as remarked in the following.
Remark 2.
(Comparing Theorems 5-6). Compared with in Theorem 5, the first downside of Theorem 6 is the sub-optimal sample complexity , and note that is also required in [35, Thm. 2(b)]. But indeed, it can be improved to by explicitly adding the constraint to the recovery program, as will be noted as an interesting side finding in Remark 6. Secondly, following [35] we impose an -norm constraint that is stronger than used in the proof of Theorem 5. In fact, when replacing the constraint in Theorem 6 with an -norm bound , then our proof technique leads to an error rate that exhibits worse dependence on .
Remark 3.
(Modification of -loss). Recall that we generalize the regular -loss to as loss function in (3.6). Note that the choice of in Theorem 5 is tantamount to using the loss function that replaces with the quantized response ; this idea is analogous to the generalized Lasso investigated for single index model [78] and dithered quantized model [89], and will be used again in quantized matrix completion, see (3.8) below. However, our generalized -loss provides more flexibility to deal with heavy-tailedness or quantization of , e.g., in Theorem 6 amounts to adopting as loss function, and under quantized covariate more delicate modifications are required in Theorems 9-12, which is beyond the range of prior works on generalized Lasso.
3.3 Quantized Matrix Completion
Completing a low-rank matrix from only a partial observation of its entries is known as the matrix completion problem, which has found many applications including recommendation system, image inpainting, quantum state tomography [19, 27, 2, 74, 42], to name just a few. Mathematically, let be the underlying matrix satisfying , the matrix completion problem can be formulated as
(3.7) |
where s are distributed on ( is the -th column of ), is observation noise. Note that for one has , so each observation is a noisy entry. Our main interest is on quantized matrix completion (QMC), where our goal is to design quantizer for the observation that allows for accurate estimation of from the quantized observations.
Unlike in compressed sensing, additional condition (besides the low-rankness) on is needed to ensure the well-posedness of the matrix completion problem. More specifically, certain incoherence conditions are required if we pursue exact recovery (e.g., [15, 82]), whereas a faithful estimation can be achieved as long as the underlying matrix is not overly spiky and sufficiently diffuse (e.g., [51, 71]). The latter condition is also known as “low spikiness” and is formulated by [35, 71], which has been noted to be necessary for the well-posedness of matrix completion problem [27, 71]. In subsequent works the low-spikiness condition is often formulated as the simpler max-norm constraint [51, 19, 53, 26, 37].
In this work, we consider the uniform sampling scheme , but with a little bit more work it generalizes to more general sampling scheme [51]. We apply the proposed quantization scheme to possibly heavy-tailed — we truncate to with some threshold , and then quantize to with uniform dither . Because we do not pursue exact recovery (which is impossible under quantization), we do not assume any incoherence condition like [82]. Instead, we only hope to accurately estimate , and following [51, 19, 53, 26, 37] we impose a max-norm constraint
Overall, we estimate from by the regularized M-estimator [70, 72]
(3.8) |
that combines an -loss and nuclear norm regularizer.
In the literature, there has been a line of works on 1-bit or multi-bit matrix completion related to our results to be presented [14, 59, 52, 17, 3]. While the referenced works commonly adopted a likelihood approach, our method is an essential departure and embraces some advantage, see a precise comparison in Remark 4. Considering such novelty, we include the result for sub-exponential in Theorem 7, for which the truncation of becomes unnecessary and we simply set .
Theorem 7.
(QMC under Sub-Exponential Noise). Given some , in (3.7) we suppose that s are i.i.d. uniformly distributed over , satisfies and , the noise s are i.i.d. zero-mean sub-exponential satisfying , and are independent of s. In the quantization, we do not truncation but directly quantize it to with uniform dither . We choose with sufficiently large , and define as (3.8). If , then with probability at least , the estimation error satisfies
where .
By contrast, under heavy-tailed noise only assumed to have bounded variance, we truncate with a suitable threshold before the dithered quantization to achieve an optimal trade-off between bias and variance.
Theorem 8.
(QMC under Heavy-tailed Noise). Given some , we consider (3.7) in the setting of Theorem 7 but with the assumption replaced by . In the quantization, we truncate to with , and then quantize to with uniform dither . We choose with sufficiently large , and define as (3.8). If , then with probability at least , the estimation error satisfies
where .
Compared to the information-theoretic lower bounds in [71, 55], the error rates obtained in Theorems 7-8 are minimax optimal up to logarithmic factors. Specifically, Theorem 8 derives near optimal guarantee for QMC with heavy-tailed observations, as the key standpoint of this paper. Note that, the 1-bit quantization counterpart of these two Theorems was derived in our previous work [21]; in sharp contrast to Theorem 8, for 1-bit QMC under heavy-tailed noise, the error rate under in [21, Thm. 13] reads as and is essentially slower; using the 1-bit observations therein, this slow error rate is indeed nearly tight due to the lower bound in [21, Thm. 14].
To close this section, we give a remark to illustrate the novelty and advantage of our QMC method by a careful comparison with prior works.
Remark 4.
QMC with 1-bit or multi-bit quantized observations has received considerable research interest [26, 14, 59, 52, 17, 3]. Adapted to our notation, these works studied the model under general random dither and quantizer , and they commonly adopted regularized (or constrained) maximum likelihood estimation for estimating . By contrast, with the random dither and quantizer specialized to and , our model is formulated as . Thus, while suffering from less generality in , our method embraces the advantage of robustness to pre-quantization noise , whose distribution is unknown and can even be heavy-tailed. Note that such unknown evidently forbids the likelihood approach.
4 Covariate Quantization and Uniform Signal Recovery in Quantized Compressed Sensing
By now we have presented near optimal results in the contexts of QCME, QCS and QMC under heavy-tailed data that further undergo the proposed quantization scheme, which we position as the primary contribution of this work. In this section, we further provide two additional developments to enhance our results on heavy-tailed QCS.
4.1 Covariate Quantization
In the area of QCS, almost all prior works merely focused on the quantization of response , see the recent survey [29]; here, we consider a setting of “complete quantization” — meaning that the covariate is also quantized. To motivate our study of “complete quantization”, we interpret compressed sensing as sparse linear regression. Indeed, to reduce the power consumption and computational cost, it is sometimes preferable to work with low-precision data in a machine learning system, e.g., the sample quantization scheme developed in [96] led to experimental success in training linear model. Also, it was shown that direct gradient quantization may not be efficient in certain distributed learning systems where the terminal nodes are connected to the server only through very weak communication fabric and the number of parameters are extremely huge; rather, quantizing and transmitting some important samples could provably reduce communication cost [43]. In fact, the process of data collection may already appeal to quantization due to certain limit of the data acquisition device (e.g., a low-resolution analog-to-digital module used in distributed signal processing [25]). Our main goal is to understand how quantization of s affects the subsequent recovery/learning process, particularly showing that the simple dithered uniform quantization scheme still allows for accurate estimator that may even provide near minimax error rate. To our best knowledge, the only prior rigorous estimation guarantees for QCS with covariate quantization are [21, Thms. 7-8]; these two results require a restricted and unnatural assumption, which we will also relax later.
4.1.1 Multi-bit QCS with Quantized Covariate
Since we will also consider the 1-bit quantization, we more precisely refer to the QCS under uniform quantizer as multi-bit QCS. We will generalize Theorems 5-6 to covariate quantization in the next two theorems.
Let be the quantized covariate-response pair, we first quickly sketch the idea of our approach. Specifically, we stick to the framework of M-estimator in (3.6), which appeals to accurate surrogates for and based on , where represents the quantized covariate. Fortunately, the surrogates can be constructed analogously to our QCME estimator when triangular dither is used for quantizing . Let us first state our quantization scheme as follows:
- •
-
•
Covariate Quantization. This is the same as Theorem 2. We truncate to with threshold , and then quantize to with triangular dither and quantization level .
-
•
Notation. We write the quantization noise as and , the quantization error as and .
We will adopt the above notation in subsequent developments. Based on the quantized covariate-response pairs s, we specify our estimator by setting in (3.6) as
(4.1) |
Note that the choice of is due to the estimator in Theorem 2, while is inspired by the calculation
where the last equality can be seen by conditioning on or . However, the issue is that is not positive semi-definite, hence the resulting program is non-convex. To explain this, note that the rank of does not exceed , so when at least eigenvalues of are . Alternatively, the non-convexity can also be seen from the observation that setting as in (4.1) is tantamount to replacing the regular -loss with
We mention that the lack of positive semi-definiteness of is problematic in both statistics and optimization aspects: 1) Statistically, Lemma 4 used to derive the error rates in Theorems 5-6 requires to be positive semi-definite, and is hence no longer applicable here; 2) From the optimization side, it is in general unknown how to globally optimize a non-convex program.
Motivated by a line of previous works on non-convex M-estimator [63, 64, 62], we add an -norm constraint to (3.6) by setting , where represents the prior estimation on . Let be a subdifferential of at ,121212Thus, in (4.2) below should be understood as “there exists one element in such that (4.2) holds.” we consider the local minimizer of the proposed recovery program,131313The existence of local minimizer is guaranteed because of the additional -constraint. or more generally put, that satisfies141414To distinguish the global minimizer in (3.6), we denote by the estimator in QCS with quantized covariate.
(4.2) |
We will prove a fairly strong guarantee stating that all satisfying (4.2) (of course including all local minimizers) enjoy near minimax error rate. While this guarantee bears resemblance to the ones in [64], we point out that, [64] only derived concrete results for sub-Gaussian regime; because of the heavy-tailed data and quantization in our setting, some essentially different ingredients are required for the technical analysis (see Remark 5). As before, our results for sub-Gaussian and heavy-tailed are presented separately.
Theorem 9.
(Quantized Sub-Gaussian Covariate). Given , , , we consider (3.4) with the same assumptions on as Theorem 5, and additionally assume that . The quantization of is described above, and we set , . For recovery, we let , , and set with sufficiently large . If for some hidden constant only depending on , with probability at least , all satisfying (4.2) have estimation error bounded by
where .
Similarly, the next result extends Theorem 6 to a setting involving covariate quantization.
Theorem 10.
(Quantized Heavy-Tailed Covariate). Given , , , we consider (3.4) with the same assumptions on as Theorem 6. The quantization of is described above, and we set . For recovery, we let , , and set with sufficiently large . If for some hidden constant only depending on , then with probability at least , all satisfying (4.2) have estimation error bounded by
where .
Remark 5.
(Comparing Our Analyses with [64]). The above results are motivated by a line of works on nonconvex M-estimator [63, 64, 62], and our guarantee for the whole set of stationary points (4.2) resembles [64] most. While the main strategy for proving Theorem 9 is adjusted from [64], the proof of Theorem 10 does involve an essentially different RSC condition, see our (B.9). In particular, compared with [64, equation (4)], the leading factor of in (B.9) degrades from to . To retain near optimal rate we need to impose a stronger scaling with proper changes in the proof. Although Theorem 10 is presented for a concrete setting, it sheds light on extension of [64] to a weaker RSC condition that could accommodate covariate with heavier tail. Such extension is formally presented as a deterministic framework in Proposition 1.
Proposition 1.
Suppose that the -sparse satisfies , and the positive definite matrix satisfies . If for some we have
(4.3) |
holds for sufficiently large , then all satisfying (4.2) with have estimation error bounded by
By extracting the ingredients that guarantee (4.2) to be accurate, interestingly, Proposition 1 is now independent of the model assumption (3.4). Particularly, we could set when we apply Proposition 1 to (3.4). Compared with the framework [64, Thm. 1], the key strength of Proposition 1 is that it does not explicitly assume the RSC condition on the loss function that is hard to verify without assuming sub-Gaussian covariate. Instead, the role of the RSC assumption in [64] is now played by , which immediately yields a kind of RSC condition by simple argument as (B.10). Although this RSC condition is often essentially weaker than the conventional one in terms of the leading factor of (see Remark 5), along this line one can still derive non-trivial (or even near optimal) error rate. The gain of replacing RSC assumption with is that the latter amounts to constructing element-wise estimator for , which is often much easier for heavy-tailed covariate (e.g., due to many existing robust covariance estimator).
We conclude this part with a side interesting observation.
Remark 6.
By setting , Theorem 10 produces a result (with convex program) for the setting of Theorem 6. Interestingly, with the additional -constraint, a notable improvement is that the sub-optimal in Theorem 6 is sharpened to the near optimal one in Theorem 10. More concretely, this is because (ii) in (A.12) can be tightened to of (B.10). Going back to the full-data unquantized regime, Theorem 10 with recovers [35, Theorem 2(b)] with improved sample complexity requirement.
4.1.2 1-bit QCS with Quantized Covariate
Our consideration of covariate quantization in QCS seems fairly new to the literature. To the best of our knowledge, the only related results are [21, Thms. 7-8] for QCS with 1-bit quantized covariate and response. The assumption there, however, is quite restrictive. Specifically, it is assumed that has sparse columns (see [21, Assumption 3]), which is non-standard in both compressed sensing and sparse linear regression. Departing momentarily from our focus of dithered uniform quantization, we consider QCS under dithered 1-bit quantization and will apply Proposition 1 to derive results comparable to [21, Thms. 7-8] without resorting to the sparsity of .
We first review the 1-bit quantization scheme developed in [21]:
-
•
Response Quantization. We truncate to with some threshold , and then quantize to with uniform dither .
-
•
Covariate Quantization. We truncate to with some threshold , and then quantize to and , where are independent uniform dithers. (Note that we collect 2 bits per entry).
The following two results refine [21, Thms. 7-8] by deriving comparable error rates without using sparsity of .
Theorem 11.
(1-bit Quantized Sub-Gaussian Covariate). Given , we consider (3.4) where the -sparse satisfies , s are i.i.d. zero-mean sub-Gaussian with , and satisfies for some , the noise s are independent of s and i.i.d. sub-Gaussian, while for simplicity we assume . In the quantization of described above, we set and . For recovery we let , , and set with sufficiently large . If , then with probability at least , all satisfying (4.2) have estimation error bounded by
Theorem 12.
(1-bit Quantized Heavy-Tailed Covariate). Given , we consider (3.4) where the -sparse satisfies , s are i.i.d. zero-mean heavy-tailed satisfying the joint fourth moment constraint , and satisfies for some , the noise s are independent of s and i.i.d. heavy-tailed with bounded fourth moment, while for simplicity we assume . In the quantization of described above, we set and enforce , . For recovery we let , , and set with sufficiently large . If , then with probability at least , all satisfying (4.2) have estimation error bounded by
4.2 Uniform Recovery Guarantee
Uniformity is a highly desired property for a compressed sensing guarantee because it allows one to use a fixed (possibly randomly drawn) measurement ensemble for all sparse signals. Unfortunately, as many other results for nonlinear compressed sensing in the literature, our earlier recovery guarantees are non-uniform and only ensure the accurate recovery of a sparse signal fixed before drawing the random measurement ensemble.
We provide another additional development to QCS in this part. Specifically, we establish a uniform recovery guarantee which, despite the heavy-tailed noise and nonlinear quantization scheme, notably retains a near minimax error rate. This is done by upgrading Theorem 5 to be uniform over all sparse by more in-depth technical tools and a careful covering argument. Part of the techniques is inspired by prior works [40, 94], but certain technical innovations are required:
1) Like the recent work [40], one crucial technical tool in our proof is a powerful concentration inequality for product process due to Mendelson [67], as adapted in the present Lemma 9. However, [40] only studied sub-Gaussian distribution, and the results produced by their unified approach typically exhibit a decaying rate of [40, Sect. 4]. By contrast, our problem involves heavy-tailed noise only having bounded -th moment (), and we aim to establish a near minimax uniform error bound — cautiousness and new treatment are thus needed in the application of Lemma 9. More specifically, in the proof we need to bound
where , and is the signal space of interest, and recall that with sub-Gaussian . It is natural to invoke Lemma 9 to bound straightforwardly, but the issue is on lack of good bound for due to the heavy-tailedness of ; indeed, one only has the trivial estimate as , which is much worse than an bound since , and using Lemma 9 with this estimate leads to a loose bound for and finally a sub-optimal error rate. To address the issue, our main idea is to introduce the truncated heavy-tailed noise and define , which enables us to decompose as
Now, the benefits of working with are that: i) We can directly invoke Lemma 9 to bound since we have a good sub-Gaussian norm estimate , see Step 2.1.1 in the proof; ii) becomes the supremum of a process that is independent of and only indexed by , hence Bernstein’s inequality suffices for bounding (Step 2.1.2 in the proof), analogously to the proof of the non-uniform guarantee (Theorem 5).
2) Like [94, Prop. 6.1], we invoke a covering argument with similar techniques to bound , where is the quantization noise. Nevertheless, our Lasso estimator is different from their projected back projection estimator, and it turns out that we need to directly handle “” by Lemma 10, unlike [94, Prop. 6.2] that again used a covering argument for this purpose. See more discussions in Step 2.4 of the proof.
We are in a position to present our uniform recovery guarantee. We follow most assumptions in Theorem 5 but specify the signal space as and impose the -th moment constraint on . Following prior works on QCS (e.g., [40, 89]), we consider constrained Lasso that utilizes an -constraint (rather than (3.6)) to pursue uniform recovery.
Theorem 13.
(Uniform Version of Theorem 5). Given some , in (3.4) we suppose that s are i.i.d., zero-mean sub-Gaussian with , for some where , for some absolute constant , s are i.i.d. noise that are independent of s and satisfy for some fixed . In quantization, we truncate to with threshold , then quantize to with uniform dither . For recovery, we define the estimator as the solution to constrained Lasso
If for and some hidden constant depending on , then with probability at least on a single random draw of , it holds uniformly for all that the estimation error satisfy
Notably, our uniform guarantee is still minimax optimal up to some additional logarithmic factors (i.e., ) arising from the covering argument (Step 2.4 of the proof), whose main aim is to show that one uniform dither suffices for all signals. Thus naturally, is associated with a leading factor of the quantization level , meaning that the logarithmic gap between uniform recovery and non-uniform recovery closes when . In particular, Theorem 13 implies a uniform error rate matching the non-uniform one in Theorem 5 (up to some multiplicative factors) when is small enough or in an unquantized case.
To the best of our knowledge, the only existing uniform guarantee for heavy-tailed QCS is [32, Thm. 1.11], but the following distinctions make it impossible to closely compare their result with our Theorem 13: 1) [32, Thm. 1.11] is for dithered 1-bit quantization, but ours is for dithered uniform quantizer; 2) We handle heavy-tailedness by truncation, while [32, Thm. 1.11] does not involve this kind of special treatment; 3) [32, Thm. 1.11] considers a highly intractable program with hamming distance as objective and as constraint (when specialized to sparse signal), while our Theorem 13 is for the convex program Lasso; 4) Their analysis is based on an in-depth result on random hyperplane tessellations (see also [33, 77]), while our proof follows the more standard strategy (i.e., to upgrade each piece in a non-uniform proof to be uniform) and requires certain technical innovations (e.g., the treatment to deal with the truncation step). It is possible to use such a standard strategy to upgrade Theorem 6 to a uniform result whose error rate may exhibit worse dependence on due to covering argument.
5 Numerical Simulations
In this section we provide two sets of experimental results to support and demonstrate our theoretical developments. The first set of our simulations is devoted to validate our major standpoint that near minimax rates are achievable in quantized heavy-tailed settings. Then, the second set of results are presented to illustrate the crucial role played by the appropriate dither (i.e., triangular dither for covariate, uniform dither for response) before uniform quantization. For the importance of data truncation we refer to in [35, Sect. 5], which includes three estimation problems in this work and contrasts the estimations with or without the data truncation.
5.1 (Near) Minimax Error Rates
Each data point in our results is set to be the mean value of or independent trials.
5.1.1 Quantized Covariance Matrix Estimation
We start from covariance matrix estimation, specifically we verify the element-wise rate and operator norm rate in Theorems 2-3.
For estimator in Theorem 2, we draw such that the first two coordinates are independently drawn from , are from with covariance , and the remaining coordinates are i.i.d. following . We test different choices of under , and the log-log plots are shown in Figure 1(a). Clearly, for each the experimental points roughly exhibit a straight line that is well aligned with the dashed line representing the rate. As predicted by the factor , the curves with larger are higher, but note that the error decreasing rates remain unchanged. In addition, the curves of are extremely close, which is consistent with the logarithmic dependence of on .
For the error bound , the coordinates of are independently drawn from a scaled version of such that , and we test different settings of under . As shown in Figure 1(b), the operator norm error decreases with in the optimal rate , and using a coarser dithered quantizer (i.e., larger ) only slightly lifts the curves. Indeed, the effect seems consistent with ’s quadratic dependence on . To validate the relative scaling of and , in addition to the setting under , we try under times the original sample size (but in Figure 1(b) we still plot the curve according to the sample size of without the multiplicative factor of ), and surprisingly the obtained curve coincides with the one for . Thus, ignoring the logarithmic factor , the operator norm error can be characterized by fairly well.
Additionally, we want to compare and regarding the dependence on more clearly. Specifically, we generate the samples s as in Figure 1(a) and test the fixed sample size and varying dimension . The max norm estimation errors of in Theorem 2 and the operator norm errors (under to ensure ) of the estimator in Theorem 3 are reported in Figure 1(c). It is clear that the max norm error increases with rather slowly, while the operator norm error increases much more significantly under larger . This is consistent with the logarithmic dependence of on and the more essential dependence of on .

(a) (b) (c)
5.1.2 Quantized Compressed Sensing
We now switch to QCS with unquantized covariate and aim to verify the -norm error rate obtained in Theorems 5-6. We let the support of the -sparse be , and then draw the non-zero entries from a uniform distribution over (hence ). For the setting of Theorem 5 we adopt and , while and for Theorem 6. We simulate different choices of under , and the proposed convex program (3.6) is solved with the framework of ADMM (we refer to the review [9]). Experimental results are shown as log-log plots in Figure 2. Consistent with the theoretical bound , the errors in both cases decrease in a rate of , whereas the effect of uniform quantization is merely on the multiplicative factor . Interestingly, it seems that the gaps between and are in agreement with the explicit form of , i.e., for Theorem 5, and for Theorem 6. In addition, note that the curves of are close, whereas increasing suffers from significantly larger error. This is consistent with the scaling law of in .

(a) (b)
Then, we simulate the complete quantization setting where both covariate and response are quantized (Theorems 9-10). The simulation details are the same as before except that is also quantized with quantization level same as . We provide the best -norm constraint for recovery, i.e., . Then, composite gradient descent [63, 64] is invoked to handle the non-convex estimation program. We show the log-log plots in Figure 3. Note that these results have implications similar to Figure 2, in terms of the rate, the effect of quantization, and the relative scaling of .

(a) (b)
5.1.3 Quantized Matrix Completion
Finally, we simulate QMC and demonstrate the error bound for in Theorems 7-8. We generate the rank- as follows: we first generate with i.i.d. standard Gaussian entries to obtain the rank- , then we rescale it to such that . We use to simulate the sub-exponential noise in Theorem 7, while for Theorem 8. The convex program (3.8) is fed with and optimized by the ADMM algorithm. We test different choices of under , with the log-log error plots displayed in Figure 4. Firstly, the experimental curves are well aligned with the dashed line that represents the optimal rate. Then, comparing the results for , we conclude that quantization only affects the multiplicative factor in the estimation error. It should also be noted that, increasing either or leads to significantly larger error, which is consistent with the ’s essential dependence on and .

(a) (b)
5.2 Importance of Appropriate Dithering
To demonstrate the crucial role played by the suitable dither, we provide the second set of simulations. In order to observe more significant phenomena and then conclude evidently, we may test huge sample size but a rather simple estimation problem under coarse quantization (i.e., large ).
Specifically, for covariance matrix estimation we set and i.i.d. draw from . Thus, the problem boils down to estimating , for which the estimators in Theorems 2-3 coincide. Since is sub-Gaussian, we do not perform data truncation before dithered quantization. Besides our estimator where and is triangular dither, we invite the following competitors:
-
•
, where is the direct quantization without dithering;
-
•
and , where , and is quantized under uniform dither .
To illustrate the choice of and , we write with quantization error (due to Theorem 1(a)) and quantization noise , then (3.1) gives , while remains unknown. Thus, we consider because of an unjustified guess , while simply gives up the correction of . We test under . From the results shown in Figure 5(a), the proposed estimator based on quantized data under triangular dither embraces the lowest estimation errors and the optimal rate of , whereas other competitors are not consistent, i.e., they all reach some error floors under a large sample size.
For the two remaining signal recovery problems, we simply focus on the quantization of the response . In particular, we simulate QCS in the setting of Theorem 5, with under . Other experimental details are as previously stated. We compare our estimator with its counterpart defined by (3.6) with the same but , where is a direct uniform quantization with no dither. Evidently, the simulation results in Figure 5(b) confirm that the application of a uniform dither significantly lessens the recovery errors. Without dithering, although our results under Gaussian covariate still exhibit decreasing rate, identifiability issue unavoidably arises under Bernoulli covariate. In that case, the simulation without dithering will evidently deviate from the rate, see [87, Figure 1] for instance.
In analogy, we simulate QMC (Theorem 7) with data generated as previous experiments, and specifically we try under . While our estimator is defined in (3.8) involving from a dithered quantizer, we simulate the performance of its counterpart without dithering, i.e., defined in (3.8) with substituted by . From the experimental results displayed in Figure 5(c), one shall clearly see that performs much better in terms of the decreasing rate of and the estimation error; while the curve without dithering even does not decrease.

(a) (b) (c)
6 Concluding Remarks
In digital signal processing and many distributed machine learning systems, data quantization is an indispensable process. On the other hand, many modern datasets exhibit heavy-tailedness, and the past decade has witnessed an increasing interest in statistical estimation methods robust to heavy-tailed data. In this work we try to bridge these two developments by studying quantization of heavy-tailed data. We propose to truncate the heavy-tailed data prior to a uniform quantizer with random dither well suited to the problem at hand. Applying our quantization scheme to covariance matrix estimation, compressed sensing, and matrix completion, we have proposed (near) optimal estimators based on quantized data, and they are computationally feasible. These results suggest a unified conclusion that the dithered quantization does not affect the key scaling law in the error rate but only slightly worsens the multiplicative factor, which was complemented by numerical simulations. Further, from two respects, we presented additional developments for quantized compressed sensing. Firstly, we study a novel setting that involves covariate quantization. Because our quantized covariance matrix estimator is not positive semi-definite, the proposed recovery program is non-convex, but we proved that all local minimizers enjoy near minimax rate. At a higher level, this development extends a line of works on non-convex M-estimator [63, 64, 62] to accommodate heavy-tailed covariate, see the deterministic framework Proposition 1. As application, we derive results for (dithered) 1-bit compressed sensing as byproducts. Secondly, we established a near minimax uniform recovery guarantee for QCS under heavy-tailed noise, which states that all sparse signals within an -ball can be uniformly recovered up to near optimal -norm error, using a single realization of the measurement ensemble. We believe the developments presented in this work will prove useful in many other estimation problems, for instance, the triangular dither and the quantization scheme apply to multi-task learning, as shown by subsequent works [22, 60].
References
- [1] Albert Ai, Alex Lapanowski, Yaniv Plan, and Roman Vershynin. One-bit compressed sensing with non-gaussian measurements. Linear Algebra and its Applications, 441:222–239, 2014.
- [2] James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35. New York, 2007.
- [3] Sonia A Bhaskar. Probabilistic low-rank matrix completion from quantized measurements. The Journal of Machine Learning Research, 17(1):2131–2164, 2016.
- [4] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of statistics, 36(6):2577–2604, 2008.
- [5] Atanu Biswas, Sujay Datta, Jason P Fine, and Mark R Segal. Statistical advances in the biomedical sciences: clinical trials, epidemiology, survival analysis, and bioinformatics. John Wiley & Sons, 2007.
- [6] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
- [7] Petros T Boufounos. Sparse signal reconstruction from phase-only measurements. In Proc. Int. Conf. Sampling Theory and Applications (SampTA)],(July 1-5 2013). Citeseer, 2013.
- [8] Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. In 2008 42nd Annual Conference on Information Sciences and Systems, pages 16–21. IEEE, 2008.
- [9] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
- [10] Christian Brownlees, Emilien Joly, and Gábor Lugosi. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507–2536, 2015.
- [11] T Tony Cai, Cun-Hui Zhang, and Harrison H Zhou. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38(4):2118–2144, 2010.
- [12] T Tony Cai and Harrison H Zhou. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics, pages 2389–2420, 2012.
- [13] Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106(494):672–684, 2011.
- [14] Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res., 14(1):3619–3647, 2013.
- [15] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119, 2012.
- [16] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
- [17] Yang Cao and Yao Xie. Categorical matrix completion. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 369–372. IEEE, 2015.
- [18] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, pages 1148–1185, 2012.
- [19] Junren Chen and Michael K Ng. Color image inpainting via robust pure quaternion matrix completion: Error bound and weighted loss. SIAM Journal on Imaging Sciences, 15(3):1469–1498, 2022.
- [20] Junren Chen and Michael K Ng. Uniform exact reconstruction of sparse signals and low-rank matrices from phase-only measurements. IEEE Transactions on Information Theory, 2023.
- [21] Junren Chen, Cheng-Long Wang, Michael K Ng, and Di Wang. High dimensional statistical estimation under one-bit quantization. IEEE Transactions on Information Theory, 2023.
- [22] Junren Chen, Yueqi Wang, and Michael K Ng. Quantized low-rank multivariate regression with random dithering. arXiv preprint arXiv:2302.11197, 2023.
- [23] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. Advances in Neural Information Processing Systems, 28, 2015.
- [24] Onkar Dabeer and Aditya Karnik. Signal parameter estimation using 1-bit dithered quantization. IEEE Transactions on Information Theory, 52(12):5389–5405, 2006.
- [25] Alireza Danaee, Rodrigo C de Lamare, and Vitor Heloiz Nascimento. Distributed quantization-aware rls learning with bias compensation and coarsely quantized signals. IEEE Transactions on Signal Processing, 70:3441–3455, 2022.
- [26] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.
- [27] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622, 2016.
- [28] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I Oliveira. Sub-gaussian mean estimators. The Annals of Statistics, 44(6):2695–2725, 2016.
- [29] Sjoerd Dirksen. Quantized compressed sensing: a survey. In Compressed Sensing and Its Applications, pages 67–95. Springer, 2019.
- [30] Sjoerd Dirksen, Hans Christian Jung, and Holger Rauhut. One-bit compressed sensing with partial gaussian circulant matrices. Information and Inference: A Journal of the IMA, 9(3):601–626, 2020.
- [31] Sjoerd Dirksen, Johannes Maly, and Holger Rauhut. Covariance estimation under one-bit quantization. The Annals of Statistics, 50(6):3538–3562, 2022.
- [32] Sjoerd Dirksen and Shahar Mendelson. Non-gaussian hyperplane tessellations and robust one-bit compressed sensing. Journal of the European Mathematical Society, 23(9):2913–2947, 2021.
- [33] Sjoerd Dirksen, Shahar Mendelson, and Alexander Stollenwerk. Sharp estimates on random hyperplane tessellations. SIAM Journal on Mathematics of Data Science, 4(4):1396–1419, 2022.
- [34] Jianqing Fan, Quefeng Li, and Yuyan Wang. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247–265, 2017.
- [35] Jianqing Fan, Weichen Wang, and Ziwei Zhu. A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of statistics, 49(3):1239, 2021.
- [36] Thomas Feuillen, Mike E Davies, Luc Vandendorpe, and Laurent Jacques. (,)-rip and projected back-projection reconstruction for phase-only measurements. IEEE Signal Processing Letters, 27:396–400, 2020.
- [37] Simon Foucart, Deanna Needell, Reese Pathak, Yaniv Plan, and Mary Wootters. Weighted matrix completion from non-random, non-uniform sampling patterns. IEEE Transactions on Information Theory, 67(2):1264–1290, 2020.
- [38] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Springer New York, New York, NY, 2013.
- [39] Martin Genzel and Christian Kipp. Generic error bounds for the generalized lasso with sub-exponential data. Sampling Theory, Signal Processing, and Data Analysis, 20(2):15, 2022.
- [40] Martin Genzel and Alexander Stollenwerk. A unified approach to uniform signal recovery from nonlinear observations. Foundations of Computational Mathematics, pages 1–74, 2022.
- [41] Robert M Gray and Thomas G Stockham. Dithered quantizers. IEEE Transactions on Information Theory, 39(3):805–812, 1993.
- [42] David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quantum state tomography via compressed sensing. Physical review letters, 105(15):150401, 2010.
- [43] Osama A Hanna, Yahya H Ezzeldin, Christina Fragouli, and Suhas Diggavi. Quantization of distributed data for learning. IEEE Journal on Selected Areas in Information Theory, 2(3):987–1001, 2021.
- [44] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
- [45] Marat Ibragimov, Rustam Ibragimov, and Johan Walden. Heavy-tailed distributions and robustness in economics and finance, volume 214. Springer, 2015.
- [46] Laurent Jacques and Valerio Cambareri. Time for dithering: fast and quantized random embeddings via the restricted isometry property. Information and Inference: A Journal of the IMA, 6(4):441–476, 2017.
- [47] Laurent Jacques and Thomas Feuillen. The importance of phase in complex compressive sensing. IEEE Transactions on Information Theory, 67(6):4150–4161, 2021.
- [48] Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE transactions on information theory, 59(4):2082–2102, 2013.
- [49] Halyun Jeong, Xiaowei Li, Yaniv Plan, and Ozgur Yilmaz. Sub-gaussian matrices on sets: Optimal tail dependence and applications. Communications on Pure and Applied Mathematics, 75(8):1713–1754, 2022.
- [50] Hans Christian Jung, Johannes Maly, Lars Palzer, and Alexander Stollenwerk. Quantized compressed sensing by rectified linear units. IEEE transactions on information theory, 67(6):4125–4149, 2021.
- [51] Olga Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282–303, 2014.
- [52] Olga Klopp, Jean Lafond, Éric Moulines, and Joseph Salmon. Adaptive multinomial matrix completion. Electronic Journal of Statistics, 9(2):2950–2975, 2015.
- [53] Olga Klopp, Karim Lounici, and Alexandre B Tsybakov. Robust matrix completion. Probability Theory and Related Fields, 169(1):523–564, 2017.
- [54] Karin Knudson, Rayan Saab, and Rachel Ward. One-bit compressive sensing with norm estimation. IEEE Transactions on Information Theory, 62(5):2748–2758, 2016.
- [55] Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329, 2011.
- [56] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
- [57] Piotr Kruczek, Radosław Zimroz, and Agnieszka Wyłomańska. How to detect the cyclostationarity in heavy-tailed distributed signals. Signal Processing, 172:107514, 2020.
- [58] Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11(4):1389–1456, 2022.
- [59] Jean Lafond, Olga Klopp, Eric Moulines, and Joseph Salmon. Probabilistic low-rank matrix completion on finite alphabets. Advances in Neural Information Processing Systems, 27, 2014.
- [60] Kangqiang Li and Yuxuan Wang. Two results on low-rank heavy-tailed multiresponse regressions. arXiv preprint arXiv:2305.13897, 2023.
- [61] Christopher Liaw, Abbas Mehrabian, Yaniv Plan, and Roman Vershynin. A simple tool for bounding the deviation of random matrices on geometric sets. In Geometric aspects of functional analysis, pages 277–299. Springer, 2017.
- [62] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust -estimators. The Annals of Statistics, 45(2):866–896, 2017.
- [63] Po-Ling Loh and Martin J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of statistics, 40(3):1637–1664, 2012.
- [64] Po-Ling Loh and Martin J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19):559–616, 2015.
- [65] Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190, 2019.
- [66] Gabor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. The Annals of Statistics, 49(1):393–410, 2021.
- [67] Shahar Mendelson. Upper bounds on product and multiplier empirical processes. Stochastic Processes and their Applications, 126(12):3652–3680, 2016.
- [68] Stanislav Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 2015.
- [69] Jianhua Mo and Robert W Heath. Limited feedback in single and multi-user mimo systems with finite-bit adcs. IEEE Transactions on Wireless Communications, 17(5):3284–3297, 2018.
- [70] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069–1097, 2011.
- [71] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697, 2012.
- [72] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. Statistical science, 27(4):538–557, 2012.
- [73] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
- [74] Luong Trung Nguyen, Junhan Kim, and Byonghyo Shim. Low-rank matrix completion: A contemporary survey. IEEE Access, 7:94215–94237, 2019.
- [75] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2012.
- [76] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
- [77] Yaniv Plan and Roman Vershynin. Dimension reduction by random hyperplane tessellations. Discrete & Computational Geometry, 51(2):438–461, 2014.
- [78] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on information theory, 62(3):1528–1537, 2016.
- [79] Yaniv Plan, Roman Vershynin, and Elena Yudovina. High-dimensional estimation with geometric constraints. Information and Inference: A Journal of the IMA, 6(1):1–40, 2017.
- [80] Shuang Qiu, Xiaohan Wei, and Zhuoran Yang. Robust one-bit recovery via relu generative networks: Near-optimal statistical rate and global landscape analysis. In International Conference on Machine Learning, pages 7857–7866. PMLR, 2020.
- [81] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
- [82] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12), 2011.
- [83] Lawrence Roberts. Picture coding using pseudo-random noise. IRE Transactions on Information Theory, 8(2):145–154, 1962.
- [84] Sima Sahu, Harsh Vikram Singh, Basant Kumar, and Amit Kumar Singh. De-noising of ultrasound image using bayesian approached heavy-tailed cauchy distribution. Multimedia Tools and Applications, 78(4):4089–4106, 2019.
- [85] Vidyashankar Sivakumar, Arindam Banerjee, and Pradeep K Ravikumar. Beyond sub-gaussian measurements: High-dimensional structured estimation with sub-exponential designs. Advances in neural information processing systems, 28, 2015.
- [86] Alexander Stollenwerk. One-bit compressed sensing and fast binary embeddings. PhD thesis, Dissertation, RWTH Aachen University, 2019, 2019.
- [87] Zhongxing Sun, Wei Cui, and Yulong Liu. Quantized corrupted sensing with random dithering. IEEE Transactions on Signal Processing, 70:600–615, 2022.
- [88] Ananthram Swami and Brian M Sadler. On some detection and estimation problems in heavy-tailed noise. Signal Processing, 82(12):1829–1846, 2002.
- [89] Christos Thrampoulidis and Ankit Singh Rawat. The generalized lasso for sub-gaussian measurements with dithered quantization. IEEE Transactions on Information Theory, 66(4):2487–2500, 2020.
- [90] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- [91] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.
- [92] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- [93] Robert F Woolson and William R Clarke. Statistical methods for the analysis of biomedical data. John Wiley & Sons, 2011.
- [94] Chunlei Xu and Laurent Jacques. Quantized compressive sensing with rip matrices: The benefit of dithering. Information and Inference: A Journal of the IMA, 9(3):543–586, 2020.
- [95] Tianyu Yang, Johannes Maly, Sjoerd Dirksen, and Giuseppe Caire. Plug-in channel estimation with dithered quantized signals in spatially non-stationary massive mimo systems. arXiv preprint arXiv:2301.04641, 2023.
- [96] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning, pages 4035–4043. PMLR, 2017.
- [97] Ziwei Zhu and Wenjing Zhou. Taming heavy-tailed features by shrinkage. In International Conference on Artificial Intelligence and Statistics, pages 3268–3276. PMLR, 2021.
Appendix A Proofs in Section 3
A.1 Quantized Covariance Matrix Estimation
We first provide Bernstein’s inequality that is recurring in our proofs. In application we will choose the more convenient one from (A.1) and (A.2).
Lemma 1.
(Bernstein’s inequality, [6, Thm. 2.10, Coro. 2.11]). Let be independent random variables, and assume that there exist positive numbers and such that and
then for any we have
(A.1) | |||
(A.2) |
We will also use the Matrix Bernstein’s inequality.
Lemma 2.
(Matrix Bernstein, [91, Thm. 6.1.1]). Let be independent zero-mean random variables with common dimension . We assume that for and introduce the matrix variance statistic
Then for any , we have
A.1.1 Proof of Theorem 2
Proof.
Recall that is the quantization noise, and , which implies . Thus, by using triangle inequality we obtain
Step 1. Bounding .
Note that , so for any we aim to bound the -th entry error
Observe that the quantization noise is bounded as follows
which implies and . By the moment constraint on we have . Thus, for any positive integer we have the following bound
(A.3) | ||||
for some , . With these preparations, we can invoke Bernstein’s inequality (Lemma 1) to obtain that, for any , with probability at least ,
Taking and using the choice , then applying a union bound over , under the scaling , we obtain that holds with probability at least .
Step 2. Bounding .
We aim to bound for any . First by the definition of truncation we have
then applying Cauchy-Schwarz to , we obtain
where the second inequality is due to Markov’s inequality. Note that this bound remains valid for . Since this holds for any , combining with , we obtain .
By putting pieces together, we have with probability at least , as claimed. ∎
A.1.2 Proof of Theorem 3
Proof.
Note that the calculations in (3.1) and (3.2) remain valid (but the truncated samples are denoted by rather than ), so we have . Using triangle inequality we first decompose the error as
Step 1. Bounding .
We first write that
Recall that we define quantization error as and quantization noise as , and observe that the quantization noise is bounded . Thus, by that holds for any , we obtain
which implies . Moreover, we estimate the matrix variance statistic. Since is symmetric, we simply deal with and some algebra gives . First let us note that
Combining with the observation that
we obtain . Then we turn to the operator norm of . We apply Cauchy-Schwarz to estimate
(A.4) |
By that holds for any , , and , we obtain
(A.5) | ||||
For any , we write and then have the bound
(A.6) |
where is because , , and the quantization error follows ; in more detail, and then the moment constraint of sub-Gaussian random variable implies and . From (A.4), (A.5) and (LABEL:627add2), we obtain . Further combining with and , we arrive at and hence . With these preparations, Matrix Bernstein’s inequality (Lemma 2) yields the following inequality that holds for any
Setting with sufficiently large , under the scaling of and the threshold , we obtain that holds with probability at least .
Step 2. Bounding .
Having bounded the concentration term , we now switch to the bias term
For any , because is obtained from truncating regarding -norm, we have
where and are respectively by Cauchy-Schwarz and Markov’s, and in we use . This leads to the bound Combining the bounds of completes the proof.∎
A.1.3 Proof of Theorem 4
This small appendix is devoted to the proof of Theorem 4, for which we need a Lemma concerning the element-wise error rate of , i.e., where we write , . Recalling that , the key message from Lemma 3 is that due to the thresholding operator , respects an element-wise bound tighter than in Theorem 2, as can be seen from the additional branch in (A.7).
Lemma 3.
(Element-wise Error Rate of ). For any , the thresholding estimator in Theorem 4 satisfies for some that
(A.7) |
where .
Proof.
Recall that and hence . Given , the proof of Theorem 2 delivers with probability at least . Assume that we are on this event in the following analyses. As stated in Theorem 4, we set with , . Since , we discuss whether holds.
Case 1. when holds.
In this case we have , thus . Further, , so we also have .
Case 2. when holds.
Then, we consider that implies . Then . Moreover, , so we also have .
Therefore, in both cases we have proved that , which completes the proof. ∎
We are now in a position to present the proof.
Proof of Theorem 4. We let (just assume ) and use as shorthand. For we define the event as
By Lemma 3 we can choose to be sufficiently large such that and ; here, by convention we let be the complement of . Our proof strategy is to first bound the -th order moment , and then invoke Markov’s inequality to derive a high probability bound. We start with a simple estimate
where and are due to for symmetric and . In this proof, the ranges of indices in summation or supremum, if omitted, are .
Step 1. Bounding .
By the definition of , if . Because the columns of are -sparse, we can straightforwardly bound as follows:
(A.8) |
Step 2. Bounding .
We first write with , then start from
note that in we define
By replacing with , this further gives
(A.9) | ||||
Step 2.1. Bounding .
Note that means , and implies , their combination thus allows us to proceed as the following and :
where is due to our choice of . Thus, implies and . Note that Step 2 in the proof of Theorem 2 gives , and hence we can assume and so . Using these relations and triangle inequality, we obtain
which implies . Now we conclude that, implies and , which allows us to bound as
(A.10) |
Analogously to the proof of Theorem 2, we can apply Bernstein’s inequality to . More specifically, by preparations as in (A.3), we can use (A.2) in Lemma 1 with , (recall that ). For some absolute constants , it gives
(A.11) | ||||
and in we use that holds because and . We substitute (LABEL:627add4) into (LABEL:A.5) and perform some estimates
where in we substitute from the indicator function into the exponent, is because , , and we consider with large enough.
Step 2.2. Bounding .
Then, we deal with by Cauchy-Schwarz
As in (LABEL:627add4), we can use (A.2) in Lemma 1 with and , yielding that for any , Based on this probability tail bound, we can bound the moment via integral as follows
where we use , in under suitably large . Thus, it follows that
where is due to and (recall that ).
Step 2.3. Bounding .
From Step 2 in the proof of Theorem 2 we have . This directly leads to
We are in a position to combine everything and conclude the proof. Putting all pieces into (LABEL:A.4), it follows that . Assuming , such upper bound is dominated by (A.8) for , we can hence conclude that . Therefore, by Markov’s inequality,
which completes the proof.
A.2 Quantized Compressed Sensing
Note that our estimation procedure in QCS, QMC falls in the framework of regularized M-estimator, see [70, 35, 21] for instance. Particularly, we introduce the following deterministic result for analysing the estimator (3.6).
Lemma 4.
(Adapted from[21, Coro. 2]). Consider (3.4) and the estimator defined in (3.6), let be the estimation error. If is positive semi-definite, and , then it holds that . Moreover, if for some we have the restricted strong convexity (RSC) , then we have the error bounds and .151515We do not optimize the constants in Lemmas 4, 6 for easy reference.
To establish the RSC condition, a convenient way is to use the matrix deviation inequality. The following Lemma is adapted from [61], by combining Theorem 3 and Remark 1 therein.161616The dependence on can be further refined [49], while this is not pursued in the present paper.
Lemma 5.
(Adapted from[61, Thm. 3]). Assume has independent zero-mean sub-Gaussian rows s satisfying , and the eigenvalues of are between for some . For we let be its radius. Then with probability at least , it holds that
where with is the Gaussian width of .
Based on Lemma 4, the proofs of Theorems 5-6 are divided into two steps, i.e., showing and verifying the RSC. While we still have full in Theorems 5-6, we will study the more challenging settings where the covariates s are also quantized via in Theorems 9-10, in which we can take to return the settings of Theorems 5-6. Using such perspective, for most technical ingredients (e.g., the verification of ) in the proofs of Theorems 5-6 we can simply refer to the counterparts established in the proofs of Theorems 9-10. This avoids repetition and will be explained in the proofs more clearly.
A.2.1 Proof of Theorem 5
Proof. We divide the proofs into two steps.
Step 1. Proving
Recall that we choose and . In the setting of Theorem 9, the process of obtaining remains the same, while the covariates s are further quantized to for some under triangular dither , and we choose and there. As a result, by considering , it can be implied by Step 1 in the proof of Theorem 9 that under the choice with sufficiently large , holds with probability at least . Then, by using Lemma 4 we obtain .
Step 2. Verifying the RSC
A.2.2 Proof of Theorem 6
Proof. The proof is similarly based on Lemma 4.
Step 1. Proving
Recall that we choose and . In the setting of Theorem 10, the process of obtaining remains the same, while the truncated covariates s are further quantized to for some under triangular dither , and we choose and there. As a result, by considering , it can be implied by step 1 in the proof of Theorem 10 that, our choice with sufficiently large ensures with the promised probability. By Lemma 4 we obtain .
Step 2. Verifying the RSC
Unlike the case of sub-Gaussian covariate that is based on matrix deviation inequality (Lemma 5), here we establish a lower bound for using the bound on (Theorem 2). Specifically, setting in Theorem 2 yields that, holds with probability at least , which allows us to proceed as follows:
(A.12) | ||||
where is because , is due to , is due to the the assumed scaling . Now the desired results follow immediately from Lemma 4.
A.3 Quantized Matrix Completion
Under the observation model (3.7), we first provide a deterministic framework for analysing the estimator (3.8).
Lemma 6.
(Adapted from[21, Coro. 3]). Let . If
(A.13) |
then it holds that . Moreover, if for some we have the restricted strong convexity (RSC) , then we have the error bounds and .
Clearly, to derive statistical error rate of from Lemma 6, the key ingredients are (A.13) and the RSC. Specialized to the covariate in matrix completion, we will use the following lemma to establish RSC.
Lemma 7.
(Adapted from [21, Lem. 4] with ). Given some , we define the constraint set with sufficiently large as
(A.14) | ||||
Let be i.i.d. uniformly distributed on , then there exist absolute constants and , such that with probability at least we have
(A.15) |
Matrix completion with sub-exponential noise was studied in [51], and we make use of the following Lemma in the sub-exponential case.
Lemma 8.
(Adapted from [51, Lem. 5]). Given some . Let be i.i.d. uniformly distributed on , independent of s, are i.i.d. zero-mean and satisfy . If , with probability at least we have
A.3.1 Proof of Theorem 7
Proof. We divide the proof into two steps.
Step 1. Proving (A.13)
Defining as the quantization error, from Theorem 1(a) we know that s are independent of and i.i.d. uniformly distributed on . Thus, we can further write that
which allows us to decompose into
Because s are independent of s and i.i.d. sub-exponential noise satisfying , under the scaling , Lemma 8 implies that holds with probability at least . Analogously, s and s are independent of and are i.i.d. uniformly distributed on , Lemma 8 also applies to and , yielding that with the promised probability . Taken collectively, , so setting with sufficiently large ensures , with probability at least . Further, Lemma 6 gives .
Step 2. Verifying RSC
First note that ; and as proved before, . To proceed we define the constraint set as in (A.14) with some properly chosen constant . Then using Lemma 7, for some absolute constants , (A.15) holds with probability at least . Then we discuss several cases.
1) If , because satisfies the first two constraints in the definition of , it must violate the third constraint and satisfy , which gives , as desired. Note that is due to the scaling .
2) If , (A.15) implies that , and we further consider the following two cases.
2.1) If , we have , as desired.
2.2) If , then the RSC condition holds: . This allows us to apply Lemma 6 to obtain .
Thus, in any case, we have shown . The bound on follows immediately from . The proof is complete.
A.3.2 Proof of Theorem 8
Proof. The proof is based on Lemma 6 and divided into two steps.
Step 1. Proving (A.13)
Recall that the quantization error is zero-mean and independent of (Theorem 1(a)), thus we have Combining with , triangle inequality can first decompose the target term into
Step 1.1. Bounding and
We write and by defining
By we have
Analogously, we have since . In addition, by () and the simple fact , we estimate the matrix variance statistic as follows
where is because given we can estimate since , and moreover It is not hard to see that this bound remains valid for . Also, by similar arguments one can prove
Thus, Matrix Bernstein’s inequality (Lemma 2) gives
Thus, setting in the two inequalities above with sufficiently large , combining with the scaling that , we obtain with probability at least .
Step 1.2. Bounding
Let us bound first. Write -th entry of as , then for given , , otherwise. We can thus proceed by the following estimations:
where , is by Cauchy-Schwarz and Markov’s, respectively. Since this holds for any , we obtain , which further gives by using (). Putting pieces together, with probability at least we have , hence ensures (A.13) under the same probability. Further, Lemma 6 gives .
Step 2. Verifying RSC
Appendix B Proofs in Section 4
This appendix collects the proofs in Section 4 concerning covariate quantization and uniform signal recovery in QCS.
B.1 Covariate Quantization
Because of the non-convexity, the proofs in this part can no longer be based on Lemma 4. Indeed, bounding the estimation errors of s satisfying (4.2) require more tedious manipulations essentially due to the additional constraint (induced by the constraint in (4.2)).
B.1.1 Proof of Theorem 9
Proof. The proof is divided into three steps — the first two steps resemble the previous proofs that are based on Lemma 4, while we bound the estimation errors in the last step.
Step 1. Proving for some pre-specified
Recall that are constructed from the quantized data as and . We will show that, guarantees holds with the promised probability, where is any pre-specified constant. Recall the notation we introduced: with the quantization error being independent of , with the quantization error being independent of . Combining with the assumptions that the dithers are independent of and that s and s are independent, we have
(B.1) | ||||
which allows us to decompose the target term as two concentration terms () and a bias term
Step 1.1. Bounding
Denote the -th entry of by , respectively. For , the -th entry reads . By using the relations , and , for any integer we can bound that
(B.2) | ||||
combining with Stirling’s approximation and treating as absolute constant, this provides
In (LABEL:nB.2), is due to Holder’s, and in we use the moment constraint of sub-Gaussian variable (2.2). With these preparations, we invoke Bernstein’s inequality (Lemma 1) and then a union bound over to obtain
Thus, taking and plug in , we obtain
Step 1.2. Bounding
Moreover, we estimate the -th entry of by
(B.3) | ||||
where , are due to Holder’s, is due to Markov’s. Since this holds for , it gives .
Step 1.3. Bounding
We first derive a bound for that is implicitly implied by other conditions:
Hence, we can estimate . Then, we invoke Bernstein’s inequality regarding the independent sum of sub-exponential random variables (e.g., [92, Thm. 2.8.1])171717The application of Bernstein’s inequality leads to the ( is the upper bound on ) dependence in the multiplicative factor . It is possible to refine this quadratic dependence by using a new Bernstein’s inequality developed in [49, Thm. 1.3], but we do not pursue this in the present paper. to obtain that for any we have
Hence, we can set with sufficiently large , and further apply union bound over , under the scaling that is small enough, we obtain holds with probability at least .
Putting pieces together, since , it is immediate that holds with probability at least . Since with sufficiently large , holds w.h.p.
Step 2. Verifying RSC
We provide a lower bound for by using the matrix deviation inequality (Lemma 5). First note that the rows of are sub-Gaussian . Since , all eigenvalues of are between . Thus, we invoke Lemma 5 for with ; due to the well-known Gaussian width estimate [92, Example 7.5.9], with probability at least the following event holds
Under the same probability, a simple rescaling then provides
(B.4) |
which implies
(B.5) | ||||
Based on (LABEL:nB.7), we let and use the inequality to obtain
which holds for all and is a constant depending on (we remove the dependence on by ).
Step 3. Bounding the Estimation Error
We are in a position to bound the estimation error of any satisfying (4.2). Note that definition of gives . Thus, we set in (4.2) and proceed as follows
(B.6) | ||||
where we use and in , and is due to the scaling that holds under the assumed for some hidden constant depending on . Thus, we arrive at .
For , we obtain by keeping entries of in while setting others to zero. Let be the support of , , then we have
(B.7) | ||||
Further use , we obtain . Hence, we have . Now, we further substitute this into (B.6) and obtain
Thus, we arrive at the desired error bound for -norm
We simply use again to establish the bound for . The proof is complete.
B.1.2 Proof of Theorem 10
Proof. The proof is divided into two steps. Compared to the last proof, due to the heavy-tailedness of , the step of “verifying RSC” reduces to the simpler argument in (B.9).
Step 1. Proving for some pre-specified
Recall that are constructed from the quantized data as and . Thus, our main aim in this step is to prove that suffices to ensure with the promised probability and any pre-specified . We let , with quantization errors and . Analogously to (B.1), we have and Thus, the term we want to bound can be first decomposed into two concentration terms () and one bias term ():
(B.8) | ||||
Step 1.1. Bounding
Denote the -th entry of by , respectively. Since , , for any integer we can bound the moments as
and in the last inequality we use and . With these preparations, we apply Bernstein’s inequality (Lemma 1) and a union bound, yielding that
Set , holds with probability at least .
Step 1.2. Bounding
Noting that , we could further decompose as . To bound , we note that the assumption and truncation procedure for are the same as in Theorem 2; thus, Step 2 in the proof of Theorem 2 can yield that . Thus, we have . To bound , we estimate the -th entry
where is because , and applying similar treatment to . Overall, we have .
Step 1.3. Bounding
We first note that
By Theorem 2, holds with probability at least , which leads to . Thus, by combining everything, we obtain that holds with probability at least . Compared to our choice of , the claim of this step follows.
Step 2. Bounding the Estimation Error
We are now ready to bound the estimation error of any satisfying (4.2). Set in (LABEL:4.12), it gives . Recall that we can assume with probability at least , which leads to
(B.9) | ||||
Thus, it follows that
(B.10) | ||||
Note that is due to (B.9) and , in we use , and from Step 1 holds when with sufficiently large . Therefore, we arrive at . Similar to Step 3 in the proof of Theorem 9, we can show . Applying (B.10) again, it yields
Thus, we obtain with . The proof can be concluded by further applying to derive the bound for -norm.
B.1.3 Proof of Proposition 1
Proof.
B.1.4 Proof of Theorem 11
Proof.
To invoke Proposition 1, it is enough to verify (4.3). Recalling , we first invoke [21, Thm. 1] and obtain holds with probability at least . This confirms . Then, it remains to upper bound :
where in we use a known estimate from [21, Eq. (A.31)]:
Thus, by setting , (4.3) can be satisfied with probability at least , hence using Proposition 1 concludes the proof.∎
B.1.5 Proof of Theorem 12
Proof.
The proof is again based on Proposition 1 and some ingredients from [21]. From [21, Thm. 4], holds with probability at least , thus confirming with the same probability. Moreover,
where is due to a known estimate from [21, Eq. (A.34)]:
Thus, with probability at least , (4.3) holds if with sufficiently large . The proof can be concluded by invoking Proposition 1. ∎
B.2 Uniform Recovery Guarantee
We need some auxiliary results to support the proof. The first one is a concentration inequality for product process due to Mendelson [67]; the following statement can be directly adapted from [40, Thm. 8] by specifying the pseudo-metrics as -distance.
Lemma 9.
(Concentration of Product Process). Let and be stochastic processes indexed by two sets , , both defined on a common probability space . We assume that there exist such that
Finally, let be independent copies of a random variable , then for every the following holds with probability at least
where with is the Gaussian width of , and similarly, is the Gaussian width of .
We will use the following result that can be found in [61, Thm. 8].
Lemma 10.
Let be a random process indexed by points in a bounded set . Assume that the process has sub-Gaussian increments, i.e., there exists such that holds for any . Then for every , the event
holds with probability at least , where denotes the diameter of .
B.2.1 Proof of Theorem 13
Proof.
We start from the optimality By substituting and performing some algebra, we obtain
Due to the constraint we have , then similar to (LABEL:decom) we can show holds. Thus, we let , then the following holds uniformly for all
(B.12) |
Similarly to previous developments, our strategy is to lower bound the left-hand side while upper bound the right hand side, but with the difference that the bounds must be valid uniformly for all .
Step 1. Bounding the Left-Hand Side From Below
Letting and restricting to , we use (LABEL:nB.7) in the proof of Theorem 9, then for some constant depending on , with probability at least
Thus, if , then it holds that .
Step 2. Bounding the Right-Hand Side Uniformly
To pursue the uniformity over , we take a supremum by replacing specific with , then we consider the upper bound on
(B.13) |
where ; note that depend on , and we will use notation to indicate such dependence when necessary. In this proof, the ranges of and (e.g., in supremum), if omitted, are respectively and . Now let the quantization noise be , observing that , then we can first decompose as
(B.14) | ||||
where is the term arising from quantization, is the concentration term involving truncation of heavy-tailed data for which we develop some new machinery to bound it, is the bias term, is a more regular concentration term that can be bounded via Lemma 9. In the remainder of the proof, we will bound separately and finally deal with .
Step 2.1. Bounding
Using , is concerned with the concentration of the product process about its mean. It is natural to apply Lemma 9 towards this end, but we lack good bound on because of the heavy-tailedness of (on the other hand, the bound is just too crude to yield a sharp rate). Our strategy is already introduced in the mainbody — we introduce and decompose as
Thus, it suffices to bound and .
Step 2.1.1. Bounding
We use Lemma 9 to bound . For any , it is evident that we have and . Regarding indexed by , the -Lipschitzness of gives , and then the definition of sub-Gaussian norm yields (this addresses the aforementioned issue). Further, for any we verify the sub-Gaussian increments
which leads to . With these preparations, we can invoke Lemma 9 use the well-known estimates 181818In fact, we have the tighter estimate (e.g., [76]) but we simply put to be consistent with earlier results concerning unconstrained Lasso. to obtain that, with probability at least we have
(B.15) | ||||
Therefore, we can set in (B.15), under the scaling of it provides
(B.16) |
Step 2.1.2. Bounding
By we have Then to apply Bernstein’s inequality, for integer and , analogously to (LABEL:nB.2) in the proof of Theorem 9, we can bound that
for some , . Then we use Lemma 1 to obtain that, with probability at least we have
Then we use , set , and take a union bound over to obtain that, holds with probability at least , which implies the following under the same probability
(B.17) |
Therefore, combining (B.16) and (B.17), we obtain that with the promised probability.
Step 2.2. Bounding
For this bias term the supremum does not make things harder. We begin with Fix any , we have . Then following arguments similarly to (LABEL:nB.3) we obtain
which implies .
Step 2.3. Bounding
It is evident that we can apply Lemma 9 with , , and hence with . Along with , we obtain that the following holds with probability at least :
By taking , under the scaling , it follows that with probability at least .
Step 2.4. Bounding
It remains to bound . Bounding is similar to establishing the “limited projection distortion (LPD)” property in [94], but the key distinction is that and in take value in different spaces.
The main difficulty associated with “” lies in the discontinuity of the quantization noise , which we overcome by a covering argument and some machinery developed in [94, Prop. 6.1]. However, the essential difference from [94] is that we use Lemma 10 to handle “”, while [94] again used covering argument for to strengthen their Proposition 6.1 to their Proposition 6.2, which is unfortunately insufficient in our setting because the covering number of significantly increases under smaller covering radius (on the other hand, using covering argument for suffices for the analyses in [94] regarding a different estimator named projected back projection).
Let us first construct a -net of denoted by , so that for any we can pick satisfying ; here, the covering radius is to be chosen later, and we assume that [76, Lemma 3.3]. As is standard in a covering argument, we first control over the net (by replacing “” with “”), and then bound the approximation error induced by such replacement.
Step 2.4.1. Bounding over
In this step, we want to bound . First let us consider a fixed . Then since , we have . Because are independent zero mean, by [92, Prop. 2.6.1] we have . Define , then for any we have
Thus, by Lemma 10, it holds with probability at least that
Moreover, by a union bound over , we obtain that holds with probability at least . We set and arrive at
(B.18) |
Step 2.4.2. Bounding the Approximation Error
From now on we indicate the dependence of on by using the notation where . For any we pick one such that ; we fix such correspondence and remember that from now on every is associated with some , (which of course depends on but our notation omits such dependence). Thus, we can bound as
(B.19) |
Note that the bound on is available in (B.18), so it remains to bound , which can be understood as the approximation error of the net regarding the empirical process of interest. To facilitate the presentation we switch to the more compact notations — let with rows be the sensing matrix, be the quantization error indexed by , be the random dither vector, be the heavy-tailed noise vector, and be the measurement vector and truncated measurement vector, respectively. With these conventions we can write . Recall that a specific has been specified for each , so defining allows us to write . Further, we define and make the following observation
(B.20) |
We pause to establish a property of that holds w.h.p.. Specifically, we restrict (B.4) to (recall that and so and ), then under the promised probability it holds for some that
Thus, when with for large enough hidden constant depending on , it holds that
(B.21) |
We proceed by assuming we are on this event, which allows us to bound as
(B.22) | ||||
To bound , motivated by (B.20), we will investigate and more carefully. We pick a threshold (that is to be chosen later), and by the -Lipschitzness of we have
(B.23) | ||||
where the last inequality is because is -sparse, hence (B.21) implies .
To proceed, we will define for specific the index vectors and use to denote the number of s in (similar meaning for ). Specifically, using the entry-wise notation we define . Recall that (B.23) gives ; combined with the simple observation , we obtain a uniform bound on as
(B.24) |
Next, we define the index vector for : first let and then we define . Then by Lemma 11 that we prove later, we have
(B.25) |
Note that does not happen (i.e., ) means that is continuous in ; combined with the definition of , this is also equivalent to the statement that “ remains constant in .” Thus, given a fixed and its associated , suppose that the -th entry of is zero (meaning that “ remains constant in ”), if additionally -th entry of is zero (i.e., ), then the -th entry of vanishes:
combining with (B.20), this implies . Recall from (B.22) that we want to bound . Write and , then denoting hadamard product by and using the decomposition and (B.20) we have
(B.26) | ||||
where is because entries of equal to those of if the index corresponds to , and in we use the simple bound and the derived bounds on in (B.24), (B.25) and (B.23), respectively.
Step 2.4.3. Concluding the Bound on
We are ready to put pieces together, specify , and conclude the bound on . Overall, with probability at least for defined in (B.25) and some within the promised probability, combining (B.18), (B.19), (B.22) and (LABEL:B.47) we obtain
Thus, we take the (near-optimal) choice of as and under which we obtain that, with the promised probability (as is also sufficiently small), we obtain the bound on as
(B.27) |
We can conclude the proof with all the works above. Substituting and the definition of in (B.13) into (B.12), then we obtain that holds uniformly for all , which implies . Substituting the derived bounds on into (B.14), with the promised probability we have
so the uniform bound on follows immediately. Further using completes the proof.∎
Lemma 11.
(Bounding ). Along the proof of Theorem 13, it holds that
(B.28) |
Proof.
Notation and details in the proof of Theorem 13 will be used. We first consider a fixed , and by a simple shifting happens if and only if is discontinuous in , which is also equivalently to . Because and , is valid independent of the location of . Thus, for fixed , by conditioning on , follows the binomial distribution with trials and probability of success . This allows us to write with i.i.d. following Bernoulli distribution with success probability . Then for any integer we have Now we invoke Bernstein’s inequality (Lemma 1) to obtain that for any ,
We let and take a union bound over ; this yields the desired claim since . ∎