Robust Non-adaptive Group Testing under Errors in Group Membership Specifications

Shuvayan Banerjee, Radhendushka Srivastava, James Saunderson and Ajit Rajwade Shuvayan Banerjee is with IITB-Monash Research Academy.Radhendushka Srivastava is with the Department of Mathematics, IIT Bombay.James Saunderson is with the Department of Electrical and Computer Systems Engineering, Monash University.Ajit Rajwade is with the Department of Computer Science and Engineering, IIT Bombay.

Abstract

Given $p$ samples, each of which may or may not be defective, group testing aims to determine their defect status indirectly by performing tests on $n<p$ ‘groups’ (also called ‘pools’), where a group is formed by mixing a subset of the $p$ samples. Under the assumption that the number of defective samples is very small compared to $p$ , group testing algorithms have provided excellent recovery of the status of all $p$ samples with even a small number of groups. Most existing methods, however, assume that the group memberships are accurately specified. This assumption may not always be true in all applications, due to various resource constraints. For example, such errors could occur when a technician, preparing the groups in a laboratory, unknowingly mixes together an incorrect subset of samples as compared to what was specified. We develop a new group testing method, the Debiased Robust Lasso Test Method (Drlt), that handles such group membership specification errors. The proposed Drlt method is based on an approach to debias, or reduce the inherent bias in, estimates produced by Lasso, a popular and effective sparse regression technique. We also provide theoretical upper bounds on the reconstruction error produced by our estimator. Our approach is then combined with two carefully designed hypothesis tests respectively for (i) the identification of defective samples in the presence of errors in group membership specifications, and (ii) the identification of groups with erroneous membership specifications. The Drlt approach extends the literature on bias mitigation of statistical estimators such as the Lasso, to handle the important case when some of the measurements contain outliers, due to factors such as group membership specification errors. We present several numerical results which show that our approach outperforms several intuitive baselines and robust regression techniques for identification of defective samples as well as erroneously specified groups.

Index Terms:

Group testing, debiasing, Lasso, non-adaptive group testing, hypothesis testing, compressed sensing

I Introduction

Group testing is a well-studied area of data science, information theory and signal processing, dating back to the classical work of Dorfman in [14]. Consider $p$ samples, one per subject, where each sample is either defective or non-defective. In the case of defective samples, additional quantitative information regarding the extent or severity of the defect in the sample may be available. Group testing typically replaces individual testing of these $p$ samples by testing of $n<p$ ‘groups’ of samples, thereby saving on the number of tests. Each group (also called a ‘pool’) consists of a mixture of small, equal portions taken from a subset of the $p$ samples. Let the (perhaps noisy) test results on the $n$ groups be arranged in an $n$ -dimensional vector $\boldsymbol{z}$ . Let the true status of each of the $p$ samples be arranged in a $p$ -dimensional vector $\boldsymbol{\beta^{*}}$ . The aim of group testing is to infer $\boldsymbol{\beta^{*}}$ from $\boldsymbol{z}$ given accurate knowledge of the group memberships. We encode group memberships in an $n\times p$ -dimensional binary matrix $\boldsymbol{B}$ (called the ‘pooling matrix’) where $B_{ij}=1$ if the $j^{\text{th}}$ sample is a member of the $i^{\text{th}}$ group, and $B_{ij}=0$ otherwise. If the overall status of a group is the sum of the status values of each of the samples that participated in the group, we have:

\boldsymbol{z}=\boldsymbol{B\beta^{*}}+\boldsymbol{\tilde{\eta}},

(1)

where $\boldsymbol{\tilde{\eta}}\in\mathbb{R}^{n}$ is a noise vector. In a large body of the literature on group testing (e.g., [5, 10, 14]), $\boldsymbol{z}$ and $\boldsymbol{\beta^{*}}$ are modeled as binary vectors, leading to the forward model $\boldsymbol{z}=\mathfrak{N}(\boldsymbol{B\beta^{*}})$ , where the matrix-vector ‘multiplication’ $\boldsymbol{B\beta^{*}}$ involves binary OR, AND operations instead of sums and products, and $\mathfrak{N}$ is a noise operator that could at random flip some of the bits in $\boldsymbol{z}$ . In this work, however, we consider $\boldsymbol{z}$ and $\boldsymbol{\beta^{*}}$ to be vectors in $\mathbb{R}^{n}$ and $\mathbb{R}^{p}$ respectively, as also done in [23, 19, 47], and adopt the linear model (1). This enables encoding of quantitative information in $\boldsymbol{z},\boldsymbol{\beta^{*}}$ , and $\boldsymbol{B\beta^{*}}$ now involves the usual matrix-vector multiplication.

In commonly considered situations, the number of non-zero samples, i.e., defective samples, $s\triangleq\|\boldsymbol{\beta^{*}}\|_{0}$ is much less than $p$ , and $\beta^{*}_{j}=0$ indicates that the $j^{\text{th}}$ sample is non-defective where $1\leq j\leq p$ . In such cases, group testing algorithms have shown excellent results for the recovery of $\boldsymbol{\beta^{*}}$ from $\boldsymbol{z},\boldsymbol{B}$ . These algorithms are surveyed in detail in [3] and can be classified into two broad categories: adaptive and non-adaptive. Adaptive algorithms [14, 23, 25] process the measurements (i.e., the results of pooled tests available in $\boldsymbol{z}$ ) in two or more stages of testing, where the output of each stage determines the choice of pools in the subsequent testing stage. Non-adaptive algorithms [47, 19, 20, 6], on the other hand, process the measurements with only a single stage of testing. Non-adaptive algorithms are known to be more efficient in terms of time as well as the number of tests required, at the cost of somewhat higher recovery errors, as compared to adaptive algorithms [30, 19]. In this work, we focus on non-adaptive algorithms.

Problem Motivation: In the recent COVID-19 pandemic, RT-PCR (reverse transcription polymerase chain reaction) has been the primary method of testing a person for this disease. Due to widespread shortage of various resources for testing, group testing algorithms were widely employed in many countries [1]. Many of these approaches used Dorfman testing [14] (an adaptive algorithm), but non-adaptive algorithms have also been recommended or used for this task [19, 47, 6]. In this application, the vectors $\boldsymbol{\beta^{*}}$ and $\boldsymbol{z}$ refer to the real-valued viral loads in the individual samples and the pools respectively, and $\boldsymbol{B}$ is again a binary pooling matrix. In a pandemic situation, there is heavy demand on testing labs. This leads to practical challenges for the technicians to implement pooling due to factors such as (i) a heavy workload, (ii) differences in pooling protocols across different labs, and (iii) the fact that pooling is inherently more complicated than individual sample testing [50],[15, ‘Results’]. Due to this, there is the possibility of a small number of inadvertent errors in creating the pools. This causes a difference between a few entries of the pre-specified matrix $\boldsymbol{B}$ and the actual matrix $\boldsymbol{\hat{B}}$ used for pooling. Note that $\boldsymbol{B}$ is known whereas as $\boldsymbol{\hat{B}}$ is unknown in practice. The sparsity of the difference between $\boldsymbol{B}$ and $\boldsymbol{\hat{B}}$ is a reasonable assumption, if the technicians are competent. Hence only a small number of group membership specifications contain errors. This issue of errors during pool creation is well documented in several independent sources such as [50],[15, ‘Results’], [13, Page 2], [44], [51, Sec. 3.1], [12, ‘Discussion’], [21, ‘Specific consideration related to SARS-CoV-2’] and [2, ‘Laboratory infrastructure’]. However the vast majority of group testing algorithms — adaptive as well as non-adaptive — do not account for these errors. To the best of our knowledge, this is the first piece of work on the problem of a mismatched pooling matrix (i.e., a pooling matrix that contains errors in group membership specifications) for non-adaptive group testing with real-valued $\boldsymbol{\beta^{*}}$ and (possibly) noisy $\boldsymbol{z}$ . We emphasize that besides pooled RT-PCR testing, faulty specification of pooling matrices may also naturally occur in group testing in many other scenarios, for example when applied to verification of electronic circuits [29]. Another scenario is in epidemiology [11], for identifying infected individuals who come in contact with agents who are sent to mix with individuals in the population. The health status of various individuals is inferred from the health status of the agents. However, sometimes an agent may remain uninfected even upon coming in contact with an infected individual, which can be interpreted as an error in the pooling matrix.

Related Work: We now comment on two related pieces of work which deal with group testing with errors in pooling matrices via non-adaptive techniques. The work in [11] considers probabilistic and structured errors in the pooling matrix, where an entry $b_{ij}$ with a value of 1 could flip to 0 with a probability $0.5$ , but not vice versa, i.e., a genuinely zero-valued $b_{ij}$ never flips to 1. The work in [35] considers a small number of ‘pretenders’ in the unknown binary vector $\boldsymbol{\beta^{*}}$ , i.e., there exist elements in $\boldsymbol{\beta^{*}}$ which flip from 1 to 0 with probability $0.5$ , but not vice versa. Both these techniques consider binary valued vectors $\boldsymbol{z}$ and $\boldsymbol{\beta^{*}}$ , unlike real-valued vectors as considered in this work. They also do not consider noise in $\boldsymbol{z}$ in addition to the errors in $\boldsymbol{B}$ . Furrthemore, we also present a method to identify the errors in $\boldsymbol{B}$ , unlike the techniques in [11, 35]. Due to these differences between our work and [11, 35], a direct numerical comparison between our results and theirs will not be meaningful.

Sensing Matrix Perturbation in Compressed Sensing: There exists a nice relationship between the group testing problem and the problem of signal reconstruction in compressed sensing (CS), as explored in [19, 20]. Likewise, there is literature in the CS community which deals with perturbations in sensing matrices [39, 40, 53, 27, 4, 24, 18]. However, these works either consider dense random perturbations (i.e., perturbations in every entry) [53, 27, 40, 4, 24, 18] or perturbations in specifications of Fourier frequencies [39, 26]. These perturbation models are vastly different from the sparse set of errors in binary matrices as considered in this work. Furthermore, apart from [39, 26], these techniques just perform robust signal estimation, without any attempt to identify rows of the sensing matrix which contained those errors.

Overview of contributions: In this paper, we present a robust approach for recovering $\boldsymbol{\beta^{*}}\in\mathbb{R}^{p}$ from noisy $\boldsymbol{z}\in\mathbb{R}^{n}$ when $n<p$ , given a (known) pre-specified pooling matrix $\boldsymbol{B}$ , but where the measurements in $\boldsymbol{z}$ have been generated with another unknown pooling matrix $\boldsymbol{\hat{B}}$ , i.e., $\boldsymbol{z}=\boldsymbol{\hat{B}\beta^{*}}+\boldsymbol{\tilde{\eta}}$ . The approach, which we call the Debiased Robust Lasso Test Method or Drlt, extends existing work on ‘debiasing’ the well-known Lasso estimator in statistics [28], to also handle errors in $\boldsymbol{B}$ . In this approach, we present a principled method to identify which measurements in $\boldsymbol{z}$ correspond to rows with errors in $\boldsymbol{B}$ , using hypothesis testing. We also present an algorithm for direct estimation of $\boldsymbol{\beta^{*}}$ and a hypothesis test for identification of the defective samples in $\boldsymbol{\beta^{*}}$ , given errors in $\boldsymbol{B}$ . We establish the desirable properties of these statistical tests such as consistency. Though our approach was initially motivated by pooling errors during preparation of pools of COVID-19 samples, it is broadly applicable to any group-testing problem where the pool membership specifications contain errors.

Notations: Throughout this paper, $\boldsymbol{I_{n}}$ denotes the identity matrix of size $n\times n$ . We use the notation $[n]\triangleq\{1,2,\cdots,n\}$ for $n\in\mathbb{Z}_{+}$ . Given a matrix $\boldsymbol{A}$ , its $i^{\text{th}}$ row is denoted as $\boldsymbol{a_{i.}}$ , its $j^{\text{th}}$ column is denoted as $\boldsymbol{a_{.j}}$ and the $(i,j)^{\text{th}}$ element is denoted by $a_{ij}$ . The $i^{\text{th}}$ column of the identity matrix will be denoted as $\boldsymbol{e_{i}}$ . For any vector $\boldsymbol{z}\in\mathbb{R}^{n}$ and index set $S\subseteq[n]$ , we define $\boldsymbol{z}_{S}\in\mathbb{R}^{n}$ such that $\forall i\in S,(z_{S})_{i}=z_{i}$ and $\forall i\notin S,(z_{S})_{i}=0$ . $S^{c}$ denotes the complement of set $S$ . We define the entrywise $l_{\infty}$ norm of a matrix $\boldsymbol{A}$ as $|\boldsymbol{A}|_{\infty}\triangleq\underset{i,j}{\max}|a_{ij}|$ . Consider two real-valued random sequences $x_{n}$ and $r_{n}$ . Then, we say that $x_{n}$ is $o_{P}(r_{n})$ if $x_{n}/r_{n}\rightarrow 0$ in probability, i.e., $\lim_{n\rightarrow\infty}P(|x_{n}/r_{n}|\geq\epsilon)=0$ for any $\epsilon>0$ . Also, we say that $x_{n}$ is $O_{P}(r_{n})$ if $x_{n}/r_{n}$ is bounded in probability, i.e., for any $\epsilon>0$ there exist $m,n_{0}>0$ , such that $P(|x_{n}/r_{n}|<m)\geq 1-\epsilon$ for all $n>n_{0}$ .

Organization of the paper: The noise model involving measurement noise and errors in the pooling matrix is presented in Sec. II. Our core technique, Drlt, is presented in Sec. III, with essential background literature summarized in Sec. III-B, and our key innovations presented in Sec. III-C as well as Sec. IV. Detailed experimental results are given in Sec. V. We conclude in Sec. VI. Proofs of all theorems and lemmas are provided in Appendices A,B and C.

II Problem formulation

II-A Basic Noise Model

We now formally describe the model setup used in this paper. Suppose $\boldsymbol{\beta^{*}}\in\mathbb{R}^{p}$ and that the elements of the $n\times p$ pooling matrix $\boldsymbol{B}$ are independently drawn from $\text{Bernoulli}(0.5)$ in (1). Additionally, let $\boldsymbol{\beta^{*}}$ be sparse with at most $s\ll p$ non-zero elements. Assume that the elements of the noise vector $\boldsymbol{\tilde{\eta}}\in\mathbb{R}^{n}$ in (1) are independent and identically distributed (i.i.d.) Gaussian random variables with mean $0$ and variance $\tilde{\sigma}^{2}$ . Note that, throughout this work, we assume $\tilde{\sigma}^{2}$ to be known. The Lasso estimator $\boldsymbol{\hat{\beta}}$ , used to estimate $\boldsymbol{\beta^{*}}$ , is defined as

\boldsymbol{\hat{\beta}}=\arg\min_{\boldsymbol{\beta}}\frac{1}{2n}\|\boldsymbol{z}-\boldsymbol{B\beta}\|^{2}_{2}+\lambda\|\boldsymbol{\beta}\|_{1}.

(2)

Given a sufficient number of measurements, the Lasso is known to be consistent for sparse $\boldsymbol{\beta^{*}}$ [22, Chapter 11] if the penalty parameter $\lambda>0$ is chosen appropriately and if $\boldsymbol{B}$ satisfies the Restricted Eigenvalue Condition (REC)¹¹1Restricted Eigenvalue Condition: For some constant $\eta^{c}\geq 1$ and $S\subseteq\{1,2,..,n\}$ , let $C_{\eta^{c}}(S)\triangleq\{\boldsymbol{\Delta}\in\mathbb{R}^{n}:\|\boldsymbol{\Delta_{S^{c}}}\|_{1}\leq\eta^{c}\|\boldsymbol{\Delta_{S}}\|_{1}\}$ . We say that a $m\times n$ matrix $\boldsymbol{B}$ satisfies the REC if for a constant $\gamma>0$ , we have $\frac{1}{p}\|\boldsymbol{B\Delta}\|_{2}^{2}\geq\gamma\|\boldsymbol{\Delta}\|_{2}^{2}$ for every $\boldsymbol{\Delta}\in C_{\eta^{C}}(S)$ , $\gamma$ being the RE constant.. Certain deterministic binary pooling matrices can also be used as in [19, 47] for a consistent estimator of $\boldsymbol{\beta^{*}}$ . However, we focus on the chosen random pooling matrix in this paper.

It is more convenient for analysis via the REC, and more closely related to the theory in [28], if the elements of the pooling matrix have mean $0$ . Since the elements of $\boldsymbol{B}$ are drawn independently from Bernoulli $(0.5)$ , it does not obey the mean-zero property. Hence, we transform the random binary matrix $\boldsymbol{B}$ to a random Rademacher matrix $\boldsymbol{A}\triangleq 2\boldsymbol{B}-\boldsymbol{1_{n\times p}}$ , which is a simple one-one transformation similar to that adopted in [42] for Poisson compressive measurements. (Note that $\boldsymbol{1_{n\times p}}$ refers to a matrix of size $n\times p$ containing all ones.) We also transform the measurements in $\boldsymbol{z}$ to equivalent measurements $\boldsymbol{y}$ associated with Rademacher matrix $\boldsymbol{A}$ .

The expression for each measurement in $\boldsymbol{y}$ is now given by:

\displaystyle\forall i\in[n],y_{i}=\boldsymbol{a_{i.}}\boldsymbol{\beta^{*}}+\eta_{i}\implies\boldsymbol{y}=\boldsymbol{A\beta^{*}}+\boldsymbol{\eta},

(3)

where $\eta_{i}\sim\mathcal{N}(0,\sigma^{2}),\sigma^{2}\triangleq 4\tilde{\sigma}^{2}$ . We will henceforth consider $\boldsymbol{y},\boldsymbol{A}$ for the Lasso estimates in the following manner: The Lasso estimator $\boldsymbol{\hat{\beta}}$ , used to estimate $\boldsymbol{\beta^{*}}$ , is now defined as

\boldsymbol{\hat{\beta}}=\arg\min_{\boldsymbol{\beta}}\frac{1}{2n}\|\boldsymbol{y}-\boldsymbol{A\beta}\|^{2}_{2}+\lambda\|\boldsymbol{\beta}\|_{1}.

(4)

II-B Model Mismatch Errors

Consider the measurement model defined in (3). We now examine the effect of mis-specification of samples in a pool. That is, we consider the case where due to errors in mixing of the samples, the pools are generated using an unknown matrix $\boldsymbol{\hat{A}}$ (say) instead of the pre-specified matrix $\boldsymbol{A}$ . Note that $\boldsymbol{A}$ and $\boldsymbol{\hat{A}}$ are respectively obtained from $\boldsymbol{B}$ and $\boldsymbol{\hat{B}}$ . The elements of matrix $\boldsymbol{\hat{A}}$ and $\boldsymbol{A}$ are equal everywhere except for the misspecified samples in each pool. We refer to these errors in group membership specifications as ‘bit-flips’. For example, suppose that the $i^{\text{th}}$ pool is specified to consist of samples $j_{1},j_{2},j_{3}\in[p]$ . But due to errors during pool creation, the $i^{\text{th}}$ pool is generated using samples $j_{1},j_{2},j_{5}$ . In this specific instance, $a_{i,j_{3}}\neq\hat{a}_{i,j_{3}}$ and $a_{i,j_{5}}\neq\hat{a}_{i,j_{5}}$ .

In practice, note that $\boldsymbol{A}$ is known whereas $\boldsymbol{\hat{A}}$ is unknown. Moreover, the locations of bit-flips are unknown. Hence they induce signal-dependent and possibly large ‘model mismatch errors’ $\delta_{i}^{*}\triangleq(\boldsymbol{\hat{a}_{i.}}-\boldsymbol{a_{i.}})\boldsymbol{\beta^{*}}$ in the $i^{\text{th}}$ measurement. In the presence of bit-flips, the model in (3) can be expressed as:

\displaystyle y_{i}=\boldsymbol{a_{i.}\beta^{*}}+\delta_{i}^{*}+{\eta}_{i},\ \mbox{for}\ i\in[n],\implies\boldsymbol{y}=\boldsymbol{A\beta^{*}}+\boldsymbol{\delta^{*}}+\boldsymbol{{\eta}}=(\boldsymbol{A}|\boldsymbol{I_{n}})\begin{pmatrix}\boldsymbol{\beta^{*}}\\ \boldsymbol{\delta^{*}}\end{pmatrix}+\boldsymbol{\eta}.

(5)

We assume $\boldsymbol{\delta^{*}}$ , which we call the ‘model mismatch error’ (MME) vector in $\mathbb{R}^{n}$ , to be sparse, and $r\triangleq\|\boldsymbol{\delta^{*}}\|_{0}\ll n$ . This implies that at least $r$ rows in $\boldsymbol{\hat{A}}$ have bit-flips. The sparsity assumption is reasonable in many applications (e.g., given a competent technician performing pooling).

Suppose for a fixed $i\in[n]$ , $\boldsymbol{\hat{a}_{i.}}$ contains a bit-flip at index $j$ . However, if ${\beta}_{j}^{*}$ is $0$ then $\delta^{*}_{i}$ would remain 0 despite the presence of a bit-flip in $\boldsymbol{\hat{a}_{i.}}$ . Furthermore, such a bit-flip has no effect on the measurements and is not identifiable from the measurements. However, if $\beta_{j}^{*}$ is non-zero then $\delta_{i}^{*}$ is also non-zero. Such a bit-flip adversely affects the measurement and we henceforth refer to it as an effective bit-flip or effective MME.

III Debiased Robust Lasso Test (Drlt) Method

We now present our proposed approach, named the ‘Debiased Robust Lasso Test Method’ (Drlt), for recovering the signal $\boldsymbol{\beta^{*}}$ given measurements $\boldsymbol{y}$ obtained from the erroneous, unknown matrix $\boldsymbol{\hat{A}}$ which is different from the pre-specified, known sensing matrix $\boldsymbol{A}$ . The main objectives of this work are:

Aim (i):

Estimation of $\boldsymbol{\beta^{*}}$ under model mismatch and development of a statistical test to determine whether or not the $j^{\text{th}}$ sample, $j\in[p]$ , is defective/diseased.
Aim (ii):

Development of a statistical test to determine whether or not $\boldsymbol{\hat{a}_{i.}}$ , $i\in[n]$ , contains effective MMEs.

A measurement containing an effective MME will appear like an outlier in comparison to other measurements due to the non-zero values in $\boldsymbol{\delta^{*}}$ . Therefore identification of measurements containing MMEs is equivalent to determining the non-zero entries of $\boldsymbol{\delta^{*}}$ . This idea is inspired by the concept of ‘Studentised residuals’ which is widely used in the statistics literature to identify outliers in full-rank regression models [36]. Since our model operates in a compressive regime where $n<p$ , the distributional property of studentized residuals may not hold. Therefore, we develop our Drlt method which is tailored for the compressive regime.

Our basic estimator for $\boldsymbol{\beta^{*}}$ and $\boldsymbol{\delta^{*}}$ from $\boldsymbol{y}$ and $\boldsymbol{A}$ is given as

\displaystyle\begin{pmatrix}\boldsymbol{\hat{\beta}_{\lambda_{1}}}\\ \boldsymbol{\hat{\delta}_{\lambda_{2}}}\end{pmatrix}

\displaystyle=

\displaystyle\arg\min_{\boldsymbol{\beta},\boldsymbol{\delta}}{\frac{1}{2n}}\left\|\boldsymbol{y}-\boldsymbol{A\beta}-\boldsymbol{\delta}\right\|^{2}_{2}+\lambda_{1}\|\boldsymbol{\beta}\|_{1}+\lambda_{2}\left\|\boldsymbol{\delta}\right\|_{1},

(6)

where $\lambda_{1},\lambda_{2}$ are appropriately chosen regularization parameters. This estimator is a robust version of the Lasso regression [38]. The robust Lasso, just like the Lasso, will incur a bias due to the $\ell_{1}$ penalty terms.

The work in [28] provides a method to mitigate the bias in the Lasso estimate and produces a ‘debiased’ signal estimate whose distribution turns out to be approximately Gaussian with specific observable parameters in the compressive regime (for details, see [28] and Sec. III-B below). However, the work in [28] does not take into account errors in sensing matrix specification. We non-trivially adapt the techniques of [28] to our specific application which considers MMEs in the pooling matrix, and we also develop novel procedures to realize Aims (i) and (ii) mentioned above.

We now first review important concepts which are used to develop our method for the specified aims. We subsequently develop our method in the rest of this section. However, before that, we present error bounds on the estimates $\boldsymbol{\hat{\beta}_{\lambda_{1}}}$ and $\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ from (6), which are non-trivial extensions of results in [38]. These bounds will be essential in developing hypothesis tests to achieve Aims (i) and (ii).

III-A Bounds on the Robust Lasso Estimate

When the sensing matrix $\boldsymbol{A}$ is iid Gaussian, upper bounds on $\|\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{\lambda_{1}}}\|_{2}$ and $\|\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\|_{2}$ have been presented in [38]. In our case, $\boldsymbol{A}$ is i.i.d. Rademacher, and hence some modifications to the results from [38] are required. We now state a theorem for the upper bound on the reconstruction error of both $\boldsymbol{\hat{\beta}_{\lambda_{1}}}$ and $\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ for a random Rademacher pooling matrix $\boldsymbol{A}$ . We further use the so called ‘cone constraint’ to derive separate bounds on the estimates of both $\boldsymbol{\beta^{*}}$ and $\boldsymbol{\delta^{*}}$ . These bounds will be very useful in deriving theoretical results for debiasing.

Theorem 1

Let $\boldsymbol{\hat{\beta}}_{\boldsymbol{\lambda_{1}}},\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ be as in (6) and set $\lambda_{1}\triangleq\frac{4\sigma\sqrt{\log p}}{\sqrt{n}},\lambda_{2}\triangleq\frac{4\sigma\sqrt{\log n}}{{n}}$ . Let $n<p$ , $\mathcal{S}\triangleq\{j:\beta^{*}_{j}\neq 0\}$ , $\mathcal{R}\triangleq\{i:\delta^{*}_{i}\neq 0\}$ , $s\triangleq|\mathcal{S}|$ and $r\triangleq|\mathcal{R}|$ . If $|\boldsymbol{A}|_{\infty}\leq 1$ and $\boldsymbol{A}$ satisfies the Extended Restricted Eigenvalue Condition (EREC) from Definition 1 with $\kappa>0$ and with respect to the cone $\mathcal{C}(\mathcal{S},\mathcal{R},(\sqrt{n}\lambda_{2})/\lambda_{1})$ , then we have the following:

(1)

P\left(\left\|\boldsymbol{\hat{\beta}_{\lambda_{1}}}-\boldsymbol{\beta^{*}}\right\|_{1}\leq 48\kappa^{-2}(s+r)\sigma\sqrt{\frac{{\log(p)}}{{n}}}\right)\geq 1-\left(\frac{1}{p}+\frac{1}{n}\right).

(7)

(2)

Additionally if $n\log n\geq(48\kappa^{-2})^{2}(s+r)^{2}\log p$ ,

P\Biggl{(}\left\|\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}}\right\|_{1}\leq\frac{24\sigma r\sqrt{\log n}}{n}\Biggr{)}\geq 1-\left(\frac{1}{p}+\frac{2}{n}\right).

(8)

$\blacksquare$

In Lemma. 1 of Sec.A, we show that the chosen random Rademacher pooling matrix $\boldsymbol{A}$ satisfies the EREC with $\kappa=1/16$ if $\lambda_{1}$ and $\lambda_{2}$ are chosen as in Theorem 1. Furthermore, $|\boldsymbol{A}|_{\infty}=1$ .
Remarks on Theorem 1:

1.

From part (1), we see that $\left\|\boldsymbol{\hat{\beta}_{\lambda_{1}}}-\boldsymbol{\beta^{*}}\right\|_{1}$ = $O_{P}\left((s+r)\sqrt{\frac{\log p}{n}}\right)$ .
2.

From part (2), we see that $\left\|\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}}\right\|_{1}$ = $O_{P}\left(\frac{e\sqrt{\log n}}{n}\right)$ .
3.

The upper bounds of errors given in Theorem 1 increase with $\sigma$ , as well as $s$ and $r$ , which is quite intuitive. They also decrease with $n$ .

III-B A note on the Debiased Lasso

Let us consider the measurement vector from (5), momentarily setting $\boldsymbol{\delta^{*}}\triangleq\boldsymbol{0}$ , i.e., we have $\boldsymbol{y}=\boldsymbol{A\beta^{*}}+\boldsymbol{\eta}$ . Let $\boldsymbol{\hat{\beta}_{\lambda}}$ be the minimizer of the following Lasso problem

\min_{\boldsymbol{\beta}}{\frac{1}{2n}}\|\boldsymbol{y}-\boldsymbol{A\beta}\|^{2}_{2}+\lambda\|\boldsymbol{\beta}\|_{1},

(9)

for a given value of $\lambda$ . Though Lasso provides excellent theoretical guarantees [22, Chapter 11], it is well known that it produces biased estimates, i.e., $E(\boldsymbol{\hat{\beta}_{\lambda}})\neq\boldsymbol{\beta^{*}}$ where the expectation is over different instances of $\boldsymbol{\eta}$ . The work in [28] replaces $\boldsymbol{\hat{\beta}_{\lambda}}$ by a ‘debiased’ estimate $\boldsymbol{\hat{\beta}_{d}}$ given by:

\boldsymbol{\hat{\beta}_{d}}=\boldsymbol{\hat{\beta}_{\lambda}}+\dfrac{1}{n}\boldsymbol{M}\boldsymbol{A}^{\top}(\boldsymbol{y}-\boldsymbol{A\hat{\beta}_{\lambda}}),

(10)

where $\boldsymbol{M}$ is an approximate inverse (defined as in Alg. 1) of $\boldsymbol{\hat{\Sigma}}\triangleq\boldsymbol{A}^{\top}\boldsymbol{A}/n$ . Substituting $\boldsymbol{y}=\boldsymbol{A\beta^{*}}+\boldsymbol{\eta}$ into (10) and treating $\frac{1}{n}\boldsymbol{MA}^{\top}\boldsymbol{A}$ as approximately equal to the identity matrix, yields:

\boldsymbol{\hat{\beta}_{d}}=\boldsymbol{\hat{\beta}_{\lambda}}+\dfrac{1}{n}\boldsymbol{M}\boldsymbol{A}^{\top}(\boldsymbol{A\beta^{*}}+\boldsymbol{\eta}-\boldsymbol{A\hat{\beta}_{\lambda}})\approx\boldsymbol{\beta^{*}}+\dfrac{1}{n}\boldsymbol{\boldsymbol{M}\boldsymbol{A}^{\top}\eta},

(11)

which is referred to as a debiased estimate, as $E(\boldsymbol{\hat{\beta}_{d}})\approx\boldsymbol{\beta^{*}}$ . Note that $\boldsymbol{\hat{\Sigma}}$ is not an invertible matrix as $n<p$ , and hence, the approximate inverse is obtained by solving a convex optimization problem as given by Alg. 1, where the minimization of the diagonal elements of $\boldsymbol{M\hat{\Sigma}M}$ is motivated by minimizing the variance of $\boldsymbol{\hat{\beta}_{d}}$ , as proved in [28, Sec. 2.1]. Furthermore, as proved in [28, Theorem 7], the convex problem in Alg. 1 is feasible with high probability if $\boldsymbol{\Sigma}\triangleq E[\boldsymbol{a_{i.}}(\boldsymbol{a_{i.}})^{\top}]$ (where the expectation is taken over the rows of $\boldsymbol{A}$ ) obeys some specific statistical properties (see later in this section).

Algorithm 1 Construction of

\boldsymbol{M}

(Alg 1 of [28])

0: Measurement vector

\boldsymbol{y}

, design matrix

\boldsymbol{A}

\mu

\boldsymbol{M}

1: Set

\boldsymbol{\hat{\Sigma}}=\boldsymbol{A}^{\top}\boldsymbol{A}/n

2: Let

\boldsymbol{m_{i}}\in\mathbb{R}^{p}

for

i=1,2,\ldots,p

be a solution of:

	$\displaystyle\textrm{minimize}\quad\boldsymbol{m_{i}}^{\top}\boldsymbol{\hat{\Sigma}_{1}m_{i}}$
	$\displaystyle\textrm{subject to}\quad\\|\boldsymbol{\hat{\Sigma}m_{i}}-\boldsymbol{e_{i}}\\|_{\infty}\leq\mu,$		(12)

where

\boldsymbol{e_{i}}\in\mathbb{R}^{p}

is the

i^{\text{th}}

column of the identity matrix

\boldsymbol{I}

and

\mu=O\left(\sqrt{\frac{\log p}{n}}\right)

3: Set

\boldsymbol{M}=(\boldsymbol{m_{1}}|\ldots|\boldsymbol{m_{p}})^{\top}

. If any of the above problems is not feasible, then set

\boldsymbol{M=I_{p}}

The debiased estimate $\boldsymbol{\hat{\beta}_{d}}$ in (10) obtained via an approximate inverse $\boldsymbol{M}$ of $\boldsymbol{\hat{\Sigma}}$ using $\mu=O\left(\sqrt{(\log p)/n}\right)$ has the following statistical properties [28, Theorem 8]:

\displaystyle\sqrt{m}(\boldsymbol{\hat{\beta}_{d}}-\boldsymbol{\beta^{*}})=\boldsymbol{MA}^{\top}\boldsymbol{\eta}/\sqrt{n}+\sqrt{n}(\boldsymbol{M\hat{\Sigma}}-\boldsymbol{I_{n}})(\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{d}}).

(13)

Here the second term on the RHS is referred to as the bias vector. Moreover it is proved in [28, Theorem 8] that for sufficiently large $n$ , an appropriate choice of $\lambda$ in (9) and appropriate statistical assumptions on $\boldsymbol{\Sigma}$ , the maximum absolute value of the bias vector is $O_{P}(\frac{\sigma s\log p}{\sqrt{n}})$ . Thus, if $n>O((s\log p)^{2})$ , the largest absolute value of the bias vector will be negligible, and thus the debiasing effect is achieved since $E(\boldsymbol{\hat{\beta}_{d}})\approx\boldsymbol{\beta^{*}}$ .

Our debiasing approach is motivated along similar lines as in Alg. 1, but with MMEs in the sensing matrix which the earlier method cannot handle. Moreover, we demonstrate via simulations that ignoring MMEs may lead to larger estimation errors – see Table I of Sec. V-A.

III-C Debiasing in the Presence of MMEs

In the presence of MMEs, the design matrix $(\boldsymbol{A}|\boldsymbol{I_{n}})$ from (5) plays the role of $\boldsymbol{A}$ in Alg. 1. However $(\boldsymbol{A}|\boldsymbol{I_{n}})$ is partly random and partly deterministic, whereas the theory in [28] applies to either purely random (Theorem 8 of [28]) or purely deterministic (Theorem 6 of [28]) matrices, but not a combination of both. Hence, the theoretical results of [28] do not apply for the approximate inverse of $\frac{1}{n}(\boldsymbol{A}|\boldsymbol{I_{n}})^{\top}(\boldsymbol{A}|\boldsymbol{I_{n}})$ obtained using Alg. 1. Numerical results for the poor performance of ‘debiasing’ (as in (10)) with such an approximate inverse are demonstrated in Sec. V-A.

To produce a debiased estimate of $\boldsymbol{\beta^{*}}$ in the presence of MMEs in the pooling matrix, we adopt a different approach than the one in [28]. We define a linear combination of the residual error vectors obtained by running the robust Lasso estimator from (6) via a carefully chosen set of weights, in order to debias the robust Lasso estimates $\boldsymbol{\hat{\beta}}_{\lambda_{1}},\boldsymbol{\hat{\delta}}_{\lambda_{2}}$ . The weights of the linear combination are represented in the form of an appropriately designed matrix $\boldsymbol{W}\in\mathbb{R}^{n\times p}$ for debiasing $\boldsymbol{\hat{\beta}_{\lambda_{1}}}$ and a derived weights matrix $\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{WA}^{\top}\right)$ for debiasing $\boldsymbol{\hat{\delta}}_{\lambda_{2}}$ . We later provide a procedure to design an optimal $\boldsymbol{W}$ , as given in Alg. 1.

Given weight matrix $\boldsymbol{W}$ , we define a set of modified debiased Lasso estimates for $\boldsymbol{\beta^{*}}$ and $\boldsymbol{\delta^{*}}$ as follows:

	$\displaystyle\boldsymbol{\hat{\beta}_{W}}$	$\displaystyle\triangleq$	$\displaystyle\boldsymbol{\hat{\beta}_{\lambda_{1}}}+\frac{1}{n}\boldsymbol{W}^{\top}(\boldsymbol{y}-\boldsymbol{A\hat{\beta}_{\lambda_{1}}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}),$		(14)
	$\displaystyle\boldsymbol{\hat{\delta}_{W}}$	$\displaystyle\triangleq$	$\displaystyle\boldsymbol{y}-\boldsymbol{A}\boldsymbol{\hat{\beta}_{W}}=\boldsymbol{\hat{\delta}_{\lambda_{2}}}+\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{WA}^{\top}\right)(\boldsymbol{y}-\boldsymbol{A\hat{\beta}_{\lambda_{1}}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}).$		(15)

In our work, the matrix $\boldsymbol{W}$ does not play the role of $\boldsymbol{M}$ from Alg. 1, but instead plays the role of $\boldsymbol{AM}^{\top}$ (comparing (14) and (10)). In Theorem 2 below, we show that these estimates are debiased in nature for the choice $\boldsymbol{W}\triangleq\boldsymbol{A}$ . Thereafter, in Sec. IV and Theorem 5, using a different choice for $\boldsymbol{W}$ via Alg. 2, we show that the resultant tests are superior in comparison to $\boldsymbol{W}=\boldsymbol{A}$ .

Theorem 2

Let $\boldsymbol{\hat{\beta}}_{\boldsymbol{\lambda_{1}}},\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ be as in (6), $\boldsymbol{\hat{\beta}_{W}}$ , $\boldsymbol{\hat{\delta}_{W}}$ be as in (14), (15) respectively and set $\lambda_{1}\triangleq\frac{4\sigma\sqrt{\log p}}{\sqrt{n}},\lambda_{2}\triangleq\frac{4\sigma\sqrt{\log n}}{{n}}$ . If $n$ is $\omega[((s+r)\log p)^{2}]$ , then given that $\boldsymbol{A}$ is a random Rademacher matrix and using $\boldsymbol{W}\triangleq\boldsymbol{A}$ , we have the following:

(1)

Additionally, if $n<p$ , then for any $j\in[p]$ ,

\sqrt{n}(\hat{\beta}_{{\scaleto{Wj}{5pt}}}-{\beta}^{*}_{j})\xrightarrow{\mathcal{L}}N\left(0,\sigma^{2}\right)\mbox{ as }p,n\to\infty.

(16)

(2)

Additionally, if $n\log n$ is $o(p)$ , then for any $i\in[n]$ ,

\displaystyle\frac{\hat{\delta}_{{\scaleto{Wi}{5pt}}}-\delta^{*}_{i}}{\sqrt{\Sigma_{A_{ii}}}}\xrightarrow{\mathcal{L}}N\left(0,\sigma^{2}\right)\mbox{ as }p,n\to\infty,

(17)

where $\boldsymbol{\Sigma_{A}}$ is defined as follows:

\boldsymbol{\Sigma_{A}}\triangleq\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{AA}^{\top}\right)\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{AA}^{\top}\right)^{\top}.

(18)

Here $\xrightarrow{\mathcal{L}}$ denotes the convergence in law/distribution. $\blacksquare$

Remarks on Theorem 2

1.

The asymptotic distributions of the LHS terms in (16) and (17) do not depend on $\boldsymbol{A}$ . These distributions are asymptotically Gaussian because the noise vector $\boldsymbol{\eta}$ is normally distributed.
2.

Theorem 2 provides the key result to develop a testing procedure corresponding to Aims (i) and (ii).
3.

If $n$ is $\omega[((s+r)\log p)^{2}]$ then Lemma 1 implies that the Rademacher matrix $\boldsymbol{A}$ satisfies EREC.
4.

The condition $n<p$ in Result (1) emerges from (142) and (143), which are based on probabilistic bounds on the singular values of random Rademacher matrices [37]. For the special case where $n=p$ (which is no longer a compressive regime), these bounds are no longer applicable, and instead results such as [45, Thm. 1.2] can be used.

Drlt for $\boldsymbol{\beta^{*}}$ : In Aim (i), we intended to develop a statistical test to determine whether a sample in defective or not. Given the significance level $\alpha\in[0,1]$ , for each $j\in[p]$ , we reject the null hypothesis $\mathsf{G_{0,j}}:\beta^{*}_{j}=0$ in favor of $\mathsf{G_{1,j}}:\beta^{*}_{j}\neq 0$ when

\sqrt{n}|\hat{\beta}_{Wj}|/\sigma>z_{\alpha/2},

(19)

where $z_{\alpha/2}$ is the upper $(\alpha/2)^{\text{th}}$ quantile of a standard normal random variable.
Drlt for $\boldsymbol{\delta^{*}}$ : In Aim (ii), we intended to develop a statistical test to determine whether or not a pooled measurement is affected by MMEs. Given the significance level $\alpha\in[0,1]$ , for each $i\in[n]$ , we reject the null hypothesis $\mathsf{H_{0,i}}:\delta^{*}_{i}=0$ in favor of $\mathsf{H_{1,i}}:\delta^{*}_{i}\neq 0$ when

|\hat{\delta}_{Wi}|/\left(\sigma\sqrt{\Sigma_{A_{ii}}}\right)>z_{\alpha/2}.

(20)

A desirable property of a statistical test is that probability of rejecting the null hypothesis when the alternate is true converges to $1$ (referred to as a consistent test). Theorem 2 ensures that the proposed Drlts are consistent. In other words, the sensitivity and specificity (as described in Sec. V) of both these tests approach $1$ as $n,p\rightarrow\infty$ .

IV Optimal Debiased Lasso Test

It is well known that a statistical test based on a statistic with smaller variance is generally more powerful than that based on a statistic with higher variance [9]. Therefore, we wish to design a weight matrix $\boldsymbol{W}$ that reduces the variance of the debiased robust Lasso estimate to construct an ‘optimal’ debiased robust Lasso test. Hence we choose $\boldsymbol{W}$ so as to minimize the sum of the asymptotic variances of the debiased estimates $\{\hat{\beta}_{Wj}\}_{j=1}^{p}$ . The procedure for the design of $\boldsymbol{W}$ is presented in Alg. 2.

Rearranging (14) and (15) followed by some algebra, we obtain the following expressions:

	$\displaystyle\boldsymbol{\hat{\beta}_{W}}-\boldsymbol{\beta^{*}}$	$\displaystyle=$	$\displaystyle\frac{1}{n}\boldsymbol{W}^{\top}\boldsymbol{{\eta}}+\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{W}^{\top}\boldsymbol{A}\right)(\boldsymbol{\beta^{}-\hat{\beta}_{\lambda_{1}}})+\frac{1}{n}\boldsymbol{W}^{\top}\left(\boldsymbol{\delta^{}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\right),$		(21)
	$\displaystyle\boldsymbol{\hat{\delta}_{W}}-\boldsymbol{\delta^{*}}$	$\displaystyle=$	$\displaystyle\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{{\eta}}+\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{}-\hat{\beta}_{\lambda_{1}}})-\frac{1}{n}\boldsymbol{WA}^{\top}\left(\boldsymbol{\delta^{}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\right).$		(22)

Note that the first term on the RHS of both (21) and (22) is zero-mean Gaussian. The remaining two terms in both equations are bias terms. In order to develop an optimal hypothesis test for the debiased robust Lasso, we show that (i) the variances of the first term on the RHS of (21) and (22) are bounded with appropriate scaling as $n,p\to\infty$ ; and (ii) the two bias terms in (21) and (22) go to $0$ in probability as $n,p\to\infty$ . In such a situation, the sum of the asymptotic variance of the elements of $\boldsymbol{\hat{\beta}_{W}}$ will be $\frac{\sigma^{2}}{n^{2}}\sum_{j=1}^{p}\boldsymbol{w_{.j}}^{\top}\boldsymbol{w_{.j}}$ .

Algorithm 2 Design of

\boldsymbol{W}

\boldsymbol{A}

\mu_{1}

\mu_{2}

and

\mu_{3}

\boldsymbol{W}

1: We solve the following optimisation problem :

	$\displaystyle\underset{\boldsymbol{W}}{\textrm{minimize}}$		$\displaystyle\quad\sum_{j=1}^{p}\boldsymbol{w_{.j}}^{\top}\boldsymbol{w_{.j}}$
	subject to		$\displaystyle\mathsf{C0}:\boldsymbol{w_{.j}}^{\top}\boldsymbol{w_{.j}}/n\leq 1\ \forall\ j\in[n]$
			$\displaystyle\mathsf{C1}:\left\|\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{W^{\top}A}\right)\right\|_{\infty}\leq\mu_{1},$
			$\displaystyle\mathsf{C2}:\left\|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right\|_{\infty}\leq\mu_{2},$
			$\displaystyle\mathsf{C3}:\frac{n}{p\sqrt{1-\frac{n}{p}}}\left\|\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right\|_{\infty}\leq\mu_{3},$

where

\mu_{1}\triangleq 2\sqrt{\frac{2\log(p)}{n}}

\mu_{2}\triangleq 2\sqrt{\frac{\log({2np})}{np}}+\frac{1}{n}

and

\mu_{3}\triangleq\frac{2}{\sqrt{1-n/p}}\sqrt{\frac{2\log(n)}{p}}

2: If the above problem is not feasible, then set

\boldsymbol{W=A}

Theorem 3 (given below) establishes that the second and third term on the RHS of both (21) and (22) go to $0$ in probability. We design $\boldsymbol{W}$ which minimizes the expression $\sum_{j=1}^{p}\boldsymbol{w_{.j}^{\top}w_{.j}}$ subject to constraints $\mathsf{C0},\mathsf{C1},\mathsf{C2},\mathsf{C3}$ on $\boldsymbol{W}$ which are summarized in Alg. 2. The values of $\mu_{1},\mu_{2},\mu_{3}$ are selected in such a way that each of the constraints $\mathsf{C1},\mathsf{C2},\mathsf{C3}$ in Alg. 2 holds with high probability for the choice $\boldsymbol{W}\triangleq\boldsymbol{A}$ , as will be formally established in Lemma 5. These constraints are derived from Theorem 3 and ensure that the bias terms go to $0$ . In particular, the constraint $\mathsf{C1}$ ( via $\mu_{1}$ ) controls the rate of convergence of bias terms on the RHS of (21), whereas the constraint $\mathsf{C2}$ (via $\mu_{2}$ ) controls the rate of convergence of bias terms on RHS of (22). Furthermore, the constraint $\mathsf{C3}$ ensures that the asymptotic variance of the first term on RHS of (22) converges. Essentially, the choice $\boldsymbol{W}\triangleq\boldsymbol{A}$ helps us establish that the set of all possible $\boldsymbol{W}$ matrices which satisfy the constraints in Alg. 2 is non-empty with high probability. Finally, Theorem 4 establishes that the variances of the first term on the RHS of (21) and (22) converge. These theorems play a vital role in deriving Theorem 5 that leads to developing the optimal debiased robust Lasso tests.

Theorem 3

Let $\boldsymbol{\hat{\beta}}_{\boldsymbol{\lambda_{1}}},\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ be as in (6), $\boldsymbol{\hat{\beta}_{W}}$ , $\boldsymbol{\hat{\delta}_{W}}$ be as in (14), (15) respectively and set $\lambda_{1}\triangleq\frac{4\sigma\sqrt{\log p}}{\sqrt{n}},\lambda_{2}\triangleq\frac{4\sigma\sqrt{\log n}}{{n}}$ . Let $\boldsymbol{A}$ be a random Rademacher matrix and let $\boldsymbol{W}$ be obtained from Alg. 2. Then if $n$ is $o(p)$ and $n$ is $\omega[((s+r)\log p)^{2}]$ , as $p,n\to\infty$ , we have:

\left\|\sqrt{n}\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{W^{\top}}\boldsymbol{A}\right)(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}=o_{P}(1).

(23)

\left\|\frac{1}{\sqrt{n}}\boldsymbol{W}^{\top}\left(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\right)\right\|_{\infty}=o_{P}(1).

(24)

\left\|\frac{n}{p\sqrt{1-n/p}}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}=o_{P}(1).

(25)

\left\|\frac{n}{p\sqrt{1-n/p}}\frac{1}{n}\boldsymbol{WA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\|_{\infty}=o_{P}(1).

(26)

$\blacksquare$

Now, note that the variance-covariance matrix of the first terms of the RHSs of (22) and (21), respectively, are

	$\displaystyle\boldsymbol{\Sigma_{\beta}}$	$\displaystyle\triangleq$	$\displaystyle Var\left(\frac{1}{n}\boldsymbol{W}^{\top}\boldsymbol{{\eta}}\right)=\sigma^{2}\frac{1}{n}\boldsymbol{W^{\top}W},$		(27)
	$\displaystyle\boldsymbol{\Sigma_{\delta}}$	$\displaystyle\triangleq$	$\displaystyle Var\left(\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{{\eta}}\right)=\sigma^{2}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{WA}^{\top}\right)\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{WA}^{\top}\right)^{\top}.$		(28)

Theorem 4 shows that when $\boldsymbol{W}$ is chosen as per Alg. 2, the element-wise variances of the first term of the RHS of (21) (diagonal elements of $\boldsymbol{\Sigma_{\beta}}$ ) approach $1$ in probability. The constraints $\mathsf{C0}$ and $\mathsf{C1}$ of Alg. 2 are mainly used to establish this theorem. Further, for the optimal choice of $\boldsymbol{W}$ as in Alg. 2, we show that the element-wise variances of the first term of the RHS of (22) (diagonal elements of $\boldsymbol{\Sigma_{\delta}}$ ) goes to $1$ in probability. To establish this, we use the constraint $\mathsf{C3}$ of Alg. 2.

Theorem 4

Let $\boldsymbol{A}$ be a Rademacher matrix. Suppose $\boldsymbol{W}$ is obtained from Alg. 2 and $\boldsymbol{\Sigma_{\beta}}$ and $\boldsymbol{\Sigma_{\delta}}$ are defined as in (27) and (28), respectively. If $n\log n$ is $o(p)$ and $n$ is $\omega[((s+r)\log p)^{2}]$ , as $n,p\to\infty$ , we have the following:

(1)

For $j\in[p]$ ,

$\Sigma_{\beta_{jj}}\overset{P}{\to}\sigma^{2}.$ (29)
(2)

For $i\in[n]$ ,

$\frac{n^{2}}{p^{2}(1-n/p)}\Sigma_{\delta_{ii}}\overset{P}{\to}\sigma^{2}.$ (30)

$\blacksquare$

When we choose an optimal $\boldsymbol{W}$ as per the Alg. 2, the equations (21) and (22) along with Theorem 3 and Theorem 4 can be used to derive the asymptotic distribution of $\boldsymbol{\hat{\beta}_{W}}$ and $\boldsymbol{\hat{\delta}_{W}}$ . This is accomplished in Theorem 5, which can be viewed as a non-trivial extension of Theorem 2 for such an optimal choice of $\boldsymbol{W}$ .

Theorem 5

Let $\boldsymbol{\hat{\beta}}_{\boldsymbol{\lambda_{1}}},\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ be as in (6), $\boldsymbol{\hat{\beta}_{W}}$ , $\boldsymbol{\hat{\delta}_{W}}$ be as in (14), (15) respectively and set $\lambda_{1}\triangleq\frac{4\sigma\sqrt{\log p}}{\sqrt{n}},\lambda_{2}\triangleq\frac{4\sigma\sqrt{\log n}}{{n}}$ . Let $\boldsymbol{A}$ be a random Rademacher matrix and $\boldsymbol{W}$ be the debiasing matrix obtained from Alg. 2. If $n$ is $\omega[((s+r)\log p)^{2}]$ and $n\log n$ is $o(p)$ , then we have:

(1)

For fixed $j\in[p]$ ,

\frac{\sqrt{n}(\hat{\beta}_{{\scaleto{Wj}{5pt}}}-{\beta}^{*}_{j})}{\sqrt{\Sigma_{\beta_{jj}}}}\xrightarrow{\mathcal{L}}N\left(0,1\right)\mbox{ as }p,n\to\infty.

(31)

(2)

For fixed $i\in[n]$

\displaystyle\frac{\hat{\delta}_{{\scaleto{Wi}{5pt}}}-\delta^{*}_{i}}{\sqrt{\Sigma_{\delta_{ii}}}}\xrightarrow{\mathcal{L}}N\left(0,1\right)\mbox{ as }p,n\to\infty,

(32)

where $\Sigma_{\beta_{jj}}$ and $\Sigma_{\delta_{ii}}$ are $j^{\text{th}}$ and $i^{\text{th}}$ diagonal element of matrices $\boldsymbol{\Sigma_{\beta}}$ (as in (27)) and $\boldsymbol{\Sigma_{\delta}}$ (as in (28)), respectively.

$\blacksquare$

Theorem 5 paves a way to develop an optimal Drlt for Aim (i) and (ii) of this work along a similar line of development as the Drlt.
Optimal Drlt for $\boldsymbol{\beta^{*}}$ : As in Drlt for $\boldsymbol{\beta^{*}}$ , we now present a hypothesis testing procedure for an optimally designed $\boldsymbol{W}$ to determine defective samples based on Theorem 5. As before, given $\alpha>0$ we reject the null hypothesis $\mathsf{G_{0,j}}:\beta^{*}_{j}=0$ in favor of $\mathsf{G_{1,j}}:\beta^{*}_{j}\neq 0$ , for each $j\in[p]$ when

\sqrt{n}\hat{\beta}_{Wj}\big{/}\sqrt{\Sigma_{\beta_{jj}}}>z_{\alpha/2}.

(33)

where $z_{\alpha/2}$ is as in (19).

Optimal Drlt for $\boldsymbol{\delta^{*}}$ : As in Drlt for $\boldsymbol{\delta^{*}}$ , we develop a hypothesis testing procedure corresponding to optimal $\boldsymbol{W}$ to determine whether or not a measurement in $\boldsymbol{y}$ is affected by effective MMEs based on Theorem 5. As before, given $\alpha>0$ , for $i\in[n]$ , we reject the null hypothesis $\mathsf{H_{0,i}}:\delta^{*}_{i}=0$ in favor of $\mathsf{H_{1,i}}:\delta^{*}_{i}\neq 0$ when

\hat{\delta}_{Wi}\big{/}\sqrt{\Sigma_{\delta_{ii}}}>z_{\alpha/2}.

(34)

Similar to the Drlts in (19) and (20), Theorem 5 ensures that the proposed Optimal Drlts are consistent size $\alpha$ tests. This implies that the sensitivity and specificity of both these tests approach $1$ as $n,p\rightarrow\infty$ . In Section V, we show that the Optimal Drlt is superior to Drlt in finite sample scenarios.

V Experimental Results

Data Generation: We now describe the method of data generation for our simulation study. We synthetically generated signals (i.e., $\boldsymbol{\beta^{*}}$ ) with $p=500$ elements in each. For the non-zero values of $\boldsymbol{\beta^{*}}$ , 40 $\%$ were drawn i.i.d. from $U(50,100)$ and the remaining 60 $\%$ were drawn i.i.d. from $U(500,10^{3})$ , and were placed at randomly chosen indices. The elements of the pooling matrix $\boldsymbol{B}$ were drawn i.i.d. from $\textrm{Bernoulli}(0.5)$ , thereby producing $\boldsymbol{A}\triangleq 2\boldsymbol{B}-\boldsymbol{1_{n}}\boldsymbol{1_{p}}^{\top}$ where $\boldsymbol{A}$ is Rademacher distributed. In order to generate effective bit-flips, sign changes were induced in an adversarial manner in randomly chosen rows of $\boldsymbol{A}$ and at column indices corresponding to the non-zero locations of $\boldsymbol{\beta^{*}}$ . This yielded the perturbed matrix $\boldsymbol{\hat{A}}$ , produced via an adversarial form of the model mismatch error (MMEs) for bit-flips which will be described in the following paragraph. Define the fractions $f_{sp}\triangleq s/p$ , $f_{adv}\triangleq r/n$ . We chose the noise standard deviation $\sigma$ to be a fraction of the mean absolute value of the noiseless measurements, i.e., we set $\sigma\triangleq f_{\sigma}\sum_{i=1}^{n}|\boldsymbol{a_{i.}\beta^{*}}|/n$ where $0<f_{\sigma}<1$ . For different simulation scenarios, different values of $s=\|\boldsymbol{\beta^{*}}\|_{0}$ (via $f_{sp}$ ), $r=\|\boldsymbol{\delta^{*}}\|_{0}$ (via $f_{adv}$ ), noise standard deviation $\sigma$ (via $f_{\sigma}$ ) and number of measurements $n$ were chosen, as will be described in the following paragraphs.

Choice of Model Mismatch Error: In our work, all MMEs were generated in the following manner. A bit-flipped pool (measurement as described in (5)) contains exactly one bit-flip at a randomly chosen index. Suppose that the $i^{\text{th}}$ pool (measurement) contains a bit-flip. Then exactly one of the following two can happen: (1) some $j^{\text{th}}$ sample that was intended to be in the pool (as defined in $\boldsymbol{A}$ ) is excluded, or (2) some $j^{\text{th}}$ sample that was not intended to be part of the pool (as defined in $\boldsymbol{A}$ ) is included. These two cases lead to the following changes in the $i^{\text{th}}$ row of $\boldsymbol{\hat{A}}$ (as compared to the $i^{\text{th}}$ row of $\boldsymbol{A}$ ), and in both cases the choice of $j\in[n]$ is random uniform: Case 1: $\hat{a}_{ij}=-1$ but $a_{ij}=1$ , Case 2: $\hat{a}_{ij}=1$ but $a_{ij}=-1$ . Note that under this scheme, the generated bit-flips may not be effective. Hence MMEs need to be applied in adversarial setting by inducing bit-flips only at those entries in any row of $\boldsymbol{\hat{A}}$ corresponding to column indices with non-zero values of $\boldsymbol{\beta^{*}}$ .

Choice of Regularization Parameters: The regularization parameters $\lambda_{1}$ and $\lambda_{2}$ were chosen such that $\log(\lambda_{1})$ and $\log(\lambda_{2})$ was from the range $[1:0.25:7]$ in the following manner: We first identified values of $\lambda_{1}$ and $\lambda_{2}$ such that the Lilliefors test [32] confirmed the Gaussian distribution for both $\sqrt{n}\hat{\beta}_{Wj}/\left(\sqrt{\Sigma_{\beta_{jj}}}\right)$ and $\hat{\delta}_{Wi}/\left(\sqrt{\Sigma_{\delta_{ii}}}\right)$ (see Odrlt as in (33) and (34)) at the 1% significance level, for at least 70% of $j\in[p]$ (coordinates of $\boldsymbol{\beta^{*}}$ ) and $i\in[n]$ (coordinates of $\boldsymbol{\delta^{*}}$ ). Out of these chosen values, we determined the value of $\lambda_{1},\lambda_{2}$ that minimized the average cross-validation error over 10 folds. In each fold, 90% of the $n$ measurements (denoted by a sub-vector $\boldsymbol{y_{r}}$ corresponding to sub-matrix $\boldsymbol{A_{r}}$ ) were used to obtain ( $\boldsymbol{\hat{\beta}_{\lambda_{1}}},\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ ) via the robust Lasso, and the remaining 10% of the measurements (denoted by a sub-vector $\boldsymbol{y_{cv}}$ corresponding to measurements generated by the sub-matrix $\boldsymbol{A_{cv}}$ ) were used to estimate the cross-validation error $\|\boldsymbol{y_{cv}}-\boldsymbol{A_{cv}}\boldsymbol{\hat{\beta}_{\lambda_{1}}}-\boldsymbol{I_{cv}}\boldsymbol{\hat{\delta}_{\lambda_{2}}}\|^{2}_{2}$ . Note that $\boldsymbol{I_{cv}}$ is a sub-matrix of the identity matrix which samples only some elements of $\boldsymbol{y}$ and hence of $\boldsymbol{\hat{\delta}_{cv}}$ . The cross-validation error is known to be an estimable quantity for the mean-squared error [52], which justifies its choice as a method for parameter selection.

Evaluation Measures of Hypothesis Tests: Many different variants of the Lasso estimator were compared empirically against each other as will be described in subsequent subsections. Each of them were modeled and solved using the CVX package in MATLAB. Results for the hypothesis tests (given in (19),(20),(33) and (34)) are reported in terms of sensitivity and specificity (defined below). The significance level of these tests is chosen at $1\%$ . Consider a binary signal $\boldsymbol{\hat{b}_{\beta}}$ with $p$ elements. In our simulations, a sample at index $j$ in $\boldsymbol{\hat{\beta}_{W}}$ is declared to be defective if the hypothesis test $\mathsf{G_{0,j}}$ is rejected, in which case we set $\hat{b}_{\beta,j}=1$ . In all other cases, we set $\hat{b}_{\beta,j}=0$ . We declare an element to be a true defective if $\beta^{*}_{j}\neq 0$ and $\hat{b}_{\beta,j}\neq 0$ , and a false defective if $\beta^{*}_{j}=0$ but $\hat{b}_{\beta,j}\neq 0$ . We declare it to be a false non-defective if $\beta^{*}_{j}=0$ but $\hat{b}_{\beta,j}\neq 0$ , and a true non-defective if $\beta^{*}_{j}=0$ and $\hat{b}_{\beta,j}=0$ . The sensitivity for $\boldsymbol{\beta^{*}}$ is defined as ( $\#$ true defectives)/( $\#$ true defectives + $\#$ false non-defectives) and specificity for $\boldsymbol{\beta^{*}}$ is defined as ( $\#$ true non-defectives)/( $\#$ true non-defectives + $\#$ false defectives). We report the results of testing for the debiased tests using: (i) $\boldsymbol{W}\triangleq\boldsymbol{A}$ corresponding to Drlt (see (19) and (20)), and (ii) the optimal $\boldsymbol{W}$ using Alg. 2 corresponding to Odrlt (see (33) and (34)).

V-A Results with Baseline Debiasing Techniques in the Presence of MMEs

We now describe the results of an experiment to show the impact of Odrlt (33) in the presence of MMEs induced in $\boldsymbol{A}$ . We compare Odrlt with the baseline hypothesis test for $\boldsymbol{\beta^{*}}$ as defined by [28], which is equivalent to ignoring MMEs (i.e., setting $\boldsymbol{\delta^{*}}=0$ in (5)). Considering the presence of MMEs, we further compare Odrlt with the baseline test defined in [28] which would use the approximate inverse of the augmented sensing matrix $(\boldsymbol{A}|\boldsymbol{I_{n}})$ as obtained from Alg. 1 with $\mu=2\sqrt{\frac{\log(n+p)}{n}}$ . Moreover, the theoretical results established in [28] hold for completely random or purely deterministic sensing matrices, whereas the sensing matrix corresponding to the MME model, i.e., $(\boldsymbol{A}|\boldsymbol{I_{n}})$ , is partly random and partly deterministic. We now describe these two chosen baseline hypothesis tests for $\boldsymbol{\beta^{*}}$ in more detail.

Baseline ignoring MMEs: (Baseline-1) This approach computes the following ‘debiased’ estimate of $\boldsymbol{\beta^{*}}$ as given in Equation (5) of [28]:

\boldsymbol{\hat{\beta}_{b}}\triangleq\boldsymbol{\hat{\beta}_{\lambda,b}}+\frac{1}{n}\boldsymbol{MA}^{\top}(\boldsymbol{y}-\boldsymbol{A\hat{\beta}_{\lambda,b}}),

(35)

where $\boldsymbol{\hat{\beta}_{\lambda,b}}\triangleq\text{argmin}_{\boldsymbol{\beta}}\|\boldsymbol{y}-\boldsymbol{A\beta}\|^{2}_{2}+\lambda\|\boldsymbol{\beta}\|_{1}$ , and $\boldsymbol{M}$ is the approximate inverse of $\boldsymbol{A}$ obtained from Alg. 1. In this baseline approach, we reject the null hypothesis $\mathsf{G_{0,j}}:\beta^{*}_{j}=0$ in favor of $\mathsf{G_{1,j}}:\beta^{*}_{j}\neq 0$ , for each $j\in[p]$ when $\sqrt{n}\boldsymbol{\hat{\beta}_{bj}}/\sqrt{\sigma^{2}[\boldsymbol{MA^{\top}AM^{\top}}]_{jj}}>z_{\alpha/2}$ .

Baseline considering MMEs: (Baseline-2) In this approach, we consider the MMEs which is equivalent to the sensing matrix as $(\boldsymbol{A}|\boldsymbol{I_{n}})$ and signal vector $\boldsymbol{x^{*}}=(\boldsymbol{\beta^{*}}^{\top},\boldsymbol{\delta^{*}}^{\top})^{\top}$ . The ‘debiased’ estimate of $\boldsymbol{x^{*}}$ in this approach is given as:

\boldsymbol{\tilde{x}_{b}}\triangleq\boldsymbol{\tilde{x}_{\lambda}}+\frac{1}{n}\boldsymbol{\tilde{M}}(\boldsymbol{A}|\boldsymbol{I_{n}})^{\top}(\boldsymbol{y}-(\boldsymbol{A}|\boldsymbol{I_{n}})\boldsymbol{\tilde{x}_{\lambda}}),

(36)

where $\boldsymbol{\tilde{x}_{\lambda}}\triangleq\text{argmin}_{\boldsymbol{\beta}}\|\boldsymbol{y}-(\boldsymbol{A}|\boldsymbol{I_{n}})\boldsymbol{x}\|^{2}_{2}+\lambda\|\boldsymbol{x}\|_{1}$ and $\boldsymbol{\tilde{M}}$ is the approximate inverse of $(\boldsymbol{A}|\boldsymbol{I_{n}})$ obtained from Alg. 1. Then $\boldsymbol{\tilde{\beta}_{b}}$ is obtained by extracting the first $p$ elements of $\boldsymbol{\tilde{x}_{b}}$ . In this approach, we reject the null hypothesis $\mathsf{G_{0,j}}:\beta^{*}_{j}=0$ in favor of $\mathsf{G_{1,j}}:\beta^{*}_{j}\neq 0$ , for each $j\in[p]$ when $\sqrt{n}\boldsymbol{\tilde{\beta}_{bj}}/\sqrt{\sigma^{2}[\boldsymbol{\tilde{M}(A|I_{n})^{\top}(A|I_{n})\tilde{M}^{\top}}]_{jj}}>z_{\alpha/2}$ .

For these baseline approaches, the regularization parameter $\lambda$ was chosen using cross validation. We chose the $\lambda$ value which minimized the validation error with $90\%$ of the measurements used for reconstruction and the remaining $10\%$ used for cross-validation.

$n$	Sensitivity - Baseline-1	Sensitivity-Baseline-2	Sensitivity-Odrlt	Specificity-Baseline-1	Specificity-Baseline-2	Specificity-Odrlt
100	0.522	0.602	0.647	0.678	0.702	0.771
200	0.597	0.682	0.704	0.832	0.895	0.931
300	0.698	0.802	0.878	0.884	0.915	0.963
400	0.791	0.834	0.951	0.902	0.927	0.999
500	0.858	0.894	0.984	0.923	0.956	1

TABLE I: Comparison of average Sensitivity and Specificity (based on 100 independent noise runs) for the tests Baseline-1, Baseline-2 and Odrlt for determining defectives in

\boldsymbol{\beta^{*}}

from their respective debiased estimates in the presence of MMEs induced in

\boldsymbol{{A}}

(See Sec. V-A for detailed definitions).

In Table I, we compare the average values (over $100$ instances of measurement noise) of Sensitivity and Specificity of Baseline-1, Baseline-2 and Odrlt for different values of $n$ varying in $\{100,200,300,400,500\}$ and $p=500$ . It is clear from Table I, that for all the values of $n$ , the Sensitivity and Specificity value of Odrlt is higher as compared to that of Baseline-1 and Baseline-2. The performance of Baseline-2 dominates Baseline-1 which indicates that ignoring MMEs may lead to misleading inferences in small sample scenario. Furthermore, the Sensitivity and Specificity of Odrlt approaches 1 as $n$ increases. This highlights the superiority of our proposed technique and its associated hypothesis tests over two carefully chosen baselines. Note that there is no prior literature on debiasing in the presence of MMEs, and hence these two baselines are the only possible competitors for our technique.

V-B Empirical verification of asymptotic results of Theorem. 5

In this subsection, we compare the empirical distribution of $T_{G,j}\triangleq\sqrt{n}(\hat{\beta}_{Wj}-\beta^{*}_{j})/\sqrt{[\boldsymbol{\Sigma_{\beta}}]_{jj}}$ and $T_{H,i}\triangleq(\hat{\delta}_{Wi}-\delta^{*}_{i})/\sqrt{[\boldsymbol{\Sigma_{\delta}}]_{{ii}}}$ , for the optimal weight matrix $\boldsymbol{W}$ , with its asymptotic distribution as derived in Theorem. 5. We chose $p=500,n=400$ , $f_{adv}=0.01$ , $f_{sp}=0.01$ and $f_{\sigma}=0.01$ . The measurement vector $\boldsymbol{y}$ was generated with a perturbed matrix $\boldsymbol{\hat{A}}$ containing effective bit-flips generated using MMEs as described above. Here, $T_{G,j}$ and $T_{H,i}$ were computed for $100$ runs across different noise instances in $\boldsymbol{\eta}$ .

The left sub-figure of Fig. 1 shows plots of the quantiles of a standard normal random variable versus the quantiles of $T_{G,j}$ based on $100$ runs for each $j\in[p]$ in different colors. A $45^{o}$ straight line passing through the origin is also plotted (black solid line) as a reference. These $p$ different quantile-quantile (QQ) plots corresponding to $j\in[p]$ , all super-imposed on one another, indicate that the quantiles of the $T_{G,j}$ are close to that of standard normal distribution in the range of $[-2,2]$ (thus covering $95\%$ range of the area under the standard bell curve) for defective as well as non-defective samples. This confirms that the distribution of the $T_{G,j}$ are each approximately $N(0,1)$ even in this chosen finite sample scenario. Similarly, the right sub-figure of Fig. 1 shows the QQ-plot corresponding to $T_{H,i}$ for each $i\in[n]$ in different colors. As before, these $n$ different QQ-plots, one for each $i\in[n]$ , all super-imposed on one another, indicate that the $T_{H,i}$ are also each approximately standard normal, with or without MMEs.

Refer to caption — Figure 1: Left: Quantile-Quantile plots of $\mathcal{N}(0,1)$ vs. $T_{G,j}$ (defined at the beginning of Sec. V-B) using 100 independent noise runs for all $j\in[p]$ (one plot per index $j$ with different colors). Right: Quantile-Quantile plots of $\mathcal{N}(0,1)$ vs. $T_{H,i}$ (defined at the beginning of Sec. V-B) using 100 independent noise runs for all $i\in[n]$ (one plot per index $i$ with different colors). For both plots, the pooling matrix contained MMEs.

V-C Sensitivity and Specificity of Odrlt and Drlt for $\boldsymbol{\delta^{*}}$

We performed experiments to study sensitivity and specificity of Drlt and Odrlt for $\boldsymbol{\delta^{*}}$ . In experimental setup E1, we varied $f_{adv}\in\{0.01,...,0.1\}$ with fixed values $n=400,f_{sp}=0.01,f_{\sigma}=0.1$ . In E2, we varied $n$ from 200 to 500 in steps of 50 with $f_{adv}=0.01,f_{sp}=0.01,f_{\sigma}=0.1$ . In E3, we varied $f_{\sigma}\in\{0,0.05,...,0.5\}$ with $n=400,f_{adv}=0.01,f_{sp}=0.01$ . In E4, we varied $f_{sp}\in\{0.01,...,0.1\}$ with $n=400,f_{adv}=0.01,f_{\sigma}=0.1$ . The experiments were run 100 times across different noise instances in $\boldsymbol{\eta}$ , for the same signal $\boldsymbol{\beta^{*}}$ (in E1, E2 and E3) and sensing matrix $\boldsymbol{A}$ (in E1, E3 and E4). In E4, the sparsity of the signal varies, therefore, the signal vector $\boldsymbol{\beta^{*}}$ also varies. Similarly, in E2 as $n$ varies, the sensing matrix $\boldsymbol{A}$ also varies.

The empirical sensitivity and specificity of a test was computed as follows. The estimate $\boldsymbol{\hat{\delta}_{W}}$ was binarized to create a vector $\boldsymbol{\hat{b}_{W,\delta}}$ such that for all $i\in[n]$ , the value of $\hat{b}_{W,\delta}(i)$ was set to 1 if Drlt or Odrlt rejected the hypothesis $\mathsf{H_{0,i}}$ , and $\hat{b}_{W,\delta}(i)$ was set to 0 otherwise. Likewise, a ground truth binary vector $\boldsymbol{b_{\delta}^{*}}$ was created which satisfied $b^{*}_{\delta}(i)=1$ at all locations $i$ where $\delta^{*}_{i}\neq 0$ and $b^{*}_{\delta}(i)=0$ otherwise. Sensitivity and specificity values were computed by comparing corresponding entries of $\boldsymbol{b^{*}_{\delta}}$ and $\boldsymbol{\hat{b}_{W,\delta}}$ . The sensitivity of Drlt and Odrlt test for $\boldsymbol{\delta^{*}}$ averaged over 100 runs of different $\boldsymbol{\eta}$ instances is reported in Fig. 2 for the different experimental settings E1, E2, E3, E4. Under setup E2, the sensitivity plot indicates that the sensitivity of Drlt and Odrlt increases as $n$ increases. Under setups E1, E3, and E4, the sensitivity of both Drlt and Odrlt is reasonable even with larger values of $f_{adv}$ , $f_{\sigma}$ , and $f_{sp}$ (which are difficult regimes). In Fig. 2, we compare the sensitivity of Drlt and Odrlt to that of Robust Lasso from (6) without any debiasing step, which is abbreviated as Rl. To determine defectives and non-defectives for the Rl method, we adopted a thresholding strategy where an estimated element was considered defective (resp. non-defective) if its value was greater than or equal to (resp. less than) a threshold $\tau_{ss}$ . The optimal value of $\tau_{ss}$ was chosen clairvoyantly (i.e., assuming knowledge of the ground truth signal vector $\boldsymbol{\beta^{*}}$ ) using Youden’s index which is the value that maximises $\text{Sensitivity}+\text{Specificity}-1$ . Furthermore, Fig. 2 indicates that the sensitivity of Odrlt is superior to that of Rl and Drlt with Drlt also slightly better than Rl. Note that, in practice, a choice of the threshold $\tau_{ss}$ for Rl would be challenging and require a representative training set, whereas Drlt and Odrlt do not require any training set for the choice of such a threshold.

V-D Identification of Defective Samples in $\boldsymbol{\beta^{*}}$

In the next set of experimental results, we first examined the effectiveness of Drlt and Odrlt to detect defective samples in $\boldsymbol{\beta^{*}}$ in the presence of bit-flips in $\boldsymbol{A}$ induced as per adversarial MMEs. We compared the performance of Drlt and Odrlt to two other closely related algorithms to enable performance calibration: (1) Robust Lasso (Rl) from (6) without debiasing; (2) A hypothesis testing mechanism on a pooling matrix without model mismatch, which we refer to as Baseline-3. In Baseline-3, we generated measurements with the correct pooling matrix $\boldsymbol{A}$ (i.e., $\boldsymbol{\delta^{*}}=0$ ) and obtained a debiased Lasso estimate as given by (11). (Note that Baseline-3 is very different from Baseline-1 and Baseline-2 from Sec. V-A as in this approach $\boldsymbol{\delta^{*}}=0$ .) Using this debiased estimate, we obtained a hypothesis test similar to Odrlt. In the case of Rl, the decision regarding whether a sample is defective or not was taken based on a threshold $\tau_{ss}$ that was chosen as the Youden’s index on a training set of signals from the same distribution. The regularization parameters $\lambda_{1},\lambda_{2}$ were chosen separately for every choice of parameters $f_{adv},f_{\sigma},f_{sp}$ and $n$ .

We examined the variation in sensitivity and specificity with regard to change in the following parameters, keeping all other parameters fixed: (EA) the number of bit-flips in the matrix $\boldsymbol{A}$ expressed as a fraction $f_{adv}\in[0,1]$ of $n$ ; (EB) number of pools $n$ ; (EC) noise standard deviation $\sigma$ expressed as a fraction $f_{\sigma}\in[0,1]$ of the quantity $\bar{y}$ defined in Sec. V-B; (ED) sparsity ( $\ell_{0}$ norm) $s$ of vector $\boldsymbol{\beta^{*}}$ expressed as fraction $f_{sp}\in[0,1]$ of signal dimension $p$ . For the bit-flips experiment i.e., (EA), $f_{adv}$ was varied in $\{0.01,0.02,\ldots,0.1\}$ with $n=400,f_{sp}=0.01,f_{\sigma}=0.1$ . For the measurements experiment (EB), $n$ was varied over $\{200,150,\ldots,500\}$ with $f_{sp}=0.01,f_{adv}=0.01,f_{\sigma}=0.1$ . For the noise experiment (i.e., (EC), we varied $f_{\sigma}$ in $\{0,0.05,\ldots,0.5\}$ with $n=400,f_{sp}=0.01,f_{adv}=0.01$ . For the sparsity experiment (i.e., (ED), $f_{sp}$ was varied in $\{0.01,0.02,\ldots,0.1\}$ with $n=400,f_{adv}=0.01,f_{\sigma}=0.1$ .

The Sensitivity and Specificity values, averaged over 100 noise instances, for all four experiments are plotted in Fig. 3. The plots demonstrate the superior performance of Odrlt over Rl and Drlt. Furthermore, the performance of Drlt is also superior to Rl. In all regimes, Baseline performs best as it uses an error-free sensing matrix. We also see that for higher $n$ , lower $f_{\sigma}$ and lower $f_{sp}$ , the sensitivity and specificity for Odrlt comes very close to that of Baseline.

V-E RRMSE Comparison of Debiased Robust Lasso Techniques to Baseline Algorithms

We computed estimates of $\boldsymbol{\beta^{*}}$ using the debiased robust Lasso technique in two ways: (i) with the weights matrix $\boldsymbol{W}\triangleq\boldsymbol{A}$ , and (ii) the optimal $\boldsymbol{W}$ as obtained using Alg. 2. We henceforth refer to these estimators as Debiased Robust Lasso (Drl) and Optimal Debiased Robust Lasso (Odrl) respectively.

We computed the relative root mean squared error (RRMSE) for Drl and Odrl as follows: First, the pooled measurements with MMEs were identified as described in Sec. V-C and then discarded. From the remaining measurements, an estimate of $\boldsymbol{\beta^{*}}$ was obtained using robust Lasso with the optimal $\lambda_{1},\lambda_{2}$ chosen by cross-validation. Given the resultant estimate $\boldsymbol{\hat{\beta}}$ , the RRMSE was computed as $\|\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}}\|_{2}/\|\boldsymbol{\beta^{*}}\|_{2}$ .

We compared the RRMSE of Drl and Odrl to that of the following algorithms:

1.

Robust Lasso or Rl from (6).
2.

Lasso (referred to as L2) based on minimizing $\|\boldsymbol{y}-\boldsymbol{A\beta}\|^{2}_{2}+\lambda\|\boldsymbol{\beta}\|_{1}$ with respect to $\boldsymbol{\beta}$ . Note that this ignores MMEs.
3.

An inherently outlier-resistant version of Lasso which uses the $\ell_{1}$ data fidelity (referred to as L1), based on minimizing $\|\boldsymbol{y}-\boldsymbol{A\beta}\|_{1}+\lambda\|\boldsymbol{\beta}\|_{1}$ with respect to $\boldsymbol{\beta}$ .
4.

Variants of L1 and L2 combined with the well-known Ransac (Random Sample Consensus) framework [16] (described below in more detail). The combined estimators are referred to as Rl1 and Rl2 respectively.

Ransac is a popular randomized robust regression algorithm, widely used in computer vision [17, Chap. 10]. We apply here to the signal reconstruction problem considered in this paper. In Ransac, multiple small subsets of measurements from $\boldsymbol{y}$ are randomly chosen. Let the total number of subsets be $N_{S}$ . Let the set of the chosen subsets be denoted by $\{\mathcal{Z}_{i}\}_{i=1}^{N_{S}}$ . From each subset $\mathcal{Z}_{i}$ , the vector $\boldsymbol{\hat{\beta}}^{(i)}$ is estimated, using either L2 or L1. Every measurement is made to ‘cast a vote’ for one of the models from the set $\{\boldsymbol{\hat{\beta}}^{(i)}\}_{i=1}^{N_{S}}$ . We say that measurement $y_{l}$ (where $l\in[n]$ ) casts a vote for model $\boldsymbol{\hat{\beta}}^{(j)}$ (where $j\in[N_{S}]$ ) if $|y_{l}-\boldsymbol{a_{l.}}\boldsymbol{\hat{\beta}}^{(j)}|\leq|y_{l}-\boldsymbol{a_{l.}}\boldsymbol{\hat{\beta}}^{(k)}|$ for $k\in[N_{S}],k\neq j$ . Let the model which garners the largest number of votes be denoted by $\boldsymbol{\hat{\beta}}^{j_{*}}$ , where $j_{*}\in[N_{S}]$ . The set of measurements which voted for this model is called the consensus set. Ransac when combined with L2 and L1 is respectively called Rl2 and Rl1. In Rl2, the estimator L2 is used to determine $\boldsymbol{\beta^{*}}$ using measurements only from the consensus set. Likewise, in Rl1, the estimator L1 is used to determine $\boldsymbol{\beta^{*}}$ using measurements only from the consensus set.

Our experiments in this section were performed for signal and sensing matrix settings identical to those described in Sec. V-D. The performance in all experiments was measured using RRMSE, averaged over reconstructions from 100 independent noise runs. For all techniques, the regularization parameters were chosen using cross-validation following the procedure in [52]. The maximum number of subsets for finding the consensus set in Ransac was set to $N_{S}=500$ with $0.9n$ measurements in each subset. RRMSE plots for all competing algorithms are presented in Fig. 4, where we see that Odrl and Drl outperformed all other algorithms for all parameter ranges considered here. We also observe that Odrl produces lower RRMSE than Drl, particularly in the regime involving higher $f_{adv}$ .

VI Conclusion

We have presented a technique for determining the sparse vector $\boldsymbol{\beta^{*}}$ of health status values from noisy pooled measurements in $\boldsymbol{y}$ , with the additional feature that our technique is designed to handle bit-flip errors in the pooling matrix. These bit-flip errors can occur at a small number of unknown locations, due to which the pre-specified matrix $\boldsymbol{A}$ (known) and the actual pooling matrix $\boldsymbol{\hat{A}}$ (unknown) via which pooled measurements are acquired, differ from each other. We use the theory of Lasso debiasing as our basic scaffolding to identify the defective samples in $\boldsymbol{\beta^{*}}$ , but with extensive and non-trivial theoretical and algorithmic innovations to (i) make the debiasing robust to model mismatch errors (MMEs), and also to (ii) enable identification of the pooled measurements that were affected by the MMEs. Our approach is also validated by an extensive set of simulation results, where the proposed method outperforms intuitive baseline techniques. To our best knowledge, there is no prior literature on using Lasso debiasing to identify measurements with MMEs.

There are several interesting avenues for future work:

1.

Currently the optimal weights matrix $\boldsymbol{W}$ is designed to minimize the variance of the debiased estimates of $\boldsymbol{\beta^{*}}$ and not necessarily those of $\boldsymbol{\delta^{*}}$ . Our technique could in principle be extended to derive another weights matrix to minimize the variance of the debiased estimates of $\boldsymbol{\delta^{*}}$ .
2.

A specific form of MMEs consists of unknown permutations in the pooling matrix (also called ‘permutation noise’ [51]) where the pooled results are swapped with one another. The techniques in this paper can be extended to identify pooled measurements that suffer from permutation noise, and potentially correct them. These results, which will be reported elsewhere, are an interesting contribution to the growing sub-area of ‘unlabelled sensing’ [41].
3.

Our technique is based on work in [28], which requires that $\|\boldsymbol{\beta^{*}}\|_{0}<\sqrt{n}/\log p$ . This is a stronger condition that $\|\boldsymbol{\beta^{*}}\|_{0}<n/\log p$ which is common in sparse regression. However in the literature on Lasso debiasing, there exist techniques such as [7] which relax the condition on $\|\boldsymbol{\beta^{*}}\|_{0}$ to allow for $\sqrt{n}/\log p<\|\boldsymbol{\beta^{*}}\|_{0}\leq n/\log p$ , with the caveat that a priori knowledge of $\|\boldsymbol{\beta^{*}}\|_{0}$ is (provably) essential. Incorporating these results within the current framework is another avenue for future research. It is also of interest to combine our results with those on in situ estimation of $\|\boldsymbol{\beta^{*}}\|_{0}$ from pooled or compressed measurements as in [43, 33].

Appendix A Proofs of Theorems and Lemmas on Robust Lasso

We extend Theorem 1 and Lemma 1 of [38] to prove our Theorem 1. We re-parameterize model (5) and the robust Lasso optimisation problem (6) to match those in [38], i.e.,

\boldsymbol{y}=\left(\boldsymbol{A}\ |\ \sqrt{n}\boldsymbol{I_{n}}\right)\begin{pmatrix}\boldsymbol{\beta^{*}}\\ \boldsymbol{\delta^{*}}/\sqrt{n}\end{pmatrix}+\boldsymbol{{\eta}}=\boldsymbol{A\beta^{*}}+\sqrt{n}\boldsymbol{e^{*}}+\boldsymbol{\eta},

(37)

where $\boldsymbol{e^{*}}\triangleq\boldsymbol{\delta^{*}}/\sqrt{n}$ . Note that the optimization problem (6) is

\displaystyle\begin{pmatrix}\boldsymbol{\hat{\beta}_{\lambda_{1}}}\\ \boldsymbol{\hat{\delta}_{\tilde{\lambda}_{2}}}\end{pmatrix}

\displaystyle=

\displaystyle\arg\min_{\boldsymbol{\beta},\boldsymbol{\delta}}{\frac{1}{2n}}\left\|\boldsymbol{y}-(\boldsymbol{A}|\sqrt{n}\boldsymbol{I_{n}})\begin{pmatrix}\boldsymbol{\beta}\\ \boldsymbol{\delta}/\sqrt{n}\end{pmatrix}\right\|^{2}_{2}+\lambda_{1}\|\boldsymbol{\beta}\|_{1}+{\tilde{\lambda}_{2}}\left\|\frac{\boldsymbol{\delta}}{\sqrt{n}}\right\|_{1},

where $\tilde{\lambda}_{2}=\sqrt{n}\lambda_{2}$ . The equivalent robust Lasso optimization problem for the model (37) is given by:

\displaystyle\begin{pmatrix}\boldsymbol{\hat{\beta}_{\lambda_{1}}}\\ \boldsymbol{\hat{e}_{\tilde{\lambda}_{2}}}\end{pmatrix}

\displaystyle=

\displaystyle\arg\min_{\boldsymbol{\beta},\boldsymbol{e}}{\frac{1}{2n}}\left\|\boldsymbol{y}-(\boldsymbol{A}|\sqrt{n}\boldsymbol{I_{n}})\begin{pmatrix}\boldsymbol{\beta}\\ \boldsymbol{e}\end{pmatrix}\right\|^{2}_{2}+\lambda_{1}\|\boldsymbol{\beta}\|_{1}+{\tilde{\lambda}_{2}}\left\|\boldsymbol{e}\right\|_{1},

(38)

where $\boldsymbol{\hat{e}_{\lambda_{2}}}=\boldsymbol{\hat{\delta}_{\lambda_{2}}}/\sqrt{n}$ . In order to prove Theorem 1, we first recall the Extended Restricted Eigenvalue Condition (EREC) for a sensing matrix from [38]. Given $\boldsymbol{\beta}^{*}$ and $\boldsymbol{\delta}^{*}$ , let us define sets

\mathcal{S}\triangleq\{j:\beta^{*}_{j}\neq 0\}\ ,\ \mathcal{R}\triangleq\{i:\delta^{*}_{i}\neq 0\}.

(39)

Note that $s\triangleq|\mathcal{S}|,r\triangleq|\mathcal{R}|$ .

Definition 1

Extended Restricted Eigenvalue Condition (EREC) [38]: Given $\mathcal{S},\mathcal{R}$ as defined in (39), and $\lambda_{1},\tilde{\lambda}_{2}>0$ , an $n\times p$ matrix $\boldsymbol{A}$ is said to satisfy the EREC if there exists a $\kappa>0$ such that

\frac{1}{\sqrt{n}}\|\boldsymbol{Ah_{\beta}}+\sqrt{n}\boldsymbol{h_{\delta}}\|_{2}\geq\kappa(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2}),

(40)

for all $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}(\mathcal{S},\mathcal{R},\lambda)$ with $\lambda:=\tilde{\lambda}_{2}/\lambda_{1}$ and where $\mathcal{C}$ is defined as follows:

\displaystyle\mathcal{C}(\mathcal{S},\mathcal{R},\lambda)

\displaystyle\triangleq

\displaystyle\{(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathbb{R}^{p}\times\mathbb{R}^{n}:\|(\boldsymbol{h_{\beta}})_{\mathcal{S}^{c}}\|_{1}+\lambda\|(\boldsymbol{h_{\delta}})_{\mathcal{R}^{c}}\|_{1}\leq 3(\|(\boldsymbol{h_{\beta}})_{\mathcal{S}}\|_{1}+\lambda\|(\boldsymbol{h_{\delta}})_{\mathcal{R}}\|_{1})\}.

(41)

Here, $(\boldsymbol{h_{\beta}})_{\mathcal{S}}$ and $(\boldsymbol{h_{\delta}})_{\mathcal{R}}$ are $s$ and $r$ dimensional vectors extracted from $\boldsymbol{h_{\beta}}$ and $\boldsymbol{h_{\delta}}$ respectively, restricted to the set $\mathcal{S}$ and $\mathcal{R}$ as defined in (39). $\blacksquare$

In Lemma. 1, we extend Lemma 1 from [38] to random Rademacher matrices. In this lemma we show that a random Rademacher matrix $\boldsymbol{A}$ satisfies EREC with high probability for $\kappa=1/16$ .

Lemma 1

Let $\boldsymbol{A}$ be an $n\times p$ matrix with i.i.d. Rademacher entries. There exist positive constants $C_{1},C_{2},c_{3},c_{4}$ such that if $n\geq C_{1}s\log p$ and $r\leq\textrm{min}\{C_{2}\frac{n}{\log n},\frac{s\log p}{\log n}\}$ then

P\left(\forall\ (\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\lambda\right),\ \frac{1}{\sqrt{n}}\|\boldsymbol{Ah_{\beta}}+\sqrt{n}\boldsymbol{h_{\delta}}\|_{2}\geq\frac{1}{16}(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})\right)\geq 1-c_{3}\exp{\{-c_{4}n\}},

where $\lambda=\sqrt{\frac{\log n}{\log p}}$ and $\mathcal{C}$ as in (41). $\blacksquare$

Proof of Lemma 1: Using a similar line of argument as in the proof of Lemma 1 of [38], it is enough to show the following two properties of the sensing matrix $\boldsymbol{A}$ to complete the proof.

Lower bound on $\frac{1}{n}\|\boldsymbol{Ah_{\beta}}\|_{2}^{2}+\|\boldsymbol{h_{\delta}}\|_{2}^{2}$ . For some $\kappa_{1}>0$ with high probability,

\frac{1}{n}\|\boldsymbol{Ah_{\beta}}\|_{2}^{2}+\|\boldsymbol{h_{\delta}}\|_{2}^{2}\geq\kappa_{1}\left(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2}\right)^{2}\ \qquad\forall\ (\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right).

(42)

Mutual Incoherence: The column space of the matrix $\boldsymbol{A}$ is incoherent with the column space of the identity matrix. For some $\kappa_{2}>0$ with high probability,

\frac{1}{\sqrt{n}}|\langle\boldsymbol{Ah_{\beta}},\boldsymbol{h_{\delta}}\rangle|\leq\kappa_{2}(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})^{2}\ \qquad\forall\ (\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right).

(43)

By using (42) and (43), we have, with high probability,

	$\displaystyle\frac{1}{n}\\|\boldsymbol{Ah_{\beta}+\sqrt{n}h_{\delta}}\\|_{2}^{2}$	$\displaystyle=$	$\displaystyle\frac{1}{n}\left\\|\boldsymbol{Ah_{\beta}}\right\\|_{2}^{2}+\\|\boldsymbol{h_{\delta}}\\|_{2}^{2}+\frac{2}{\sqrt{n}}\langle\boldsymbol{Ah_{\beta}},\boldsymbol{h_{\delta}}\rangle\geq\kappa_{1}\left(\\|\boldsymbol{h_{\beta}}\\|_{2}+\\|\boldsymbol{h_{\delta}}\\|_{2}\right)^{2}-2\kappa_{2}(\\|\boldsymbol{h_{\beta}}\\|_{2}+\\|\boldsymbol{h_{\delta}}\\|_{2})^{2}$
		$\displaystyle=$	$\displaystyle(\kappa_{1}-2\kappa_{2})(\\|\boldsymbol{h_{\beta}}\\|_{2}+\\|\boldsymbol{h_{\delta}}\\|_{2})^{2}$

The proof is completed if $\kappa_{1}>2\kappa_{2}$ . We now show that (42) and (43) hold together with $\kappa_{1}>2\kappa_{2}$ for a Rademacher sensing matrix $\boldsymbol{A}$ .
We now state two important facts on the Rademacher matrix $\boldsymbol{A}$ which will be used in proving (42) and (43) respectively.

(1)

We use a result following Lemma 1 [31] (see the equation immediately following Lemma 1 in [31], and set $\bar{D}$ in that equation to the identity matrix, since we are concerned with signals that are sparse in the canonical basis). Using this result, there exist positive constants $c_{2},c^{\prime}_{3},c^{\prime}_{4}$ , such that with probability at least $1-c^{\prime}_{3}\exp{\{-c^{\prime}_{4}n\}}$ :

\frac{1}{\sqrt{n}}\|\boldsymbol{Ah_{\beta}}\|_{2}\geq\frac{\|\boldsymbol{h_{\beta}}\|_{2}}{4}-c_{2}\sqrt{\frac{\log p}{n}}\|\boldsymbol{h_{\beta}}\|_{1}\ \forall\ \boldsymbol{h_{\beta}}\in\mathbb{R}^{p}.

(44)

(2)

From Theorem 4.4.5 of [49], for a $s\times r^{\prime}$ dimensional Rademacher matrix $\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}$ , there exists a constant $c_{1}>0$ such that, for any $\tau^{\prime}>0$ , with probability at least $1-2\exp{\{-n\tau^{\prime 2}\}}$ we have

\frac{1}{\sqrt{n}}\|\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}\|_{2}=\frac{1}{\sqrt{n}}\sigma_{max}({\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}})\leq c_{1}\left(\sqrt{\frac{s}{n}}+\sqrt{\frac{r^{\prime}}{n}}+\tau^{\prime}\right).

(45)

Throughout this proof, we take the constants $C_{1}\triangleq\frac{7^{2}}{(24c_{2})^{2}}$ and $C_{2}\triangleq\max\{32^{2}c_{2}^{2},4(51200c_{1})^{2}\}$ , where $c_{1},c_{2}$ are as defined in (44) and (45) respectively.

Proof of (42): We first obtain a lower bound on $\frac{1}{n}\|\boldsymbol{Ah_{\beta}}\|_{2}^{2}$ using (44). For every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$ , we have:

\|\boldsymbol{h_{\beta}}\|_{1}\leq 4\|\boldsymbol{(h_{\beta})_{\mathcal{S}}}\|_{1}+3\sqrt{\frac{\log n}{\log p}}\|\boldsymbol{(h_{\delta})_{\mathcal{S}}}\|_{1}\leq 4\sqrt{s}\|\boldsymbol{h_{\beta}}\|_{2}+3\sqrt{\frac{\log n}{\log p}}\sqrt{r}\|\boldsymbol{h_{\delta}}\|_{2}.

(46)

Substituting (46) in (44), we obtain that, with probability at least $1-c^{\prime}_{3}\exp{\{-c^{\prime}_{4}n\}}$ , for every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$ :

	$\displaystyle\frac{1}{\sqrt{n}}\\|\boldsymbol{Ah_{\beta}}\\|_{2}$	$\displaystyle\geq$	$\displaystyle\left(\frac{1}{4}-4c_{2}\sqrt{\frac{s\log p}{n}}\right)\\|\boldsymbol{h_{\beta}}\\|_{2}-3c_{2}\sqrt{\frac{\log n}{\log p}}\sqrt{\frac{r\log p}{n}}\\|\boldsymbol{h_{\delta}}\\|_{2},$
	$\displaystyle\therefore\frac{1}{\sqrt{n}}\\|\boldsymbol{Ah_{\beta}}\\|_{2}+\\|\boldsymbol{h_{\delta}}\\|_{2}$	$\displaystyle\geq$	$\displaystyle\left(\frac{1}{4}-4c_{2}\sqrt{\frac{s\log p}{n}}\right)\\|\boldsymbol{h_{\beta}}\\|_{2}+\left(1-3c_{2}\sqrt{\frac{\log n}{\log p}}\sqrt{\frac{r\log p}{n}}\right)\\|\boldsymbol{h_{\delta}}\\|_{2}.$		(47)

Under the assumption $n\geq C_{1}s\log p$ , the first term in the brackets of (47) is greater than $\frac{1}{8}$ . Again, under the assumption $r\leq C_{2}\frac{n}{\log n}$ , the second term is greater than $\frac{1}{8}$ . Thus we have, $\frac{1}{\sqrt{n}}\|\boldsymbol{Ah_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2}\geq\frac{1}{8}(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})$ . Squaring both sides, we have, $\frac{1}{n}\|\boldsymbol{Ah_{\beta}}\|_{2}^{2}+\|\boldsymbol{h_{\delta}}\|_{2}^{2}+\frac{2}{\sqrt{n}}\|\boldsymbol{Ah_{\beta}}\|_{2}\|\boldsymbol{h_{\delta}}\|_{2}\geq\frac{1}{64}\left(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2}\right)^{2}$ . Using the fact that $\|\boldsymbol{a}\|^{2}_{2}+\|\boldsymbol{b}\|^{2}_{2}\geq 2\|\boldsymbol{a}\|_{2}\|\boldsymbol{b}\|_{2}$ for any vectors $\boldsymbol{a},\boldsymbol{b}$ , we have, $2\left(\frac{1}{n}\|\boldsymbol{Ah_{\beta}}\|_{2}^{2}+\|\boldsymbol{h_{\delta}}\|_{2}^{2}\right)\geq\frac{1}{n}\|\boldsymbol{Ah_{\beta}}\|_{2}^{2}+\|\boldsymbol{h_{\delta}}\|_{2}^{2}+\frac{2}{\sqrt{n}}\|\boldsymbol{Ah_{\beta}}\|_{2}\|\boldsymbol{h_{\delta}}\|_{2}\geq\frac{1}{64}\left(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2}\right)^{2}$ . Hence we have with probability at least $1-c^{\prime}_{3}\exp{\{-c^{\prime}_{4}n\}}$ , for every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$

\frac{1}{n}\|\boldsymbol{Ah_{\beta}}\|_{2}^{2}+\|\boldsymbol{h_{\delta}}\|_{2}^{2}\geq\frac{1}{128}(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})^{2}.

(48)

Therefore, we have $\kappa_{1}=1/128$ completing the proof of (42).

Proof of (43): This part of the proof directly follows the proof of Lemma 2 in [38], with a few minor differences in constant factors. Nevertheless, we are including it here to make the paper self-contained.

Divide the set $\{1,2,\ldots,p\}$ into subsets $\mathcal{S}_{1},\mathcal{S}_{2},\ldots,\mathcal{S}_{q}$ of size $s$ each, such that the first set $\mathcal{S}_{1}$ contains $s$ largest absolute value entries of $\boldsymbol{h_{\beta}}$ indexed by $\mathcal{S}$ , the set $\mathcal{S}_{2}$ contains $s$ largest absolute value entries of the vector $\boldsymbol{({h_{\beta}})_{\mathcal{S}^{c}}}$ , $\mathcal{S}_{2}$ contains the second largest $s$ absolute value entries of $\boldsymbol{({h_{\beta}})_{\mathcal{S}^{c}}}$ , and so on. By the same strategy, we also divide the set $\{1,2,\ldots,n\}$ into subsets $\mathcal{R}_{1},\mathcal{R}_{2},\ldots,\mathcal{R}_{k}$ such that the first set $\mathcal{R}_{1}$ contains $r$ entries of $\boldsymbol{h_{\delta}}$ indexed by $\mathcal{R}$ and sets $\mathcal{R}_{2},\mathcal{R}_{3},\ldots$ are of size $r^{\prime}\geq r$ . We have for every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$ ,

	$\displaystyle\frac{1}{\sqrt{n}}\|\langle\boldsymbol{Ah_{\beta}},\boldsymbol{h_{\delta}}\rangle\|\leq\sum_{i,j}\frac{1}{\sqrt{n}}\|\langle\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}(\boldsymbol{h_{\beta}})_{\mathcal{S}_{j}},(\boldsymbol{h_{\delta}})_{\mathcal{R}_{i}}\rangle\|$	$\displaystyle\leq\max_{i,j}\frac{1}{\sqrt{n}}\\|\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}\\|_{2}\sum_{i^{\prime},j^{\prime}}\\|(\boldsymbol{h_{\beta}})_{\mathcal{S}_{j^{\prime}}}\\|_{2}\\|(\boldsymbol{h_{\delta}})_{\mathcal{R}_{i^{\prime}}}\\|_{2}$		(49)
		$\displaystyle=\max_{i,j}\frac{1}{\sqrt{n}}\\|\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}\\|_{2}\left(\sum_{j^{\prime}}\\|(\boldsymbol{h_{\beta}})_{\mathcal{S}_{j^{\prime}}}\\|_{2}\right)\left(\sum_{i^{\prime}}\\|(\boldsymbol{h_{\delta}})_{\mathcal{R}_{i^{\prime}}}\\|_{2}\right).$		(50)

Note that $\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}$ (a submatrix of $\boldsymbol{A}$ containing rows belonging to $\mathcal{R}_{i}$ and columns belonging to $\mathcal{S}_{j}$ ) is itself a Rademacher matrix with i.i.d. entries. Taking the union bound over all possible values of $\mathcal{S}_{j}$ and $\mathcal{R}_{i}$ , we have that the inequality in (45) holds with probability at least $1-2\binom{n}{r^{\prime}}\binom{p}{s}\exp{(-n\tau^{\prime 2})}$ . If $n\geq 4{\tau^{\prime}}^{-2}s\log(p)$ we obtain, $\binom{p}{s}\leq p^{s}\leq\exp({\tau^{\prime}}^{2}n/4)$ . Furthermore, if we assume, $n\geq 4{\tau^{\prime}}^{-2}r^{\prime}\log(n)$ , we have $\binom{n}{r^{\prime}}\leq n^{r^{\prime}}\leq\exp({\tau^{\prime}}^{2}n/4)$ . Later we will give a choice of $\tau^{\prime}$ which ensures that these conditions are satisfied. Therefore, we obtain with probability at least $1-2\exp{\{-n\tau^{\prime 2}/2\}}$ ,

\textrm{max}_{i,j}\frac{1}{\sqrt{n}}\|\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}\|_{2}\leq c_{1}\left(\sqrt{\frac{s}{n}}+\sqrt{\frac{r^{\prime}}{n}}+\tau^{\prime}\right).

(51)

Using the first inequality in the last equation of Section 2.1 of [8] we obtain $\sum_{i=3}^{q}\|\boldsymbol{(h_{\beta})_{\mathcal{S}_{i}}}\|_{2}\leq\frac{1}{\sqrt{s}}\|\boldsymbol{(h_{\beta})_{\mathcal{S}^{c}}}\|_{1}$ . Furthermore, for every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$ , we have $\|\boldsymbol{(h_{\beta})_{\mathcal{S}^{c}}}\|_{1}\leq 3\sqrt{s}\|\boldsymbol{h_{\beta}}\|_{2}+3\sqrt{\frac{\log n}{\log p}}\sqrt{r}\|\boldsymbol{h_{\delta}}\|_{2}$ . Hence,

\sum_{i=1}^{q}\|\boldsymbol{(h_{\beta})_{\mathcal{S}_{i}}}\|_{2}=\|\boldsymbol{(h_{\beta})_{\mathcal{S}_{1}}}\|_{2}+\|\boldsymbol{(h_{\beta})_{\mathcal{S}_{2}}}\|_{2}+\sum_{i=3}^{q}\|\boldsymbol{(h_{\beta})_{\mathcal{S}_{i}}}\|_{2}\leq 2\|\boldsymbol{h_{\beta}}\|_{2}+\sum_{i=3}^{q}\|\boldsymbol{(h_{\beta})_{\mathcal{S}_{i}}}\|_{2}\leq 5\|\boldsymbol{h_{\beta}}\|_{2}+3\sqrt{\frac{\log n}{\log p}}\sqrt{\frac{r}{s}}\|\boldsymbol{h_{\delta}}\|_{2}.

Following a similar process we obtain $\sum_{i=3}^{k}\|\boldsymbol{(h_{\delta})_{\mathcal{R}_{i}}}\|_{2}\leq\frac{1}{\sqrt{r^{\prime}}}\|\boldsymbol{(h_{\delta})_{\mathcal{R}^{c}}}\|_{1}$ . Furthermore, for every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$ , we have $\frac{1}{\sqrt{r^{\prime}}}\|\boldsymbol{(h_{\delta})_{\mathcal{R}^{c}}}\|_{1}\leq 3\sqrt{\frac{s}{r^{\prime}}}\frac{1}{\sqrt{\frac{\log n}{\log p}}}\|\boldsymbol{h_{\beta}}\|_{2}+3\sqrt{\frac{r}{r^{\prime}}}\|\boldsymbol{h_{\delta}}\|_{2}$ . Since $r^{\prime}\geq r$ ,

\sum_{i=1}^{k}\|\boldsymbol{(h_{\delta})_{\mathcal{R}_{i}}}\|_{2}=\|\boldsymbol{(h_{\delta})_{\mathcal{R}_{1}}}\|_{2}+\|\boldsymbol{(h_{\delta})_{\mathcal{R}_{2}}}\|_{2}+\sum_{i=3}^{k}\|\boldsymbol{(h_{\delta})_{\mathcal{R}_{i}}}\|_{2}\leq 2\|\boldsymbol{h_{\delta}}\|_{2}+\sum_{i=3}^{k}\|\boldsymbol{(h_{\delta})_{\mathcal{R}_{i}}}\|_{2}\leq 5\|\boldsymbol{h_{\delta}}\|_{2}+\frac{3}{\sqrt{\frac{\log n}{\log p}}}\sqrt{\frac{s}{r^{\prime}}}\|\boldsymbol{h_{\beta}}\|_{2}.

Hence, joining (51) , (45) into (49), we obtain with probability at least $1-2\exp{\{-n\tau^{\prime 2}/2\}}$ , for every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$

\displaystyle\frac{1}{\sqrt{n}}|\langle\boldsymbol{Ah_{\beta}},\boldsymbol{h_{\delta}}\rangle|\leq c_{1}\left(\sqrt{\frac{s}{n}}+\sqrt{\frac{r^{\prime}}{n}}+\tau^{\prime}\right)\times\left(5\|\boldsymbol{h_{\beta}}\|_{2}+3\sqrt{\frac{\log n}{\log p}}\sqrt{\frac{r}{s}}\|\boldsymbol{h_{\delta}}\|_{2}\right)\times\left(5\|\boldsymbol{h_{\delta}}\|_{2}+\frac{3}{\sqrt{\frac{\log n}{\log p}}}\sqrt{\frac{s}{r^{\prime}}}\|\boldsymbol{h_{\beta}}\|_{2}\right).

(52)

Recall that $r\leq\frac{s\log p}{\log n}$ , by assumption. Taking $r^{\prime}=\frac{s\log p}{\log n}$ leads to $\sqrt{\frac{\log n}{\log p}}\sqrt{\frac{r}{s}}\leq\sqrt{\frac{\log n}{\log p}}\sqrt{\frac{r^{\prime}}{s}}=1$ and $\frac{1}{\sqrt{\frac{\log n}{\log p}}}\sqrt{\frac{s}{r^{\prime}}}=1$ . Thus, we obtain with probability at least $1-2\exp({-n{\tau^{\prime 2}}/2})$ for every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$ ,

\frac{1}{\sqrt{n}}|\langle\boldsymbol{Ah_{\beta}},\boldsymbol{h_{\delta}}\rangle|\leq 25c_{1}\left(\sqrt{\frac{s}{n}}+\sqrt{\frac{r^{\prime}}{n}}+\tau^{\prime}\right)\times(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})^{2}

(53)

Let $\tau^{\prime}\triangleq 1/(51200c_{1})$ . Recall that, $C_{1}\triangleq\max\{32^{2}c_{2}^{2},4(51200c_{1})^{2}\}$ . Then $n\geq C_{1}s\log p$ implies $n\geq 4{\tau^{\prime}}^{-2}s\log p=4{\tau^{\prime}}^{-2}r^{\prime}\log n$ . Furthermore,

\sqrt{\frac{s}{n}}\leq\sqrt{\frac{r^{\prime}}{n}}=\sqrt{\frac{s\log p}{n\log n}}\leq\tau^{\prime}/2.

(54)

Therefore, we have with probability at least $1-2\exp({-n{\tau^{\prime 2}}/2})$ , for every $(\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right)$

\frac{1}{\sqrt{n}}|\langle\boldsymbol{Ah_{\beta}},\boldsymbol{h_{\delta}}\rangle|\leq 25c_{1}\times 2\tau^{\prime}(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})^{2}\leq\frac{1}{512}(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})^{2}

(55)

This completes the proof of (43).

Now, from (48) and (55), using a union bound, we obtain with probability at least $1-(c^{\prime}_{3}\exp(-c^{\prime}_{4}n)+2\exp(-n{\tau^{\prime 2}}/2))$ ,

\frac{1}{n}\|\boldsymbol{Ah_{\beta}+\sqrt{n}h_{\delta}}\|_{2}^{2}\geq(\kappa_{1}-2\kappa_{2})(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})^{2}=\kappa^{2}(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})^{2}

(56)

Taking $c_{3}=c^{\prime}_{3}+2$ and $c_{4}=\min\{c^{\prime}_{4},{\tau^{\prime 2}}/2\}$ , we have $1-(c^{\prime}_{3}\exp(-c^{\prime}_{4}n)+2\exp(-{\tau^{\prime 2}}n/2)\geq 1-c_{3}\exp(-c_{4}n)$ .

Note that, we have, $\kappa=\sqrt{\kappa_{1}-2\kappa_{2}}=1/16$ . Taking the root over (56), we obtain with probability at least $1-c_{3}\exp(-c_{4}n)$ ,

\frac{1}{\sqrt{n}}\|\boldsymbol{Ah_{\beta}+\sqrt{n}h_{\delta}}\|_{2}\geq\frac{1}{16}(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2})\ \forall\ (\boldsymbol{h_{\beta}},\boldsymbol{h_{\delta}})\in\mathcal{C}\left(\mathcal{S},\mathcal{R},\sqrt{\frac{\log n}{\log p}}\right).

This completes the proof of the lemma. $\blacksquare$

A-A Proof of Theorem 1

Proof of (7): We now derive the bound for the $l_{1}$ norm of the robust lasso estimate of the error $\boldsymbol{\hat{\beta}_{\lambda_{1}}}-\boldsymbol{\beta^{*}}$ given by the optimisation problem (6). Recall that, we have $\lambda_{1}=\frac{4\sigma\sqrt{\log p}}{\sqrt{n}}$ and $\lambda_{2}=\frac{4\sigma\sqrt{\log n}}{{n}}$ . We choose $\tilde{\lambda}_{2}\triangleq\sqrt{n}\lambda_{2}=\frac{4\sigma\sqrt{\log n}}{\sqrt{n}}$ . We use $\tilde{\lambda}_{2}$ to define the cone constraint in (41). Note that, in the proof of Theorem 1 of [38], it is shown that $\boldsymbol{h_{\beta}}\triangleq\boldsymbol{\hat{\beta}_{\lambda_{1}}-\beta^{*}}$ and $\boldsymbol{h_{\delta}}\triangleq\frac{1}{\sqrt{n}}(\boldsymbol{\hat{\delta}_{\lambda_{2}}-\delta^{*}})$ satisfies the cone constraint given in (41). Therefore, we have

\|(\boldsymbol{h_{\beta}})_{\mathcal{S}^{c}}\|_{1}+\frac{\tilde{\lambda}_{2}}{\lambda_{1}}\|(\boldsymbol{h_{\delta}})_{\mathcal{R}^{c}}\|_{1}\leq 3(\|(\boldsymbol{h_{\beta}})_{\mathcal{S}}\|_{1}+\frac{\tilde{\lambda}_{2}}{\lambda_{1}}\|(\boldsymbol{h_{\delta}})_{\mathcal{R}}\|_{1}).

(57)

Now by using Eqn. (57), we have

\displaystyle\|\boldsymbol{h_{\beta}}\|_{1}

\displaystyle=

\displaystyle\|\boldsymbol{(h_{\beta})_{\mathcal{S}}}\|_{1}+\|\boldsymbol{(h_{\beta})_{\mathcal{S}^{c}}}\|_{1}\leq 4\|\boldsymbol{(h_{\beta})_{\mathcal{S}}}\|_{1}+3\frac{\tilde{\lambda}_{2}}{\lambda_{1}}\|\boldsymbol{(h_{\delta})_{\mathcal{R}}}\|_{1}\leq 4\sqrt{s}\|\boldsymbol{h_{\beta}}\|_{2}+3\sqrt{r}\frac{\tilde{\lambda}_{2}}{\lambda_{1}}\|\boldsymbol{h_{\delta}}\|_{2}.

(58)

Here, the last inequality of Eqn.(58) holds since $\left\|\boldsymbol{(h_{\beta})_{\mathcal{S}}}\right\|_{1}\leq\sqrt{s}\|\boldsymbol{h_{\beta}}\|_{2}$ and $\left\|\boldsymbol{(h_{\delta})_{\mathcal{R}}}\right\|_{1}\leq\sqrt{r}\|\boldsymbol{h_{\delta}}\|_{2}$ . Note that, $\max\{\sqrt{s},\sqrt{r}\}\leq\sqrt{s+r}$ . Based on the values of $\lambda_{1},\tilde{\lambda}_{2}$ , we have $\tilde{\lambda}_{2}<\lambda_{1}$ since $n<p$ . Hence, by using Eqn.(58), we have

\displaystyle\|\boldsymbol{h_{\beta}}\|_{1}\leq 4\sqrt{s}\|\boldsymbol{h_{\beta}}\|_{2}+3\sqrt{r}\|\boldsymbol{h_{\delta}}\|_{2}\leq 4\sqrt{s+r}\left(\|\boldsymbol{h_{\beta}}\|_{2}+\|\boldsymbol{h_{\delta}}\|_{2}\right).

(59)

Recall that, $\boldsymbol{e^{*}}=\frac{\boldsymbol{\delta^{*}}}{\sqrt{n}}$ and $\boldsymbol{\hat{e}}=\frac{\boldsymbol{\hat{\delta}_{\tilde{\lambda}_{2}}}}{\sqrt{n}}$ in Theorem 1 of [38]. Therefore, by the equivalence of the model given in (37) and the optimisation problem in (6) with that of [38], we have

\|\boldsymbol{\hat{\beta}_{\lambda_{1}}-\beta^{*}}\|_{2}+\left\|\frac{1}{\sqrt{n}}(\boldsymbol{\hat{\delta}_{\tilde{\lambda}_{2}}-\delta^{*}})\right\|_{2}\leq 3\kappa^{-2}\textrm{max}\{\lambda_{1}\sqrt{s},\tilde{\lambda}_{2}\sqrt{r}\},

(60)

as long as

\dfrac{2\|\boldsymbol{A}^{\top}\boldsymbol{\eta}\|_{\infty}}{n}\leq\lambda_{1},\ \text{and}\ \dfrac{2\|\boldsymbol{\eta}\|_{\infty}}{\sqrt{n}}\leq\tilde{\lambda}_{2}.

(61)

Therefore when (61) holds, then by using (59) (recall $\boldsymbol{h_{\beta}}=\boldsymbol{\hat{\beta}_{\lambda_{1}}-\beta^{*}}$ ) and (60), we have

\displaystyle\|\boldsymbol{\hat{\beta}_{\lambda_{1}}-\beta^{*}}\|_{1}\leq 4\sqrt{s+r}\left(3\kappa^{-2}\textrm{max}\{\lambda_{1}\sqrt{s},\tilde{\lambda}_{2}\sqrt{r}\}\right)\leq 12\kappa^{-2}(s+r)\textrm{max}\{\lambda_{1},\tilde{\lambda}_{2}\}\leq 12\kappa^{-2}(s+r)\lambda_{1}.

(62)

We will now bound the probability that $\dfrac{2\|\boldsymbol{A}^{\top}\boldsymbol{\eta}\|_{\infty}}{n}\leq\lambda_{1}$ and $\dfrac{2\|\boldsymbol{\eta}\|_{\infty}}{\sqrt{n}}\leq\tilde{\lambda}_{2}$ using the fact that $\boldsymbol{\eta}$ is Gaussian. By using Lemma 7 in Appendix C which describes the tail bounds of the Gaussian vector, we have

	$\displaystyle P(2\\|\boldsymbol{\eta}\\|_{\infty}/\sqrt{n}\leq 4\sigma\sqrt{\log(n)/n})\geq 1-\frac{1}{n},$		(63)
	$\displaystyle P\left(2\left\\|\frac{1}{\sqrt{n}}\boldsymbol{A}^{\top}\boldsymbol{\eta}\right\\|_{\infty}/\sqrt{n}\leq 4\sigma\sqrt{\log(p)/n}\right)\geq 1-\frac{1}{p}.$		(64)

Using (63),(64) with Bonferroni’s inequality in (62), we have:

\displaystyle P\left(\|\boldsymbol{\hat{\beta}_{\lambda_{1}}}-\boldsymbol{\beta^{*}}\|_{1}\leq 48\kappa^{-2}(s+r)\sigma\sqrt{\frac{{\log(p)}}{{n}}}\right)\geq 1-\frac{1}{n}-\frac{1}{p}.

(65)

This completes the proof of (7).
Proof of (8): We now derive an upper bound of $\|\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}}\|_{1}$ . We approach this by showing that given the optimal estimate of $\boldsymbol{\hat{\beta}}_{\lambda_{1}}$ , we can obtain a unique estimate of $\boldsymbol{\hat{\delta}}_{\lambda_{2}}$ using (67). We then derive the upper bound on $\|\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}}\|_{1}$ using the Lasso bounds given in [22]. Expanding the terms in (5), we obtain:

\displaystyle{\min}_{\boldsymbol{\beta},\boldsymbol{\delta}}{\frac{1}{2n}}\left\|\boldsymbol{y}-\boldsymbol{A\beta}-\boldsymbol{\delta}\right\|^{2}_{2}+\lambda_{1}\|\boldsymbol{\beta}\|_{1}+\lambda_{2}\left\|\boldsymbol{\delta}\right\|_{1}={\min}_{\boldsymbol{\beta}}\left\{\lambda_{1}\|\boldsymbol{\beta}\|_{1}+\sum_{i=1}^{n}{\min}_{\delta_{i}}\left\{\frac{1}{2n}((y_{i}-\boldsymbol{a_{i.}\beta})-\delta_{i})^{2}+{\lambda_{2}}|\delta_{i}|\right\}\right\}.

(66)

Given the optimal solutions $\boldsymbol{\hat{\beta}_{\lambda_{1}}}$ and $\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ of (6), $\boldsymbol{\hat{\delta}_{\lambda_{2}}}$ can also be viewed as

\displaystyle\boldsymbol{\hat{\delta}_{\lambda_{2}}}=\textrm{argmin}_{\boldsymbol{\delta}}\frac{1}{2n}\sum_{i=1}^{n}\{(y_{i}-\boldsymbol{a_{i.}}\boldsymbol{\hat{\beta}_{\lambda_{1}}}-\delta_{i})^{2}\}+\lambda_{2}\|\boldsymbol{\delta}\|_{1}

(67)

Thus (67) can also be viewed as Lasso estimator for $\boldsymbol{z}=\boldsymbol{I_{n}\delta^{*}}+\boldsymbol{\varrho}$ , where $\boldsymbol{z}\triangleq\boldsymbol{y}-\boldsymbol{A}^{\top}\boldsymbol{\hat{\beta}_{\lambda_{1}}}$ and $\boldsymbol{\varrho}\triangleq\boldsymbol{A}(\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{\lambda_{1}})+\boldsymbol{\eta}}$ with $\boldsymbol{\delta^{*}}$ being $r$ -sparse. By using Theorem 11.1(b) of [22] , we have if $\lambda_{2}\geq 2\frac{\|\boldsymbol{\varrho}\|_{\infty}}{n}$ ,

\left\|\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}}\right\|_{2}\leq\dfrac{3\sqrt{r}\lambda_{2}}{\gamma_{r}},

(68)

where $\gamma_{r}$ is the Restricted eigenvalue constant of order $r$ which equals one for $\boldsymbol{I_{n}}$ . Now using the result in Lemma 11.1 of [22], when $\lambda_{2}\geq 2\frac{\|\boldsymbol{\varrho}\|_{\infty}}{n}$ , then

\left\|(\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}})_{\mathcal{R}^{c}}\right\|_{1}\leq 3\left\|(\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}})_{\mathcal{R}}\right\|_{1}.

Therefore by using (68) when $\lambda_{2}\geq 2\frac{\|\boldsymbol{\varrho}\|_{\infty}}{n}$ , we have

	$\displaystyle\\|\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}}\\|_{1}$	$\displaystyle=$	$\displaystyle\left\\|(\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{}})_{\mathcal{R}^{c}}\right\\|_{1}+\left\\|(\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{}})_{\mathcal{R}}\right\\|_{1}\leq 4\left\\|(\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{}})_{\mathcal{R}}\right\\|_{1}\leq 4\sqrt{r}\left\\|(\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{}})_{\mathcal{R}}\right\\|_{2}\leq 4\sqrt{r}\\|\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}}\\|_{2}$		(69)
		$\displaystyle\leq$	$\displaystyle 12r\lambda_{2}.$		(69)

Therefore we now show that $\lambda_{2}\left(=4\sigma\frac{\sqrt{\log n}}{n}\right)\geq 2\frac{\|\boldsymbol{\varrho}\|_{\infty}}{n}$ (i.e., $\|\boldsymbol{\varrho}\|_{\infty}\leq 2\sigma\sqrt{\log n}$ ) holds with high probability. Now, by Lemma 6 and the triangle inequality, we have

\|\boldsymbol{\varrho}\|_{\infty}=\|\boldsymbol{A}(\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{\lambda_{1}}})+\boldsymbol{\eta}\|_{\infty}\leq|\boldsymbol{A}|_{\infty}\|\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{\lambda_{1}}}\|_{1}+\|\boldsymbol{\eta}\|_{\infty}.

By using Lemma 7 in Appendix C, we have the following:

P(\|\boldsymbol{\eta}\|_{\infty}\geq\sigma\sqrt{\log(n)})\leq\frac{1}{n}.

(70)

Since $|\boldsymbol{A}|_{\infty}\leq 1$ , by using (65), we have

\displaystyle P\left(\|\boldsymbol{\varrho}\|_{\infty}\leq 48\kappa^{-2}(s+r)\sigma\sqrt{\frac{\log(p)}{n}}+\sigma\sqrt{\log(n)}\right)\geq 1-\left(\frac{2}{n}+\frac{1}{p}\right).

(71)

Since $n\log n\geq(48\kappa^{-2})^{2}(s+r)^{2}\log p$ , we have $48\kappa^{-2}(s+r)\sigma\sqrt{\frac{\log(p)}{n}}<\sigma\sqrt{\log(n)}$ . Thus

P(\|\boldsymbol{\varrho}\|_{\infty}\leq 2\sigma\sqrt{\log n})\geq 1-\left(\frac{2}{n}+\frac{1}{p}\right).

(72)

We now put (72) in (69) to obtain:

\displaystyle P\left(\|\boldsymbol{\hat{\delta}_{\lambda_{2}}}-\boldsymbol{\delta^{*}}\|_{1}\leq\frac{24\sigma r\sqrt{\log n}}{n}\right)\geq 1-\left(\frac{2}{n}+\frac{1}{p}\right).

(73)

This completes the proof. $\blacksquare$

Appendix B Proofs of Theorems and Lemmas on Debiased Lasso

B-A Proof of Theorem 2

Note that we have chosen $\boldsymbol{W}=\boldsymbol{A}$ . Now, recall the expression for $\boldsymbol{\hat{\beta}_{W}}$ from (14) and model as given in (5), we have

\displaystyle\boldsymbol{\hat{\beta}_{W}}-\boldsymbol{\beta^{*}}

\displaystyle=

\displaystyle\frac{1}{n}\boldsymbol{A}^{\top}\boldsymbol{{\eta}}+\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{A^{\top}}\boldsymbol{A}\right)(\boldsymbol{\hat{\beta}_{\lambda_{1}}}-\boldsymbol{\beta^{*}})+\frac{1}{n}\boldsymbol{A}^{\top}\left(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\right).

(74)

In Lemma 2, (78) and (79) show that the second and third term on the RHS of (74) are negligible as $n$ , $p$ increases in probability. Therefore, in view of Lemma 2, we have

\displaystyle\sqrt{n}(\hat{\beta}_{Wj}-\beta^{*}_{j})

\displaystyle=

\displaystyle\frac{1}{\sqrt{n}}\boldsymbol{a}_{.j}^{\top}\boldsymbol{{\eta}}+o_{P}(1),

(75)

where $\boldsymbol{a}_{.j}$ denotes the $j^{\text{th}}$ column of matrix $\boldsymbol{A}$ . Given $\boldsymbol{a_{.j}}$ , by using the Gaussianity of $\boldsymbol{\eta}$ , the first term on the RHS of (75) is a Gaussian random variable with mean $0$ and variance $\sigma^{2}\frac{\boldsymbol{a}_{.j}^{\top}\boldsymbol{a}_{.j}}{n}$ . Since $\frac{\boldsymbol{a}_{.j}^{\top}\boldsymbol{a}_{.j}}{n}=1$ , the first term on the RHS is $N(0,\sigma^{2})$ . This completes the proof of result (1) of the theorem.

We now turn to result (2) of the theorem. By using a similar decomposition argument as in the case of $\boldsymbol{\hat{\beta}_{W}}$ in (74) and using the expression of $\boldsymbol{\hat{\delta}_{W}}$ in (15), we have

\displaystyle\boldsymbol{\hat{\delta}_{W}}-\boldsymbol{\delta^{*}}=\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}\right)\boldsymbol{{\eta}}+\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})-\frac{1}{n}\boldsymbol{AA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}.

(76)

We have $\boldsymbol{\Sigma_{A}}=\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{AA}^{\top}\right)\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{AA}^{\top}\right)^{\top}$ . From (80) and (81) of Lemma 2, the second and third terms on the RHS of (76) are both $o_{P}\left(\frac{p\sqrt{1-n/p}}{n}\right)$ . Therefore, using Lemma 2, we have for any $i\in[n]$

\displaystyle\frac{\left({\hat{\delta}_{Wi}}-{\delta^{*}_{i}}\right)}{\sqrt{\Sigma_{A_{ii}}}}=\frac{\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{AA}^{\top}\right)^{\top}_{i.}\boldsymbol{{\eta}}}{\sqrt{\Sigma_{A_{ii}}}}+o_{P}\left(\frac{1}{\sqrt{\Sigma_{A_{ii}}}}\frac{p\sqrt{1-n/p}}{n}\right).

(77)

As $\boldsymbol{\eta}$ is Gaussian, the first term on the RHS of (77) is a Gaussian random variable with mean $0$ and variance $\sigma^{2}$ . In Lemma 3, we show that $\frac{n^{2}}{p(p-n)}\Sigma_{A_{ii}}$ converges to $1$ in probability if $\boldsymbol{A}$ is a Rademacher matrix. This implies that the second term on the RHS of (77) is $o_{P}(1)$ . This completes the proof of result (2). $\blacksquare$

Lemma 2

$\displaystyle\left\\|\sqrt{n}\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{A^{\top}}\boldsymbol{A}\right)(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\\|_{\infty}$	$\displaystyle=$	$\displaystyle o_{P}(1)$	(78)
$\displaystyle\left\\|\frac{1}{\sqrt{n}}\boldsymbol{A}^{\top}\left(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\right)\right\\|_{\infty}$	$\displaystyle=$	$\displaystyle o_{P}(1)$	(79)
$\displaystyle\left\\|\frac{n}{p\sqrt{1-n/p}}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\\|_{\infty}$	$\displaystyle=$	$\displaystyle o_{P}(1)$	(80)
$\displaystyle\left\\|\frac{n}{p\sqrt{1-n/p}}\frac{1}{n}\boldsymbol{AA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\\|_{\infty}$	$\displaystyle=$	$\displaystyle o_{P}(1)$	(81)

$\blacksquare$

Proof of Lemma 2:
When $n$ is $\omega[((s+r)\log p)^{2}]$ , the assumptions of Lemma 1 are satisfied. Hence, the Rademacher matrix $\boldsymbol{A}$ satisfies the assumptions of Theorem 1 with probability that goes to $1$ as $n,p\rightarrow\infty$ . Therefore, to prove the results, it suffices to condition on the event that the conclusion of Theorem 1 holds.

Proof of (78): Using result (4) of Lemma 6, we have:

\displaystyle\left\|\sqrt{n}\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{A^{\top}}\boldsymbol{A}\right)(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}\leq\sqrt{n}|\boldsymbol{A^{\top}A}/n-\boldsymbol{I_{p}}|_{\infty}\|\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{\lambda_{1}}}\|_{1}.

(82)

From result (1) of Lemma 5, result (1) of Theorem 1, and result (5) of Lemma 6 , we have,

\left\|\sqrt{n}\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{A^{\top}}\boldsymbol{A}\right)(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}=O_{P}\left((s+r)\frac{\log p}{\sqrt{n}}\right).

(83)

Under the assumption $n$ is $\omega[((s+r)\log p)^{2}]$ , we have:

\left\|\sqrt{n}\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{A^{\top}}\boldsymbol{A}\right)(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}=o_{P}(1).

(84)

Proof of (79): Again by using result (4) of Lemma 6, we have

\left\|\frac{1}{\sqrt{n}}\boldsymbol{A^{\top}}(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}})\right\|_{\infty}\leq\left|\frac{1}{\sqrt{n}}\boldsymbol{A^{\top}}\right|_{\infty}\|\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\|_{1}.

Since $\boldsymbol{A}$ is a Rademacher matrix, we have, $\left|\frac{1}{\sqrt{n}}\boldsymbol{A^{\top}}\right|_{\infty}=\frac{1}{\sqrt{n}}$ . From result (2) of Theorem 1 and result (5) of Lemma 6, we have

\left\|\frac{1}{\sqrt{n}}\boldsymbol{A^{\top}}(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}})\right\|_{\infty}=O_{P}\left(\frac{r\sqrt{\log n}}{n^{3/2}}\right).

(85)

As $n,p\to\infty$ , we have

\left\|\frac{1}{\sqrt{n}}\boldsymbol{A^{\top}}(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}})\right\|_{\infty}=o_{P}(1).

(86)

Proof of (80): Again using result (4) of Lemma 6, we have,

\displaystyle\left\|\frac{n}{p\sqrt{1-n/p}}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}\leq\frac{n}{\sqrt{1-n/p}}\times\left|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right|_{\infty}\|\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{\lambda_{1}}}\|_{1}.

(87)

By using result (5) of Lemma 6, result (1) of Theorem 1 and result (2) of Lemma 5, we have

$\displaystyle\left\\|\frac{n}{p\sqrt{1-n/p}}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\\|_{\infty}$	$\displaystyle\leq$	$\displaystyle O_{P}\left((s+r)\frac{n}{\sqrt{1-n/p}}\left(\sqrt{\frac{\log(pn)}{pn}}+\frac{1}{n}\right)\sqrt{\frac{\log p}{n}}\right)$	(88)
	$\displaystyle=$	$\displaystyle O_{P}\left(\frac{(s+r)n}{\sqrt{1-n/p}}\sqrt{\frac{\log(pn)}{pn}}\sqrt{\frac{\log p}{n}}+\frac{(s+r)n}{\sqrt{1-n/p}}\frac{1}{n}\sqrt{\frac{\log p}{n}}\right)$
	$\displaystyle=$	$\displaystyle O_{P}\left(\frac{(s+r)}{\sqrt{1-n/p}}\sqrt{\frac{\log(np)\log(p)}{p}}+\frac{(s+r)}{\sqrt{1-n/p}}\sqrt{\frac{\log p}{n}}\right).$

Since $n$ is $o(p)$ and $n$ is $\omega[((s+r)\log p)^{2}]$ , (88) becomes as $n,p\to\infty$ ,

\left\|\frac{n}{p\sqrt{1-n/p}}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}=o_{P}(1).

(89)

Proof of (81): Using result (4) of Lemma 6, we have,

\displaystyle\left\|\frac{n}{p\sqrt{1-n/p}}\frac{1}{n}\boldsymbol{AA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\|_{\infty}\leq\frac{1}{\sqrt{1-n/p}}\times\left|\frac{1}{p}\boldsymbol{AA}^{\top}\right|_{\infty}\left\|\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\|_{1}.

(90)

Since $\boldsymbol{A}$ is a Rademacher matrix, the elements of $\frac{1}{p}\boldsymbol{AA^{\top}}$ lie between $-1$ and $1$ . Therefore, $\left|\frac{1}{p}\boldsymbol{AA^{\top}}\right|_{\infty}=1$ . By using part (5) of Lemma 6 and result (2) of Theorem 1, we have

\displaystyle\left\|\frac{n}{p\sqrt{1-n/p}}\frac{1}{n}\boldsymbol{AA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\|_{\infty}=O_{P}\left(\frac{r}{\sqrt{1-n/p}}\frac{\sqrt{\log n}}{\sqrt{n}}\right).

(91)

Since $n$ is $o(p)$ , we have

\left\|\frac{n}{p\sqrt{1-n/p}}\frac{1}{n}\boldsymbol{AA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\|_{\infty}=o_{P}(1).

(92)

This completes the proof. $\blacksquare$

Lemma 3

Let $\boldsymbol{A}$ be a Rademacher matrix and $\boldsymbol{\Sigma_{A}}$ be as defined in (18). If $n\log n$ is $o(p)$ , we have as $n,p\to\infty$ for any $i\in[n]$ ,

\frac{n^{2}}{p^{2}(1-n/p)}\Sigma_{A_{ii}}\overset{P}{\to}1.

(93)

Proof of Lemma 3: Recall that from (18), $\boldsymbol{\Sigma_{A}}=\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{AA}^{\top}\right)\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{AA}^{\top}\right)^{\top}$ . Note that for $i\in[n]$ , we have

$\displaystyle\frac{n^{2}}{p^{2}(1-n/p)}\Sigma_{A_{ii}}$	$\displaystyle=$	$\displaystyle\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\left(1-\frac{2}{n}\boldsymbol{a_{i.}}\boldsymbol{a_{i.}}^{\top}+\frac{1}{n^{2}}\boldsymbol{a_{i.}A^{\top}A}\boldsymbol{a_{i.}}^{\top}\right)$	(94)
	$\displaystyle=$	$\displaystyle\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(1-\frac{\boldsymbol{a_{i.}a_{i.}^{\top}}}{n}\right)\right)^{2}+\underset{k\neq i}{\sum_{k=1}^{n}}\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(\frac{\boldsymbol{a_{i.}a_{k.}^{\top}}}{n}\right)\right)^{2}$
	$\displaystyle=$	$\displaystyle{1-\frac{n}{p}}+\underset{k\neq i}{\sum_{k=1}^{n}}\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(\frac{\boldsymbol{a_{i.}a_{k.}^{\top}}}{n}\right)\right)^{2}.$

In (94), since the second term is positive, we have

\frac{n^{2}}{p^{2}(1-n/p)}\Sigma_{A_{ii}}\geq 1-\frac{n}{p}.

(95)

We have from Result (3) of Lemma 5, for $k\in[n],k\neq i$ ,

P\left(\frac{1}{\sqrt{1-n/p}}\left|\boldsymbol{a_{i.}a_{k.}^{\top}}/p\right|\leq\frac{2}{\sqrt{1-n/p}}\sqrt{\frac{2\log(n)}{p}}\right)\geq 1-\frac{2}{n^{2}}.

(96)

By using (96), we have

$\displaystyle P\left(\underset{k\neq i}{\sum_{k=1}^{n}}\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(\frac{\boldsymbol{a_{i.}a_{k.}^{\top}}}{n}\right)\right)^{2}\leq\frac{4(n-1)}{{1-n/p}}{\frac{2\log(n)}{p}}\right)$	$\displaystyle\geq$	$\displaystyle P\left(\underset{k\neq i}{\cap_{k=1}^{n}}\left\{\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(\frac{\boldsymbol{a_{i.}a_{k.}^{\top}}}{n}\right)\right)^{2}\leq\frac{4}{{1-n/p}}{\frac{2\log(n)}{p}}\right\}\right)$	(97)
	$\displaystyle=$	$\displaystyle P\left(\underset{k\neq i}{\cap_{k=1}^{n}}\left\{\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left\|\frac{\boldsymbol{a_{i.}a_{k.}^{\top}}}{n}\right\|\leq\frac{2}{\sqrt{1-n/p}}{\sqrt{\frac{2\log(n)}{p}}}\right\}\right)$
	$\displaystyle\geq$	$\displaystyle 1-\frac{2(n-1)}{n^{2}}.$

The last inequality comes using Bonferroni’s inequality which states that $P(\cap_{i=1}^{n}U_{i})\geq 1-\sum_{i=1}^{n}P(U_{i})$ for any events $U_{1},U_{2},...,U_{n}$ . Therefore by using (94) and (97), we have

P\left(\frac{n^{2}}{p^{2}(1-n/p)}\Sigma_{A_{ii}}\leq 1-\frac{n}{p}+\frac{4(n-1)}{{1-n/p}}{\frac{2\log(n)}{p}}\right)\geq 1-\frac{2}{n}

(98)

By using (95) and (98), we have

P\left(1-\frac{n}{p}\leq\frac{n^{2}}{p^{2}(1-n/p)}\Sigma_{A_{ii}}\leq 1-\frac{n}{p}+\frac{4(n-1)}{{1-n/p}}{\frac{2\log(n)}{p}}\right)\geq 1-\frac{2}{n}

(99)

Since $n\log n$ is $o(p)$ , as $n,p\to\infty$ , $1-\frac{n}{p}\to 1$ and $\frac{4(n-1)}{{1-n/p}}{\frac{2\log(n)}{p}}\to 0$ . This completes the proof. $\blacksquare$

Now we proceed to the results involving debiasing using the optimal weights matrix $\boldsymbol{W}$ obtained from Alg. 2. The proofs of these results largely follow the same approach as that for $\boldsymbol{W}=\boldsymbol{A}$ (i.e. Theorem 2). However there is one major point of departure—due to differences in properties of the weights matrix $\boldsymbol{W}$ designed from Alg. 2 (given in Lemma 4), as compared to the case where $\boldsymbol{W}=\boldsymbol{A}$ (given in Lemma 5).

B-B Proof of Theorem 3

Proof of (23): Using Result (4) of Lemma 6, we have

\displaystyle\left\|\sqrt{n}\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{W^{\top}}\boldsymbol{A}\right)(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}\leq\sqrt{n}\|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}|_{\infty}\|\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{\lambda_{1}}}\|_{1}.

(100)

Using Result (2) of Lemma 4, Result (1) of Theorem 1 and Result (5) of Lemma 6, we have

\left\|\sqrt{n}\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{W^{\top}}\boldsymbol{A}\right)(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}=O_{P}\left((s+r)\frac{\log p}{\sqrt{n}}\right).

(101)

If $n$ is $\omega[((s+r)\log p)^{2}]$ , we have:

\left\|\sqrt{n}\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{W^{\top}}\boldsymbol{A}\right)(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}=o_{P}(1).

(102)

Proof of (24): Using Result (4) of Lemma 6, we have

\left\|\frac{1}{\sqrt{n}}\boldsymbol{W^{\top}}(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}})\right\|_{\infty}\leq\left|\frac{1}{\sqrt{n}}\boldsymbol{W^{\top}}\right|_{\infty}\|\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\|_{1}.

Using Result (3) of Lemma 4, Result (2) of Theorem 1 and Result (5) of Lemma 6, we have

\left\|\frac{1}{\sqrt{n}}\boldsymbol{W^{\top}}(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}})\right\|_{\infty}=O_{P}\left(\frac{r\sqrt{\log n}}{n^{3/2}}\right)=o_{P}(1).

(103)

Proof of (25): Using Result (4) of Lemma 6, we have

\displaystyle\left\|\frac{n}{p\sqrt{1-n/p}}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}\leq\frac{n}{\sqrt{1-n/p}}\times\left|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right|_{\infty}\|\boldsymbol{\beta^{*}}-\boldsymbol{\hat{\beta}_{\lambda_{1}}}\|_{1}.

(104)

Using Result (4) of Lemma 4, Result (1) of Theorem 1 and Result (5) of Lemma 6, we have

	$\displaystyle\left\\|\frac{n}{p\sqrt{1-n/p}}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\\|_{\infty}$	$\displaystyle\leq$	$\displaystyle O_{P}\left(\frac{(s+r)n}{\sqrt{1-n/p}}\left(\sqrt{\frac{\log(pn)}{pn}}+\frac{1}{n}\right)\sqrt{\frac{\log p}{n}}\right)$		(105)
		$\displaystyle=$	$\displaystyle O_{P}\left(\frac{(s+r)}{\sqrt{1-n/p}}\sqrt{\frac{\log(np)\log(p)}{p}}+\frac{(s+r)}{\sqrt{1-n/p}}\sqrt{\frac{\log p}{n}}\right).$		(105)

Since $n$ is $o(p)$ and $n$ is $\omega[((s+r)\log p)^{2}]$ , we have

\left\|\frac{n}{p\sqrt{1-n/p}}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})\right\|_{\infty}=o_{P}(1).

(106)

Proof of (26): Using Result (4) of Lemma 6, we have

\displaystyle\left\|\frac{n}{p\sqrt{1-n/p}}\frac{1}{n}\boldsymbol{WA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\|_{\infty}\leq\frac{1}{\sqrt{1-n/p}}\times\left|\frac{1}{p}\boldsymbol{WA}^{\top}\right|_{\infty}\left\|\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\|_{1}

(107)

Using Result (5) of Lemma 4, Result (2) of Theorem 1 and Result (5) of Lemma 6, we have

	$\displaystyle\left\\|\frac{n}{p\sqrt{1-n/p}}\frac{1}{n}\boldsymbol{WA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\\|_{\infty}$	$\displaystyle=$	$\displaystyle O_{P}\left(\frac{r}{\sqrt{1-n/p}}\left(\sqrt{\frac{n\log(np)}{p}}+1\right)\frac{\sqrt{\log n}}{{n^{3/2}}}\right)$		(108)
		$\displaystyle=$	$\displaystyle O_{P}\left(\frac{r}{\sqrt{1-n/p}}\sqrt{\frac{\log(np)}{p}}\sqrt{\frac{{\log n}}{{n}}}+\frac{r}{\sqrt{1-n/p}}\frac{\sqrt{\log n}}{{n^{3/2}}}\right).$		(108)

Since $n$ is $o(p)$ , we have

\left\|\frac{n}{p\sqrt{1-n/p}}\frac{1}{n}\boldsymbol{WA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}\right\|_{\infty}=o_{P}(1)

(109)

This completes the proof. $\blacksquare$

B-C Proof of Theorem 4

Result(1): Recall that $\boldsymbol{W}$ is the output of Alg. 2, $\boldsymbol{\Sigma_{\beta}}=\frac{\sigma^{2}}{n}\boldsymbol{W^{\top}W}$ , $\boldsymbol{\Sigma_{\delta}}=\sigma^{2}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{WA}^{\top}\right)\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{WA}^{\top}\right)^{\top}$ and $\mu_{1}=2\sqrt{\frac{2\log p}{n}}$ . We will derive the lower bound of $\Sigma_{\beta_{jj}}$ for all $j\in[p]$ following the same idea as in the proof of Lemma 12 of [28]. Note that, $\Sigma_{\beta_{jj}}=\frac{\sigma^{2}}{n}\boldsymbol{w_{.j}^{\top}w_{.j}}$ . For all $j\in[p]$ , from (134) of result (2) of Lemma 4, for any feasible $\boldsymbol{W}$ with probability at least $1-\left(\dfrac{2}{p^{2}}+\dfrac{2}{n^{2}}+\dfrac{3}{2np}\right)$ , we have

1-\frac{1}{n}\boldsymbol{a_{.j}^{\top}w_{.j}}\leq\mu_{1}\ \implies\ 1-\mu_{1}\leq\boldsymbol{a_{.j}^{\top}w_{.j}}.

For any feasible $\boldsymbol{W}$ of Alg.2, we have for any $c>0$ ,

$\displaystyle\frac{1}{n}\boldsymbol{w_{.j}^{\top}w_{.j}}$	$\displaystyle\geq$	$\displaystyle\frac{1}{n}\boldsymbol{w_{.j}^{\top}w_{.j}}+c(1-\mu_{1})-c\left(\boldsymbol{a_{.j}^{\top}w_{.j}}\right)\geq\underset{\boldsymbol{w_{.j}}\in\mathbb{R}^{n}}{\min}\left\{\frac{1}{n}\boldsymbol{w_{.j}^{\top}w_{.j}}+c(1-\mu_{1})-c\left(\boldsymbol{a_{.j}^{\top}w_{.j}}\right)\right\}$
	$\displaystyle=$	$\displaystyle\underset{\boldsymbol{w_{.j}}\in\mathbb{R}^{n}}{\min}\left\{\frac{1}{n}\left(\boldsymbol{w_{.j}}-c\boldsymbol{a_{.j}}/2\right)^{\top}\left(\boldsymbol{w_{.j}}-c\boldsymbol{a_{.j}}/2\right)\right\}+c(1-\mu_{1})-\frac{c^{2}}{4}\frac{\boldsymbol{a_{.j}^{\top}a_{.j}}}{n}\geq c(1-\mu_{1})-\frac{c^{2}}{4}\frac{\boldsymbol{a_{.j}}^{\top}\boldsymbol{a_{.j}}}{n}$
	$\displaystyle=$	$\displaystyle c(1-\mu_{1})-\dfrac{c^{2}}{4}.$

We obtain the last inequality by putting $\boldsymbol{w_{.j}}=c\boldsymbol{a_{.j}}/2$ which makes the square term 0. The rightmost equality is because $\boldsymbol{a_{.j}}^{\top}\boldsymbol{a_{.j}}=n$ . The lower bound on the RHS is maximized for $c=2(1-\mu_{1})$ . Plugging in this value of $c$ , we obtain the following with probability atleast $1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ :

\frac{1}{n}\boldsymbol{w_{.j}^{\top}w_{.j}}\geq(1-\mu_{1})^{2}.

Hence, from the above equation and (129), we obtain the lower bound on $\Sigma_{\beta_{jj}}$ for any $j\in[p]$ as follows:

P\left(\Sigma_{\beta_{jj}}\geq\sigma^{2}(1-\mu_{1})^{2}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

(110)

Furthermore from Result (1) of Lemma. 4, we have

P\left(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1\ \forall j\in[p]\right)=1.

(111)

We use (111) to get, for any $j\in[p]$ , $P(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1)=1$ . As $\Sigma_{\beta_{jj}}=\sigma^{2}\frac{\boldsymbol{w_{.j}^{\top}w_{.j}}}{n}$ , we have for any $j\in[p]$ :

P\left(\Sigma_{\beta_{jj}}\leq\sigma^{2}\right)=1.

(112)

Using (112) with (110), we obtain for any $j\in[p]$ ,

P\left(\sigma^{2}(1-\mu_{1})^{2}\leq\Sigma_{\beta_{jj}}\leq\sigma^{2}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

Now under the assumption $n$ is $\omega[((s+r)\log p)^{2}]$ , $\mu_{1}\to 0$ . Hence, we have, $\Sigma_{\beta_{jj}}\overset{P}{\to}\sigma^{2}$ . This completes the proof of Result (1).
Result (2): Recall that $\mu_{3}=\frac{2}{\sqrt{1-n/p}}\sqrt{\frac{2\log(n)}{p}}$ . Now in order to obtain the upper and lower bounds for $\Sigma_{\delta_{ii}}$ for any $i\in[n]$ , we use Result (6) of Lemma. 4. We have,

P\left(\frac{n}{p\sqrt{1-\frac{n}{p}}}\left|\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right|_{\infty}\leq\mu_{3}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

(113)

We have for $i\in[n]$ ,

	$\displaystyle\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\Sigma_{\delta_{ii}}$	$\displaystyle=$	$\displaystyle\sigma^{2}\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\left(1-\frac{2}{n}\boldsymbol{w_{i.}}\boldsymbol{a_{i.}}^{\top}+\frac{1}{n^{2}}\boldsymbol{w_{i.}A^{\top}A}\boldsymbol{w_{i.}}^{\top}\right)$		(114)
		$\displaystyle=$	$\displaystyle\sigma^{2}\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(1-\frac{\boldsymbol{w_{i.}a_{i.}^{\top}}}{n}\right)\right)^{2}+\sigma^{2}\underset{k\neq i}{\sum_{k=1}^{n}}\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(\frac{\boldsymbol{w_{i.}a_{k.}^{\top}}}{n}\right)\right)^{2}.$		(114)

Let $\boldsymbol{V}=\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)$ . Note that, for $i\in[n]$ ,

v_{ii}=\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-\frac{p}{n}=\left(\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-1\right)+\left(1-\frac{p}{n}\right).

(115)

Hence from (115), we have

$\displaystyle\frac{n}{p\sqrt{1-n/p}}v_{ii}$	$\displaystyle=$	$\displaystyle\frac{n}{p\sqrt{1-n/p}}\left(\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-1\right)+\frac{1}{\sqrt{1-n/p}}\left(\frac{n}{p}-1\right)$	(116)
	$\displaystyle=$	$\displaystyle\frac{n}{p\sqrt{1-n/p}}\left(\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-1\right)-\frac{1}{\sqrt{1-n/p}}\left(1-\frac{n}{p}\right)$
	$\displaystyle=$	$\displaystyle\frac{n}{p\sqrt{1-n/p}}\left(\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-1\right)-\sqrt{1-\frac{n}{p}}.$

We have, from (113),

P\left(\frac{n}{p\sqrt{1-n/p}}v_{ii}\geq-\mu_{3}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

Therefore using (116), we have

	$\displaystyle P\left(\frac{n}{p\sqrt{1-n/p}}\left(\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-1\right)-\sqrt{1-\frac{n}{p}}\geq-\mu_{3}\right)$	$\displaystyle\geq$	$\displaystyle 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$
	$\displaystyle\implies P\left(\frac{n}{p\sqrt{1-n/p}}\left(\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-1\right)\geq\sqrt{1-\frac{n}{p}}-\mu_{3}\right)$	$\displaystyle\geq$	$\displaystyle 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).$		(117)

Using (117) in (114) yields the following inequality with probability at least $1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ ,

	$\displaystyle\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\Sigma_{\delta_{ii}}$	$\displaystyle=$	$\displaystyle\sigma^{2}\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(\frac{\boldsymbol{w_{i.}a_{i.}^{\top}}}{n}-1\right)\right)^{2}+\sigma^{2}\sum_{k=1,k\neq i}^{n}\left(\frac{n}{p\sqrt{1-n/p}}\frac{\boldsymbol{w_{i.}a_{k.}^{\top}}}{n}\right)^{2}$
		$\displaystyle\geq$	$\displaystyle\sigma^{2}\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(\frac{\boldsymbol{w_{i.}a_{i.}^{\top}}}{n}-1\right)\right)^{2}\geq\sigma^{2}\left(\sqrt{1-\frac{n}{p}}-\mu_{3}\right)^{2}.$

Therefore the lower bound on $\Sigma_{\delta_{ii}}$ is as follows:

P\left(\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\Sigma_{\delta_{ii}}\geq\sigma^{2}\left(\sqrt{1-\frac{n}{p}}-\mu_{3}\right)^{2}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

(118)

We need to now derive an upper bound on $\Sigma_{\delta_{ii}}$ . By the same argument as before, we have from (113)

$\displaystyle P\left(\frac{n}{p\sqrt{1-n/p}}v_{ii}\leq\mu_{3}\right)$	$\displaystyle\geq$	$\displaystyle 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right),$
$\displaystyle\implies P\left(\frac{n}{p\sqrt{1-n/p}}\left(\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-1\right)-\sqrt{1-\frac{n}{p}}\leq\mu_{3}\right)$	$\displaystyle\geq$	$\displaystyle 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right),$
$\displaystyle\implies P\left(\frac{n}{p\sqrt{1-n/p}}\left(\frac{\boldsymbol{w_{i.}a_{i.}}^{\top}}{n}-1\right)\leq\sqrt{1-\frac{n}{p}}+\mu_{3}\right)$	$\displaystyle\geq$	$\displaystyle 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).$

Again for $i\in[n],k\in[n],k\neq i$ , $v_{ij}=\frac{\boldsymbol{w_{i.}a_{k.}^{\top}}}{n}$ . We have from (113),

P\left(\frac{n}{p\sqrt{1-n/p}}\left|\frac{\boldsymbol{w_{i.}a_{k.}^{\top}}}{n}\right|\leq\mu_{3}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

Now from (114), we have

	$\displaystyle\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\Sigma_{\delta_{ii}}$	$\displaystyle=$	$\displaystyle\sigma^{2}\left(\frac{n}{p\sqrt{({1-\frac{n}{p}})}}\left(\frac{\boldsymbol{w_{i.}a_{i.}^{\top}}}{n}-1\right)\right)^{2}+\sigma^{2}\sum_{k=1,k\neq i}^{n}\left(\frac{n}{p\sqrt{1-n/p}}\frac{\boldsymbol{w_{i.}a_{k.}^{\top}}}{n}\right)^{2}$
		$\displaystyle\leq$	$\displaystyle\sigma^{2}\left(\sqrt{1-\frac{n}{p}}+\mu_{3}\right)^{2}+\sigma^{2}(n-1)\mu_{3}^{2}.$

The last inequality holds with probability at least $1-2\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ . Hence, we have

P\left(\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\Sigma_{\delta_{ii}}\leq\sigma^{2}\left(\sqrt{1-\frac{n}{p}}+\mu_{3}\right)^{2}+\sigma^{2}(n-1)\mu_{3}^{2}\right)\geq 1-2\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

(119)

Using (119) with (118), we obtain the following using the union bound, for all $i\in[n]$ ,

P\left(\sigma^{2}\left(\sqrt{1-\frac{n}{p}}-\mu_{3}\right)^{2}\leq\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\Sigma_{\delta_{ii}}\leq\sigma^{2}\left(\sqrt{1-\frac{n}{p}}+\mu_{3}\right)^{2}+\sigma^{2}(n-1)\mu_{3}^{2}\right)\geq 1-3\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

Therefore, under the assumption $n\log n$ is $o(p)$ , we have, $(n-1)\mu_{3}^{2}=(n-1)\frac{\log n}{p}\to 0$ and $\left(\sqrt{1-\frac{n}{p}}+\mu_{3}\right)^{2}\to 1$ . Hence, we have $\frac{n^{2}}{p^{2}(1-\frac{n}{p})}\Sigma_{\delta_{ii}}\overset{P}{\to}\sigma^{2}$ . This completes the proof. $\blacksquare$

B-D Proof of Theorem 5

Let $\boldsymbol{W}$ be the output of Alg. 2. Using the definition of $\boldsymbol{\hat{\beta}_{W}}$ from (14) and the measurement model from (5), we have

\displaystyle\boldsymbol{\hat{\beta}_{W}}-\boldsymbol{\beta^{*}}

\displaystyle=

\displaystyle\frac{1}{n}\boldsymbol{W}^{\top}\boldsymbol{{\eta}}+\left(\boldsymbol{I_{p}}-\frac{1}{n}\boldsymbol{W^{\top}}\boldsymbol{A}\right)(\boldsymbol{\hat{\beta}_{\lambda_{1}}}-\boldsymbol{\beta^{*}})+\frac{1}{n}\boldsymbol{W}^{\top}\left(\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\right).

(120)

Using Results (1) and (2) of Theorem 3, the second and third term on the RHS of (120) are $o_{P}(1/\sqrt{n})$ . Recall that $\boldsymbol{\Sigma_{\beta}}=\frac{\sigma^{2}}{n}\boldsymbol{W^{\top}W}$ . Therefore, we have

\displaystyle\frac{\sqrt{n}(\hat{\beta}_{Wj}-\beta^{*}_{j})}{\sqrt{\Sigma_{\beta_{jj}}}}

\displaystyle=

\displaystyle\frac{\frac{1}{\sqrt{n}}\boldsymbol{w}_{.j}^{\top}\boldsymbol{{\eta}}}{\sqrt{\Sigma_{\beta_{jj}}}}+o_{P}\left(1/\sqrt{\Sigma_{\beta_{jj}}}\right),

(121)

where $\boldsymbol{w}_{.j}$ denotes the $j^{\text{th}}$ column of matrix $\boldsymbol{W}$ . As $\boldsymbol{\eta}$ is Gaussian, the first term on the RHS of (121) is a Gaussian random variable with mean $0$ and variance $1$ . Using Result (1) of Theorem 4, $\Sigma_{\beta_{jj}}$ converges to $\sigma^{2}$ in probability. This completes the proof of Result (1).

Using (15) and the measurement model (5), we have

\displaystyle\boldsymbol{\hat{\delta}_{W}}-\boldsymbol{\delta^{*}}=\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{{\eta}}+\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}(\boldsymbol{\beta^{*}-\hat{\beta}_{\lambda_{1}}})-\frac{1}{n}\boldsymbol{WA}^{\top}\big{(}\boldsymbol{\delta^{*}}-\boldsymbol{\hat{\delta}_{\lambda_{2}}}\big{)}.

(122)

Using Results (3) and (4) of Theorem 3, the second and third term on the RHS of (76) are both $o_{P}\left(\frac{p\sqrt{1-n/p}}{n}\right)$ . Recall from (28), that $\boldsymbol{\Sigma_{\delta}}=\sigma^{2}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)^{\top}$ . Therefore, we have

\displaystyle\frac{\left({\hat{\delta}_{Wi}}-{\delta^{*}_{i}}\right)}{\sqrt{\Sigma_{\delta_{ii}}}}=\frac{\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{WA}^{\top}\right)^{\top}_{i.}\boldsymbol{{\eta}}}{\sqrt{\Sigma_{\delta_{ii}}}}+o_{P}\left(\frac{1}{\sqrt{\Sigma_{\delta_{ii}}}}\frac{p\sqrt{1-n/p}}{n}\right).

(123)

As $\boldsymbol{\eta}$ is Gaussian, the first term on the RHS of (123) is a Gaussian random variable with mean $0$ and variance $1$ . Using Result (2) of Theorem 4, $\frac{n^{2}}{p^{2}(1-n/p)}\Sigma_{\delta_{ii}}$ converges to $\sigma^{2}$ in probability so that $\left(\frac{1}{\sqrt{\Sigma_{\delta_{ii}}}}\frac{p\sqrt{1-n/p}}{n}\right)=1$ . This completes the proof of result (2). $\blacksquare$

Lemma 4

Let $\boldsymbol{A}$ be a $n\times p$ Rademacher matrix and $\boldsymbol{W}$ be the corresponding output of Alg. 2. If $n$ is $o(p)$ , we have the following results:

(1)

$P\left(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1\ \forall j\in[p]\right)=1$ .
(2)

$\left|\frac{1}{n}\boldsymbol{W^{\top}}\boldsymbol{A}-\boldsymbol{I_{p}}\right|_{\infty}$ = $O_{P}\left(\sqrt{\frac{\log(p)}{n}}\right)$ .
(3)

$\left|\frac{1}{\sqrt{n}}\boldsymbol{W^{\top}}\right|_{\infty}=O\left(1\right)$ .
(4)

$\left|\frac{1}{p}\left(\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}-\boldsymbol{I_{n}}\right)\boldsymbol{A}\right|_{\infty}$ = $O_{P}\left(\sqrt{\frac{\log(pn)}{pn}}+\frac{1}{n}\right)$ .
(5)

$\left|\frac{1}{p}\boldsymbol{WA}^{\top}\right|_{\infty}=O_{P}\left(\sqrt{\frac{n\log(np)}{p}}+1\right)$ .
(6)

$\frac{n}{p\sqrt{1-\frac{n}{p}}}\left|\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right|_{\infty}=O_{P}\left(\frac{1}{\sqrt{1-n/p}}\sqrt{\frac{\log(n)}{p}}\right)$ .

Proof of Lemma. 4:
In order to prove these results, we will first show that the intersection event of the $4$ constraints of Alg. 2 is non-null with probability at least $1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ as the solution $\boldsymbol{W}=\boldsymbol{A}$ is in the feasible set. This will show that the feasible region of the optimisation problem given in Alg. 2 is non-empty. Let us first define the following sets:
$G_{1}(n,p)=\left\{\boldsymbol{A}\in R^{n\times p}:|\boldsymbol{A^{\top}A}/n-\boldsymbol{I_{p}}|_{\infty}\leq\mu_{1}\right\}$ , $G_{2}(n,p)=\left\{\boldsymbol{A}\in R^{n\times p}:\left|\frac{1}{p}(\boldsymbol{I_{n}}-\boldsymbol{AA^{\top}}/n)\boldsymbol{A}\right|\leq\mu_{2}\right\}$ ,
$G_{3}(n,p)=\left\{\boldsymbol{A}\in R^{n\times p}:\frac{n}{p\sqrt{1-\frac{n}{p}}}\left|\left(\frac{\boldsymbol{AA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right|_{\infty}\leq\mu_{3}\right\}$ , $G_{4}(n,p)=\{\boldsymbol{A}\in R^{n\times p}:\boldsymbol{a_{.j}^{\top}a_{.j}}/n\leq 1\ \forall\ j\in[p]\}$ , where, $\mu_{1}=2\sqrt{\frac{2\log(p)}{n}}$ , $\mu_{2}=2\sqrt{\frac{\log({2np})}{np}}+\frac{1}{n}$ and $\mu_{3}=\frac{2}{\sqrt{1-n/p}}\sqrt{\frac{2\log(n)}{p}}$ . Note that, here $\boldsymbol{A}$ is a $n\times p$ Rademacher matrix. We will now state the probabilities of the aforementioned sets. From (149) of Lemma 5, we have

P\left(\boldsymbol{A}\in G_{1}(n,p)\right)\geq 1-\frac{2}{p^{2}}

(124)

Again from (155) of Lemma 5, we have

P\left(\boldsymbol{A}\in G_{2}(n,p)\right)\geq 1-\frac{1}{2np}

(125)

Similarly from (157) of Lemma 5, we have

P\left(\boldsymbol{A}\in G_{3}(n,p)\right)\geq 1-\frac{2}{n^{2}}

(126)

Lastly, since $\boldsymbol{A}$ is Rademacher, $P(\boldsymbol{a_{.j}^{\top}a_{.j}}/n\leq 1\ \forall\ j\in[p])=P\left(\cap_{j=1}^{p}\boldsymbol{a_{.j}^{\top}a_{.j}}/n\leq 1\right)=\prod_{j=1}^{p}P(\boldsymbol{a_{.j}^{\top}a_{.j}}/n\leq 1)=1^{p}=1$ . Therefore, we have,

P\left(\boldsymbol{A}\in G_{4}(n,p)\right)=1

(127)

Note that, $P\left(\boldsymbol{A}\in\{\cap_{k=1}^{4}G_{k}(n,p)\}^{c}\right)=P\left(\boldsymbol{A}\in\{\cup_{k=1}^{4}(G_{k}(n,p))^{c}\}\right)\leq\sum_{k=1}^{4}P\left(\boldsymbol{A}\in(G_{k}(n,p))^{c}\right)\leq\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ . Therefore, we obtain, $P\left(\boldsymbol{A}\in\{\cap_{k=1}^{4}G_{k}(n,p)\}^{c}\right)=1-P\left(\boldsymbol{A}\in\{\cap_{k=1}^{4}G_{k}(n,p)\}\right)\leq\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ . Hence,

P\left(\boldsymbol{A}\in\{\cap_{k=1}^{4}G_{k}(n,p)\}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

(128)

Therefore, $\boldsymbol{A}$ satisfies the constraints of Alg.2 with high probability. This implies that there exists $\boldsymbol{W^{*}}$ that satisfies the constraints. Let

E(n,p)=\Bigg{\{}\boldsymbol{A}:\exists\boldsymbol{W^{*}}\ \text{s.t.}\ |\boldsymbol{{W^{*}}^{\top}A}/n-\boldsymbol{I_{p}}|_{\infty}\leq\mu_{1},\ \left|\frac{1}{p}(\boldsymbol{I_{n}}-\boldsymbol{W^{*}A^{\top}}/n)\boldsymbol{A}\right|\leq\mu_{2},\\ \frac{n}{p\sqrt{1-\frac{n}{p}}}\left|\left(\frac{\boldsymbol{W^{*}A^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right|_{\infty}\leq\mu_{3},\boldsymbol{{w^{*}}_{.j}^{\top}{w^{*}}_{.j}}/n\leq 1\ \forall\ j\in[p]\Bigg{\}}.

Hence, we have

P(\boldsymbol{A}\in E(n,p))\geq P(\boldsymbol{A}\in\cap_{k=1}^{4}G_{k}(n,p))\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)

(129)

Given that the set of feasible solutions is non-null, we can say that the optimal solution of Alg. 2 denoted by $\boldsymbol{W}$ satisfies the constraints of Alg. 2 with probability $1$ .
Result (1): Recall that the event that there exists a point satisfying constraints $\mathsf{C0}$ – $\mathsf{C3}$ is $E(n,p)$ . We have

	$\displaystyle P\left(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1\ \forall j\in[p]\right)$	$\displaystyle=$	$\displaystyle P\left(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1\ \forall j\in[p]\bigg{\|}\boldsymbol{A}\in E(n,p)\right)P\left(\boldsymbol{A}\in E(n,p)\right)$		(130)
		$\displaystyle+$	$\displaystyle P\left(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1\ \forall j\in[p]\bigg{\|}\boldsymbol{A}\in E(n,p)^{c}\right)P\left(\boldsymbol{A}\in E(n,p)^{c}\right)$		(130)

If there exists a feasible solution to $\mathsf{C0}$ – $\mathsf{C3}$ then $\boldsymbol{w}_{.j}^{\top}\boldsymbol{w}_{.j}/n\leq 1$ . Therefore, we have

P\left(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1\ \forall j\in[p]\bigg{|}\boldsymbol{A}\in E(n,p)\right)=1.

(131)

Now, we have from Alg. 2, if the constraints of the optimisation problem are not satisfied, then we choose $\boldsymbol{W}=\boldsymbol{A}$ as the output. This event is given by $\boldsymbol{A}\in E(n,p)^{c}$ . Now, we know that for Rademacher matrix $\boldsymbol{A}$ , $\boldsymbol{a_{.j}^{\top}a_{.j}}/n=1$ with probability 1. Therefore, we have

P\left(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1\ \forall j\in[p]\bigg{|}\boldsymbol{A}\in E(n,p)^{c}\right)=1.

(132)

Therefore, we have from (131),(132) and (130),

P\left(\boldsymbol{w_{.j}^{\top}w_{.j}}/n\leq 1\ \forall j\in[p]\right)=P\left(\boldsymbol{A}\in E(n,p)\right)+P\left(\boldsymbol{A}\in E(n,p)^{c}\right)=1.

(133)

Result (2): Recall that $\mu_{1}=2\sqrt{\frac{2\log(p)}{n}}$ . Note that we have for any two events $F_{1},F_{2}$ , $P(F_{1})=P(F_{1}\cap F_{2})+P(F_{1}\cap F_{2}^{c})\leq P(F_{1}\cap F_{2})+P(F_{2}^{c})$ . Therefore, we have,

	$\displaystyle P\left(\|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}\|_{\infty}\geq\mu_{1}\right)$	$\displaystyle\leq$	$\displaystyle P\left(\{\|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}\|_{\infty}\geq\mu_{1}\}\cap E(n,p)\right)+P\left(\{E(n,p)\}^{c}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\{\|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}\|_{\infty}\geq\mu_{1}\}\cap E(n,p)\right)+\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$

The last inequality comes from (129). Since $\boldsymbol{W}$ is a feasible solution, given $\boldsymbol{A}\in E(n,p)$ , it will satisfy the second constraint of Alg. 2 with probability 1. This means that

P\left(\{|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}|_{\infty}\geq\mu_{1}\}\cap E(n,p)\right)=0.

Therefore we have,

P\left(|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}|_{\infty}\leq\mu_{1}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

(134)

Since $\mu_{1}=2\sqrt{\frac{2\log(p)}{n}}$ , we have, $|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}|_{\infty}=O_{P}(\sqrt{\log(p)/n})$ .
Result (3): From (133), we have that, for each $j\in[p]$ , $\|\boldsymbol{w_{.j}}\|_{2}\leq\sqrt{n}$ with probability $1$ . Note that for any vector $\boldsymbol{x}$ , $\|\boldsymbol{x}\|_{\infty}\leq\|\boldsymbol{x}\|_{2}$ , we have for every $j\in[p]$ , $\|\boldsymbol{w_{.j}}\|_{\infty}\leq\sqrt{n}$ with probability $1$ . Since $|\boldsymbol{W}^{\top}|_{\infty}\leq\underset{j\in[p]}{\max}\|\boldsymbol{w_{.j}}\|_{\infty}\leq\sqrt{n}$ with probability 1. Therefore, we have

\left|\frac{1}{\sqrt{n}}\boldsymbol{W}^{\top}\right|_{\infty}=O(1).

(135)

Result (4): Recall that $\mu_{2}=\frac{1}{n}+2\sqrt{\frac{\log({2np})}{np}}$ . Therefore, we have

	$\displaystyle P\left(\left\|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right\|_{\infty}\geq\mu_{2}\right)$	$\displaystyle\leq$	$\displaystyle P\left(\left\{\left\|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right\|_{\infty}\geq\mu_{2}\right\}\cap E(n,p)\right)+P\left(\{E(n,p)\}^{c}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\left\{\left\|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right\|_{\infty}\geq\mu_{2}\right\}\cap E(n,p)\right)+\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$

The last inequality comes from (129). Note that, since $\boldsymbol{W}$ is a feasible solution, given $\boldsymbol{A}\in E(n,p)$ , it will satisfy the third constraint of Alg. 2 with probability 1. This implies

P\left(\left\{\left|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right|_{\infty}\geq\mu_{2}\right\}\cap E(n,p)\right)=0.

Therefore, we have,

P\left(\left|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right|_{\infty}\leq\mu_{2}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)

(136)

Hence, we have, $\left|\frac{1}{p}\left(\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}-\boldsymbol{I_{n}}\right)\boldsymbol{A}\right|_{\infty}$ = $O_{P}\left(\sqrt{\frac{\log(pn)}{pn}}+1/n\right)$ .
Result (5): Recall that from Eqn.(129), we have with probability at least $1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ ,

\left|\frac{1}{p}(\boldsymbol{WA^{\top}}/n-\boldsymbol{I_{n}})\boldsymbol{A}\right|_{\infty}\leq\mu_{2}.

Applying triangle inequality, we have with probability atleast $1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ ,

\left|\frac{1}{np}\boldsymbol{WA}^{\top}\boldsymbol{A}\right|_{\infty}\leq\left|\frac{1}{p}(\boldsymbol{WA^{\top}}/n-\boldsymbol{I_{n}})\boldsymbol{A}\right|_{\infty}+\frac{1}{p}|\boldsymbol{A}|_{\infty}\leq\mu_{2}+1/p.

(137)

We now present some useful results about the norms being used in this proof. Let $\boldsymbol{X}$ be a $p\times p$ matrix and $\boldsymbol{Y}$ be a $p\times n$ matrix . Recall the following definitions from [48],

\displaystyle\|\boldsymbol{Y}\|_{\infty\rightarrow\infty}\triangleq\max_{\boldsymbol{x}\in\mathbb{R}^{n}\setminus\{\boldsymbol{0}\}}\frac{\|\boldsymbol{Yx}\|_{\infty}}{\|\boldsymbol{x}\|_{\infty}}\quad\textup{and}\ \|\boldsymbol{Y}\|_{2\rightarrow 2}\triangleq\max_{\boldsymbol{x}\in\mathbb{R}^{n}\setminus\{\boldsymbol{0}\}}\frac{\|\boldsymbol{Yx}\|_{2}}{\|\boldsymbol{x}\|_{2}}=\sigma_{max}(\boldsymbol{Y}).

Note that $|\boldsymbol{XY}|_{\infty}\leq\|\boldsymbol{X}\|_{\infty\rightarrow\infty}|\boldsymbol{Y}|_{\infty}$ ²²2This is because $\|\boldsymbol{XY}\|_{\infty}=\max_{i}\|\boldsymbol{XY_{i}}\|_{\infty}\leq\max_{i}\|\boldsymbol{X}\|_{\infty\rightarrow\infty}\|\boldsymbol{Y_{.i}}\|_{\infty}$ (by the definition of the induced norm) $=\|\boldsymbol{X}\|_{\infty\rightarrow\infty}\max_{i}\|\boldsymbol{Y_{.i}}\|_{\infty}=\|\boldsymbol{X}\|_{\infty\rightarrow\infty}\|\boldsymbol{Y}\|_{\infty}$ .. Since $\frac{1}{\sqrt{p}}\|\boldsymbol{x}\|_{2}\leq\|\boldsymbol{x}\|_{\infty}$ and $\|\boldsymbol{Y}^{\top}\boldsymbol{x}\|_{\infty}\leq\|\boldsymbol{Y}^{\top}\boldsymbol{x}\|_{2}$ for all $\boldsymbol{x}\in\mathbb{R}^{p}$ , we have

\|\boldsymbol{Y^{\top}}\|_{\infty\rightarrow\infty}\leq\sqrt{p}\|\boldsymbol{Y}^{\top}\|_{2\rightarrow 2}.

(138)

Then by using (138), we have

\displaystyle|\boldsymbol{XY}|_{\infty}=|\boldsymbol{Y}^{\top}\boldsymbol{X}^{\top}|_{\infty}

\displaystyle\leq\|\boldsymbol{Y}^{\top}\|_{\infty\rightarrow\infty}|\boldsymbol{X}^{\top}|_{\infty}\leq\sqrt{p}\|\boldsymbol{Y}^{\top}\|_{2\rightarrow 2}|\boldsymbol{X}^{\top}|_{\infty}=\sqrt{p}\sigma_{max}(\boldsymbol{Y})|\boldsymbol{X}|_{\infty}.

(139)

Substituting $\boldsymbol{X}=\frac{1}{np}\boldsymbol{W}\boldsymbol{A}^{\top}\boldsymbol{A}$ and $\boldsymbol{Y=A^{\dagger}}\triangleq\boldsymbol{A}^{\top}(\boldsymbol{AA}^{\top})^{-1}$ , the Moore-Penrose pseudo-inverse of $\boldsymbol{A}$ , in (139), we obtain:

\left|\frac{1}{np}\boldsymbol{WA^{\top}}\right|_{\infty}=\left|\frac{1}{np}\boldsymbol{WA}^{\top}\boldsymbol{AA}^{\dagger}\right|_{\infty}\leq\frac{\sqrt{p}}{np}|\boldsymbol{WA}^{\top}\boldsymbol{A}|_{\infty}\|\boldsymbol{A^{\dagger}}\|_{2\rightarrow 2}.

(140)

We now derive the upper bound for the second factor of (140). We have,

\|\boldsymbol{A^{\dagger}}\|_{2\rightarrow 2}=\sigma_{max}(\boldsymbol{A^{\dagger}})=\frac{1}{\sigma_{min}(\boldsymbol{A})}=\frac{1}{\sigma_{min}(\boldsymbol{A}^{\top})}.

(141)

Note that, for an arbitrary $\epsilon_{1}>0$ , using Theorem 1.1. from [37] for the mean-zero sub-Gaussian random matrix $\boldsymbol{A}$ , we have the following

P(\sigma_{min}(\boldsymbol{A}^{\top})\leq\epsilon_{1}(\sqrt{p}-\sqrt{n-1}))\leq(c_{6}\epsilon_{1})^{p-n+1}+(c_{5})^{p}.

(142)

where $c_{6}>0$ and $c_{5}\in(0,1)$ are constants dependent on the sub-Gaussian norm of the entries of $\boldsymbol{A}^{\top}$ . Let for some small constant $\psi\in(0,1)$ $\epsilon_{1}c_{6}\triangleq\psi$ , we have

P\left(\sigma_{max}(\boldsymbol{A^{\dagger}})\leq\frac{c_{6}}{\psi(\sqrt{p}-\sqrt{n-1})}\right)\geq 1-((\psi)^{p-n+1}+(c_{5})^{p}).

(143)

we have

\displaystyle P\left(\sigma_{max}(\boldsymbol{A^{\dagger}})\leq\frac{c_{6}}{\psi\sqrt{p}\left(1-\frac{\sqrt{n-1}}{\sqrt{p}}\right)}\right)\geq 1-((\psi)^{p-n+1}+(c_{5})^{p}).

(144)

Using Eqns. (144) and (137), we have

\displaystyle P\left[\sqrt{p}\frac{1}{np}|\boldsymbol{WA}^{\top}\boldsymbol{A}|_{\infty}|\boldsymbol{A^{\dagger}}|_{2\rightarrow 2}\leq\left(\mu_{2}+\frac{1}{p}\right)\frac{c_{6}}{\psi(1-\sqrt{n/p})}\right]\geq 1-\left(\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)+(1/2)^{p-n+1}+(c_{5})^{p}\right).

Therefore we have by Bonferroni’s inequality,

P\left(\left|\frac{1}{np}\boldsymbol{WA}^{\top}\right|_{\infty}\leq\left(\mu_{2}+\frac{1}{p}\right)\frac{c_{6}}{\psi(1-\sqrt{n/p})}\right)\geq 1-\left(\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)+(\psi)^{p-n+1}+(c_{5})^{p}\right).

Under the condition $n=o(p)$ as $n,p\to\infty$ , the probability $1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}+(\psi)^{p-n+1}+(c_{5})^{p}\right)\to 1$ . Therefore, we have,

\left|\frac{1}{p}\boldsymbol{WA}^{\top}\right|_{\infty}=O_{P}\left(n\left(\sqrt{\frac{\log(np)}{np}}+\frac{1}{n}+\frac{1}{p}\right)\right)=O_{P}\left(\sqrt{\frac{n\log(np)}{p}}+1+\frac{n}{p}\right)=O_{P}\left(\sqrt{\frac{n\log(np)}{p}}+1\right).

(145)

Result (6): Recall that $\mu_{3}=\frac{2}{\sqrt{1-n/p}}\sqrt{\frac{2\log(n)}{p}}$ . We have,

			$\displaystyle P\left(\frac{n}{p\sqrt{1-\frac{n}{p}}}\left\|\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right\|_{\infty}\geq\mu_{3}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\left\{\frac{n}{p\sqrt{1-\frac{n}{p}}}\left\|\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right\|_{\infty}\geq\mu_{3}\right\}\cap E(n,p)\right)+P\left(\{E(n,p)\}^{c}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\left\{\frac{n}{p\sqrt{1-\frac{n}{p}}}\left\|\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right\|_{\infty}\geq\mu_{3}\right\}\cap E(n,p)\right)+\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$

The last inequality comes from (129). Note that, since $\boldsymbol{W}$ is a feasible solution, given $\boldsymbol{A}\in E_{3}(n,p)$ , it will satisfy the fourth constraint of Alg. 2 with probability 1. This implies that:

P\left(\left\{\frac{n}{p\sqrt{1-\frac{n}{p}}}\left|\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right|_{\infty}\geq\mu_{3}\right\}\cap E(n,p)\right)=0,

which yields:

P\left(\frac{n}{p\sqrt{1-\frac{n}{p}}}\left|\left(\frac{\boldsymbol{WA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right|_{\infty}\leq\mu_{3}\right)\geq 1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right).

(146)

Since, $1-\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$ goes to $0$ as $n,p\to\infty$ , the proof is completed. $\blacksquare$

Appendix C Lemma on properties of $\boldsymbol{A}$

Lemma 5

Let $\boldsymbol{A}$ be a $n\times p$ random Rademacher matrix. Then the following holds:

1.

$\left|\frac{1}{n}\boldsymbol{A^{\top}}\boldsymbol{A}-\boldsymbol{I_{p}}\right|_{\infty}$ = $O_{P}\left(\sqrt{\frac{\log(p)}{n}}\right)$ .
2.

$\left|\frac{1}{p}\left(\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}-\boldsymbol{I_{n}}\right)\boldsymbol{A}\right|_{\infty}=O_{P}\left(\sqrt{\frac{\log(pn)}{pn}}+\frac{1}{n}\right)$ .
3.

If $n<p$ , then $\frac{n}{p\sqrt{1-\frac{n}{p}}}\left|\left(\frac{\boldsymbol{AA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right|_{\infty}=O_{P}\left(\frac{1}{\sqrt{1-n/p}}\sqrt{\frac{\log(n)}{p}}\right)$ .

Proof of Lemma 5, [Result (1)]: Let $\boldsymbol{V}\triangleq\boldsymbol{A^{\top}A}/n-\boldsymbol{I_{p}}$ . Note that elements of $\boldsymbol{V}$ matrix satisfies the following:

v_{lj}=\begin{cases}\sum_{k=1}^{n}\frac{(a_{kl})^{2}}{n}-1=0&\mbox{if }j=l,\ l\in[p]\\ \sum_{k=1}^{n}\frac{a_{kl}a_{kj}}{n}&\mbox{if }j\neq l,\ j,l\in[p]\end{cases}

(147)

Therefore, we now consider off-diagonal elements of $\boldsymbol{V}$ (i.e., $l\neq j$ ) for the bound. Each summand of $v_{lj}$ is uniformly bounded $-\frac{1}{n}\leq\frac{a_{kl}a_{kj}}{n}\leq\frac{1}{n}$ since the elements of $\boldsymbol{A}$ are $\pm 1$ . Note that $E\left[\frac{a_{kl}a_{kj}}{n}\right]=0$ $\forall k\in[n],\ l\neq j\in[p]$ . Furthermore, for $l\neq j\in[p]$ , each of the summands of $v_{lj}$ are independent since the elements of $\boldsymbol{A}$ are independent. Therefore, using Hoeffding’s Inequality (see Lemma 1 of [34]) for $t>0$ ,

\displaystyle P(|v_{lj}|\geq t)=P\left(\left|\sum_{k=1}^{n}\frac{a_{kl}a_{kj}}{n}\right|\geq t\right)\leq 2e^{-\frac{nt^{2}}{2}},\ l\neq j\in[p].

Therefore we have

\displaystyle P\left(\max_{l\neq j\in[n]}|v_{lj}|\geq t\right)

\displaystyle=

\displaystyle P\left(\cup_{l\neq j\in[n]}|v_{lj}|\geq t\right)\leq\sum_{l\neq j\in[n]}P(|v_{lj}|\geq t)\leq 2p(p-1)e^{-\frac{nt^{2}}{2}}<2p^{2}e^{-\frac{nt^{2}}{2}}.

(148)

Putting $t=2\sqrt{\frac{2\log p}{n}}$ in (148), we obtain:

\displaystyle P\left(\max_{l\neq j\in[n]}|v_{lj}|\geq 2\sqrt{\frac{2\log p}{n}}\right)\leq 2p^{2}e^{-4\log(p)}=2p^{-2}.

(149)

Thus, we have:

|\boldsymbol{V}|_{\infty}=O_{P}\left(\sqrt{\frac{\log p}{n}}\right).

(150)

This completes the proof of Result (1).

Result (2): Note that,

\frac{1}{p}\left(\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}-\boldsymbol{I_{n}}\right)\boldsymbol{A}=\frac{1}{p}\left(\frac{1}{n}\boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{A}-\boldsymbol{A}\right)\triangleq\boldsymbol{V}.

(151)

Fix $i\in[n]$ and $j\in[p]$ , and consider

v_{ij}=-\frac{a_{ij}}{p}+\frac{1}{np}\sum_{l=1}^{p}\sum_{k=1}^{n}a_{il}a_{kl}a_{kj}

By splitting the sum over $l$ into the terms where $l\neq j$ and the term where $l=j$ , and simplifying by using the fact that $a_{kj}^{2}=1$ for all $k,j$ , we obtain

	$\displaystyle v_{ij}$	$\displaystyle=-\frac{a_{ij}}{p}+\frac{1}{np}\left(\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{k=1}^{n}a_{il}a_{kl}a_{kj}\right)+\frac{1}{np}\left(\sum_{k=1}^{n}a_{ij}a_{kj}^{2}\right)$
		$\displaystyle=-\frac{a_{ij}}{p}+\frac{1}{np}\left(\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{k=1}^{n}a_{il}a_{kl}a_{kj}\right)+\frac{a_{ij}}{p}\frac{1}{n}\left(\sum_{k=1}^{n}1\right)$
		$\displaystyle=\frac{1}{np}\left(\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{k=1}^{n}a_{il}a_{kl}a_{kj}\right).$

Next we split the sum over $k$ into the terms where $k\neq i$ and the term where $k=i$ to obtain

v_{ij}=\frac{1}{np}\left(\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{k=1}^{n}a_{il}a_{kl}a_{kj}\right)=\frac{1}{np}\left(\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{n}a_{il}a_{kl}a_{kj}\right)+\frac{1}{np}\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}a_{il}^{2}a_{ij}=\frac{1}{np}\left(\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{n}a_{il}a_{kl}a_{kj}\right)+\frac{p-1}{np}a_{ij}.

(152)

If we condition on $\boldsymbol{a}_{i.}$ and $\boldsymbol{a}_{.j}$ , the $(n-1)(p-1)$ random variables $\frac{1}{np}a_{il}a_{kl}a_{kj}$ for $k\in[n]\setminus\{i\}$ and $l\in[p]\setminus\{j\}$ are independent, have mean zero, and are bounded between $-\frac{1}{np}$ and $\frac{1}{np}$ . Therefore, by Hoeffding’s inequality, for $t>0$ , we have

P\left(\left.\left|\frac{1}{np}\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{n}a_{il}a_{kl}a_{kj}\right|\geq t\;\right|\;\boldsymbol{a}_{i.},\boldsymbol{a}_{.j}\right)\leq 2e^{-\frac{n^{2}p^{2}t^{2}}{2(n-1)(p-1)}}\leq 2e^{-npt^{2}/2}.

(153)

Since the RHS of (153) is independent of $\boldsymbol{a}_{i,}$ and $\boldsymbol{a}_{.j}$ the bound also holds on the unconditional probability, i.e., we have

P\left(\left|\frac{1}{np}\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{n}a_{il}a_{kl}a_{kj}\right|\geq t\right)\leq 2e^{-\frac{n^{2}p^{2}t^{2}}{2(n-1)(p-1)}}\leq 2e^{-npt^{2}/2}.

(154)

Since $a_{ij}$ is Rademacher, $\frac{p-1}{np}|a_{ij}|<\frac{1}{n}$ with probability $1$ . Choosing $t=2\sqrt{\frac{\log(2np)}{np}}$ and using the triangle inequality, we have from (154),

	$\displaystyle P\left(\|v_{ij}\|\geq\frac{1}{n}+2\sqrt{\frac{\log(2np)}{np}}\right)$	$\displaystyle\leq P\left(\left\|\frac{1}{np}\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{n}a_{il}a_{kl}a_{kj}\right\|+\frac{p-1}{np}\|a_{ij}\|\geq 2\sqrt{\frac{\log(2np)}{np}}+\frac{1}{n}\right)$
		$\displaystyle\leq P\left(\left\|\frac{1}{np}\sum_{\begin{subarray}{c}l=1\\ l\neq j\end{subarray}}^{p}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{n}a_{il}a_{kl}a_{kj}\right\|\geq 2\sqrt{\frac{\log(2np)}{np}}\right)\leq\frac{1}{2n^{2}p^{2}}.$

Then, by a union bound over $np$ events,

P\left(\max_{i,j}|v_{ij}|\geq\frac{1}{n}+2\sqrt{\frac{\log(2np)}{np}}\right)\leq np\,\frac{1}{2n^{2}p^{2}}=\frac{1}{2np}.

(155)

This completes the proof of Result (2).
Result (3): Reversing the roles of $n$ and $p$ in result (1), (147) and (149) of this lemma, we have

P\left(\left|\frac{\boldsymbol{AA^{\top}}}{p}-\boldsymbol{I_{n}}\right|_{\infty}\leq 2\sqrt{\frac{2\log(n)}{p}}\right)\geq 1-\frac{2}{n^{2}}.

(156)

Now, multiplying by $\frac{1}{\sqrt{1-n/p}}$ , we get

P\left(\frac{n}{p\sqrt{1-\frac{n}{p}}}\left|\left(\frac{\boldsymbol{AA^{\top}}}{n}-\frac{p}{n}\boldsymbol{I_{n}}\right)\right|_{\infty}\leq\frac{2}{\sqrt{1-n/p}}\sqrt{\frac{2\log(n)}{p}}\right)\geq 1-\frac{2}{n^{2}}.

(157)

This completes the proof. $\blacksquare$

Appendix D Some useful lemmas

Lemma 6

Let $\boldsymbol{U}$ and $\boldsymbol{V}$ be two $n\times p$ random matrices. Let $\vartheta\in\mathbb{R}$ and $\boldsymbol{w}\in\mathbb{R}^{n}$ . Then,

1.

$|\vartheta\boldsymbol{U}|_{\infty}=|\vartheta||\boldsymbol{U}|_{\infty}$ .
2.

$|\boldsymbol{U}+\boldsymbol{V}|_{\infty}\leq|\boldsymbol{U}|_{\infty}+|\boldsymbol{V}|_{\infty}$ .
3.

If $|\boldsymbol{U}|_{\infty}=O_{P}(h_{1}(n,p))$ and $|\boldsymbol{V}|_{\infty}=O_{P}(h_{2}(n,p))$ , then $|\boldsymbol{U}+\boldsymbol{V}|_{\infty}\leq O_{P}(\max\{h_{1}(n,p),h_{2}(n,p)\})$ .
4.

$\|\boldsymbol{w^{\top}V}\|_{\infty}\leq|\boldsymbol{V}|_{\infty}\|\boldsymbol{w}\|_{1}$ .
5.

If $|\boldsymbol{V}|_{\infty}=O_{P}(h_{1}(n,p))$ and $\|\boldsymbol{w}\|_{1}=O_{P}(h_{\boldsymbol{w}}(n,p))$ , then $\|\boldsymbol{w^{\top}V}\|_{\infty}\leq O_{P}(h_{1}(n,p)h_{w}(n,p))$ . $\blacksquare$

Proof:
Part (1): We have, $|\vartheta\boldsymbol{U}|_{\infty}=\underset{i\in[n],j\in[p]}{\max}|\vartheta u_{ij}|=\underset{i\in[n],j\in[p]}{\max}|\vartheta||u_{ij}|=|\vartheta||\boldsymbol{U}|_{\infty}$ .

Part (2): We have, $|\boldsymbol{U}+\boldsymbol{V}|_{\infty}=\underset{i\in[n],j\in[p]}{\max}|u_{ij}+v_{ij}|\leq\underset{i\in[n],j\in[p]}{\max}\{|u_{ij}|+|v_{ij}|\}\leq\underset{i\in[n],j\in[p]}{\max}|u_{ij}|+\underset{i\in[n],j\in[p]}{\max}|v_{ij}|=|\boldsymbol{U}|_{\infty}+|\boldsymbol{V}|_{\infty}$ .

Part (3): Given $|\boldsymbol{U}|_{\infty}=O_{P}(h_{1}(n,p))$ and $|\boldsymbol{V}|_{\infty}=O_{P}(h_{2}(n,p))$ . From Part (2), we have,

|\boldsymbol{U}+\boldsymbol{V}|_{\infty}\leq|\boldsymbol{U}|_{\infty}+|\boldsymbol{V}|_{\infty}\leq O_{P}(h_{1}(n,p))+O_{P}(h_{1}(n,p))=O_{P}(h_{1}(n,p)+h_{2}(n,p))\leq O_{P}(\max\{h_{1}(n,p),h_{2}(n,p)\}).

Part (4): For any $j\in[p]$ , we have

|(\boldsymbol{w^{\top}V})_{j}|=|\sum_{i=1}^{n}{v}_{ij}w_{i}|\leq\sum_{i=1}^{n}|{v}_{ij}||w_{i}|\leq\sum_{i=1}^{n}\max_{j\in[p]}|{v}_{ij}||w_{i}|\leq\max_{i\in[n],j\in[p]}|{v}_{ij}|\sum_{i=1}^{n}|w_{i}|=|\boldsymbol{V}|_{\infty}\|\boldsymbol{w}\|_{1}.

Part (5): We have from Part (4), $\|\boldsymbol{w^{\top}V}\|_{\infty}\leq|\boldsymbol{V}|_{\infty}\|\boldsymbol{w}\|_{1}$ = $O_{P}(h_{1}(n,p))O_{P}(h_{\boldsymbol{w}}(n,p))$ = $O_{P}(h_{1}(n,p)h_{w}(n,p))$ . $\blacksquare$

Lemma 7

Let $X_{i}$ , for $i=1,2,\ldots,k$ , be Gaussian random variables with mean $0$ and variance $\sigma^{2}$ . Then, we have

P\left[\max_{i\in[k]}|X_{i}|\geq 2\sigma\sqrt{\log(k)}\right]\leq 1/k.

(158)

Note that Lemma 7 does not require independence. For a proof see, e.g., [46].

References

[1] List of countries implementing pool testing strategy against COVID-19. https://en.wikipedia.org/wiki/List_of_countries_implementing_pool_testing_strategy_against_COVID-19. Last retrieved, Oct 2021.
[2] M. Aldridge and D. Ellis. Pooled Testing and Its Applications in the COVID-19 Pandemic, pages 217–249. 2022.
[3] M. Aldridge, O. Johnson, and J. Scarlett. Group testing: An information theory perspective. Found. Trends Commun. Inf. Theory, 15:196–392, 2019.
[4] A. Aldroubi, X. Chen, and A.M. Powell. Perturbations of measurement matrices and dictionaries in compressed sensing. Appl. Comput. Harmon. Anal., 33(2), 2012.
[5] G. Atia and V. Saligrama. Boolean compressed sensing and noisy group testing. IEEE Trans. Inf. Theory, 58(3), 2012.
[6] S. H. Bharadwaja and C. R. Murthy. Recovery algorithms for pooled RT-qPCR based COVID-19 screening. IEEE Trans. Signal Process., 70:4353–4368, 2022.
[7] T.T. Cai and Z. Guo. Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. Ann. Stat., 45(2), 2017.
[8] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2006.
[9] G. Casella and R. Berger. Statistical Inference. Thomson Learning, 2002.
[10] C. L. Chan, P. H. Che, S. Jaggi, and V. Saligrama. Non-adaptive probabilistic group testing with noisy measurements: Near-optimal bounds with efficient algorithms. In ACCC, pages 1832–1839, 2011.
[11] M. Cheraghchi, A. Hormati, A. Karbasi, and M. Vetterli. Group testing with probabilistic tests: Theory, design and application. IEEE Transactions on Information Theory, 57(10), 2011.
[12] A. Christoff et al. Swab pooling: A new method for large-scale RT-qPCR screening of SARS-CoV-2 avoiding sample dilution. PLOS ONE, 16(2):1–12, 02 2021.
[13] S. Comess, H. Wang, S. Holmes, and C. Donnat. Statistical Modeling for Practical Pooled Testing During the COVID-19 Pandemic. Statistical Science, 37(2):229 – 250, 2022.
[14] R. Dorfman. The detection of defective members of large populations. Ann. Math. Stat., 14(4):436–440, 1943.
[15] E. Fenichel, R. Koch, A. Gilbert, G. Gonsalves, and A. Wyllie. Understanding the barriers to pooled SARS-CoV-2 testing in the United States. Microbiology Spectrum, 2021.
[16] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 1981.
[17] D. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Pearson, 2012.
[18] S. Fosson, V. Cerone, and D. Regruto. Sparse linear regression from perturbed data. Automatica, 122, 2020.
[19] Sabyasachi Ghosh, Rishi Agarwal, Mohammad Ali Rehan, Shreya Pathak, Pratyush Agarwal, Yash Gupta, Sarthak Consul, Nimay Gupta, Ritesh Goenka, Ajit Rajwade, and Manoj Gopalkrishnan. A compressed sensing approach to pooled RT-PCR testing for COVID-19 detection. IEEE Open Journal of Signal Processing, 2:248–264, 2021.
[20] A. C. Gilbert, M. A. Iwen, and M. J. Strauss. Group testing and sparse signal recovery. In 42nd Asilomar Conf. Signals, Syst. and Comput., pages 1059–1063, 2008.
[21] N. Grobe, A. Cherif, X. Wang, Z. Dong, and P. Kotanko. Sample pooling: burden or solution? Clin. Microbiol. Infect., 27(9):1212–1220, 2021.
[22] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity: The LASSO and Generalizations. CRC Press, 2015.
[23] A. Heidarzadeh and K. Narayanan. Two-stage adaptive pooling with RT-qPCR for COVID-19 screening. In ICASSP, 2021.
[24] M.A. Herman and T. Strohmer. General deviants: an analysis of perturbations in compressed sensing. IEEE Journal on Sel. Topics Signal Process., 4(2), 2010.
[25] F. Hwang. A method for detecting all defective members in a population by group testing. J Am Stat Assoc, 67(339):605–608, 1972.
[26] J.D. Ianni and W.A. Grissom. Trajectory auto-corrected image reconstruction. Magnetic Resonance in Medicine, 76(3), 2016.
[27] T. Ince and A. Nacaroglu. On the perturbation of measurement matrix in non-convex compressed sensing. Signal Process., 98:143–149, 2014.
[28] A. Javanmard and A. Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res, 2014.
[29] A. Kahng and S. Reda. New and improved BIST diagnosis methods from combinatorial group testing theory. IEEE Trans. Comp. Aided Design of Inetg. Circ. and Sys., 25(3), 2006.
[30] D. B. Larremore, B. Wilder, E. Lester, S. Shehata, J. M. Burke, J. A. Hay, M. Tambe, M. Mina, and R. Parker. Test sensitivity is secondary to frequency and turnaround time for COVID-19 screening. Science Advances, 7(1), 2021.
[31] Y. Li and G. Raskutti. Minimax optimal convex methods for Poisson inverse problems under $\ell_{q}$ -ball sparsity. IEEE Trans. Inf. Theory, 64(8):5498–5512, 2018.
[32] Hubert W Lilliefors. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American statistical Association, 62(318):399–402, 1967.
[33] Miles E Lopes. Unknown sparsity in compressed sensing: Denoising and inference. IEEE Transactions on Information Theory, 62(9):5145–5166, 2016.
[34] G. Lugosi. Concentration-of-measure inequalities, 2009. https://www.econ.upf.edu/ lugosi/anu.pdf.
[35] A. Mazumdar and S. Mohajer. Group testing with unreliable elements. In ACCC, 2014.
[36] D. C. Montgomery, E. Peck, and G. Vining. Introduction to Linear Regression Analysis. Wiley, 2021.
[37] M.Rudelson and R.Vershynin. Smallest singular value of a random rectangular matrix. Comm. Pure Appl. Math., 2009.
[38] Nam H. Nguyen and Trac D. Tran. Robust LASSO with missing and grossly corrupted observations. IEEE Trans. Inf. Theory, 59(4):2036–2058, 2013.
[39] H. Pandotra, E. Malhotra, A. Rajwade, and K. S. Gurumoorthy. Dealing with frequency perturbations in compressive reconstructions with Fourier sensing matrices. Signal Process., 165:57–71, 2019.
[40] J. Parker, V. Cevher, and P. Schniter. Compressive sensing under matrix uncertainties: An approximate message passing approach. In Asilomar Conference on Signals, Systems and Computers, pages 804–808, 2011.
[41] Liangzu Peng, Boshi Wang, and Manolis Tsakiris. Homomorphic sensing: Sparsity and noise. In International Conference on Machine Learning, pages 8464–8475. PMLR, 2021.
[42] M. Raginsky, R. Willett, Z. Harmany, and R. Marcia. Compressed sensing performance bounds under Poisson noise. IEEE Trans. Signal Process., 58(8):3990–4002, 2010.
[43] Chiara Ravazzi, Sophie Fosson, Tiziano Bianchi, and Enrico Magli. Sparsity estimation from compressive projections via sparse random matrices. EURASIP Journal on Advances in Signal Processing, 2018:1–18, 2018.
[44] R. Rohde. COVID-19 pool testing: Is it time to jump in? https://asm.org/Articles/2020/July/COVID-19-Pool-Testing-Is-It-Time-to-Jump-In, 2020.
[45] Mark Rudelson and Roman Vershynin. The Littlewood–Offord problem and invertibility of random matrices. Advances in Mathematics, 218(2):600–633, 2008.
[46] P. Massart S. Boucheron, G. Lugosi. Concentration inequality: A nonasymptotic theory of independence. Oxford Claredon Press, 2012.
[47] N. Shental et al. Efficient high throughput SARS-CoV-2 testing to detect asymptomatic carriers. Sci. Adv., 6(37), September 2020.
[48] J. Todd. Induced Norms, pages 19–28. Birkhäuser Basel, Basel, 1977.
[49] R. Vershynin. High-Dimensional Probability:An Introduction with Applications in Data Science. Cambridge University Press, 2018.
[50] Katherine J. Wu. Why pooled testing for the coronavirus isn’t working in America. https://www.nytimes.com/2020/08/18/health/coronavirus-pool-testing.html. Last retrieved October 2021.
[51] Hooman Zabeti, Nick Dexter, Ivan Lau, Leonhardt Unruh, Ben Adcock, and Leonid Chindelevitch. Group testing large populations for SARS-CoV-2. medRxiv, pages 2021–06, 2021.
[52] J. Zhang, L. Chen, P. Boufounos, and Y. Gu. On the theoretical analysis of cross validation in compressive sensing. In ICASSP, 2014.
[53] H. Zhu, G. Leus, and G. Giannakis. Sparsity-cognizant total least-squares for perturbed compressive sampling. IEEE Trans. Signal Process., 59(11), 2011.

	$\displaystyle\frac{1}{n}\\|\boldsymbol{Ah_{\beta}+\sqrt{n}h_{\delta}}\\|_{2}^{2}$	$\displaystyle=$	$\displaystyle\frac{1}{n}\left\\|\boldsymbol{Ah_{\beta}}\right\\|_{2}^{2}+\\|\boldsymbol{h_{\delta}}\\|_{2}^{2}+\frac{2}{\sqrt{n}}\langle\boldsymbol{Ah_{\beta}},\boldsymbol{h_{\delta}}\rangle\geq\kappa_{1}\left(\\|\boldsymbol{h_{\beta}}\\|_{2}+\\|\boldsymbol{h_{\delta}}\\|_{2}\right)^{2}-2\kappa_{2}(\\|\boldsymbol{h_{\beta}}\\|_{2}+\\|\boldsymbol{h_{\delta}}\\|_{2})^{2}$
		$\displaystyle=$	$\displaystyle(\kappa_{1}-2\kappa_{2})(\\|\boldsymbol{h_{\beta}}\\|_{2}+\\|\boldsymbol{h_{\delta}}\\|_{2})^{2}$

	$\displaystyle\frac{1}{\sqrt{n}}\\|\boldsymbol{Ah_{\beta}}\\|_{2}$	$\displaystyle\geq$	$\displaystyle\left(\frac{1}{4}-4c_{2}\sqrt{\frac{s\log p}{n}}\right)\\|\boldsymbol{h_{\beta}}\\|_{2}-3c_{2}\sqrt{\frac{\log n}{\log p}}\sqrt{\frac{r\log p}{n}}\\|\boldsymbol{h_{\delta}}\\|_{2},$
	$\displaystyle\therefore\frac{1}{\sqrt{n}}\\|\boldsymbol{Ah_{\beta}}\\|_{2}+\\|\boldsymbol{h_{\delta}}\\|_{2}$	$\displaystyle\geq$	$\displaystyle\left(\frac{1}{4}-4c_{2}\sqrt{\frac{s\log p}{n}}\right)\\|\boldsymbol{h_{\beta}}\\|_{2}+\left(1-3c_{2}\sqrt{\frac{\log n}{\log p}}\sqrt{\frac{r\log p}{n}}\right)\\|\boldsymbol{h_{\delta}}\\|_{2}.$		(47)

	$\displaystyle\frac{1}{\sqrt{n}}\|\langle\boldsymbol{Ah_{\beta}},\boldsymbol{h_{\delta}}\rangle\|\leq\sum_{i,j}\frac{1}{\sqrt{n}}\|\langle\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}(\boldsymbol{h_{\beta}})_{\mathcal{S}_{j}},(\boldsymbol{h_{\delta}})_{\mathcal{R}_{i}}\rangle\|$	$\displaystyle\leq\max_{i,j}\frac{1}{\sqrt{n}}\\|\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}\\|_{2}\sum_{i^{\prime},j^{\prime}}\\|(\boldsymbol{h_{\beta}})_{\mathcal{S}_{j^{\prime}}}\\|_{2}\\|(\boldsymbol{h_{\delta}})_{\mathcal{R}_{i^{\prime}}}\\|_{2}$		(49)
		$\displaystyle=\max_{i,j}\frac{1}{\sqrt{n}}\\|\boldsymbol{A_{\mathcal{R}_{i}\mathcal{S}_{j}}}\\|_{2}\left(\sum_{j^{\prime}}\\|(\boldsymbol{h_{\beta}})_{\mathcal{S}_{j^{\prime}}}\\|_{2}\right)\left(\sum_{i^{\prime}}\\|(\boldsymbol{h_{\delta}})_{\mathcal{R}_{i^{\prime}}}\\|_{2}\right).$		(50)

	$\displaystyle P\left(\|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}\|_{\infty}\geq\mu_{1}\right)$	$\displaystyle\leq$	$\displaystyle P\left(\{\|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}\|_{\infty}\geq\mu_{1}\}\cap E(n,p)\right)+P\left(\{E(n,p)\}^{c}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\{\|\boldsymbol{W^{\top}A}/n-\boldsymbol{I_{p}}\|_{\infty}\geq\mu_{1}\}\cap E(n,p)\right)+\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$

	$\displaystyle P\left(\left\|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right\|_{\infty}\geq\mu_{2}\right)$	$\displaystyle\leq$	$\displaystyle P\left(\left\{\left\|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right\|_{\infty}\geq\mu_{2}\right\}\cap E(n,p)\right)+P\left(\{E(n,p)\}^{c}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\left\{\left\|\frac{1}{p}\left(\boldsymbol{I_{n}}-\frac{1}{n}\boldsymbol{W}\boldsymbol{A}^{\top}\right)\boldsymbol{A}\right\|_{\infty}\geq\mu_{2}\right\}\cap E(n,p)\right)+\left(\frac{2}{p^{2}}+\frac{2}{n^{2}}+\frac{1}{2np}\right)$

Robust Non-adaptive Group Testing under Errors in Group Membership Specifications

Abstract

Index Terms:

I Introduction

II Problem formulation

II-A Basic Noise Model

II-B Model Mismatch Errors

III Debiased Robust Lasso Test (Drlt) Method

III-A Bounds on the Robust Lasso Estimate

Theorem 1

III-B A note on the Debiased Lasso

III-C Debiasing in the Presence of MMEs

Theorem 2

IV Optimal Debiased Lasso Test

Theorem 3

Theorem 4

Theorem 5

V Experimental Results

V-A Results with Baseline Debiasing Techniques in the Presence of MMEs

V-B Empirical verification of asymptotic results of Theorem. 5

V-C Sensitivity and Specificity of Odrlt and Drlt for 𝛅∗\boldsymbol{\delta^{*}}

V-D Identification of Defective Samples in 𝛃∗\boldsymbol{\beta^{*}}

V-E RRMSE Comparison of Debiased Robust Lasso Techniques to Baseline Algorithms

VI Conclusion

Appendix A Proofs of Theorems and Lemmas on Robust Lasso

Definition 1

Lemma 1

A-A Proof of Theorem 1

Appendix B Proofs of Theorems and Lemmas on Debiased Lasso

B-A Proof of Theorem 2

Lemma 2

Lemma 3

B-B Proof of Theorem 3

B-C Proof of Theorem 4

B-D Proof of Theorem 5

Lemma 4

Appendix C Lemma on properties of 𝑨\boldsymbol{A}

Lemma 5

Appendix D Some useful lemmas

Lemma 6

Lemma 7

References

V-C Sensitivity and Specificity of Odrlt and Drlt for $\boldsymbol{\delta^{*}}$

V-D Identification of Defective Samples in $\boldsymbol{\beta^{*}}$

Appendix C Lemma on properties of $\boldsymbol{A}$