Phase Transitions in Recovery of Structured Signals from Corrupted Measurements

Zhongxing Sun1, Wei Cui1, and Yulong Liu2 Corresponding author: Yulong Liu. Email: [email protected]. 1School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China 2School of Physics, Beijing Institute of Technology, Beijing 100081, China

Abstract

This paper is concerned with the problem of recovering a structured signal from a relatively small number of corrupted random measurements. Sharp phase transitions have been numerically observed in practice when different convex programming procedures are used to solve this problem. This paper is devoted to presenting theoretical explanations for these phenomenons by employing some basic tools from Gaussian process theory. Specifically, we identify the precise locations of the phase transitions for both constrained and penalized recovery procedures. Our theoretical results show that these phase transitions are determined by some geometric measures of structure, e.g., the spherical Gaussian width of a tangent cone and the Gaussian (squared) distance to a scaled subdifferential. By utilizing the established phase transition theory, we further investigate the relationship between these two kinds of recovery procedures, which also reveals an optimal strategy (in the sense of Lagrange theory) for choosing the tradeoff parameter in the penalized recovery procedure. Numerical experiments are provided to verify our theoretical results.

Index Terms:

Phase transition, corrupted sensing, signal separation, signal demixing, compressed sensing, structured signals, corruption, Gaussian process.

I Introduction

This paper studies the problem of recovering a structured signal from a relatively small number of corrupted measurements

\displaystyle\bm{y}=\bm{\Phi}\bm{x}^{\star}+\bm{v}^{\star}+\bm{z},

(1)

where $\bm{\Phi}\in\mathbb{R}^{m\times n}$ is the sensing matrix, $\bm{x}^{\star}\in\mathbb{R}^{n}$ denotes the structured signal to be estimated, $\bm{v}^{\star}\in\mathbb{R}^{m}$ stands for the structured corruption, and $\bm{z}\in\mathbb{R}^{m}$ represents the unstructured observation noise. The objective is to estimate $\bm{x}^{\star}$ and $\bm{v}^{\star}$ from given knowledge of $\bm{y}$ and $\bm{\Phi}$ . If $\bm{v}^{\star}$ contains some useful information, then this model (1) can be regarded as the signal separation (or demixing) problem. In particular, if there is no corruption $(\bm{v}^{\star}=\bm{0})$ , then the model (1) reduces to the standard compressed sensing problem.

This problem arises in many practical applications of interest, such as face recognition [1], subspace clustering [2], sensor network [3], latent variable modeling [4], principle component analysis [5], source separation [6], and so on. The theoretical aspects of this problem have also been studied under different scenarios in the literature, important examples include sparse signal recovery from sparse corruption [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], low-rank matrix recovery from sparse corruption [4, 5, 19, 20, 21, 22], and structured signal recovery from structured corruption [23, 24, 25, 26, 27, 28, 29].

Since this problem is ill-posed in general, tractable recovery is possible when both signal and corruption are suitably structured. Typical examples of structured signal (or corruption) include sparse vectors and low-rank matrices. Let $f(\cdot)$ and $g(\cdot)$ be suitable proper convex functions which promote structures for signal and corruption respectively. There are three popular convex optimization approaches to reconstruct signal and corruption when different kinds of prior information are available. Specifically, when we have access to the prior knowledge of either signal $f(\bm{x}^{\star})$ or corruption $g(\bm{v}^{\star})$ and the noise level $\delta$ (in terms of the $\ell_{2}$ norm), it is natural to consider the following constrained convex recovery procedures

\displaystyle\min_{\bm{x},\bm{v}}~{}f(\bm{x}),\quad\text{s.t.~{}}

\displaystyle g(\bm{v})\leq g(\bm{v}^{\star}),~{}~{}\|\bm{y}-\bm{\Phi}\bm{x}-\bm{v}\|_{2}\leq\delta

(2)

and

\displaystyle\min_{\bm{x},\bm{v}}~{}g(\bm{v}),\quad\text{s.t.~{}}

\displaystyle f(\bm{x})\leq f(\bm{x}^{\star}),~{}~{}\|\bm{y}-\bm{\Phi}\bm{x}-\bm{v}\|_{2}\leq\delta.

(3)

When only the noise level $\delta$ is known, it is convenient to employ the partially penalized convex recovery procedure

\displaystyle\min_{\bm{x},\bm{v}}~{}f(\bm{x})+\lambda\cdot g(\bm{v}),\quad\text{s.t.}\quad\|\bm{y}-\bm{\Phi}\bm{x}-\bm{v}\|_{2}\leq\delta,

(4)

where $\lambda>0$ is a tradeoff parameter. When there is no prior knowledge available, it is practical to use the fully penalized convex recovery procedure

\displaystyle\min_{\bm{x},\bm{v}}\frac{1}{2}\|\bm{y}-\bm{\Phi}\bm{x}-\bm{v}\|_{2}^{2}+\tau_{1}\cdot f(\bm{x})+\tau_{2}\cdot g(\bm{v}),

(5)

where $\tau_{1},\tau_{2}>0$ are some tradeoff parameters.

A large number of numerical results in the literature have suggested that phase transitions emerge in all above three recovery procedures (under random measurements), see e.g., [10, 11, 13, 15, 16, 17, 23, 24, 25, 26, 27]. Concretely, for a specific recovery procedure, when the number of the measurements exceeds a threshold, this procedure can faithfully reconstruct both signal and corruption with high probability, when the number of the measurements is below the threshold, this procedure fails with high probability. A fundamental question then is:

Q1: How to determine the locations of these phase transitions accurately?

In addition, in partially and fully penalized recovery procedures, the optimization problems also rely on some tradeoff parameters. Another important question is:

Q2: How to choose these tradeoff parameters to achieve the best possible performance?

I-A Model Assumptions and Contributions

This paper tries to provide answers for the above two questions in the absence of unstructured noise ( $\bm{z}=0$ ). In this scenario, the observation model becomes

\displaystyle\bm{y}=\bm{\Phi}\bm{x}^{\star}+\sqrt{m}\bm{v}^{\star}.

(6)

Here we assume that $\bm{\Phi}$ is a Gaussian sensing matrix with i.i.d. entries ( $\bm{\Phi}_{ij}\sim\mathcal{N}(0,1)$ ), and the factor $\sqrt{m}$ in (6) makes the columns of $\bm{\Phi}$ and $\sqrt{m}\bm{I}_{m}$ have the same scale, which helps our theoretical results to be more interpretable. Accordingly, the constrained convex recovery procedures become

\displaystyle\min_{\bm{x},\bm{v}}~{}f(\bm{x}),\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v},~{}~{}g(\bm{v})\leq g(\bm{v}^{\star})

(7)

and

\displaystyle\min_{\bm{x},\bm{v}}~{}g(\bm{v}),\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v},~{}~{}f(\bm{x})\leq f(\bm{x}^{\star}),

(8)

and partially and fully penalized recovery procedures reduce to

\displaystyle\min_{\bm{x},\bm{v}}~{}f(\bm{x})+\lambda\cdot g(\bm{v}),\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v},

(9)

where $\lambda>0$ is a tradeoff parameter. For each recovery procedure, we declare it succeeds when its unique solution $(\hat{\bm{x}},\hat{\bm{v}})$ satisfies $\hat{\bm{x}}=\bm{x}^{\star}$ and $\hat{\bm{v}}=\bm{v}^{\star}$ , otherwise, it fails.

Under the above model settings, the contribution of this paper is twofold:

•

First, we develop a new analytical framework which allows us to establish the phase transition theory of both constrained and penalized recovery procedures in a unified way. Specifically, for constrained recovery procedures (7) and (8), our analysis shows that their phase transitions locate at

\mathscr{C}_{p}=\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}),

where $\mathcal{T}_{f}(\bm{x}^{\star})$ (or $\mathcal{T}_{g}(\bm{v}^{\star})$ ) is the tangent cone induced by $f$ (or $g$ ) at the true signal $\bm{x}^{\star}$ (or corruption $\bm{v}^{\star}$ ), $\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})$ (or $\omega(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})$ ) is the spherical Gaussian width of this cone, defined in Section II. For the penalized recovery procedure (9), our results indicate that its critical point locates at

\mathscr{C}_{p}(\lambda)=\min_{\alpha\leq t\leq\beta}2\cdot\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1,

where $\partial f(\bm{x}^{\star})$ (or $\partial g(\bm{v}^{\star})$ ) is the subdifferential of $f$ (or $g$ ) at the true signal $\bm{x}^{\star}$ (or corruption $\bm{v}^{\star}$ ), $\zeta(\SS)$ and $\eta^{2}(\SS)$ denote the Gaussian distance and the Gaussian squared distance to a set $\SS$ respectively, also defined in Section II, and $\alpha=\min_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}$ and $\beta=\max_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}$ .

•

Second, we investigate the relationship between these two kinds of recovery procedures by utilizing the established critical points $\mathscr{C}_{p}$ and $\mathscr{C}_{p}(\lambda)$ , which also reveals an optimal parameter selection strategy for $\lambda$ (in the sense of Lagrange theory):

$\lambda^{\star}=\arg\min_{\lambda>0}\mathscr{C}_{p}(\lambda).$

More precisely, under mild conditions, if the penalized procedure (9) is likely to succeed, then the constrained procedures (7) and (8) succeed with high probability, namely, if $m\geq\mathscr{C}_{p}(\lambda)$ , then we have $m\geq\mathscr{C}_{p}-1$ . On the contrary, if the constrained procedures (7) and (8) are likely to succeed, then we can choose the tradeoff parameter $\lambda$ as $\lambda^{\star}$ such that the penalized procedure (9) succeeds with high probability, namely, if $m\geq\mathscr{C}_{p}$ , then we have $m\geq\mathscr{C}_{p}(\lambda^{\star})-1$ .

I-B Related Works

During the past few decades, there have been abundant works investigating the phase transition phenomenons in random convex optimization problems. Most of these works fit in with the framework of compressed sensing ( $\bm{v}^{\star}=\bm{0}$ ). In this paper, we focus on the scenario in which the random measurements are contaminated by some structured corruption. We will review the works related to these two aspects in details.

I-B1 Related Works in Compressed Sensing

The works in the context of compressed sensing can be roughly divided into four groups according to their analytical tools.

The early works study phase transitions in the context of sparse signal recovery via polytope angle calculations [30]. Under Gaussian measurements, Donoho [31] analyzes the $\ell_{1}$ -minimization method in the asymptotic regime and establishes an empirically tight lower bound on the number of measurements required for successful recovery. In contrast to [31], Donoho and Tanner [32] also prove the existence of sharp phase transitions in the asymptotic regime when using $\ell_{1}$ -minimization to reconstruct sparse signals from random projections. These results are later extended to other related $\ell_{1}$ -minimization problems. For instance, Donoho and Tanner [33, 34] identify a precise phase transition of the sparse signal recovery problem with an additional nonnegative constraint; Khajehnejad et al. [35] introduce a nonuniform sparse model and analyze the performance of weighted $\ell_{1}$ -minimization over that model; Xu and Hassibi [36] present sharp performance bounds on the number of measurements required for recovering approximately sparse signals from noisy measurements via $\ell_{1}$ -minimization.

In [37, Fact 10.1], Amelunxen et al. explore the relationship between polytope angle theory and conical integral geometry in details. Their results have shown that conical integral geometry could go beyond many inherent limitations of polytope angle theory, such as dealing with the nuclear norm regularizer in low-rank matrix recovery problems, non-asymptotic analysis, and establishing phase transition from absolute success to absolute failure. In summary, [37] provides the first comprehensive analysis that explains phase transition phenomenons in some random convex optimization problems. Other authors further use conical integral geometry to analyze convex optimization problems with random data. For examples, Amelunxen and Bürgisser [38, 39] apply conical integral geometry to study conic optimization problems; Goldstein et al. [40] show that the sequence of conic intrinsic volumes can be approximated by a suitable Gaussian distribution in the high-dimensional limit, which provides more precise probabilities for successful and failed recovery.

Whereas the above works involve combinatorial geometry, there are some others using minimax decision theory to analyze the phase transition problems. Several papers [41, 42, 43] have observed a close agreement between the asymptotic mean square error (MSE) and the location of phase transition in the linear inverse problems. Donoho et al. [44, 45] then have shown that the minimax MSE for denoising empirically predicts the locations of phase transitions in both sparse and low-rank recovery problems. Recently, Oymak and Hassibi [46] prove that the minimax MSE risk in structured signal denoising problems is almost the same as the statistical dimension. Combining with the results in [37], their results provide a theoretical explanation for using minimax risk to describe the location of phase transition in regularized linear inverse problems.

The last line of works study the compressed sensing problem by utilizing some tools from Gaussian process theory. The key technique is a sharp comparison inequality for Gaussian processes due to Gordon [47]. Rudelson and Vershynin [48] first use Gordon’s inequality to study the $\ell_{1}$ -minimization problem. Stojnin [49] refines this method and obtains an empirically sharp success recovery condition under Gaussian measurements. Stojnin’s calculation is then extended to more general settings. Oymak and Hassibi [50] use it to study the nuclear norm minimization problem. Chandrasekaran et al. [51] consider a more general case in which the regularizer can be any convex function. Stojnic [52, 53] has also investigated the error behaviors of $\ell_{1}$ -minimization and its variants in random optimization problems, these works have been extended by a series of researches [54, 55] by Oymak, Thrampoulidis and Hassibi. Although the mentioned works provide detailed discussions for using Gaussian process theory to analyze random convex optimization problems, few of them consider the failure case for recovery. In a recent work [56], Oymak and Tropp demonstrate a universality property for randomized dimension reduction, which also proves the phase transition in the recovery of structured signals from a large class of measurement models.

I-B2 Related Works in Corrupted Sensing

There are several works in the literature studying the phase transition theory of corrupted sensing problems. For instance, in [37], Amelunxen et al. study the demixing problem $\bm{z}_{0}=\bm{x}_{0}+\bm{U}\bm{y}_{0}$ , where $\bm{U}\in\mathbb{R}^{n\times n}$ is a random orthogonal matrix. They establish sharp phase transition results when the constrained convex program $\min_{\bm{x},\bm{y}}~{}f(\bm{x}),~{}\text{s.t.~{}}\bm{z}_{0}=\bm{x}+\bm{U}\bm{y},~{}g(\bm{y})\leq g(\bm{y}_{0})$ is used to solve this demixing problem. In [56], Oymak and Tropp consider a more general demixing model $\bm{y}=\bm{\Phi}_{0}\bm{x}_{0}+\bm{\Phi}_{1}\bm{x}_{1}$ , where $\bm{\Phi}_{0},\bm{\Phi}_{1}\in\mathbb{R}^{m\times n}$ are two random transformation matrices drawing from a wide class of distributions. They attempt to reconstruct the original signal pair by solving $\min_{\bm{z}_{0},~{}\bm{z}_{1}}\max\{f_{0}(\bm{z}_{0}),f_{1}(\bm{z}_{1})\},~{}\text{s.t.~{}}\bm{y}=\bm{\Phi}_{0}\bm{z}_{0}+\bm{\Phi}_{1}\bm{z}_{1}$ and establish the related phase transition theory. More related to this work, Foygel and Mackey [24] consider the corrupted sensing problem (1) and analyze both the constrained recovery procedures (2) or (3) and the partially penalized recovery procedure (4). In each case, they provide sufficient conditions for stable signal recovery from structured corruption with added unstructured noise under Gaussian measurements. Very recently, Chen and Liu [27] develop an extended matrix deviation inequality and use it to analyze all three kinds of convex procedures ((2), (3), (4), and (5)) in a unified way under sub-Gaussian measurements. In terms of failure case of corrupted sensing, Zhang, Liu, and Lei [25] establish a sharp threshold below which the constrained convex procedures (7) and (8) fail to recover both signal and corruption under Gaussian measurements. Together with the work in [24], their results provide a theoretical explanation for the phase transition when the constrained procedures are used to solve corrupted sensing problems.

I-C Organization

The remainder of the paper is organized as follows. We start with reviewing some preliminaries that are necessary for our subsequent analysis in Section II. Section III is devoted to presenting the main theoretical results of this paper. In Section IV, we present a series of numerical experiments to verify our theoretical results. We conclude the paper in Section V. All proofs of our main results are included in Appendixes.

II Preliminaries

In this section, we introduce some notations and facts that underlie our analysis. Throughout the paper, $\mathbb{S}^{n-1}$ and $\mathbb{B}_{2}^{n}$ represent the unit sphere and unit ball in $\mathbb{R}^{n}$ under the $\ell_{2}$ norm, respectively.

II-A Convex Geometry

II-A1 Subdifferential

The subdifferential of a convex function $f$ at $\bm{x}$ is the set of vectors

\partial f(\bm{x})=\{\bm{u}\in\mathbb{R}^{n}:f(\bm{x}+\bm{d})-f(\bm{x})\geq\left\langle\bm{u},\bm{d}\right\rangle~{}\text{for all~{}}\bm{d}\in\mathbb{R}^{n}\}.

If $f$ is convex and $\bm{x}\in\textrm{intdom}(f)$ , then $\partial f(\bm{x})$ is a nonempty, compact, convex set. For any number $t\geq 0$ , we denote the scaled subdifferential as $t\cdot\partial f(\bm{x})=\{t\cdot\bm{u}:\bm{u}\in\partial f(\bm{x})\}$ .

II-A2 Cone and Polar Cone

A subset $\mathcal{C}\subset\mathbb{R}^{n}$ is called a cone if for every $\bm{x}\in\mathcal{C}$ and $t\geq 0$ , we have $t\cdot\bm{x}\in\mathcal{C}$ . For a cone $\mathcal{C}\subset\mathbb{R}^{n}$ , the polar cone of $\mathcal{C}$ is defined as

\mathcal{C}^{\circ}=\{\bm{u}\in\mathbb{R}^{n}:\left\langle\bm{u},\bm{x}\right\rangle\leq 0~{}\textrm{for all}~{}\bm{x}\in\mathcal{C}\}.

The polar cone $\mathcal{C}^{\circ}$ is always closed and convex. A subset $\SS\subset\mathbb{S}^{n-1}$ is called spherically convex if $\SS$ is the intersection of a convex cone with the unit sphere.

II-A3 Tangent Cone and Normal Cone

The tangent cone of a convex function $f$ at $\bm{x}$ is defined as the set of descent directions of $f$ at $\bm{x}$

\mathcal{T}_{f}(\bm{x})=\{\bm{u}\in\mathbb{R}^{n}:~{}f(\bm{x}+t\cdot\bm{u})\leq f(\bm{x})~{}\textrm{for some}~{}t>0\}.

The tangent cone of a proper convex function is always convex, but they may not be closed.

The normal cone of a convex function $f$ at $\bm{x}$ is the polar of the tangent cone

\mathcal{N}_{f}(\bm{x})=\mathcal{T}_{f}^{\circ}(\bm{x})=\{\bm{u}\in\mathbb{R}^{n}:\left\langle\bm{u},\bm{d}\right\rangle\leq 0~{}\text{for all~{}}\bm{d}\in\mathcal{T}_{f}(\bm{x})\}.

Suppose that $0\notin\partial f(\bm{x})$ , the normal cone can also be written as the cone hull of the subdifferential [57, Theorem 23.7]

\mathcal{N}_{f}(\bm{x})=\operatorname{cone}\{\partial f(\bm{x})\}=\{\bm{u}\in\mathbb{R}^{n}:~{}\bm{u}\in t\cdot\partial f(\bm{x})~{}\textrm{for some}~{}t\geq 0\}.

II-B Geometric Measures

II-B1 Gaussian Width

For any $\SS\subset\mathbb{R}^{n}$ , a popular way to quantify the “size” of $\SS$ is through its Gaussian width

\omega(\SS):=\operatorname{\mathbb{E}}\sup_{\bm{x}\in\SS}\langle\bm{g},\bm{x}\rangle,~{}~{}\textrm{where}~{}~{}\bm{g}\sim\mathcal{N}(0,\bm{I}_{n}).

II-B2 Gaussian Distance and Gaussian Squared Distance

Recall that the Euclidean distance to a set $\SS\subset\mathbb{R}^{n}$ is defined as

\operatorname{dist}(\bm{x},\SS):=\inf_{\bm{y}\in\SS}\|\bm{x}-\bm{y}\|_{2}.

We define the Gaussian distance to a set $\SS\subset\mathbb{R}^{n}$ as

\zeta(\SS):=\operatorname{\mathbb{E}}\operatorname{dist}(\bm{g},\SS)=\operatorname{\mathbb{E}}\inf_{\bm{y}\in\SS}\|\bm{g}-\bm{y}\|_{2},~{}~{}\textrm{where}~{}~{}\bm{g}\sim\mathcal{N}(0,\bm{I}_{n}).

Similarly, the Gaussian squared distance to a set $\SS\subset\mathbb{R}^{n}$ is defined as

\eta^{2}(\SS):=\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{g},\SS)=\operatorname{\mathbb{E}}\inf_{\bm{y}\in\SS}\|\bm{g}-\bm{y}\|_{2}^{2},~{}~{}\textrm{where}~{}~{}\bm{g}\sim\mathcal{N}(0,\bm{I}_{n}).

These two quantities are closely related ¹¹1The lower and upper bounds follow from Fact 4 (in Appendix E) and Jensen’s inequality, respectively.

\sqrt{\eta^{2}(\SS)-1}\leq\zeta(\SS)\leq\sqrt{\eta^{2}(\SS)}.

(10)

II-C Tools from Gaussian Analysis

Our analysis makes heavy use of two well-known results in Gaussian analysis. The first one is a comparison principle for Gaussian processes due to Gordon [47, Theorem 1.1]. This result provides a convenient way to bound the probability of an event from below by that of another one. It is worth noting that the original lemma can be naturally extended from discrete index sets to compact index sets, see e.g., [54, Lemma C.1].

Fact 1 (Gordon’s Lemma).

[47, Theorem 1.1] Let $\{X_{ij}\}$ , $\{Y_{ij}\}$ , $1\leq i\leq n,~{}1\leq j\leq m$ , be two centered Gaussian processes. If $X_{ij}$ and $Y_{ij}$ satisfy the following inequalities:

	$\displaystyle\textrm{1.}~{}\operatorname{\mathbb{E}}[X_{ij}^{2}]=\operatorname{\mathbb{E}}[Y_{ij}^{2}];$
	$\displaystyle\textrm{2.}~{}\operatorname{\mathbb{E}}[X_{ij}X_{ik}]\leq\operatorname{\mathbb{E}}[Y_{ij}Y_{ik}];$
	$\displaystyle\textrm{3.}~{}\operatorname{\mathbb{E}}[X_{ij}X_{lk}]\geq\operatorname{\mathbb{E}}[Y_{ij}Y_{lk}]~{}\textrm{for all}~{}i\neq l;$

Then, we have

\displaystyle\mathbb{P}\left\{\cap_{i}\cup_{j}[X_{ij}\geq\tau_{ij}]\rule{0.0pt}{8.53581pt}\right\}\geq\mathbb{P}\left\{\cap_{i}\cup_{j}[Y_{ij}\geq\tau_{ij}]\rule{0.0pt}{8.53581pt}\right\}

for all choices of $\tau_{ij}\in\mathbb{R}$ .

The second one is the Gaussian concentration inequality which allows us to establish tail bounds for different kinds of Gaussian Lipschitz functions. Recall that a function $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is $L$ -Lipschitz with respect to the Euclidean norm $\|\cdot\|_{2}$ if

|f(\bm{x})-f(\bm{y})|\leq L\|\bm{x}-\bm{y}\|_{2}~{}~{}~{}~{}\textrm{for all}~{}~{}\bm{x},\bm{y}\in\mathbb{R}^{n}.

Then the Gaussian concentration inequality reads as

Fact 2 (Gaussian concentration inequality).

[58, Theorem 1.7.6] Let $\bm{x}\sim\mathcal{N}(0,\bm{I}_{n})$ , and let $f:~{}\mathbb{R}^{n}\to\mathbb{R}$ be $L$ -Lipschitz with respect to the Euclidean metric. Then for any $\epsilon\geq 0$ , we have

\mathbb{P}\left\{f(\bm{x})\geq\operatorname{\mathbb{E}}f(\bm{x})+\epsilon\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{2L^{2}}\right),

and

\mathbb{P}\left\{f(\bm{x})\leq\operatorname{\mathbb{E}}f(\bm{x})-\epsilon\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{2L^{2}}\right).

III Main Results

In this section, we will present our main results. Section III-A is devoted to analyzing the phase transition of constrained recovery procedures (7) and (8). The phase transition of penalized recovery procedure (9) will be established in Section III-B. Section III-C explores the relationship between these two kinds of recovery procedures and illustrates how to choose the optimal tradeoff parameter $\lambda$ . The proofs are included in Appendixes.

III-A Phase Transition of the Constrained Recovery Procedures

We start with analyzing the phase transition of constrained recovery procedures (7) and (8). Recall that a recovery procedure succeeds if it has a unique optimal solution which coincides with the true value; otherwise it fails. First of all, it is necessary to specify some analytic conditions under which the constrained procedures (7) and (8) succeed or fail to recover the original signal and corruption. To this end, we have following lemma.

Lemma 1 (Sufficient conditions for successful and failed recovery).

Suppose $\mathcal{T}_{f}(\bm{x}^{\star})$ and $\mathcal{T}_{g}(\bm{v}^{\star})$ are nonempty and closed. If

\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star})\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0,

(11)

then the constrained procedures (7) and (8) succeed. If

\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star})\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}=0.

(12)

then the constrained procedures (7) and (8) fail. Furthermore, a sufficient condition for (12) to hold is

\displaystyle\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star}))^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0,

(13)

where $(\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star}))^{\circ}$ denotes the polar cone of $\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star})$ , $\bm{A}=[\bm{\Phi},\sqrt{m}\bm{I}_{m}]$ , and $\bm{I}_{m}$ is the $m$ -dimensional identity matrix.

Armed with this lemma, our first theorem shows that the phase transition of constrained recovery procedures (7) and (8) occurs around the sum of squares of spherical Gaussian widths of $\mathcal{T}_{f}(\bm{x}^{\star})$ and $\mathcal{T}_{g}(\bm{v}^{\star})$ . This result ensures that the recovery is likely to succeed when the number of measurements exceeds the critical point. On the contrary, the recovery is likely to fail when the number of measurements is smaller than the critical point.

Theorem 1 (Phase transition of constrained recovery procedures).

Consider the corrupted sensing model (6) with Gaussian measurements. Assume $\mathcal{T}_{f}(\bm{x}^{\star})$ and $\mathcal{T}_{g}(\bm{v}^{\star})$ are non-empty and closed. Define $\mathscr{C}_{p}:={\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)}$ . If the number of measurements satisfies

\displaystyle\sqrt{m}\geq\sqrt{\mathscr{C}_{p}}+\sqrt{2}+\epsilon,

(14)

then the constrained procedures (7) and (8) succeed with probability at least $1-2\exp\left(\frac{-\epsilon^{2}}{4}\right)$ . If the number of measurements satisfies

\displaystyle\sqrt{m}\leq\sqrt{\mathscr{C}_{p}}-\epsilon,

(15)

then the constrained procedures (7) and (8) fail with probability at least $1-2\exp\left(\frac{-\epsilon^{2}}{4}\right)$ .

Remark 1 (Relation to existing results).

In [24, Theorem 1], Foygel and Mackey have shown that when

\sqrt{m}\geq\sqrt{\gamma^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{B}_{2}^{n}\right)+\gamma^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{B}_{2}^{m}\right)}+\frac{1}{\sqrt{2}}+\frac{1}{\sqrt{2\pi}}+\epsilon,

the constrained procedures (7) and (8) succeed with probability at least $1-\exp\left({-\epsilon^{2}}/{2}\right)$ . Here, the Gaussian squared complexity of a set $\SS\subset\mathbb{R}^{n}$ is defined as $\gamma^{2}(\SS):=\operatorname{\mathbb{E}}\left[\left(\sup_{\bm{x}\in\SS}\langle\bm{g},\bm{x}\rangle\right)_{+}^{2}\right]$ with $\bm{g}\sim\mathcal{N}(0,\bm{I}_{n})$ and $(a)_{+}=\max\{a,0\}$ . On the other hand, the third author and his coauthors [25, Theorem 1] have demonstrated that when

\sqrt{m}\leq\sqrt{\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)}-\epsilon,

the constrained procedures (7) and (8) fail with probability at least $1-\exp\left({-\epsilon^{2}}/{2}\right)$ . Since $\gamma^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{B}_{2}^{n}\right)$ (or $\gamma^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{B}_{2}^{m}\right)$ ) is very close to $\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right)$ (or $\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)$ ), the above two results have essentially established the phase transition theory of the constrained procedures (7) and (8).

However, in this paper, we have developed a new analytical framework which allows us to unify the results in both success and failure cases in terms of Gaussian width, which makes the phase transition theory of the constrained recovery procedures more natural. More importantly, this framework can be easily applied to establish the phase transition theory of the penalized recovery procedure.

Remark 2 (Related works).

In [37], Amelunxen et al. consider the following demixing problem

\bm{y}=\bm{U}\bm{x}^{\star}+\bm{v}^{\star},

where $\bm{x}^{\star},\bm{v}^{\star}\in\mathbb{R}^{n}$ are unknown structured signals and $\bm{U}\in\mathbb{R}^{n\times n}$ is a known orthogonal matrix. They have shown that the phase transition occurs around $\delta(\mathcal{T}_{f}(\bm{x}^{\star}))+\delta(\mathcal{T}_{g}(\bm{v}^{\star}))$ when the constrained recovery procedures are employed to solve this problem. Here, the statistical dimension of a convex cone $\mathcal{C}\in\mathbb{R}^{n}$ is defined as $\delta(\mathcal{C}):=\operatorname{\mathbb{E}}\left[\left(\sup_{\bm{x}\in\mathcal{C}\cap\mathbb{B}_{2}^{n}}\langle\bm{g},\bm{x}\rangle\right)^{2}\right]$ with $\bm{g}\sim\mathcal{N}(0,\bm{I}_{n})$ . Although the model assumptions of this demixing problem are different from ours, the results in the two cases are essentially consistent, since we have $\delta(\mathcal{T}_{f}(\bm{x}^{\star}))+\delta(\mathcal{T}_{g}(\bm{v}^{\star}))\approx\mathscr{C}_{p}$ (by Fact 7 in Appendix E).

Recently, Oymak and Tropp [56] consider a more general demixing model

\bm{y}=\bm{\Phi}_{0}\bm{x}^{\star}+\bm{\Phi}_{1}\bm{v}^{\star},

where $\bm{x}^{\star},\bm{v}^{\star}\in\mathbb{R}^{n}$ are unknown structured signals and $\bm{\Phi}_{0},\bm{\Phi}_{1}\in\mathbb{R}^{m\times n}$ are random matrices. They have demonstrated that the critical point of the constrained recovery procedures is nearly located at $\delta(\mathcal{T}_{f}(\bm{x}^{\star}))+\delta(\mathcal{T}_{g}(\bm{v}^{\star}))$ for a large class of random matrices drawing from some models. Their model assumptions are also different from ours, because $\bm{\Phi}_{1}$ is a deterministic matrix in our case, which makes our analysis different from that of [56].

III-A1 How to evaluate the critical point $\mathscr{C}_{p}$ ?

Theorem 1 has demonstrated that the phase transition of the constrained recovery procedures occurs around

\mathscr{C}_{p}=\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right).

A natural question then is how to determine the value of this critical point. To this end, it suffices to estimate $\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right)$ and $\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)$ . It is now well-known that there are some standard recipes to estimate these two quantities, see e.g., [51, 37, 24]. Actually, Facts 7 and 10 indicate that $\mathscr{C}_{p}$ can be accurately approximated by

\min_{t\geq 0}\eta^{2}(t\cdot\partial f(\bm{x}^{\star}))+\min_{t\geq 0}\eta^{2}(t\cdot\partial g(\bm{v}^{\star})).

(16)

To illustrate this result (16), we consider two typical examples: sparse signal recovery from sparse corruption and low-rank matrix recovery from sparse corruption. In the first example, we assume $\bm{x}^{\star}\in\mathbb{R}^{n}$ and $\bm{v}^{\star}\in\mathbb{R}^{m}$ are $s$ -sparse and $k$ -sparse vectors, respectively. Direct calculations (see Appendix D) lead to

\displaystyle\min_{t\geq 0}\eta^{2}(t\cdot\partial\|\bm{x}^{\star}\|_{1})=\min_{t\geq 0}\left\{s(1+t^{2})+\frac{2(n-s)}{\sqrt{2\pi}}\left((1+t^{2})\int_{t}^{\infty}e^{-x^{2}/2}dx-te^{-t^{2}/2}\right)\right\},

and

\displaystyle\min_{t\geq 0}\eta^{2}(t\cdot\partial\|\bm{v}^{\star}\|_{1})=\min_{t\geq 0}\left\{k(1+t^{2})+\frac{2(m-k)}{\sqrt{2\pi}}\left((1+t^{2})\int_{t}^{\infty}e^{-x^{2}/2}dx-te^{-t^{2}/2}\right)\right\}.

In the case of low-rank matrix recovery from sparse corruption, suppose $\bm{X}^{\star}\in\mathbb{R}^{n\times n}$ is an $r$ -rank matrix, the Gaussian squared distance of the signal in (16) is given by

\displaystyle\min_{t\geq 0}\eta^{2}(t\cdot\partial\|\bm{X}\|_{*})

\displaystyle=\min_{t\geq 0}\left\{r(2n-r+t^{2})+\operatorname{\mathbb{E}}\sum_{i=1}^{n-r}\textrm{shrink}\left(\sigma_{i}(\bm{G}_{2}),t\right)^{2}\right\},

where $\bm{G}_{2}$ is an $(n-r)\times(n-r)$ standard Gaussian matrix, and $\sigma_{i}(\bm{G}_{2})$ is the $i$ -th largest singular value of $\bm{G}_{2}$ . The Gaussian squared distance of the corruption can be similarly evaluated as in the first example. In addition, we should mention that it is also possible to estimate the Gaussian width in $\mathscr{C}_{p}$ numerically by approximating the expectation in its definition with an empirical average.

III-B Phase Transition of the Penalized Recovery Procedure

We then study the phase transition theory of the penalized recovery procedure (9). Firstly, we also need to establish sufficient conditions under which the penalized recovery procedure (9) succeeds or fails to recover the original signal and corruption.

Lemma 2 (Sufficient conditions for successful and failed recovery).

The penalized problem (9) succeeds if

\displaystyle\bm{0}\in\bm{\Phi}^{T}\cdot\partial g(\bm{v}^{\star})-\frac{\sqrt{m}}{\lambda}\partial f(\bm{x}^{\star}).

(17)

The penalized problem (9) fails if

\displaystyle\min_{\bm{a}\in\partial f(\bm{x}^{\star}),\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0.

(18)

Define the joint cone $\mathcal{T}_{J}=\{t\cdot(\bm{a}^{T},\bm{b}^{T})^{T}\in\mathbb{R}^{n}\times\mathbb{R}^{m}:t\geq 0,~{}\bm{a}\in\partial f(\bm{x}^{\star}),~{}\bm{b}\in\partial g(\bm{v}^{\star})\}$ . Then, a sufficient condition for (17) to hold is

\displaystyle\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}>0,

(19)

where $\mathcal{T}_{J}^{\circ}$ denotes the polar cone of $\mathcal{T}_{J}$ , $\bm{M}=[-\frac{\sqrt{m}}{\lambda}\bm{I}_{n},\bm{\Phi}^{T}]$ , and $\bm{I}_{n}$ is the $n$ -dimensional identity matrix.

Our second theorem shows that the critical point of the penalized recovery procedure is nearly located at $\mathscr{C}_{p}(\lambda)$ , which is determined by two Gaussian distances to scaled subdifferentials. This result asserts that the recovery succeeds with high probability when the number of measurements is larger than the critical point. On the other hand, the recovery fails with high probability when the number of measurements is below the critical point. In addition, the critical point $\mathscr{C}_{p}(\lambda)$ is influenced by the tradeoff parameter $\lambda$ .

Theorem 2 (Phase transition of penalized recovery procedure).

Consider the corrupted sensing model (6) with Gaussian measurements. Suppose that the subdifferential $\partial g(\bm{v}^{\star})$ does not contain the origin. Let $\alpha=\min_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}$ and $\beta=\max_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}$ . Define

\displaystyle\mathscr{C}_{p}(\lambda):=\min_{\alpha\leq t\leq\beta}2\cdot\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1.

If the number of measurements satisfies

\displaystyle m\geq\mathscr{C}_{p}(\lambda)+\epsilon,

(20)

then the penalized problem (9) succeeds with probability at least $1-2\exp\left(\frac{-\epsilon^{2}}{16}\right)$ . If the number of measurements satisfies

\displaystyle m\leq\mathscr{C}_{p}(\lambda)-\epsilon,

(21)

then the penalized problem (9) fails with probability at least $1-2\exp\left(\frac{-\epsilon^{2}}{16}\right)$ .

Remark 3 (Related works).

In [24], Foygel and Mackey have shown that, under Gaussian measurements, when

\sqrt{m}\geq 2\cdot\sqrt{\eta^{2}(\lambda_{1}\cdot\partial f(\bm{x}^{\star}))}+\sqrt{\eta^{2}(\lambda_{2}\cdot\partial g(\bm{v}^{\star}))}+3\sqrt{2\pi}+\frac{1}{\sqrt{2}}+\frac{1}{\sqrt{2\pi}}+\epsilon,

the penalized problem (9) succeeds with probability at least $1-\exp(-\epsilon^{2}/2)$ . Here $\lambda=\lambda_{2}/\lambda_{1}$ . Very recently, Chen and Liu [27] have illustrated that, under sub-Gaussian measurements, when

m\geq CK^{4}\left[\eta^{2}(\lambda_{1}\cdot\partial f(\bm{x}^{\star}))+\eta^{2}(\lambda_{2}\cdot\partial g(\bm{v}^{\star}))\right],

the penalized problem (9) succeeds with high probability. Here $C$ is an absolute constant and $K$ is the upper bound for the sub-Gaussian norm of rows of the sensing matrix. These two sufficient conditions for successful recovery are demonstrated to be unsharp by their numerical experiments. To the best of our knowledge, the present results (Theorem 2) first establish the complete phase transition theory of the penalized recovery procedure (9), which closes an important open problem in the literature, see e.g., [23], [24], and [27].

III-B1 How to evaluate the critical point $\mathscr{C}_{p}(\lambda)$ ?

Theorem 2 has suggested that the phase transition of the penalized recovery procedure occurs around

\mathscr{C}_{p}(\lambda)=\min_{\alpha\leq t\leq\beta}2\cdot\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1.

The next important question is how to calculate $\mathscr{C}_{p}(\lambda)$ accurately. To this end, we have the following lemma.

Lemma 3.

The quantity $\mathscr{C}_{p}(\lambda)$ can be bounded as

	$\displaystyle\min_{\alpha\leq t\leq\beta}\left[2\cdot\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)-1}-2\cdot\omega\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)+m\right]\leq\mathscr{C}_{p}(\lambda)$
	$\displaystyle\hskip 150.0pt\leq\min_{\alpha\leq t\leq\beta}\left[2\cdot\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)}-2\cdot\omega\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)+m\right].$

Proof.

Note that

$\displaystyle\mathscr{C}_{p}(\lambda)$	$\displaystyle=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1\right]$
	$\displaystyle=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\min_{\bm{b}\in\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}}\\|\bm{h}-\bm{b}\\|_{2}^{2}-1\right]$
	$\displaystyle=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)-2\max_{\bm{b}\in\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}}\left\langle\bm{h},\bm{b}\right\rangle+m\right].$	(22)

It follows from (10) that

\displaystyle\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)-1}

\displaystyle\leq\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)\leq\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)}.

Substituting the above bound into (III-B1) completes the proof. ∎

Lemma 3 demonstrates that $\mathscr{C}_{p}(\lambda)$ can be accurately estimated by

\min_{\alpha\leq t\leq\beta}\left[2\cdot\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)}-2\cdot\omega\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)+m\right].

(23)

Thus it is sufficient to estimate $\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)$ and $\omega\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)$ . There are also some standard methods to estimate these two quantities, see e.g., [51, 37, 24]. To illustrate this result (23), we also consider two typical examples: sparse signal recovery from sparse corruption and low-rank matrix recovery from sparse corruption. In the first example, we assume the signal $\bm{x}^{\star}\in\mathbb{R}^{n}$ and the corruption $\bm{v}^{\star}\in\mathbb{R}^{m}$ are $s$ -sparse and $k$ -sparse vectors, then we can obtain (see Appendix D for details)

\displaystyle\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial\|\bm{x}^{\star}\|_{1}\right)=s\left(1+\frac{m}{\lambda^{2}t^{2}}\right)+\frac{2(n-s)}{\sqrt{2\pi}}\left(\left(1+\frac{m}{\lambda^{2}t^{2}}\right)\int_{\frac{\sqrt{m}}{\lambda t}}^{\infty}e^{-x^{2}/2}dx-\frac{\sqrt{m}}{\lambda t}e^{-\frac{m}{2\lambda^{2}t^{2}}}\right)

and

\displaystyle\omega\left(\frac{1}{t}\partial\|\bm{v}^{\star}\|_{1}\cap\mathbb{S}^{m-1}\right)=\sqrt{\frac{2}{\pi}(m-k)\left(1-\frac{k}{t^{2}}\right)}.

The parameter $t$ in (23) takes value from $\alpha=\sqrt{k}$ to $\beta=\sqrt{m}$ . Consider the example of low-rank matrix recovery from sparse corruption, the signal $\bm{X}^{\star}\in\mathbb{R}^{n\times n}$ is an $r$ -rank matrix, the first Gaussian squared distance in (23) can be calculated as

\displaystyle\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial\|\bm{X}\|_{*}\right)=r\left(2n-r+\frac{m}{\lambda^{2}t^{2}}\right)+\operatorname{\mathbb{E}}\sum_{i=1}^{n-r}\textrm{shrink}\left(\sigma_{i}(\bm{G}_{2}),\frac{\sqrt{m}}{\lambda t}\right)^{2}.

where $\bm{G}_{2}$ is an $(n-r)\times(n-r)$ standard Gaussian matrix, and $\sigma_{i}(\bm{G}_{2})$ is the $i$ -th largest singular value of $\bm{G}_{2}$ . The calculations of the Gaussian width of the corruption and the range of parameter $t$ are similar to the first example. In addition, it is possible to estimate the Gaussian distance and Gaussian width numerically by approximating the expectations in their definitions with empirical averages.

Refer to caption — Figure 1: We assume $\bm{x}^{\star}\in\mathbb{R}^{128}$ is an $s$ -sparse vector and $\bm{v}^{\star}\in\mathbb{R}^{128}$ is a $k$ -sparse vector. $f(\cdot)$ and $g(\cdot)$ are set to be the $\ell_{1}$ -norm. Fix the sample size $m=128$ . The solid red line corresponds to the phase transition threshold of the constrained procedures: $m=\mathscr{C}_{p}$ , the dashed blue lines correspond to the phase transition thresholds of the penalized recovery procedure with different $\lambda$ s: $m=\mathscr{C}_{p}(\lambda)-\epsilon$ . The recovery of both signal and corruption is likely to succeed when the sparsity pair $(s,k)$ lies below the phase thresholds, the recovery is likely to fail when $(s,k)$ lies above the phase thresholds. It is not hard to find that: (I). The successful areas of the penalized recovery procedure with different $\lambda$ s are always smaller than that of the constrained methods; (II). Even for the critical points (e.g., $(s,k)=(s_{0},k_{0}),(s_{1},k_{1}),(s_{2},k_{2})$ ) that lie on the phase transition threshold of the constrained recovery procedures (which might represent the reconstruction limit of the constrained procedures), we can still choose some corresponding tradeoff parameters (e.g., $\lambda=0.75,1.50,3.50$ ) such that the penalized problem succeeds too (with similar probability). Thus, the successful area of the constrained procedures can be regarded as the union of that of the penalized one with different $\lambda$ s.

III-C Relationship between Constrained and Penalized Recovery Procedures and Optimal Choice of $\lambda$

The theory of Lagrange multipliers [57, Section 28] asserts that solving the constrained recovery procedures is essentially equivalent to solving the penalized problem with a best choice of the tradeoff parameter $\lambda$ . More precisely, this equivalence consists of the following two aspects [23, Appendix A]. On the one hand,

(I).

Suppose the penalized procedure (9) succeeds for some value $\lambda>0$ . Then the constrained ones (7) and (8) succeed.

On the other hand, as a partial converse to (I), one has

(II).

Suppose that the subdifferentials $\partial f(\bm{x}^{\star})$ and $\partial g(\bm{v}^{\star})$ do not contain the origin. If the constrained procedures (7) and (8) succeed, then there exists a parameter $\lambda>0$ such that $(\bm{x}^{\star},\bm{v}^{\star})$ is an optimal point for the penalized one (9).

The above relations indicate that the performance of the constrained procedures can be interpreted as the best possible one for the penalized problem. However, the main difficulty in these results lies in how to select a suitable tradeoff parameter $\lambda$ that leads to this equivalence. Since we have identified the precise phase transitions of both constrained and penalized recovery procedures, it is possible to allow us to explore the relationship between these two kinds of approaches in a quantitative way, which in turn implies an explicit strategy to choose the optimal $\lambda$ .

Theorem 3 (Relationship between constrained and penalized recovery procedures and optimal choice of $\lambda$ ).

Assume that $\mathcal{T}_{f}(\bm{x}^{\star})$ and $\mathcal{T}_{g}(\bm{v}^{\star})$ are non-empty and closed, and that the subdifferentials $\partial f(\bm{x}^{\star})$ and $\partial g(\bm{v}^{\star})$ do not contain the origin. If $m\geq\mathscr{C}_{p}(\lambda)$ , then we have

\displaystyle m\geq\mathscr{C}_{p}-1.

On the other hand, if $m\geq\mathscr{C}_{p}$ , then we can choose the tradeoff parameter as

\lambda^{\star}=\arg\min_{\lambda>0}\mathscr{C}_{p}(\lambda),

(24)

such that ²²2As shown in the proof of Theorem 3, the gap $5$ can be easily reduced by introducing an extra condition $\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})\geq 4$ , namely, if we further let $\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})\geq 4$ , then we have $m\geq\mathscr{C}_{p}(\lambda^{\star})-1$ . It is worth noting that this condition is easy to satisfy in practical applications.

\displaystyle m\geq\mathscr{C}_{p}(\lambda^{\star})-5.

Combining the phase transition results in Theorems 1 and 2, the first part of Theorem 3 implies that if the penalized procedure (9) is likely to succeed, then the constrained procedures (7) and (8) succeed with high probability. Similarly, the second part of Theorem 3 conveys that if the constrained procedures (7) and (8) are likely to succeed, then we can choose the tradeoff parameter $\lambda$ as in (24) such that the penalized procedure (9) succeeds with high probability. Thus our results provide a quantitative characterization for the relations (I) and (II).

The results in Theorem 3 also enjoy a geometrical explanation in the phase transition program: The first part implies that the successful area of penalized recovery procedure should be smaller than that of the constrained procedures. The second part indicates that for any point in the successful area of the constrained recovery procedures, we can find at least a $\lambda$ such that this point also belongs to the successful area of the corresponding penalized recovery procedure. In other words, the successful area of the constrained procedures can be regarded as the union of that of the penalized one (with different $\lambda$ s). Fig.1 illustrates this relationship in the case of sparse signal recovery from sparse corruption.

Moreover, Theorem 3 has suggested an explicit way to choose the best parameter $\lambda$ predicted by the Lagrange theory, i.e.,

\lambda^{\star}=\arg\min_{\lambda>0}\mathscr{C}_{p}(\lambda),

which is equivalent to

\displaystyle\lambda^{\star}=\arg\min_{\lambda>0}\zeta\left(\frac{\sqrt{m}}{\lambda t^{\star}}\partial f(\bm{x}^{\star})\right)~{}~{}\textrm{with}~{}~{}t^{\star}=\arg\min_{\alpha\leq t\leq\beta}\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right).

(25)

We provide some insights for this parameter selection strategy. Recall that Theorem 2 has demonstrated that the penalized procedure succeeds with high probability if the number of measurements exceeds the critical point $\mathscr{C}_{p}(\lambda)$ . Then the strategy (24) implies that we should pick the $\lambda$ which makes the number of observations required for successful recovery of the penalized procedure as small as possible. Another explanation comes from the relationship between these two kinds of recovery procedures. For a given corrupted sensing problem (with fixed $\bm{x}^{\star}$ and $\bm{v}^{\star}$ ), the first part of Theorem 3 indicates that the phase transition threshold of the penalized procedure is always bounded from below by that of constrained ones, it is natural to choose the $\lambda$ such that we can achieve the possibly smallest gap between these two thresholds i.e., $\lambda^{\star}=\arg\min_{\lambda>0}(\mathscr{C}_{p}(\lambda)-\mathscr{C}_{p})$ , which also leads to the strategy (24).

Remark 4 (Related works).

In [24] and [27], the authors also provide an explicit way to select the tradeoff parameter $\lambda$ . Specifically, their results have shown that $\mathcal{O}\left(\eta^{2}(\lambda_{1}\cdot\partial f(\bm{x}^{\star}))+\eta^{2}(\lambda_{2}\cdot\partial g(\bm{v}^{\star}))\right)$ measurements are sufficient to guarantee the success of the penalized procedure. In order to achieve the smallest number of measurements, it is natural to choose the $\lambda$ as follows:

\lambda_{1}^{\ast}=\arg\min_{\lambda_{1}>0}\eta^{2}(\lambda_{1}\cdot\partial f(\bm{x}^{\star})),~{}~{}\lambda_{2}^{\ast}=\arg\min_{\lambda_{2}>0}\eta^{2}(\lambda_{2}\cdot\partial g(\bm{v}^{\star})),~{}~{}\textrm{and}~{}\lambda^{\ast}=\lambda_{2}^{\ast}/\lambda_{1}^{\ast}.

(26)

However, a visible mismatch between the penalized program with the strategy (26) and the constrained ones has been observed in their numerical experiments. This suggests that the choice (26) might not be optimal in the sense of the Lagrange theory. As shown in our simulations (Section IV), the empirical performance of the penalized procedure (9) with our optimal choice of the tradeoff parameter is nearly the same as that of the constrained convex procedures. Thus, our strategy (24) solves another significant open problem in [23].

IV Numerical Simulations

In this section, we perform a series of numerical experiments to verify our theoretical results. We consider two typical structured signal recovery problems: sparse signal recovery from sparse corruption and low-rank matrix recovery from sparse corruption. In each case, we employ both constrained and penalized recovery procedures to reconstruct the original signal and corruption. Throughout these experiments, the related convex optimization problems are solved by CVX Matlab package [59, 60]. In addition to the Gaussian measurements, we also consider sub-Gaussian measurements ³³3In fact, we have tested other distributions of $\bm{\Phi}$ such as sparse Rademacher distribution and Student’s $t$ distribution, the obtained results are quite similar, so we omit them here..

IV-A Phase Transition of the Constrained Recovery Procedures

We first consider the empirical behavior of the constrained recovery procedures in the following two structured signal recovery problems.

IV-A1 Sparse Signal Recovery from Sparse Corruption

In this example, both signal and corruption are sparse, and we use the $\ell_{1}$ -norm to promote their structures, i.e., $f(\bm{x}^{\star})=\|\bm{x}^{\star}\|_{1}$ and $g(\bm{v}^{\star})=\|\bm{v}^{\star}\|_{1}$ . Suppose the $\ell_{1}$ -norm of the true signal $\|\bm{x}^{\star}\|_{1}$ are known beforehand. We fix the sample size and ambient signal dimension $m=n=128$ . For each signal sparsity $s=1,2,...,128$ and each corruption sparsity $k=1,2,...,128$ . We repeat the following experiments 20 times:

(1)

Generate a signal vector $\bm{x}^{\star}\in\mathbb{R}^{n}$ with $s$ non-zero entries and set the other $n-s$ entries to 0. The locations of the non-zero entries are uniformly selected among all possible supports, and nonzero entries are independently sampled from the normal distribution.
(2)

Similarly, generate a corruption vector $\bm{v}^{\star}\in\mathbb{R}^{m}$ with $k$ non-zero entries and set the other $m-k$ entries to 0.
(3)

For Gaussian measurements, we draw the sensing matrix $\bm{\Phi}\in\mathbb{R}^{m\times n}$ with i.i.d. standard normal entries. For sub-Gaussian measurements, we draw the sensing matrix $\bm{\Phi}\in\mathbb{R}^{m\times n}$ with i.i.d. symmetric Bernoulli entries.

(4)

Solve the following constrained optimization problem (8):

\displaystyle(\hat{\bm{x}},\hat{\bm{v}})=\arg\min_{\bm{x},\bm{v}}~{}\|\bm{v}\|_{1},\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v},~{}~{}\|\bm{x}\|_{1}\leq\|\bm{x}^{\star}\|_{1}.

(5)

Set $tol=10^{-3}$ . Declare success if $\|\hat{\bm{x}}-\bm{x}^{\star}\|_{2}/\|\bm{x}^{\star}\|_{2}\leq tol$ .

IV-A2 Low-rank Matrix Recovery from Sparse Corruption

In this case, the desired signal is an $r$ -rank matrix and the corruption is a $\rho$ -sparse vector. We use the nuclear norm $f(\bm{x}^{\star})=\|\bm{x}^{\star}\|_{*}$ to promote the structure of signal. Suppose the nuclear norm of true signal $\|\bm{x}^{\star}\|_{*}$ are known exactly. Let $n=20$ and consider $n\times n$ signal matrices. Set the sample size $m=n^{2}$ . For each rank $r=1,2,...,20$ and each corruption sparsity $\rho=1,6,11,16,...,396$ . We repeat the following experiment 20 times:

(1)

Generate an $r$ -rank matrix $\bm{X}^{\star}=\bm{U}_{1}\bm{U}_{2}^{T}$ , where $\bm{U}_{1}$ and $\bm{U}_{2}$ are independent $n\times r$ matrices with orthonormal columns.
(2)

Generate a corruption vector $\bm{v}^{\star}\in\mathbb{R}^{m}$ with $\rho$ non-zero entries and set the other $m-\rho$ entries to 0.
(3)

For Gaussian measurements, we draw the sensing matrix $\bm{\Phi}\in\mathbb{R}^{m\times n^{2}}$ with i.i.d standard normal entries. For sub-Gaussian measurements, we draw the sensing matrix $\bm{\Phi}\in\mathbb{R}^{m\times n^{2}}$ with i.i.d. symmetric Bernoulli entries.

(4)

Solve the following constrained problem (8):

\displaystyle(\hat{\bm{X}},\hat{\bm{v}})=\arg\min_{\bm{X},\bm{v}}~{}\|\bm{v}\|_{1},\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\cdot\textrm{vec}(\bm{X})+\sqrt{m}\bm{v},~{}~{}\|\bm{X}\|_{*}\leq\|\bm{X}^{\star}\|_{*}.

(5)

Set $tol=10^{-3}$ . Declare success if $\|\hat{\bm{X}}-\bm{X}^{\star}\|_{F}/\|\bm{X}^{\star}\|_{F}\leq tol$ .

In order to compare the empirical behaviors with theoretical results, we overlay the phase transition curve that predicted in Theorem 1:

\displaystyle\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}).

(27)

Fig.2 reports the empirical probability of success for the constrained procedures in these two typical structured signal recovery problems. It is not hard to find that our theoretical predictions sharply align with the empirical phase transitions under both Gaussian and Bernoulli measurements.

IV-B Phase Transition of the Penalized Recovery Procedure

We next consider the empirical phase transition of the penalized procedure in these two examples.

IV-B1 Sparse Signal Recovery from Sparse Corruption

The experiment settings are almost the same as the constrained case except that we require neither $f(\bm{x}^{\star})$ nor $g(\bm{v}^{\star})$ , and we solve the following penalized procedure instead of the constrained one in step (4):

\displaystyle(\hat{\bm{x}},\hat{\bm{v}})=\arg\min_{\bm{x},\bm{v}}~{}\|\bm{x}\|_{1}+\lambda\|\bm{v}\|_{1},\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v}.

Here we test two tradeoff parameters: $\lambda=1$ and $\lambda=2$ . To compare the empirical behaviors with theoretical results, we overlay the phase transition curve that predicted in Theorem 2:

\min_{\alpha\leq t\leq\beta}2\cdot\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1.

(28)

Fig. 3 displays the average empirical probability of success for the penalized problem in sparse signal recovery from sparse corruption. We can see that the theoretical threshold (28) perfectly predicts the empirical phase transition under different tradeoff parameter $\lambda$ s.

IV-B2 Low-rank Matrix Recovery from Sparse Corruption

Similarly, the experiment settings are nearly the same as the constrained case except that we recover the original signal and corruption via the following penalized procedure in step (4):

\displaystyle(\hat{\bm{X}},\hat{\bm{v}})=\arg\min_{\bm{X},\bm{v}}~{}\|\bm{X}\|_{*}+\lambda\|\bm{v}\|_{1},\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\cdot\textrm{vec}(\bm{X})+\sqrt{m}\bm{v}.

The tradeoff parameter is set to be $\lambda=1/4$ or $\lambda=1/2$ . To compare our theory with the empirical results, we overlay the theoretical threshold (28). Fig. 4 shows the empirical probability of success for penalized problem in low-rank matrix recovery from sparse corruption. We can find that the theoretical threshold (28) predicts the empirical phase transition quite well under different tradeoff parameter $\lambda$ s.

IV-C Optimal Choice of the Tradeoff Parameter $\lambda$

In this section, we explore the empirical phase transition of the penalized procedure with the optimal tradeoff parameter $\lambda$ . Similarly, we consider these two typical examples.

IV-C1 Sparse Signal Recovery from Sparse Corruption

The experiment settings are the same as the penalized case except that we solve the following penalized procedure in step (4):

\displaystyle(\hat{\bm{x}},\hat{\bm{v}})=\arg\min_{\bm{x},\bm{v}}~{}\|\bm{x}\|_{1}+\lambda\|\bm{v}\|_{1},\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v}

with the optimal parameter selection strategy $\lambda=\lambda^{\star}$ as in (24).

IV-C2 Low-rank Matrix Recovery from Sparse Corruption

We carry out similar experiments as the penalized case except that we solve the following penalized procedure in step (4):

\displaystyle(\hat{\bm{X}},\hat{\bm{v}})=\arg\min_{\bm{X},\bm{v}}~{}\|\bm{X}\|_{*}+\lambda\|\bm{v}\|_{1},\quad\text{s.t.~{}}

\displaystyle\bm{y}=\bm{\Phi}\cdot\textrm{vec}(\bm{X})+\sqrt{m}\bm{v}.

The tradeoff parameter is set to $\lambda=\lambda^{\star}$ according to (24).

To compare our theory with the empirical results, we overlay the theoretical threshold (27) of the constrained procedures. Fig.5 shows the empirical probability of success for the penalized procedure with the optimal parameter $\lambda$ in both examples. We can find that the theoretical threshold of the constrained problems predicts the empirical phase transition of penalized problems perfectly under both Gaussian and Bernoulli measurements, which indicates that our strategy to choose $\lambda$ is optimal in the sense of the Lagrange theory.

V Conclusion and Future Directions

This paper has developed a unified framework to establish the phase transition theory for both constrained and penalized recovery procedures which are used to solve corrupted sensing problems under different scenarios. The analysis is only based on some well-known results in Gaussian process theory. Our theoretical results have shown that the phase transitions of these two recovery procedures are determined by some geometric measures, e.g., the spherical Gaussian width of a tangent cone, the Gaussian (squared) distance to a scaled subdifferential. We have also explored the relationship between these two procedures from a quantitative perspective, which in turn indicates how to pick the optimal tradeoff parameter in the penalized recovery procedure. The numerical experiments have demonstrated a close agreement between our theoretical results and the empirical phase transitions. For future work, we enlist two promising directions:

•

Universality: Under Gaussian measurements, our results provide a thorough explanation for the phase transition phenomenon of corrupted sensing. The Gaussian assumption is critical in the establishment of our main results. However, extensive numerical examples in Section IV have suggested that the phase transition results of corrupted sensing are universal. Thus, an important question is to establish the phase transition theory for corrupted sensing beyond Gaussian measurements.
•

Noisy phase transition: Throughout the paper, we analyze the phase transition of corrupted sensing in the noiseless setting. It might be interesting to consider the noisy measurements $\bm{y}=\bm{\Phi}\bm{x}^{\star}+\bm{v}^{\star}+\bm{z}$ , and to provide precise error analysis for different convex recovery procedures. In [44], Donoho et al. have considered the noisy compressed sensing problem $\bm{y}=\bm{\Phi}\bm{x}^{\star}+\bm{z}$ with $\bm{z}\sim\mathcal{N}(0,\sigma\bm{I}_{m})$ , and use the penalized $\ell_{1}$ -minimization $\hat{\bm{x}}=\arg\min\{\|\bm{y}-\bm{\Phi}\bm{x}\|_{2}^{2}/2+\lambda\|\bm{x}\|_{1}\}$ to recover the original signal. They have shown that the normalized MSE $\frac{\operatorname{\mathbb{E}}\|\hat{\bm{x}}-\bm{x}^{\star}\|_{2}^{2}}{\sigma^{2}}$ is bounded throughout an asymptotic region and is unbounded throughout the complementary region. The phase boundary of the interested region is identical to the previously known phase transition for the noiseless problem. We may expect a non-asymptotic characterization of the normalized MSE for the noisy corrupted sensing problem, which implies a new perspective for the phase transition results in noiseless case.

Appendix A Proofs of Lemma 1 and Theorem 1

In this appendix, we present a detailed proof for the phase transition result of the constrained recovery procedures. For brevity, we denote $\mathcal{T}_{f}(\bm{x}^{\star})$ and $\mathcal{T}_{g}(\bm{v}^{\star})$ by $\mathcal{T}_{f}$ and $\mathcal{T}_{g}$ respectively. Some auxiliary lemma and facts used in the proofs are included in Appendix E.

A-A Proof of Lemma 1

Proof.

It follows from the optimization condition for linear inverse problems [48, Section 4] or [51, Proposition 2.1] that $(\hat{\bm{x}},\hat{\bm{v}})=(\bm{x}^{\star},\bm{v}^{\star})$ is the unique optimal solution of (7) or (8) if and only if $\text{null}\left([\bm{\Phi},\sqrt{m}\bm{I}_{m}]\right)\cap\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)=\{(\bm{0},\bm{0})\}$ , which is equivalent to $\text{null}\left([\bm{\Phi},\sqrt{m}\bm{I}_{m}]\right)\cap\left((\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}\right)=\emptyset$ . Therefore, if $\text{null}\left([\bm{\Phi},\sqrt{m}\bm{I}_{m}]\right)\cap\left((\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}\right)=\emptyset$ , i.e.,

\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0,

then the constrained procedures (7) and (8) succeed. If $\text{null}\left([\bm{\Phi},\sqrt{m}\bm{I}_{m}]\right)\cap\left((\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}\right)\neq\emptyset$ , i.e.,

\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}=0,

(29)

then the constrained procedures (7) and (8) fail.

Obviously, (29) holds if $\bm{0}\in\bm{A}((\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1})$ . Since $f$ and $g$ are proper convex functions, then $\mathcal{T}_{f}$ and $\mathcal{T}_{g}$ are convex, and hence $(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}$ is spherically convex. By assumption, $\mathcal{T}_{f}$ and $\mathcal{T}_{g}$ are nonempty and closed, the desired sufficient condition (13) follows by directly applying the polarity principle (Fact 3).

∎

A-B Proof of Theorem 1

Proof.

Success case: Lemma 1 indicates that the constrained procedures (7) and (8) succeed if

\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0.

Our goal then reduces to show that if the number of measurements satisfies (14), then the above inequality holds with high probability. For clarity, the proof is divided into three steps.

Step 1: Problem reduction. We first apply Gordon’s Lemma to convert the probability of the targeted event to a surrogate which is convenient to handle. Observe that

\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}=\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle.

(30)

For any $(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}$ and $\bm{u}\in\mathbb{S}^{m-1}$ , define the following two Gaussian processes

X_{(\bm{a},\bm{b}),\bm{u}}:=\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\|\bm{a}\|_{2}\cdot g~{}~{}~{}

and

Y_{(\bm{a},\bm{b}),\bm{u}}:=\|\bm{a}\|_{2}\left\langle\bm{h},\bm{u}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle,

where $g\sim\mathcal{N}(0,1)$ , $\bm{h}\sim\mathcal{N}(\bm{0},\bm{I}_{m})$ , and $\bm{g}\sim\mathcal{N}(\bm{0},\bm{I}_{n})$ are independent of each other. It is not hard to check that the above Gaussian processes satisfy the conditions of Gordon’s Lemma, i.e.,

	$\displaystyle\operatorname{\mathbb{E}}X_{(\bm{a},\bm{b}),\bm{u}}^{2}$	$\displaystyle=2\\|\bm{a}\\|_{2}^{2}=\operatorname{\mathbb{E}}Y_{(\bm{a},\bm{b}),\bm{u}}^{2},$
	$\displaystyle\operatorname{\mathbb{E}}[X_{(\bm{a},\bm{b}),\bm{u}}X_{(\bm{a}^{\prime},\bm{b}^{\prime}),\bm{u}^{\prime}}]-\operatorname{\mathbb{E}}[Y_{(\bm{a},\bm{b}),\bm{u}}Y_{(\bm{a}^{\prime},\bm{b}^{\prime}),\bm{u}^{\prime}}]$	$\displaystyle=\left\langle\bm{u},\bm{u}^{\prime}\right\rangle\left\langle\bm{a},\bm{a}^{\prime}\right\rangle+\\|\bm{a}\\|_{2}\\|\bm{a}^{\prime}\\|_{2}-\left\langle\bm{a},\bm{a}^{\prime}\right\rangle-\\|\bm{a}\\|_{2}\\|\bm{a}^{\prime}\\|_{2}\left\langle\bm{u},\bm{u}^{\prime}\right\rangle$
		$\displaystyle=\left(1-\left\langle\bm{u},\bm{u}^{\prime}\right\rangle\right)\left(\\|\bm{a}\\|_{2}\\|\bm{a}^{\prime}\\|_{2}-\left\langle\bm{a},\bm{a}^{\prime}\right\rangle\right)$
		$\displaystyle\geq 0,$

where, in the last line, the equality holds when $\bm{a}=\bm{a}^{\prime}$ . It then follows from Gordon’s Lemma (Fact 1) that (by setting $\tau_{(\bm{a},\bm{b}),\bm{u}}=-\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle+0_{+}$ )

	$\displaystyle\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}Y_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}=\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\\|\bm{a}\\|_{2}\left\langle\bm{h},\bm{u}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}X_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt=\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\\|\bm{a}\\|_{2}\cdot g+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\\|\bm{a}\\|_{2}\cdot g+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0~{}\bigg{\|}~{}g\leq 0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},$

where the second inequality is due to the law of total probability and the third inequality holds by noting $-\|\bm{a}\|_{2}\cdot g\geq 0$ when $g\leq 0$ . Rearranging the above inequality leads to

		$\displaystyle\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\hskip 135.0pt\geq 2\mathbb{P}\left\{\underbrace{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\\|\bm{a}\\|_{2}\left\langle\bm{h},\bm{u}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle}_{:=\mathscr{E}_{1}}>0\rule{0.0pt}{8.53581pt}\right\}-1.$		(31)

Moreover, $\mathscr{E}_{1}$ can be rewritten as

\begin{split}\mathscr{E}_{1}&=\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{u},\|\bm{a}\|_{2}\bm{h}+\sqrt{m}\bm{b}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle\\ &=\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\bigg{\|}\|\bm{a}\|_{2}\bm{h}+\sqrt{m}\bm{b}\bigg{\|}_{2}+\left\langle\bm{g},\bm{a}\right\rangle\\ &=\min_{t\in[0,1]}\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}t\bm{h}+\sqrt{m(1-t^{2})}\bm{b}^{\prime}\bigg{\|}_{2}+t\left\langle\bm{g},\bm{a}^{\prime}\right\rangle.\end{split}

(32)

In the last line, we have let $\|\bm{a}\|_{2}=t,~{}\|\bm{b}\|_{2}=\sqrt{1-t^{2}}$ , $\bm{a}^{\prime}={\bm{a}}/{\|\bm{a}\|_{2}},\textrm{and}~{}\bm{b}^{\prime}={\bm{b}}/{\|\bm{b}\|_{2}}$ .

Define

\begin{split}U(\bm{g},\bm{h},t):=\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}\bm{h}+\sqrt{m(\frac{1}{t^{2}}-1)}\cdot\bm{b}^{\prime}\bigg{\|}_{2}+\left\langle\bm{g},\bm{a}^{\prime}\right\rangle.\end{split}

Let

\displaystyle t_{1}\in\arg\min_{t\in[0,1]}\left\{\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}t\bm{h}+\sqrt{m(1-t^{2})}\bm{b}^{\prime}\bigg{\|}_{2}+t\left\langle\bm{g},\bm{a}^{\prime}\right\rangle\right\}.

If $t_{1}\neq 0$ , then we have

	$\displaystyle\mathbb{P}\left\{\min_{t\in[0,1]}\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\\|}t\bm{h}+\sqrt{m(1-t^{2})}\bm{b}^{\prime}\bigg{\\|}_{2}+t\left\langle\bm{g},\bm{a}^{\prime}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$	$\displaystyle=\mathbb{P}\left\{t_{1}\cdot U(\bm{g},\bm{h},t_{1})>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle=\mathbb{P}\left\{U(\bm{g},\bm{h},t_{1})>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\geq\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}.$

If $t_{1}=0$ , then equation $\mathscr{E}_{1}=\sqrt{m}$ , which implies

\mathbb{P}\left\{\mathscr{E}_{1}>0\rule{0.0pt}{8.53581pt}\right\}=1.

Thus we have

\displaystyle\mathbb{P}\left\{\mathscr{E}_{1}>0\rule{0.0pt}{8.53581pt}\right\}\geq\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}.

(33)

Combining (30), (A-B), and (33) yields

\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}\geq 2\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}-1.

(34)

Therefore, it is sufficient to establish the lower bound for $\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}$ .

Step 2: Establish the lower bound for $\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}$ . We then apply the Gaussian concentration inequality to establish the lower bound for $\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}$ .

Note that $\min_{t\in(0,1]}U(\bm{g},\bm{h},t)$ can reformulated as

\begin{split}\min_{t\in(0,1]}U(\bm{g},\bm{h},t)&=\min_{t\in(0,1]}\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}\bm{h}+\sqrt{m(\frac{1}{t^{2}}-1)}\cdot\bm{b}^{\prime}\bigg{\|}_{2}+\left\langle\bm{g},\bm{a}^{\prime}\right\rangle\\ &=\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\min_{t^{\prime}\geq 0\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\|\bm{h}+t^{\prime}\bm{b}^{\prime}\|_{2}+\left\langle\bm{g},\bm{a}^{\prime}\right\rangle\\ &=\min_{\bm{x}\in\mathcal{T}_{g}\atop\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\|\bm{h}+\bm{x}\|_{2}+\left\langle\bm{g},\bm{a}^{\prime}\right\rangle,\end{split}

where, in the second line, we have let $t^{\prime}=\sqrt{m(\frac{1}{t^{2}}-1)}\geq 0$ . It then follows from Lemma 4 that the function $\min_{t\in(0,1]}U(\bm{g},\bm{h},t)$ is a $\sqrt{2}$ -Lipschitz function. To apply the Gaussian concentration inequality, it suffices to bound the expectation of $\min_{t\in(0,1]}U(\bm{g},\bm{h},t)$ . To this end,

\begin{split}\operatorname{\mathbb{E}}\min_{t\in(0,1]}U(\bm{g},\bm{h},t)&=\operatorname{\mathbb{E}}\left\{\min_{\bm{x}\in\mathcal{T}_{g}}\|\bm{h}+\bm{x}\|_{2}-\max_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\left\langle-\bm{g},\bm{a}^{\prime}\right\rangle\right\}\\ &=\operatorname{\mathbb{E}}\left\{\operatorname{dist}(-\bm{h},\mathcal{T}_{g})-\max_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\left\langle-\bm{g},\bm{a}^{\prime}\right\rangle\right\}\\ &=\operatorname{\mathbb{E}}\operatorname{dist}(\bm{h},\mathcal{T}_{g})-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &\geq\sqrt{\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g})-1}-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &=\sqrt{m-\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g}^{\circ})-1}-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &=\sqrt{m-\operatorname{\mathbb{E}}\bigg{(}\max_{\bm{x}\in\mathcal{T}_{g}\cap\mathbb{B}_{2}^{n}}\left\langle\bm{h},\bm{x}\right\rangle\bigg{)}^{2}-1}-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &\geq\sqrt{m-\omega^{2}(\mathcal{T}_{g}\cap\mathbb{S}^{m-1})-2}-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &\geq\sqrt{m}-\sqrt{\omega^{2}(\mathcal{T}_{g}\cap\mathbb{S}^{m-1})+\omega^{2}(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})+2}\\ &\geq\epsilon.\end{split}

(35)

The third line is due to the fact that $-\bm{h}$ (or $-\bm{g}$ ) and $\bm{h}$ (or $\bm{g}$ ) share the same distribution. The first inequality has used Fact 4, i.e., $\textrm{Var}(\operatorname{dist}(\bm{h},\mathcal{T}_{g}))=\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g})-\operatorname{\mathbb{E}}^{2}\operatorname{dist}(\bm{h},\mathcal{T}_{g})\leq 1.$ The fifth line holds because of Moreau’s decomposition theorem (Fact 5). The next two lines follows from Facts 6 and 7, respectively. The last two inequalities are due to the measurement condition (14).

Now using the Gaussian concentration inequality (Fact 2) yields

\displaystyle\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)-\operatorname{\mathbb{E}}\min_{t\in(0,1]}U(\bm{g},\bm{h},t)\leq-\epsilon\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{4}\right),

which in turn implies that

\begin{split}\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}&\geq\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>\operatorname{\mathbb{E}}\min_{t\in(0,1]}U(\bm{g},\bm{h},t)-\epsilon\rule{0.0pt}{8.53581pt}\right\}\\ &\geq 1-\exp\left(\frac{-\epsilon^{2}}{4}\right).\end{split}

(36)

Step 3: Complete the proof.

Combining the results in Steps 1 and 2 ((34) and (36)), we have

\begin{split}\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &\geq 1-2\exp\left(\frac{-\epsilon^{2}}{4}\right).\end{split}

Therefore, we have established that when $\sqrt{m}\geq\sqrt{\omega^{2}\left(\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}\cap\mathbb{S}^{m-1}\right)}+\sqrt{2}+\epsilon$ , the constrained procedures (7) and (8) succeed with probability at least $1-2\exp\left(-\epsilon^{2}/4\right)$ .

Failure case: According to Lemma 1, the constrained procedures (7) and (8) fail if

\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0,

where $\bm{A}=[\bm{\Phi},\sqrt{m}\bm{I}_{m}]$ . So it suffices to show that if the number of measurements satisfies (15), then the above inequality holds with high probability. For clarity, the proof is similarly divided into three steps.

Step 1: Problem reduction. In this step, we also use Gordon’s Lemma to convert the probability of the targeted event to another one which is easy to handle.

Let $\bm{s}=[\bm{s}_{1}^{T},\bm{s}_{2}^{T}]^{T}$ and $\bm{u}=[\bm{u}_{1}^{T},\bm{u}_{2}^{T}]^{T}$ . Note first that for any $\bm{r}\in\mathbb{S}^{m-1}$ , we have

\begin{split}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}&=\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\left\|\begin{bmatrix}\bm{s}_{1}-\bm{\Phi}^{T}\bm{r}\\ \bm{s}_{2}-\sqrt{m}\bm{r}\end{bmatrix}\right\|_{2}\\ &=\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r}-\bm{s}_{1},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r}-\bm{s}_{2},\bm{u}_{2}\right\rangle\\ &=\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\bm{s},\bm{u}\right\rangle\\ &\geq\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left[\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle-\max_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\left\langle\bm{s},\bm{u}\right\rangle\right]\\ &=\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle.\end{split}

The inequality is due to the max-min inequality. The last line has used the fact that $\max_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\left\langle\bm{s},\bm{u}\right\rangle=0$ when $\bm{u}\in\mathcal{T}_{f}\times\mathcal{T}_{g}$ , otherwise it equals $\infty$ . Thus we obtain

\displaystyle\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}\geq\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle.

(37)

We then use Gordon’s Lemma to bound the probability of the targeted event from below. To this end, for any $\bm{r}\in\mathbb{S}^{m-1}$ and $(\bm{u}_{1},\bm{u}_{2})\in(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}$ , define the following two Gaussian processes

X_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}:=\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\|\bm{u}_{1}\|_{2}\cdot g~{}

and

Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}:=\left\langle\bm{g},\bm{u}_{1}\right\rangle+\|\bm{u}_{1}\|_{2}\left\langle\bm{h},\bm{r}\right\rangle,

where $g\sim\mathcal{N}(0,1)$ , $\bm{g}\sim\mathcal{N}(\bm{0},\bm{I}_{n})$ , and $\bm{h}\sim\mathcal{N}(\bm{0},\bm{I}_{m})$ are independent of each other. It can be easily checked that these two Gaussian processes satisfy the conditions in Gordon’s Lemma:

	$\displaystyle\operatorname{\mathbb{E}}X_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}^{2}$	$\displaystyle=2\\|\bm{u}_{1}\\|_{2}^{2}=\operatorname{\mathbb{E}}Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}^{2},$
	$\displaystyle\operatorname{\mathbb{E}}[X_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}X_{\bm{r}^{\prime},(\bm{u}_{1}^{\prime},\bm{u}_{2}^{\prime})}]-\operatorname{\mathbb{E}}[Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}Y_{\bm{r}^{\prime},(\bm{u}_{1}^{\prime},\bm{u}_{2}^{\prime})}]$	$\displaystyle=\left\langle\bm{r},\bm{r}^{\prime}\right\rangle\left\langle\bm{u}_{1},\bm{u}_{1}^{\prime}\right\rangle+\\|\bm{u}_{1}\\|_{2}\\|\bm{u}_{1}^{\prime}\\|_{2}-\left\langle\bm{u}_{1},\bm{u}_{1}^{\prime}\right\rangle-\\|\bm{u}_{1}\\|_{2}\\|\bm{u}_{1}^{\prime}\\|_{2}\left\langle\bm{r},\bm{r}^{\prime}\right\rangle$
		$\displaystyle=\left(1-\left\langle\bm{r},\bm{r}^{\prime}\right\rangle\right)\left(\\|\bm{u}_{1}\\|_{2}\\|\bm{u}_{1}^{\prime}\\|_{2}-\left\langle\bm{u}_{1},\bm{u}_{1}^{\prime}\right\rangle\right)$
		$\displaystyle\geq 0.$

Here, in the last line, the equality holds when $\bm{r}=\bm{r}^{\prime}$ . It follows from Gordon’s Lemma (Fact 1) that (by setting $\tau_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}=-\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle+0_{+}$ )

	$\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\geq\tau_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\rule{0.0pt}{8.53581pt}\right\}=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\\|\bm{u}_{1}\\|_{2}\left\langle\bm{h},\bm{r}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}X_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\geq\tau_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\\|\bm{u}_{1}\\|_{2}\cdot g+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\\|\bm{u}_{1}\\|_{2}\cdot g+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\Big{\|}g\leq 0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},$

which implies

		$\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\hskip 135.0pt\geq 2\mathbb{P}\left\{\underbrace{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\\|\bm{u}_{1}\\|_{2}\left\langle\bm{h},\bm{r}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle}_{:=\mathscr{E}_{2}}>0\rule{0.0pt}{8.53581pt}\right\}-1.$		(38)

Moreover, $\mathscr{E}_{2}$ can be bounded

\begin{split}\mathscr{E}_{2}&=\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\left\langle\bm{r},\|\bm{u}_{1}\|_{2}\bm{h}+\sqrt{m}\bm{u}_{2}\right\rangle\\ &\geq\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\min_{\bm{r}\in\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\left\langle\bm{r},\|\bm{u}_{1}\|_{2}\bm{h}+\sqrt{m}\bm{u}_{2}\right\rangle\\ &=\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle-\bigg{\|}\|\bm{u}_{1}\|_{2}\bm{h}+\sqrt{m}\bm{u}_{2}\bigg{\|}_{2}\\ &=\max_{t\in[0,1]}\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}t\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}t\bm{h}+\sqrt{m(1-t^{2})}\bm{u}_{2}^{\prime}\bigg{\|}_{2}.\end{split}

The inequality holds because of the max-min inequality. In the last line, we have let $\|\bm{u}_{1}\|_{2}=t,~{}\|\bm{u}_{2}\|_{2}=\sqrt{1-t^{2}}$ , $\bm{u}_{1}^{\prime}={\bm{u}_{1}}/{\|\bm{u}_{1}\|_{2}},$ and $\bm{u}_{2}^{\prime}={\bm{u}_{2}}/{\|\bm{u}_{2}\|_{2}}$ .

Define

\begin{split}W(\bm{g},\bm{h},t):=\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}\bm{h}+\sqrt{m(\frac{1}{t^{2}}-1)}\cdot\bm{u}_{2}^{\prime}\bigg{\|}_{2}.\end{split}

Let ⁴⁴4The effective domain of an extended real-valued function $k(x):X\rightarrow\bar{\mathbb{R}}$ is defined as $\{x\in X|k(x)\in\mathbb{R}\cup\{-\infty\}\}$ .

\displaystyle t_{2}\in\arg\max_{t\in[0,1]}W(\bm{g},\bm{h},t).

Clearly, $t_{2}\neq 0$ , since $W(\bm{g},\bm{h},t)\to-\infty$ as $t\to 0_{+}$ . Then we have

$\displaystyle\mathbb{P}\left\{\max_{t\in[0,1]}\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}t\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\\|}t\bm{h}-\sqrt{m(1-t^{2})}\bm{u}_{2}^{\prime}\bigg{\\|}_{2}>0\rule{0.0pt}{8.53581pt}\right\}$	$\displaystyle=\mathbb{P}\left\{\max_{t\in[0,1]}t\cdot W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\geq\mathbb{P}\left\{t_{2}\cdot W(\bm{g},\bm{h},t_{2})>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle=\mathbb{P}\left\{W(\bm{g},\bm{h},t_{2})>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle=\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}.$	(39)

Combining (37), (A-B), and (A-B), we obtain

\begin{split}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}-1.\end{split}

(40)

Therefore, our goal reduces to establish the lower bound for $\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}$ .

Step 2: Establish the lower bound for $\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}$ . In this step, we use the Gaussian concentration inequality to establish the lower bound for $\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}$ .

Similar to the success case, $\max_{t\in(0,1]}W(\bm{g},\bm{h},t)$ can be rewritten as

\begin{split}\max_{t\in(0,1]}W(\bm{g},\bm{h},t)&=\max_{t\in(0,1]}\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}\bm{h}+\sqrt{m(\frac{1}{t^{2}}-1)}\cdot\bm{u}_{2}^{\prime}\bigg{\|}_{2}\\ &=\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\max_{t^{\prime}\geq 0\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}\bm{h}+t^{\prime}\cdot\bm{u}_{2}^{\prime}\bigg{\|}_{2}\\ &=\max_{\bm{x}\in\mathcal{T}_{g}\atop\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\|\bm{h}+\bm{x}\|_{2}.\\ \end{split}

It then follows from Lemma 4 that $\max_{t\in(0,1]}W(\bm{g},\bm{h},t)$ is a $\sqrt{2}$ -Lipschitz function. Moreover, its expectation can be bounded from below:

\begin{split}\operatorname{\mathbb{E}}\max_{t\in(0,1]}W(\bm{g},\bm{h},t)&=\operatorname{\mathbb{E}}\left(\max_{\bm{x}\in\mathcal{T}_{g}\atop\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\|\bm{h}+\bm{x}\|_{2}\right)\\ &=\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\operatorname{\mathbb{E}}\operatorname{dist}(\bm{h},\mathcal{T}_{g})\\ &\geq\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\sqrt{\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g})}\\ &=\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\sqrt{m-\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g}^{\circ})}\\ &=\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\sqrt{m-\operatorname{\mathbb{E}}\bigg{(}\max_{\bm{x}\in\mathcal{T}_{g}\cap\mathbb{B}_{2}^{n}}\left\langle\bm{h},\bm{x}\right\rangle\bigg{)}^{2}}\\ &\geq\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\sqrt{m-\omega^{2}(\mathcal{T}_{g}\cap\mathbb{S}^{m-1})}\\ &\geq\sqrt{\omega^{2}(\mathcal{T}_{g}\cap\mathbb{S}^{m-1})+\omega^{2}(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})}-\sqrt{m}\\ &\geq\epsilon.\end{split}

(41)

The second line holds because $-\bm{h}$ and $\bm{h}$ have the same distribution. The first inequality is due to Jensen’s inequality. The fourth line has used Moreau’s decomposition theorem (Fact 5). The next two lines follow from Facts 6 and 7, respectively. The last two lines holds because of the measurement condition (15).

Now applying the Gaussian concentration inequality (Fact 2) yields

\displaystyle\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)-\operatorname{\mathbb{E}}\max_{t\in(0,1]}W(\bm{g},\bm{h},t)\leq-\epsilon\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{4}\right),

which in turn implies that

\begin{split}\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}&\geq\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>\operatorname{\mathbb{E}}\max_{t\in(0,1]}W(\bm{g},\bm{h},t)-\epsilon\rule{0.0pt}{8.53581pt}\right\}\\ &\geq 1-\exp\left(\frac{-\epsilon^{2}}{4}\right).\end{split}

(42)

Step 3: Complete the proof. Putting (40) and (42) together, we have

\begin{split}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &\geq 1-2\exp\left(\frac{-\epsilon^{2}}{4}\right).\end{split}

Thus we have established that when $\sqrt{m}\leq\sqrt{\omega^{2}\left(\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}\cap\mathbb{S}^{m-1}\right)}-\epsilon$ , the constrained procedures (7) and (8) fail with probability at least $1-2\exp\left(-\epsilon^{2}/4\right)$ . This completes the proof.

∎

Appendix B Proofs of Lemma 2 and Theorem 2

In this appendix, we prove the phase transition result of the penalized recovery procedure. Some auxiliary lemma and facts used in the proofs are included in Appendix E.

B-A Proof of Lemma 2

Proof.

The penalized recovery procedure (9) can be reformulated as the following unconstrained form

\displaystyle\min_{\bm{x}}~{}f(\bm{x})+\lambda\cdot g\left(\frac{1}{\sqrt{m}}(\bm{y}-\bm{\Phi}\bm{x})\right).

Define $F(\bm{x})=f(\bm{x})+\lambda\cdot g\left(\frac{1}{\sqrt{m}}(\bm{y}-\bm{\Phi}\bm{x})\right)$ . Clearly, $F(\bm{x})$ is a proper convex function. It follows from [57, Theorems 23.8 and 23.9] that the subdifferential of $F$ at $\bm{x}^{\star}$ is given by

\displaystyle\partial F(\bm{x}^{\star})=\partial f(\bm{x}^{\star})-\frac{\lambda}{\sqrt{m}}\bm{\Phi}^{T}\cdot\partial g(\bm{v}^{\star}).

Moreover, $F(\bm{x})$ attains its minimum at $\bm{x}^{\star}$ if and only if $\bm{0}\in\partial F(\bm{x}^{\star})$ [57, Theorems 27.1]. Therefore, if

\displaystyle\bm{0}\in\bm{\Phi}^{T}\cdot\partial g(\bm{v}^{\star})-\frac{\sqrt{m}}{\lambda}\partial f(\bm{x}^{\star}),

(43)

then the penalized problem (9) succeeds. If $\bm{0}\notin\partial F(\bm{x}^{\star})$ , i.e.,

\displaystyle\min_{\bm{a}\in\partial f(\bm{x}^{\star}),\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0.

then the penalized problem (9) fails.

Clearly, (43) holds if $\bm{0}\in\bm{M}(\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1})$ . Since $f$ and $g$ are proper convex functions, then $\partial f(\bm{x}^{\star})$ and $\partial g(\bm{v}^{\star})$ are nonempty, closed convex sets, and hence $\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}$ is nonempty, closed, and spherically convex. A direct application of the polarity principle (Fact 3) yields the desired sufficient condition (19).

∎

B-B Proof of Theorem 2

Proof.

Success case: By Lemma 2, the penalized problem (9) succeeds if

\displaystyle\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}>0,

where $\bm{M}=[-\frac{\sqrt{m}}{\lambda}\bm{I}_{n},\bm{\Phi}^{T}]$ . So it is sufficient to show that if the number of measurements satisfies (20), then the above inequality holds with high probability. The proof is also divided into three steps.

Step 1: Problem reduction. In this step, we similarly use Gordon’s Lemma to convert the probability of the targeted event to another one which can be handled easily.

Let $\bm{s}=[\bm{s}_{1}^{T},\bm{s}_{2}^{T}]^{T}$ and $\bm{u}=[\bm{u}_{1}^{T},\bm{u}_{2}^{T}]^{T}$ . Note first that for any $\bm{r}\in\mathbb{S}^{n-1}$ , we have

\begin{split}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}&=\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\left\|\begin{bmatrix}\bm{s}_{1}+\frac{\sqrt{m}}{\lambda}\bm{r}\\ \bm{s}_{2}-\bm{\Phi}\bm{r}\end{bmatrix}\right\|_{2}\\ &=\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left\langle\bm{s}_{1}+\frac{\sqrt{m}}{\lambda}\bm{r},-\bm{u}_{1}\right\rangle+\left\langle\bm{s}_{2}-\bm{\Phi}\bm{r},-\bm{u}_{2}\right\rangle\\ &=\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle-\left\langle\bm{s},\bm{u}\right\rangle\\ &\geq\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left[\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle-\max_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\left\langle\bm{s},\bm{u}\right\rangle\right]\\ &=\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle.\end{split}

The inequality is due to the max-min inequality. The last step has used the fact that $\max_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\left\langle\bm{s},\bm{u}\right\rangle=0$ when $\bm{u}\in\mathcal{T}_{J}$ , otherwise it equals $\infty$ . Thus we obtain

\displaystyle\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}\quad\geq\quad\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle.

(44)

We then use Gordon’s Lemma to establish a lower bound for the probability of the targeted event. To this end, for any $\bm{r}\in\mathbb{S}^{n-1}$ and $\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}$ , define the following two Gaussian processes

X_{\bm{r},\bm{u}}:=\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle+\|\bm{u}_{2}\|_{2}\cdot g~{}~{}~{}

and

Y_{\bm{r},\bm{u}}:=\left\langle\bm{h},\bm{u}_{2}\right\rangle+\|\bm{u}_{2}\|_{2}\left\langle\bm{g},\bm{r}\right\rangle

where $g\sim\mathcal{N}(0,1)$ , $\bm{h}\sim\mathcal{N}(\bm{0},\bm{I}_{m})$ , and $\bm{g}\sim\mathcal{N}(\bm{0},\bm{I}_{n})$ are independent of each other. Direct calculations show that these two defined Gaussian processes satisfy the conditions in Gordon’s Lemma:

	$\displaystyle\operatorname{\mathbb{E}}X_{\bm{r},\bm{u}}^{2}$	$\displaystyle=2\\|\bm{u}_{2}\\|_{2}^{2}=\operatorname{\mathbb{E}}Y_{\bm{r},\bm{u}}^{2},$
	$\displaystyle\operatorname{\mathbb{E}}[X_{\bm{r},\bm{u}}X_{\bm{r}^{\prime},\bm{u}^{\prime}}]-\operatorname{\mathbb{E}}[Y_{\bm{r},\bm{u}}Y_{\bm{r}^{\prime},\bm{u}^{\prime}}]$	$\displaystyle=\left\langle\bm{r},\bm{r}^{\prime}\right\rangle\left\langle\bm{u}_{2},\bm{u}_{2}^{\prime}\right\rangle+\\|\bm{u}_{2}\\|_{2}\\|\bm{u}_{2}^{\prime}\\|_{2}-\left\langle\bm{u}_{2},\bm{u}_{2}^{\prime}\right\rangle-\\|\bm{u}_{2}\\|_{2}\\|\bm{u}_{2}^{\prime}\\|_{2}\left\langle\bm{r},\bm{r}^{\prime}\right\rangle$
		$\displaystyle=\left(1-\left\langle\bm{r},\bm{r}^{\prime}\right\rangle\right)\left(\\|\bm{u}_{2}\\|_{2}\\|\bm{u}_{2}^{\prime}\\|_{2}-\left\langle\bm{u}_{2},\bm{u}_{2}^{\prime}\right\rangle\right)$
		$\displaystyle\geq 0.$

In the last line, the equality holds when $\bm{r}=\bm{r}^{\prime}$ . It follows from Gordon’s lemma (Fact 1) that (by setting $\tau_{\bm{r},\bm{u}}=\frac{\sqrt{m}}{\lambda}\bm{u}_{1}^{T}\bm{r}+0_{+}$ )

	$\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}Y_{\bm{r},\bm{u}}\geq\tau_{\bm{r},\bm{u}}\rule{0.0pt}{8.53581pt}\right\}$	$\displaystyle=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle+\\|\bm{u}_{2}\\|_{2}\left\langle\bm{g},\bm{r}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\leq\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}X_{\bm{r},\bm{u}}\geq\tau_{\bm{r},\bm{u}}\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle+\\|\bm{u}_{2}\\|_{2}\cdot g-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle+\\|\bm{u}_{2}\\|_{2}\cdot g-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0~{}\big{\|}~{}g\leq 0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},$

which implies

		$\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\hskip 135.0pt\geq 2\mathbb{P}\left\{\underbrace{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle+\\|\bm{u}_{2}\\|_{2}\left\langle\bm{g},\bm{r}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle}_{:=~{}\mathscr{E}_{3}}>0\rule{0.0pt}{8.53581pt}\right\}-1.$		(45)

Moreover, $\mathscr{E}_{3}$ can be bounded from below as follows

\begin{split}\mathscr{E}_{3}&=\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle+\left\langle\bm{r},\|\bm{u}_{2}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{u}_{1}\right\rangle\\ &\geq\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\min_{\bm{r}\in\mathbb{S}^{n-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle+\left\langle\bm{r},\|\bm{u}_{2}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{u}_{1}\right\rangle\\ &=\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle-\left\|\|\bm{u}_{2}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{u}_{1}\right\|_{2}\\ &=\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\underbrace{\frac{\|\bm{b}\|_{2}}{\sqrt{\|\bm{a}\|_{2}^{2}+\|\bm{b}\|_{2}^{2}}}}_{:=c(\bm{a},\bm{b})}\cdot\left[\left\langle\bm{h},\frac{\bm{b}}{\|\bm{b}\|_{2}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}\|_{2}}\bm{a}\right\|_{2}\right]\\ \end{split}

The first inequality holds because of the max-min inequality. In the last line, recall that the joint cone is defined as $\mathcal{T}_{J}=\{t\cdot(\bm{a},\bm{b}):t\geq 0,~{}\bm{a}\in\partial f(\bm{x}^{\star}),~{}\bm{b}\in\partial g(\bm{v}^{\star})\}$ , so we have let $\bm{u}_{1}=\frac{\bm{a}}{\sqrt{\|\bm{a}\|_{2}^{2}+\|\bm{b}\|_{2}^{2}}}$ and $\bm{u}_{2}=\frac{\bm{b}}{\sqrt{\|\bm{a}\|_{2}^{2}+\|\bm{b}\|_{2}^{2}}}$ for $\bm{a}\in\partial f(\bm{x}^{\star}),~{}\bm{b}\in\partial g(\bm{v}^{\star})$ .

Since $\partial f(\bm{x}^{\star})$ and $\partial g(\bm{v}^{\star})$ are nonempty and closed, we choose $(\bm{a}_{0},\bm{b}_{0})$ such that

(\bm{a}_{0},\bm{b}_{0})\in\arg\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\langle\bm{h},\frac{\bm{b}}{\|\bm{b}\|_{2}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}\|_{2}}\bm{a}\right\|_{2},

which leads to

\begin{split}\mathscr{E}_{3}&\geq c(\bm{a}_{0},\bm{b}_{0})\cdot\left[\left\langle\bm{h},\frac{\bm{b}_{0}}{\|\bm{b}_{0}\|_{2}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}_{0}\|_{2}}\bm{a}_{0}\right\|_{2}\right]\\ &=c(\bm{a}_{0},\bm{b}_{0})\cdot\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\langle\bm{h},\frac{\bm{b}}{\|\bm{b}\|_{2}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}\|_{2}}\bm{a}\right\|_{2}\\ &=c(\bm{a}_{0},\bm{b}_{0})\cdot\max_{\alpha\leq t\leq\beta}\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left\langle\bm{h},\frac{\bm{b}}{t}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2}\\ \end{split}

In the last line, we have let $\|\bm{b}\|_{2}=t$ , $\alpha=\min_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}$ , and $\beta=\max_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}$ .

Define

L(\bm{g},\bm{h},t):=\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left\langle\bm{h},\frac{\bm{b}}{t}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2},

and choose $t_{3}$ such that

\displaystyle t_{3}\in\arg\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}

\displaystyle\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1\right].

Then we have

\begin{split}\mathscr{E}_{3}&\geq c(\bm{a}_{0},\bm{b}_{0})\cdot\max_{\alpha\leq t\leq\beta}L(\bm{g},\bm{h},t)\\ &\geq c(\bm{a}_{0},\bm{b}_{0})\cdot L(\bm{g},\bm{h},t_{3}).\end{split}

(46)

Combining (44), (B-B), and (46) yields

\begin{split}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{c(\bm{a}_{0},\bm{b}_{0})\cdot L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &=2\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}-1,\end{split}

(47)

where the last line holds because $0\notin\partial g(\bm{v}^{\star})$ , and hence $c(\bm{a}_{0},\bm{b}_{0})>0$ . Thus it suffices to establish the lower bound for $\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}$ .

Step 2: Establish the lower bound for $\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}$ . In this step, we apply the Gaussian concentration inequality to establish the lower bound for $\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}$ .

It follows from Lemma 4 that the function $L(\bm{g},\bm{h},t_{3})$ is a $\sqrt{2}$ -Lipschitz function. Its expectation can be bounded from below as follows:

\begin{split}\operatorname{\mathbb{E}}L(\bm{g},\bm{h},t_{3})&=\operatorname{\mathbb{E}}\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{3}\mathbb{S}^{m-1}}\left\langle\bm{h},\frac{\bm{b}}{t_{3}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{3}}\bm{a}\right\|_{2}\\ &=\operatorname{\mathbb{E}}\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{3}\mathbb{S}^{m-1}}\frac{1+\|\bm{h}\|_{2}^{2}-\|\bm{h}-\frac{\bm{b}}{t_{3}}\|_{2}^{2}}{2}-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{3}}\bm{a}\right\|_{2}\\ &=\frac{m}{2}-\operatorname{\mathbb{E}}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{3}\mathbb{S}^{m-1}}\left(\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{3}}\bm{a}\right\|_{2}+\frac{1}{2}\|\bm{h}-\frac{\bm{b}}{t_{3}}\|_{2}^{2}-\frac{1}{2}\right)\\ &=\frac{m}{2}-\operatorname{\mathbb{E}}\left[\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t_{3}}\partial f(\bm{x}^{\star})\right)+\frac{1}{2}\cdot\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t_{3}}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-\frac{1}{2}\right]\\ &=\frac{m}{2}-\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\frac{1}{2}\cdot\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-\frac{1}{2}\right]\\ &\geq\epsilon/2.\end{split}

The last line is due to the measurement condition (20).

Now using the Gaussian concentration inequality (Fact 2) yields

\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})-\operatorname{\mathbb{E}}L(\bm{g},\bm{h},t_{3})\leq-\epsilon/2\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{16}\right).

which in turn implies

\begin{split}\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}&\geq\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>\operatorname{\mathbb{E}}L(\bm{g},\bm{h},t_{3})-\epsilon/2\rule{0.0pt}{8.53581pt}\right\}\\ &\geq 1-\exp\left(\frac{-\epsilon^{2}}{16}\right).\end{split}

(48)

Step 3: Complete the proof.

Combining (47) and (48), we have

\begin{split}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &\geq 1-2\exp\left(\frac{-\epsilon^{2}}{16}\right).\end{split}

This means that when $m\geq\mathscr{C}_{p}(\lambda)+\epsilon$ , the penalized problem (9) succeeds with probability at least $1-2\exp\left(\frac{-\epsilon^{2}}{16}\right)$ .

Failure case: According to Lemma 2, the penalized problem (9) fails if

\displaystyle\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0.

So it is enough to show that if the number of measurements satisfies (21), then the above inequality holds with high probability. For clarity, the proof is similarly divided into three steps.

Step 1: Problem reduction. In this step, we employ Gordon’s Lemma to convert the probability of the targeted event to another one which is easy to handle.

Note that

\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}=\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle.

(49)

We then use Gordon’s Lemma to establish a lower bound for the probability of the targeted event. To this end, for any $(\bm{a},\bm{b})\in\partial f(\bm{x}^{\star})\times\partial g(\bm{v}^{\star})$ and $\bm{u}\in\mathbb{S}^{n-1}$ , define the following two Gaussian processes

X_{(\bm{a},\bm{b}),\bm{u}}:=\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle+\|\bm{b}\|_{2}\cdot g

and

Y_{(\bm{a},\bm{b}),\bm{u}}:=\|\bm{b}\|_{2}\left\langle\bm{g},\bm{u}\right\rangle+\left\langle\bm{h},\bm{b}\right\rangle,

here $g\sim\mathcal{N}(0,1)$ , $\bm{g}\sim\mathcal{N}(\bm{0},\bm{I}_{n})$ , and $\bm{h}\sim\mathcal{N}(\bm{0},\bm{I}_{m})$ are independent of each other. It is not hard to check that these two defined processes satisfy the conditions in Gordon’s Lemma:

	$\displaystyle\operatorname{\mathbb{E}}X_{(\bm{a},\bm{b}),\bm{u}}^{2}$	$\displaystyle=2\\|\bm{b}\\|_{2}^{2}=\operatorname{\mathbb{E}}Y_{(\bm{a},\bm{b}),\bm{u}}^{2},$
	$\displaystyle\operatorname{\mathbb{E}}[X_{(\bm{a},\bm{b}),\bm{u}}X_{(\bm{a}^{\prime},\bm{b}^{\prime}),\bm{u}^{\prime}}]-\operatorname{\mathbb{E}}[Y_{(\bm{a},\bm{b}),\bm{u}}Y_{(\bm{a}^{\prime},\bm{b}^{\prime}),\bm{u}^{\prime}}]$	$\displaystyle=\left\langle\bm{u},\bm{u}^{\prime}\right\rangle\left\langle\bm{b},\bm{b}^{\prime}\right\rangle+\\|\bm{b}\\|_{2}\\|\bm{b}^{\prime}\\|_{2}-\left\langle\bm{b},\bm{b}^{\prime}\right\rangle-\\|\bm{b}\\|_{2}\\|\bm{b}^{\prime}\\|_{2}\left\langle\bm{u},\bm{u}^{\prime}\right\rangle$
		$\displaystyle=\left(1-\left\langle\bm{u},\bm{u}^{\prime}\right\rangle\right)\left(\\|\bm{b}\\|_{2}\\|\bm{b}^{\prime}\\|_{2}-\left\langle\bm{b},\bm{b}^{\prime}\right\rangle\right)$
		$\displaystyle\geq 0.$

In the last line, the equality holds when $\bm{b}=\bm{b}^{\prime}$ . It follows from Gordon’s Lemma (Fact 1) that (by setting $\tau_{(\bm{a},\bm{b}),\bm{u}}=\frac{\sqrt{m}}{\lambda}\bm{a}^{T}\bm{u}+0_{+}$ )

	$\displaystyle\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}Y_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}$	$\displaystyle=\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\\|\bm{b}\\|_{2}\left\langle\bm{g},\bm{u}\right\rangle+\left\langle\bm{h},\bm{b}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\leq\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}X_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle=\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle+\\|\bm{b}\\|_{2}\cdot g-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle+\\|\bm{b}\\|_{2}\cdot g-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\Big{\|}g\leq 0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},$

which implies

		$\displaystyle\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
		$\displaystyle\hskip 135.0pt\geq 2\mathbb{P}\left\{\underbrace{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\\|\bm{b}\\|_{2}\left\langle\bm{g},\bm{u}\right\rangle+\left\langle\bm{h},\bm{b}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle}_{:=\mathscr{E}_{4}}>0\rule{0.0pt}{8.53581pt}\right\}-1.$		(50)

Moreover, $\mathscr{E}_{4}$ can be bounded from below as follows:

\begin{split}\mathscr{E}_{4}&=\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{u},\|\bm{b}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{a}\right\rangle+\left\langle\bm{h},\bm{b}\right\rangle\\ &=\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\|\|\bm{b}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{a}\right\|_{2}+\left\langle\bm{h},\bm{b}\right\rangle\\ &=\left\|\|\bm{b}_{1}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{a}_{1}\right\|_{2}+\left\langle\bm{h},\bm{b}_{1}\right\rangle\\ &\geq\|\bm{b}_{1}\|_{2}\cdot\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}\|_{2}}\bm{a}\right\|_{2}+\left\langle\bm{h},\frac{\bm{b}}{\|\bm{b}\|_{2}}\right\rangle\\ &=\|\bm{b}_{1}\|_{2}\cdot\min_{\alpha\leq t\leq\beta}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2}+\left\langle\bm{h},\frac{\bm{b}}{t}\right\rangle\\ \end{split}

In the third line, we have chosen $(\bm{a}_{1},\bm{b}_{1})$ such that

(\bm{a}_{1},\bm{b}_{1})\in\arg\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\|\|\bm{b}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{a}\right\|_{2}+\left\langle\bm{h},\bm{b}\right\rangle.

In the last line, we have let $\|\bm{b}\|_{2}=t$ , $\alpha=\min_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}$ , and $\beta=\max_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}$ .

Define

V(\bm{g},\bm{h},t)=\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2}+\left\langle\bm{h},\frac{\bm{b}}{t}\right\rangle,

and choose $t_{4}$ such that

\displaystyle t_{4}\in\arg\min_{\alpha\leq t\leq\beta}V(\bm{g},\bm{h},t).

Then we have

\begin{split}\mathscr{E}_{4}&\geq\|\bm{b}_{1}\|_{2}\cdot\min_{\alpha\leq t\leq\beta}V(\bm{g},\bm{h},t)\\ &=\|\bm{b}_{1}\|_{2}\cdot V(\bm{g},\bm{h},t_{4}).\end{split}

(51)

Combining (49), (B-B), and (51) yields

\begin{split}\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{\|\bm{b}_{1}\|_{2}\cdot V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &=2\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}-1,\end{split}

(52)

where we have used the fact that $0\notin\partial g(\bm{v}^{\star})$ and hence $\|\bm{b}_{1}\|_{2}>0$ . Thus it is enough to establish the lower bound for $\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}$ .

Step 2: Establish the lower bound for $\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}$ . In this step, we use the Gausian concentration inequaltiy to establish the lower bound for $\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}$ .

By Lemma 4, the function $V(\bm{g},\bm{h},t_{4})$ is a $\sqrt{2}$ -Lipschitz function. Its expectation can be bounded:

\begin{split}\operatorname{\mathbb{E}}V(\bm{g},\bm{h},t_{4})&=\operatorname{\mathbb{E}}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{4}\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{4}}\bm{a}\right\|_{2}+\left\langle\bm{h},\frac{\bm{b}}{t_{4}}\right\rangle\\ &=\operatorname{\mathbb{E}}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{4}\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{4}}\bm{a}\right\|_{2}+\left\langle-\bm{h},\frac{\bm{b}}{t_{4}}\right\rangle\\ &=\operatorname{\mathbb{E}}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{4}\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{4}}\bm{a}\right\|_{2}+\frac{\|\bm{h}-\frac{\bm{b}}{t_{4}}\|_{2}^{2}-1-\|\bm{h}\|_{2}^{2}}{2}\\ &=\operatorname{\mathbb{E}}\left[\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{4}\mathbb{S}^{m-1}}\left(\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{4}}\bm{a}\right\|_{2}+\frac{1}{2}\|\bm{h}-\frac{\bm{b}}{t_{4}}\|_{2}^{2}-\frac{1}{2}\right)\right]-\frac{m}{2}\\ &\geq\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left(\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2}+\frac{1}{2}\|\bm{h}-\frac{\bm{b}}{t}\|_{2}^{2}-\frac{1}{2}\right)\right]-\frac{m}{2}\\ &=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\frac{1}{2}\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-\frac{1}{2}\right]-\frac{m}{2}\\ &\geq\epsilon/2.\end{split}

(53)

The second line is due to the fact that $-\bm{h}$ and $\bm{h}$ have the same distribution. The last inequality is due to the measurement condition (21).

Now it follows from the Gaussian concentration inequality (Fact 2) that

\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})-\operatorname{\mathbb{E}}V(\bm{g},\bm{h},t_{4})\leq-\epsilon/2\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{16}\right),

which in turn implies

\begin{split}\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}&\geq\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>\operatorname{\mathbb{E}}V(\bm{g},\bm{h},t_{4})-\epsilon/2\rule{0.0pt}{8.53581pt}\right\}\\ &\geq 1-\exp\left(\frac{-\epsilon^{2}}{16}\right).\end{split}

(54)

Step 3: Complete the proof.

Putting (52) and (54) together, we have

\begin{split}\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &\geq 1-2\exp\left(\frac{-\epsilon^{2}}{16}\right).\end{split}

Thus we have shown that when $m\leq\mathscr{C}_{p}(\lambda)-\epsilon$ , the penalized problem (9) fails with probability at least $1-2\exp\left(\frac{-\epsilon^{2}}{16}\right)$ . This completes the proof.

∎

Appendix C Proof of Theorem 3

In this appendix, we prove our last theorem which establishes the relationship between constrained and penalized recovery procedures and illustrates how to select the optimal parameter $\lambda$ for the penalized method. Some auxiliary lemma and facts used in the proof are included in Appendix E.

Proof.

The core ingredient in the proof of Theorem 3 is the fact that the distance of a vector to the scaled subdifferential $t\cdot\partial f(\bm{x}^{\star})$ can always be bounded from below by that of this vector to the cone of the subdifferential $\operatorname{cone}(\partial f(\bm{x}^{\star}))$ (see Fig. 6). With this observation in mind, we have

$\displaystyle\mathscr{C}_{p}(\lambda)$	$\displaystyle=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1\right]$
	$\displaystyle\geq\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right)+\operatorname{dist}^{2}\left(\bm{h},\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}\right)-1\right]$
	$\displaystyle=\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right)+\min_{\bm{b}\in\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}}\\|\bm{h}-\bm{b}\\|_{2}^{2}-1\right]$
	$\displaystyle=\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right)-2\max_{\bm{b}\in\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}}\left\langle\bm{h},\bm{b}\right\rangle+m\right].$	(55)

By assumptions, $\bm{0}\notin\partial f(\bm{x}^{\star})$ and $\bm{0}\notin\partial g(\bm{v}^{\star})$ , we obtain

\operatorname{cone}(\partial f(\bm{x}^{\star}))=\mathcal{N}_{f}(\bm{x}^{\star})~{}~{}~{}\textrm{and}~{}\operatorname{cone}(\partial g(\bm{v}^{\star}))=\mathcal{N}_{g}(\bm{v}^{\star}).

Substituting the above equalities into (C) and rearranging yields

\displaystyle\frac{\mathscr{C}_{p}(\lambda)-m}{2}\geq\operatorname{\mathbb{E}}\left[\operatorname{dist}\left(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star})\right)-\max_{\bm{b}\in\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}}\left\langle\bm{h},\bm{b}\right\rangle\right]=\operatorname{\mathbb{E}}\operatorname{dist}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))-\omega(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}).

(56)

We are now ready to establish the bounds in Theorem 3. First consider the case in which $m\geq\mathscr{C}_{p}(\lambda)$ , then (56) implies that

	$\displaystyle 0$	$\displaystyle\geq\operatorname{\mathbb{E}}^{2}\operatorname{dist}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))-\omega^{2}(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})$
		$\displaystyle\geq\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))-1-\omega^{2}(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})$
		$\displaystyle=\operatorname{\mathbb{E}}\left(\max_{\bm{x}\in\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{B}_{2}^{n}}\left\langle\bm{x},\bm{g}\right\rangle\right)^{2}-1-\omega^{2}(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})$
		$\displaystyle\geq\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})-1-\omega^{2}(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})$
		$\displaystyle\geq\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})-1-[m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})]$
		$\displaystyle=\mathscr{C}_{p}-m-1.$

The second inequality is due to the relation (10). The third line has used Fact 6 and the assumption that $\mathcal{T}_{f}(\bm{x}^{\star})$ is closed (thus $\mathcal{N}_{f}^{\circ}(\bm{x}^{\star})=\mathcal{T}_{f}(\bm{x}^{\star})$ ). The next two lines follow from Facts 7 and 8, respectively. Thus we have shown that if $m\geq\mathscr{C}_{p}(\lambda)$ , then $m\geq\mathscr{C}_{p}-1$ .

We next consider the case where $m\geq\mathscr{C}_{p}$ . It is not hard to find that for $\alpha\leq t\leq\beta$ , the two Gaussian distances in $\mathscr{C}_{p}(\lambda)$ have the following lower bounds:

\operatorname{\mathbb{E}}\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)\geq\operatorname{\mathbb{E}}\operatorname{dist}^{2}\left(\bm{h},\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}\right)

and

\operatorname{\mathbb{E}}\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)\geq\operatorname{\mathbb{E}}\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right).

Then we can choose

t^{\star}=\arg\min_{\alpha\leq t\leq\beta}\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)~{}~{}\textrm{and}~{}~{}\lambda^{\star}=\arg\min_{\lambda>0}\zeta\left(\frac{\sqrt{m}}{\lambda t^{\star}}\partial f(\bm{x}^{\star})\right)

such that the above two Gaussian distances attain their lower bounds simultaneously. The second lower bound is achievable because ${\sqrt{m}}/{\lambda t^{\star}}$ can take any positive number if $\lambda>0$ . Thus $\mathscr{C}_{p}(\lambda)$ attains its minimum at $\lambda^{\star}$ , i.e.,

	$\displaystyle\mathscr{C}_{p}(\lambda^{\star})$	$\displaystyle=\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right)+\operatorname{dist}^{2}\left(\bm{h},\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}\right)-1\right]$
		$\displaystyle=\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star})\right)-2\max_{\bm{b}\in\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}}\left\langle\bm{h},\bm{b}\right\rangle+m\right].$

Further, we have the following upper bound

$\displaystyle\frac{\mathscr{C}_{p}(\lambda^{\star})-m}{2}$	$\displaystyle=\operatorname{\mathbb{E}}\operatorname{dist}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))-\omega(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})$	(57)
	$\displaystyle\leq\sqrt{\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))}-\omega(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})$
	$\displaystyle=\sqrt{\delta(\mathcal{T}_{f}(\bm{x}^{\star}))}-\omega(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})$
	$\displaystyle\leq\sqrt{\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1}-\sqrt{\delta(\mathcal{N}_{g}(\bm{v}^{\star}))-1}$
	$\displaystyle=\sqrt{\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1}-\sqrt{m-\delta(\mathcal{T}_{g}(\bm{v}^{\star}))-1}$
	$\displaystyle\leq\sqrt{\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1}-\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})-2}$
	$\displaystyle\leq\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1-\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})}+\sqrt{2}$
	$\displaystyle\leq\sqrt{\mathscr{C}_{p}}-\sqrt{m}+1+\sqrt{2}$
	$\displaystyle\leq 1+\sqrt{2}.$

Here, the first inequality is due to Jensen’s inequality. The next two lines have used Facts 6 and 7, respectively. The third equality holds because of Moreau’s decomposition theorem (Fact 5). The next inequality has used Fact 7 again. The lase two inequalities follow from the condition $m\geq\mathscr{C}_{p}$ . Rearranging completes the proof.

It’s worth noting that if we impose an extra condition on $\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})$ (e.g., $\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})\geq 4$ , which can be easily satisfied in practical applications), then we can obtain a sharper upper bound than (57):

	$\displaystyle\frac{\mathscr{C}_{p}(\lambda^{\star})-m}{2}$	$\displaystyle\leq\sqrt{\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1}-\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})-2}$
		$\displaystyle\leq\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+(\sqrt{17}-\sqrt{16})-\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})}+(\sqrt{16}-\sqrt{14})$
		$\displaystyle\leq\sqrt{\mathscr{C}_{p}}-\sqrt{m}+\sqrt{17}-\sqrt{14}$
		$\displaystyle\leq\frac{1}{2}.$

The second inequality is due to the facts that $\sqrt{a^{2}+1}\leq a+(\sqrt{17}-\sqrt{16})$ for $a\geq 4$ and $\sqrt{b^{2}-2}\geq b-(\sqrt{16}-\sqrt{14})$ for $b=\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})}\geq 4$ . The lase two inequalities have used the condition $m\geq\mathscr{C}_{p}$ . Rearranging yields $m\geq\mathscr{C}_{p}(\lambda^{\star})-1$ , which leads to a smaller gap than $5$ .

∎

Appendix D Evaluate $\mathscr{C}_{p}$ and $\mathscr{C}_{p}(\lambda)$ for Typical Structured Signal and Corruption

In Section III, we have shown that $\mathscr{C}_{p}$ and $\mathscr{C}_{p}(\lambda)$ can be accurately estimated by equations (16) and (23), respectively. Thus it is sufficient to evaluate two related functionals: Gaussian squared distance to a scaled subdifferential and spherical Gaussian width of a scaled subdifferential. There exist some standard recipes to calculate these two quantities in the literature, see e.g., [51, 37, 24]. For the completeness of this paper, we calculate these two functionals for sparse vectors and low-rank matrices in this appendix.

D-A Calculation for Sparse Vectors

Let $\bm{x}$ be an $s$ -sparse vector in $\mathbb{R}^{n}$ , and let $S$ denote its support. We use the $\ell_{1}$ -norm to promote the structure of sparse vectors. The scaled subdifferential of $\|\bm{x}\|_{1}$ is given by

t\cdot\partial\|\bm{x}\|_{1}=\{\bm{z}\in\mathbb{R}^{n}:\bm{z}_{i}=t\cdot\textrm{sgn}(\bm{x}_{i})~{}\textrm{for}~{}i\in S,~{}|\bm{z}_{i}|\leq t~{}\textrm{for}~{}i\in S^{c}\},

where $S^{c}$ represents the complement of $S$ . Then the Gaussian squared distance to a scaled subdifferential can be calculated as

	$\displaystyle\eta^{2}(t\cdot\partial\\|\bm{x}\\|_{1})$	$\displaystyle=\operatorname{\mathbb{E}}\inf_{\bm{z}\in t\cdot\partial\\|\bm{x}\\|_{1}}\\|\bm{g}-\bm{z}\\|_{2}^{2}=\operatorname{\mathbb{E}}\inf_{\bm{z}\in t\cdot\partial\\|\bm{x}\\|_{1}}\sum_{i\in S}(\bm{g}_{i}-\bm{z}_{i})^{2}+\sum_{i\in S^{c}}(\bm{g}_{i}-\bm{z}_{i})^{2}$
		$\displaystyle=\operatorname{\mathbb{E}}~{}\sum_{i\in S}(\bm{g}_{i}-t\cdot\textrm{sgn}(x_{i}))^{2}+\sum_{i\in S^{c}}\textrm{shrink}(\bm{g}_{i},t)^{2}$
		$\displaystyle=s(1+t^{2})+\frac{2(n-s)}{\sqrt{2\pi}}\left((1+t^{2})\int_{t}^{\infty}e^{-x^{2}/2}dx-te^{-t^{2}/2}\right).$

Here $\textrm{shrink}(\bm{g}_{i},t)$ is the soft thresholding operator defined as:

\textrm{shrink}(\bm{g}_{i},t)=\left\{\begin{array}[]{ll}\bm{g}_{i}+t&\textrm{if}~{}~{}\bm{g}_{i}<-t,\\ 0&\textrm{if}~{}~{}-t\leq\bm{g}_{i}\leq t,\\ \bm{g}_{i}-t&\textrm{if}~{}~{}\bm{g}>t.\end{array}\right.

Let $\bm{v}$ be a $k$ -sparse vector in $\mathbb{R}^{m}$ , and let $S$ denote the support of $\bm{v}$ . The scaled spherical part of the subdifferential $\partial\|\bm{v}\|_{1}$ is given by

\partial\|\bm{v}\|_{1}\cap t\mathbb{S}^{m-1}=\{\bm{z}\in\mathbb{R}^{m}:\bm{z}_{i}=\textrm{sgn}(\bm{v}_{i})~{}\textrm{for}~{}i\in S,~{}|\bm{z}_{i}|\leq 1~{}\textrm{for}~{}i\in S^{c},~{}\|\bm{z}\|_{2}=t\}.

Then the spherical Gaussian width of a scaled subdifferential is

	$\displaystyle\omega\left(\frac{1}{t}\partial\\|\bm{v}\\|_{1}\cap\mathbb{S}^{m-1}\right)$	$\displaystyle=\frac{1}{t}\cdot\omega(\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1})=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{z}\right\rangle$
		$\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S}\bm{g}_{i}\bm{z}_{i}+\sum_{i\in S^{c}}\bm{g}_{i}\bm{z}_{i}$
		$\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S^{c}}\bm{g}_{i}\bm{z}_{i}$
		$\displaystyle=\frac{1}{t}\sup_{\bm{z}\in\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S^{c}}\|\bm{z}_{i}\|\cdot\operatorname{\mathbb{E}}\|\bm{g}_{i}\|$
		$\displaystyle=\frac{1}{t}\sqrt{m-k}\sqrt{t^{2}-k}\sqrt{\frac{2}{\pi}}=\sqrt{\frac{2}{\pi}(m-k)\left(1-\frac{k}{t^{2}}\right)}.$

The last line is due to the Cauchy-Schwarz inequality, i.e., $\sum_{i\in S^{c}}|\bm{z}_{i}|\leq\sqrt{m-k}\cdot\sqrt{\sum_{i\in S^{c}}\bm{z}_{i}^{2}}=\sqrt{m-k}\cdot\sqrt{t^{2}-k}$ . The equality holds when $|\bm{z}_{i}|=\sqrt{\frac{t^{2}-k}{m-k}}$ for $i\in S^{c}$ .

D-B Calculation for Low-rank Matrices

Let $\bm{X}\in\mathbb{R}^{n_{1}\times n_{2}}$ be an $r$ -rank matrix with $n_{1}\leq n_{2}$ . We use the nuclear norm $\|\bm{X}\|_{*}$ , which is the sum of singular values of $\bm{X}$ , to promote the structure of low-rank matrices. Note that the nuclear norm and Gaussian distance are both unitary invariant, without loss of generality, we can assume that $\bm{X}$ takes the form

\bm{X}=\begin{bmatrix}\bm{\Sigma}&\bm{0}\\ \bm{0}&\bm{0}\end{bmatrix},

where $\bm{\Sigma}=\textrm{diag}(\sigma_{1},\sigma_{2},...,\sigma_{r})$ . The scaled subdifferential of $\|\bm{X}\|_{*}$ is

t\cdot\partial\|\bm{X}\|_{*}=\left\{\begin{bmatrix}t\bm{I}_{r}&\bm{0}\\ \bm{0}&t\bm{W}\end{bmatrix}:~{}\|\bm{W}\|\leq 1\right\}.

Here $\|\bm{W}\|$ is the spectral norm, which is equal to the maximum singular value of $\bm{W}$ . Let

\bm{G}=\begin{bmatrix}\bm{G}_{1}&\bm{G}_{1}^{\prime}\\ \bm{G}_{2}^{\prime}&\bm{G}_{2}\end{bmatrix}

be a partition of Gaussian matrix $\bm{G}\in\mathbb{R}^{n_{1}\times n_{2}}$ with $\bm{G}_{1}\in\mathbb{R}^{r\times r}$ and $\bm{G}_{2}\in\mathbb{R}^{(n_{1}-r)\times(n_{2}-r)}$ . Then the Gaussian squared distance to a scaled subdifferential is

	$\displaystyle\eta^{2}(t\cdot\partial\\|\bm{X}\\|_{*})$	$\displaystyle=\operatorname{\mathbb{E}}\inf_{\bm{Z}\in t\cdot\partial\\|\bm{X}\\|_{*}}\\|\bm{G}-\bm{Z}\\|_{F}^{2}=\operatorname{\mathbb{E}}\left\{\left\\|\begin{bmatrix}\bm{G}_{1}-t\bm{I}_{r}&\bm{G}_{1}^{\prime}\\ \bm{G}_{2}^{\prime}&\bm{0}\end{bmatrix}\right\\|_{F}^{2}+\inf_{\\|\bm{W}\\|\leq 1}\\|\bm{G}_{2}-t\bm{W}\\|_{F}^{2}\right\}$
		$\displaystyle=r(n_{1}+n_{2}-r+t^{2})+\operatorname{\mathbb{E}}\inf_{\\|\bm{W}\\|\leq 1}\sum_{i=1}^{n_{1}-r}\left(\sigma_{i}(\bm{G}_{2})-t\sigma_{i}(\bm{W})\right)^{2}$
		$\displaystyle=r(n_{1}+n_{2}-r+t^{2})+\operatorname{\mathbb{E}}\sum_{i=1}^{n_{1}-r}\textrm{shrink}\left(\sigma_{i}(\bm{G}_{2}),t\right)^{2}.$

Here $\sigma_{i}(\cdot)$ is the $i$ -th largest singular value. The expectation term concerns the density of singular values of Gaussian matrix $\bm{G}_{2}$ , it seems challenging to obtain an exact formula for this term. However, there exist some asymptotic results in the literatures, see e.g. [37, 54]. In our simulations, we use the Monte Carlo method to calculate this expectation term.

Let $\bm{V}\in\mathbb{R}^{m_{1}\times m_{2}}$ be a $\rho$ -rank matrix. Since the Gaussian width is also unitary invariant, we assume that $\bm{V}$ takes the same form as $\bm{X}$ . The scaled spherical part of the subdifferential $\partial\|\bm{V}\|_{*}$ is

\partial\|\bm{V}\|_{*}\cap t\mathbb{S}^{m_{1}m_{2}-1}=\left\{\begin{bmatrix}\bm{I}_{\rho}&\bm{0}\\ \bm{0}&\bm{W}\end{bmatrix}:~{}\|\bm{W}\|\leq 1,~{}\|\bm{W}\|_{F}^{2}=t^{2}-\rho\right\}.

Then spherical Gaussian width of a scaled subdifferential is given by

	$\displaystyle\omega\left(\frac{1}{t}\partial\\|\bm{V}\\|_{*}\cap\mathbb{S}^{m_{1}m_{2}-1}\right)$	$\displaystyle=\frac{1}{t}\cdot\omega(\partial\\|\bm{V}\\|_{*}\cap t\mathbb{S}^{m_{1}m_{2}-1})$
		$\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{Z}\in\partial\\|\bm{V}\\|_{*}\cap t\mathbb{S}^{m_{1}m_{2}-1}}\left\langle\bm{G},\bm{Z}\right\rangle$
		$\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\\|\bm{W}\\|\leq 1,~{}\\|\bm{W}\\|_{F}^{2}=t^{2}-\rho}\left\langle\bm{G}_{2},\bm{W}\right\rangle$
		$\displaystyle=\frac{1}{t}\sqrt{t^{2}-\rho}\cdot\operatorname{\mathbb{E}}\\|\bm{G}_{2}\\|_{F}=\sqrt{1-\frac{\rho}{t^{2}}}\cdot\mu_{(m_{1}-\rho)(m_{2}-\rho)}.$

The last line is due to the Cauchy-Schwarz inequality, i.e., $\left\langle\bm{G}_{2},\bm{W}\right\rangle\leq\|\bm{G}_{2}\|_{F}\cdot\|\bm{W}\|_{F}=\sqrt{t^{2}-\rho}\cdot\|\bm{G}_{2}\|_{F}$ . The notation $\mu_{n}$ denotes the expected length of an $n$ -dimensional vector with independent standard normal entries.

Appendix E Auxiliary Lemma and Facts

In this appendix, we present some additional auxiliary lemma and facts that are used in the proofs of our main results.

Lemma 4.

Let $\mathcal{T}_{1}\subset\mathbb{R}^{n}$ and $\mathcal{T}_{2}\subset\mathbb{S}^{m-1}$ be two subsets, $c\in\mathbb{R}$ is a constant. Then the functions

F(\bm{g},\bm{h})=\min_{\bm{a}\in\mathcal{T}_{1}\atop\bm{b}\in\mathcal{T}_{2}}\|\bm{g}-c\cdot\bm{a}\|_{2}+\left\langle\bm{h},\bm{b}\right\rangle

and

G(\bm{g},\bm{h})=\max_{\bm{a}\in\mathcal{T}_{1}\atop\bm{b}\in\mathcal{T}_{2}}\left\langle\bm{h},\bm{b}\right\rangle-\|\bm{g}-c\cdot\bm{a}\|_{2}

are $\sqrt{2}$ -Lipschitz functions.

Proof.

To prove first part, it suffices to show that for any $(\bm{g}_{1},\bm{h}_{1}),~{}(\bm{g}_{2},\bm{h}_{2})$ , we have

\displaystyle\big{|}F(\bm{g}_{1},\bm{h}_{1})-F(\bm{g}_{2},\bm{h}_{2})\big{|}\leq\sqrt{2}\sqrt{\|\bm{g}_{1}-\bm{g}_{2}\|_{2}^{2}+\|\bm{h}_{1}-\bm{h}_{2}\|_{2}^{2}}.

To this end, let

(\bar{\bm{a}},~{}\bar{\bm{b}})\in\arg\min_{\bm{a}\in\mathcal{T}_{1}\atop\bm{b}\in\mathcal{T}_{2}}\|\bm{g}_{2}-c\cdot\bm{a}\|_{2}+\left\langle\bm{h}_{2},\bm{b}\right\rangle.

Then we have

	$\displaystyle F(\bm{g}_{1},\bm{h}_{1})$	$\displaystyle=\min_{\bm{a}\in\mathcal{T}_{1}\atop\bm{b}\in\mathcal{T}_{2}}\\|\bm{g}_{1}-c\cdot\bm{a}\\|_{2}+\left\langle\bm{h}_{1},\bm{b}\right\rangle$
		$\displaystyle\leq\\|\bm{g}_{1}-c\cdot\bar{\bm{a}}\\|_{2}+\left\langle\bm{h}_{1},\bar{\bm{b}}\right\rangle.$

Therefore,

$\displaystyle F(\bm{g}_{1},\bm{h}_{1})-F(\bm{g}_{2},\bm{h}_{2})$	$\displaystyle\leq\left(\\|\bm{g}_{1}-c\cdot\bar{\bm{a}}\\|_{2}+\left\langle\bm{h}_{1},\bar{\bm{b}}\right\rangle\right)-\left(\\|\bm{g}_{2}-c\cdot\bar{\bm{a}}\\|_{2}+\left\langle\bm{h}_{2},\bar{\bm{b}}\right\rangle\right)$
	$\displaystyle=\left(\\|\bm{g}_{1}-c\cdot\bar{\bm{a}}\\|_{2}-\\|\bm{g}_{2}-c\cdot\bar{\bm{a}}\\|_{2}\right)+\left(\left\langle\bm{h}_{1},\bar{\bm{b}}\right\rangle-\left\langle\bm{h}_{2},\bar{\bm{b}}\right\rangle\right)$
	$\displaystyle\leq\\|\bm{g}_{1}-\bm{g}_{2}\\|_{2}+\\|\bm{h}_{1}-\bm{h}_{2}\\|_{2}$
	$\displaystyle\leq\sqrt{2}\sqrt{\\|\bm{g}_{1}-\bm{g}_{2}\\|_{2}^{2}+\\|\bm{h}_{1}-\bm{h}_{2}\\|_{2}^{2}}.$	(58)

Similarly, we have

F(\bm{g}_{2},\bm{h}_{2})-F(\bm{g}_{1},\bm{h}_{1})\leq\sqrt{2}\sqrt{\|\bm{g}_{1}-\bm{g}_{2}\|_{2}^{2}+\|\bm{h}_{1}-\bm{h}_{2}\|_{2}^{2}}.

(59)

Combining (E) and (59) yields the first conclusion. The second part for $G(\bm{g},\bm{h})$ can be proven similarly. ∎

Fact 3 (Polarity principle).

[56, Proposition 3.8] Let $\SS$ be a non-empty, closed, spherically convex subset of the unit sphere $\mathbb{S}^{n-1}$ , and let $\bm{A}:\mathbb{R}^{n}\to\mathbb{R}^{m}$ be a linear map. If $\operatorname{cone}(\SS)$ is not a subspace, then

\min_{\|\bm{r}\|_{2}=1}\min_{\bm{s}\in(\operatorname{cone}(\SS))^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0~{}~{}\textrm{implies}~{}\bm{0}\in\bm{A}(\SS),

where $\bm{A}(\SS)$ is the image of $\bm{A}$ .

Fact 4 (Variance of Gaussian Lipschitz functions).

[58, Theorem 1.6.4] Consider a random vector $\bm{x}\sim\mathcal{N}(0,\bm{I}_{n})$ and a Lipschitz function $f:~{}\mathbb{R}^{n}\to\mathbb{R}$ with Lipschitz norm $\|f\|_{\textrm{Lip}}$ (with respect to the Euclidean metric). Then,

\textrm{Var}(f(\bm{x}))\leq\|f\|_{\textrm{Lip}}^{2}.

Fact 5 (Moreau’s decomposition theorem).

[61, Theorem 6.30] Let $\mathcal{C}$ be a nonempty closed convex cone in $\mathbb{R}^{n}$ and let $\bm{x}\in\mathbb{R}^{n}$ . Then

\displaystyle\|\bm{x}\|_{2}^{2}=\operatorname{dist}^{2}(\bm{x},\mathcal{C})+\operatorname{dist}^{2}(\bm{x},\mathcal{C}^{\circ}).

Fact 6.

[51, Appendix A] Let $\mathcal{C}\subset\mathbb{R}^{n}$ be a non-empty convex cone, we have

\sup_{\bm{z}\in\mathcal{C}\cap\mathbb{B}_{2}^{n}}\left\langle\bm{x},\bm{z}\right\rangle=\operatorname{dist}(\bm{x},\mathcal{C}^{\circ}),

where $\bm{x}\in\mathbb{R}^{n}$ .

Fact 7.

[37, Proposition 10.2] Let $\mathcal{C}$ be a convex cone. The Gaussian width and the statistical dimension are closely related:

\delta(\mathcal{C})-1\leq\omega^{2}(\mathcal{C}\cap\mathbb{S}^{n-1})\leq\delta(\mathcal{C}).

Fact 8.

[51, Lemma 3.7] Let $\mathcal{C}\subset\mathbb{R}^{n}$ be a non-empty closed, convex cone. Then we have that

\omega^{2}(\mathcal{C}\cap\mathbb{S}^{n-1})+\omega^{2}(\mathcal{C}^{\circ}\cap\mathbb{S}^{n-1})\leq n.

Fact 9 (Max–min inequality).

[57, Lemma 36.1] For any function $F:~{}\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}$ and any $\mathcal{W}\subseteq\mathbb{R}^{n}$ , $\mathcal{Z}\subseteq\mathbb{R}^{m}$ , we have

\sup_{\bm{z}\in\mathcal{Z}}\inf_{\bm{w}\in\mathcal{W}}F(\bm{w},\bm{z})\leq\inf_{\bm{w}\in\mathcal{W}}\sup_{\bm{z}\in\mathcal{Z}}F(\bm{w},\bm{z}).

Fact 10.

[37, Theorem 4.3] Let $f$ be a norm on $\mathbb{R}^{n}$ , and fix a non-zero point $\bm{x}\in\mathbb{R}^{n}$ . Then

\displaystyle 0\leq

\displaystyle\bigg{(}\min_{t\geq 0}\eta^{2}(t\cdot\partial f(\bm{x}))\bigg{)}-\delta(\mathcal{T}_{f})\leq\frac{2\max\{\|\bm{a}\|_{2}:\bm{a}\in\partial f(\bm{x})\}}{f(\bm{x}/\|\bm{x}\|_{2})}.

References

[1] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, 2009.
[2] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Miami, FL, USA, 2009, pp. 2790–2797.
[3] J. Haupt, W. U. Bajwa, M. Rabbat, and R. Nowak, “Compressed sensing for networked data,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 92–101, 2008.
[4] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM J. Optim., vol. 21, no. 2, pp. 572–596, 2011.
[5] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, pp. 1–37, 2011.
[6] M. Elad, J.-L. Starck, P. Querre, and D. L. Donoho, “Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA),” Appl. Comp. Harmonic Anal., vol. 19, no. 3, pp. 340–358, 2005.
[7] J. N. Laska, M. A. Davenport, and R. G. Baraniuk, “Exact signal recovery from sparsely corrupted measurements through the pursuit of justice,” in Proc. 43rd Asilomar Conf. Signals, Syst. Comput., Pacific Grove, CA, USA, 2009, pp. 1556–1560.
[8] J. Wright and Y. Ma, “Dense error correction via $\ell_{1}$ -minimization,” IEEE Trans. Inf. Theory, vol. 56, no. 7, pp. 3540–3560, 2010.
[9] X. Li, “Compressed sensing and matrix completion with constant proportion of corruptions,” Construct. Approximation, vol. 37, no. 1, pp. 73–99, 2013.
[10] N. H. Nguyen and T. D. Tran, “Exact recoverability from dense corrupted observations via-minimization,” IEEE Trans. Inf. Theory, vol. 59, no. 4, pp. 2017–2035, 2013.
[11] ——, “Robust lasso with missing and grossly corrupted observations,” IEEE Trans. Inf. Theory, vol. 4, no. 59, pp. 2036–2058, 2013.
[12] P. Kuppinger, G. Durisi, and H. Bolcskei, “Uncertainty relations and sparse signal recovery for pairs of general signal sets,” IEEE Trans. Inf. Theory, vol. 58, no. 1, pp. 263–277, 2012.
[13] C. Studer, P. Kuppinger, G. Pope, and H. Bolcskei, “Recovery of sparsely corrupted signals,” IEEE Trans. Inf. Theory, vol. 58, no. 5, pp. 3115–3130, 2012.
[14] G. Pope, A. Bracher, and C. Studer, “Probabilistic recovery guarantees for sparsely corrupted signals,” IEEE Trans. Inf. Theory, vol. 59, no. 5, pp. 3104–3116, 2013.
[15] C. Studer and R. G. Baraniuk, “Stable restoration and separation of approximately sparse signals,” Appl. Comp. Harmonic Anal., vol. 37, no. 1, pp. 12–35, 2014.
[16] D. Su, “Data recovery from corrupted observations via l1 minimization,” 2016, [Online]. Available: https://arxiv.org/abs/1601.06011.
[17] B. Adcock, A. Bao, J. D. Jakeman, and A. Narayan, “Compressed sensing with sparse corruptions: Fault-tolerant sparse collocation approximations,” SIAM/ASA J. Uncertain. Quantif., vol. 6, no. 4, pp. 1424–1453, 2018.
[18] B. Adcock, A. Bao, and S. Brugiapaglia, “Correcting for unknown errors in sparse high-dimensional function approximation,” Numer. Math., vol. 142, no. 3, pp. 667–711, 2019.
[19] H. Xu, C. Caramanis, and S. Sanghavi, “Robust PCA via outlier pursuit,” IEEE Trans. Inf. Theory, vol. 58, no. 5, pp. 3047–3064, 2012.
[20] H. Xu, C. Caramanis, and S. Mannor, “Outlier-robust PCA: the high-dimensional case,” IEEE Trans. Inf. Theory, vol. 59, no. 1, pp. 546–572, 2013.
[21] J. Wright, A. Ganesh, K. Min, and Y. Ma, “Compressive principal component pursuit,” Inf. Inference, J. IMA, vol. 2, no. 1, pp. 32–68, 2013.
[22] Y. Chen, A. Jalali, S. Sanghavi, and C. Caramanis, “Low-rank matrix recovery from errors and erasures,” IEEE Trans. Inf. Theory, vol. 59, no. 7, pp. 4324–4337, 2013.
[23] M. B. McCoy and J. A. Tropp, “Sharp recovery bounds for convex demixing, with applications,” Found. Comput. Math., vol. 14, no. 3, pp. 503–567, 2014.
[24] R. Foygel and L. Mackey, “Corrupted sensing: Novel guarantees for separating structured signals,” IEEE Trans. Inf. Theory, vol. 60, no. 2, pp. 1223–1247, 2014.
[25] H. Zhang, Y. Liu, and L. Hong, “On the phase transition of corrupted sensing,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Aachen, Germany, 2017, pp. 521–525.
[26] J. Chen and Y. Liu, “Corrupted sensing with sub-Gaussian measurements,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT). Aachen, Germany: IEEE, 2017, pp. 516–520.
[27] J. Chen and Y. Liu, “Stable recovery of structured signals from corrupted sub-Gaussian measurements,” IEEE Trans. Inf. Theory, vol. 65, no. 5, pp. 2976–2994, 2019.
[28] Z. Sun, W. Cui, and Y. Liu, “Recovery of structured signals from corrupted non-linear measurements,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Paris, France, 2019, pp. 2084–2088.
[29] ——, “Quantized corrupted sensing with random dithering,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Los Angeles, USA, 2020, pp. 1397–1402.
[30] D. L. Donoho, “Neighborly polytopes and sparse solutions of underdetermined linear equations,” Technical Report, Department of Statistics, Stanford University, 2005.
[31] ——, “High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension,” Discrete Comput. Geom., vol. 35, no. 4, pp. 617–652, 2006.
[32] D. Donoho and J. Tanner, “Counting faces of randomly projected polytopes when the projection radically lowers dimension,” J. Amer. Math. Soc., vol. 22, no. 1, pp. 1–53, 2009.
[33] D. L. Donoho and J. Tanner, “Neighborliness of randomly projected simplices in high dimensions,” Proc. Natl Acad. Sci. USA, vol. 102, no. 27, pp. 9452–9457, 2005.
[34] ——, “Counting the faces of randomly-projected hypercubes and orthants, with applications,” Discrete Comput. Geom., vol. 43, no. 3, pp. 522–541, 2010.
[35] M. A. Khajehnejad, W. Xu, A. S. Avestimehr, and B. Hassibi, “Analyzing weighted $\ell_{1}$ minimization for sparse recovery with nonuniform sparse models,” IEEE Trans. Signal Process., vol. 59, no. 5, pp. 1985–2001, 2011.
[36] W. Xu and B. Hassibi, “Precise stability phase transitions for $\ell_{1}$ minimization: A unified geometric framework,” IEEE Trans. Inf. Theory, vol. 57, no. 10, pp. 6894–6919, 2011.
[37] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp, “Living on the edge: Phase transitions in convex programs with random data,” Inf. Inference, J. IMA, vol. 3, no. 3, pp. 224–294, 2014.
[38] D. Amelunxen and P. Bürgisser, “Intrinsic volumes of symmetric cones and applications in convex programming,” Math. Progam., vol. 149, no. 1-2, pp. 105–130, 2015.
[39] ——, “Probabilistic analysis of the grassmann condition number,” Found. Comput. Math., vol. 15, no. 1, pp. 3–51, 2015.
[40] L. Goldstein, I. Nourdin, and G. Peccati, “Gaussian phase transitions and conic intrinsic volumes: Steining the steiner formula,” Ann. Appl. Probab., vol. 27, no. 1, pp. 1–47, 2017.
[41] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Natl Acad. Sci. USA, vol. 106, no. 45, pp. 18 914–18 919, 2009.
[42] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” IEEE Trans. Inf. Theory, vol. 57, no. 2, pp. 764–785, 2011.
[43] ——, “The LASSO risk for Gaussian matrices,” IEEE Trans. Inf. Theory, vol. 58, no. 4, pp. 1997–2017, 2011.
[44] D. L. Donoho, A. Maleki, and A. Montanari, “The noise-sensitivity phase transition in compressed sensing,” IEEE Trans. Inf. Theory, vol. 57, no. 10, pp. 6920–6941, 2011.
[45] D. L. Donoho, M. Gavish, and A. Montanari, “The phase transition of matrix recovery from Gaussian measurements matches the minimax MSE of matrix denoising,” Proc. Natl Acad. Sci. USA, vol. 110, no. 21, pp. 8405–8410, 2013.
[46] S. Oymak and B. Hassibi, “Sharp MSE bounds for proximal denoising,” Found. Comput. Math., vol. 16, no. 4, pp. 965–1029, 2016.
[47] Y. Gordon, “Some inequalities for Gaussian processes and applications,” Isr. J. Math., vol. 50, no. 4, pp. 265–289, 1985.
[48] M. Rudelson and R. Vershynin, “On sparse reconstruction from fourier and Gaussian measurements,” Comm. Pure Appl. Math., vol. 61, no. 8, pp. 1025–1045, 2008.
[49] M. Stojnic, “Various thresholds for $\ell_{1}$ -optimization in compressed sensing,” 2009, [Online]. Available: https://arxiv.org/abs/0907.3666.
[50] S. Oymak and B. Hassibi, “New null space results and recovery thresholds for matrix rank minimization,” 2010, [Online]. Available: https://arxiv.org/abs/1011.6326.
[51] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Found. Comput. Math., vol. 12, no. 6, pp. 805–849, 2012.
[52] M. Stojnic, “A framework to characterize performance of lasso algorithms,” 2013, [Online]. Available: https://arxiv.org/abs/1303.7291.
[53] ——, “Regularly random duality,” 2013, [Online]. Available: https://arxiv.org/abs/1303.7295.
[54] S. Oymak, C. Thrampoulidis, and B. Hassibi, “The squared-error of generalized LASSO: A precise analysis,” 2013, [Online]. Available: https://arxiv.org/abs/1311.0830.
[55] C. Thrampoulidis, S. Oymak, and B. Hassibi, “The Gaussian min-max theorem in the presence of convexity,” 2014, [Online]. Available: https://arxiv.org/abs/1408.4837.
[56] S. Oymak and J. A. Tropp, “Universality laws for randomized dimension reduction, with applications,” Inf. Inference, J. IMA, vol. 7, no. 3, pp. 337–446, 2018.
[57] R. T. Rockafellar, Convex analysis. Princeton, NJ, USA: Princeton Univ. Press, 2015.
[58] V. I. Bogachev, Gaussian measures. Providence, Rhode island, USA: American Mathematical Soc., 1998, no. 62.
[59] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” 2017, [Online]. Available: http://cvxr.com/cvx/.
[60] ——, “Graph implementations for nonsmooth convex programs,” in Recent Advances in Learning and Control, ser. Lecture Notes in Control and Information Sciences, V. Blondel, S. Boyd, and H. Kimura, Eds. London, U.K.: Springer-Verlag, 2008, pp. 95–110.
[61] H. H. Bauschke, P. L. Combettes et al., Convex analysis and monotone operator theory in Hilbert spaces. Cham, Switzerland: Springer, 2011, vol. 408.

	$\displaystyle\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}Y_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}=\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\\|\bm{a}\\|_{2}\left\langle\bm{h},\bm{u}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}X_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt=\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\\|\bm{a}\\|_{2}\cdot g+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\\|\bm{a}\\|_{2}\cdot g+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0~{}\bigg{\|}~{}g\leq 0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},$

	$\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\geq\tau_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\rule{0.0pt}{8.53581pt}\right\}=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\\|\bm{u}_{1}\\|_{2}\left\langle\bm{h},\bm{r}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}X_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\geq\tau_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\\|\bm{u}_{1}\\|_{2}\cdot g+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\\|\bm{u}_{1}\\|_{2}\cdot g+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\Big{\|}g\leq 0\rule{0.0pt}{8.53581pt}\right\}$
	$\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},$

	$\displaystyle\omega\left(\frac{1}{t}\partial\\|\bm{v}\\|_{1}\cap\mathbb{S}^{m-1}\right)$	$\displaystyle=\frac{1}{t}\cdot\omega(\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1})=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{z}\right\rangle$
		$\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S}\bm{g}_{i}\bm{z}_{i}+\sum_{i\in S^{c}}\bm{g}_{i}\bm{z}_{i}$
		$\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S^{c}}\bm{g}_{i}\bm{z}_{i}$
		$\displaystyle=\frac{1}{t}\sup_{\bm{z}\in\partial\\|\bm{v}\\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S^{c}}\|\bm{z}_{i}\|\cdot\operatorname{\mathbb{E}}\|\bm{g}_{i}\|$
		$\displaystyle=\frac{1}{t}\sqrt{m-k}\sqrt{t^{2}-k}\sqrt{\frac{2}{\pi}}=\sqrt{\frac{2}{\pi}(m-k)\left(1-\frac{k}{t^{2}}\right)}.$

	$\displaystyle\eta^{2}(t\cdot\partial\\|\bm{X}\\|_{*})$	$\displaystyle=\operatorname{\mathbb{E}}\inf_{\bm{Z}\in t\cdot\partial\\|\bm{X}\\|_{*}}\\|\bm{G}-\bm{Z}\\|_{F}^{2}=\operatorname{\mathbb{E}}\left\{\left\\|\begin{bmatrix}\bm{G}_{1}-t\bm{I}_{r}&\bm{G}_{1}^{\prime}\\ \bm{G}_{2}^{\prime}&\bm{0}\end{bmatrix}\right\\|_{F}^{2}+\inf_{\\|\bm{W}\\|\leq 1}\\|\bm{G}_{2}-t\bm{W}\\|_{F}^{2}\right\}$
		$\displaystyle=r(n_{1}+n_{2}-r+t^{2})+\operatorname{\mathbb{E}}\inf_{\\|\bm{W}\\|\leq 1}\sum_{i=1}^{n_{1}-r}\left(\sigma_{i}(\bm{G}_{2})-t\sigma_{i}(\bm{W})\right)^{2}$
		$\displaystyle=r(n_{1}+n_{2}-r+t^{2})+\operatorname{\mathbb{E}}\sum_{i=1}^{n_{1}-r}\textrm{shrink}\left(\sigma_{i}(\bm{G}_{2}),t\right)^{2}.$

	$\displaystyle\omega\left(\frac{1}{t}\partial\\|\bm{V}\\|_{*}\cap\mathbb{S}^{m_{1}m_{2}-1}\right)$	$\displaystyle=\frac{1}{t}\cdot\omega(\partial\\|\bm{V}\\|_{*}\cap t\mathbb{S}^{m_{1}m_{2}-1})$
		$\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{Z}\in\partial\\|\bm{V}\\|_{*}\cap t\mathbb{S}^{m_{1}m_{2}-1}}\left\langle\bm{G},\bm{Z}\right\rangle$
		$\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\\|\bm{W}\\|\leq 1,~{}\\|\bm{W}\\|_{F}^{2}=t^{2}-\rho}\left\langle\bm{G}_{2},\bm{W}\right\rangle$
		$\displaystyle=\frac{1}{t}\sqrt{t^{2}-\rho}\cdot\operatorname{\mathbb{E}}\\|\bm{G}_{2}\\|_{F}=\sqrt{1-\frac{\rho}{t^{2}}}\cdot\mu_{(m_{1}-\rho)(m_{2}-\rho)}.$

Phase Transitions in Recovery of Structured Signals from Corrupted Measurements

Abstract

Index Terms:

I Introduction

I-A Model Assumptions and Contributions

I-B Related Works

I-B1 Related Works in Compressed Sensing

I-B2 Related Works in Corrupted Sensing

I-C Organization

II Preliminaries

II-A Convex Geometry

II-A1 Subdifferential

II-A2 Cone and Polar Cone

II-A3 Tangent Cone and Normal Cone

II-B Geometric Measures

II-B1 Gaussian Width

II-B2 Gaussian Distance and Gaussian Squared Distance

II-C Tools from Gaussian Analysis

Fact 1 (Gordon’s Lemma).

Fact 2 (Gaussian concentration inequality).

III Main Results

III-A Phase Transition of the Constrained Recovery Procedures

Lemma 1 (Sufficient conditions for successful and failed recovery).

Theorem 1 (Phase transition of constrained recovery procedures).

Remark 1 (Relation to existing results).

Remark 2 (Related works).

III-A1 How to evaluate the critical point 𝒞p\mathscr{C}_{p}?

III-B Phase Transition of the Penalized Recovery Procedure

Lemma 2 (Sufficient conditions for successful and failed recovery).

Theorem 2 (Phase transition of penalized recovery procedure).

Remark 3 (Related works).

III-B1 How to evaluate the critical point 𝒞p​(λ)\mathscr{C}_{p}(\lambda)?

Lemma 3.

Proof.

III-C Relationship between Constrained and Penalized Recovery Procedures and Optimal Choice of λ\lambda

Theorem 3 (Relationship between constrained and penalized recovery procedures and optimal choice of λ\lambda).

Remark 4 (Related works).

IV Numerical Simulations

IV-A Phase Transition of the Constrained Recovery Procedures

IV-A1 Sparse Signal Recovery from Sparse Corruption

IV-A2 Low-rank Matrix Recovery from Sparse Corruption

IV-B Phase Transition of the Penalized Recovery Procedure

IV-B1 Sparse Signal Recovery from Sparse Corruption

IV-B2 Low-rank Matrix Recovery from Sparse Corruption

IV-C Optimal Choice of the Tradeoff Parameter λ\lambda

IV-C1 Sparse Signal Recovery from Sparse Corruption

IV-C2 Low-rank Matrix Recovery from Sparse Corruption

V Conclusion and Future Directions

Appendix A Proofs of Lemma 1 and Theorem 1

A-A Proof of Lemma 1

Proof.

A-B Proof of Theorem 1

Proof.

Appendix B Proofs of Lemma 2 and Theorem 2

B-A Proof of Lemma 2

Proof.

B-B Proof of Theorem 2

Proof.

Appendix C Proof of Theorem 3

Proof.

Appendix D Evaluate 𝒞p\mathscr{C}_{p} and 𝒞p​(λ)\mathscr{C}_{p}(\lambda) for Typical Structured Signal and Corruption

D-A Calculation for Sparse Vectors

D-B Calculation for Low-rank Matrices

Appendix E Auxiliary Lemma and Facts

Lemma 4.

Proof.

Fact 3 (Polarity principle).

Fact 4 (Variance of Gaussian Lipschitz functions).

Fact 5 (Moreau’s decomposition theorem).

Fact 6.

Fact 7.

Fact 8.

Fact 9 (Max–min inequality).

Fact 10.

References

III-A1 How to evaluate the critical point $\mathscr{C}_{p}$ ?

III-B1 How to evaluate the critical point $\mathscr{C}_{p}(\lambda)$ ?

III-C Relationship between Constrained and Penalized Recovery Procedures and Optimal Choice of $\lambda$

Theorem 3 (Relationship between constrained and penalized recovery procedures and optimal choice of $\lambda$ ).

IV-C Optimal Choice of the Tradeoff Parameter $\lambda$

Appendix D Evaluate $\mathscr{C}_{p}$ and $\mathscr{C}_{p}(\lambda)$ for Typical Structured Signal and Corruption