This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Phase Transitions in Recovery of Structured Signals from Corrupted Measurements

Zhongxing Sun1, Wei Cui1, and Yulong Liu2 Corresponding author: Yulong Liu. Email: [email protected]. 1School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China 2School of Physics, Beijing Institute of Technology, Beijing 100081, China
Abstract

This paper is concerned with the problem of recovering a structured signal from a relatively small number of corrupted random measurements. Sharp phase transitions have been numerically observed in practice when different convex programming procedures are used to solve this problem. This paper is devoted to presenting theoretical explanations for these phenomenons by employing some basic tools from Gaussian process theory. Specifically, we identify the precise locations of the phase transitions for both constrained and penalized recovery procedures. Our theoretical results show that these phase transitions are determined by some geometric measures of structure, e.g., the spherical Gaussian width of a tangent cone and the Gaussian (squared) distance to a scaled subdifferential. By utilizing the established phase transition theory, we further investigate the relationship between these two kinds of recovery procedures, which also reveals an optimal strategy (in the sense of Lagrange theory) for choosing the tradeoff parameter in the penalized recovery procedure. Numerical experiments are provided to verify our theoretical results.

Index Terms:
Phase transition, corrupted sensing, signal separation, signal demixing, compressed sensing, structured signals, corruption, Gaussian process.

I Introduction

This paper studies the problem of recovering a structured signal from a relatively small number of corrupted measurements

𝒚=𝚽𝒙+𝒗+𝒛,\displaystyle\bm{y}=\bm{\Phi}\bm{x}^{\star}+\bm{v}^{\star}+\bm{z}, (1)

where 𝚽m×n\bm{\Phi}\in\mathbb{R}^{m\times n} is the sensing matrix, 𝒙n\bm{x}^{\star}\in\mathbb{R}^{n} denotes the structured signal to be estimated, 𝒗m\bm{v}^{\star}\in\mathbb{R}^{m} stands for the structured corruption, and 𝒛m\bm{z}\in\mathbb{R}^{m} represents the unstructured observation noise. The objective is to estimate 𝒙\bm{x}^{\star} and 𝒗\bm{v}^{\star} from given knowledge of 𝒚\bm{y} and 𝚽\bm{\Phi}. If 𝒗\bm{v}^{\star} contains some useful information, then this model (1) can be regarded as the signal separation (or demixing) problem. In particular, if there is no corruption (𝒗=𝟎)(\bm{v}^{\star}=\bm{0}), then the model (1) reduces to the standard compressed sensing problem.

This problem arises in many practical applications of interest, such as face recognition [1], subspace clustering [2], sensor network [3], latent variable modeling [4], principle component analysis [5], source separation [6], and so on. The theoretical aspects of this problem have also been studied under different scenarios in the literature, important examples include sparse signal recovery from sparse corruption [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], low-rank matrix recovery from sparse corruption [4, 5, 19, 20, 21, 22], and structured signal recovery from structured corruption [23, 24, 25, 26, 27, 28, 29].

Since this problem is ill-posed in general, tractable recovery is possible when both signal and corruption are suitably structured. Typical examples of structured signal (or corruption) include sparse vectors and low-rank matrices. Let f()f(\cdot) and g()g(\cdot) be suitable proper convex functions which promote structures for signal and corruption respectively. There are three popular convex optimization approaches to reconstruct signal and corruption when different kinds of prior information are available. Specifically, when we have access to the prior knowledge of either signal f(𝒙)f(\bm{x}^{\star}) or corruption g(𝒗)g(\bm{v}^{\star}) and the noise level δ\delta (in terms of the 2\ell_{2} norm), it is natural to consider the following constrained convex recovery procedures

min𝒙,𝒗f(𝒙),s.t.\displaystyle\min_{\bm{x},\bm{v}}~{}f(\bm{x}),\quad\text{s.t.~{}} g(𝒗)g(𝒗),𝒚𝚽𝒙𝒗2δ\displaystyle g(\bm{v})\leq g(\bm{v}^{\star}),~{}~{}\|\bm{y}-\bm{\Phi}\bm{x}-\bm{v}\|_{2}\leq\delta (2)

and

min𝒙,𝒗g(𝒗),s.t.\displaystyle\min_{\bm{x},\bm{v}}~{}g(\bm{v}),\quad\text{s.t.~{}} f(𝒙)f(𝒙),𝒚𝚽𝒙𝒗2δ.\displaystyle f(\bm{x})\leq f(\bm{x}^{\star}),~{}~{}\|\bm{y}-\bm{\Phi}\bm{x}-\bm{v}\|_{2}\leq\delta. (3)

When only the noise level δ\delta is known, it is convenient to employ the partially penalized convex recovery procedure

min𝒙,𝒗f(𝒙)+λg(𝒗),s.t.𝒚𝚽𝒙𝒗2δ,\displaystyle\min_{\bm{x},\bm{v}}~{}f(\bm{x})+\lambda\cdot g(\bm{v}),\quad\text{s.t.}\quad\|\bm{y}-\bm{\Phi}\bm{x}-\bm{v}\|_{2}\leq\delta, (4)

where λ>0\lambda>0 is a tradeoff parameter. When there is no prior knowledge available, it is practical to use the fully penalized convex recovery procedure

min𝒙,𝒗12𝒚𝚽𝒙𝒗22+τ1f(𝒙)+τ2g(𝒗),\displaystyle\min_{\bm{x},\bm{v}}\frac{1}{2}\|\bm{y}-\bm{\Phi}\bm{x}-\bm{v}\|_{2}^{2}+\tau_{1}\cdot f(\bm{x})+\tau_{2}\cdot g(\bm{v}), (5)

where τ1,τ2>0\tau_{1},\tau_{2}>0 are some tradeoff parameters.

A large number of numerical results in the literature have suggested that phase transitions emerge in all above three recovery procedures (under random measurements), see e.g., [10, 11, 13, 15, 16, 17, 23, 24, 25, 26, 27]. Concretely, for a specific recovery procedure, when the number of the measurements exceeds a threshold, this procedure can faithfully reconstruct both signal and corruption with high probability, when the number of the measurements is below the threshold, this procedure fails with high probability. A fundamental question then is:

Q1: How to determine the locations of these phase transitions accurately?

In addition, in partially and fully penalized recovery procedures, the optimization problems also rely on some tradeoff parameters. Another important question is:

Q2: How to choose these tradeoff parameters to achieve the best possible performance?

I-A Model Assumptions and Contributions

This paper tries to provide answers for the above two questions in the absence of unstructured noise (𝒛=0\bm{z}=0). In this scenario, the observation model becomes

𝒚=𝚽𝒙+m𝒗.\displaystyle\bm{y}=\bm{\Phi}\bm{x}^{\star}+\sqrt{m}\bm{v}^{\star}. (6)

Here we assume that 𝚽\bm{\Phi} is a Gaussian sensing matrix with i.i.d. entries (𝚽ij𝒩(0,1)\bm{\Phi}_{ij}\sim\mathcal{N}(0,1)), and the factor m\sqrt{m} in (6) makes the columns of 𝚽\bm{\Phi} and m𝑰m\sqrt{m}\bm{I}_{m} have the same scale, which helps our theoretical results to be more interpretable. Accordingly, the constrained convex recovery procedures become

min𝒙,𝒗f(𝒙),s.t.\displaystyle\min_{\bm{x},\bm{v}}~{}f(\bm{x}),\quad\text{s.t.~{}} 𝒚=𝚽𝒙+m𝒗,g(𝒗)g(𝒗)\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v},~{}~{}g(\bm{v})\leq g(\bm{v}^{\star}) (7)

and

min𝒙,𝒗g(𝒗),s.t.\displaystyle\min_{\bm{x},\bm{v}}~{}g(\bm{v}),\quad\text{s.t.~{}} 𝒚=𝚽𝒙+m𝒗,f(𝒙)f(𝒙),\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v},~{}~{}f(\bm{x})\leq f(\bm{x}^{\star}), (8)

and partially and fully penalized recovery procedures reduce to

min𝒙,𝒗f(𝒙)+λg(𝒗),s.t.\displaystyle\min_{\bm{x},\bm{v}}~{}f(\bm{x})+\lambda\cdot g(\bm{v}),\quad\text{s.t.~{}} 𝒚=𝚽𝒙+m𝒗,\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v}, (9)

where λ>0\lambda>0 is a tradeoff parameter. For each recovery procedure, we declare it succeeds when its unique solution (𝒙^,𝒗^)(\hat{\bm{x}},\hat{\bm{v}}) satisfies 𝒙^=𝒙\hat{\bm{x}}=\bm{x}^{\star} and 𝒗^=𝒗\hat{\bm{v}}=\bm{v}^{\star}, otherwise, it fails.

Under the above model settings, the contribution of this paper is twofold:

  • First, we develop a new analytical framework which allows us to establish the phase transition theory of both constrained and penalized recovery procedures in a unified way. Specifically, for constrained recovery procedures (7) and (8), our analysis shows that their phase transitions locate at

    𝒞p=ω2(𝒯f(𝒙)𝕊n1)+ω2(𝒯g(𝒗)𝕊m1),\mathscr{C}_{p}=\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}),

    where 𝒯f(𝒙)\mathcal{T}_{f}(\bm{x}^{\star}) (or 𝒯g(𝒗)\mathcal{T}_{g}(\bm{v}^{\star})) is the tangent cone induced by ff (or gg) at the true signal 𝒙\bm{x}^{\star} (or corruption 𝒗\bm{v}^{\star}), ω(𝒯f(𝒙)𝕊n1)\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}) (or ω(𝒯g(𝒗)𝕊m1)\omega(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})) is the spherical Gaussian width of this cone, defined in Section II. For the penalized recovery procedure (9), our results indicate that its critical point locates at

    𝒞p(λ)=minαtβ2ζ(mλtf(𝒙))+η2(1tg(𝒗)𝕊m1)1,\mathscr{C}_{p}(\lambda)=\min_{\alpha\leq t\leq\beta}2\cdot\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1,

    where f(𝒙)\partial f(\bm{x}^{\star}) (or g(𝒗)\partial g(\bm{v}^{\star})) is the subdifferential of ff (or gg) at the true signal 𝒙\bm{x}^{\star} (or corruption 𝒗\bm{v}^{\star}), ζ(SS)\zeta(\SS) and η2(SS)\eta^{2}(\SS) denote the Gaussian distance and the Gaussian squared distance to a set SS\SS respectively, also defined in Section II, and α=min𝒃g(𝒗)𝒃2\alpha=\min_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2} and β=max𝒃g(𝒗)𝒃2\beta=\max_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}.

  • Second, we investigate the relationship between these two kinds of recovery procedures by utilizing the established critical points 𝒞p\mathscr{C}_{p} and 𝒞p(λ)\mathscr{C}_{p}(\lambda), which also reveals an optimal parameter selection strategy for λ\lambda (in the sense of Lagrange theory):

    λ=argminλ>0𝒞p(λ).\lambda^{\star}=\arg\min_{\lambda>0}\mathscr{C}_{p}(\lambda).

    More precisely, under mild conditions, if the penalized procedure (9) is likely to succeed, then the constrained procedures (7) and (8) succeed with high probability, namely, if m𝒞p(λ)m\geq\mathscr{C}_{p}(\lambda), then we have m𝒞p1m\geq\mathscr{C}_{p}-1. On the contrary, if the constrained procedures (7) and (8) are likely to succeed, then we can choose the tradeoff parameter λ\lambda as λ\lambda^{\star} such that the penalized procedure (9) succeeds with high probability, namely, if m𝒞pm\geq\mathscr{C}_{p}, then we have m𝒞p(λ)1m\geq\mathscr{C}_{p}(\lambda^{\star})-1.

I-B Related Works

During the past few decades, there have been abundant works investigating the phase transition phenomenons in random convex optimization problems. Most of these works fit in with the framework of compressed sensing (𝒗=𝟎\bm{v}^{\star}=\bm{0}). In this paper, we focus on the scenario in which the random measurements are contaminated by some structured corruption. We will review the works related to these two aspects in details.

I-B1 Related Works in Compressed Sensing

The works in the context of compressed sensing can be roughly divided into four groups according to their analytical tools.

The early works study phase transitions in the context of sparse signal recovery via polytope angle calculations [30]. Under Gaussian measurements, Donoho [31] analyzes the 1\ell_{1}-minimization method in the asymptotic regime and establishes an empirically tight lower bound on the number of measurements required for successful recovery. In contrast to [31], Donoho and Tanner [32] also prove the existence of sharp phase transitions in the asymptotic regime when using 1\ell_{1}-minimization to reconstruct sparse signals from random projections. These results are later extended to other related 1\ell_{1}-minimization problems. For instance, Donoho and Tanner [33, 34] identify a precise phase transition of the sparse signal recovery problem with an additional nonnegative constraint; Khajehnejad et al. [35] introduce a nonuniform sparse model and analyze the performance of weighted 1\ell_{1}-minimization over that model; Xu and Hassibi [36] present sharp performance bounds on the number of measurements required for recovering approximately sparse signals from noisy measurements via 1\ell_{1}-minimization.

In [37, Fact 10.1], Amelunxen et al. explore the relationship between polytope angle theory and conical integral geometry in details. Their results have shown that conical integral geometry could go beyond many inherent limitations of polytope angle theory, such as dealing with the nuclear norm regularizer in low-rank matrix recovery problems, non-asymptotic analysis, and establishing phase transition from absolute success to absolute failure. In summary, [37] provides the first comprehensive analysis that explains phase transition phenomenons in some random convex optimization problems. Other authors further use conical integral geometry to analyze convex optimization problems with random data. For examples, Amelunxen and Bürgisser [38, 39] apply conical integral geometry to study conic optimization problems; Goldstein et al. [40] show that the sequence of conic intrinsic volumes can be approximated by a suitable Gaussian distribution in the high-dimensional limit, which provides more precise probabilities for successful and failed recovery.

Whereas the above works involve combinatorial geometry, there are some others using minimax decision theory to analyze the phase transition problems. Several papers [41, 42, 43] have observed a close agreement between the asymptotic mean square error (MSE) and the location of phase transition in the linear inverse problems. Donoho et al. [44, 45] then have shown that the minimax MSE for denoising empirically predicts the locations of phase transitions in both sparse and low-rank recovery problems. Recently, Oymak and Hassibi [46] prove that the minimax MSE risk in structured signal denoising problems is almost the same as the statistical dimension. Combining with the results in [37], their results provide a theoretical explanation for using minimax risk to describe the location of phase transition in regularized linear inverse problems.

The last line of works study the compressed sensing problem by utilizing some tools from Gaussian process theory. The key technique is a sharp comparison inequality for Gaussian processes due to Gordon [47]. Rudelson and Vershynin [48] first use Gordon’s inequality to study the 1\ell_{1}-minimization problem. Stojnin [49] refines this method and obtains an empirically sharp success recovery condition under Gaussian measurements. Stojnin’s calculation is then extended to more general settings. Oymak and Hassibi [50] use it to study the nuclear norm minimization problem. Chandrasekaran et al. [51] consider a more general case in which the regularizer can be any convex function. Stojnic [52, 53] has also investigated the error behaviors of 1\ell_{1}-minimization and its variants in random optimization problems, these works have been extended by a series of researches [54, 55] by Oymak, Thrampoulidis and Hassibi. Although the mentioned works provide detailed discussions for using Gaussian process theory to analyze random convex optimization problems, few of them consider the failure case for recovery. In a recent work [56], Oymak and Tropp demonstrate a universality property for randomized dimension reduction, which also proves the phase transition in the recovery of structured signals from a large class of measurement models.

I-B2 Related Works in Corrupted Sensing

There are several works in the literature studying the phase transition theory of corrupted sensing problems. For instance, in [37], Amelunxen et al. study the demixing problem 𝒛0=𝒙0+𝑼𝒚0\bm{z}_{0}=\bm{x}_{0}+\bm{U}\bm{y}_{0}, where 𝑼n×n\bm{U}\in\mathbb{R}^{n\times n} is a random orthogonal matrix. They establish sharp phase transition results when the constrained convex program min𝒙,𝒚f(𝒙),s.t. 𝒛0=𝒙+𝑼𝒚,g(𝒚)g(𝒚0)\min_{\bm{x},\bm{y}}~{}f(\bm{x}),~{}\text{s.t.~{}}\bm{z}_{0}=\bm{x}+\bm{U}\bm{y},~{}g(\bm{y})\leq g(\bm{y}_{0}) is used to solve this demixing problem. In [56], Oymak and Tropp consider a more general demixing model 𝒚=𝚽0𝒙0+𝚽1𝒙1\bm{y}=\bm{\Phi}_{0}\bm{x}_{0}+\bm{\Phi}_{1}\bm{x}_{1}, where 𝚽0,𝚽1m×n\bm{\Phi}_{0},\bm{\Phi}_{1}\in\mathbb{R}^{m\times n} are two random transformation matrices drawing from a wide class of distributions. They attempt to reconstruct the original signal pair by solving min𝒛0,𝒛1max{f0(𝒛0),f1(𝒛1)},s.t. 𝒚=𝚽0𝒛0+𝚽1𝒛1\min_{\bm{z}_{0},~{}\bm{z}_{1}}\max\{f_{0}(\bm{z}_{0}),f_{1}(\bm{z}_{1})\},~{}\text{s.t.~{}}\bm{y}=\bm{\Phi}_{0}\bm{z}_{0}+\bm{\Phi}_{1}\bm{z}_{1} and establish the related phase transition theory. More related to this work, Foygel and Mackey [24] consider the corrupted sensing problem (1) and analyze both the constrained recovery procedures (2) or (3) and the partially penalized recovery procedure (4). In each case, they provide sufficient conditions for stable signal recovery from structured corruption with added unstructured noise under Gaussian measurements. Very recently, Chen and Liu [27] develop an extended matrix deviation inequality and use it to analyze all three kinds of convex procedures ((2), (3), (4), and (5)) in a unified way under sub-Gaussian measurements. In terms of failure case of corrupted sensing, Zhang, Liu, and Lei [25] establish a sharp threshold below which the constrained convex procedures (7) and (8) fail to recover both signal and corruption under Gaussian measurements. Together with the work in [24], their results provide a theoretical explanation for the phase transition when the constrained procedures are used to solve corrupted sensing problems.

I-C Organization

The remainder of the paper is organized as follows. We start with reviewing some preliminaries that are necessary for our subsequent analysis in Section II. Section III is devoted to presenting the main theoretical results of this paper. In Section IV, we present a series of numerical experiments to verify our theoretical results. We conclude the paper in Section V. All proofs of our main results are included in Appendixes.

II Preliminaries

In this section, we introduce some notations and facts that underlie our analysis. Throughout the paper, 𝕊n1\mathbb{S}^{n-1} and 𝔹2n\mathbb{B}_{2}^{n} represent the unit sphere and unit ball in n\mathbb{R}^{n} under the 2\ell_{2} norm, respectively.

II-A Convex Geometry

II-A1 Subdifferential

The subdifferential of a convex function ff at 𝒙\bm{x} is the set of vectors

f(𝒙)={𝒖n:f(𝒙+𝒅)f(𝒙)𝒖,𝒅for all 𝒅n}.\partial f(\bm{x})=\{\bm{u}\in\mathbb{R}^{n}:f(\bm{x}+\bm{d})-f(\bm{x})\geq\left\langle\bm{u},\bm{d}\right\rangle~{}\text{for all~{}}\bm{d}\in\mathbb{R}^{n}\}.

If ff is convex and 𝒙intdom(f)\bm{x}\in\textrm{intdom}(f), then f(𝒙)\partial f(\bm{x}) is a nonempty, compact, convex set. For any number t0t\geq 0, we denote the scaled subdifferential as tf(𝒙)={t𝒖:𝒖f(𝒙)}t\cdot\partial f(\bm{x})=\{t\cdot\bm{u}:\bm{u}\in\partial f(\bm{x})\}.

II-A2 Cone and Polar Cone

A subset 𝒞n\mathcal{C}\subset\mathbb{R}^{n} is called a cone if for every 𝒙𝒞\bm{x}\in\mathcal{C} and t0t\geq 0, we have t𝒙𝒞t\cdot\bm{x}\in\mathcal{C}. For a cone 𝒞n\mathcal{C}\subset\mathbb{R}^{n}, the polar cone of 𝒞\mathcal{C} is defined as

𝒞={𝒖n:𝒖,𝒙0for all𝒙𝒞}.\mathcal{C}^{\circ}=\{\bm{u}\in\mathbb{R}^{n}:\left\langle\bm{u},\bm{x}\right\rangle\leq 0~{}\textrm{for all}~{}\bm{x}\in\mathcal{C}\}.

The polar cone 𝒞\mathcal{C}^{\circ} is always closed and convex. A subset SS𝕊n1\SS\subset\mathbb{S}^{n-1} is called spherically convex if SS\SS is the intersection of a convex cone with the unit sphere.

II-A3 Tangent Cone and Normal Cone

The tangent cone of a convex function ff at 𝒙\bm{x} is defined as the set of descent directions of ff at 𝒙\bm{x}

𝒯f(𝒙)={𝒖n:f(𝒙+t𝒖)f(𝒙)for somet>0}.\mathcal{T}_{f}(\bm{x})=\{\bm{u}\in\mathbb{R}^{n}:~{}f(\bm{x}+t\cdot\bm{u})\leq f(\bm{x})~{}\textrm{for some}~{}t>0\}.

The tangent cone of a proper convex function is always convex, but they may not be closed.

The normal cone of a convex function ff at 𝒙\bm{x} is the polar of the tangent cone

𝒩f(𝒙)=𝒯f(𝒙)={𝒖n:𝒖,𝒅0for all 𝒅𝒯f(𝒙)}.\mathcal{N}_{f}(\bm{x})=\mathcal{T}_{f}^{\circ}(\bm{x})=\{\bm{u}\in\mathbb{R}^{n}:\left\langle\bm{u},\bm{d}\right\rangle\leq 0~{}\text{for all~{}}\bm{d}\in\mathcal{T}_{f}(\bm{x})\}.

Suppose that 0f(𝒙)0\notin\partial f(\bm{x}), the normal cone can also be written as the cone hull of the subdifferential [57, Theorem 23.7]

𝒩f(𝒙)=cone{f(𝒙)}={𝒖n:𝒖tf(𝒙)for somet0}.\mathcal{N}_{f}(\bm{x})=\operatorname{cone}\{\partial f(\bm{x})\}=\{\bm{u}\in\mathbb{R}^{n}:~{}\bm{u}\in t\cdot\partial f(\bm{x})~{}\textrm{for some}~{}t\geq 0\}.

II-B Geometric Measures

II-B1 Gaussian Width

For any SSn\SS\subset\mathbb{R}^{n}, a popular way to quantify the “size” of SS\SS is through its Gaussian width

ω(SS):=𝔼sup𝒙SS𝒈,𝒙,where𝒈𝒩(0,𝑰n).\omega(\SS):=\operatorname{\mathbb{E}}\sup_{\bm{x}\in\SS}\langle\bm{g},\bm{x}\rangle,~{}~{}\textrm{where}~{}~{}\bm{g}\sim\mathcal{N}(0,\bm{I}_{n}).

II-B2 Gaussian Distance and Gaussian Squared Distance

Recall that the Euclidean distance to a set SSn\SS\subset\mathbb{R}^{n} is defined as

dist(𝒙,SS):=inf𝒚SS𝒙𝒚2.\operatorname{dist}(\bm{x},\SS):=\inf_{\bm{y}\in\SS}\|\bm{x}-\bm{y}\|_{2}.

We define the Gaussian distance to a set SSn\SS\subset\mathbb{R}^{n} as

ζ(SS):=𝔼dist(𝒈,SS)=𝔼inf𝒚SS𝒈𝒚2,where𝒈𝒩(0,𝑰n).\zeta(\SS):=\operatorname{\mathbb{E}}\operatorname{dist}(\bm{g},\SS)=\operatorname{\mathbb{E}}\inf_{\bm{y}\in\SS}\|\bm{g}-\bm{y}\|_{2},~{}~{}\textrm{where}~{}~{}\bm{g}\sim\mathcal{N}(0,\bm{I}_{n}).

Similarly, the Gaussian squared distance to a set SSn\SS\subset\mathbb{R}^{n} is defined as

η2(SS):=𝔼dist2(𝒈,SS)=𝔼inf𝒚SS𝒈𝒚22,where𝒈𝒩(0,𝑰n).\eta^{2}(\SS):=\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{g},\SS)=\operatorname{\mathbb{E}}\inf_{\bm{y}\in\SS}\|\bm{g}-\bm{y}\|_{2}^{2},~{}~{}\textrm{where}~{}~{}\bm{g}\sim\mathcal{N}(0,\bm{I}_{n}).

These two quantities are closely related 111The lower and upper bounds follow from Fact 4 (in Appendix E) and Jensen’s inequality, respectively.

η2(SS)1ζ(SS)η2(SS).\sqrt{\eta^{2}(\SS)-1}\leq\zeta(\SS)\leq\sqrt{\eta^{2}(\SS)}. (10)

II-C Tools from Gaussian Analysis

Our analysis makes heavy use of two well-known results in Gaussian analysis. The first one is a comparison principle for Gaussian processes due to Gordon [47, Theorem 1.1]. This result provides a convenient way to bound the probability of an event from below by that of another one. It is worth noting that the original lemma can be naturally extended from discrete index sets to compact index sets, see e.g., [54, Lemma C.1].

Fact 1 (Gordon’s Lemma).

[47, Theorem 1.1] Let {Xij}\{X_{ij}\}, {Yij}\{Y_{ij}\}, 1in,1jm1\leq i\leq n,~{}1\leq j\leq m, be two centered Gaussian processes. If XijX_{ij} and YijY_{ij} satisfy the following inequalities:

1.𝔼[Xij2]=𝔼[Yij2];\displaystyle\textrm{1.}~{}\operatorname{\mathbb{E}}[X_{ij}^{2}]=\operatorname{\mathbb{E}}[Y_{ij}^{2}];
2.𝔼[XijXik]𝔼[YijYik];\displaystyle\textrm{2.}~{}\operatorname{\mathbb{E}}[X_{ij}X_{ik}]\leq\operatorname{\mathbb{E}}[Y_{ij}Y_{ik}];
3.𝔼[XijXlk]𝔼[YijYlk]for allil;\displaystyle\textrm{3.}~{}\operatorname{\mathbb{E}}[X_{ij}X_{lk}]\geq\operatorname{\mathbb{E}}[Y_{ij}Y_{lk}]~{}\textrm{for all}~{}i\neq l;

Then, we have

{ij[Xijτij]}{ij[Yijτij]}\displaystyle\mathbb{P}\left\{\cap_{i}\cup_{j}[X_{ij}\geq\tau_{ij}]\rule{0.0pt}{8.53581pt}\right\}\geq\mathbb{P}\left\{\cap_{i}\cup_{j}[Y_{ij}\geq\tau_{ij}]\rule{0.0pt}{8.53581pt}\right\}

for all choices of τij\tau_{ij}\in\mathbb{R}.

The second one is the Gaussian concentration inequality which allows us to establish tail bounds for different kinds of Gaussian Lipschitz functions. Recall that a function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is LL-Lipschitz with respect to the Euclidean norm 2\|\cdot\|_{2} if

|f(𝒙)f(𝒚)|L𝒙𝒚2for all𝒙,𝒚n.|f(\bm{x})-f(\bm{y})|\leq L\|\bm{x}-\bm{y}\|_{2}~{}~{}~{}~{}\textrm{for all}~{}~{}\bm{x},\bm{y}\in\mathbb{R}^{n}.

Then the Gaussian concentration inequality reads as

Fact 2 (Gaussian concentration inequality).

[58, Theorem 1.7.6] Let 𝐱𝒩(0,𝐈n)\bm{x}\sim\mathcal{N}(0,\bm{I}_{n}), and let f:nf:~{}\mathbb{R}^{n}\to\mathbb{R} be LL-Lipschitz with respect to the Euclidean metric. Then for any ϵ0\epsilon\geq 0, we have

{f(𝒙)𝔼f(𝒙)+ϵ}exp(ϵ22L2),\mathbb{P}\left\{f(\bm{x})\geq\operatorname{\mathbb{E}}f(\bm{x})+\epsilon\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{2L^{2}}\right),

and

{f(𝒙)𝔼f(𝒙)ϵ}exp(ϵ22L2).\mathbb{P}\left\{f(\bm{x})\leq\operatorname{\mathbb{E}}f(\bm{x})-\epsilon\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{2L^{2}}\right).

III Main Results

In this section, we will present our main results. Section III-A is devoted to analyzing the phase transition of constrained recovery procedures (7) and (8). The phase transition of penalized recovery procedure (9) will be established in Section III-B. Section III-C explores the relationship between these two kinds of recovery procedures and illustrates how to choose the optimal tradeoff parameter λ\lambda. The proofs are included in Appendixes.

III-A Phase Transition of the Constrained Recovery Procedures

We start with analyzing the phase transition of constrained recovery procedures (7) and (8). Recall that a recovery procedure succeeds if it has a unique optimal solution which coincides with the true value; otherwise it fails. First of all, it is necessary to specify some analytic conditions under which the constrained procedures (7) and (8) succeed or fail to recover the original signal and corruption. To this end, we have following lemma.

Lemma 1 (Sufficient conditions for successful and failed recovery).

Suppose 𝒯f(𝐱)\mathcal{T}_{f}(\bm{x}^{\star}) and 𝒯g(𝐯)\mathcal{T}_{g}(\bm{v}^{\star}) are nonempty and closed. If

min(𝒂,𝒃)(𝒯f(𝒙)×𝒯g(𝒗))𝕊n+m1𝚽𝒂+m𝒃2>0,\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star})\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0, (11)

then the constrained procedures (7) and (8) succeed. If

min(𝒂,𝒃)(𝒯f(𝒙)×𝒯g(𝒗))𝕊n+m1𝚽𝒂+m𝒃2=0.\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star})\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}=0. (12)

then the constrained procedures (7) and (8) fail. Furthermore, a sufficient condition for (12) to hold is

min𝒓𝕊m1min𝒔(𝒯f(𝒙)×𝒯g(𝒗))𝒔𝑨T𝒓2>0,\displaystyle\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star}))^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0, (13)

where (𝒯f(𝐱)×𝒯g(𝐯))(\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star}))^{\circ} denotes the polar cone of 𝒯f(𝐱)×𝒯g(𝐯)\mathcal{T}_{f}(\bm{x}^{\star})\times\mathcal{T}_{g}(\bm{v}^{\star}), 𝐀=[𝚽,m𝐈m]\bm{A}=[\bm{\Phi},\sqrt{m}\bm{I}_{m}], and 𝐈m\bm{I}_{m} is the mm-dimensional identity matrix.

Armed with this lemma, our first theorem shows that the phase transition of constrained recovery procedures (7) and (8) occurs around the sum of squares of spherical Gaussian widths of 𝒯f(𝒙)\mathcal{T}_{f}(\bm{x}^{\star}) and 𝒯g(𝒗)\mathcal{T}_{g}(\bm{v}^{\star}). This result ensures that the recovery is likely to succeed when the number of measurements exceeds the critical point. On the contrary, the recovery is likely to fail when the number of measurements is smaller than the critical point.

Theorem 1 (Phase transition of constrained recovery procedures).

Consider the corrupted sensing model (6) with Gaussian measurements. Assume 𝒯f(𝐱)\mathcal{T}_{f}(\bm{x}^{\star}) and 𝒯g(𝐯)\mathcal{T}_{g}(\bm{v}^{\star}) are non-empty and closed. Define 𝒞p:=ω2(𝒯f(𝐱)𝕊n1)+ω2(𝒯g(𝐯)𝕊m1)\mathscr{C}_{p}:={\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)}. If the number of measurements satisfies

m𝒞p+2+ϵ,\displaystyle\sqrt{m}\geq\sqrt{\mathscr{C}_{p}}+\sqrt{2}+\epsilon, (14)

then the constrained procedures (7) and (8) succeed with probability at least 12exp(ϵ24)1-2\exp\left(\frac{-\epsilon^{2}}{4}\right). If the number of measurements satisfies

m𝒞pϵ,\displaystyle\sqrt{m}\leq\sqrt{\mathscr{C}_{p}}-\epsilon, (15)

then the constrained procedures (7) and (8) fail with probability at least 12exp(ϵ24)1-2\exp\left(\frac{-\epsilon^{2}}{4}\right).

Remark 1 (Relation to existing results).

In [24, Theorem 1], Foygel and Mackey have shown that when

mγ2(𝒯f(𝒙)𝔹2n)+γ2(𝒯g(𝒗)𝔹2m)+12+12π+ϵ,\sqrt{m}\geq\sqrt{\gamma^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{B}_{2}^{n}\right)+\gamma^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{B}_{2}^{m}\right)}+\frac{1}{\sqrt{2}}+\frac{1}{\sqrt{2\pi}}+\epsilon,

the constrained procedures (7) and (8) succeed with probability at least 1exp(ϵ2/2)1-\exp\left({-\epsilon^{2}}/{2}\right). Here, the Gaussian squared complexity of a set SSn\SS\subset\mathbb{R}^{n} is defined as γ2(SS):=𝔼[(sup𝒙SS𝒈,𝒙)+2]\gamma^{2}(\SS):=\operatorname{\mathbb{E}}\left[\left(\sup_{\bm{x}\in\SS}\langle\bm{g},\bm{x}\rangle\right)_{+}^{2}\right] with 𝒈𝒩(0,𝑰n)\bm{g}\sim\mathcal{N}(0,\bm{I}_{n}) and (a)+=max{a,0}(a)_{+}=\max\{a,0\}. On the other hand, the third author and his coauthors [25, Theorem 1] have demonstrated that when

mω2(𝒯f(𝒙)𝕊n1)+ω2(𝒯g(𝒗)𝕊m1)ϵ,\sqrt{m}\leq\sqrt{\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)}-\epsilon,

the constrained procedures (7) and (8) fail with probability at least 1exp(ϵ2/2)1-\exp\left({-\epsilon^{2}}/{2}\right). Since γ2(𝒯f(𝒙)𝔹2n)\gamma^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{B}_{2}^{n}\right) (or γ2(𝒯g(𝒗)𝔹2m)\gamma^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{B}_{2}^{m}\right)) is very close to ω2(𝒯f(𝒙)𝕊n1)\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right) (or ω2(𝒯g(𝒗)𝕊m1)\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)), the above two results have essentially established the phase transition theory of the constrained procedures (7) and (8).

However, in this paper, we have developed a new analytical framework which allows us to unify the results in both success and failure cases in terms of Gaussian width, which makes the phase transition theory of the constrained recovery procedures more natural. More importantly, this framework can be easily applied to establish the phase transition theory of the penalized recovery procedure.

Remark 2 (Related works).

In [37], Amelunxen et al. consider the following demixing problem

𝒚=𝑼𝒙+𝒗,\bm{y}=\bm{U}\bm{x}^{\star}+\bm{v}^{\star},

where 𝒙,𝒗n\bm{x}^{\star},\bm{v}^{\star}\in\mathbb{R}^{n} are unknown structured signals and 𝑼n×n\bm{U}\in\mathbb{R}^{n\times n} is a known orthogonal matrix. They have shown that the phase transition occurs around δ(𝒯f(𝒙))+δ(𝒯g(𝒗))\delta(\mathcal{T}_{f}(\bm{x}^{\star}))+\delta(\mathcal{T}_{g}(\bm{v}^{\star})) when the constrained recovery procedures are employed to solve this problem. Here, the statistical dimension of a convex cone 𝒞n\mathcal{C}\in\mathbb{R}^{n} is defined as δ(𝒞):=𝔼[(sup𝒙𝒞𝔹2n𝒈,𝒙)2]\delta(\mathcal{C}):=\operatorname{\mathbb{E}}\left[\left(\sup_{\bm{x}\in\mathcal{C}\cap\mathbb{B}_{2}^{n}}\langle\bm{g},\bm{x}\rangle\right)^{2}\right] with 𝒈𝒩(0,𝑰n)\bm{g}\sim\mathcal{N}(0,\bm{I}_{n}). Although the model assumptions of this demixing problem are different from ours, the results in the two cases are essentially consistent, since we have δ(𝒯f(𝒙))+δ(𝒯g(𝒗))𝒞p\delta(\mathcal{T}_{f}(\bm{x}^{\star}))+\delta(\mathcal{T}_{g}(\bm{v}^{\star}))\approx\mathscr{C}_{p} (by Fact 7 in Appendix E).

Recently, Oymak and Tropp [56] consider a more general demixing model

𝒚=𝚽0𝒙+𝚽1𝒗,\bm{y}=\bm{\Phi}_{0}\bm{x}^{\star}+\bm{\Phi}_{1}\bm{v}^{\star},

where 𝒙,𝒗n\bm{x}^{\star},\bm{v}^{\star}\in\mathbb{R}^{n} are unknown structured signals and 𝚽0,𝚽1m×n\bm{\Phi}_{0},\bm{\Phi}_{1}\in\mathbb{R}^{m\times n} are random matrices. They have demonstrated that the critical point of the constrained recovery procedures is nearly located at δ(𝒯f(𝒙))+δ(𝒯g(𝒗))\delta(\mathcal{T}_{f}(\bm{x}^{\star}))+\delta(\mathcal{T}_{g}(\bm{v}^{\star})) for a large class of random matrices drawing from some models. Their model assumptions are also different from ours, because 𝚽1\bm{\Phi}_{1} is a deterministic matrix in our case, which makes our analysis different from that of [56].

III-A1 How to evaluate the critical point 𝒞p\mathscr{C}_{p}?

Theorem 1 has demonstrated that the phase transition of the constrained recovery procedures occurs around

𝒞p=ω2(𝒯f(𝒙)𝕊n1)+ω2(𝒯g(𝒗)𝕊m1).\mathscr{C}_{p}=\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right).

A natural question then is how to determine the value of this critical point. To this end, it suffices to estimate ω2(𝒯f(𝒙)𝕊n1)\omega^{2}\left(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}\right) and ω2(𝒯g(𝒗)𝕊m1)\omega^{2}\left(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right). It is now well-known that there are some standard recipes to estimate these two quantities, see e.g., [51, 37, 24]. Actually, Facts 7 and 10 indicate that 𝒞p\mathscr{C}_{p} can be accurately approximated by

mint0η2(tf(𝒙))+mint0η2(tg(𝒗)).\min_{t\geq 0}\eta^{2}(t\cdot\partial f(\bm{x}^{\star}))+\min_{t\geq 0}\eta^{2}(t\cdot\partial g(\bm{v}^{\star})). (16)

To illustrate this result (16), we consider two typical examples: sparse signal recovery from sparse corruption and low-rank matrix recovery from sparse corruption. In the first example, we assume 𝒙n\bm{x}^{\star}\in\mathbb{R}^{n} and 𝒗m\bm{v}^{\star}\in\mathbb{R}^{m} are ss-sparse and kk-sparse vectors, respectively. Direct calculations (see Appendix D) lead to

mint0η2(t𝒙1)=mint0{s(1+t2)+2(ns)2π((1+t2)tex2/2𝑑xtet2/2)},\displaystyle\min_{t\geq 0}\eta^{2}(t\cdot\partial\|\bm{x}^{\star}\|_{1})=\min_{t\geq 0}\left\{s(1+t^{2})+\frac{2(n-s)}{\sqrt{2\pi}}\left((1+t^{2})\int_{t}^{\infty}e^{-x^{2}/2}dx-te^{-t^{2}/2}\right)\right\},

and

mint0η2(t𝒗1)=mint0{k(1+t2)+2(mk)2π((1+t2)tex2/2𝑑xtet2/2)}.\displaystyle\min_{t\geq 0}\eta^{2}(t\cdot\partial\|\bm{v}^{\star}\|_{1})=\min_{t\geq 0}\left\{k(1+t^{2})+\frac{2(m-k)}{\sqrt{2\pi}}\left((1+t^{2})\int_{t}^{\infty}e^{-x^{2}/2}dx-te^{-t^{2}/2}\right)\right\}.

In the case of low-rank matrix recovery from sparse corruption, suppose 𝑿n×n\bm{X}^{\star}\in\mathbb{R}^{n\times n} is an rr-rank matrix, the Gaussian squared distance of the signal in (16) is given by

mint0η2(t𝑿)\displaystyle\min_{t\geq 0}\eta^{2}(t\cdot\partial\|\bm{X}\|_{*}) =mint0{r(2nr+t2)+𝔼i=1nrshrink(σi(𝑮2),t)2},\displaystyle=\min_{t\geq 0}\left\{r(2n-r+t^{2})+\operatorname{\mathbb{E}}\sum_{i=1}^{n-r}\textrm{shrink}\left(\sigma_{i}(\bm{G}_{2}),t\right)^{2}\right\},

where 𝑮2\bm{G}_{2} is an (nr)×(nr)(n-r)\times(n-r) standard Gaussian matrix, and σi(𝑮2)\sigma_{i}(\bm{G}_{2}) is the ii-th largest singular value of 𝑮2\bm{G}_{2}. The Gaussian squared distance of the corruption can be similarly evaluated as in the first example. In addition, we should mention that it is also possible to estimate the Gaussian width in 𝒞p\mathscr{C}_{p} numerically by approximating the expectation in its definition with an empirical average.

III-B Phase Transition of the Penalized Recovery Procedure

We then study the phase transition theory of the penalized recovery procedure (9). Firstly, we also need to establish sufficient conditions under which the penalized recovery procedure (9) succeeds or fails to recover the original signal and corruption.

Lemma 2 (Sufficient conditions for successful and failed recovery).

The penalized problem (9) succeeds if

𝟎𝚽Tg(𝒗)mλf(𝒙).\displaystyle\bm{0}\in\bm{\Phi}^{T}\cdot\partial g(\bm{v}^{\star})-\frac{\sqrt{m}}{\lambda}\partial f(\bm{x}^{\star}). (17)

The penalized problem (9) fails if

min𝒂f(𝒙),𝒃g(𝒗)𝚽T𝒃mλ𝒂2>0.\displaystyle\min_{\bm{a}\in\partial f(\bm{x}^{\star}),\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0. (18)

Define the joint cone 𝒯J={t(𝐚T,𝐛T)Tn×m:t0,𝐚f(𝐱),𝐛g(𝐯)}\mathcal{T}_{J}=\{t\cdot(\bm{a}^{T},\bm{b}^{T})^{T}\in\mathbb{R}^{n}\times\mathbb{R}^{m}:t\geq 0,~{}\bm{a}\in\partial f(\bm{x}^{\star}),~{}\bm{b}\in\partial g(\bm{v}^{\star})\}. Then, a sufficient condition for (17) to hold is

min𝒓𝕊n1min𝒔𝒯J𝒔𝑴T𝒓2>0,\displaystyle\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}>0, (19)

where 𝒯J\mathcal{T}_{J}^{\circ} denotes the polar cone of 𝒯J\mathcal{T}_{J}, 𝐌=[mλ𝐈n,𝚽T]\bm{M}=[-\frac{\sqrt{m}}{\lambda}\bm{I}_{n},\bm{\Phi}^{T}], and 𝐈n\bm{I}_{n} is the nn-dimensional identity matrix.

Our second theorem shows that the critical point of the penalized recovery procedure is nearly located at 𝒞p(λ)\mathscr{C}_{p}(\lambda), which is determined by two Gaussian distances to scaled subdifferentials. This result asserts that the recovery succeeds with high probability when the number of measurements is larger than the critical point. On the other hand, the recovery fails with high probability when the number of measurements is below the critical point. In addition, the critical point 𝒞p(λ)\mathscr{C}_{p}(\lambda) is influenced by the tradeoff parameter λ\lambda.

Theorem 2 (Phase transition of penalized recovery procedure).

Consider the corrupted sensing model (6) with Gaussian measurements. Suppose that the subdifferential g(𝐯)\partial g(\bm{v}^{\star}) does not contain the origin. Let α=min𝐛g(𝐯)𝐛2\alpha=\min_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2} and β=max𝐛g(𝐯)𝐛2\beta=\max_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}. Define

𝒞p(λ):=minαtβ2ζ(mλtf(𝒙))+η2(1tg(𝒗)𝕊m1)1.\displaystyle\mathscr{C}_{p}(\lambda):=\min_{\alpha\leq t\leq\beta}2\cdot\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1.

If the number of measurements satisfies

m𝒞p(λ)+ϵ,\displaystyle m\geq\mathscr{C}_{p}(\lambda)+\epsilon, (20)

then the penalized problem (9) succeeds with probability at least 12exp(ϵ216)1-2\exp\left(\frac{-\epsilon^{2}}{16}\right). If the number of measurements satisfies

m𝒞p(λ)ϵ,\displaystyle m\leq\mathscr{C}_{p}(\lambda)-\epsilon, (21)

then the penalized problem (9) fails with probability at least 12exp(ϵ216)1-2\exp\left(\frac{-\epsilon^{2}}{16}\right).

Remark 3 (Related works).

In [24], Foygel and Mackey have shown that, under Gaussian measurements, when

m2η2(λ1f(𝒙))+η2(λ2g(𝒗))+32π+12+12π+ϵ,\sqrt{m}\geq 2\cdot\sqrt{\eta^{2}(\lambda_{1}\cdot\partial f(\bm{x}^{\star}))}+\sqrt{\eta^{2}(\lambda_{2}\cdot\partial g(\bm{v}^{\star}))}+3\sqrt{2\pi}+\frac{1}{\sqrt{2}}+\frac{1}{\sqrt{2\pi}}+\epsilon,

the penalized problem (9) succeeds with probability at least 1exp(ϵ2/2)1-\exp(-\epsilon^{2}/2). Here λ=λ2/λ1\lambda=\lambda_{2}/\lambda_{1}. Very recently, Chen and Liu [27] have illustrated that, under sub-Gaussian measurements, when

mCK4[η2(λ1f(𝒙))+η2(λ2g(𝒗))],m\geq CK^{4}\left[\eta^{2}(\lambda_{1}\cdot\partial f(\bm{x}^{\star}))+\eta^{2}(\lambda_{2}\cdot\partial g(\bm{v}^{\star}))\right],

the penalized problem (9) succeeds with high probability. Here CC is an absolute constant and KK is the upper bound for the sub-Gaussian norm of rows of the sensing matrix. These two sufficient conditions for successful recovery are demonstrated to be unsharp by their numerical experiments. To the best of our knowledge, the present results (Theorem 2) first establish the complete phase transition theory of the penalized recovery procedure (9), which closes an important open problem in the literature, see e.g., [23], [24], and [27].

III-B1 How to evaluate the critical point 𝒞p(λ)\mathscr{C}_{p}(\lambda)?

Theorem 2 has suggested that the phase transition of the penalized recovery procedure occurs around

𝒞p(λ)=minαtβ2ζ(mλtf(𝒙))+η2(1tg(𝒗)𝕊m1)1.\mathscr{C}_{p}(\lambda)=\min_{\alpha\leq t\leq\beta}2\cdot\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1.

The next important question is how to calculate 𝒞p(λ)\mathscr{C}_{p}(\lambda) accurately. To this end, we have the following lemma.

Lemma 3.

The quantity 𝒞p(λ)\mathscr{C}_{p}(\lambda) can be bounded as

minαtβ[2η2(mλtf(𝒙))12ω(1tg(𝒗)𝕊m1)+m]𝒞p(λ)\displaystyle\min_{\alpha\leq t\leq\beta}\left[2\cdot\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)-1}-2\cdot\omega\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)+m\right]\leq\mathscr{C}_{p}(\lambda)
minαtβ[2η2(mλtf(𝒙))2ω(1tg(𝒗)𝕊m1)+m].\displaystyle\hskip 150.0pt\leq\min_{\alpha\leq t\leq\beta}\left[2\cdot\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)}-2\cdot\omega\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)+m\right].
Proof.

Note that

𝒞p(λ)\displaystyle\mathscr{C}_{p}(\lambda) =minαtβ𝔼[2dist(𝒈,mλtf(𝒙))+dist2(𝒉,1tg(𝒗)𝕊m1)1]\displaystyle=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1\right]
=minαtβ𝔼[2dist(𝒈,mλtf(𝒙))+min𝒃1tg(𝒗)𝕊m1𝒉𝒃221]\displaystyle=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\min_{\bm{b}\in\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}}\|\bm{h}-\bm{b}\|_{2}^{2}-1\right]
=minαtβ𝔼[2dist(𝒈,mλtf(𝒙))2max𝒃1tg(𝒗)𝕊m1𝒉,𝒃+m].\displaystyle=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)-2\max_{\bm{b}\in\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}}\left\langle\bm{h},\bm{b}\right\rangle+m\right]. (22)

It follows from (10) that

η2(mλtf(𝒙))1\displaystyle\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)-1} ζ(mλtf(𝒙))η2(mλtf(𝒙)).\displaystyle\leq\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)\leq\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)}.

Substituting the above bound into (III-B1) completes the proof. ∎

Lemma 3 demonstrates that 𝒞p(λ)\mathscr{C}_{p}(\lambda) can be accurately estimated by

minαtβ[2η2(mλtf(𝒙))2ω(1tg(𝒗)𝕊m1)+m].\min_{\alpha\leq t\leq\beta}\left[2\cdot\sqrt{\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)}-2\cdot\omega\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)+m\right]. (23)

Thus it is sufficient to estimate η2(mλtf(𝒙))\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right) and ω(1tg(𝒗)𝕊m1)\omega\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right). There are also some standard methods to estimate these two quantities, see e.g., [51, 37, 24]. To illustrate this result (23), we also consider two typical examples: sparse signal recovery from sparse corruption and low-rank matrix recovery from sparse corruption. In the first example, we assume the signal 𝒙n\bm{x}^{\star}\in\mathbb{R}^{n} and the corruption 𝒗m\bm{v}^{\star}\in\mathbb{R}^{m} are ss-sparse and kk-sparse vectors, then we can obtain (see Appendix D for details)

η2(mλt𝒙1)=s(1+mλ2t2)+2(ns)2π((1+mλ2t2)mλtex2/2𝑑xmλtem2λ2t2)\displaystyle\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial\|\bm{x}^{\star}\|_{1}\right)=s\left(1+\frac{m}{\lambda^{2}t^{2}}\right)+\frac{2(n-s)}{\sqrt{2\pi}}\left(\left(1+\frac{m}{\lambda^{2}t^{2}}\right)\int_{\frac{\sqrt{m}}{\lambda t}}^{\infty}e^{-x^{2}/2}dx-\frac{\sqrt{m}}{\lambda t}e^{-\frac{m}{2\lambda^{2}t^{2}}}\right)

and

ω(1t𝒗1𝕊m1)=2π(mk)(1kt2).\displaystyle\omega\left(\frac{1}{t}\partial\|\bm{v}^{\star}\|_{1}\cap\mathbb{S}^{m-1}\right)=\sqrt{\frac{2}{\pi}(m-k)\left(1-\frac{k}{t^{2}}\right)}.

The parameter tt in (23) takes value from α=k\alpha=\sqrt{k} to β=m\beta=\sqrt{m}. Consider the example of low-rank matrix recovery from sparse corruption, the signal 𝑿n×n\bm{X}^{\star}\in\mathbb{R}^{n\times n} is an rr-rank matrix, the first Gaussian squared distance in (23) can be calculated as

η2(mλt𝑿)=r(2nr+mλ2t2)+𝔼i=1nrshrink(σi(𝑮2),mλt)2.\displaystyle\eta^{2}\left(\frac{\sqrt{m}}{\lambda t}\partial\|\bm{X}\|_{*}\right)=r\left(2n-r+\frac{m}{\lambda^{2}t^{2}}\right)+\operatorname{\mathbb{E}}\sum_{i=1}^{n-r}\textrm{shrink}\left(\sigma_{i}(\bm{G}_{2}),\frac{\sqrt{m}}{\lambda t}\right)^{2}.

where 𝑮2\bm{G}_{2} is an (nr)×(nr)(n-r)\times(n-r) standard Gaussian matrix, and σi(𝑮2)\sigma_{i}(\bm{G}_{2}) is the ii-th largest singular value of 𝑮2\bm{G}_{2}. The calculations of the Gaussian width of the corruption and the range of parameter tt are similar to the first example. In addition, it is possible to estimate the Gaussian distance and Gaussian width numerically by approximating the expectations in their definitions with empirical averages.

Refer to caption
Figure 1: We assume 𝒙128\bm{x}^{\star}\in\mathbb{R}^{128} is an ss-sparse vector and 𝒗128\bm{v}^{\star}\in\mathbb{R}^{128} is a kk-sparse vector. f()f(\cdot) and g()g(\cdot) are set to be the 1\ell_{1}-norm. Fix the sample size m=128m=128. The solid red line corresponds to the phase transition threshold of the constrained procedures: m=𝒞pm=\mathscr{C}_{p}, the dashed blue lines correspond to the phase transition thresholds of the penalized recovery procedure with different λ\lambdas: m=𝒞p(λ)ϵm=\mathscr{C}_{p}(\lambda)-\epsilon. The recovery of both signal and corruption is likely to succeed when the sparsity pair (s,k)(s,k) lies below the phase thresholds, the recovery is likely to fail when (s,k)(s,k) lies above the phase thresholds. It is not hard to find that: (I). The successful areas of the penalized recovery procedure with different λ\lambdas are always smaller than that of the constrained methods; (II). Even for the critical points (e.g., (s,k)=(s0,k0),(s1,k1),(s2,k2)(s,k)=(s_{0},k_{0}),(s_{1},k_{1}),(s_{2},k_{2})) that lie on the phase transition threshold of the constrained recovery procedures (which might represent the reconstruction limit of the constrained procedures), we can still choose some corresponding tradeoff parameters (e.g., λ=0.75,1.50,3.50\lambda=0.75,1.50,3.50) such that the penalized problem succeeds too (with similar probability). Thus, the successful area of the constrained procedures can be regarded as the union of that of the penalized one with different λ\lambdas.

III-C Relationship between Constrained and Penalized Recovery Procedures and Optimal Choice of λ\lambda

The theory of Lagrange multipliers [57, Section 28] asserts that solving the constrained recovery procedures is essentially equivalent to solving the penalized problem with a best choice of the tradeoff parameter λ\lambda. More precisely, this equivalence consists of the following two aspects [23, Appendix A]. On the one hand,

  • (I).

    Suppose the penalized procedure (9) succeeds for some value λ>0\lambda>0. Then the constrained ones (7) and (8) succeed.

On the other hand, as a partial converse to (I), one has

  • (II).

    Suppose that the subdifferentials f(𝐱)\partial f(\bm{x}^{\star}) and g(𝐯)\partial g(\bm{v}^{\star}) do not contain the origin. If the constrained procedures (7) and (8) succeed, then there exists a parameter λ>0\lambda>0 such that (𝐱,𝐯)(\bm{x}^{\star},\bm{v}^{\star}) is an optimal point for the penalized one (9).

The above relations indicate that the performance of the constrained procedures can be interpreted as the best possible one for the penalized problem. However, the main difficulty in these results lies in how to select a suitable tradeoff parameter λ\lambda that leads to this equivalence. Since we have identified the precise phase transitions of both constrained and penalized recovery procedures, it is possible to allow us to explore the relationship between these two kinds of approaches in a quantitative way, which in turn implies an explicit strategy to choose the optimal λ\lambda.

Theorem 3 (Relationship between constrained and penalized recovery procedures and optimal choice of λ\lambda).

Assume that 𝒯f(𝐱)\mathcal{T}_{f}(\bm{x}^{\star}) and 𝒯g(𝐯)\mathcal{T}_{g}(\bm{v}^{\star}) are non-empty and closed, and that the subdifferentials f(𝐱)\partial f(\bm{x}^{\star}) and g(𝐯)\partial g(\bm{v}^{\star}) do not contain the origin. If m𝒞p(λ)m\geq\mathscr{C}_{p}(\lambda), then we have

m𝒞p1.\displaystyle m\geq\mathscr{C}_{p}-1.

On the other hand, if m𝒞pm\geq\mathscr{C}_{p}, then we can choose the tradeoff parameter as

λ=argminλ>0𝒞p(λ),\lambda^{\star}=\arg\min_{\lambda>0}\mathscr{C}_{p}(\lambda), (24)

such that 222As shown in the proof of Theorem 3, the gap 55 can be easily reduced by introducing an extra condition ω(𝒯f(𝐱)𝕊n1)4\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})\geq 4, namely, if we further let ω(𝒯f(𝐱)𝕊n1)4\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})\geq 4, then we have m𝒞p(λ)1m\geq\mathscr{C}_{p}(\lambda^{\star})-1. It is worth noting that this condition is easy to satisfy in practical applications.

m𝒞p(λ)5.\displaystyle m\geq\mathscr{C}_{p}(\lambda^{\star})-5.

Combining the phase transition results in Theorems 1 and 2, the first part of Theorem 3 implies that if the penalized procedure (9) is likely to succeed, then the constrained procedures (7) and (8) succeed with high probability. Similarly, the second part of Theorem 3 conveys that if the constrained procedures (7) and (8) are likely to succeed, then we can choose the tradeoff parameter λ\lambda as in (24) such that the penalized procedure (9) succeeds with high probability. Thus our results provide a quantitative characterization for the relations (I) and (II).

The results in Theorem 3 also enjoy a geometrical explanation in the phase transition program: The first part implies that the successful area of penalized recovery procedure should be smaller than that of the constrained procedures. The second part indicates that for any point in the successful area of the constrained recovery procedures, we can find at least a λ\lambda such that this point also belongs to the successful area of the corresponding penalized recovery procedure. In other words, the successful area of the constrained procedures can be regarded as the union of that of the penalized one (with different λ\lambdas). Fig.1 illustrates this relationship in the case of sparse signal recovery from sparse corruption.

Moreover, Theorem 3 has suggested an explicit way to choose the best parameter λ\lambda predicted by the Lagrange theory, i.e.,

λ=argminλ>0𝒞p(λ),\lambda^{\star}=\arg\min_{\lambda>0}\mathscr{C}_{p}(\lambda),

which is equivalent to

λ=argminλ>0ζ(mλtf(𝒙))witht=argminαtβη2(1tg(𝒗)𝕊m1).\displaystyle\lambda^{\star}=\arg\min_{\lambda>0}\zeta\left(\frac{\sqrt{m}}{\lambda t^{\star}}\partial f(\bm{x}^{\star})\right)~{}~{}\textrm{with}~{}~{}t^{\star}=\arg\min_{\alpha\leq t\leq\beta}\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right). (25)

We provide some insights for this parameter selection strategy. Recall that Theorem 2 has demonstrated that the penalized procedure succeeds with high probability if the number of measurements exceeds the critical point 𝒞p(λ)\mathscr{C}_{p}(\lambda). Then the strategy (24) implies that we should pick the λ\lambda which makes the number of observations required for successful recovery of the penalized procedure as small as possible. Another explanation comes from the relationship between these two kinds of recovery procedures. For a given corrupted sensing problem (with fixed 𝒙\bm{x}^{\star} and 𝒗\bm{v}^{\star}), the first part of Theorem 3 indicates that the phase transition threshold of the penalized procedure is always bounded from below by that of constrained ones, it is natural to choose the λ\lambda such that we can achieve the possibly smallest gap between these two thresholds i.e., λ=argminλ>0(𝒞p(λ)𝒞p)\lambda^{\star}=\arg\min_{\lambda>0}(\mathscr{C}_{p}(\lambda)-\mathscr{C}_{p}), which also leads to the strategy (24).

Remark 4 (Related works).

In [24] and [27], the authors also provide an explicit way to select the tradeoff parameter λ\lambda. Specifically, their results have shown that 𝒪(η2(λ1f(𝒙))+η2(λ2g(𝒗)))\mathcal{O}\left(\eta^{2}(\lambda_{1}\cdot\partial f(\bm{x}^{\star}))+\eta^{2}(\lambda_{2}\cdot\partial g(\bm{v}^{\star}))\right) measurements are sufficient to guarantee the success of the penalized procedure. In order to achieve the smallest number of measurements, it is natural to choose the λ\lambda as follows:

λ1=argminλ1>0η2(λ1f(𝒙)),λ2=argminλ2>0η2(λ2g(𝒗)),andλ=λ2/λ1.\lambda_{1}^{\ast}=\arg\min_{\lambda_{1}>0}\eta^{2}(\lambda_{1}\cdot\partial f(\bm{x}^{\star})),~{}~{}\lambda_{2}^{\ast}=\arg\min_{\lambda_{2}>0}\eta^{2}(\lambda_{2}\cdot\partial g(\bm{v}^{\star})),~{}~{}\textrm{and}~{}\lambda^{\ast}=\lambda_{2}^{\ast}/\lambda_{1}^{\ast}. (26)

However, a visible mismatch between the penalized program with the strategy (26) and the constrained ones has been observed in their numerical experiments. This suggests that the choice (26) might not be optimal in the sense of the Lagrange theory. As shown in our simulations (Section IV), the empirical performance of the penalized procedure (9) with our optimal choice of the tradeoff parameter is nearly the same as that of the constrained convex procedures. Thus, our strategy (24) solves another significant open problem in [23].

IV Numerical Simulations

In this section, we perform a series of numerical experiments to verify our theoretical results. We consider two typical structured signal recovery problems: sparse signal recovery from sparse corruption and low-rank matrix recovery from sparse corruption. In each case, we employ both constrained and penalized recovery procedures to reconstruct the original signal and corruption. Throughout these experiments, the related convex optimization problems are solved by CVX Matlab package [59, 60]. In addition to the Gaussian measurements, we also consider sub-Gaussian measurements 333In fact, we have tested other distributions of 𝚽\bm{\Phi} such as sparse Rademacher distribution and Student’s tt distribution, the obtained results are quite similar, so we omit them here..

IV-A Phase Transition of the Constrained Recovery Procedures

We first consider the empirical behavior of the constrained recovery procedures in the following two structured signal recovery problems.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Phase transitions of the constrained recovery procedures under Gaussian and Bernoulli measurements in both sparse signal recovery from sparse corruption and low-rank matrix recovery from sparse corruption. The red curves plot the phase transition thresholds predicted by (27).

IV-A1 Sparse Signal Recovery from Sparse Corruption

In this example, both signal and corruption are sparse, and we use the 1\ell_{1}-norm to promote their structures, i.e., f(𝒙)=𝒙1f(\bm{x}^{\star})=\|\bm{x}^{\star}\|_{1} and g(𝒗)=𝒗1g(\bm{v}^{\star})=\|\bm{v}^{\star}\|_{1}. Suppose the 1\ell_{1}-norm of the true signal 𝒙1\|\bm{x}^{\star}\|_{1} are known beforehand. We fix the sample size and ambient signal dimension m=n=128m=n=128. For each signal sparsity s=1,2,,128s=1,2,...,128 and each corruption sparsity k=1,2,,128k=1,2,...,128. We repeat the following experiments 20 times:

  • (1)

    Generate a signal vector 𝒙n\bm{x}^{\star}\in\mathbb{R}^{n} with ss non-zero entries and set the other nsn-s entries to 0. The locations of the non-zero entries are uniformly selected among all possible supports, and nonzero entries are independently sampled from the normal distribution.

  • (2)

    Similarly, generate a corruption vector 𝒗m\bm{v}^{\star}\in\mathbb{R}^{m} with kk non-zero entries and set the other mkm-k entries to 0.

  • (3)

    For Gaussian measurements, we draw the sensing matrix 𝚽m×n\bm{\Phi}\in\mathbb{R}^{m\times n} with i.i.d. standard normal entries. For sub-Gaussian measurements, we draw the sensing matrix 𝚽m×n\bm{\Phi}\in\mathbb{R}^{m\times n} with i.i.d. symmetric Bernoulli entries.

  • (4)

    Solve the following constrained optimization problem (8):

    (𝒙^,𝒗^)=argmin𝒙,𝒗𝒗1,s.t.\displaystyle(\hat{\bm{x}},\hat{\bm{v}})=\arg\min_{\bm{x},\bm{v}}~{}\|\bm{v}\|_{1},\quad\text{s.t.~{}} 𝒚=𝚽𝒙+m𝒗,𝒙1𝒙1.\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v},~{}~{}\|\bm{x}\|_{1}\leq\|\bm{x}^{\star}\|_{1}.
  • (5)

    Set tol=103tol=10^{-3}. Declare success if 𝒙^𝒙2/𝒙2tol\|\hat{\bm{x}}-\bm{x}^{\star}\|_{2}/\|\bm{x}^{\star}\|_{2}\leq tol.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Phase transitions of the penalized recovery procedure with different λ\lambdas under Gaussian and Bernoulli measurements in sparse signal recovery from sparse corruption. The red curves plot the phase transition thresholds predicted by (28).

IV-A2 Low-rank Matrix Recovery from Sparse Corruption

In this case, the desired signal is an rr-rank matrix and the corruption is a ρ\rho-sparse vector. We use the nuclear norm f(𝒙)=𝒙f(\bm{x}^{\star})=\|\bm{x}^{\star}\|_{*} to promote the structure of signal. Suppose the nuclear norm of true signal 𝒙\|\bm{x}^{\star}\|_{*} are known exactly. Let n=20n=20 and consider n×nn\times n signal matrices. Set the sample size m=n2m=n^{2}. For each rank r=1,2,,20r=1,2,...,20 and each corruption sparsity ρ=1,6,11,16,,396\rho=1,6,11,16,...,396. We repeat the following experiment 20 times:

  • (1)

    Generate an rr-rank matrix 𝑿=𝑼1𝑼2T\bm{X}^{\star}=\bm{U}_{1}\bm{U}_{2}^{T}, where 𝑼1\bm{U}_{1} and 𝑼2\bm{U}_{2} are independent n×rn\times r matrices with orthonormal columns.

  • (2)

    Generate a corruption vector 𝒗m\bm{v}^{\star}\in\mathbb{R}^{m} with ρ\rho non-zero entries and set the other mρm-\rho entries to 0.

  • (3)

    For Gaussian measurements, we draw the sensing matrix 𝚽m×n2\bm{\Phi}\in\mathbb{R}^{m\times n^{2}} with i.i.d standard normal entries. For sub-Gaussian measurements, we draw the sensing matrix 𝚽m×n2\bm{\Phi}\in\mathbb{R}^{m\times n^{2}} with i.i.d. symmetric Bernoulli entries.

  • (4)

    Solve the following constrained problem (8):

    (𝑿^,𝒗^)=argmin𝑿,𝒗𝒗1,s.t.\displaystyle(\hat{\bm{X}},\hat{\bm{v}})=\arg\min_{\bm{X},\bm{v}}~{}\|\bm{v}\|_{1},\quad\text{s.t.~{}} 𝒚=𝚽vec(𝑿)+m𝒗,𝑿𝑿.\displaystyle\bm{y}=\bm{\Phi}\cdot\textrm{vec}(\bm{X})+\sqrt{m}\bm{v},~{}~{}\|\bm{X}\|_{*}\leq\|\bm{X}^{\star}\|_{*}.
  • (5)

    Set tol=103tol=10^{-3}. Declare success if 𝑿^𝑿F/𝑿Ftol\|\hat{\bm{X}}-\bm{X}^{\star}\|_{F}/\|\bm{X}^{\star}\|_{F}\leq tol.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Phase transitions of the penalized recovery procedure with different λ\lambdas under Gaussian and Bernoulli measurements in low-rank matrix recovery from sparse corruption. The red curves plot the phase transition thresholds predicted by (28).

In order to compare the empirical behaviors with theoretical results, we overlay the phase transition curve that predicted in Theorem 1:

ω2(𝒯f(𝒙)𝕊n1)+ω2(𝒯g(𝒗)𝕊m1).\displaystyle\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}). (27)

Fig.2 reports the empirical probability of success for the constrained procedures in these two typical structured signal recovery problems. It is not hard to find that our theoretical predictions sharply align with the empirical phase transitions under both Gaussian and Bernoulli measurements.

IV-B Phase Transition of the Penalized Recovery Procedure

We next consider the empirical phase transition of the penalized procedure in these two examples.

IV-B1 Sparse Signal Recovery from Sparse Corruption

The experiment settings are almost the same as the constrained case except that we require neither f(𝒙)f(\bm{x}^{\star}) nor g(𝒗)g(\bm{v}^{\star}), and we solve the following penalized procedure instead of the constrained one in step (4):

(𝒙^,𝒗^)=argmin𝒙,𝒗𝒙1+λ𝒗1,s.t.\displaystyle(\hat{\bm{x}},\hat{\bm{v}})=\arg\min_{\bm{x},\bm{v}}~{}\|\bm{x}\|_{1}+\lambda\|\bm{v}\|_{1},\quad\text{s.t.~{}} 𝒚=𝚽𝒙+m𝒗.\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v}.

Here we test two tradeoff parameters: λ=1\lambda=1 and λ=2\lambda=2. To compare the empirical behaviors with theoretical results, we overlay the phase transition curve that predicted in Theorem 2:

minαtβ2ζ(mλtf(𝒙))+η2(1tg(𝒗)𝕊m1)1.\min_{\alpha\leq t\leq\beta}2\cdot\zeta\left(\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1. (28)

Fig. 3 displays the average empirical probability of success for the penalized problem in sparse signal recovery from sparse corruption. We can see that the theoretical threshold (28) perfectly predicts the empirical phase transition under different tradeoff parameter λ\lambdas.

IV-B2 Low-rank Matrix Recovery from Sparse Corruption

Similarly, the experiment settings are nearly the same as the constrained case except that we recover the original signal and corruption via the following penalized procedure in step (4):

(𝑿^,𝒗^)=argmin𝑿,𝒗𝑿+λ𝒗1,s.t.\displaystyle(\hat{\bm{X}},\hat{\bm{v}})=\arg\min_{\bm{X},\bm{v}}~{}\|\bm{X}\|_{*}+\lambda\|\bm{v}\|_{1},\quad\text{s.t.~{}} 𝒚=𝚽vec(𝑿)+m𝒗.\displaystyle\bm{y}=\bm{\Phi}\cdot\textrm{vec}(\bm{X})+\sqrt{m}\bm{v}.

The tradeoff parameter is set to be λ=1/4\lambda=1/4 or λ=1/2\lambda=1/2. To compare our theory with the empirical results, we overlay the theoretical threshold (28). Fig. 4 shows the empirical probability of success for penalized problem in low-rank matrix recovery from sparse corruption. We can find that the theoretical threshold (28) predicts the empirical phase transition quite well under different tradeoff parameter λ\lambdas.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Phase transitions of the penalized recovery procedure with the optimal λ\lambda under Gaussian and Bernoulli measurements in both sparse signal recovery from sparse corruption and low-rank matrix recovery from sparse corruption. The red curves plot the phase transition thresholds predicted by that of the constrained procedures (27).

IV-C Optimal Choice of the Tradeoff Parameter λ\lambda

In this section, we explore the empirical phase transition of the penalized procedure with the optimal tradeoff parameter λ\lambda. Similarly, we consider these two typical examples.

IV-C1 Sparse Signal Recovery from Sparse Corruption

The experiment settings are the same as the penalized case except that we solve the following penalized procedure in step (4):

(𝒙^,𝒗^)=argmin𝒙,𝒗𝒙1+λ𝒗1,s.t.\displaystyle(\hat{\bm{x}},\hat{\bm{v}})=\arg\min_{\bm{x},\bm{v}}~{}\|\bm{x}\|_{1}+\lambda\|\bm{v}\|_{1},\quad\text{s.t.~{}} 𝒚=𝚽𝒙+m𝒗\displaystyle\bm{y}=\bm{\Phi}\bm{x}+\sqrt{m}\bm{v}

with the optimal parameter selection strategy λ=λ\lambda=\lambda^{\star} as in (24).

IV-C2 Low-rank Matrix Recovery from Sparse Corruption

We carry out similar experiments as the penalized case except that we solve the following penalized procedure in step (4):

(𝑿^,𝒗^)=argmin𝑿,𝒗𝑿+λ𝒗1,s.t.\displaystyle(\hat{\bm{X}},\hat{\bm{v}})=\arg\min_{\bm{X},\bm{v}}~{}\|\bm{X}\|_{*}+\lambda\|\bm{v}\|_{1},\quad\text{s.t.~{}} 𝒚=𝚽vec(𝑿)+m𝒗.\displaystyle\bm{y}=\bm{\Phi}\cdot\textrm{vec}(\bm{X})+\sqrt{m}\bm{v}.

The tradeoff parameter is set to λ=λ\lambda=\lambda^{\star} according to (24).

To compare our theory with the empirical results, we overlay the theoretical threshold (27) of the constrained procedures. Fig.5 shows the empirical probability of success for the penalized procedure with the optimal parameter λ\lambda in both examples. We can find that the theoretical threshold of the constrained problems predicts the empirical phase transition of penalized problems perfectly under both Gaussian and Bernoulli measurements, which indicates that our strategy to choose λ\lambda is optimal in the sense of the Lagrange theory.

V Conclusion and Future Directions

This paper has developed a unified framework to establish the phase transition theory for both constrained and penalized recovery procedures which are used to solve corrupted sensing problems under different scenarios. The analysis is only based on some well-known results in Gaussian process theory. Our theoretical results have shown that the phase transitions of these two recovery procedures are determined by some geometric measures, e.g., the spherical Gaussian width of a tangent cone, the Gaussian (squared) distance to a scaled subdifferential. We have also explored the relationship between these two procedures from a quantitative perspective, which in turn indicates how to pick the optimal tradeoff parameter in the penalized recovery procedure. The numerical experiments have demonstrated a close agreement between our theoretical results and the empirical phase transitions. For future work, we enlist two promising directions:

  • Universality: Under Gaussian measurements, our results provide a thorough explanation for the phase transition phenomenon of corrupted sensing. The Gaussian assumption is critical in the establishment of our main results. However, extensive numerical examples in Section IV have suggested that the phase transition results of corrupted sensing are universal. Thus, an important question is to establish the phase transition theory for corrupted sensing beyond Gaussian measurements.

  • Noisy phase transition: Throughout the paper, we analyze the phase transition of corrupted sensing in the noiseless setting. It might be interesting to consider the noisy measurements 𝒚=𝚽𝒙+𝒗+𝒛\bm{y}=\bm{\Phi}\bm{x}^{\star}+\bm{v}^{\star}+\bm{z}, and to provide precise error analysis for different convex recovery procedures. In [44], Donoho et al. have considered the noisy compressed sensing problem 𝒚=𝚽𝒙+𝒛\bm{y}=\bm{\Phi}\bm{x}^{\star}+\bm{z} with 𝒛𝒩(0,σ𝑰m)\bm{z}\sim\mathcal{N}(0,\sigma\bm{I}_{m}), and use the penalized 1\ell_{1}-minimization 𝒙^=argmin{𝒚𝚽𝒙22/2+λ𝒙1}\hat{\bm{x}}=\arg\min\{\|\bm{y}-\bm{\Phi}\bm{x}\|_{2}^{2}/2+\lambda\|\bm{x}\|_{1}\} to recover the original signal. They have shown that the normalized MSE 𝔼𝒙^𝒙22σ2\frac{\operatorname{\mathbb{E}}\|\hat{\bm{x}}-\bm{x}^{\star}\|_{2}^{2}}{\sigma^{2}} is bounded throughout an asymptotic region and is unbounded throughout the complementary region. The phase boundary of the interested region is identical to the previously known phase transition for the noiseless problem. We may expect a non-asymptotic characterization of the normalized MSE for the noisy corrupted sensing problem, which implies a new perspective for the phase transition results in noiseless case.

Appendix A Proofs of Lemma 1 and Theorem 1

In this appendix, we present a detailed proof for the phase transition result of the constrained recovery procedures. For brevity, we denote 𝒯f(𝒙)\mathcal{T}_{f}(\bm{x}^{\star}) and 𝒯g(𝒗)\mathcal{T}_{g}(\bm{v}^{\star}) by 𝒯f\mathcal{T}_{f} and 𝒯g\mathcal{T}_{g} respectively. Some auxiliary lemma and facts used in the proofs are included in Appendix E.

A-A Proof of Lemma 1

Proof.

It follows from the optimization condition for linear inverse problems [48, Section 4] or [51, Proposition 2.1] that (𝒙^,𝒗^)=(𝒙,𝒗)(\hat{\bm{x}},\hat{\bm{v}})=(\bm{x}^{\star},\bm{v}^{\star}) is the unique optimal solution of (7) or (8) if and only if null([𝚽,m𝑰m])(𝒯f×𝒯g)={(𝟎,𝟎)}\text{null}\left([\bm{\Phi},\sqrt{m}\bm{I}_{m}]\right)\cap\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)=\{(\bm{0},\bm{0})\}, which is equivalent to null([𝚽,m𝑰m])((𝒯f×𝒯g)𝕊n+m1)=\text{null}\left([\bm{\Phi},\sqrt{m}\bm{I}_{m}]\right)\cap\left((\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}\right)=\emptyset. Therefore, if null([𝚽,m𝑰m])((𝒯f×𝒯g)𝕊n+m1)=\text{null}\left([\bm{\Phi},\sqrt{m}\bm{I}_{m}]\right)\cap\left((\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}\right)=\emptyset, i.e.,

min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1𝚽𝒂+m𝒃2>0,\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0,

then the constrained procedures (7) and (8) succeed. If null([𝚽,m𝑰m])((𝒯f×𝒯g)𝕊n+m1)\text{null}\left([\bm{\Phi},\sqrt{m}\bm{I}_{m}]\right)\cap\left((\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}\right)\neq\emptyset, i.e.,

min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1𝚽𝒂+m𝒃2=0,\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}=0, (29)

then the constrained procedures (7) and (8) fail.

Obviously, (29) holds if 𝟎𝑨((𝒯f×𝒯g)𝕊n+m1)\bm{0}\in\bm{A}((\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}). Since ff and gg are proper convex functions, then 𝒯f\mathcal{T}_{f} and 𝒯g\mathcal{T}_{g} are convex, and hence (𝒯f×𝒯g)𝕊n+m1(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1} is spherically convex. By assumption, 𝒯f\mathcal{T}_{f} and 𝒯g\mathcal{T}_{g} are nonempty and closed, the desired sufficient condition (13) follows by directly applying the polarity principle (Fact 3).

A-B Proof of Theorem 1

Proof.

Success case: Lemma 1 indicates that the constrained procedures (7) and (8) succeed if

min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1𝚽𝒂+m𝒃2>0.\displaystyle\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0.

Our goal then reduces to show that if the number of measurements satisfies (14), then the above inequality holds with high probability. For clarity, the proof is divided into three steps.

Step 1: Problem reduction. We first apply Gordon’s Lemma to convert the probability of the targeted event to a surrogate which is convenient to handle. Observe that

min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1𝚽𝒂+m𝒃2=min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1𝚽𝒂,𝒖+m𝒃,𝒖.\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}=\min_{(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle. (30)

For any (𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1(\bm{a},\bm{b})\in\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1} and 𝒖𝕊m1\bm{u}\in\mathbb{S}^{m-1}, define the following two Gaussian processes

X(𝒂,𝒃),𝒖:=𝚽𝒂,𝒖+𝒂2gX_{(\bm{a},\bm{b}),\bm{u}}:=\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\|\bm{a}\|_{2}\cdot g~{}~{}~{}

and

Y(𝒂,𝒃),𝒖:=𝒂2𝒉,𝒖+𝒈,𝒂,Y_{(\bm{a},\bm{b}),\bm{u}}:=\|\bm{a}\|_{2}\left\langle\bm{h},\bm{u}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle,

where g𝒩(0,1)g\sim\mathcal{N}(0,1), 𝒉𝒩(𝟎,𝑰m)\bm{h}\sim\mathcal{N}(\bm{0},\bm{I}_{m}), and 𝒈𝒩(𝟎,𝑰n)\bm{g}\sim\mathcal{N}(\bm{0},\bm{I}_{n}) are independent of each other. It is not hard to check that the above Gaussian processes satisfy the conditions of Gordon’s Lemma, i.e.,

𝔼X(𝒂,𝒃),𝒖2\displaystyle\operatorname{\mathbb{E}}X_{(\bm{a},\bm{b}),\bm{u}}^{2} =2𝒂22=𝔼Y(𝒂,𝒃),𝒖2,\displaystyle=2\|\bm{a}\|_{2}^{2}=\operatorname{\mathbb{E}}Y_{(\bm{a},\bm{b}),\bm{u}}^{2},
𝔼[X(𝒂,𝒃),𝒖X(𝒂,𝒃),𝒖]𝔼[Y(𝒂,𝒃),𝒖Y(𝒂,𝒃),𝒖]\displaystyle\operatorname{\mathbb{E}}[X_{(\bm{a},\bm{b}),\bm{u}}X_{(\bm{a}^{\prime},\bm{b}^{\prime}),\bm{u}^{\prime}}]-\operatorname{\mathbb{E}}[Y_{(\bm{a},\bm{b}),\bm{u}}Y_{(\bm{a}^{\prime},\bm{b}^{\prime}),\bm{u}^{\prime}}] =𝒖,𝒖𝒂,𝒂+𝒂2𝒂2𝒂,𝒂𝒂2𝒂2𝒖,𝒖\displaystyle=\left\langle\bm{u},\bm{u}^{\prime}\right\rangle\left\langle\bm{a},\bm{a}^{\prime}\right\rangle+\|\bm{a}\|_{2}\|\bm{a}^{\prime}\|_{2}-\left\langle\bm{a},\bm{a}^{\prime}\right\rangle-\|\bm{a}\|_{2}\|\bm{a}^{\prime}\|_{2}\left\langle\bm{u},\bm{u}^{\prime}\right\rangle
=(1𝒖,𝒖)(𝒂2𝒂2𝒂,𝒂)\displaystyle=\left(1-\left\langle\bm{u},\bm{u}^{\prime}\right\rangle\right)\left(\|\bm{a}\|_{2}\|\bm{a}^{\prime}\|_{2}-\left\langle\bm{a},\bm{a}^{\prime}\right\rangle\right)
0,\displaystyle\geq 0,

where, in the last line, the equality holds when 𝒂=𝒂\bm{a}=\bm{a}^{\prime}. It then follows from Gordon’s Lemma (Fact 1) that (by setting τ(𝒂,𝒃),𝒖=m𝒃,𝒖+0+\tau_{(\bm{a},\bm{b}),\bm{u}}=-\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle+0_{+})

{min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1Y(𝒂,𝒃),𝒖τ(𝒂,𝒃),𝒖}={min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1𝒂2𝒉,𝒖+𝒈,𝒂+m𝒃,𝒖>0}\displaystyle\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}Y_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}=\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\|\bm{a}\|_{2}\left\langle\bm{h},\bm{u}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
{min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1X(𝒂,𝒃),𝒖τ(𝒂,𝒃),𝒖}\displaystyle\hskip 150.0pt\leq\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}X_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}
={min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1𝚽𝒂,𝒖+𝒂2g+m𝒃,𝒖>0}\displaystyle\hskip 150.0pt=\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\|\bm{a}\|_{2}\cdot g+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
12+12{min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1𝚽𝒂,𝒖+𝒂2g+m𝒃,𝒖>0|g0}\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\|\bm{a}\|_{2}\cdot g+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0~{}\bigg{|}~{}g\leq 0\rule{0.0pt}{8.53581pt}\right\}
12+12{min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1𝚽𝒂,𝒖+m𝒃,𝒖>0},\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},

where the second inequality is due to the law of total probability and the third inequality holds by noting 𝒂2g0-\|\bm{a}\|_{2}\cdot g\geq 0 when g0g\leq 0. Rearranging the above inequality leads to

{min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1𝚽𝒂,𝒖+m𝒃,𝒖>0}\displaystyle\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{\Phi}\bm{a},\bm{u}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
2{min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1𝒂2𝒉,𝒖+𝒈,𝒂+m𝒃,𝒖:=1>0}1.\displaystyle\hskip 135.0pt\geq 2\mathbb{P}\left\{\underbrace{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\|\bm{a}\|_{2}\left\langle\bm{h},\bm{u}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle+\left\langle\sqrt{m}\bm{b},\bm{u}\right\rangle}_{:=\mathscr{E}_{1}}>0\rule{0.0pt}{8.53581pt}\right\}-1. (31)

Moreover, 1\mathscr{E}_{1} can be rewritten as

1=min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1max𝒖𝕊m1𝒖,𝒂2𝒉+m𝒃+𝒈,𝒂=min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1𝒂2𝒉+m𝒃2+𝒈,𝒂=mint[0,1]min𝒂𝒯f𝕊n1𝒃𝒯g𝕊m1t𝒉+m(1t2)𝒃2+t𝒈,𝒂.\begin{split}\mathscr{E}_{1}&=\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\max_{\bm{u}\in\mathbb{S}^{m-1}}\left\langle\bm{u},\|\bm{a}\|_{2}\bm{h}+\sqrt{m}\bm{b}\right\rangle+\left\langle\bm{g},\bm{a}\right\rangle\\ &=\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\bigg{\|}\|\bm{a}\|_{2}\bm{h}+\sqrt{m}\bm{b}\bigg{\|}_{2}+\left\langle\bm{g},\bm{a}\right\rangle\\ &=\min_{t\in[0,1]}\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}t\bm{h}+\sqrt{m(1-t^{2})}\bm{b}^{\prime}\bigg{\|}_{2}+t\left\langle\bm{g},\bm{a}^{\prime}\right\rangle.\end{split} (32)

In the last line, we have let 𝒂2=t,𝒃2=1t2\|\bm{a}\|_{2}=t,~{}\|\bm{b}\|_{2}=\sqrt{1-t^{2}}, 𝒂=𝒂/𝒂2,and𝒃=𝒃/𝒃2\bm{a}^{\prime}={\bm{a}}/{\|\bm{a}\|_{2}},\textrm{and}~{}\bm{b}^{\prime}={\bm{b}}/{\|\bm{b}\|_{2}}.

Define

U(𝒈,𝒉,t):=min𝒂𝒯f𝕊n1𝒃𝒯g𝕊m1𝒉+m(1t21)𝒃2+𝒈,𝒂.\begin{split}U(\bm{g},\bm{h},t):=\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}\bm{h}+\sqrt{m(\frac{1}{t^{2}}-1)}\cdot\bm{b}^{\prime}\bigg{\|}_{2}+\left\langle\bm{g},\bm{a}^{\prime}\right\rangle.\end{split}

Let

t1argmint[0,1]{min𝒂𝒯f𝕊n1𝒃𝒯g𝕊m1t𝒉+m(1t2)𝒃2+t𝒈,𝒂}.\displaystyle t_{1}\in\arg\min_{t\in[0,1]}\left\{\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}t\bm{h}+\sqrt{m(1-t^{2})}\bm{b}^{\prime}\bigg{\|}_{2}+t\left\langle\bm{g},\bm{a}^{\prime}\right\rangle\right\}.

If t10t_{1}\neq 0, then we have

{mint[0,1]min𝒂𝒯f𝕊n1𝒃𝒯g𝕊m1t𝒉+m(1t2)𝒃2+t𝒈,𝒂>0}\displaystyle\mathbb{P}\left\{\min_{t\in[0,1]}\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}t\bm{h}+\sqrt{m(1-t^{2})}\bm{b}^{\prime}\bigg{\|}_{2}+t\left\langle\bm{g},\bm{a}^{\prime}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\} ={t1U(𝒈,𝒉,t1)>0}\displaystyle=\mathbb{P}\left\{t_{1}\cdot U(\bm{g},\bm{h},t_{1})>0\rule{0.0pt}{8.53581pt}\right\}
={U(𝒈,𝒉,t1)>0}\displaystyle=\mathbb{P}\left\{U(\bm{g},\bm{h},t_{1})>0\rule{0.0pt}{8.53581pt}\right\}
{mint(0,1]U(𝒈,𝒉,t)>0}.\displaystyle\geq\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}.

If t1=0t_{1}=0, then equation 1=m\mathscr{E}_{1}=\sqrt{m}, which implies

{1>0}=1.\mathbb{P}\left\{\mathscr{E}_{1}>0\rule{0.0pt}{8.53581pt}\right\}=1.

Thus we have

{1>0}{mint(0,1]U(𝒈,𝒉,t)>0}.\displaystyle\mathbb{P}\left\{\mathscr{E}_{1}>0\rule{0.0pt}{8.53581pt}\right\}\geq\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}. (33)

Combining (30), (A-B), and (33) yields

{min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1𝚽𝒂+m𝒃2>0}2{mint(0,1]U(𝒈,𝒉,t)>0}1.\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}\geq 2\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}-1. (34)

Therefore, it is sufficient to establish the lower bound for {mint(0,1]U(𝒈,𝒉,t)>0}\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}.

Step 2: Establish the lower bound for {mint(0,1]U(g,h,t)>0}\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}. We then apply the Gaussian concentration inequality to establish the lower bound for {mint(0,1]U(𝒈,𝒉,t)>0}\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}.

Note that mint(0,1]U(𝒈,𝒉,t)\min_{t\in(0,1]}U(\bm{g},\bm{h},t) can reformulated as

mint(0,1]U(𝒈,𝒉,t)=mint(0,1]min𝒂𝒯f𝕊n1𝒃𝒯g𝕊m1𝒉+m(1t21)𝒃2+𝒈,𝒂=min𝒂𝒯f𝕊n1mint0𝒃𝒯g𝕊m1𝒉+t𝒃2+𝒈,𝒂=min𝒙𝒯g𝒂𝒯f𝕊n1𝒉+𝒙2+𝒈,𝒂,\begin{split}\min_{t\in(0,1]}U(\bm{g},\bm{h},t)&=\min_{t\in(0,1]}\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\bigg{\|}\bm{h}+\sqrt{m(\frac{1}{t^{2}}-1)}\cdot\bm{b}^{\prime}\bigg{\|}_{2}+\left\langle\bm{g},\bm{a}^{\prime}\right\rangle\\ &=\min_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\min_{t^{\prime}\geq 0\atop\bm{b}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\|\bm{h}+t^{\prime}\bm{b}^{\prime}\|_{2}+\left\langle\bm{g},\bm{a}^{\prime}\right\rangle\\ &=\min_{\bm{x}\in\mathcal{T}_{g}\atop\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\|\bm{h}+\bm{x}\|_{2}+\left\langle\bm{g},\bm{a}^{\prime}\right\rangle,\end{split}

where, in the second line, we have let t=m(1t21)0t^{\prime}=\sqrt{m(\frac{1}{t^{2}}-1)}\geq 0. It then follows from Lemma 4 that the function mint(0,1]U(𝒈,𝒉,t)\min_{t\in(0,1]}U(\bm{g},\bm{h},t) is a 2\sqrt{2}-Lipschitz function. To apply the Gaussian concentration inequality, it suffices to bound the expectation of mint(0,1]U(𝒈,𝒉,t)\min_{t\in(0,1]}U(\bm{g},\bm{h},t). To this end,

𝔼mint(0,1]U(𝒈,𝒉,t)=𝔼{min𝒙𝒯g𝒉+𝒙2max𝒂𝒯f𝕊n1𝒈,𝒂}=𝔼{dist(𝒉,𝒯g)max𝒂𝒯f𝕊n1𝒈,𝒂}=𝔼dist(𝒉,𝒯g)ω(𝒯f𝕊n1)𝔼dist2(𝒉,𝒯g)1ω(𝒯f𝕊n1)=m𝔼dist2(𝒉,𝒯g)1ω(𝒯f𝕊n1)=m𝔼(max𝒙𝒯g𝔹2n𝒉,𝒙)21ω(𝒯f𝕊n1)mω2(𝒯g𝕊m1)2ω(𝒯f𝕊n1)mω2(𝒯g𝕊m1)+ω2(𝒯f𝕊n1)+2ϵ.\begin{split}\operatorname{\mathbb{E}}\min_{t\in(0,1]}U(\bm{g},\bm{h},t)&=\operatorname{\mathbb{E}}\left\{\min_{\bm{x}\in\mathcal{T}_{g}}\|\bm{h}+\bm{x}\|_{2}-\max_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\left\langle-\bm{g},\bm{a}^{\prime}\right\rangle\right\}\\ &=\operatorname{\mathbb{E}}\left\{\operatorname{dist}(-\bm{h},\mathcal{T}_{g})-\max_{\bm{a}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\left\langle-\bm{g},\bm{a}^{\prime}\right\rangle\right\}\\ &=\operatorname{\mathbb{E}}\operatorname{dist}(\bm{h},\mathcal{T}_{g})-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &\geq\sqrt{\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g})-1}-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &=\sqrt{m-\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g}^{\circ})-1}-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &=\sqrt{m-\operatorname{\mathbb{E}}\bigg{(}\max_{\bm{x}\in\mathcal{T}_{g}\cap\mathbb{B}_{2}^{n}}\left\langle\bm{h},\bm{x}\right\rangle\bigg{)}^{2}-1}-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &\geq\sqrt{m-\omega^{2}(\mathcal{T}_{g}\cap\mathbb{S}^{m-1})-2}-\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})\\ &\geq\sqrt{m}-\sqrt{\omega^{2}(\mathcal{T}_{g}\cap\mathbb{S}^{m-1})+\omega^{2}(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})+2}\\ &\geq\epsilon.\end{split} (35)

The third line is due to the fact that 𝒉-\bm{h} (or 𝒈-\bm{g}) and 𝒉\bm{h} (or 𝒈\bm{g}) share the same distribution. The first inequality has used Fact 4, i.e.,Var(dist(𝒉,𝒯g))=𝔼dist2(𝒉,𝒯g)𝔼2dist(𝒉,𝒯g)1.\textrm{Var}(\operatorname{dist}(\bm{h},\mathcal{T}_{g}))=\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g})-\operatorname{\mathbb{E}}^{2}\operatorname{dist}(\bm{h},\mathcal{T}_{g})\leq 1. The fifth line holds because of Moreau’s decomposition theorem (Fact 5). The next two lines follows from Facts 6 and 7, respectively. The last two inequalities are due to the measurement condition (14).

Now using the Gaussian concentration inequality (Fact 2) yields

{mint(0,1]U(𝒈,𝒉,t)𝔼mint(0,1]U(𝒈,𝒉,t)ϵ}exp(ϵ24),\displaystyle\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)-\operatorname{\mathbb{E}}\min_{t\in(0,1]}U(\bm{g},\bm{h},t)\leq-\epsilon\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{4}\right),

which in turn implies that

{mint(0,1]U(𝒈,𝒉,t)>0}{mint(0,1]U(𝒈,𝒉,t)>𝔼mint(0,1]U(𝒈,𝒉,t)ϵ}1exp(ϵ24).\begin{split}\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}&\geq\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>\operatorname{\mathbb{E}}\min_{t\in(0,1]}U(\bm{g},\bm{h},t)-\epsilon\rule{0.0pt}{8.53581pt}\right\}\\ &\geq 1-\exp\left(\frac{-\epsilon^{2}}{4}\right).\end{split} (36)

Step 3: Complete the proof.

Combining the results in Steps 1 and 2 ((34) and (36)), we have

{min(𝒂,𝒃)(𝒯f×𝒯g)𝕊n+m1𝚽𝒂+m𝒃2>0}2{mint(0,1]U(𝒈,𝒉,t)>0}112exp(ϵ24).\begin{split}\mathbb{P}\left\{\min_{(\bm{a},\bm{b})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\|\bm{\Phi}\bm{a}+\sqrt{m}\bm{b}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{\min_{t\in(0,1]}U(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &\geq 1-2\exp\left(\frac{-\epsilon^{2}}{4}\right).\end{split}

Therefore, we have established that when mω2(𝒯f𝕊n1)+ω2(𝒯g𝕊m1)+2+ϵ\sqrt{m}\geq\sqrt{\omega^{2}\left(\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}\cap\mathbb{S}^{m-1}\right)}+\sqrt{2}+\epsilon, the constrained procedures (7) and (8) succeed with probability at least 12exp(ϵ2/4)1-2\exp\left(-\epsilon^{2}/4\right).

Failure case: According to Lemma 1, the constrained procedures (7) and (8) fail if

min𝒓𝕊m1min𝒔(𝒯f×𝒯g)𝒔𝑨T𝒓2>0,\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0,

where 𝑨=[𝚽,m𝑰m]\bm{A}=[\bm{\Phi},\sqrt{m}\bm{I}_{m}]. So it suffices to show that if the number of measurements satisfies (15), then the above inequality holds with high probability. For clarity, the proof is similarly divided into three steps.

Step 1: Problem reduction. In this step, we also use Gordon’s Lemma to convert the probability of the targeted event to another one which is easy to handle.

Let 𝒔=[𝒔1T,𝒔2T]T\bm{s}=[\bm{s}_{1}^{T},\bm{s}_{2}^{T}]^{T} and 𝒖=[𝒖1T,𝒖2T]T\bm{u}=[\bm{u}_{1}^{T},\bm{u}_{2}^{T}]^{T}. Note first that for any 𝒓𝕊m1\bm{r}\in\mathbb{S}^{m-1}, we have

min𝒔(𝒯f×𝒯g)𝒔𝑨T𝒓2=min𝒔(𝒯f×𝒯g)[𝒔1𝚽T𝒓𝒔2m𝒓]2=min𝒔(𝒯f×𝒯g)max𝒖𝕊n+m1𝚽T𝒓𝒔1,𝒖1+m𝒓𝒔2,𝒖2=min𝒔(𝒯f×𝒯g)max𝒖𝕊n+m1𝚽T𝒓,𝒖1+m𝒓,𝒖2𝒔,𝒖max𝒖𝕊n+m1[𝚽T𝒓,𝒖1+m𝒓,𝒖2max𝒔(𝒯f×𝒯g)𝒔,𝒖]=max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝚽T𝒓,𝒖1+m𝒓,𝒖2.\begin{split}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}&=\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\left\|\begin{bmatrix}\bm{s}_{1}-\bm{\Phi}^{T}\bm{r}\\ \bm{s}_{2}-\sqrt{m}\bm{r}\end{bmatrix}\right\|_{2}\\ &=\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r}-\bm{s}_{1},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r}-\bm{s}_{2},\bm{u}_{2}\right\rangle\\ &=\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\bm{s},\bm{u}\right\rangle\\ &\geq\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left[\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle-\max_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\left\langle\bm{s},\bm{u}\right\rangle\right]\\ &=\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle.\end{split}

The inequality is due to the max-min inequality. The last line has used the fact that max𝒔(𝒯f×𝒯g)𝒔,𝒖=0\max_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\left\langle\bm{s},\bm{u}\right\rangle=0 when 𝒖𝒯f×𝒯g\bm{u}\in\mathcal{T}_{f}\times\mathcal{T}_{g}, otherwise it equals \infty. Thus we obtain

min𝒓𝕊m1min𝒔(𝒯f×𝒯g)𝒔𝑨T𝒓2min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝚽T𝒓,𝒖1+m𝒓,𝒖2.\displaystyle\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}\geq\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle. (37)

We then use Gordon’s Lemma to bound the probability of the targeted event from below. To this end, for any 𝒓𝕊m1\bm{r}\in\mathbb{S}^{m-1} and (𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1(\bm{u}_{1},\bm{u}_{2})\in(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}, define the following two Gaussian processes

X𝒓,(𝒖1,𝒖2):=𝚽T𝒓,𝒖1+𝒖12gX_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}:=\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\|\bm{u}_{1}\|_{2}\cdot g~{}

and

Y𝒓,(𝒖1,𝒖2):=𝒈,𝒖1+𝒖12𝒉,𝒓,Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}:=\left\langle\bm{g},\bm{u}_{1}\right\rangle+\|\bm{u}_{1}\|_{2}\left\langle\bm{h},\bm{r}\right\rangle,

where g𝒩(0,1)g\sim\mathcal{N}(0,1), 𝒈𝒩(𝟎,𝑰n)\bm{g}\sim\mathcal{N}(\bm{0},\bm{I}_{n}), and 𝒉𝒩(𝟎,𝑰m)\bm{h}\sim\mathcal{N}(\bm{0},\bm{I}_{m}) are independent of each other. It can be easily checked that these two Gaussian processes satisfy the conditions in Gordon’s Lemma:

𝔼X𝒓,(𝒖1,𝒖2)2\displaystyle\operatorname{\mathbb{E}}X_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}^{2} =2𝒖122=𝔼Y𝒓,(𝒖1,𝒖2)2,\displaystyle=2\|\bm{u}_{1}\|_{2}^{2}=\operatorname{\mathbb{E}}Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}^{2},
𝔼[X𝒓,(𝒖1,𝒖2)X𝒓,(𝒖1,𝒖2)]𝔼[Y𝒓,(𝒖1,𝒖2)Y𝒓,(𝒖1,𝒖2)]\displaystyle\operatorname{\mathbb{E}}[X_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}X_{\bm{r}^{\prime},(\bm{u}_{1}^{\prime},\bm{u}_{2}^{\prime})}]-\operatorname{\mathbb{E}}[Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}Y_{\bm{r}^{\prime},(\bm{u}_{1}^{\prime},\bm{u}_{2}^{\prime})}] =𝒓,𝒓𝒖1,𝒖1+𝒖12𝒖12𝒖1,𝒖1𝒖12𝒖12𝒓,𝒓\displaystyle=\left\langle\bm{r},\bm{r}^{\prime}\right\rangle\left\langle\bm{u}_{1},\bm{u}_{1}^{\prime}\right\rangle+\|\bm{u}_{1}\|_{2}\|\bm{u}_{1}^{\prime}\|_{2}-\left\langle\bm{u}_{1},\bm{u}_{1}^{\prime}\right\rangle-\|\bm{u}_{1}\|_{2}\|\bm{u}_{1}^{\prime}\|_{2}\left\langle\bm{r},\bm{r}^{\prime}\right\rangle
=(1𝒓,𝒓)(𝒖12𝒖12𝒖1,𝒖1)\displaystyle=\left(1-\left\langle\bm{r},\bm{r}^{\prime}\right\rangle\right)\left(\|\bm{u}_{1}\|_{2}\|\bm{u}_{1}^{\prime}\|_{2}-\left\langle\bm{u}_{1},\bm{u}_{1}^{\prime}\right\rangle\right)
0.\displaystyle\geq 0.

Here, in the last line, the equality holds when 𝒓=𝒓\bm{r}=\bm{r}^{\prime}. It follows from Gordon’s Lemma (Fact 1) that (by setting τ𝒓,(𝒖1,𝒖2)=m𝒓,𝒖2+0+\tau_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}=-\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle+0_{+})

{min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1Y𝒓,(𝒖1,𝒖2)τ𝒓,(𝒖1,𝒖2)}={min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝒈,𝒖1+𝒖12𝒉,𝒓+m𝒓,𝒖2>0}\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}Y_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\geq\tau_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\rule{0.0pt}{8.53581pt}\right\}=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\|\bm{u}_{1}\|_{2}\left\langle\bm{h},\bm{r}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
{min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1X𝒓,(𝒖1,𝒖2)τ𝒓,(𝒖1,𝒖2)}\displaystyle\hskip 150.0pt\leq\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}X_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\geq\tau_{\bm{r},(\bm{u}_{1},\bm{u}_{2})}\rule{0.0pt}{8.53581pt}\right\}
={min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝚽T𝒓,𝒖1+𝒖12g+m𝒓,𝒖2>0}\displaystyle\hskip 150.0pt=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\|\bm{u}_{1}\|_{2}\cdot g+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
12+12{min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝚽T𝒓,𝒖1+𝒖12g+m𝒓,𝒖2>0|g0}\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\|\bm{u}_{1}\|_{2}\cdot g+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\Big{|}g\leq 0\rule{0.0pt}{8.53581pt}\right\}
12+12{min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝚽T𝒓,𝒖1+m𝒓,𝒖2>0},\displaystyle\hskip 150.0pt\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},

which implies

{min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝚽T𝒓,𝒖1+m𝒓,𝒖2>0}\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}^{T}\bm{r},\bm{u}_{1}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
2{min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝒈,𝒖1+𝒖12𝒉,𝒓+m𝒓,𝒖2:=2>0}1.\displaystyle\hskip 135.0pt\geq 2\mathbb{P}\left\{\underbrace{\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop(\mathcal{T}_{f}\times\mathcal{T}_{g})\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\|\bm{u}_{1}\|_{2}\left\langle\bm{h},\bm{r}\right\rangle+\left\langle\sqrt{m}\bm{r},\bm{u}_{2}\right\rangle}_{:=\mathscr{E}_{2}}>0\rule{0.0pt}{8.53581pt}\right\}-1. (38)

Moreover, 2\mathscr{E}_{2} can be bounded

2=min𝒓𝕊m1max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝒈,𝒖1+𝒓,𝒖12𝒉+m𝒖2max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1min𝒓𝕊m1𝒈,𝒖1+𝒓,𝒖12𝒉+m𝒖2=max(𝒖1,𝒖2)(𝒯f×𝒯g)𝕊n+m1𝒈,𝒖1𝒖12𝒉+m𝒖22=maxt[0,1]max𝒖1𝒯f𝕊n1𝒖2𝒯g𝕊m1t𝒈,𝒖1t𝒉+m(1t2)𝒖22.\begin{split}\mathscr{E}_{2}&=\min_{\bm{r}\in\mathbb{S}^{m-1}}\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\left\langle\bm{r},\|\bm{u}_{1}\|_{2}\bm{h}+\sqrt{m}\bm{u}_{2}\right\rangle\\ &\geq\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\min_{\bm{r}\in\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle+\left\langle\bm{r},\|\bm{u}_{1}\|_{2}\bm{h}+\sqrt{m}\bm{u}_{2}\right\rangle\\ &=\max_{(\bm{u}_{1},\bm{u}_{2})\in\atop\left(\mathcal{T}_{f}\times\mathcal{T}_{g}\right)\cap\mathbb{S}^{n+m-1}}\left\langle\bm{g},\bm{u}_{1}\right\rangle-\bigg{\|}\|\bm{u}_{1}\|_{2}\bm{h}+\sqrt{m}\bm{u}_{2}\bigg{\|}_{2}\\ &=\max_{t\in[0,1]}\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}t\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}t\bm{h}+\sqrt{m(1-t^{2})}\bm{u}_{2}^{\prime}\bigg{\|}_{2}.\end{split}

The inequality holds because of the max-min inequality. In the last line, we have let 𝒖12=t,𝒖22=1t2\|\bm{u}_{1}\|_{2}=t,~{}\|\bm{u}_{2}\|_{2}=\sqrt{1-t^{2}}, 𝒖1=𝒖1/𝒖12,\bm{u}_{1}^{\prime}={\bm{u}_{1}}/{\|\bm{u}_{1}\|_{2}}, and 𝒖2=𝒖2/𝒖22\bm{u}_{2}^{\prime}={\bm{u}_{2}}/{\|\bm{u}_{2}\|_{2}}.

Define

W(𝒈,𝒉,t):=max𝒖1𝒯f𝕊n1𝒖2𝒯g𝕊m1𝒈,𝒖1𝒉+m(1t21)𝒖22.\begin{split}W(\bm{g},\bm{h},t):=\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}\bm{h}+\sqrt{m(\frac{1}{t^{2}}-1)}\cdot\bm{u}_{2}^{\prime}\bigg{\|}_{2}.\end{split}

Let 444The effective domain of an extended real-valued function k(x):X¯k(x):X\rightarrow\bar{\mathbb{R}} is defined as {xX|k(x){}}\{x\in X|k(x)\in\mathbb{R}\cup\{-\infty\}\}.

t2argmaxt[0,1]W(𝒈,𝒉,t).\displaystyle t_{2}\in\arg\max_{t\in[0,1]}W(\bm{g},\bm{h},t).

Clearly, t20t_{2}\neq 0, since W(𝒈,𝒉,t)W(\bm{g},\bm{h},t)\to-\infty as t0+t\to 0_{+}. Then we have

{maxt[0,1]max𝒖1𝒯f𝕊n1𝒖2𝒯g𝕊m1t𝒈,𝒖1t𝒉m(1t2)𝒖22>0}\displaystyle\mathbb{P}\left\{\max_{t\in[0,1]}\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}t\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}t\bm{h}-\sqrt{m(1-t^{2})}\bm{u}_{2}^{\prime}\bigg{\|}_{2}>0\rule{0.0pt}{8.53581pt}\right\} ={maxt[0,1]tW(𝒈,𝒉,t)>0}\displaystyle=\mathbb{P}\left\{\max_{t\in[0,1]}t\cdot W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}
{t2W(𝒈,𝒉,t2)>0}\displaystyle\geq\mathbb{P}\left\{t_{2}\cdot W(\bm{g},\bm{h},t_{2})>0\rule{0.0pt}{8.53581pt}\right\}
={W(𝒈,𝒉,t2)>0}\displaystyle=\mathbb{P}\left\{W(\bm{g},\bm{h},t_{2})>0\rule{0.0pt}{8.53581pt}\right\}
={maxt(0,1]W(𝒈,𝒉,t)>0}.\displaystyle=\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}. (39)

Combining (37), (A-B), and (A-B), we obtain

{min𝒓𝕊m1min𝒔(𝒯f×𝒯g)𝒔𝑨T𝒓2>0}2{maxt(0,1]W(𝒈,𝒉,t)>0}1.\begin{split}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}-1.\end{split} (40)

Therefore, our goal reduces to establish the lower bound for {maxt(0,1]W(𝒈,𝒉,t)>0}\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}.

Step 2: Establish the lower bound for {maxt(0,1]W(g,h,t)>0}\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}. In this step, we use the Gaussian concentration inequality to establish the lower bound for {maxt(0,1]W(𝒈,𝒉,t)>0}\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}.

Similar to the success case, maxt(0,1]W(𝒈,𝒉,t)\max_{t\in(0,1]}W(\bm{g},\bm{h},t) can be rewritten as

maxt(0,1]W(𝒈,𝒉,t)=maxt(0,1]max𝒖1𝒯f𝕊n1𝒖2𝒯g𝕊m1𝒈,𝒖1𝒉+m(1t21)𝒖22=max𝒖1𝒯f𝕊n1maxt0𝒖2𝒯g𝕊m1𝒈,𝒖1𝒉+t𝒖22=max𝒙𝒯g𝒖1𝒯f𝕊n1𝒈,𝒖1𝒉+𝒙2.\begin{split}\max_{t\in(0,1]}W(\bm{g},\bm{h},t)&=\max_{t\in(0,1]}\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}\bm{h}+\sqrt{m(\frac{1}{t^{2}}-1)}\cdot\bm{u}_{2}^{\prime}\bigg{\|}_{2}\\ &=\max_{\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\max_{t^{\prime}\geq 0\atop\bm{u}_{2}^{\prime}\in\mathcal{T}_{g}\cap\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\bigg{\|}\bm{h}+t^{\prime}\cdot\bm{u}_{2}^{\prime}\bigg{\|}_{2}\\ &=\max_{\bm{x}\in\mathcal{T}_{g}\atop\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\|\bm{h}+\bm{x}\|_{2}.\\ \end{split}

It then follows from Lemma 4 that maxt(0,1]W(𝒈,𝒉,t)\max_{t\in(0,1]}W(\bm{g},\bm{h},t) is a 2\sqrt{2}-Lipschitz function. Moreover, its expectation can be bounded from below:

𝔼maxt(0,1]W(𝒈,𝒉,t)=𝔼(max𝒙𝒯g𝒖1𝒯f𝕊n1𝒈,𝒖1𝒉+𝒙2)=ω(𝒯f𝕊n1)𝔼dist(𝒉,𝒯g)ω(𝒯f𝕊n1)𝔼dist2(𝒉,𝒯g)=ω(𝒯f𝕊n1)m𝔼dist2(𝒉,𝒯g)=ω(𝒯f𝕊n1)m𝔼(max𝒙𝒯g𝔹2n𝒉,𝒙)2ω(𝒯f𝕊n1)mω2(𝒯g𝕊m1)ω2(𝒯g𝕊m1)+ω2(𝒯f𝕊n1)mϵ.\begin{split}\operatorname{\mathbb{E}}\max_{t\in(0,1]}W(\bm{g},\bm{h},t)&=\operatorname{\mathbb{E}}\left(\max_{\bm{x}\in\mathcal{T}_{g}\atop\bm{u}_{1}^{\prime}\in\mathcal{T}_{f}\cap\mathbb{S}^{n-1}}\left\langle\bm{g},\bm{u}_{1}^{\prime}\right\rangle-\|\bm{h}+\bm{x}\|_{2}\right)\\ &=\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\operatorname{\mathbb{E}}\operatorname{dist}(\bm{h},\mathcal{T}_{g})\\ &\geq\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\sqrt{\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g})}\\ &=\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\sqrt{m-\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{h},\mathcal{T}_{g}^{\circ})}\\ &=\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\sqrt{m-\operatorname{\mathbb{E}}\bigg{(}\max_{\bm{x}\in\mathcal{T}_{g}\cap\mathbb{B}_{2}^{n}}\left\langle\bm{h},\bm{x}\right\rangle\bigg{)}^{2}}\\ &\geq\omega(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})-\sqrt{m-\omega^{2}(\mathcal{T}_{g}\cap\mathbb{S}^{m-1})}\\ &\geq\sqrt{\omega^{2}(\mathcal{T}_{g}\cap\mathbb{S}^{m-1})+\omega^{2}(\mathcal{T}_{f}\cap\mathbb{S}^{n-1})}-\sqrt{m}\\ &\geq\epsilon.\end{split} (41)

The second line holds because 𝒉-\bm{h} and 𝒉\bm{h} have the same distribution. The first inequality is due to Jensen’s inequality. The fourth line has used Moreau’s decomposition theorem (Fact 5). The next two lines follow from Facts 6 and 7, respectively. The last two lines holds because of the measurement condition (15).

Now applying the Gaussian concentration inequality (Fact 2) yields

{maxt(0,1]W(𝒈,𝒉,t)𝔼maxt(0,1]W(𝒈,𝒉,t)ϵ}exp(ϵ24),\displaystyle\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)-\operatorname{\mathbb{E}}\max_{t\in(0,1]}W(\bm{g},\bm{h},t)\leq-\epsilon\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{4}\right),

which in turn implies that

{maxt(0,1]W(𝒈,𝒉,t)>0}{maxt(0,1]W(𝒈,𝒉,t)>𝔼maxt(0,1]W(𝒈,𝒉,t)ϵ}1exp(ϵ24).\begin{split}\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}&\geq\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>\operatorname{\mathbb{E}}\max_{t\in(0,1]}W(\bm{g},\bm{h},t)-\epsilon\rule{0.0pt}{8.53581pt}\right\}\\ &\geq 1-\exp\left(\frac{-\epsilon^{2}}{4}\right).\end{split} (42)

Step 3: Complete the proof. Putting (40) and (42) together, we have

{min𝒓𝕊m1min𝒔(𝒯f×𝒯g)𝒔𝑨T𝒓2>0}2{maxt(0,1]W(𝒈,𝒉,t)>0}112exp(ϵ24).\begin{split}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{m-1}}\min_{\bm{s}\in(\mathcal{T}_{f}\times\mathcal{T}_{g})^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{\max_{t\in(0,1]}W(\bm{g},\bm{h},t)>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &\geq 1-2\exp\left(\frac{-\epsilon^{2}}{4}\right).\end{split}

Thus we have established that when mω2(𝒯f𝕊n1)+ω2(𝒯g𝕊m1)ϵ\sqrt{m}\leq\sqrt{\omega^{2}\left(\mathcal{T}_{f}\cap\mathbb{S}^{n-1}\right)+\omega^{2}\left(\mathcal{T}_{g}\cap\mathbb{S}^{m-1}\right)}-\epsilon, the constrained procedures (7) and (8) fail with probability at least 12exp(ϵ2/4)1-2\exp\left(-\epsilon^{2}/4\right). This completes the proof.

Appendix B Proofs of Lemma 2 and Theorem 2

In this appendix, we prove the phase transition result of the penalized recovery procedure. Some auxiliary lemma and facts used in the proofs are included in Appendix E.

B-A Proof of Lemma 2

Proof.

The penalized recovery procedure (9) can be reformulated as the following unconstrained form

min𝒙f(𝒙)+λg(1m(𝒚𝚽𝒙)).\displaystyle\min_{\bm{x}}~{}f(\bm{x})+\lambda\cdot g\left(\frac{1}{\sqrt{m}}(\bm{y}-\bm{\Phi}\bm{x})\right).

Define F(𝒙)=f(𝒙)+λg(1m(𝒚𝚽𝒙))F(\bm{x})=f(\bm{x})+\lambda\cdot g\left(\frac{1}{\sqrt{m}}(\bm{y}-\bm{\Phi}\bm{x})\right). Clearly, F(𝒙)F(\bm{x}) is a proper convex function. It follows from [57, Theorems 23.8 and 23.9] that the subdifferential of FF at 𝒙\bm{x}^{\star} is given by

F(𝒙)=f(𝒙)λm𝚽Tg(𝒗).\displaystyle\partial F(\bm{x}^{\star})=\partial f(\bm{x}^{\star})-\frac{\lambda}{\sqrt{m}}\bm{\Phi}^{T}\cdot\partial g(\bm{v}^{\star}).

Moreover, F(𝒙)F(\bm{x}) attains its minimum at 𝒙\bm{x}^{\star} if and only if 𝟎F(𝒙)\bm{0}\in\partial F(\bm{x}^{\star}) [57, Theorems 27.1]. Therefore, if

𝟎𝚽Tg(𝒗)mλf(𝒙),\displaystyle\bm{0}\in\bm{\Phi}^{T}\cdot\partial g(\bm{v}^{\star})-\frac{\sqrt{m}}{\lambda}\partial f(\bm{x}^{\star}), (43)

then the penalized problem (9) succeeds. If 𝟎F(𝒙)\bm{0}\notin\partial F(\bm{x}^{\star}), i.e.,

min𝒂f(𝒙),𝒃g(𝒗)𝚽T𝒃mλ𝒂2>0.\displaystyle\min_{\bm{a}\in\partial f(\bm{x}^{\star}),\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0.

then the penalized problem (9) fails.

Clearly, (43) holds if 𝟎𝑴(𝒯J𝕊n+m1)\bm{0}\in\bm{M}(\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}). Since ff and gg are proper convex functions, then f(𝒙)\partial f(\bm{x}^{\star}) and g(𝒗)\partial g(\bm{v}^{\star}) are nonempty, closed convex sets, and hence 𝒯J𝕊n+m1\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1} is nonempty, closed, and spherically convex. A direct application of the polarity principle (Fact 3) yields the desired sufficient condition (19).

B-B Proof of Theorem 2

Proof.

Success case: By Lemma 2, the penalized problem (9) succeeds if

min𝒓𝕊n1min𝒔𝒯J𝒔𝑴T𝒓2>0,\displaystyle\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}>0,

where 𝑴=[mλ𝑰n,𝚽T]\bm{M}=[-\frac{\sqrt{m}}{\lambda}\bm{I}_{n},\bm{\Phi}^{T}]. So it is sufficient to show that if the number of measurements satisfies (20), then the above inequality holds with high probability. The proof is also divided into three steps.

Step 1: Problem reduction. In this step, we similarly use Gordon’s Lemma to convert the probability of the targeted event to another one which can be handled easily.

Let 𝒔=[𝒔1T,𝒔2T]T\bm{s}=[\bm{s}_{1}^{T},\bm{s}_{2}^{T}]^{T} and 𝒖=[𝒖1T,𝒖2T]T\bm{u}=[\bm{u}_{1}^{T},\bm{u}_{2}^{T}]^{T}. Note first that for any 𝒓𝕊n1\bm{r}\in\mathbb{S}^{n-1}, we have

min𝒔𝒯J𝒔𝑴T𝒓2=min𝒔𝒯J[𝒔1+mλ𝒓𝒔2𝚽𝒓]2=min𝒔𝒯Jmax𝒖𝕊n+m1𝒔1+mλ𝒓,𝒖1+𝒔2𝚽𝒓,𝒖2=min𝒔𝒯Jmax𝒖𝕊n+m1𝚽𝒓,𝒖2mλ𝒓,𝒖1𝒔,𝒖max𝒖𝕊n+m1[𝚽𝒓,𝒖2mλ𝒓,𝒖1max𝒔𝒯J𝒔,𝒖]=max𝒖𝒯J𝕊n+m1𝚽𝒓,𝒖2mλ𝒓,𝒖1.\begin{split}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}&=\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\left\|\begin{bmatrix}\bm{s}_{1}+\frac{\sqrt{m}}{\lambda}\bm{r}\\ \bm{s}_{2}-\bm{\Phi}\bm{r}\end{bmatrix}\right\|_{2}\\ &=\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left\langle\bm{s}_{1}+\frac{\sqrt{m}}{\lambda}\bm{r},-\bm{u}_{1}\right\rangle+\left\langle\bm{s}_{2}-\bm{\Phi}\bm{r},-\bm{u}_{2}\right\rangle\\ &=\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle-\left\langle\bm{s},\bm{u}\right\rangle\\ &\geq\max_{\bm{u}\in\mathbb{S}^{n+m-1}}\left[\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle-\max_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\left\langle\bm{s},\bm{u}\right\rangle\right]\\ &=\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle.\end{split}

The inequality is due to the max-min inequality. The last step has used the fact that max𝒔𝒯J𝒔,𝒖=0\max_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\left\langle\bm{s},\bm{u}\right\rangle=0 when 𝒖𝒯J\bm{u}\in\mathcal{T}_{J}, otherwise it equals \infty. Thus we obtain

min𝒓𝕊n1min𝒔𝒯J𝒔𝑴T𝒓2min𝒓𝕊n1max𝒖𝒯J𝕊n+m1𝚽𝒓,𝒖2mλ𝒓,𝒖1.\displaystyle\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}\quad\geq\quad\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle. (44)

We then use Gordon’s Lemma to establish a lower bound for the probability of the targeted event. To this end, for any 𝒓𝕊n1\bm{r}\in\mathbb{S}^{n-1} and 𝒖𝒯J𝕊n+m1\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}, define the following two Gaussian processes

X𝒓,𝒖:=𝚽𝒓,𝒖2+𝒖22gX_{\bm{r},\bm{u}}:=\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle+\|\bm{u}_{2}\|_{2}\cdot g~{}~{}~{}

and

Y𝒓,𝒖:=𝒉,𝒖2+𝒖22𝒈,𝒓Y_{\bm{r},\bm{u}}:=\left\langle\bm{h},\bm{u}_{2}\right\rangle+\|\bm{u}_{2}\|_{2}\left\langle\bm{g},\bm{r}\right\rangle

where g𝒩(0,1)g\sim\mathcal{N}(0,1), 𝒉𝒩(𝟎,𝑰m)\bm{h}\sim\mathcal{N}(\bm{0},\bm{I}_{m}), and 𝒈𝒩(𝟎,𝑰n)\bm{g}\sim\mathcal{N}(\bm{0},\bm{I}_{n}) are independent of each other. Direct calculations show that these two defined Gaussian processes satisfy the conditions in Gordon’s Lemma:

𝔼X𝒓,𝒖2\displaystyle\operatorname{\mathbb{E}}X_{\bm{r},\bm{u}}^{2} =2𝒖222=𝔼Y𝒓,𝒖2,\displaystyle=2\|\bm{u}_{2}\|_{2}^{2}=\operatorname{\mathbb{E}}Y_{\bm{r},\bm{u}}^{2},
𝔼[X𝒓,𝒖X𝒓,𝒖]𝔼[Y𝒓,𝒖Y𝒓,𝒖]\displaystyle\operatorname{\mathbb{E}}[X_{\bm{r},\bm{u}}X_{\bm{r}^{\prime},\bm{u}^{\prime}}]-\operatorname{\mathbb{E}}[Y_{\bm{r},\bm{u}}Y_{\bm{r}^{\prime},\bm{u}^{\prime}}] =𝒓,𝒓𝒖2,𝒖2+𝒖22𝒖22𝒖2,𝒖2𝒖22𝒖22𝒓,𝒓\displaystyle=\left\langle\bm{r},\bm{r}^{\prime}\right\rangle\left\langle\bm{u}_{2},\bm{u}_{2}^{\prime}\right\rangle+\|\bm{u}_{2}\|_{2}\|\bm{u}_{2}^{\prime}\|_{2}-\left\langle\bm{u}_{2},\bm{u}_{2}^{\prime}\right\rangle-\|\bm{u}_{2}\|_{2}\|\bm{u}_{2}^{\prime}\|_{2}\left\langle\bm{r},\bm{r}^{\prime}\right\rangle
=(1𝒓,𝒓)(𝒖22𝒖22𝒖2,𝒖2)\displaystyle=\left(1-\left\langle\bm{r},\bm{r}^{\prime}\right\rangle\right)\left(\|\bm{u}_{2}\|_{2}\|\bm{u}_{2}^{\prime}\|_{2}-\left\langle\bm{u}_{2},\bm{u}_{2}^{\prime}\right\rangle\right)
0.\displaystyle\geq 0.

In the last line, the equality holds when 𝒓=𝒓\bm{r}=\bm{r}^{\prime}. It follows from Gordon’s lemma (Fact 1) that (by setting τ𝒓,𝒖=mλ𝒖1T𝒓+0+\tau_{\bm{r},\bm{u}}=\frac{\sqrt{m}}{\lambda}\bm{u}_{1}^{T}\bm{r}+0_{+})

{min𝒓𝕊n1max𝒖𝒯J𝕊n+m1Y𝒓,𝒖τ𝒓,𝒖}\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}Y_{\bm{r},\bm{u}}\geq\tau_{\bm{r},\bm{u}}\rule{0.0pt}{8.53581pt}\right\} ={min𝒓𝕊n1max𝒖𝒯J𝕊n+m1𝒉,𝒖2+𝒖22𝒈,𝒓mλ𝒓,𝒖1>0}\displaystyle=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle+\|\bm{u}_{2}\|_{2}\left\langle\bm{g},\bm{r}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
{min𝒓𝕊n1max𝒖𝒯J𝕊n+m1X𝒓,𝒖τ𝒓,𝒖}\displaystyle\leq\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}X_{\bm{r},\bm{u}}\geq\tau_{\bm{r},\bm{u}}\rule{0.0pt}{8.53581pt}\right\}
={min𝒓𝕊n1max𝒖𝒯J𝕊n+m1𝚽𝒓,𝒖2+𝒖22gmλ𝒓,𝒖1>0}\displaystyle=\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle+\|\bm{u}_{2}\|_{2}\cdot g-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
12+12{min𝒓𝕊n1max𝒖𝒯J𝕊n+m1𝚽𝒓,𝒖2+𝒖22gmλ𝒓,𝒖1>0|g0}\displaystyle\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle+\|\bm{u}_{2}\|_{2}\cdot g-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0~{}\big{|}~{}g\leq 0\rule{0.0pt}{8.53581pt}\right\}
12+12{min𝒓𝕊n1max𝒖𝒯J𝕊n+m1𝚽𝒓,𝒖2mλ𝒓,𝒖1>0},\displaystyle\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},

which implies

{min𝒓𝕊n1max𝒖𝒯J𝕊n+m1𝚽𝒓,𝒖2mλ𝒓,𝒖1>0}\displaystyle\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{\Phi}\bm{r},\bm{u}_{2}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
2{min𝒓𝕊n1max𝒖𝒯J𝕊n+m1𝒉,𝒖2+𝒖22𝒈,𝒓mλ𝒓,𝒖1:=3>0}1.\displaystyle\hskip 135.0pt\geq 2\mathbb{P}\left\{\underbrace{\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle+\|\bm{u}_{2}\|_{2}\left\langle\bm{g},\bm{r}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{r},\bm{u}_{1}\right\rangle}_{:=~{}\mathscr{E}_{3}}>0\rule{0.0pt}{8.53581pt}\right\}-1. (45)

Moreover, 3\mathscr{E}_{3} can be bounded from below as follows

3=min𝒓𝕊n1max𝒖𝒯J𝕊n+m1𝒉,𝒖2+𝒓,𝒖22𝒈mλ𝒖1max𝒖𝒯J𝕊n+m1min𝒓𝕊n1𝒉,𝒖2+𝒓,𝒖22𝒈mλ𝒖1=max𝒖𝒯J𝕊n+m1𝒉,𝒖2𝒖22𝒈mλ𝒖12=max𝒂f(𝒙)𝒃g(𝒗)𝒃2𝒂22+𝒃22:=c(𝒂,𝒃)[𝒉,𝒃𝒃2𝒈mλ𝒃2𝒂2]\begin{split}\mathscr{E}_{3}&=\min_{\bm{r}\in\mathbb{S}^{n-1}}\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle+\left\langle\bm{r},\|\bm{u}_{2}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{u}_{1}\right\rangle\\ &\geq\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\min_{\bm{r}\in\mathbb{S}^{n-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle+\left\langle\bm{r},\|\bm{u}_{2}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{u}_{1}\right\rangle\\ &=\max_{\bm{u}\in\mathcal{T}_{J}\cap\mathbb{S}^{n+m-1}}\left\langle\bm{h},\bm{u}_{2}\right\rangle-\left\|\|\bm{u}_{2}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{u}_{1}\right\|_{2}\\ &=\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\underbrace{\frac{\|\bm{b}\|_{2}}{\sqrt{\|\bm{a}\|_{2}^{2}+\|\bm{b}\|_{2}^{2}}}}_{:=c(\bm{a},\bm{b})}\cdot\left[\left\langle\bm{h},\frac{\bm{b}}{\|\bm{b}\|_{2}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}\|_{2}}\bm{a}\right\|_{2}\right]\\ \end{split}

The first inequality holds because of the max-min inequality. In the last line, recall that the joint cone is defined as 𝒯J={t(𝒂,𝒃):t0,𝒂f(𝒙),𝒃g(𝒗)}\mathcal{T}_{J}=\{t\cdot(\bm{a},\bm{b}):t\geq 0,~{}\bm{a}\in\partial f(\bm{x}^{\star}),~{}\bm{b}\in\partial g(\bm{v}^{\star})\}, so we have let 𝒖1=𝒂𝒂22+𝒃22\bm{u}_{1}=\frac{\bm{a}}{\sqrt{\|\bm{a}\|_{2}^{2}+\|\bm{b}\|_{2}^{2}}} and 𝒖2=𝒃𝒂22+𝒃22\bm{u}_{2}=\frac{\bm{b}}{\sqrt{\|\bm{a}\|_{2}^{2}+\|\bm{b}\|_{2}^{2}}} for 𝒂f(𝒙),𝒃g(𝒗)\bm{a}\in\partial f(\bm{x}^{\star}),~{}\bm{b}\in\partial g(\bm{v}^{\star}).

Since f(𝒙)\partial f(\bm{x}^{\star}) and g(𝒗)\partial g(\bm{v}^{\star}) are nonempty and closed, we choose (𝒂0,𝒃0)(\bm{a}_{0},\bm{b}_{0}) such that

(𝒂0,𝒃0)argmax𝒂f(𝒙)𝒃g(𝒗)𝒉,𝒃𝒃2𝒈mλ𝒃2𝒂2,(\bm{a}_{0},\bm{b}_{0})\in\arg\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\langle\bm{h},\frac{\bm{b}}{\|\bm{b}\|_{2}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}\|_{2}}\bm{a}\right\|_{2},

which leads to

3c(𝒂0,𝒃0)[𝒉,𝒃0𝒃02𝒈mλ𝒃02𝒂02]=c(𝒂0,𝒃0)max𝒂f(𝒙)𝒃g(𝒗)𝒉,𝒃𝒃2𝒈mλ𝒃2𝒂2=c(𝒂0,𝒃0)maxαtβmax𝒂f(𝒙)𝒃g(𝒗)t𝕊m1𝒉,𝒃t𝒈mλt𝒂2\begin{split}\mathscr{E}_{3}&\geq c(\bm{a}_{0},\bm{b}_{0})\cdot\left[\left\langle\bm{h},\frac{\bm{b}_{0}}{\|\bm{b}_{0}\|_{2}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}_{0}\|_{2}}\bm{a}_{0}\right\|_{2}\right]\\ &=c(\bm{a}_{0},\bm{b}_{0})\cdot\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\langle\bm{h},\frac{\bm{b}}{\|\bm{b}\|_{2}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}\|_{2}}\bm{a}\right\|_{2}\\ &=c(\bm{a}_{0},\bm{b}_{0})\cdot\max_{\alpha\leq t\leq\beta}\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left\langle\bm{h},\frac{\bm{b}}{t}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2}\\ \end{split}

In the last line, we have let 𝒃2=t\|\bm{b}\|_{2}=t, α=min𝒃g(𝒗)𝒃2\alpha=\min_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}, and β=max𝒃g(𝒗)𝒃2\beta=\max_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}.

Define

L(𝒈,𝒉,t):=max𝒂f(𝒙)𝒃g(𝒗)t𝕊m1𝒉,𝒃t𝒈mλt𝒂2,L(\bm{g},\bm{h},t):=\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left\langle\bm{h},\frac{\bm{b}}{t}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2},

and choose t3t_{3} such that

t3argminαtβ𝔼\displaystyle t_{3}\in\arg\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}} [2dist(𝒈,mλtf(𝒙))+dist2(𝒉,1tg(𝒗)𝕊m1)1].\displaystyle\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1\right].

Then we have

3c(𝒂0,𝒃0)maxαtβL(𝒈,𝒉,t)c(𝒂0,𝒃0)L(𝒈,𝒉,t3).\begin{split}\mathscr{E}_{3}&\geq c(\bm{a}_{0},\bm{b}_{0})\cdot\max_{\alpha\leq t\leq\beta}L(\bm{g},\bm{h},t)\\ &\geq c(\bm{a}_{0},\bm{b}_{0})\cdot L(\bm{g},\bm{h},t_{3}).\end{split} (46)

Combining (44), (B-B), and (46) yields

{min𝒓𝕊n1min𝒔𝒯J𝒔𝑴T𝒓2>0}2{c(𝒂0,𝒃0)L(𝒈,𝒉,t3)>0}1=2{L(𝒈,𝒉,t3)>0}1,\begin{split}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{c(\bm{a}_{0},\bm{b}_{0})\cdot L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &=2\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}-1,\end{split} (47)

where the last line holds because 0g(𝒗)0\notin\partial g(\bm{v}^{\star}), and hence c(𝒂0,𝒃0)>0c(\bm{a}_{0},\bm{b}_{0})>0. Thus it suffices to establish the lower bound for {L(𝒈,𝒉,t3)>0}\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}.

Step 2: Establish the lower bound for {L(g,h,t3)>0}\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}. In this step, we apply the Gaussian concentration inequality to establish the lower bound for {L(𝒈,𝒉,t3)>0}\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}.

It follows from Lemma 4 that the function L(𝒈,𝒉,t3)L(\bm{g},\bm{h},t_{3}) is a 2\sqrt{2}-Lipschitz function. Its expectation can be bounded from below as follows:

𝔼L(𝒈,𝒉,t3)=𝔼max𝒂f(𝒙)𝒃g(𝒗)t3𝕊m1𝒉,𝒃t3𝒈mλt3𝒂2=𝔼max𝒂f(𝒙)𝒃g(𝒗)t3𝕊m11+𝒉22𝒉𝒃t3222𝒈mλt3𝒂2=m2𝔼min𝒂f(𝒙)𝒃g(𝒗)t3𝕊m1(𝒈mλt3𝒂2+12𝒉𝒃t32212)=m2𝔼[dist(𝒈,mλt3f(𝒙))+12dist2(𝒉,1t3g(𝒗)𝕊m1)12]=m2minαtβ𝔼[dist(𝒈,mλtf(𝒙))+12dist2(𝒉,1tg(𝒗)𝕊m1)12]ϵ/2.\begin{split}\operatorname{\mathbb{E}}L(\bm{g},\bm{h},t_{3})&=\operatorname{\mathbb{E}}\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{3}\mathbb{S}^{m-1}}\left\langle\bm{h},\frac{\bm{b}}{t_{3}}\right\rangle-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{3}}\bm{a}\right\|_{2}\\ &=\operatorname{\mathbb{E}}\max_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{3}\mathbb{S}^{m-1}}\frac{1+\|\bm{h}\|_{2}^{2}-\|\bm{h}-\frac{\bm{b}}{t_{3}}\|_{2}^{2}}{2}-\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{3}}\bm{a}\right\|_{2}\\ &=\frac{m}{2}-\operatorname{\mathbb{E}}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{3}\mathbb{S}^{m-1}}\left(\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{3}}\bm{a}\right\|_{2}+\frac{1}{2}\|\bm{h}-\frac{\bm{b}}{t_{3}}\|_{2}^{2}-\frac{1}{2}\right)\\ &=\frac{m}{2}-\operatorname{\mathbb{E}}\left[\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t_{3}}\partial f(\bm{x}^{\star})\right)+\frac{1}{2}\cdot\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t_{3}}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-\frac{1}{2}\right]\\ &=\frac{m}{2}-\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\frac{1}{2}\cdot\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-\frac{1}{2}\right]\\ &\geq\epsilon/2.\end{split}

The last line is due to the measurement condition (20).

Now using the Gaussian concentration inequality (Fact 2) yields

{L(𝒈,𝒉,t3)𝔼L(𝒈,𝒉,t3)ϵ/2}exp(ϵ216).\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})-\operatorname{\mathbb{E}}L(\bm{g},\bm{h},t_{3})\leq-\epsilon/2\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{16}\right).

which in turn implies

{L(𝒈,𝒉,t3)>0}{L(𝒈,𝒉,t3)>𝔼L(𝒈,𝒉,t3)ϵ/2}1exp(ϵ216).\begin{split}\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}&\geq\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>\operatorname{\mathbb{E}}L(\bm{g},\bm{h},t_{3})-\epsilon/2\rule{0.0pt}{8.53581pt}\right\}\\ &\geq 1-\exp\left(\frac{-\epsilon^{2}}{16}\right).\end{split} (48)

Step 3: Complete the proof.

Combining (47) and (48), we have

{min𝒓𝕊n1min𝒔𝒯J𝒔𝑴T𝒓2>0}2{L(𝒈,𝒉,t3)>0}112exp(ϵ216).\begin{split}\mathbb{P}\left\{\min_{\bm{r}\in\mathbb{S}^{n-1}}\min_{\bm{s}\in\mathcal{T}_{J}^{\circ}}\|\bm{s}-\bm{M}^{T}\bm{r}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{L(\bm{g},\bm{h},t_{3})>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &\geq 1-2\exp\left(\frac{-\epsilon^{2}}{16}\right).\end{split}

This means that when m𝒞p(λ)+ϵm\geq\mathscr{C}_{p}(\lambda)+\epsilon, the penalized problem (9) succeeds with probability at least 12exp(ϵ216)1-2\exp\left(\frac{-\epsilon^{2}}{16}\right).

Failure case: According to Lemma 2, the penalized problem (9) fails if

min𝒂f(𝒙)𝒃g(𝒗)𝚽T𝒃mλ𝒂2>0.\displaystyle\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0.

So it is enough to show that if the number of measurements satisfies (21), then the above inequality holds with high probability. For clarity, the proof is similarly divided into three steps.

Step 1: Problem reduction. In this step, we employ Gordon’s Lemma to convert the probability of the targeted event to another one which is easy to handle.

Note that

min𝒂f(𝒙)𝒃g(𝒗)𝚽T𝒃mλ𝒂2=min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1𝚽T𝒃,𝒖mλ𝒂,𝒖.\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}=\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle. (49)

We then use Gordon’s Lemma to establish a lower bound for the probability of the targeted event. To this end, for any (𝒂,𝒃)f(𝒙)×g(𝒗)(\bm{a},\bm{b})\in\partial f(\bm{x}^{\star})\times\partial g(\bm{v}^{\star}) and 𝒖𝕊n1\bm{u}\in\mathbb{S}^{n-1}, define the following two Gaussian processes

X(𝒂,𝒃),𝒖:=𝚽T𝒃,𝒖+𝒃2gX_{(\bm{a},\bm{b}),\bm{u}}:=\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle+\|\bm{b}\|_{2}\cdot g

and

Y(𝒂,𝒃),𝒖:=𝒃2𝒈,𝒖+𝒉,𝒃,Y_{(\bm{a},\bm{b}),\bm{u}}:=\|\bm{b}\|_{2}\left\langle\bm{g},\bm{u}\right\rangle+\left\langle\bm{h},\bm{b}\right\rangle,

here g𝒩(0,1)g\sim\mathcal{N}(0,1), 𝒈𝒩(𝟎,𝑰n)\bm{g}\sim\mathcal{N}(\bm{0},\bm{I}_{n}), and 𝒉𝒩(𝟎,𝑰m)\bm{h}\sim\mathcal{N}(\bm{0},\bm{I}_{m}) are independent of each other. It is not hard to check that these two defined processes satisfy the conditions in Gordon’s Lemma:

𝔼X(𝒂,𝒃),𝒖2\displaystyle\operatorname{\mathbb{E}}X_{(\bm{a},\bm{b}),\bm{u}}^{2} =2𝒃22=𝔼Y(𝒂,𝒃),𝒖2,\displaystyle=2\|\bm{b}\|_{2}^{2}=\operatorname{\mathbb{E}}Y_{(\bm{a},\bm{b}),\bm{u}}^{2},
𝔼[X(𝒂,𝒃),𝒖X(𝒂,𝒃),𝒖]𝔼[Y(𝒂,𝒃),𝒖Y(𝒂,𝒃),𝒖]\displaystyle\operatorname{\mathbb{E}}[X_{(\bm{a},\bm{b}),\bm{u}}X_{(\bm{a}^{\prime},\bm{b}^{\prime}),\bm{u}^{\prime}}]-\operatorname{\mathbb{E}}[Y_{(\bm{a},\bm{b}),\bm{u}}Y_{(\bm{a}^{\prime},\bm{b}^{\prime}),\bm{u}^{\prime}}] =𝒖,𝒖𝒃,𝒃+𝒃2𝒃2𝒃,𝒃𝒃2𝒃2𝒖,𝒖\displaystyle=\left\langle\bm{u},\bm{u}^{\prime}\right\rangle\left\langle\bm{b},\bm{b}^{\prime}\right\rangle+\|\bm{b}\|_{2}\|\bm{b}^{\prime}\|_{2}-\left\langle\bm{b},\bm{b}^{\prime}\right\rangle-\|\bm{b}\|_{2}\|\bm{b}^{\prime}\|_{2}\left\langle\bm{u},\bm{u}^{\prime}\right\rangle
=(1𝒖,𝒖)(𝒃2𝒃2𝒃,𝒃)\displaystyle=\left(1-\left\langle\bm{u},\bm{u}^{\prime}\right\rangle\right)\left(\|\bm{b}\|_{2}\|\bm{b}^{\prime}\|_{2}-\left\langle\bm{b},\bm{b}^{\prime}\right\rangle\right)
0.\displaystyle\geq 0.

In the last line, the equality holds when 𝒃=𝒃\bm{b}=\bm{b}^{\prime}. It follows from Gordon’s Lemma (Fact 1) that (by setting τ(𝒂,𝒃),𝒖=mλ𝒂T𝒖+0+\tau_{(\bm{a},\bm{b}),\bm{u}}=\frac{\sqrt{m}}{\lambda}\bm{a}^{T}\bm{u}+0_{+})

{min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1Y(𝒂,𝒃),𝒖τ(𝒂,𝒃),𝒖}\displaystyle\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}Y_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\} ={min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1𝒃2𝒈,𝒖+𝒉,𝒃mλ𝒂,𝒖>0}\displaystyle=\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\|\bm{b}\|_{2}\left\langle\bm{g},\bm{u}\right\rangle+\left\langle\bm{h},\bm{b}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
{min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1X(𝒂,𝒃),𝒖τ(𝒂,𝒃),𝒖}\displaystyle\leq\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}X_{(\bm{a},\bm{b}),\bm{u}}\geq\tau_{(\bm{a},\bm{b}),\bm{u}}\rule{0.0pt}{8.53581pt}\right\}
={min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1𝚽T𝒃,𝒖+𝒃2gmλ𝒂,𝒖>0}\displaystyle=\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle+\|\bm{b}\|_{2}\cdot g-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
12+12{min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1𝚽T𝒃,𝒖+𝒃2gmλ𝒂,𝒖>0|g0}\displaystyle\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle+\|\bm{b}\|_{2}\cdot g-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\Big{|}g\leq 0\rule{0.0pt}{8.53581pt}\right\}
12+12{min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1𝚽T𝒃,𝒖mλ𝒂,𝒖>0},\displaystyle\leq\frac{1}{2}+\frac{1}{2}\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\},

which implies

{min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1𝚽T𝒃,𝒖mλ𝒂,𝒖>0}\displaystyle\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{\Phi}^{T}\bm{b},\bm{u}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle>0\rule{0.0pt}{8.53581pt}\right\}
2{min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1𝒃2𝒈,𝒖+𝒉,𝒃mλ𝒂,𝒖:=4>0}1.\displaystyle\hskip 135.0pt\geq 2\mathbb{P}\left\{\underbrace{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\|\bm{b}\|_{2}\left\langle\bm{g},\bm{u}\right\rangle+\left\langle\bm{h},\bm{b}\right\rangle-\left\langle\frac{\sqrt{m}}{\lambda}\bm{a},\bm{u}\right\rangle}_{:=\mathscr{E}_{4}}>0\rule{0.0pt}{8.53581pt}\right\}-1. (50)

Moreover, 4\mathscr{E}_{4} can be bounded from below as follows:

4=min𝒂f(𝒙)𝒃g(𝒗)max𝒖𝕊n1𝒖,𝒃2𝒈mλ𝒂+𝒉,𝒃=min𝒂f(𝒙)𝒃g(𝒗)𝒃2𝒈mλ𝒂2+𝒉,𝒃=𝒃12𝒈mλ𝒂12+𝒉,𝒃1𝒃12min𝒂f(𝒙)𝒃g(𝒗)𝒈mλ𝒃2𝒂2+𝒉,𝒃𝒃2=𝒃12minαtβmin𝒂f(𝒙)𝒃g(𝒗)t𝕊m1𝒈mλt𝒂2+𝒉,𝒃t\begin{split}\mathscr{E}_{4}&=\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\max_{\bm{u}\in\mathbb{S}^{n-1}}\left\langle\bm{u},\|\bm{b}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{a}\right\rangle+\left\langle\bm{h},\bm{b}\right\rangle\\ &=\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\|\|\bm{b}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{a}\right\|_{2}+\left\langle\bm{h},\bm{b}\right\rangle\\ &=\left\|\|\bm{b}_{1}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{a}_{1}\right\|_{2}+\left\langle\bm{h},\bm{b}_{1}\right\rangle\\ &\geq\|\bm{b}_{1}\|_{2}\cdot\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda\|\bm{b}\|_{2}}\bm{a}\right\|_{2}+\left\langle\bm{h},\frac{\bm{b}}{\|\bm{b}\|_{2}}\right\rangle\\ &=\|\bm{b}_{1}\|_{2}\cdot\min_{\alpha\leq t\leq\beta}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2}+\left\langle\bm{h},\frac{\bm{b}}{t}\right\rangle\\ \end{split}

In the third line, we have chosen (𝒂1,𝒃1)(\bm{a}_{1},\bm{b}_{1}) such that

(𝒂1,𝒃1)argmin𝒂f(𝒙)𝒃g(𝒗)𝒃2𝒈mλ𝒂2+𝒉,𝒃.(\bm{a}_{1},\bm{b}_{1})\in\arg\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\left\|\|\bm{b}\|_{2}\bm{g}-\frac{\sqrt{m}}{\lambda}\bm{a}\right\|_{2}+\left\langle\bm{h},\bm{b}\right\rangle.

In the last line, we have let 𝒃2=t\|\bm{b}\|_{2}=t, α=min𝒃g(𝒗)𝒃2\alpha=\min_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}, and β=max𝒃g(𝒗)𝒃2\beta=\max_{\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{b}\|_{2}.

Define

V(𝒈,𝒉,t)=min𝒂f(𝒙)𝒃g(𝒗)t𝕊m1𝒈mλt𝒂2+𝒉,𝒃t,V(\bm{g},\bm{h},t)=\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2}+\left\langle\bm{h},\frac{\bm{b}}{t}\right\rangle,

and choose t4t_{4} such that

t4argminαtβV(𝒈,𝒉,t).\displaystyle t_{4}\in\arg\min_{\alpha\leq t\leq\beta}V(\bm{g},\bm{h},t).

Then we have

4𝒃12minαtβV(𝒈,𝒉,t)=𝒃12V(𝒈,𝒉,t4).\begin{split}\mathscr{E}_{4}&\geq\|\bm{b}_{1}\|_{2}\cdot\min_{\alpha\leq t\leq\beta}V(\bm{g},\bm{h},t)\\ &=\|\bm{b}_{1}\|_{2}\cdot V(\bm{g},\bm{h},t_{4}).\end{split} (51)

Combining (49), (B-B), and (51) yields

{min𝒂f(𝒙)𝒃g(𝒗)𝚽T𝒃mλ𝒂2>0}2{𝒃12V(𝒈,𝒉,t4)>0}1=2{V(𝒈,𝒉,t4)>0}1,\begin{split}\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{\|\bm{b}_{1}\|_{2}\cdot V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &=2\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}-1,\end{split} (52)

where we have used the fact that 0g(𝒗)0\notin\partial g(\bm{v}^{\star}) and hence 𝒃12>0\|\bm{b}_{1}\|_{2}>0. Thus it is enough to establish the lower bound for {V(𝒈,𝒉,t4)>0}\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}.

Step 2: Establish the lower bound for {V(g,h,t4)>0}\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}. In this step, we use the Gausian concentration inequaltiy to establish the lower bound for {V(𝒈,𝒉,t4)>0}\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}.

By Lemma 4, the function V(𝒈,𝒉,t4)V(\bm{g},\bm{h},t_{4}) is a 2\sqrt{2}-Lipschitz function. Its expectation can be bounded:

𝔼V(𝒈,𝒉,t4)=𝔼min𝒂f(𝒙)𝒃g(𝒗)t4𝕊m1𝒈mλt4𝒂2+𝒉,𝒃t4=𝔼min𝒂f(𝒙)𝒃g(𝒗)t4𝕊m1𝒈mλt4𝒂2+𝒉,𝒃t4=𝔼min𝒂f(𝒙)𝒃g(𝒗)t4𝕊m1𝒈mλt4𝒂2+𝒉𝒃t4221𝒉222=𝔼[min𝒂f(𝒙)𝒃g(𝒗)t4𝕊m1(𝒈mλt4𝒂2+12𝒉𝒃t42212)]m2minαtβ𝔼[min𝒂f(𝒙)𝒃g(𝒗)t𝕊m1(𝒈mλt𝒂2+12𝒉𝒃t2212)]m2=minαtβ𝔼[dist(𝒈,mλtf(𝒙))+12dist2(𝒉,1tg(𝒗)𝕊m1)12]m2ϵ/2.\begin{split}\operatorname{\mathbb{E}}V(\bm{g},\bm{h},t_{4})&=\operatorname{\mathbb{E}}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{4}\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{4}}\bm{a}\right\|_{2}+\left\langle\bm{h},\frac{\bm{b}}{t_{4}}\right\rangle\\ &=\operatorname{\mathbb{E}}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{4}\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{4}}\bm{a}\right\|_{2}+\left\langle-\bm{h},\frac{\bm{b}}{t_{4}}\right\rangle\\ &=\operatorname{\mathbb{E}}\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{4}\mathbb{S}^{m-1}}\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{4}}\bm{a}\right\|_{2}+\frac{\|\bm{h}-\frac{\bm{b}}{t_{4}}\|_{2}^{2}-1-\|\bm{h}\|_{2}^{2}}{2}\\ &=\operatorname{\mathbb{E}}\left[\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t_{4}\mathbb{S}^{m-1}}\left(\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t_{4}}\bm{a}\right\|_{2}+\frac{1}{2}\|\bm{h}-\frac{\bm{b}}{t_{4}}\|_{2}^{2}-\frac{1}{2}\right)\right]-\frac{m}{2}\\ &\geq\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})\cap t\mathbb{S}^{m-1}}\left(\left\|\bm{g}-\frac{\sqrt{m}}{\lambda t}\bm{a}\right\|_{2}+\frac{1}{2}\|\bm{h}-\frac{\bm{b}}{t}\|_{2}^{2}-\frac{1}{2}\right)\right]-\frac{m}{2}\\ &=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\frac{1}{2}\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-\frac{1}{2}\right]-\frac{m}{2}\\ &\geq\epsilon/2.\end{split} (53)

The second line is due to the fact that 𝒉-\bm{h} and 𝒉\bm{h} have the same distribution. The last inequality is due to the measurement condition (21).

Now it follows from the Gaussian concentration inequality (Fact 2) that

{V(𝒈,𝒉,t4)𝔼V(𝒈,𝒉,t4)ϵ/2}exp(ϵ216),\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})-\operatorname{\mathbb{E}}V(\bm{g},\bm{h},t_{4})\leq-\epsilon/2\rule{0.0pt}{8.53581pt}\right\}\leq\exp\left(\frac{-\epsilon^{2}}{16}\right),

which in turn implies

{V(𝒈,𝒉,t4)>0}{V(𝒈,𝒉,t4)>𝔼V(𝒈,𝒉,t4)ϵ/2}1exp(ϵ216).\begin{split}\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}&\geq\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>\operatorname{\mathbb{E}}V(\bm{g},\bm{h},t_{4})-\epsilon/2\rule{0.0pt}{8.53581pt}\right\}\\ &\geq 1-\exp\left(\frac{-\epsilon^{2}}{16}\right).\end{split} (54)

Step 3: Complete the proof.

Putting (52) and (54) together, we have

{min𝒂f(𝒙)𝒃g(𝒗)𝚽T𝒃mλ𝒂2>0}2{V(𝒈,𝒉,t4)>0}112exp(ϵ216).\begin{split}\mathbb{P}\left\{\min_{\bm{a}\in\partial f(\bm{x}^{\star})\atop\bm{b}\in\partial g(\bm{v}^{\star})}\|\bm{\Phi}^{T}\bm{b}-\frac{\sqrt{m}}{\lambda}\bm{a}\|_{2}>0\rule{0.0pt}{8.53581pt}\right\}&\geq 2\mathbb{P}\left\{V(\bm{g},\bm{h},t_{4})>0\rule{0.0pt}{8.53581pt}\right\}-1\\ &\geq 1-2\exp\left(\frac{-\epsilon^{2}}{16}\right).\end{split}

Thus we have shown that when m𝒞p(λ)ϵm\leq\mathscr{C}_{p}(\lambda)-\epsilon, the penalized problem (9) fails with probability at least 12exp(ϵ216)1-2\exp\left(\frac{-\epsilon^{2}}{16}\right). This completes the proof.

Appendix C Proof of Theorem 3

In this appendix, we prove our last theorem which establishes the relationship between constrained and penalized recovery procedures and illustrates how to select the optimal parameter λ\lambda for the penalized method. Some auxiliary lemma and facts used in the proof are included in Appendix E.

Refer to caption
Figure 6: Illustration of the distances of a vector to the scaled subdifferential tf(𝒙)t\cdot\partial f(\bm{x}^{\star}) and to the cone of the subdifferential cone(f(𝒙))\operatorname{cone}(\partial f(\bm{x}^{\star})).
Proof.

The core ingredient in the proof of Theorem 3 is the fact that the distance of a vector to the scaled subdifferential tf(𝒙)t\cdot\partial f(\bm{x}^{\star}) can always be bounded from below by that of this vector to the cone of the subdifferential cone(f(𝒙))\operatorname{cone}(\partial f(\bm{x}^{\star})) (see Fig. 6). With this observation in mind, we have

𝒞p(λ)\displaystyle\mathscr{C}_{p}(\lambda) =minαtβ𝔼[2dist(𝒈,mλtf(𝒙))+dist2(𝒉,1tg(𝒗)𝕊m1)1]\displaystyle=\min_{\alpha\leq t\leq\beta}\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)+\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)-1\right]
𝔼[2dist(𝒈,cone(f(𝒙)))+dist2(𝒉,cone(g(𝒗))𝕊m1)1]\displaystyle\geq\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right)+\operatorname{dist}^{2}\left(\bm{h},\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}\right)-1\right]
=𝔼[2dist(𝒈,cone(f(𝒙)))+min𝒃cone(g(𝒗))𝕊m1𝒉𝒃221]\displaystyle=\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right)+\min_{\bm{b}\in\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}}\|\bm{h}-\bm{b}\|_{2}^{2}-1\right]
=𝔼[2dist(𝒈,cone(f(𝒙)))2max𝒃cone(g(𝒗))𝕊m1𝒉,𝒃+m].\displaystyle=\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right)-2\max_{\bm{b}\in\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}}\left\langle\bm{h},\bm{b}\right\rangle+m\right]. (55)

By assumptions, 𝟎f(𝒙)\bm{0}\notin\partial f(\bm{x}^{\star}) and 𝟎g(𝒗)\bm{0}\notin\partial g(\bm{v}^{\star}), we obtain

cone(f(𝒙))=𝒩f(𝒙)andcone(g(𝒗))=𝒩g(𝒗).\operatorname{cone}(\partial f(\bm{x}^{\star}))=\mathcal{N}_{f}(\bm{x}^{\star})~{}~{}~{}\textrm{and}~{}\operatorname{cone}(\partial g(\bm{v}^{\star}))=\mathcal{N}_{g}(\bm{v}^{\star}).

Substituting the above equalities into (C) and rearranging yields

𝒞p(λ)m2𝔼[dist(𝒈,𝒩f(𝒙))max𝒃𝒩g(𝒗)𝕊m1𝒉,𝒃]=𝔼dist(𝒈,𝒩f(𝒙))ω(𝒩g(𝒗)𝕊m1).\displaystyle\frac{\mathscr{C}_{p}(\lambda)-m}{2}\geq\operatorname{\mathbb{E}}\left[\operatorname{dist}\left(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star})\right)-\max_{\bm{b}\in\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}}\left\langle\bm{h},\bm{b}\right\rangle\right]=\operatorname{\mathbb{E}}\operatorname{dist}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))-\omega(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}). (56)

We are now ready to establish the bounds in Theorem 3. First consider the case in which m𝒞p(λ)m\geq\mathscr{C}_{p}(\lambda), then (56) implies that

0\displaystyle 0 𝔼2dist(𝒈,𝒩f(𝒙))ω2(𝒩g(𝒗)𝕊m1)\displaystyle\geq\operatorname{\mathbb{E}}^{2}\operatorname{dist}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))-\omega^{2}(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})
𝔼dist2(𝒈,𝒩f(𝒙))1ω2(𝒩g(𝒗)𝕊m1)\displaystyle\geq\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))-1-\omega^{2}(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})
=𝔼(max𝒙𝒯f(𝒙)𝔹2n𝒙,𝒈)21ω2(𝒩g(𝒗)𝕊m1)\displaystyle=\operatorname{\mathbb{E}}\left(\max_{\bm{x}\in\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{B}_{2}^{n}}\left\langle\bm{x},\bm{g}\right\rangle\right)^{2}-1-\omega^{2}(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})
ω2(𝒯f(𝒙)𝕊n1)1ω2(𝒩g(𝒗)𝕊m1)\displaystyle\geq\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})-1-\omega^{2}(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})
ω2(𝒯f(𝒙)𝕊n1)1[mω2(𝒯g(𝒗)𝕊m1)]\displaystyle\geq\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})-1-[m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})]
=𝒞pm1.\displaystyle=\mathscr{C}_{p}-m-1.

The second inequality is due to the relation (10). The third line has used Fact 6 and the assumption that 𝒯f(𝒙)\mathcal{T}_{f}(\bm{x}^{\star}) is closed (thus 𝒩f(𝒙)=𝒯f(𝒙)\mathcal{N}_{f}^{\circ}(\bm{x}^{\star})=\mathcal{T}_{f}(\bm{x}^{\star})). The next two lines follow from Facts 7 and 8, respectively. Thus we have shown that if m𝒞p(λ)m\geq\mathscr{C}_{p}(\lambda), then m𝒞p1m\geq\mathscr{C}_{p}-1.

We next consider the case where m𝒞pm\geq\mathscr{C}_{p}. It is not hard to find that for αtβ\alpha\leq t\leq\beta, the two Gaussian distances in 𝒞p(λ)\mathscr{C}_{p}(\lambda) have the following lower bounds:

𝔼dist2(𝒉,1tg(𝒗)𝕊m1)𝔼dist2(𝒉,cone(g(𝒗))𝕊m1)\operatorname{\mathbb{E}}\operatorname{dist}^{2}\left(\bm{h},\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)\geq\operatorname{\mathbb{E}}\operatorname{dist}^{2}\left(\bm{h},\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}\right)

and

𝔼dist(𝒈,mλtf(𝒙))𝔼dist(𝒈,cone(f(𝒙))).\operatorname{\mathbb{E}}\operatorname{dist}\left(\bm{g},\frac{\sqrt{m}}{\lambda t}\partial f(\bm{x}^{\star})\right)\geq\operatorname{\mathbb{E}}\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right).

Then we can choose

t=argminαtβη2(1tg(𝒗)𝕊m1)andλ=argminλ>0ζ(mλtf(𝒙))t^{\star}=\arg\min_{\alpha\leq t\leq\beta}\eta^{2}\left(\frac{1}{t}\partial g(\bm{v}^{\star})\cap\mathbb{S}^{m-1}\right)~{}~{}\textrm{and}~{}~{}\lambda^{\star}=\arg\min_{\lambda>0}\zeta\left(\frac{\sqrt{m}}{\lambda t^{\star}}\partial f(\bm{x}^{\star})\right)

such that the above two Gaussian distances attain their lower bounds simultaneously. The second lower bound is achievable because m/λt{\sqrt{m}}/{\lambda t^{\star}} can take any positive number if λ>0\lambda>0. Thus 𝒞p(λ)\mathscr{C}_{p}(\lambda) attains its minimum at λ\lambda^{\star}, i.e.,

𝒞p(λ)\displaystyle\mathscr{C}_{p}(\lambda^{\star}) =𝔼[2dist(𝒈,cone(f(𝒙)))+dist2(𝒉,cone(g(𝒗))𝕊m1)1]\displaystyle=\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\operatorname{cone}(\partial f(\bm{x}^{\star}))\right)+\operatorname{dist}^{2}\left(\bm{h},\operatorname{cone}(\partial g(\bm{v}^{\star}))\cap\mathbb{S}^{m-1}\right)-1\right]
=𝔼[2dist(𝒈,𝒩f(𝒙))2max𝒃𝒩g(𝒗)𝕊m1𝒉,𝒃+m].\displaystyle=\operatorname{\mathbb{E}}\left[2\cdot\operatorname{dist}\left(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star})\right)-2\max_{\bm{b}\in\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}}\left\langle\bm{h},\bm{b}\right\rangle+m\right].

Further, we have the following upper bound

𝒞p(λ)m2\displaystyle\frac{\mathscr{C}_{p}(\lambda^{\star})-m}{2} =𝔼dist(𝒈,𝒩f(𝒙))ω(𝒩g(𝒗)𝕊m1)\displaystyle=\operatorname{\mathbb{E}}\operatorname{dist}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))-\omega(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1}) (57)
𝔼dist2(𝒈,𝒩f(𝒙))ω(𝒩g(𝒗)𝕊m1)\displaystyle\leq\sqrt{\operatorname{\mathbb{E}}\operatorname{dist}^{2}(\bm{g},\mathcal{N}_{f}(\bm{x}^{\star}))}-\omega(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})
=δ(𝒯f(𝒙))ω(𝒩g(𝒗)𝕊m1)\displaystyle=\sqrt{\delta(\mathcal{T}_{f}(\bm{x}^{\star}))}-\omega(\mathcal{N}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})
ω2(𝒯f(𝒙)𝕊n1)+1δ(𝒩g(𝒗))1\displaystyle\leq\sqrt{\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1}-\sqrt{\delta(\mathcal{N}_{g}(\bm{v}^{\star}))-1}
=ω2(𝒯f(𝒙)𝕊n1)+1mδ(𝒯g(𝒗))1\displaystyle=\sqrt{\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1}-\sqrt{m-\delta(\mathcal{T}_{g}(\bm{v}^{\star}))-1}
ω2(𝒯f(𝒙)𝕊n1)+1mω2(𝒯g(𝒗)𝕊m1)2\displaystyle\leq\sqrt{\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1}-\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})-2}
ω(𝒯f(𝒙)𝕊n1)+1mω2(𝒯g(𝒗)𝕊m1)+2\displaystyle\leq\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1-\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})}+\sqrt{2}
𝒞pm+1+2\displaystyle\leq\sqrt{\mathscr{C}_{p}}-\sqrt{m}+1+\sqrt{2}
1+2.\displaystyle\leq 1+\sqrt{2}.

Here, the first inequality is due to Jensen’s inequality. The next two lines have used Facts 6 and 7, respectively. The third equality holds because of Moreau’s decomposition theorem (Fact 5). The next inequality has used Fact 7 again. The lase two inequalities follow from the condition m𝒞pm\geq\mathscr{C}_{p}. Rearranging completes the proof.

It’s worth noting that if we impose an extra condition on ω(𝒯f(𝒙)𝕊n1)\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1}) (e.g., ω(𝒯f(𝒙)𝕊n1)4\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})\geq 4, which can be easily satisfied in practical applications), then we can obtain a sharper upper bound than (57):

𝒞p(λ)m2\displaystyle\frac{\mathscr{C}_{p}(\lambda^{\star})-m}{2} ω2(𝒯f(𝒙)𝕊n1)+1mω2(𝒯g(𝒗)𝕊m1)2\displaystyle\leq\sqrt{\omega^{2}(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+1}-\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})-2}
ω(𝒯f(𝒙)𝕊n1)+(1716)mω2(𝒯g(𝒗)𝕊m1)+(1614)\displaystyle\leq\omega(\mathcal{T}_{f}(\bm{x}^{\star})\cap\mathbb{S}^{n-1})+(\sqrt{17}-\sqrt{16})-\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})}+(\sqrt{16}-\sqrt{14})
𝒞pm+1714\displaystyle\leq\sqrt{\mathscr{C}_{p}}-\sqrt{m}+\sqrt{17}-\sqrt{14}
12.\displaystyle\leq\frac{1}{2}.

The second inequality is due to the facts that a2+1a+(1716)\sqrt{a^{2}+1}\leq a+(\sqrt{17}-\sqrt{16}) for a4a\geq 4 and b22b(1614)\sqrt{b^{2}-2}\geq b-(\sqrt{16}-\sqrt{14}) for b=mω2(𝒯g(𝒗)𝕊m1)4b=\sqrt{m-\omega^{2}(\mathcal{T}_{g}(\bm{v}^{\star})\cap\mathbb{S}^{m-1})}\geq 4. The lase two inequalities have used the condition m𝒞pm\geq\mathscr{C}_{p}. Rearranging yields m𝒞p(λ)1m\geq\mathscr{C}_{p}(\lambda^{\star})-1, which leads to a smaller gap than 55.

Appendix D Evaluate 𝒞p\mathscr{C}_{p} and 𝒞p(λ)\mathscr{C}_{p}(\lambda) for Typical Structured Signal and Corruption

In Section III, we have shown that 𝒞p\mathscr{C}_{p} and 𝒞p(λ)\mathscr{C}_{p}(\lambda) can be accurately estimated by equations (16) and (23), respectively. Thus it is sufficient to evaluate two related functionals: Gaussian squared distance to a scaled subdifferential and spherical Gaussian width of a scaled subdifferential. There exist some standard recipes to calculate these two quantities in the literature, see e.g., [51, 37, 24]. For the completeness of this paper, we calculate these two functionals for sparse vectors and low-rank matrices in this appendix.

D-A Calculation for Sparse Vectors

Let 𝒙\bm{x} be an ss-sparse vector in n\mathbb{R}^{n}, and let SS denote its support. We use the 1\ell_{1}-norm to promote the structure of sparse vectors. The scaled subdifferential of 𝒙1\|\bm{x}\|_{1} is given by

t𝒙1={𝒛n:𝒛i=tsgn(𝒙i)foriS,|𝒛i|tforiSc},t\cdot\partial\|\bm{x}\|_{1}=\{\bm{z}\in\mathbb{R}^{n}:\bm{z}_{i}=t\cdot\textrm{sgn}(\bm{x}_{i})~{}\textrm{for}~{}i\in S,~{}|\bm{z}_{i}|\leq t~{}\textrm{for}~{}i\in S^{c}\},

where ScS^{c} represents the complement of SS. Then the Gaussian squared distance to a scaled subdifferential can be calculated as

η2(t𝒙1)\displaystyle\eta^{2}(t\cdot\partial\|\bm{x}\|_{1}) =𝔼inf𝒛t𝒙1𝒈𝒛22=𝔼inf𝒛t𝒙1iS(𝒈i𝒛i)2+iSc(𝒈i𝒛i)2\displaystyle=\operatorname{\mathbb{E}}\inf_{\bm{z}\in t\cdot\partial\|\bm{x}\|_{1}}\|\bm{g}-\bm{z}\|_{2}^{2}=\operatorname{\mathbb{E}}\inf_{\bm{z}\in t\cdot\partial\|\bm{x}\|_{1}}\sum_{i\in S}(\bm{g}_{i}-\bm{z}_{i})^{2}+\sum_{i\in S^{c}}(\bm{g}_{i}-\bm{z}_{i})^{2}
=𝔼iS(𝒈itsgn(xi))2+iScshrink(𝒈i,t)2\displaystyle=\operatorname{\mathbb{E}}~{}\sum_{i\in S}(\bm{g}_{i}-t\cdot\textrm{sgn}(x_{i}))^{2}+\sum_{i\in S^{c}}\textrm{shrink}(\bm{g}_{i},t)^{2}
=s(1+t2)+2(ns)2π((1+t2)tex2/2𝑑xtet2/2).\displaystyle=s(1+t^{2})+\frac{2(n-s)}{\sqrt{2\pi}}\left((1+t^{2})\int_{t}^{\infty}e^{-x^{2}/2}dx-te^{-t^{2}/2}\right).

Here shrink(𝒈i,t)\textrm{shrink}(\bm{g}_{i},t) is the soft thresholding operator defined as:

shrink(𝒈i,t)={𝒈i+tif𝒈i<t,0ift𝒈it,𝒈itif𝒈>t.\textrm{shrink}(\bm{g}_{i},t)=\left\{\begin{array}[]{ll}\bm{g}_{i}+t&\textrm{if}~{}~{}\bm{g}_{i}<-t,\\ 0&\textrm{if}~{}~{}-t\leq\bm{g}_{i}\leq t,\\ \bm{g}_{i}-t&\textrm{if}~{}~{}\bm{g}>t.\end{array}\right.

Let 𝒗\bm{v} be a kk-sparse vector in m\mathbb{R}^{m}, and let SS denote the support of 𝒗\bm{v}. The scaled spherical part of the subdifferential 𝒗1\partial\|\bm{v}\|_{1} is given by

𝒗1t𝕊m1={𝒛m:𝒛i=sgn(𝒗i)foriS,|𝒛i|1foriSc,𝒛2=t}.\partial\|\bm{v}\|_{1}\cap t\mathbb{S}^{m-1}=\{\bm{z}\in\mathbb{R}^{m}:\bm{z}_{i}=\textrm{sgn}(\bm{v}_{i})~{}\textrm{for}~{}i\in S,~{}|\bm{z}_{i}|\leq 1~{}\textrm{for}~{}i\in S^{c},~{}\|\bm{z}\|_{2}=t\}.

Then the spherical Gaussian width of a scaled subdifferential is

ω(1t𝒗1𝕊m1)\displaystyle\omega\left(\frac{1}{t}\partial\|\bm{v}\|_{1}\cap\mathbb{S}^{m-1}\right) =1tω(𝒗1t𝕊m1)=1t𝔼sup𝒛𝒗1t𝕊m1𝒈,𝒛\displaystyle=\frac{1}{t}\cdot\omega(\partial\|\bm{v}\|_{1}\cap t\mathbb{S}^{m-1})=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\|\bm{v}\|_{1}\cap t\mathbb{S}^{m-1}}\left\langle\bm{g},\bm{z}\right\rangle
=1t𝔼sup𝒛𝒗1t𝕊m1iS𝒈i𝒛i+iSc𝒈i𝒛i\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\|\bm{v}\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S}\bm{g}_{i}\bm{z}_{i}+\sum_{i\in S^{c}}\bm{g}_{i}\bm{z}_{i}
=1t𝔼sup𝒛𝒗1t𝕊m1iSc𝒈i𝒛i\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{z}\in\partial\|\bm{v}\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S^{c}}\bm{g}_{i}\bm{z}_{i}
=1tsup𝒛𝒗1t𝕊m1iSc|𝒛i|𝔼|𝒈i|\displaystyle=\frac{1}{t}\sup_{\bm{z}\in\partial\|\bm{v}\|_{1}\cap t\mathbb{S}^{m-1}}\sum_{i\in S^{c}}|\bm{z}_{i}|\cdot\operatorname{\mathbb{E}}|\bm{g}_{i}|
=1tmkt2k2π=2π(mk)(1kt2).\displaystyle=\frac{1}{t}\sqrt{m-k}\sqrt{t^{2}-k}\sqrt{\frac{2}{\pi}}=\sqrt{\frac{2}{\pi}(m-k)\left(1-\frac{k}{t^{2}}\right)}.

The last line is due to the Cauchy-Schwarz inequality, i.e., iSc|𝒛i|mkiSc𝒛i2=mkt2k\sum_{i\in S^{c}}|\bm{z}_{i}|\leq\sqrt{m-k}\cdot\sqrt{\sum_{i\in S^{c}}\bm{z}_{i}^{2}}=\sqrt{m-k}\cdot\sqrt{t^{2}-k}. The equality holds when |𝒛i|=t2kmk|\bm{z}_{i}|=\sqrt{\frac{t^{2}-k}{m-k}} for iSci\in S^{c}.

D-B Calculation for Low-rank Matrices

Let 𝑿n1×n2\bm{X}\in\mathbb{R}^{n_{1}\times n_{2}} be an rr-rank matrix with n1n2n_{1}\leq n_{2}. We use the nuclear norm 𝑿\|\bm{X}\|_{*}, which is the sum of singular values of 𝑿\bm{X}, to promote the structure of low-rank matrices. Note that the nuclear norm and Gaussian distance are both unitary invariant, without loss of generality, we can assume that 𝑿\bm{X} takes the form

𝑿=[𝚺𝟎𝟎𝟎],\bm{X}=\begin{bmatrix}\bm{\Sigma}&\bm{0}\\ \bm{0}&\bm{0}\end{bmatrix},

where 𝚺=diag(σ1,σ2,,σr)\bm{\Sigma}=\textrm{diag}(\sigma_{1},\sigma_{2},...,\sigma_{r}). The scaled subdifferential of 𝑿\|\bm{X}\|_{*} is

t𝑿={[t𝑰r𝟎𝟎t𝑾]:𝑾1}.t\cdot\partial\|\bm{X}\|_{*}=\left\{\begin{bmatrix}t\bm{I}_{r}&\bm{0}\\ \bm{0}&t\bm{W}\end{bmatrix}:~{}\|\bm{W}\|\leq 1\right\}.

Here 𝑾\|\bm{W}\| is the spectral norm, which is equal to the maximum singular value of 𝑾\bm{W}. Let

𝑮=[𝑮1𝑮1𝑮2𝑮2]\bm{G}=\begin{bmatrix}\bm{G}_{1}&\bm{G}_{1}^{\prime}\\ \bm{G}_{2}^{\prime}&\bm{G}_{2}\end{bmatrix}

be a partition of Gaussian matrix 𝑮n1×n2\bm{G}\in\mathbb{R}^{n_{1}\times n_{2}} with 𝑮1r×r\bm{G}_{1}\in\mathbb{R}^{r\times r} and 𝑮2(n1r)×(n2r)\bm{G}_{2}\in\mathbb{R}^{(n_{1}-r)\times(n_{2}-r)}. Then the Gaussian squared distance to a scaled subdifferential is

η2(t𝑿)\displaystyle\eta^{2}(t\cdot\partial\|\bm{X}\|_{*}) =𝔼inf𝒁t𝑿𝑮𝒁F2=𝔼{[𝑮1t𝑰r𝑮1𝑮2𝟎]F2+inf𝑾1𝑮2t𝑾F2}\displaystyle=\operatorname{\mathbb{E}}\inf_{\bm{Z}\in t\cdot\partial\|\bm{X}\|_{*}}\|\bm{G}-\bm{Z}\|_{F}^{2}=\operatorname{\mathbb{E}}\left\{\left\|\begin{bmatrix}\bm{G}_{1}-t\bm{I}_{r}&\bm{G}_{1}^{\prime}\\ \bm{G}_{2}^{\prime}&\bm{0}\end{bmatrix}\right\|_{F}^{2}+\inf_{\|\bm{W}\|\leq 1}\|\bm{G}_{2}-t\bm{W}\|_{F}^{2}\right\}
=r(n1+n2r+t2)+𝔼inf𝑾1i=1n1r(σi(𝑮2)tσi(𝑾))2\displaystyle=r(n_{1}+n_{2}-r+t^{2})+\operatorname{\mathbb{E}}\inf_{\|\bm{W}\|\leq 1}\sum_{i=1}^{n_{1}-r}\left(\sigma_{i}(\bm{G}_{2})-t\sigma_{i}(\bm{W})\right)^{2}
=r(n1+n2r+t2)+𝔼i=1n1rshrink(σi(𝑮2),t)2.\displaystyle=r(n_{1}+n_{2}-r+t^{2})+\operatorname{\mathbb{E}}\sum_{i=1}^{n_{1}-r}\textrm{shrink}\left(\sigma_{i}(\bm{G}_{2}),t\right)^{2}.

Here σi()\sigma_{i}(\cdot) is the ii-th largest singular value. The expectation term concerns the density of singular values of Gaussian matrix 𝑮2\bm{G}_{2}, it seems challenging to obtain an exact formula for this term. However, there exist some asymptotic results in the literatures, see e.g. [37, 54]. In our simulations, we use the Monte Carlo method to calculate this expectation term.

Let 𝑽m1×m2\bm{V}\in\mathbb{R}^{m_{1}\times m_{2}} be a ρ\rho-rank matrix. Since the Gaussian width is also unitary invariant, we assume that 𝑽\bm{V} takes the same form as 𝑿\bm{X}. The scaled spherical part of the subdifferential 𝑽\partial\|\bm{V}\|_{*} is

𝑽t𝕊m1m21={[𝑰ρ𝟎𝟎𝑾]:𝑾1,𝑾F2=t2ρ}.\partial\|\bm{V}\|_{*}\cap t\mathbb{S}^{m_{1}m_{2}-1}=\left\{\begin{bmatrix}\bm{I}_{\rho}&\bm{0}\\ \bm{0}&\bm{W}\end{bmatrix}:~{}\|\bm{W}\|\leq 1,~{}\|\bm{W}\|_{F}^{2}=t^{2}-\rho\right\}.

Then spherical Gaussian width of a scaled subdifferential is given by

ω(1t𝑽𝕊m1m21)\displaystyle\omega\left(\frac{1}{t}\partial\|\bm{V}\|_{*}\cap\mathbb{S}^{m_{1}m_{2}-1}\right) =1tω(𝑽t𝕊m1m21)\displaystyle=\frac{1}{t}\cdot\omega(\partial\|\bm{V}\|_{*}\cap t\mathbb{S}^{m_{1}m_{2}-1})
=1t𝔼sup𝒁𝑽t𝕊m1m21𝑮,𝒁\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\bm{Z}\in\partial\|\bm{V}\|_{*}\cap t\mathbb{S}^{m_{1}m_{2}-1}}\left\langle\bm{G},\bm{Z}\right\rangle
=1t𝔼sup𝑾1,𝑾F2=t2ρ𝑮2,𝑾\displaystyle=\frac{1}{t}\operatorname{\mathbb{E}}\sup_{\|\bm{W}\|\leq 1,~{}\|\bm{W}\|_{F}^{2}=t^{2}-\rho}\left\langle\bm{G}_{2},\bm{W}\right\rangle
=1tt2ρ𝔼𝑮2F=1ρt2μ(m1ρ)(m2ρ).\displaystyle=\frac{1}{t}\sqrt{t^{2}-\rho}\cdot\operatorname{\mathbb{E}}\|\bm{G}_{2}\|_{F}=\sqrt{1-\frac{\rho}{t^{2}}}\cdot\mu_{(m_{1}-\rho)(m_{2}-\rho)}.

The last line is due to the Cauchy-Schwarz inequality, i.e., 𝑮2,𝑾𝑮2F𝑾F=t2ρ𝑮2F\left\langle\bm{G}_{2},\bm{W}\right\rangle\leq\|\bm{G}_{2}\|_{F}\cdot\|\bm{W}\|_{F}=\sqrt{t^{2}-\rho}\cdot\|\bm{G}_{2}\|_{F}. The notation μn\mu_{n} denotes the expected length of an nn-dimensional vector with independent standard normal entries.

Appendix E Auxiliary Lemma and Facts

In this appendix, we present some additional auxiliary lemma and facts that are used in the proofs of our main results.

Lemma 4.

Let 𝒯1n\mathcal{T}_{1}\subset\mathbb{R}^{n} and 𝒯2𝕊m1\mathcal{T}_{2}\subset\mathbb{S}^{m-1} be two subsets, cc\in\mathbb{R} is a constant. Then the functions

F(𝒈,𝒉)=min𝒂𝒯1𝒃𝒯2𝒈c𝒂2+𝒉,𝒃F(\bm{g},\bm{h})=\min_{\bm{a}\in\mathcal{T}_{1}\atop\bm{b}\in\mathcal{T}_{2}}\|\bm{g}-c\cdot\bm{a}\|_{2}+\left\langle\bm{h},\bm{b}\right\rangle

and

G(𝒈,𝒉)=max𝒂𝒯1𝒃𝒯2𝒉,𝒃𝒈c𝒂2G(\bm{g},\bm{h})=\max_{\bm{a}\in\mathcal{T}_{1}\atop\bm{b}\in\mathcal{T}_{2}}\left\langle\bm{h},\bm{b}\right\rangle-\|\bm{g}-c\cdot\bm{a}\|_{2}

are 2\sqrt{2}-Lipschitz functions.

Proof.

To prove first part, it suffices to show that for any (𝒈1,𝒉1),(𝒈2,𝒉2)(\bm{g}_{1},\bm{h}_{1}),~{}(\bm{g}_{2},\bm{h}_{2}), we have

|F(𝒈1,𝒉1)F(𝒈2,𝒉2)|2𝒈1𝒈222+𝒉1𝒉222.\displaystyle\big{|}F(\bm{g}_{1},\bm{h}_{1})-F(\bm{g}_{2},\bm{h}_{2})\big{|}\leq\sqrt{2}\sqrt{\|\bm{g}_{1}-\bm{g}_{2}\|_{2}^{2}+\|\bm{h}_{1}-\bm{h}_{2}\|_{2}^{2}}.

To this end, let

(𝒂¯,𝒃¯)argmin𝒂𝒯1𝒃𝒯2𝒈2c𝒂2+𝒉2,𝒃.(\bar{\bm{a}},~{}\bar{\bm{b}})\in\arg\min_{\bm{a}\in\mathcal{T}_{1}\atop\bm{b}\in\mathcal{T}_{2}}\|\bm{g}_{2}-c\cdot\bm{a}\|_{2}+\left\langle\bm{h}_{2},\bm{b}\right\rangle.

Then we have

F(𝒈1,𝒉1)\displaystyle F(\bm{g}_{1},\bm{h}_{1}) =min𝒂𝒯1𝒃𝒯2𝒈1c𝒂2+𝒉1,𝒃\displaystyle=\min_{\bm{a}\in\mathcal{T}_{1}\atop\bm{b}\in\mathcal{T}_{2}}\|\bm{g}_{1}-c\cdot\bm{a}\|_{2}+\left\langle\bm{h}_{1},\bm{b}\right\rangle
𝒈1c𝒂¯2+𝒉1,𝒃¯.\displaystyle\leq\|\bm{g}_{1}-c\cdot\bar{\bm{a}}\|_{2}+\left\langle\bm{h}_{1},\bar{\bm{b}}\right\rangle.

Therefore,

F(𝒈1,𝒉1)F(𝒈2,𝒉2)\displaystyle F(\bm{g}_{1},\bm{h}_{1})-F(\bm{g}_{2},\bm{h}_{2}) (𝒈1c𝒂¯2+𝒉1,𝒃¯)(𝒈2c𝒂¯2+𝒉2,𝒃¯)\displaystyle\leq\left(\|\bm{g}_{1}-c\cdot\bar{\bm{a}}\|_{2}+\left\langle\bm{h}_{1},\bar{\bm{b}}\right\rangle\right)-\left(\|\bm{g}_{2}-c\cdot\bar{\bm{a}}\|_{2}+\left\langle\bm{h}_{2},\bar{\bm{b}}\right\rangle\right)
=(𝒈1c𝒂¯2𝒈2c𝒂¯2)+(𝒉1,𝒃¯𝒉2,𝒃¯)\displaystyle=\left(\|\bm{g}_{1}-c\cdot\bar{\bm{a}}\|_{2}-\|\bm{g}_{2}-c\cdot\bar{\bm{a}}\|_{2}\right)+\left(\left\langle\bm{h}_{1},\bar{\bm{b}}\right\rangle-\left\langle\bm{h}_{2},\bar{\bm{b}}\right\rangle\right)
𝒈1𝒈22+𝒉1𝒉22\displaystyle\leq\|\bm{g}_{1}-\bm{g}_{2}\|_{2}+\|\bm{h}_{1}-\bm{h}_{2}\|_{2}
2𝒈1𝒈222+𝒉1𝒉222.\displaystyle\leq\sqrt{2}\sqrt{\|\bm{g}_{1}-\bm{g}_{2}\|_{2}^{2}+\|\bm{h}_{1}-\bm{h}_{2}\|_{2}^{2}}. (58)

Similarly, we have

F(𝒈2,𝒉2)F(𝒈1,𝒉1)2𝒈1𝒈222+𝒉1𝒉222.F(\bm{g}_{2},\bm{h}_{2})-F(\bm{g}_{1},\bm{h}_{1})\leq\sqrt{2}\sqrt{\|\bm{g}_{1}-\bm{g}_{2}\|_{2}^{2}+\|\bm{h}_{1}-\bm{h}_{2}\|_{2}^{2}}. (59)

Combining (E) and (59) yields the first conclusion. The second part for G(𝒈,𝒉)G(\bm{g},\bm{h}) can be proven similarly. ∎

Fact 3 (Polarity principle).

[56, Proposition 3.8] Let SS\SS be a non-empty, closed, spherically convex subset of the unit sphere 𝕊n1\mathbb{S}^{n-1}, and let 𝐀:nm\bm{A}:\mathbb{R}^{n}\to\mathbb{R}^{m} be a linear map. If cone(SS)\operatorname{cone}(\SS) is not a subspace, then

min𝒓2=1min𝒔(cone(SS))𝒔𝑨T𝒓2>0implies𝟎𝑨(SS),\min_{\|\bm{r}\|_{2}=1}\min_{\bm{s}\in(\operatorname{cone}(\SS))^{\circ}}\|\bm{s}-\bm{A}^{T}\bm{r}\|_{2}>0~{}~{}\textrm{implies}~{}\bm{0}\in\bm{A}(\SS),

where 𝐀(SS)\bm{A}(\SS) is the image of 𝐀\bm{A}.

Fact 4 (Variance of Gaussian Lipschitz functions).

[58, Theorem 1.6.4] Consider a random vector 𝐱𝒩(0,𝐈n)\bm{x}\sim\mathcal{N}(0,\bm{I}_{n}) and a Lipschitz function f:nf:~{}\mathbb{R}^{n}\to\mathbb{R} with Lipschitz norm fLip\|f\|_{\textrm{Lip}} (with respect to the Euclidean metric). Then,

Var(f(𝒙))fLip2.\textrm{Var}(f(\bm{x}))\leq\|f\|_{\textrm{Lip}}^{2}.
Fact 5 (Moreau’s decomposition theorem).

[61, Theorem 6.30] Let 𝒞\mathcal{C} be a nonempty closed convex cone in n\mathbb{R}^{n} and let 𝐱n\bm{x}\in\mathbb{R}^{n}. Then

𝒙22=dist2(𝒙,𝒞)+dist2(𝒙,𝒞).\displaystyle\|\bm{x}\|_{2}^{2}=\operatorname{dist}^{2}(\bm{x},\mathcal{C})+\operatorname{dist}^{2}(\bm{x},\mathcal{C}^{\circ}).
Fact 6.

[51, Appendix A] Let 𝒞n\mathcal{C}\subset\mathbb{R}^{n} be a non-empty convex cone, we have

sup𝒛𝒞𝔹2n𝒙,𝒛=dist(𝒙,𝒞),\sup_{\bm{z}\in\mathcal{C}\cap\mathbb{B}_{2}^{n}}\left\langle\bm{x},\bm{z}\right\rangle=\operatorname{dist}(\bm{x},\mathcal{C}^{\circ}),

where 𝐱n\bm{x}\in\mathbb{R}^{n}.

Fact 7.

[37, Proposition 10.2] Let 𝒞\mathcal{C} be a convex cone. The Gaussian width and the statistical dimension are closely related:

δ(𝒞)1ω2(𝒞𝕊n1)δ(𝒞).\delta(\mathcal{C})-1\leq\omega^{2}(\mathcal{C}\cap\mathbb{S}^{n-1})\leq\delta(\mathcal{C}).
Fact 8.

[51, Lemma 3.7] Let 𝒞n\mathcal{C}\subset\mathbb{R}^{n} be a non-empty closed, convex cone. Then we have that

ω2(𝒞𝕊n1)+ω2(𝒞𝕊n1)n.\omega^{2}(\mathcal{C}\cap\mathbb{S}^{n-1})+\omega^{2}(\mathcal{C}^{\circ}\cap\mathbb{S}^{n-1})\leq n.
Fact 9 (Max–min inequality).

[57, Lemma 36.1] For any function F:n×mF:~{}\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} and any 𝒲n\mathcal{W}\subseteq\mathbb{R}^{n}, 𝒵m\mathcal{Z}\subseteq\mathbb{R}^{m}, we have

sup𝒛𝒵inf𝒘𝒲F(𝒘,𝒛)inf𝒘𝒲sup𝒛𝒵F(𝒘,𝒛).\sup_{\bm{z}\in\mathcal{Z}}\inf_{\bm{w}\in\mathcal{W}}F(\bm{w},\bm{z})\leq\inf_{\bm{w}\in\mathcal{W}}\sup_{\bm{z}\in\mathcal{Z}}F(\bm{w},\bm{z}).
Fact 10.

[37, Theorem 4.3] Let ff be a norm on n\mathbb{R}^{n}, and fix a non-zero point 𝐱n\bm{x}\in\mathbb{R}^{n}. Then

0\displaystyle 0\leq (mint0η2(tf(𝒙)))δ(𝒯f)2max{𝒂2:𝒂f(𝒙)}f(𝒙/𝒙2).\displaystyle\bigg{(}\min_{t\geq 0}\eta^{2}(t\cdot\partial f(\bm{x}))\bigg{)}-\delta(\mathcal{T}_{f})\leq\frac{2\max\{\|\bm{a}\|_{2}:\bm{a}\in\partial f(\bm{x})\}}{f(\bm{x}/\|\bm{x}\|_{2})}.

References

  • [1] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, 2009.
  • [2] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Miami, FL, USA, 2009, pp. 2790–2797.
  • [3] J. Haupt, W. U. Bajwa, M. Rabbat, and R. Nowak, “Compressed sensing for networked data,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 92–101, 2008.
  • [4] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM J. Optim., vol. 21, no. 2, pp. 572–596, 2011.
  • [5] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, pp. 1–37, 2011.
  • [6] M. Elad, J.-L. Starck, P. Querre, and D. L. Donoho, “Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA),” Appl. Comp. Harmonic Anal., vol. 19, no. 3, pp. 340–358, 2005.
  • [7] J. N. Laska, M. A. Davenport, and R. G. Baraniuk, “Exact signal recovery from sparsely corrupted measurements through the pursuit of justice,” in Proc. 43rd Asilomar Conf. Signals, Syst. Comput., Pacific Grove, CA, USA, 2009, pp. 1556–1560.
  • [8] J. Wright and Y. Ma, “Dense error correction via 1\ell_{1}-minimization,” IEEE Trans. Inf. Theory, vol. 56, no. 7, pp. 3540–3560, 2010.
  • [9] X. Li, “Compressed sensing and matrix completion with constant proportion of corruptions,” Construct. Approximation, vol. 37, no. 1, pp. 73–99, 2013.
  • [10] N. H. Nguyen and T. D. Tran, “Exact recoverability from dense corrupted observations via-minimization,” IEEE Trans. Inf. Theory, vol. 59, no. 4, pp. 2017–2035, 2013.
  • [11] ——, “Robust lasso with missing and grossly corrupted observations,” IEEE Trans. Inf. Theory, vol. 4, no. 59, pp. 2036–2058, 2013.
  • [12] P. Kuppinger, G. Durisi, and H. Bolcskei, “Uncertainty relations and sparse signal recovery for pairs of general signal sets,” IEEE Trans. Inf. Theory, vol. 58, no. 1, pp. 263–277, 2012.
  • [13] C. Studer, P. Kuppinger, G. Pope, and H. Bolcskei, “Recovery of sparsely corrupted signals,” IEEE Trans. Inf. Theory, vol. 58, no. 5, pp. 3115–3130, 2012.
  • [14] G. Pope, A. Bracher, and C. Studer, “Probabilistic recovery guarantees for sparsely corrupted signals,” IEEE Trans. Inf. Theory, vol. 59, no. 5, pp. 3104–3116, 2013.
  • [15] C. Studer and R. G. Baraniuk, “Stable restoration and separation of approximately sparse signals,” Appl. Comp. Harmonic Anal., vol. 37, no. 1, pp. 12–35, 2014.
  • [16] D. Su, “Data recovery from corrupted observations via l1 minimization,” 2016, [Online]. Available: https://arxiv.org/abs/1601.06011.
  • [17] B. Adcock, A. Bao, J. D. Jakeman, and A. Narayan, “Compressed sensing with sparse corruptions: Fault-tolerant sparse collocation approximations,” SIAM/ASA J. Uncertain. Quantif., vol. 6, no. 4, pp. 1424–1453, 2018.
  • [18] B. Adcock, A. Bao, and S. Brugiapaglia, “Correcting for unknown errors in sparse high-dimensional function approximation,” Numer. Math., vol. 142, no. 3, pp. 667–711, 2019.
  • [19] H. Xu, C. Caramanis, and S. Sanghavi, “Robust PCA via outlier pursuit,” IEEE Trans. Inf. Theory, vol. 58, no. 5, pp. 3047–3064, 2012.
  • [20] H. Xu, C. Caramanis, and S. Mannor, “Outlier-robust PCA: the high-dimensional case,” IEEE Trans. Inf. Theory, vol. 59, no. 1, pp. 546–572, 2013.
  • [21] J. Wright, A. Ganesh, K. Min, and Y. Ma, “Compressive principal component pursuit,” Inf. Inference, J. IMA, vol. 2, no. 1, pp. 32–68, 2013.
  • [22] Y. Chen, A. Jalali, S. Sanghavi, and C. Caramanis, “Low-rank matrix recovery from errors and erasures,” IEEE Trans. Inf. Theory, vol. 59, no. 7, pp. 4324–4337, 2013.
  • [23] M. B. McCoy and J. A. Tropp, “Sharp recovery bounds for convex demixing, with applications,” Found. Comput. Math., vol. 14, no. 3, pp. 503–567, 2014.
  • [24] R. Foygel and L. Mackey, “Corrupted sensing: Novel guarantees for separating structured signals,” IEEE Trans. Inf. Theory, vol. 60, no. 2, pp. 1223–1247, 2014.
  • [25] H. Zhang, Y. Liu, and L. Hong, “On the phase transition of corrupted sensing,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Aachen, Germany, 2017, pp. 521–525.
  • [26] J. Chen and Y. Liu, “Corrupted sensing with sub-Gaussian measurements,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT).   Aachen, Germany: IEEE, 2017, pp. 516–520.
  • [27] J. Chen and Y. Liu, “Stable recovery of structured signals from corrupted sub-Gaussian measurements,” IEEE Trans. Inf. Theory, vol. 65, no. 5, pp. 2976–2994, 2019.
  • [28] Z. Sun, W. Cui, and Y. Liu, “Recovery of structured signals from corrupted non-linear measurements,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Paris, France, 2019, pp. 2084–2088.
  • [29] ——, “Quantized corrupted sensing with random dithering,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Los Angeles, USA, 2020, pp. 1397–1402.
  • [30] D. L. Donoho, “Neighborly polytopes and sparse solutions of underdetermined linear equations,” Technical Report, Department of Statistics, Stanford University, 2005.
  • [31] ——, “High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension,” Discrete Comput. Geom., vol. 35, no. 4, pp. 617–652, 2006.
  • [32] D. Donoho and J. Tanner, “Counting faces of randomly projected polytopes when the projection radically lowers dimension,” J. Amer. Math. Soc., vol. 22, no. 1, pp. 1–53, 2009.
  • [33] D. L. Donoho and J. Tanner, “Neighborliness of randomly projected simplices in high dimensions,” Proc. Natl Acad. Sci. USA, vol. 102, no. 27, pp. 9452–9457, 2005.
  • [34] ——, “Counting the faces of randomly-projected hypercubes and orthants, with applications,” Discrete Comput. Geom., vol. 43, no. 3, pp. 522–541, 2010.
  • [35] M. A. Khajehnejad, W. Xu, A. S. Avestimehr, and B. Hassibi, “Analyzing weighted 1\ell_{1} minimization for sparse recovery with nonuniform sparse models,” IEEE Trans. Signal Process., vol. 59, no. 5, pp. 1985–2001, 2011.
  • [36] W. Xu and B. Hassibi, “Precise stability phase transitions for 1\ell_{1} minimization: A unified geometric framework,” IEEE Trans. Inf. Theory, vol. 57, no. 10, pp. 6894–6919, 2011.
  • [37] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp, “Living on the edge: Phase transitions in convex programs with random data,” Inf. Inference, J. IMA, vol. 3, no. 3, pp. 224–294, 2014.
  • [38] D. Amelunxen and P. Bürgisser, “Intrinsic volumes of symmetric cones and applications in convex programming,” Math. Progam., vol. 149, no. 1-2, pp. 105–130, 2015.
  • [39] ——, “Probabilistic analysis of the grassmann condition number,” Found. Comput. Math., vol. 15, no. 1, pp. 3–51, 2015.
  • [40] L. Goldstein, I. Nourdin, and G. Peccati, “Gaussian phase transitions and conic intrinsic volumes: Steining the steiner formula,” Ann. Appl. Probab., vol. 27, no. 1, pp. 1–47, 2017.
  • [41] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Natl Acad. Sci. USA, vol. 106, no. 45, pp. 18 914–18 919, 2009.
  • [42] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” IEEE Trans. Inf. Theory, vol. 57, no. 2, pp. 764–785, 2011.
  • [43] ——, “The LASSO risk for Gaussian matrices,” IEEE Trans. Inf. Theory, vol. 58, no. 4, pp. 1997–2017, 2011.
  • [44] D. L. Donoho, A. Maleki, and A. Montanari, “The noise-sensitivity phase transition in compressed sensing,” IEEE Trans. Inf. Theory, vol. 57, no. 10, pp. 6920–6941, 2011.
  • [45] D. L. Donoho, M. Gavish, and A. Montanari, “The phase transition of matrix recovery from Gaussian measurements matches the minimax MSE of matrix denoising,” Proc. Natl Acad. Sci. USA, vol. 110, no. 21, pp. 8405–8410, 2013.
  • [46] S. Oymak and B. Hassibi, “Sharp MSE bounds for proximal denoising,” Found. Comput. Math., vol. 16, no. 4, pp. 965–1029, 2016.
  • [47] Y. Gordon, “Some inequalities for Gaussian processes and applications,” Isr. J. Math., vol. 50, no. 4, pp. 265–289, 1985.
  • [48] M. Rudelson and R. Vershynin, “On sparse reconstruction from fourier and Gaussian measurements,” Comm. Pure Appl. Math., vol. 61, no. 8, pp. 1025–1045, 2008.
  • [49] M. Stojnic, “Various thresholds for 1\ell_{1}-optimization in compressed sensing,” 2009, [Online]. Available: https://arxiv.org/abs/0907.3666.
  • [50] S. Oymak and B. Hassibi, “New null space results and recovery thresholds for matrix rank minimization,” 2010, [Online]. Available: https://arxiv.org/abs/1011.6326.
  • [51] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Found. Comput. Math., vol. 12, no. 6, pp. 805–849, 2012.
  • [52] M. Stojnic, “A framework to characterize performance of lasso algorithms,” 2013, [Online]. Available: https://arxiv.org/abs/1303.7291.
  • [53] ——, “Regularly random duality,” 2013, [Online]. Available: https://arxiv.org/abs/1303.7295.
  • [54] S. Oymak, C. Thrampoulidis, and B. Hassibi, “The squared-error of generalized LASSO: A precise analysis,” 2013, [Online]. Available: https://arxiv.org/abs/1311.0830.
  • [55] C. Thrampoulidis, S. Oymak, and B. Hassibi, “The Gaussian min-max theorem in the presence of convexity,” 2014, [Online]. Available: https://arxiv.org/abs/1408.4837.
  • [56] S. Oymak and J. A. Tropp, “Universality laws for randomized dimension reduction, with applications,” Inf. Inference, J. IMA, vol. 7, no. 3, pp. 337–446, 2018.
  • [57] R. T. Rockafellar, Convex analysis.   Princeton, NJ, USA: Princeton Univ. Press, 2015.
  • [58] V. I. Bogachev, Gaussian measures.   Providence, Rhode island, USA: American Mathematical Soc., 1998, no. 62.
  • [59] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” 2017, [Online]. Available: http://cvxr.com/cvx/.
  • [60] ——, “Graph implementations for nonsmooth convex programs,” in Recent Advances in Learning and Control, ser. Lecture Notes in Control and Information Sciences, V. Blondel, S. Boyd, and H. Kimura, Eds.   London, U.K.: Springer-Verlag, 2008, pp. 95–110.
  • [61] H. H. Bauschke, P. L. Combettes et al., Convex analysis and monotone operator theory in Hilbert spaces.   Cham, Switzerland: Springer, 2011, vol. 408.