This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

Smoothed Robust Phase Retrieval

Zhong Zheng and Lingzhou Xue
Department of Statistics, The Pennsylvania State University
Abstract

The phase retrieval problem in the presence of noise aims to recover the signal vector of interest from a set of quadratic measurements with infrequent but arbitrary corruptions, and it plays an important role in many scientific applications. However, the essential geometric structure of the nonconvex robust phase retrieval based on the 1\ell_{1}-loss is largely unknown to study spurious local solutions, even under the ideal noiseless setting, and its intrinsic nonsmooth nature also impacts the efficiency of optimization algorithms. This paper introduces the smoothed robust phase retrieval (SRPR) based on a family of convolution-type smoothed loss functions. Theoretically, we prove that the SRPR enjoys a benign geometric structure with high probability: (1) under the noiseless situation, the SRPR has no spurious local solutions, and the target signals are global solutions, and (2) under the infrequent but arbitrary corruptions, we characterize the stationary points of the SRPR and prove its benign landscape, which is the first landscape analysis of phase retrieval with corruption in the literature. Moreover, we prove the local linear convergence rate of gradient descent for solving the SRPR under the noiseless situation. Experiments on both simulated datasets and image recovery are provided to demonstrate the numerical performance of the SRPR.

Keywords— Convolution smoothing; Landscape analysis; Nonconvex optimization.

1 Introduction

Investigating the recovery of a signal vector of interest xpx\in\mathbb{R}^{p} from a set of sensing vectors a1,,anpa_{1},\ldots,a_{n}\in\mathbb{R}^{p} and their linear transformations a1x,,anxa_{1}^{\top}x,\ldots,a_{n}^{\top}x in the presence of noise is a critical research question in statistics and data science. There has been considerable interest in the estimation of this signal vector utilizing magnitude-based measurements (a1x)2,,(anx)2(a_{1}^{\top}x)^{2},\ldots,(a_{n}^{\top}x)^{2} in the presence of potential errors and corruptions. This problem, commonly referred to as phase retrieval, plays a pivotal role in addressing the phase issue [34, 30]. It finds applications across a wide spectrum of scientific applications, including X-ray crystallography [25], optical physics [27], coherent diffractive imaging using arrays and high-power sources [6], astronomy [15], and microscopy [26].

Over the last decade, several pioneering papers have established the theoretical foundation for the noiseless phase retrieval problem, which seeks to recover the signal vectors xx_{\star} or x-x_{\star} in p\mathbb{R}^{p} from a collection of magnitude-based measurements without any errors or corruptions:

bi=(aix)2,b_{i}=(a_{i}^{\top}x_{\star})^{2}, (1)

for each i[n]:={1,2,,n}i\in[n]:=\{1,2,\ldots,n\}. Here, b1,,bnb_{1},\ldots,b_{n} represent non-negative measurements. It is important to note that (1) is a nonconvex problem recognized as being NP-hard [14]. To address this challenge, [5] introduced the PhaseLift method, employing a convex relaxation strategy based on semidefinite programming (SDP) to solve a trace-norm minimization problem. However, as the dimension of the signal vector increases, SDP-based methods tend to become computationally infeasible [4]. In response to the limitations of SDP-based methods, the literature has seen alternative approaches based on the non-convex optimization, including the Wirtinger flow [4], truncated Wirtinger flow [7], thresholded Wirtinger flow [3], truncated amplitude flow [35], and others. These non-convex methods have utilized spectral initialization [4] to facilitate global convergence. Recently, the seminal paper [32] demonstrated that the least-squares formulation for recovering the signal from (1) exhibits a benign global geometry (also known as function landscape) ensuring that, with high probability, the non-convex problem has no spurious local minimizers and can be globally optimized through efficient iterative methods, even with random initialization. Further, [8] showed that vanilla gradient descent is effective and provided a comprehensive analysis of the optimization trajectory, and [2] illustrated that the smoothed amplitude flow model shares similar properties.

In real-world applications, it is crucial not to overlook the influence of noise on phase retrieval. Especially, we should consider the presence of additive errors and infrequent corruptions affecting magnitude-based measurements, which may arise due to equipment limitations, measurement errors, extreme occurrences, and so on. The noisy magnitude-based measurements can be formalized as follows: for each i[n]i\in[n],

bi={(aix)2+τi,if the i-th measurement is uncorrupted,ξi,otherwise.b_{i}=\begin{cases}(a_{i}^{\top}x_{\star})^{2}+\tau_{i},&\text{if the $i$-th measurement is uncorrupted},\\ \xi_{i},&\text{otherwise}.\end{cases} (2)

Here, τi\tau_{i} represents the additive errors bounded in \mathbb{R}, and ξi\xi_{i} signifies the infrequent yet arbitrary corruptions in +\mathbb{R}_{+}. The set 𝕀O={i[n]:the i-th measurement is corrupted}\mathbb{I}_{O}=\{i\in[n]:\text{the $i$-th measurement is corrupted}\} denotes the indices of outliers, and card(𝕀O)\mathrm{card}(\mathbb{I}_{O}) is the cardinality of 𝕀O\mathbb{I}_{O}. We define pfail=card(𝕀O)/np_{\text{fail}}=\mathrm{card}(\mathbb{I}_{O})/n as the proportion of corrupted measurements. It is essential to note that pfailp_{\text{fail}} remains a relatively small quantity due to the infrequency of corruption. However, corruptions ξi\xi_{i}’s can be unbounded and may behave adversarially. Furthermore, independence between {ξi}i𝕀O\{\xi_{i}\}_{i\in\mathbb{I}_{O}} and {ai}i=1n\{a_{i}\}_{i=1}^{n} may not hold, introducing additional challenges to signal recovery.

Model (2) provides a general framework for signal recovery from noisy magnitude-based measurements. This framework includes various specialized scenarios as special examples. Specifically, the noiseless phase retrieval problem, as formulated in (1), can be perceived as a particular instance of (2), characterized by pfail=0p_{\text{fail}}=0 and τi=0\tau_{i}=0 for all i[n]i\in[n]. Moreover, phase retrieval in the presence of bounded noise, as investigated by [37], emerges as another special case of (2) when pfail=0p_{\text{fail}}=0, indicating the absence of outliers. Conversely, the robust phase retrieval problem [11] aligns with (2) under the situation that τi=0\tau_{i}=0 for all i[n]i\in[n], focusing on the robustness against corruptions. Additionally, the approach for non-convex phase retrieval with outliers, as proposed by [36], resonates with (2) and accommodates both dense bounded noise and sparse arbitrary outliers: bi=(aix)2+τi+ξib_{i}=(a_{i}^{\top}x_{\star})^{2}+\tau_{i}+\xi_{i} for each i[n]i\in[n].

To solve the problem of phase retrieval with corruptions in Model (2), the objective remains to recover the signal vectors xx_{\star} or x-x_{\star} using the observations {(ai,bi)}i=1n\{(a_{i},b_{i})\}_{i=1}^{n}. In this vein, [36] extended the methodologies of [7] to propose the median truncated Wirtinger flow. [11] explored robust phase retrieval employing the 1\ell_{1}-loss, formally defined as

minxpF(x):=minxp1ni=1n|(aix)2bi|.\min_{x\in\mathbb{R}^{p}}F(x):=\min_{x\in\mathbb{R}^{p}}\frac{1}{n}\sum_{i=1}^{n}\left|(a_{i}^{\top}x)^{2}-b_{i}\right|. (3)

[11] introduced the proximal linear (PL) algorithm to tackle this nonconvex optimization challenge, demonstrating that the 1\ell_{1}-loss enhances success rates in comparison to the median truncated Wirtinger flow [36]. Furthermore, they established the properties of sharpness and weak convexity for F(x)F(x), alongside the local convergence of the PL algorithm. Subsequent research, targeting the optimization problem (3), has expanded upon these findings. [9, 10] investigated the local convergence of subgradient algorithms. Recently, [38] proposed the inexact proximal linear (IPL) algorithm and studied its local convergence.

Despite these advancements, the essential geometric structure of the robust phase retrieval is largely unknown to study spurious local solutions. This contrasts with the works of [32, 8, 2], which provided insights into the landscape of smooth objective functions in phase retrieval under noiseless conditions. Notably, even under the ideal noiseless setting, [11] did not conduct landscape analysis to understand the stationary points of the nonconvex and nonsmooth optimization. [10] only defined and analyzed stationary points in the context of subgradient, and whether they are local minimums is still unknown. Moreover, under the situation with corruption, we cannot find any existing work to study the landscape of any objective function in the literature. As a result, there is no theoretical guarantee for robust phase retrieval to avoid spurious local minimums using random initialization.

Another concern arises from the intrinsic nonsmooth nature of (3) that significantly impacts the efficiency of optimization algorithms including the PL method [11] and the IPL method [38]. Both approaches necessitate iteratively solving nonsmooth subproblems using the proximal operator graph splitting (POGS) algorithm [28] or the accelerated proximal gradient algorithm, which invariably leads to diminished optimization efficiency. Moreover, as pointed out by [9, 10], the subgradient algorithms for solving (3) are sensitive to the step size selection, and there are no practically reliable methods for choosing the step size.

All these challenges stem from the non-differentiability of the 1\ell_{1}-loss employed in (3), which limits the technical tools for analyzing function landscapes and complicates the optimization process. In our paper, we propose the smoothed robust phase retrieval (SRPR) method by minimizing the smoothed surrogate Fδ(x)F_{\delta}(x) based on the convolution-type smoothing, which involves the convolution of the 1\ell_{1}-loss with a kernel function Kδ()K_{\delta}(\cdot) and its bandwidth δ>0\delta>0. This smoothing idea was first introduced by [13] and then investigated by [16], [33], [18], and [23] to overcome the lack of smoothness in quantile regression and improve optimization efficiency while achieving the desired convergence rate and oracle property.

We demonstrate that, with high probability, Fδ(x)F_{\delta}(x) not only satisfies the weak convexity property similar to F(x)F(x) but also satisfies generalized sharpness in the noiseless scenario. Furthermore, this approach facilitates the utilization of gradient and Hessian information, providing a convenient mathematical structure. These properties are crucial for determining the convergence rates of gradient-based optimization algorithms and for conducting landscape analyses. By directly minimizing Fδ(x)F_{\delta}(x) in the absence of noise, exact signal recovery is achievable. Additionally, our methodology extends to scenarios involving corrupted measurements, illustrating its versatility and applicability. The SRPR approach has at least the following contributions to the field, enhancing both the theoretical and practical aspects of phase retrieval when compared to existing methodologies.

Firstly, the smoothness of our approach enables the application of gradient-based algorithms to minimize the nonconvex function Fδ(x)F_{\delta}(x). We demonstrate that under the noiseless situation, the minimization of Fδ(x)F_{\delta}(x) through general first-order methods allows SRPR to achieve local linear convergence. This marks a significant improvement over the sublinear convergence rates observed with the PL [11] and IPL [38] algorithms. The attainment of a linear rate is facilitated by the Polyak-Lojasiewicz condition [21] that holds locally, circumventing the need for local convexity. Additionally, our method benefits from the potential for acceleration and line search strategies, offering clear advantages over subgradient algorithms [9, 10], which encounter challenges in step size tuning. In numerical experiments, SRPR demonstrates superior efficiency relative to competing algorithms.

Secondly, in the context of noiseless phase retrieval, we conduct landscape analysis to show that Fδ(x)F_{\delta}(x), with high probability, exhibits a benign landscape and has no spurious local minima. This analysis underscores the efficacy of random initialization in achieving exact recovery through the minimization of Fδ(x)F_{\delta}(x). To the best of our knowledge, this is the first instance of a benign landscape being identified for a robust objective function in the domain of noiseless phase retrieval, as previously discussed works [11] and [10] have only considered the local behaviors of optimization algorithms and the locations of stationary points.

Thirdly, in the presence of corrupted measurements, we prove that when pfailp_{\text{fail}} is small and bounded noises are absent, with high probability, the landscape of Fδ(x)F_{\delta}(x) remains benign outside a vicinity around {x,x}\{x_{\star},-x_{\star}\} with a radius proportional to δpfail/(1pfail)\delta p_{\text{fail}}/(1-p_{\text{fail}}). Similar results hold for the general Model (2). To the best of our knowledge, this represents the first result of benign function landscapes in the context of phase retrieval with corruptions.

The rest of this paper is organized as follows. Section 2 introduces the methodology of SRPR. Section 3 focuses on the noiseless landscape analysis. Section 4 focuses on the empirical landscapes with corruption. Section 5 presents the numerical experiments. Section 6 includes a few concluding remarks. The complete proofs are presented in the supplement.

2 Methodology

This section first presents the general framework for smoothed robust phase retrieval in Subsection 2.1 and then discusses the convolution-type smoothed loss functions in Subsection 2.2. Subsection 2.3 presents the properties of the smoothed objective function. Subsection 2.4 discusses the optimization algorithms.

Before proceeding, we introduce some useful notations. Denote 𝟏[]\mathbf{1}[\cdot] as the indicator function. Let Bp(x,M)={yp:yx2M}B_{p}(x,M)=\{y\in\mathbb{R}^{p}:\|y-x\|_{2}\leq M\}. Define 𝔼sA=1si=1sAi\mathbb{E}_{s}A=\frac{1}{s}\sum_{i=1}^{s}A_{i}, and IpI_{p} denotes a p×pp\times p identity matrix. Let eipe_{i}\in\mathbb{R}^{p} represent a vector whose ii-th element is 1 with all other elements taking the value 0. In addition, throughout the paper, we use c,c,C,C,c,c^{\prime},C,C^{\prime},... to denote generic positive constants, though the actual values may vary on different occasions.

2.1 A General Framework

In the robust phase retrieval problem (2), we observe vectors {ai}\{a_{i}\} and non-negative scalars {bi}\{b_{i}\} and aim to recover the true phases {x,x}\{x_{\star},-x_{\star}\}111Due to the quadratic relationship in (2), both xx_{\star} and x-x_{\star} are true signal vectors..

Assumption 1.

Let a1,a2,,an1+n2a_{1},a_{2},\ldots,a_{n_{1}+n_{2}} be random vectors in p\mathbb{R}^{p} that are independently and identically distributed with the mean zero and covariance matrix Σp\Sigma_{p}, satisfying uΣpuλMu22u^{\top}\Sigma_{p}u\leq\lambda_{M}\|u\|_{2}^{2} for all upu\in\mathbb{R}^{p}, where λM\lambda_{M} is a positive constant. Here, n1,n2n_{1},n_{2} are non-negative integers, and p3p\geq 3 is a positive integer. The true phase xp{0}x_{\star}\in\mathbb{R}^{p}\setminus\{0\} and

bi=[(aix)2+τi]𝟏[i[n1]]+ξi𝟏[i[n1+n2][n1]],i[n1+n2].b_{i}=[(a_{i}^{\top}x_{\star})^{2}+\tau_{i}]\mathbf{1}[i\in[n_{1}]]+\xi_{i}\mathbf{1}[i\in[n_{1}+n_{2}]-[n_{1}]],\ \forall\ i\in[n_{1}+n_{2}]. (4)

The τi\tau_{i}’s are random variables in \mathbb{R} with |τi|γ|\tau_{i}|\leq\gamma for any i[n1]i\in[n_{1}]. The ξi\xi_{i}’s are non-negative random variables without further assumptions.

In Assumption 1, n=n1+n2n=n_{1}+n_{2} represents the total number of measurements, and pfail=n2n1+n2p_{\text{fail}}=\frac{n_{2}}{n_{1}+n_{2}} denotes the proportion of corrupted measurements. The sets 𝕀1=[n1]\mathbb{I}_{1}=[n_{1}] and 𝕀2=[n][n1]\mathbb{I}_{2}=[n]-[n_{1}] correspond to the inliers and outliers among the measurements, respectively. In the special case where γ=pfail=0\gamma=p_{\text{fail}}=0, the model reverts to the noiseless phase retrieval scenario as depicted in (1). Our goal is to recover the true signal vectors xx_{\star} or x-x_{\star} from the observed {(ai,bi)}i=1n\{(a_{i},b_{i})\}_{i=1}^{n}, without prior knowledge of which measurements are corrupted. This setup ensures the mutual independence of {ai,i[n1]}\{a_{i},i\in[n_{1}]\} and {ai,i[n1+n2][n1]}\{a_{i},i\in[n_{1}+n_{2}]-[n_{1}]\}. We will introduce stronger assumptions on the distribution of aia_{i} later. The bounded noise variables τi\tau_{i}s require no additional distributional assumptions, whereas no assumptions are made regarding the distribution of {ξi}i=1n\{\xi_{i}\}_{i=1}^{n} or its independence with respect to {ai}\{a_{i}\}, allowing for potentially heavy-tailed and adversarially distributed corruption. This assumption is more general then the model assumption M2M_{2} of [11], which requires independence between {ξi,i[n1+n2][n1]}\{\xi_{i},i\in[n_{1}+n_{2}]-[n_{1}]\} and {ai,i[n1]}\{a_{i},i\in[n_{1}]\}. For simplicity, in subsequent discussions, we will refer to a random vector aa to denote the generic distribution of the aia_{i}s.

As shown in [11], when γ=0\gamma=0, exact recovery is achievable in the optimization problem (3) employing an 1\ell_{1}-loss, indicating that {x,x}=argminxpF(x)\{x_{\star},-x_{\star}\}=\operatorname*{arg\,min}_{x\in\mathbb{R}^{p}}F(x), with high probability, under certain assumptions. To refine this approach, we employ convolution-type smoothing, focusing on the problem formulated as:

minxpFδ(x)=1ni=1nlδ((aix)2bi),\min_{x\in\mathbb{R}^{p}}F_{\delta}(x)=\frac{1}{n}\sum_{i=1}^{n}l_{\delta}\left((a_{i}^{\top}x)^{2}-b_{i}\right), (5)

where

lδ(x)=(lKδ)(x)=+|y|Kδ(xy)𝑑y,l_{\delta}(x)=(l*K_{\delta})(x)=\int_{-\infty}^{+\infty}|y|K_{\delta}(x-y)dy, (6)

K(x)K(x) represents a kernel function, δ>0\delta>0 is the bandwidth, * denotes the convolution operation, and Kδ(x)=1δK(xδ)K_{\delta}(x)=\frac{1}{\delta}K(\frac{x}{\delta}). In this context, lδ(x)l_{\delta}(x) serves as a smoothed loss function, substituting l(x)l(x) for F(x)F(x). The adoption of convolution-type smoothing is motivated by gradient approximations: Fδ(x)\nabla F_{\delta}(x) closely mirrors the subgradient of F(x)F(x) because limδ0lδ(x)=sign(x)|x|\lim_{\delta\rightarrow 0}l^{\prime}_{\delta}(x)=\text{sign}(x)\in\partial|x|, thereby suggesting that the statistical behavior of Fδ(x)F_{\delta}(x) will approximate that of F(x)F(x). Moreover, the smoothness of Fδ(x)F_{\delta}(x) significantly benefits gradient-based optimization and landscape analysis. Provided some mild conditions on the kernel function K(x)K(x), both lδ(x)l^{\prime}_{\delta}(x) and lδ′′(x)l^{\prime\prime}_{\delta}(x) are well-defined, ensuring that Fδ(x)F_{\delta}(x) is smooth and equipped with gradients and Hessians for both optimization and landscape analysis purposes.

2.2 Convolution-type Smoothing

In this subsection, we elaborate on the convolution-type smoothed loss function as defined in (6). Additionally, we introduce the notations K~(x)=xK(y)𝑑y\tilde{K}(x)=\int_{-\infty}^{x}K(y)dy and K~δ(x)=K~(x/δ)\tilde{K}_{\delta}(x)=\tilde{K}(x/\delta) for the integrated kernel function and its scaled version, respectively. To facilitate our analysis, we adhere to the subsequent assumption regarding the kernel function K(x)K(x) throughout this paper:

Assumption 2.

The mapping K:K:\mathbb{R}\rightarrow\mathbb{R} satisfies the following conditions:

  1. (a)

    K()K(\cdot) is a kernel, namely, K(x)0,K(x)=K(x),xK(x)\geq 0,K(x)=K(-x),\forall\ x\in\mathbb{R} and K(x)𝑑x=1\int_{-\infty}^{\infty}K(x)dx=1.

  2. (b)

    K(x)K(x) is bounded, Lipschitz continuous and K(0)>0K(0)>0.

  3. (c)

    Let K¯(x)=xK(x)\bar{K}(x)=xK(x). K¯(x)\bar{K}(x) is bounded, Lipschitz continuous and +|K¯(x)|𝑑x<\int_{-\infty}^{+\infty}\left|\bar{K}(x)\right|dx<\infty.

Next, we give some examples of the kernel K(x)K(x) that satisfies Assumption 2.

  • Gaussian kernel K(x)=(2π)1/2ex2/2,K(x)=(2\pi)^{-1/2}e^{-x^{2}/2}, which generates lδ(x)=δlG(x/δ)l_{\delta}(x)=\delta l^{\mathrm{G}}(x/\delta) with lG(x):=(2/π)1/2ex2/2+x{1l^{\mathrm{G}}(x):=(2/\pi)^{1/2}e^{-x^{2}/2}+x\{1- 2Φ(x)}2\Phi(-x)\} and Φ()\Phi(\cdot) being the CDF of N(0,1)N(0,1).

  • Logistic kernel K(x)=ex/(1+ex)2,K(x)=e^{-x}/\left(1+e^{-x}\right)^{2}, which generates lδ(x)=δlL(x/δ)l_{\delta}(x)=\delta l^{\mathrm{L}}(x/\delta) with lL(x)=x+2log(1+ex)l^{\mathrm{L}}(x)=x+2\log\left(1+e^{-x}\right).

  • Epanechnikov kernel K(x)=(3/4)(1x2)𝟏(|x|1),K(x)=(3/4)\left(1-x^{2}\right)\mathbf{1}(|x|\leq 1), which generates lδ(x)=δlE(x/δ)l_{\delta}(x)=\delta l^{\mathrm{E}}(x/\delta) with lE(x)=(3x2/4x4/8+3/8)𝟏(|x|1)+|x|𝟏(|x|>1)l^{\mathrm{E}}(x)=\left(3x^{2}/4-x^{4}/8+3/8\right)\mathbf{1}(|x|\leq 1)+|x|\mathbf{1}(|x|>1).

  • Triangular kernel K(x)=(1|x|)𝟏(|x1)K(x)=(1-|x|)\mathbf{1}(|x\leq 1), which generates lδ(x)=δlT(x/δ)l_{\delta}(x)=\delta l^{\mathrm{T}}(x/\delta) with lT(u)=(x2|x|3/3+1/3)𝟏(|x|1)+|x|𝟏(|x|>1)l^{\mathrm{T}}(u)=(x^{2}-|x|^{3}/3+1/3)\mathbf{1}(|x|\leq 1)+|x|\mathbf{1}(|x|>1).

  • K(x)=12(x2+1)3/2,K(x)=\frac{1}{2(x^{2}+1)^{3/2}}, which generates the Pseudo-Huber loss lδ(x)=x2+δ2l_{\delta}(x)=\sqrt{x^{2}+\delta^{2}}.

In practical implementations, δ\delta ought to be a small constant so that lδ(x)l_{\delta}(x) approximates the 1\ell_{1}-loss, following the ideas introduced in smoothed quantile regression [16]. Following Assumption 2, we can derive several properties of lδ(x)l_{\delta}(x), which are briefly discussed here for space consideration. More details are summarized in Lemma LABEL:lemma_property_huber of Appendix LABEL:appendix:property_hubers in the supplement. Similar to [16], we prove that lδ(x)l_{\delta}(x) is convex and has continuous first-order and second-order derivatives, namely, lδ(x)=2K~δ(x)1l^{\prime}_{\delta}(x)=2\tilde{K}_{\delta}(x)-1 and lδ′′(x)=2Kδ(x).l^{\prime\prime}_{\delta}(x)=2K_{\delta}(x). This ensures the existence of the gradient and Hessian for Fδ(x)F_{\delta}(x). Moreover, note that lδ(x)lδ(0)min{C1δx2,C2|x|},xp,l_{\delta}(x)-l_{\delta}(0)\geq\min\left\{\frac{C_{1}}{\delta}x^{2},C_{2}|x|\right\},\forall x\in\mathbb{R}^{p}, for some positive constants C1C_{1} and C2C_{2}. This relation suggests that lδ(x)l_{\delta}(x) exhibits characteristics akin to the 2\ell_{2}-loss for values of xx near 0, and resembles the 1\ell_{1}-loss for values of xx distant from 0. This bound acts as the foundation for the generalized sharpness condition to Fδ(x)F_{\delta}(x) in the absence of noise, a topic to be elaborated upon in the subsequent subsection.

2.3 Properties of the Smoothed Objective Function

In this subsection, we discuss the properties of Fδ(x)F_{\delta}(x), which will indicate the similarities between F(x)F(x) and Fδ(x)F_{\delta}(x) and how minimizing Fδ(x)F_{\delta}(x) works for robust phase retrieval. We first introduce two parallel assumptions given in [11]. Assumption 3 claims that the distribution of aia_{i} is non-degenerate in all the directions, and Assumption 4 describes the tail-behavior of the distribution of aia_{i}.

Assumption 3.

There exists two positive constants κst\kappa_{st} and pstp_{st} such that, for any i[n]i\in[n] and u,vpu,v\in\mathbb{R}^{p} such that u2=v2=1\|u\|_{2}=\|v\|_{2}=1, we have P(min{|aiu|,|aiv|}κst)pst.P\left(\min\{|a_{i}^{\top}u|,|a_{i}^{\top}v|\}\geq\kappa_{st}\right)\geq p_{st}.

Assumption 4.

The vectors aia_{i}’s are sub-Gaussian with parameter cs1c_{s}\geq 1, which means that

inf{t>0:𝔼exp((aiu)2/t2)2}csuΣpu,i[n],up.\inf\left\{t>0:\mathbb{E}\exp\left((a_{i}^{\top}u)^{2}/t^{2}\right)\leq 2\right\}\leq c_{s}\sqrt{u^{\top}\Sigma_{p}u},\forall i\in[n],u\in\mathbb{R}^{p}.

With the same Assumptions, we can prove two properties for Fδ(x)F_{\delta}(x) which are comparable to properties for F(x)F(x) as follows. The first one will be generalized sharpness.

Lemma 1.

(Generalized Sharpness) Under Assumptions 1, 2 and 3, when pfail=γ=0p_{\text{fail}}=\gamma=0 and ncpn\geq cp, with probability at least 1Cexp(cn)1-C\exp(-c^{\prime}n), we have

Fδ(x)Fδ(x)λsmin{Δ(x),1δΔ2(x)},xp.F_{\delta}(x)-F_{\delta}(x_{\star})\geq\lambda_{s}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},\forall x\in\mathbb{R}^{p}. (7)

where Δ(x)=min{xx2,x+x2}\Delta(x)=\min\{\|x-x_{\star}\|_{2},\|x+x_{\star}\|_{2}\}, and c,c,C,λsc,c^{\prime},C,\lambda_{s} are constants independent of δ\delta.

In the literature, the sharpness of a function f(x)f(x) bounds the difference between f(x)f(x) and its global minimum by the distance between xx and its nearest global minimum point. This concept is prevalent in the optimization problems characterized by equation-solving aspects and nonsmooth objective functions. For more details, please see Section 1 of [29]. The generalized sharpness condition extends the concept of sharpness for F(x)F(x), as outlined in Corollary 3.1 of [11], which is the special example of (7) when Δ(x)δ\Delta(x)\geq\delta:

F(x)F(x)λsΔ(x),xp.F(x)-F(x_{\star})\geq\lambda_{s}\Delta(x),\forall x\in\mathbb{R}^{p}. (8)

[11] and [12] proved that (8) holds with high probability for the noiseless robust phase retrieval. [11], [9], [10], and [38] leveraged (8) to show the convergence rate for their algorithms.

The second one is the weak convexity that relies on the Assumption 4.

Lemma 2.

(Weak Convexity) Given Assumptions 1, 2, and 4, and assuming ncpn\geq cp, the inequality below is satisfied with a probability of at least 1Cexp(cn)1-C\exp(-c^{\prime}n):

Fδ(y)Fδ(x)Fδ(x)(yx)ρ2yx22,x,yp.F_{\delta}(y)-F_{\delta}(x)\geq\nabla F_{\delta}(x)^{\top}(y-x)-\frac{\rho}{2}\|y-x\|_{2}^{2},\quad\forall x,y\in\mathbb{R}^{p}. (9)

Here, ρ,C,c,c\rho,C,c,c^{\prime} are positive constants independent of δ\delta, K(x)K(x), or the value of pfailp_{\text{fail}}.

This relationship also holds when substituting Fδ(x)F_{\delta}(x) with F(x)F(x) and replacing the gradient with the subgradient. Weak convexity, a notion extensively recognized in optimization literature, serves as a generalization of convexity. It necessitates that f(x)+ρ2x22f(x)+\frac{\rho}{2}\|x\|_{2}^{2} be convex for a certain positive constant ρ\rho. Various definitions for weak convexity are discussed throughout the literature. The formulation presented in Lemma 2 aligns with that in [10]. Furthermore, Corollary 3.2 in [11] defines weak convexity by bounding the deviation between F(x)F(x) and its linear approximation, requiring the existence of a positive constant ρ>0\rho>0 such that:

|F(x)1ni=1n|(aiy)2bi+2(aiy)ai(xy)||ρ2yx22,x,yp.\left|F(x)-\frac{1}{n}\sum_{i=1}^{n}\left|(a_{i}^{\top}y)^{2}-b_{i}+2(a_{i}^{\top}y)a_{i}^{\top}(x-y)\right|\right|\leq\frac{\rho}{2}\|y-x\|_{2}^{2},\quad\forall x,y\in\mathbb{R}^{p}.

Next, under the case γ=0\gamma=0, we discuss how generalized sharpness and weak convexity lead to the robustness of using Fδ(x)F_{\delta}(x). As lδ(x)l_{\delta}(x) is an approximation of l(x)l(x), we can expect the approximation in function values, which is supx|lδ(x)x|C0δ\sup_{x\in\mathbb{R}}\left|l_{\delta}(x)-x\right|\leq C_{0}\delta for some positive constant C0C_{0}. Together with the sharpness of F(x)F(x) given in (8), we can conclude that, for some positive constant C1C_{1},

Δ(x¯)C1δ,x¯argminxpFδ(x).\Delta(\bar{x})\leq C_{1}\delta,\ \forall\bar{x}\in\operatorname*{arg\,min}_{x\in\mathbb{R}^{p}}F_{\delta}(x).

This inequality shows the closeness of minimums for F(x)F(x) and Fδ(x)F_{\delta}(x). Moreover, our local landscape analysis, which aims at identifying and categorizing stationary points near the true signal vectors, allows for a finer characterization of local regions potentially containing a stationary point. When pfail=0p_{\text{fail}}=0, and under the conditions specified in (7) and (9), it is deduced that:

λsmin{Δ(x),1δΔ2(x)}Fδ(x)Fδ(x)Fδ(x)(xx)ρ2Δ2(x).-\lambda_{s}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\}\geq F_{\delta}(x_{\star})-F_{\delta}(x)\geq\nabla F_{\delta}(x)^{\top}(x_{\star}-x)-\frac{\rho}{2}\Delta^{2}(x).

So, when δ\delta is sufficiently small, there exists positive constants C2,C3C_{2},C_{3} such that

Fδ(x)(xx)C2min{Δ(x),1δΔ2(x)},\nabla F_{\delta}(x)^{\top}(x-x_{\star})\geq C_{2}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\}, (10)
xps.t.Δ(x)C3,xx2=Δ(x).\forall x\in\mathbb{R}^{p}\ s.t.\ \Delta(x)\leq C_{3},\|x-x_{\star}\|_{2}=\Delta(x).

Similar conclusions hold for the symmetrical area where Δ(z)=z+x2\Delta(z)=\|z+x_{\star}\|_{2}. So, within R0:={zp:Δ(z)C3}R_{0}:=\{z\in\mathbb{R}^{p}:\Delta(z)\leq C_{3}\}, we can claim that true signal vectors are stationary points and local minima.

In scenarios involving corruptions with γ=0\gamma=0, we employ the decomposition

Fδ(x)=(1pfail)Fδ,1(x)+pfailFδ,2(x),F_{\delta}(x)=(1-p_{\text{fail}})F_{\delta,1}(x)+p_{\text{fail}}F_{\delta,2}(x),

where Fδ,i(x)F_{\delta,i}(x), for i=1,2i=1,2, represents the sample mean of lδ((aix)2bi)l_{\delta}\left((a_{i}^{\top}x)^{2}-b_{i}\right) for the noiseless and corrupted measurements, respectively. This allows us to express the gradient as

Fδ(x)(xx)=(1pfail)Fδ,1(x)(xx)+pfailFδ,2(x)(xx).\nabla F_{\delta}(x)^{\top}(x-x_{\star})=(1-p_{\text{fail}})\nabla F_{\delta,1}(x)^{\top}(x-x_{\star})+p_{\text{fail}}\nabla F_{\delta,2}(x)^{\top}(x-x_{\star}).

The first term on the right-hand side can be bounded in a manner akin to that in (10). For the second term, we will show in Section 4 that, with high probability, its absolute value can be uniformly bounded by C4pfailxx2C_{4}p_{\text{fail}}\|x-x_{\star}\|_{2} within R0R_{0}, where C4C_{4} is a positive constant independent of δ\delta and pfailp_{\text{fail}}. This serves as an indication of the impact corrupted measurements have on the gradient. When pfailp_{\text{fail}} is sufficiently small, these bounds suggest that the vicinity R1R_{1} around the true signal vectors within R0R_{0} is defined as

R1={xp:Δ(x)r0δpfail/(1pfail)},R_{1}=\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\},

for a positive constant r0r_{0}, and no stationary points are found in R0R1R_{0}\setminus R_{1}. Therefore, if an algorithm minimizing Fδ(x)F_{\delta}(x) converges to a local minimum x0,x_{0,\star} within R0R_{0}, it must be located in R1R_{1}. This illustrates the robustness of employing Fδ(x)F_{\delta}(x), as the algorithm converges to a point near the true signal vectors. In a noiseless context, this point coincides with one of the true signal vectors. In cases with corruption, and when δpfail/(1pfail)\delta p_{\text{fail}}/(1-p_{\text{fail}}) is small, the convergence point is close to either {x,x}\{x_{\star},-x_{\star}\}, serving as an effective warm-start for the IPL algorithm that minimizes F(x)F(x) with notable local efficiency, thereby exactly finding one of the true signal vectors.

Addressing robustness alone is not sufficient given the nonconvex nature of Fδ(x)F_{\delta}(x). We also need to demonstrate the absence of spurious local minima outside the region R0R_{0} to ensure the efficacy of algorithms with random initialization. In Sections 3 and 4, we conduct landscape analysis to identify R0R_{0} and to illustrate the benign landscape characteristics beyond R0R_{0}.

Remark: Inner product quantities akin to Fδ(x)(xx)\nabla F_{\delta}(x)^{\top}(x-x_{\star}), which involve the interaction between the (sub)gradient and the deviation from xx to the true vector, are pivotal in the landscape analysis across various nonconvex formulations [10, 24, 32, 8]. Specifically, [24] leveraged such quantities in the global landscape analysis of single-index models with the essential assumption of a strictly monotone link function. Given that phase retrieval, with its squared link function not satisfying this monotonicity requirement, usage of inner product quantities in our paper, [32] and [8] is confined to local vicinities. In these works, nonlocal landscape analyses are conducted through the examination of Hessian matrix-related quantities. A distinctive aspect of our study is the incorporation of corruption into the analysis, setting it apart from existing literature. The comprehensive details of this approach will be elaborated upon in Section 4.

Finally, we provide the following normal assumption that is stronger than both 3 and 4.

Assumption 5.

For any i[n]i\in[n], aiN(0,Ip)a_{i}\sim N(0,I_{p}).

Assumption 5 was used in [11, 10, 32, 8] to study initialization and landscape analysis for phase retrieval problems.

2.4 Algorithms and Rate of Convergence

In this subsection, we discuss the algorithms for minimizing Fδ(x)F_{\delta}(x). We will show that under the noiseless situation, with high probability, gradient descent on Fδ(x)F_{\delta}(x) enjoys local linear convergence. Before we start, we first give a Lemma that finds a Lipschitz constant μmax/δ\mu_{max}/\delta for Fδ(x)\nabla F_{\delta}(x) with high probability.

Lemma 3.

Under Assumptions 1, 2, and 5, when pfail=γ=0p_{\text{fail}}=\gamma=0, x2=1\|x_{\star}\|_{2}=1, and p3p\geq 3, there exist positive constants μmax,c,c,C\mu_{max},c,c^{\prime},C such that, if ncplogpn\geq cp\log p, then with probability at least 1Cexp(cn/logn)C/n1-C\exp(c^{\prime}n/\log n)-C/n, we have

supxp2Fδ(x)/xx2μmax/δ.\sup_{x\in\mathbb{R}^{p}}\|\partial^{2}F_{\delta}(x)/\partial x\partial x^{\top}\|_{2}\leq\mu_{max}/\delta. (11)

Here, μmax\mu_{max} does not depend on δ\delta but c,c,Cc,c^{\prime},C depend on δ\delta.

Given the above lemma, the local linear convergence for gradient descent can be shown in Theorem 1.

Theorem 1.

Suppose that (7), (9) and (11) hold with λs,ρ,μmax>0\lambda_{s},\rho,\mu_{max}>0, pfail=γ=0p_{\text{fail}}=\gamma=0 and {xk}k=0p\{x_{k}\}_{k=0}^{\infty}\subseteq\mathbb{R}^{p} is generated by gradient descent algorithm that iterates as follows:

xk+1=xktFδ(xk),k=0,1,2x_{k+1}=x_{k}-t\nabla F_{\delta}(x_{k}),k=0,1,2\ldots

with t=δ/μmaxt=\delta/\mu_{max}. If Δ(x0)min{1,2λsμmax}δ\Delta(x_{0})\leq\min\left\{1,\sqrt{\frac{2\lambda_{s}}{\mu_{max}}}\right\}\delta and 0<δλs/ρ0<\delta\leq\lambda_{s}/\rho, we have

Fδ(xk+1)Fδ(x)(1λs/(8μmax))(Fδ(xk)Fδ(x)),k.F_{\delta}(x_{k+1})-F_{\delta}(x_{\star})\leq(1-\lambda_{s}/(8\mu_{max}))\left(F_{\delta}(x_{k})-F_{\delta}(x_{\star})\right),\forall k\in\mathbb{N}. (12)

In (12), if we let k=0k=0 and choose Δ(x0)>0\Delta(x^{0})>0, we have λs/(8μmax)1\lambda_{s}/(8\mu_{max})\leq 1. Thus, we can define κ:=(8μmax)/λs1\kappa:=(8\mu_{max})/\lambda_{s}\geq 1 as the condition number. It is worth pointing out that κ\kappa is not related to δ\delta or pp so the convergence rate is statistically stable.

Next, we comment on the technical details we use for reaching this result. Linear convergence usually appears in a strongly convex situation. When investigating the Hessian matrix 2Fδ(x)xx\frac{\partial^{2}F_{\delta}(x)}{\partial x\partial x^{\top}}, μmaxδ\frac{\mu_{max}}{\delta} provides an upper bound for 2Fδ(x)xx2\left\|\frac{\partial^{2}F_{\delta}(x)}{\partial x\partial x^{\top}}\right\|_{2}. However, it is hard to find a positive lower bound for the smallest eigenvalue for the Hessian matrix even within the local vicinity of the true signal vectors. So, we turn to the Polyak-Lojasiewicz condition introduced in [21], which proves linear convergence of gradient descent algorithm for minimizing a function f(x)f(x) by assuming 12f(x)22μ(f(x)minxf(x))\frac{1}{2}\|\nabla f(x)\|_{2}^{2}\geq\mu\left(f(x)-\min_{x}f(x)\right) with μ>0\mu>0. We can prove that when (7), (9) and (11) hold with λs,ρ,μmax>0\lambda_{s},\rho,\mu_{max}>0, for any δ>0\delta>0 that is small enough,

12Fδ(x)22μminδ(Fδ(x)Fδ(x)),x{xp:Δ(x)Cδ}\frac{1}{2}\|\nabla F_{\delta}(x)\|_{2}^{2}\geq\frac{\mu_{min}}{\delta}\left(F_{\delta}(x)-F_{\delta}(x_{\star})\right),\forall x\in\{x\in\mathbb{R}^{p}:\Delta(x)\leq C\delta\}

in which μmin\mu_{min} and CC are positive constants that are not related to the selection of δ\delta. With these conditions, we can get the convergence rate.

Linear convergence guarantees that minimizing the smoothed objective function Fδ(x)F_{\delta}(x) is not only sufficient for phase retrieval in the noiseless scenario but also offers enhanced efficiency in comparison to PL [11] and IPL [38]. This improvement stems from the methodologies adopted by [11], which solves the subproblem utilizing the POGS algorithm [28], and IPL, which solves the same subproblem through accelerated proximal gradient descent tailored for the dual problem. Both of these algorithms are characterized by their sub-linear convergence rates. Furthermore, a local linear convergence rate has been established across various phase retrieval algorithms, including Wirtinger flow [4], truncated Wirtinger flow [7], truncated amplitude flow [35], and subgradient methods [9, 10]. Smoothed robust phase retrieval thus joins this list of methodologies. In addition, Theorem 1 and (7) indicate that, for any ε>0\varepsilon>0, the vanilla gradient descent on the smoothed robust objective function Fδ(x)F_{\delta}(x) can find an xx such that Δ(x)ε\Delta(x)\leq\varepsilon with at most O(κlog1ε)O(\kappa\log\frac{1}{\varepsilon}) iterations. This is better than O(κ2log1ε)O(\kappa^{2}\log\frac{1}{\varepsilon}) for the subgradient algorithms [9, 10] using the robust 1\ell_{1}-loss. In our numerical experiments in Section 5, we employ monotone accelerated gradient descent (as outlined in Algorithm 1 of [22]) combined with line search techniques to achieve better overall efficiency. These experiments demonstrate that smoothed robust phase retrieval exhibits superior numerical efficiency on noiseless datasets when compared to both PL and IPL.

2.5 Overview of Benign Landscape

In this subsection, we provide an overview of the benign landscape for Fδ(x)F_{\delta}(x). Its gradient and Hessian matrix help us find and classify the stationary points, and we will discuss different kinds of benign landscapes in this paper.

  1. (a)

    Noiseless population landscape. In this case, pfail=γ=0p_{\text{fail}}=\gamma=0 and we focus on the stationary point for 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right]. Under some assumptions, we exactly locate all the stationary points and prove that only {x,x}\{x_{\star},-x_{\star}\} are local minimums.

  2. (b)

    Noiseless empirical landscape. In this case, pfail=γ=0p_{\text{fail}}=\gamma=0 and we care about the stationary point of Fδ(x)F_{\delta}(x). Under some assumptions, we show that with high probability, the stationary points should be close to those for noiseless population landscape, and only {x,x}\{x_{\star},-x_{\star}\} are local minimums.

  3. (c)

    Empirical landscape with infrequent corruptions only. In this case, pfail>0,γ=0p_{\text{fail}}>0,\gamma=0 and we study the stationary point of Fδ(x)F_{\delta}(x). Under some assumptions, we show that the stationary points are close to those of the noiseless population landscape and all the local minimums are close to {x,x}\{x_{\star},-x_{\star}\} with high probability. Specifically, all the local minimums lie in the set {xp:Δ(x)r0δpfail/(1pfail)}\left\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\right\} in which Δ(x)=dist(x,{x,x})\Delta(x)=\mbox{dist}(x,\{x_{\star},-x_{\star}\}) and r0r_{0} is a positive constant. Thus, if an optimization algorithm finds a local minimum of Fδ(x)F_{\delta}(x), its gap to {x,x}\{x_{\star},-x_{\star}\} is small when both δ\delta and pfailp_{\text{fail}} are small.

  4. (d)

    Empirical landscape with infrequent corruptions and bounded noises. When pfail>0,γ>0p_{\text{fail}}>0,\gamma>0, similar results can be obtained when δ,γ,pfail\delta,\gamma,p_{\text{fail}} are small enough and δΩ(γ)\delta\geq\Omega(\sqrt{\gamma}), with high probability, all the local minimums lie in the set {xp:Δ(x)r~0max{δγ,δpfail/(1pfail)}}\{x\in\mathbb{R}^{p}:\Delta(x)\leq\tilde{r}_{0}\max\{\sqrt{\delta\gamma},\delta p_{\text{fail}}/(1-p_{\text{fail}})\}\} where r~0\tilde{r}_{0} is a positive constant.

Now, we describe the procedure of smoothed robust phase retrieval. We first apply some gradient-based algorithm to minimize Fδ(x)F_{\delta}(x) and find a local minimum x0,x_{0,\star}. Due to the benign landscapes described above, for the optimization, we expect that both random initialization and modified spectral initialization work well such that x0,x_{0,\star} is close to xx_{\star} or x-x_{\star}. Second, for the noiseless situation, we can claim that x0,x_{0,\star} is already a true signal vector. For the situation with noise, benign landscape guarantees that x0,x_{0,\star} is close to one of the true signal vectors and is treated as a warm-start for IPL that minimizes F(x)F(x) for exact recovery.

3 Landscape Analysis in the Noiseless Case

In this section, we focus on the noiseless case with pfail=γ=0p_{\text{fail}}=\gamma=0 and analyze both population and empirical landscapes. Without loss of generosity, we always assume that x2=1\|x_{\star}\|_{2}=1 throughout this section.

3.1 Notations

Denote L(x;a)=lδ((ax)2(ax)2),x,ap.L(x;a)=l_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2}),x,a\in\mathbb{R}^{p}. The gradient vector is L(x;a)x={2lδ((ax)2(ax)2)aa}x\frac{\partial L(x;a)}{\partial x}=\left\{2l^{\prime}_{\delta}\left((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2}\right)aa^{\top}\right\}x and the Hessian matrix 2L(x;a)xx=2lδ((ax)2(ax)2)aa+4(ax)2lδ′′((ax)2(ax)2)aa\frac{\partial^{2}L(x;a)}{\partial x\partial x^{\top}}=2l^{\prime}_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2})aa^{\top}+4(a^{\top}x)^{2}l^{\prime\prime}_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2})aa^{\top}. The empirical and population versions of objective functions can be written as Fδ(x)=𝔼n[L(x;a)]F_{\delta}(x)=\mathbb{E}_{n}\left[L(x;a)\right] and 𝔼[Fδ(x)]=𝔼[L(x;a)].\mathbb{E}\left[F_{\delta}(x)\right]=\mathbb{E}\left[L(x;a)\right]. They have continuous gradient vectors and Hessian matrices as Fδ(x)=𝔼n[L(x;a)]\nabla F_{\delta}(x)=\mathbb{E}_{n}\left[\nabla L(x;a)\right], 𝔼[Fδ(x)]=𝔼[L(x;a)]\nabla\mathbb{E}\left[F_{\delta}(x)\right]=\mathbb{E}\left[\nabla L(x;a)\right], 2xxFδ(x)=𝔼n[2xxL(x;a)]\frac{\partial^{2}}{\partial x\partial x^{\top}}F_{\delta}(x)=\mathbb{E}_{n}\left[\frac{\partial^{2}}{\partial x\partial x^{\top}}L(x;a)\right], and 2xx𝔼[Fδ(x)]=𝔼[2xxL(x;a)].\frac{\partial^{2}}{\partial x\partial x^{\top}}\mathbb{E}\left[F_{\delta}(x)\right]=\mathbb{E}\left[\frac{\partial^{2}}{\partial x\partial x^{\top}}L(x;a)\right]. We will use the Hessian matrix to classify the stationary points and find the benign noiseless population landscapes for 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right] and empirical landscapes for Fδ(x)F_{\delta}(x).

Next, we introduce three quantities that are used to characterize the landscape. Define

U1(x;a)=xL(x,a)x=2(ax)2lδ((ax)2(ax)2),U_{1}(x;a)=x^{\top}\frac{\partial L(x,a)}{\partial x}=2(a^{\top}x)^{2}l^{\prime}_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2}),
U2(x;a)=x[2L(x;a)xx]x,U_{2}(x;a)=x_{\star}^{\top}\left[\frac{\partial^{2}L(x;a)}{\partial x\partial x^{\top}}\right]x_{\star},

and

U3(x;a)=xL(x,a)x=2(ax)lδ((ax)2(ax)2)(ax).U_{3}(x;a)=x_{\star}^{\top}\frac{\partial L(x,a)}{\partial x}=2(a^{\top}x_{\star})l^{\prime}_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2})(a^{\top}x).

Based on Assumption 5, it is easy to find that, for any given x0px_{0}\in\mathbb{R}^{p} satisfying x02>0\|x_{0}\|_{2}>0, we have z𝔼[Fδ]x(x0)=0z^{\top}\frac{\partial\mathbb{E}\left[F_{\delta}\right]}{\partial x}(x_{0})=0, z{xp:xx0=xx=0}.\forall\ z\in\{x\in\mathbb{R}^{p}:x^{\top}x_{0}=x^{\top}x_{\star}=0\}. This implies that most directional derivatives in p\mathbb{R}^{p} will be 0 and we only need to consider the remaining two directions. To be specific,

𝔼[U1(x;a)]=𝔼[U3(x;a)]=0𝔼[Fδ(x)]x=0.\mathbb{E}\left[U_{1}(x;a)\right]=\mathbb{E}\left[U_{3}(x;a)\right]=0\iff\frac{\partial\mathbb{E}\left[F_{\delta}(x)\right]}{\partial x}=0.

For U2(x;a)U_{2}(x;a), it is easy to find that 𝔼[U2(x0,a)]<0\mathbb{E}\left[U_{2}(x_{0},a)\right]<0 is sufficient for concluding that the Hessian matrix of 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right] at x0x_{0} contains at least one negative eigenvalue, which indicates that x0x_{0} is not a local minimum. Similar properties hold for their empirical counterparts:

𝔼n[U1(x;a)]0 or 𝔼n[U3(x;a)]0𝔼n[Fδ]x(x)0.\mathbb{E}_{n}\left[U_{1}(x;a)\right]\neq 0\mbox{ or }\mathbb{E}_{n}\left[U_{3}(x;a)\right]\neq 0\implies\frac{\partial\mathbb{E}_{n}\left[F_{\delta}\right]}{\partial x}(x)\neq 0.
𝔼n[U2(x;a)]<0x is not a local minimum for Fδ(x).\mathbb{E}_{n}\left[U_{2}(x;a)\right]<0\implies x\mbox{ is not a local minimum for }F_{\delta}(x).

3.2 Population Landscape

In this subsection, we will give the conclusions for the stationary points of 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right]. Denote

S0={xp:xx=0,x2=u(δ)}S_{0}=\{x\in\mathbb{R}^{p}:x^{\top}x_{\star}=0,\|x\|_{2}=u(\delta)\}

for some positive constant u(δ)u(\delta), whose rigorous definition will be given in Theorem 2 later. To help to understand, we also use graphical explanations in Figure 1. We express vectors in the phase space p\mathbb{R}^{p} in a two-dimensional coordinate. For each vector xpx\in\mathbb{R}^{p}, we decompose it into the direction of xx_{\star} and an orthogonal direction as x=ux+vx,xx=0,x2=1,x=ux_{\star}+vx_{\star}^{\perp},x_{\star}^{\top}x_{\star}^{\perp}=0,\|x_{\star}^{\perp}\|_{2}=1, with u,vu,v\in\mathbb{R}. This decomposition exists for any xpx\in\mathbb{R}^{p}, and for a given xx, uu and |v||v| are unique. We let a point with coordinate (u,v)(u,v) denote the subset of p\mathbb{R}^{p} in which elements can give a decomposition with coefficients uu and vv:

{xp:xx=u,x(xx)x2=|v|}.\left\{x\in\mathbb{R}^{p}:x^{\top}x_{\star}=u,\left\|x-(x^{\top}x_{\star})x_{\star}\right\|_{2}=|v|\right\}.

So, in Figure 1, if we let the first coordinate for A,BA,B be 1-1 and 11 and the second coordinate for C,DC,D be u(δ)u(\delta) and u(δ)-u(\delta), we know that A,BA,B represent the true signal vectors, C,DC,D represent points in S0S_{0} and the origin represents the 0p0\in\mathbb{R}^{p}. For the noiseless population landscape, we can show that when δ<δ0\delta<\delta_{0} with some positive constant δ0\delta_{0}, there exists a unique positive constant u(δ)u(\delta) such that, the stationary point set is S0{0,x,x}S_{0}\bigcup\{0,x_{\star},-x_{\star}\}, among which, only global minimums xx_{\star} and x-x_{\star} are local minimums. This conclusion corresponds to Figure 1, in which A,B,O,C,DA,B,O,C,D are stationary points and only A,BA,B are local minimums. The following results describe the conclusion rigorously.

Refer to caption
Figure 1: An illustration of the noiseless population landscape
Theorem 2.

Under Assumptions 1, 2 and 5, when pfail=γ=0p_{\text{fail}}=\gamma=0 and x2=1\|x_{\star}\|_{2}=1, for any given δ>0\delta>0, there exists a unique u(δ)>0u(\delta)>0 such that 𝐒:={xp:𝔼[Fδ(x)]x=0}\mathbf{S}:=\left\{x\in\mathbb{R}^{p}:\frac{\partial\mathbb{E}\left[F_{\delta}(x)\right]}{\partial x}=0\right\}, which is the stationary point set of 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right], becomes

𝐒={0,x,x}{xp:xx=0,x2=u(δ)}.\mathbf{S}=\left\{0,x_{\star},-x_{\star}\right\}\bigcup\left\{x\in\mathbb{R}^{p}:x^{\top}x_{\star}=0,\|x\|_{2}=u(\delta)\right\}.

The conclusion of Theorem 2 is similar to that of [10]. Next, we comment on some key ideas behind the proof. It is easy to find that {0,x,x}\{0,x_{\star},-x_{\star}\} are all stationary points. For x{xp:xx=0,x2>0},x\in\{x\in\mathbb{R}^{p}:x^{\top}x_{\star}=0,\|x\|_{2}>0\}, it is also easy to find out that 𝔼[U3(x;a)]=0\mathbb{E}\left[U_{3}(x;a)\right]=0. So we only need to solve the equation 𝔼[U1(x;a)]=0.\mathbb{E}\left[U_{1}(x;a)\right]=0. Based on the isotropy property of normal distribution, we can calculate 𝔼[U1(x;a)]\mathbb{E}\left[U_{1}(x;a)\right] by assuming x=e1x_{\star}=e_{1} and x=ue2,u>0x=ue_{2},u>0. So, u(δ)u(\delta) is the solution to the equation 0=𝔼[lδ(u2(ae2)2(ae1)2)(ae2)2]0=\mathbb{E}\left[l^{\prime}_{\delta}(u^{2}(a^{\top}e_{2})^{2}-(a^{\top}e_{1})^{2})(a^{\top}e_{2})^{2}\right] about u>0u>0. For the complete proof, we also need to explain why xx is not a stationary point when xx0x^{\top}x_{\star}\neq 0 and Δ(x)>0\Delta(x)>0. The proof is a direct application of [10]’s work on spectral functions. Please refer to Appendix LABEL:appendix:proof_positions for details.

With access to the Hessian matrix, we can also classify those stationary points. The conclusion is in the Theorem 3.

Theorem 3.

Under Assumptions 1, 2, 5, when pfail=γ=0p_{\text{fail}}=\gamma=0 and x2=1\|x_{\star}\|_{2}=1, for 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right], the following properties hold.

  1. (a)

    xx_{\star} and x-x_{\star} are local and global minimums.

  2. (b)

    0p0\in\mathbb{R}^{p} is a local maximum.

  3. (c)

    There exists a constant δ0>0\delta_{0}>0 such that for any δ(0,δ0)\delta\in(0,\delta_{0}), 𝔼[U2(x;a)]<0,xS0.\mathbb{E}\left[U_{2}(x;a)\right]<0,\forall x\in S_{0}. This means that stationary points in S0S_{0} are not local minimums.

Theorem 3 classifies all the stationary points for 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right] and indicates the benign noiseless population landscape that there is no spurious local minimum for 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right]. Another interesting result found in the proof of Theorem 3 is that the limits limδ0u(δ)\lim_{\delta\rightarrow 0}u(\delta) and limδ0𝔼[U2(x;a)]\lim_{\delta\rightarrow 0}\mathbb{E}\left[U_{2}(x;a)\right] do not depend on the selection of kernel K(x)K(x). It shows that,

limδ0u(δ)=u00.44,\lim_{\delta\rightarrow 0}u(\delta)=u_{0}\approx 0.44,

when u0u_{0} is the unique solution of function π4=arctanu+uu2+1\frac{\pi}{4}=\arctan u+\frac{u}{u^{2}+1} and

limδ0𝔼[U2(x;a)]=16u3π(1+u2)2+8π(arctan(u)π4uu2+1)\lim_{\delta\rightarrow 0}\mathbb{E}\left[U_{2}(x;a)\right]=\frac{16u^{3}}{\pi(1+u^{2})^{2}}+\frac{8}{\pi}\left(\arctan(u)-\frac{\pi}{4}-\frac{u}{u^{2}+1}\right)

for any xx that satisfies xx=0x^{\top}x_{\star}=0 and x2=u>0\|x\|_{2}=u>0. When u=u0u=u_{0}, limδ0𝔼[U2(x;a)]1.57\lim_{\delta\rightarrow 0}\mathbb{E}\left[U_{2}(x;a)\right]\approx-1.57, which is negative. This implies that these quantities are properties for the 1\ell_{1}-loss.

Finally, we give a lemma about the positiveness of U1(x;a)U_{1}(x;a) when x2\|x\|_{2} is very large.

Lemma 4.

Under Assumptions 1, 2 and 5, when pfail=γ=0p_{\text{fail}}=\gamma=0, there exists a positive constant M>0M>0 such that

infxp,x2M𝔼[U1(x;a)]>0.\inf_{x\in\mathbb{R}^{p},\|x\|_{2}\geq M}\ \mathbb{E}\left[U_{1}(x;a)\right]>0.

Lemma 4 further strengthens the non-zero nature of the gradient when x2\|x\|_{2} is large and will be useful in reaching conclusions about the empirical landscape.

3.3 Empirical Landscape

In this subsection, we first show the conclusions for the benign empirical landscape and then provide a sketch of the proof.

We still use visualizations to help explain the landscape. Under some conditions, we can show that, with high probability, all the stationary points lie in the set

𝐒1={0,x,x}{xp:dist(x,S0{0})ϵ}\mathbf{S}_{1}=\{0,x_{\star},-x_{\star}\}\bigcup\left\{x\in\mathbb{R}^{p}:\mbox{dist}(x,S_{0}\bigcup\{0\})\leq\epsilon\right\}

for any small positive constant ϵ\epsilon, among which, only global minimums xx_{\star} and x-x_{\star} are local minimums. In other words, a stationary point is either a true phase or lies in the vicinity of S0{0}S_{0}\bigcup\{0\}. This conclusion corresponds to Figure 2, in which possible stationary points correspond to A,BA,B and the inner parts of the three circles, and only A,BA,B are local minimums.

Refer to caption
Figure 2: Noiseless Empirical Landscape

Before giving the formal conclusion, we define the following events.

𝒜={there exists a stationary point of Fδ(x) such that it does not belong to the set 𝐒1}.\mathcal{A}=\{\textrm{there exists a stationary point of $F_{\delta}(x)$ such that it does not belong to the set $\mathbf{S}_{1}$}\}.
={there exists a local minimum of Fδ(x) such that it does not belong to the set {x,x}}.\mathcal{B}=\{\textrm{there exists a local minimum of $F_{\delta}(x)$ such that it does not belong to the set $\{x_{\star},-x_{\star}\}$}\}.

Formally, the conclusion is as follows.

Theorem 4.

Under Assumptions 1, 2 and 5, when pfail=γ=0p_{\text{fail}}=\gamma=0 and x2=1\|x_{\star}\|_{2}=1, there exists a constant δ0>0\delta_{0}>0 such that, for any δ(0,δ0)\delta\in(0,\delta_{0}) and ϵ>0\epsilon>0, if ncplogpn\geq cp\log p, we have

P(𝒜)Cexp(cn)+Cn,P(\mathcal{A}\cup\mathcal{B})\leq C\exp(-c^{\prime}n)+\frac{C}{n},

where c,c,Cc,c^{\prime},C are positive constants that depend on δ\delta and ϵ\epsilon.

Theorem 4 shows that the opposite event 𝒜cc\mathcal{A}^{c}\bigcap\mathcal{B}^{c} which corresponds to the benign noiseless empirical landscape happens with high probability. So, if we use a gradient-based optimization algorithm for minimizing Fδ(x)F_{\delta}(x) with random initialization and it finds a local minimum, the iterative sequence {xk}k=0\{x_{k}\}_{k=0}^{\infty} will converge to one of {x,x}\{x_{\star},-x_{\star}\}. In addition, under the benign landscape, though there are only theoretical guarantees for global convergence for first-order algorithms with additional tricks [19, 20] because of the existence of saddle points, numerical performances for robust phase retrieval indicate that we do not need those tricks for good performance. Readers with interests in saddle points can refer to [20] and references therein. The global convergence guarantee makes sure that our discussion of the local linear convergence rate in Section 2.4 is meaningful.

For the rest of this subsection, we provide a sketch of the proof. In the first step, we focus on a large bounded area R2={xp:x2M}R_{2}=\{x\in\mathbb{R}^{p}:\|x\|_{2}\leq M\} and prove that all of 𝔼n[Ui(x,a)],i=1,2,3\mathbb{E}_{n}\left[U_{i}(x,a)\right],i=1,2,3 uniformly concentrate to their population versions with high probability. These quantities are related to the gradient and Hessian matrix for Fδ(x)F_{\delta}(x). If we do not consider the high-dimensional situation for pp, it is easy to apply the central limit theorem to each element of the gradient and Hessian and claim convergence. However, if we consider the high-dimensional case when we can only assume that nΩ(cplogp)n\geq\Omega(cp\log p), we can only focus on some selected key elements, which are U1(x;a),U2(x;a)U_{1}(x;a),U_{2}(x;a) and U3(x;a)U_{3}(x;a). Fortunately, they are sufficient to support our claim about the empirical landscape. Next, we state our conclusions.

Lemma 5.

Suppose that x2=1\|x_{\star}\|_{2}=1. Under Assumptions 1, 2, 5, for any given δ,M>0\delta,M>0 and t>0t>0, when ncplogp,p3n\geq cp\log p,p\geq 3,x2=1\|x_{\star}\|_{2}=1, the following relationships hold.

P(supxBp(0,M)|𝔼n[U1(x;a)]𝔼[U1(x;a)]|t)Cexp(cn)+Cn.P(\sup_{x\in B_{p}(0,M)}\left|\mathbb{E}_{n}\left[U_{1}(x;a)\right]-\mathbb{E}\left[U_{1}(x;a)\right]\right|\geq t)\leq C\exp\left(-c^{\prime}n\right)+\frac{C}{n}. (13)
P(supxBp(0,M)|𝔼n[U2(x;a)]𝔼[U2(x;a)]|t)Cexp(cn)+Cn.P(\sup_{x\in B_{p}(0,M)}\left|\mathbb{E}_{n}\left[U_{2}(x;a)\right]-\mathbb{E}\left[U_{2}(x;a)\right]\right|\geq t)\leq C\exp\left(-c^{\prime}n\right)+\frac{C}{n}. (14)
P(supxBp(0,M)|𝔼n[U3(x;a)]𝔼[U3(x;a)]|t)Cexp(cn)+Cn.P(\sup_{x\in B_{p}(0,M)}\left|\mathbb{E}_{n}\left[U_{3}(x;a)\right]-\mathbb{E}\left[U_{3}(x;a)\right]\right|\geq t)\leq C\exp\left(-c^{\prime}n\right)+\frac{C}{n}. (15)

Here, c1,c,Cc\geq 1,c^{\prime},C are positive constants that depend on δ,M,t\delta,M,t but do not depend on nn or pp.

Lemma 5 guarantees the uniform concentration of the selected quantities within a large bounded subset of p\mathbb{R}^{p} when nn is large. This indicates that, within this area, stationary points and local minimums for Fδ(x)F_{\delta}(x) should be close to those for 𝔼[Fδ(x)]\mathbb{E}\left[F_{\delta}(x)\right]. It is also worth mentioning that, inferred from the proof, using smaller δ\delta will require larger nn for these quantities to get close to their population version.

In the second step, we can go beyond closeness within a local area that contains the true signal vectors. Local landscape analysis strengthens the conclusion by showing that, within a local area that contains the true signal vectors, Fδ(x)\nabla F_{\delta}(x) is non-zero when x{x,x}x\notin\{x_{\star},-x_{\star}\}, which means that there is no other stationary point in the local area. The two key properties for this part are the generalized sharpness condition given in Lemma 1 and the weak convexity condition given in Lemma 2. With them, we can claim the following conclusion for the lower bound of the norm of the gradient and claim the benign local landscape.

Lemma 6.

Suppose that Fδ(x)F_{\delta}(x) is weakly convex with parameter ρ>0\rho>0 as given in Lemma 2 and also satisfies that

Fδ(x)Fδ(x)λsmin{Δ(x),1δΔ2(x)},xpF_{\delta}(x)-F_{\delta}(x_{\star})\geq\lambda_{s}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},\forall x\in\mathbb{R}^{p} (16)

as given in Lemma 1 for some positive constant λs\lambda_{s}. Then we have, for all δ(0,λs/ρ]\delta\in(0,\lambda_{s}/\rho] and x{xp:xx2λs/ρ}x\in\{x\in\mathbb{R}^{p}:\|x-x_{\star}\|_{2}\leq\lambda_{s}/\rho\},

𝔼n[U1(x;a)]𝔼n[U3(x;a)]\displaystyle\mathbb{E}_{n}\left[U_{1}(x;a)\right]-\mathbb{E}_{n}\left[U_{3}(x;a)\right] =Fδ(x)(xx)λs2min{Δ(x),1δΔ2(x)},\displaystyle=\nabla F_{\delta}(x)^{\top}(x-x_{\star})\geq\frac{\lambda_{s}}{2}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},

The symmetric case also holds when we replace xx_{\star} with x-x_{\star}. After considering the symmetrical case based on x+xx+x_{\star}, we have

Fδ(x)2>0,x{xp:0<Δ(x)λs/ρ}.\|\nabla F_{\delta}(x)\|_{2}>0,\ \forall\ x\in\{x\in\mathbb{R}^{p}:0<\Delta(x)\leq\lambda_{s}/\rho\}.

Finally, having handled R2R_{2}, we will prove that the unbounded area p\R2\mathbb{R}^{p}\backslash R_{2} does not contain a stationary point by showing that 𝔼n[U1(x;a)]\mathbb{E}_{n}\left[U_{1}(x;a)\right] is positive on the boundary and that 1u2𝔼n[U1(ux0,a)]\frac{1}{u^{2}}\mathbb{E}_{n}\left[U_{1}(ux_{0},a)\right] is monotone increasing with regard to u>0u>0 for any x0R2x_{0}\in\partial R_{2}. The monotonicity is formally given in Lemma 7.

Lemma 7.

For any constant x0p,x02>0,x_{0}\in\mathbb{R}^{p},\|x_{0}\|_{2}>0, the function hn(u)=𝔼n[U1(ux0;a)]u2h_{n}(u)=\frac{\mathbb{E}_{n}\left[U_{1}(ux_{0};a)\right]}{u^{2}} is a monotone increasing function on (0,+)(0,+\infty).

Lemma 7 is obvious because lδ()l^{\prime}_{\delta}(\cdot) is monotone increasing. It directly indicates the following result about the unbounded area.

Lemma 8.

If there exists a constant M>0M>0 such that infxp:x2=M𝔼n[U1(x;a)]>0,\inf_{x\in\mathbb{R}^{p}:\|x\|_{2}=M}\mathbb{E}_{n}\left[U_{1}(x;a)\right]>0, then we have infxp:x2M𝔼n[U1(x;a)]>0.\inf_{x\in\mathbb{R}^{p}:\|x\|_{2}\geq M}\mathbb{E}_{n}\left[U_{1}(x;a)\right]>0.

The positiveness of 𝔼n[U1(x;a)]\mathbb{E}_{n}\left[U_{1}(x;a)\right] means that there is no stationary point in this area.

4 Empirical Landscape in the Presence of Noise

In this section, we discuss the corrupted situation and the empirical landscape of Fδ(x)F_{\delta}(x). Without loss of generosity, we still assume that x2=1\|x_{\star}\|_{2}=1 throughout this section. We will follow the statistical assumptions in Section 2 with the general corruption setting pfail>0p_{\text{fail}}>0. Subsection 4.1 will first provide the results under the situation that γ=0\gamma=0 in Assumption 1 and then extend the result to the case that γ0\gamma\geq 0. Subsection 4.2 will provide a sketch of the proof.

4.1 Benign Landscape When Sparse Corruptions Exists

In this subsection, we first discuss the situation that pfail>0,γ=0p_{\text{fail}}>0,\gamma=0 in Assumption 1. We will show that, under some conditions, with high probability, all the stationary points of Fδ(x)F_{\delta}(x) lie in the set

𝐒3={xp:Δ(x)r0δpfail/(1pfail)}{xp:dist(x,S0{0})ϵ}\mathbf{S}_{3}=\left\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\right\}\bigcup\left\{x\in\mathbb{R}^{p}:\mbox{dist}(x,S_{0}\bigcup\{0\})\leq\epsilon\right\}

for some positive constant ϵ\epsilon that is related to pfailp_{\text{fail}} and a positive constant r0r_{0}, among which, local minimums lie in

𝐒4={xp:Δ(x)r0δpfail/(1pfail)}.\mathbf{S}_{4}=\left\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\right\}.

In other words, a stationary point either lies in the vicinity of {x,x}\{x_{\star},-x_{\star}\} whose radius is proportional to δpfail/(1pfail)\delta p_{\text{fail}}/(1-p_{\text{fail}}) or lies in the vicinity of S0{0}S_{0}\bigcup\{0\}, and all local minimums lie in the vicinity of the true phase. This conclusion corresponds to Figure 3, in which possible stationary points correspond to the inner parts of the five circles and only inner parts of the circles AA and BB contain local minimums. The rigorous statement is given in the Theorem 5.

Theorem 5.

Suppose that x2=1,p3\|x_{\star}\|_{2}=1,p\geq 3 and Assumptions 1, 2 and 5 hold with γ=0\gamma=0. We consider the following two events.

𝒞={there exists a stationary point of Fδ(x) such that it does not belong to the set 𝐒3}.\mathcal{C}=\{\textrm{there exists a stationary point of $F_{\delta}(x)$ such that it does not belong to the set $\mathbf{S}_{3}$}\}.
𝒟={there exists a local minimum of Fδ(x) such that it does not belong to the set 𝐒4}.\mathcal{D}=\{\textrm{there exists a local minimum of $F_{\delta}(x)$ such that it does not belong to the set $\mathbf{S}_{4}$}\}.

Then we have that for any ϵ>0\epsilon>0, there exist positive constants δ0,r0\delta_{0},r_{0} and pfailmp_{\text{fail}}^{m}, such that for any δ(0,δ0)\delta\in(0,\delta_{0}) and pfail(0,pfailm)p_{\text{fail}}\in(0,p_{\text{fail}}^{m}), when ncplogpn\geq cp\log p,

P(𝒞𝒟)Cexp(cn/logn)+Cn.P(\mathcal{C}\bigcup\mathcal{D})\leq C\exp(-c^{\prime}n/\log n)+\frac{C}{n}.

Here, the constants c,c,Cc,c^{\prime},C depend on δ,ϵ\delta,\epsilon and pfailp_{\text{fail}}, and r0r_{0} does not depend on δ,ϵ\delta,\epsilon or pfailp_{\text{fail}}.

In Theorem 5, we claim that the local minimums for Fδ(x)F_{\delta}(x) only exist in the vicinity of the true phase {x,x}\{x_{\star},-x_{\star}\}. The radius r0δpfail/(1pfail)r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}}) of the vicinity is proportional to the bandwidth and monotone increasing with regard to the proportion of the corrupted measurements. In practice, we need to choose a δ\delta. Although smaller δ\delta leads to closer local minimums, the non-local optimization efficiency is still unclear for difference choices of δ\delta, and decreasing δ\delta will also require larger nn due to the dependency of the constants in Theorem 5 on δ\delta. Next, we provide some guidance for picking δ\delta under the situation that we do not know the covariance of aia_{i} and x2\|x_{\star}\|_{2} is not necessarily 1. Using the consideration of homogeneity, we can let δ=δ0𝔼n[aa]2x22\delta=\delta_{0}\|\mathbb{E}_{n}\left[aa^{\top}\right]\|_{2}\|x_{\star}\|_{2}^{2}, in which δ0\delta_{0} is a small but not extremely small positive constant. Finding or roughly estimating 𝔼n[aa]2\|\mathbb{E}_{n}\left[aa^{\top}\right]\|_{2} is easy in practice and a rough estimation of x2\|x_{\star}\|_{2} can be done via prior knowledge or modified spectral initialization (Algorithm 3 in [11]). We will provide an example of image recovery in Section 5. For δ0\delta_{0} that should match the theoretical results based on the noiseless population landscape given in Theorem 3, we will conduct sensitive analysis in Section 5.

Refer to caption
Figure 3: Empirical Landscape with Corruptions

Next, we extend a benign landscape in Theorem 6 for the general situation that γ0\gamma\geq 0.

Theorem 6.

Suppose that x2=1,p3\|x_{\star}\|_{2}=1,p\geq 3. Under Assumptions 1, 2 5, there exist positive constants δ0,r~0\delta_{0},\tilde{r}_{0}, γ0,c′′\gamma_{0},c^{\prime\prime} and pfailmp_{\text{fail}}^{m}, such that when ncplogpn\geq cp\log p, for any δ(0,δ0)\delta\in(0,\delta_{0}), γ(0,γ0)\gamma\in(0,\gamma_{0}) such that δc′′γ\delta\geq c^{\prime\prime}\sqrt{\gamma} and pfail(0,pfailm)p_{\text{fail}}\in(0,p_{\text{fail}}^{m}), the probability that a local minimum of Fδ(x)F_{\delta}(x) exists and does not belongs to the set

{xp:Δ(x)r~0max{δγ,δpfail/(1pfail)}}\{x\in\mathbb{R}^{p}:\Delta(x)\geq\tilde{r}_{0}\max\{\sqrt{\delta\gamma},\delta p_{\text{fail}}/(1-p_{\text{fail}})\}\} (17)

is at most Cexp(cn/logn)+CnC\exp(-c^{\prime}n/\log n)+\frac{C}{n}. Here, c,c,Cc,c^{\prime},C are positive constants that depend on δ\delta and pfailp_{\text{fail}}, and r~0\tilde{r}_{0} and c′′c^{\prime\prime} are positive constants that do not depend on δ\delta, γ\gamma or pfailp_{\text{fail}}.

Note that Theorem 5 is a special example of Theorem 6 when γ=0\gamma=0.

4.2 Sketch of Proof

We first provide a sketch of proof for Theorem 5. Before we start, we define L(x;a,b)=lδ((ax)2b),xpL(x;a,b)=l_{\delta}((a^{\top}x)^{2}-b),x\in\mathbb{R}^{p} and the definition of Fδ(x)F_{\delta}(x) in (5) can also be written as Fδ(x)=𝔼nL(x;a,b).F_{\delta}(x)=\mathbb{E}_{n}L(x;a,b). It is easy to find that L(x;a,b)x={2lδ((ax)2b)aa}x\frac{\partial L(x;a,b)}{\partial x}=\left\{2l^{\prime}_{\delta}\left((a^{\top}x)^{2}-b\right)aa^{\top}\right\}x, and 2xxL(x;a,b)=2{lδ((ax)2b)aa}+4{(ax)2lδ′′((ax)2b)aa}.\frac{\partial^{2}}{\partial x\partial x^{\top}}L(x;a,b)=2\left\{l^{\prime}_{\delta}((a^{\top}x)^{2}-b)aa^{\top}\right\}+4\left\{(a^{\top}x)^{2}l^{\prime\prime}_{\delta}((a^{\top}x)^{2}-b)aa^{\top}\right\}. Similar to Section 3, we also define

U1(x;a,b)=xL(x,a,b)x=2(ax)2lδ((ax)2b),U_{1}(x;a,b)=x^{\top}\frac{\partial L(x,a,b)}{\partial x}=2(a^{\top}x)^{2}l^{\prime}_{\delta}((a^{\top}x)^{2}-b),
U2(x;a,b)=x[2L(x;a,b)xx]x,U_{2}(x;a,b)=x_{\star}^{\top}\left[\frac{\partial^{2}L(x;a,b)}{\partial x\partial x^{\top}}\right]x_{\star},

and

U3(x;a,b)=xL(x,a,b)x=2(ax)lδ((ax)2b)(ax)U_{3}(x;a,b)=x_{\star}^{\top}\frac{\partial L(x,a,b)}{\partial x}=2(a^{\top}x_{\star})l^{\prime}_{\delta}((a^{\top}x)^{2}-b)(a^{\top}x)

Now we provide the proof sketch. First, we focus on a large bounded area R2={xp:x2M}R_{2}=\{x\in\mathbb{R}^{p}:\|x\|_{2}\leq M\}. The sample mean notations satisfies that 𝔼n=(1pfail)𝔼n1+pfail𝔼n2\mathbb{E}_{n}=(1-p_{\text{fail}})\mathbb{E}_{n_{1}}+p_{\text{fail}}\mathbb{E}_{n_{2}}. Here, 𝔼n1\mathbb{E}_{n_{1}} takes the sample mean with regard to indices from 1 to n1n_{1}, and 𝔼n2\mathbb{E}_{n_{2}} focuses on the indices from n1+1n_{1}+1 to n1+n2n_{1}+n_{2}. When pfailp_{\text{fail}} is small, We can expect that, in R2R_{2}, concentrations for the quantities defined above with (1pfail)𝔼n1(1-p_{\text{fail}})\mathbb{E}_{n_{1}} hold as usual and those quantities with pfail𝔼n2p_{\text{fail}}\mathbb{E}_{n_{2}} are bounded by small positive constants. The rigorous statement is as follows.

Lemma 9.

Under Assumptions 1, 2 and 5 with γ=0\gamma=0, p3,x2=1p\geq 3,\|x_{\star}\|_{2}=1, the following results hold.

  1. (a)

    For any given M>0M>0, t>0t>0 and δ>0\delta>0, when n1c1plogpn_{1}\geq c_{1}p\log p, we have, i=1,2,3,\forall i=1,2,3,

    P(supxBp(0,M)|𝔼n1[Ui(x;a,b)]𝔼[Ui(x;a)]|t)C1exp(c1n1)+C1n1.P(\sup_{x\in B_{p}(0,M)}\left|\mathbb{E}_{n_{1}}\left[U_{i}(x;a,b)\right]-\mathbb{E}\left[U_{i}(x;a)\right]\right|\geq t)\leq C_{1}\exp\left(-c^{\prime}_{1}n_{1}\right)+\frac{C_{1}}{n_{1}}. (18)

    Here, c1,c1,C1c_{1},c^{\prime}_{1},C_{1} are positive constants that depend on δ,M,t\delta,M,t.

  2. (b)

    For any M>0,t>0M>0,t>0 and δ>0\delta>0, when nc2plogpn\geq c_{2}p\log p, we have, i=1,2,3\forall i=1,2,3,

    P(supxBp(0,M)|pfail𝔼n2[Ui(x;a,b)]|C2pfail+t)C2exp(nc2/logn)+C2n.P(\sup_{x\in B_{p}(0,M)}\left|p_{\text{fail}}\mathbb{E}_{n_{2}}\left[U_{i}(x;a,b)\right]\right|\geq C^{\prime}_{2}p_{\text{fail}}+t)\leq C_{2}\exp\left(-nc^{\prime}_{2}/\log n\right)+\frac{C_{2}}{n}. (19)

    Here, c2,c2,C2,C2c_{2},c^{\prime}_{2},C_{2},C^{\prime}_{2} are all positive constants. C2C^{\prime}_{2} is affected by M,δM,\delta but is not affected by tt or pfailp_{\text{fail}}. c2,c2,C2c_{2},c^{\prime}_{2},C_{2} are affected by M,t,δM,t,\delta but are not affected by pfailp_{\text{fail}}.

  3. (c)

    For any M>0,t>0M>0,t>0 and δ>0\delta>0, when nc3plogpn\geq c_{3}p\log p, we have

    P(supxBp(0,M)|pfail𝔼n2[U1(x;a,b)U3(x;a,b)]|(C3pfail+t)xx2)P(\sup_{x\in B_{p}(0,M)}\left|p_{\text{fail}}\mathbb{E}_{n_{2}}\left[U_{1}(x;a,b)-U_{3}(x;a,b)\right]\right|\geq\left(C^{\prime}_{3}p_{\text{fail}}+t\right)\|x-x_{\star}\|_{2}) (20)
    C3exp(c3n)+C3n.\leq C_{3}\exp\left(-c^{\prime}_{3}n\right)+\frac{C_{3}}{n}.

    Here, c3,c3,C3,C3c_{3},c^{\prime}_{3},C_{3},C^{\prime}_{3} are all positive constants. C3C^{\prime}_{3} is affected by MM but is not affected by δ\delta, tt or pfailp_{\text{fail}}. c3,c3,C3c_{3},c^{\prime}_{3},C_{3} are affected by M,t,δM,t,\delta but are not affected by pfailp_{\text{fail}}.

All the above conclusions also hold for the situation of replacing xx_{\star} with x-x_{\star}.

Lemma 9(a) is a direct conclusion from the noiseless case. For others, we can observe that, when nn is large, tt can be very small so that the corruptions affect each of U1(x;a,b),U2(x;a,b)U_{1}(x;a,b),U_{2}(x;a,b) and U3(x;a,b)U_{3}(x;a,b) by some constant multiplying pfailp_{\text{fail}}. This indicates that as long as pfailp_{\text{fail}} is smaller than some constant, the corruption will not overcome the power of normal measurements for xx that is not in the vicinity of {x,x}\{x_{\star},-x_{\star}\}. The last inequality will be used in the local landscape analysis to quantify the vicinity.

Second, we try to quantify the vicinity. Currently, within R2R_{2}, we know that all the local minimums are close to the true signal vectors. Local landscape analysis strengthens the conclusion by showing that, within a local area that contains the true signal vectors, Fδ(x)\nabla F_{\delta}(x) is non-zero when x{xp:Δ(x)r0δpfail/(1pfail)}x\notin\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\}, which quantifies the distance of the stationary points in the local area to the true signal vectors. The key of the analysis is the conclusion for the local landscape of the noiseless situation given in Lemma 6, which means that with high probability,

𝔼n1[U1(x;a,b)U3(x;a,b)]λs2min{Δ(x),1δΔ2(x)},\mathbb{E}_{n_{1}}\left[U_{1}(x;a,b)-U_{3}(x;a,b)\right]\geq\frac{\lambda_{s}}{2}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\}, (21)
x{xp:xx2λs/ρ}.\forall\ x\in\{x\in\mathbb{R}^{p}:\|x-x_{\star}\|_{2}\leq\lambda_{s}/\rho\}.

Together with the third conclusion in Lemma 9, we can find the following result for the local landscape. Lemma 10 quantifies the radius of the vicinity as r0δpfail/(1pfail)r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}}).

Lemma 10.

When Assumptions 1, 2 and 5 hold, x2=1,p3,γ=0\|x_{\star}\|_{2}=1,p\geq 3,\gamma=0, there exists positive constants pfailmp_{\text{fail}}^{m} and δ0\delta_{0} such that when ncplogpn\geq cp\log p, 0<pfailpfailm0<p_{\text{fail}}\leq p_{\text{fail}}^{m} and δ<δ0\delta<\delta_{0}, the following relationship holds with probability at least 1Cexp(cn/logn)C/n1-C\exp(c^{\prime}n/\log n)-C/n.

𝔼n[U13(x;a,b)]C(1)min{Δ(x),1δΔ2(x)},\mathbb{E}_{n}\left[U_{13}(x;a,b)\right]\geq C^{(1)}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\}, (22)
x{xp:r0δpfail/(1pfail)xx2C(2)}.\forall\ x\in\{x\in\mathbb{R}^{p}:r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\leq\|x-x_{\star}\|_{2}\leq C^{(2)}\}.

Here, C(1)C^{(1)}, r0r_{0} and C(2)C^{(2)} are positive constants that are not related to the selection of δ\delta and pfailp_{\text{fail}}. Positive constants c,c,Cc,c^{\prime},C are related to δ\delta and pfailp_{\text{fail}}. This conclusion also holds for the situation of replacing xx_{\star} with x-x_{\star}.

Last, we will prove that the unbounded area p\R2\mathbb{R}^{p}\backslash R_{2} does not contain a stationary point by showing that 𝔼nU1(x,a,b)\mathbb{E}_{n}U_{1}(x,a,b) is positive on the boundary and that 1u2𝔼nU1(ux0,a,b)\frac{1}{u^{2}}\mathbb{E}_{n}U_{1}(ux_{0},a,b) is monotone increasing with regard to u>0u>0 for any x0R2x_{0}\in\partial R_{2}. The following two lemmas discuss monotonicity and positiveness similarly as in Lemmas 7 and 8.

Lemma 11.

For any constant x0p,x02>0,x_{0}\in\mathbb{R}^{p},\|x_{0}\|_{2}>0, the function hn(u)=𝔼n[U1(ux0;a,b)]u2h_{n}(u)=\frac{\mathbb{E}_{n}\left[U_{1}(ux_{0};a,b)\right]}{u^{2}} is a monotone increasing function on (0,+)(0,+\infty).

Lemma 12.

If there exists a constant M>0M>0 such that infxp:x2=M𝔼n[U1(x,a,b)]>0,\inf_{x\in\mathbb{R}^{p}:\|x\|_{2}=M}\mathbb{E}_{n}\left[U_{1}(x,a,b)\right]>0, then we have infxp:x2M𝔼n[U1(x,a,b)]>0.\inf_{x\in\mathbb{R}^{p}:\|x\|_{2}\geq M}\mathbb{E}_{n}\left[U_{1}(x,a,b)\right]>0.

Finally, we discuss some key points for generalizing the proof of Theorem 5 for γ=0\gamma=0 to Theorem 6 for γ>0\gamma>0. First, requiring that 𝔼n1Ui(x;a,b)\mathbb{E}_{n_{1}}U_{i}(x;a,b) to be close to 𝔼n1Ui(x;a)\mathbb{E}_{n_{1}}U_{i}(x;a), i=1,2,3i=1,2,3, we need that γ\gamma is small and δΩ(γ)\delta\geq\Omega(\sqrt{\gamma}). Next, similar to the previous discussions on the sketch of proof, we need to let (1pfail)𝔼n1U13(x,a,b)(1-p_{\text{fail}})\mathbb{E}_{n_{1}}U_{13}(x,a,b) dominates pfail𝔼n2U13(x,a,b)p_{\text{fail}}\mathbb{E}_{n_{2}}U_{13}(x,a,b) in some bounded area excluding the vicinity of xx_{\star}. The estimation that |pfail𝔼n2U13(x,a,b)|2=O(pfailxx2)|p_{\text{fail}}\mathbb{E}_{n_{2}}U_{13}(x,a,b)|_{2}=O(p_{\text{fail}}\|x-x_{\star}\|_{2}) with high probability remains unchanged. For the (1pfail)𝔼n1U13(x,a,b)(1-p_{\text{fail}})\mathbb{E}_{n_{1}}U_{13}(x,a,b), due to γ>0\gamma>0, the relationship 𝔼n1U13(x,a,b)Ω(min{Δ(x),Δ2(x)/δ})\mathbb{E}_{n_{1}}U_{13}(x,a,b)\geq\Omega(\min\{\Delta(x),\Delta^{2}(x)/\delta\}) only holds when xx2Ω(γδ)\|x-x_{\star}\|_{2}\geq\Omega(\sqrt{\gamma\delta}). So, we can get the set given in (17) by comparing the bounds.

5 Numerical Experiments

This section compares the numerical performance of PL[11], IPL[38], and SRPR. Subsection 5.1 presents three algorithms. Subsections 5.2 and 5.3 present the numerical results from simulation and a task of image recovery, respectively.

5.1 Description of Different Algorithms

To illustrate the efficacy of the benign landscape for smoothed robust phase retrieval, we evaluate both random initialization (RI) and modified spectral initialization (SI) as described in Algorithm 3 of [11]. The SI method leverages spectral initialization on subsets of samples with small bib_{i} values, aiming to circumvent the influence of heavily-tailed corrupt measurements within {bi,i=1,2,,n}\{b_{i},i=1,2,\ldots,n\}. [11] established that SI can, with high probability, generate an initial point x0x_{0} satisfying Δ(x0)O(pfail)\Delta(x_{0})\leq O(p_{\text{fail}}), provided the ratio n/pn/p is sufficiently large and pfailp_{\text{fail}} is sufficiently small.

Next, we discuss the proximal linear (PL) algorithm presented in [11] and the inexact PL (IPL) proposed in [38]. These algorithms are selected for comparative analysis due to PL’s pioneering role in minimizing F(x)F(x) and IPL’s advancements, which enhance the numerical efficiency over PL and subgradient methods detailed in [9]. Thus they set a benchmark for robust phase retrieval [38]. Both PL and IPL proceed by solving a sequence of convex subproblems to minimize F(x)F(x).

We then introduce the smoothed robust phase retrieval (SRPR) methodology. In the noiseless scenario, SRPR involves minimizing Fδ(x)F_{\delta}(x), utilizing either RI or SI for initialization. In cases involving corruption, SRPR entails an initial minimization of Fδ(x)F_{\delta}(x) using RI or SI, followed by employing the resultant minimization as a warm-start for IPL. Given that direct minimization of Fδ(x)F_{\delta}(x) does not yield an exact recovery of the true signal vectors, this two-step approach is necessary. For kernel selection across all comparisons with proximal linear algorithms, we employ the kernel K(x)=12(x2+1)3/2K(x)=\frac{1}{2(x^{2}+1)^{3/2}}, which generates the Pseudo Huber loss function lδ(x)=x2+δ2l_{\delta}(x)=\sqrt{x^{2}+\delta^{2}}. The choice of δ\delta will be elaborated upon subsequently. The monotone accelerated gradient descent algorithm (MAPG) with line search, as outlined in Algorithm 1 of [22], is utilized for optimization. This algorithm, specifically designed for nonconvex optimization, incorporates a reference update mechanism based on direct gradient descent steps to ensure sufficient decrease. It effectively guarantees performance parity with, or superiority over, simple gradient descent. Subsequent numerical results will illustrate the linear convergence achievable through this algorithm in the noiseless context.

We consider the candidate algorithms in the comparison: proximal linear algorithm with SI or RI (PL-SI and PL-RI), inexact proximal linear algorithm with SI or RI (IPL-SI and IPL-RI), and smoothed robust phase retrieval with SI or RI (SRPR-SI and SRPR-RI).

In what follows, we examine their performance on both simulated datasets and image recoveries. For an experiment with a specific dataset and initialization, let the final solution obtained by an algorithm be denoted as xKx_{K}. We consider the phase retrieval successful for simulated datasets if the relative error, defined as dist(xK,{x,x})/x2\mbox{dist}(x_{K},\{x_{\star},-x_{\star}\})/\|x_{\star}\|_{2}, does not exceed 10610^{-6} for both the IPL and SRPR algorithms. For PL, a less stringent threshold of 10410^{-4} is adopted. For image recoveries, the success threshold for PL is adjusted to 10110^{-1}, whereas for IPL and SRPR, it remains unchanged. This differentiation in accuracy thresholds is attributed to the comparatively lower optimization efficiency of PL compared to IPL and SRPR. Next, we explain the CPU time associated with each algorithm. Specifically, it encompasses the aggregate duration spent by all components of an algorithm. For instance, the CPU time calculation for SRPR-SI in the context of corruptions includes the time required for modified spectral initialization, the minimization of Fδ(x)F_{\delta}(x) through MAPG, and the execution of IPL for exact recovery. These experiments are conducted on a server equipped with an Intel Xeon E5-2650v4 (2.2GHz) processor. To ensure robustness in our findings, each experimental setup, defined by a unique set of hyper-parameters related to the dataset, is replicated 50 times with varied random samples for all components involved. This procedure enables us to report both the success rate and median CPU time for each experimental scenario. Consistent with the observations made by [11], we note that an increase in k=n/pk=n/p and a decrease in pfailp_{\text{fail}} generally lead to higher success rates. However, the impact on CPU time does not lend itself to straightforward conclusions. The subsequent subsections will present numerical experiments that furnish further insights.

5.2 Simulated Dataset

This subsection follows [11] to conduct experiments on simulated datasets. We generate simulated datasets that {ai,i=1,2n}\{a_{i},i=1,2\ldots n\} independently follows N(0,Ip)N(0,I_{p}) and x1p{1,1}px_{\star}\in\frac{1}{\sqrt{p}}\{-1,1\}^{p} with each element independently following discrete uniform distribution so that x2=1\|x_{\star}\|_{2}=1. The corruptions for these measurements are independent and the corruption type will be zeroing (i.e. letting bi=0,i[n]\[n1]b_{i}=0,\forall i\in[n]\backslash[n_{1}]) or using Cauchy distribution. A comprehensive description of the related hyper-parameters is as follows.

  • Signal size for recovery: p{500,1500}p\in\{500,1500\}.

  • Oversampling ratio k=n/p{2+0.25s|s[0,24],sN}k=n/p\in\{2+0.25s|s\in[0,24],s\in N\}.

  • Corruption type r{0,1}r\in\{0,1\}. bj=(ajx)2b_{j}=(a_{j}^{\top}x_{\star})^{2} for j[n1]j\in[n_{1}]. The corrupted response satisfies that for j[n][n1]\forall\ j\in[n]-[n_{1}], we have bj=rtan(π2Uj)×median({(aix)2:i{1,2,,n}}),UjU(0,1)b_{j}=r\tan(\frac{\pi}{2}U_{j})\times\mbox{median}\left(\{(a_{i}^{\top}x_{\star})^{2}:i\in\{1,2,...,n\}\}\right),U_{j}\sim U(0,1) and are independent for different measurements. When r=0r=0, the corrupted values are all equal to zero. This condition diminishes the effectiveness of the modified spectral initialization, which relies on bib_{i}s with small magnitudes. Conversely, when r=1r=1, the corrupted values are drawn from a positive Cauchy distribution, characterized by its infinite expected value. The choice of this distribution stems from its heavy-tailed nature as the capacity to handle heavy-tail corruption represents one of the robust model’s benefits.

  • pfail{0.025s|s[0,12],sN}p_{\text{fail}}\in\{0.025s|s\in[0,12],s\in N\}.

  • Random initialization x0=x~r~x_{0}=\tilde{x}\tilde{r}, in which x~U[Bp(0,1)]\tilde{x}\sim U[\partial B_{p}(0,1)] and r~U[0,4]\tilde{r}\sim U[0,4] are independent random variables.

  • δ=0.25\delta=0.25 for SRPR when we make comparisons with PL and IPL, while δ\delta can be selected easily here because we already know x2\|x_{\star}\|_{2}.

Next, we present the performance of the candidate algorithms. Beginning with an evaluation under noiseless conditions, Figure 4 illustrates the comparison between CPU time and success rate for these algorithms. The figure is divided into two panels: the left panel displays the success rate, while the right panel provides the median CPU time (after logarithmic transformation) for successful trials. An absence of data points indicates a success rate of zero. It is observed that algorithms incorporating SI exhibit superior success rates compared to those employing random initialization. Notably, SRPR-SI emerges as the most effective among them. For algorithms utilizing random initialization, SRPR achieves a 100%100\% success rate for large n/pn/p ratios, benefiting from the benign landscape. In contrast, PL and IPL fail to achieve similar success rates under these conditions. Regarding CPU time efficiency, SRPR stands out due to its local linear convergence property. This efficiency is further evidenced in Figure 5, which delineates the decline of Fδ(xk)F_{\delta}(x_{k}) over iterations for a trial with random initialization (where p=500p=500 and k=8k=8) using MAPG. The figure reveals a pronounced linear trend, with no discernible sub-linear or non-local stages.

Refer to caption
Figure 4: Comparison of success rate and CPU time for the noiseless simulated datasets
Refer to caption
Figure 5: Decreasing pattern of Fδ(x)F_{\delta}(x) with random initialization for the noiseless simulated dataset

Second, we focus on simulated datasets with corruptions. The conclusion of the comparison is consistent for all the combinations of hyper-parameters. We pick the combination r=1,p=500,pfail=0.1r=1,p=500,p_{\text{fail}}=0.1 and compare the performance of successful rate and CPU time for different selections of n/pn/p.

Refer to caption
Figure 6: Comparison of success rate and CPU time for corrupted simulated datasets

Figure 6 shows the performance of using these hyper-parameters and gives the same conclusion as Figure 4 in terms of success rate. In terms of CPU time, there is only a little advantage for SRPR compared to IPL-SI as SRPR eventually turns back to minimizing F(x)F(x) for exact recovery.

In the sensitivity analysis concerning the parameter δ\delta, our focus centers on the relative error following the minimization of Fδ(x)F_{\delta}(x) across various settings of δ\delta and pfailp_{\text{fail}}. Employing hyperparameters p=1500p=1500, r=1r=1, and k=8k=8, along with SI, Figure 7 delineates the outcomes from two perspectives. The first perspective plots pfailp_{\text{fail}} on the xx-axis against the relative error on the yy-axis, while the second uses δ\delta as the xx-axis. Each point on the graph represents the median relative error derived from 50 replicates. The analysis distinctly illustrates that lower values of δ\delta and pfailp_{\text{fail}} are associated with reduced relative errors. This trend of increase in δ\delta and pfailp_{\text{fail}} is in alignment with the theoretical results outlined in Theorem 5. Notably, the precision achieved by minxpFδ(x)\min_{x\in\mathbb{R}^{p}}F_{\delta}(x) surpasses that of the SI, which exhibits a relative error ranging from 0.2 to 1.2.

Next, we provide some numerical results that will show the robustness of the model when bounded noise exists. We let bj=(ajx)2+τj,j[n1]b_{j}=(a_{j}^{\top}x_{\star})^{2}+\tau_{j},\forall j\in[n_{1}], in which τi\tau_{i} is independent with {ai}i=1n\{a_{i}\}_{i=1}^{n} and independently follows U(0.05,0.05)U(-0.05,0.05). We set the threshold for success as 0.05. Figure 8 shows the experiment results and shows similar patterns to Figure 6. This means that SRPR still enjoys a benign landscape and advantages over PL and IPL.

Refer to caption
Figure 7: Relative error after minimizing Fδ(x)F_{\delta}(x) for simulated datasets
Refer to caption
Figure 8: Comparison for corrupted simulated datasets with bounded noise

5.3 Image Recovery

Now, we study the task of image recovery within the field of computational biology, leveraging the prevalent application of phase retrieval techniques [17, 31]. Consider an image array Xp1×p2×p3X_{\star}\in\mathbb{R}^{p_{1}\times p_{2}\times p_{3}}, where each element takes a value within the interval [0,1][0,1]. Here, p1×p2p_{1}\times p_{2} represents the image dimensions, and p3=3p_{3}=3 signifies the RGB color channels. We define the signal as x=[vec(X);0]px_{\star}=[\mbox{vec}(X_{\star});0]\in\mathbb{R}^{p}, where p=inf{2s:s,2sp1p2p3}p=\inf\{2^{s}:s\in\mathbb{N},2^{s}\geq p_{1}p_{2}p_{3}\}, and k=n/pk=n/p\in\mathbb{N}. The Hadamard matrix is denoted by Hp1p{1,1}p×pH_{p}\in\frac{1}{\sqrt{p}}\{-1,1\}^{p\times p}, and we let Sjdiag({1,1}p)S_{j}\in\mbox{diag}(\{-1,1\}^{p}), for j=1,2,,kj=1,2,\ldots,k, be random diagonal matrices with diagonal elements following an independent discrete uniform distribution. The measurement matrix is defined as A=[a1;a2;;an]=1k[HpS1;HpS2;;HpSk]A=[a_{1}^{\top};a_{2}^{\top};\ldots;a_{n}^{\top}]=\frac{1}{\sqrt{k}}[H_{p}S_{1};H_{p}S_{2};\ldots;H_{p}S_{k}]. Here, the semicolons mean that the matrix blocks are stacked vertically. This configuration benefits from an efficient computational complexity of O(nlogp)O(n\log p) for the transformation AyAy for ypy\in\mathbb{R}^{p}, paralleling the complexity of the discrete Fourier transform commonly encountered in phase retrieval scenarios with complex transformation matrices. We use a real RNA nanoparticles image222https://visualsonline.cancer.gov/details.cfm?imageid=11167 which is also used in [11] with an original image size of 2133×2133×32133\times 2133\times 3. We follow [11]’s code333https://stanford.edu/ jduchi/projects/phase-retrieval-code.tgz to conduct experiments on a colored sub-image and pick p=218p=2^{18} or p=222p=2^{22}. A description of the related hyper-parameters is as follows. We consider the signal size for recovery p{218,222}p\in\{2^{18},2^{22}\}, oversampling ratio k=n/p{3,6}k=n/p\in\{3,6\}, corruption type r{0,1}r\in\{0,1\}, and pfail{0.025s|s[0,8],sN}p_{\text{fail}}\in\{0.025s|s\in[0,8],s\in N\}. The corrupted response satisfies that for j[n]\[n1]j\in[n]\backslash[n_{1}], we have bj=rtan(π2Uj)×median({(aix)2:i{1,2,,n}}),UjU(0,1)b_{j}=r\tan(\frac{\pi}{2}U_{j})\times\mbox{median}\left(\{(a_{i}^{\top}x_{\star})^{2}:i\in\{1,2,...,n\}\}\right),U_{j}\sim U(0,1) and are independent for different measurements. The random initialization is x=p2x~x=\frac{\sqrt{p}}{2}\tilde{x}, in which x~U[Bp(0,1)]\tilde{x}\sim U[\partial B_{p}(0,1)]. Next, we provide the choice of δ\delta. We can find that 𝔼n[aa]=1nAA=Ip/n.\mathbb{E}_{n}\left[aa^{\top}\right]=\frac{1}{n}A^{\top}A=I_{p}/n. Based on the property of an RGB image, we assume that x2=c0p,c0(0,1)\|x_{\star}\|_{2}=c_{0}\sqrt{p},c_{0}\in(0,1) and c0c_{0} is usually small when there is a large proportion of black pixels. So, δ\delta can be selected as δ=δ01nAA2x22=δ0c02k\delta=\delta_{0}\left\|\frac{1}{n}A^{\top}A\right\|_{2}\|x_{\star}\|_{2}^{2}=\frac{\delta_{0}c_{0}^{2}}{k} for some small positive constant δ0\delta_{0}. As a result, we choose δ=0.01/6\delta=0.01/6, since k6k\leq 6.

In our experiments on image recoveries, we observe consistent results across different settings of rr. For illustration purposes, we select r=0r=0. The success rates are reported in Table 1, highlighting that SRPR uniquely does not depend on modified spectral initialization for large values of n/pn/p. Regarding the median CPU time when pfail>0p_{\text{fail}}>0, the performances of SRPR-RI, SRPR-SI, and IPL-SI are comparable. The CPU time spans from 200 to 600 seconds for p=218p=2^{18}, and from 6000 to 15000 seconds for p=222p=2^{22}. The computational times for the other algorithms are significantly longer. Interestingly, in scenarios where pfail=0p_{\text{fail}}=0, IPL-SI demonstrates markedly improved efficiency over its performance in cases of corruption, surpassing both SRPR-RI and SRPR-SI in terms of CPU time efficiency.

Success Rate pfail0.15,k=3p_{\text{fail}}\leq 0.15,k=3 pfail0.15,k=6p_{\text{fail}}\leq 0.15,k=6 pfail>0.15,k=3p_{\text{fail}}>0.15,k=3 pfail>0.15,k=6p_{\text{fail}}>0.15,k=6
PL-RI 0 0 0 0
IPL-RI 0 0 0 0
SRPR-RI 0 1 0 1
PL-SI 1 1 0 1
IPL-SI 1 1 0 1
SRPR-SI 1 1 0 1
Table 1: Comparison of Success Rate when p=218p=2^{18} for Image Recoveries

6 Conclusion

In this paper, we aim to fill an important gap in the literature by analyzing the essential geometric structure of the nonconvex robust phase retrieval. We propose the SRPR based on a family of convolution-type smoothed loss functions as a promising alternative to the RPR based on the 1\ell_{1}-loss. We prove the benign geometric structure of the SRPR with high probability. Our result under the noiseless setting matches the best available result using the least-squares formulations, and our result under the setting with infrequent but arbitrary corruptions provides the first landscape analysis of phase retrieval with corruption in the literature. We also prove the local linear convergence rate of gradient descent for solving the SRPR under the noiseless setting. Numerical experiments on both simulation and a task of image recovery are presented to demonstrate the performance of the SRPR.

References

  • [1]
  • Cai et al. [2022] Cai, J.-F., Huang, M., Li, D. and Wang, Y. [2022], ‘Solving phase retrieval with random initial guess is nearly as good as by spectral initialization’, Applied and Computational Harmonic Analysis 58, 60–84.
  • Cai et al. [2016] Cai, T. T., Li, X. and Ma, Z. [2016], ‘Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow’, The Annals of Statistics 44(5), 2221–2251.
  • Candes et al. [2015] Candes, E. J., Li, X. and Soltanolkotabi, M. [2015], ‘Phase retrieval via wirtinger flow: Theory and algorithms’, IEEE Transactions on Information Theory 61(4), 1985–2007.
  • Candes et al. [2013] Candes, E. J., Strohmer, T. and Voroninski, V. [2013], ‘Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming’, Communications on Pure and Applied Mathematics 66(8), 1241–1274.
  • Chai et al. [2010] Chai, A., Moscoso, M. and Papanicolaou, G. [2010], ‘Array imaging using intensity-only measurements’, Inverse Problems 27(1), 015005.
  • Chen and Candès [2017] Chen, Y. and Candès, E. J. [2017], ‘Solving random quadratic systems of equations is nearly as easy as solving linear systems’, Communications on Pure and Applied Mathematics 70(5), 822–883.
  • Chen et al. [2019] Chen, Y., Chi, Y., Fan, J. and Ma, C. [2019], ‘Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval’, Mathematical Programming 176(1), 5–37.
  • Davis et al. [2018] Davis, D., Drusvyatskiy, D., MacPhee, K. J. and Paquette, C. [2018], ‘Subgradient methods for sharp weakly convex functions’, Journal of Optimization Theory and Applications 179(3), 962–982.
  • Davis et al. [2020] Davis, D., Drusvyatskiy, D. and Paquette, C. [2020], ‘The nonsmooth landscape of phase retrieval’, IMA Journal of Numerical Analysis 40(4), 2652–2695.
  • Duchi and Ruan [2019] Duchi, J. C. and Ruan, F. [2019], ‘Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval’, Information and Inference: A Journal of the IMA 8(3), 471–529.
  • Eldar and Mendelson [2014] Eldar, Y. C. and Mendelson, S. [2014], ‘Phase retrieval: Stability and recovery guarantees’, Applied and Computational Harmonic Analysis 36(3), 473–494.
  • Fernandes et al. [2021] Fernandes, M., Guerre, E. and Horta, E. [2021], ‘Smoothing quantile regressions’, Journal of Business & Economic Statistics 39(1), 338–357.
  • Fickus et al. [2014] Fickus, M., Mixon, D. G., Nelson, A. A. and Wang, Y. [2014], ‘Phase retrieval from very few measurements’, Linear Algebra and its Applications 449, 475–499.
  • Fienup and Dainty [1987] Fienup, C. and Dainty, J. [1987], ‘Phase retrieval and image reconstruction for astronomy’, Image Recovery: Theory and Application 231, 275.
  • He et al. [2023] He, X., Pan, X., Tan, K. M. and Zhou, W.-X. [2023], ‘Smoothed quantile regression with large-scale inference’, Journal of Econometrics 232(2), 367–388.
  • Huang et al. [2021] Huang, L., Liu, T., Yang, X., Luo, Y., Rivenson, Y. and Ozcan, A. [2021], ‘Holographic image reconstruction with phase recovery and autofocusing using recurrent neural networks’, ACS Photonics 8(6), 1763–1774.
  • Jiang and Yu [2021] Jiang, R. and Yu, K. [2021], ‘Smoothing quantile regression for a distributed system’, Neurocomputing 466, 311–326.
  • Jin et al. [2017] Jin, C., Ge, R., Netrapalli, P., Kakade, S. M. and Jordan, M. I. [2017], How to escape saddle points efficiently, in ‘International Conference on Machine Learning’, PMLR, pp. 1724–1732.
  • Jin et al. [2018] Jin, C., Netrapalli, P. and Jordan, M. I. [2018], Accelerated gradient descent escapes saddle points faster than gradient descent, in ‘Conference On Learning Theory’, PMLR, pp. 1042–1085.
  • Karimi et al. [2016] Karimi, H., Nutini, J. and Schmidt, M. [2016], Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition, in ‘Joint European Conference on Machine Learning and Knowledge Discovery in Databases’, Springer, pp. 795–811.
  • Li and Lin [2015] Li, H. and Lin, Z. [2015], Accelerated proximal gradient methods for nonconvex programming, in ‘Advances in Neural Information Processing Systems’, Vol. 28, Curran Associates, Inc.
    https://proceedings.neurips.cc/paper/2015/file/f7664060cc52bc6f3d620bcedc94a4b6-Paper.pdf
  • Man et al. [2024] Man, R., Pan, X., Tan, K. M. and Zhou, W.-X. [2024], ‘A unified algorithm for penalized convolution smoothed quantile regression’, Journal of Computational and Graphical Statistics 33(2), 625–637.
  • Mei et al. [2018] Mei, S., Bai, Y. and Montanari, A. [2018], ‘The landscape of empirical risk for nonconvex losses’, The Annals of Statistics 46(6A), 2747–2774.
  • Miao et al. [1999] Miao, J., Charalambous, P., Kirz, J. and Sayre, D. [1999], ‘Extending the methodology of x-ray crystallography to allow imaging of micrometre-sized non-crystalline specimens’, Nature 400(6742), 342–344.
  • Miao et al. [2008] Miao, J., Ishikawa, T., Shen, Q. and Earnest, T. [2008], ‘Extending x-ray crystallography to allow the imaging of noncrystalline materials, cells, and single protein complexes’, Annu. Rev. Phys. Chem. 59, 387–410.
  • Millane [1990] Millane, R. P. [1990], ‘Phase retrieval in crystallography and optics’, JOSA A 7(3), 394–411.
  • Parikh and Boyd [2014] Parikh, N. and Boyd, S. [2014], ‘Block splitting for distributed optimization’, Mathematical Programming Computation 6(1), 77–102.
  • Roulet and d'Aspremont [2017] Roulet, V. and d'Aspremont, A. [2017], Sharpness, restart and acceleration, in ‘Advances in Neural Information Processing Systems’, Vol. 30, Curran Associates, Inc.
  • Shechtman et al. [2015] Shechtman, Y., Eldar, Y. C., Cohen, O., Chapman, H. N., Miao, J. and Segev, M. [2015], ‘Phase retrieval with application to optical imaging: a contemporary overview’, IEEE Signal Processing Magazine 32(3), 87–109.
  • Stefik [1978] Stefik, M. [1978], ‘Inferring dna structures from segmentation data’, Artificial Intelligence 11(1-2), 85–114.
  • Sun et al. [2018] Sun, J., Qu, Q. and Wright, J. [2018], ‘A geometric analysis of phase retrieval’, Foundations of Computational Mathematics 18(5), 1131–1198.
  • Tan et al. [2022] Tan, K. M., Wang, L. and Zhou, W.-X. [2022], ‘High-dimensional quantile regression: Convolution smoothing and concave regularization’, Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1), 205–233.
  • Taylor [2003] Taylor, G. [2003], ‘The phase problem’, Acta Crystallographica Section D 59(11), 1881–1890.
  • Wang et al. [2017] Wang, G., Giannakis, G. B. and Eldar, Y. C. [2017], ‘Solving systems of random quadratic equations via truncated amplitude flow’, IEEE Transactions on Information Theory 64(2), 773–794.
  • Zhang et al. [2016] Zhang, H., Chi, Y. and Liang, Y. [2016], Provable non-convex phase retrieval with outliers: Median truncated wirtinger flow, in ‘International Conference on Machine Learning’, PMLR, pp. 1022–1031.
  • Zhang et al. [2017] Zhang, H., You, S., Lin, Z. and Xu, C. [2017], Fast compressive phase retrieval under bounded noise, in ‘Proceedings of the AAAI Conference on Artificial Intelligence’, Vol. 31.
  • Zheng et al. [2024] Zheng, Z., Ma, S. and Xue, L. [2024], ‘A new inexact proximal linear algorithm with adaptive stopping criteria for robust phase retrieval’, IEEE Transactions on Signal Processing 72, 1081–1093.