\useunder

\ul

Smoothed Robust Phase Retrieval

Zhong Zheng and Lingzhou Xue
Department of Statistics, The Pennsylvania State University

Abstract

The phase retrieval problem in the presence of noise aims to recover the signal vector of interest from a set of quadratic measurements with infrequent but arbitrary corruptions, and it plays an important role in many scientific applications. However, the essential geometric structure of the nonconvex robust phase retrieval based on the $\ell_{1}$ -loss is largely unknown to study spurious local solutions, even under the ideal noiseless setting, and its intrinsic nonsmooth nature also impacts the efficiency of optimization algorithms. This paper introduces the smoothed robust phase retrieval (SRPR) based on a family of convolution-type smoothed loss functions. Theoretically, we prove that the SRPR enjoys a benign geometric structure with high probability: (1) under the noiseless situation, the SRPR has no spurious local solutions, and the target signals are global solutions, and (2) under the infrequent but arbitrary corruptions, we characterize the stationary points of the SRPR and prove its benign landscape, which is the first landscape analysis of phase retrieval with corruption in the literature. Moreover, we prove the local linear convergence rate of gradient descent for solving the SRPR under the noiseless situation. Experiments on both simulated datasets and image recovery are provided to demonstrate the numerical performance of the SRPR.

Keywords— Convolution smoothing; Landscape analysis; Nonconvex optimization.

1 Introduction

Investigating the recovery of a signal vector of interest $x\in\mathbb{R}^{p}$ from a set of sensing vectors $a_{1},\ldots,a_{n}\in\mathbb{R}^{p}$ and their linear transformations $a_{1}^{\top}x,\ldots,a_{n}^{\top}x$ in the presence of noise is a critical research question in statistics and data science. There has been considerable interest in the estimation of this signal vector utilizing magnitude-based measurements $(a_{1}^{\top}x)^{2},\ldots,(a_{n}^{\top}x)^{2}$ in the presence of potential errors and corruptions. This problem, commonly referred to as phase retrieval, plays a pivotal role in addressing the phase issue [34, 30]. It finds applications across a wide spectrum of scientific applications, including X-ray crystallography [25], optical physics [27], coherent diffractive imaging using arrays and high-power sources [6], astronomy [15], and microscopy [26].

Over the last decade, several pioneering papers have established the theoretical foundation for the noiseless phase retrieval problem, which seeks to recover the signal vectors $x_{\star}$ or $-x_{\star}$ in $\mathbb{R}^{p}$ from a collection of magnitude-based measurements without any errors or corruptions:

b_{i}=(a_{i}^{\top}x_{\star})^{2},

(1)

for each $i\in[n]:=\{1,2,\ldots,n\}$ . Here, $b_{1},\ldots,b_{n}$ represent non-negative measurements. It is important to note that (1) is a nonconvex problem recognized as being NP-hard [14]. To address this challenge, [5] introduced the PhaseLift method, employing a convex relaxation strategy based on semidefinite programming (SDP) to solve a trace-norm minimization problem. However, as the dimension of the signal vector increases, SDP-based methods tend to become computationally infeasible [4]. In response to the limitations of SDP-based methods, the literature has seen alternative approaches based on the non-convex optimization, including the Wirtinger flow [4], truncated Wirtinger flow [7], thresholded Wirtinger flow [3], truncated amplitude flow [35], and others. These non-convex methods have utilized spectral initialization [4] to facilitate global convergence. Recently, the seminal paper [32] demonstrated that the least-squares formulation for recovering the signal from (1) exhibits a benign global geometry (also known as function landscape) ensuring that, with high probability, the non-convex problem has no spurious local minimizers and can be globally optimized through efficient iterative methods, even with random initialization. Further, [8] showed that vanilla gradient descent is effective and provided a comprehensive analysis of the optimization trajectory, and [2] illustrated that the smoothed amplitude flow model shares similar properties.

In real-world applications, it is crucial not to overlook the influence of noise on phase retrieval. Especially, we should consider the presence of additive errors and infrequent corruptions affecting magnitude-based measurements, which may arise due to equipment limitations, measurement errors, extreme occurrences, and so on. The noisy magnitude-based measurements can be formalized as follows: for each $i\in[n]$ ,

b_{i}=\begin{cases}(a_{i}^{\top}x_{\star})^{2}+\tau_{i},&\text{if the $i$-th measurement is uncorrupted},\\ \xi_{i},&\text{otherwise}.\end{cases}

(2)

Here, $\tau_{i}$ represents the additive errors bounded in $\mathbb{R}$ , and $\xi_{i}$ signifies the infrequent yet arbitrary corruptions in $\mathbb{R}_{+}$ . The set $\mathbb{I}_{O}=\{i\in[n]:\text{the $i$-th measurement is corrupted}\}$ denotes the indices of outliers, and $\mathrm{card}(\mathbb{I}_{O})$ is the cardinality of $\mathbb{I}_{O}$ . We define $p_{\text{fail}}=\mathrm{card}(\mathbb{I}_{O})/n$ as the proportion of corrupted measurements. It is essential to note that $p_{\text{fail}}$ remains a relatively small quantity due to the infrequency of corruption. However, corruptions $\xi_{i}$ ’s can be unbounded and may behave adversarially. Furthermore, independence between $\{\xi_{i}\}_{i\in\mathbb{I}_{O}}$ and $\{a_{i}\}_{i=1}^{n}$ may not hold, introducing additional challenges to signal recovery.

Model (2) provides a general framework for signal recovery from noisy magnitude-based measurements. This framework includes various specialized scenarios as special examples. Specifically, the noiseless phase retrieval problem, as formulated in (1), can be perceived as a particular instance of (2), characterized by $p_{\text{fail}}=0$ and $\tau_{i}=0$ for all $i\in[n]$ . Moreover, phase retrieval in the presence of bounded noise, as investigated by [37], emerges as another special case of (2) when $p_{\text{fail}}=0$ , indicating the absence of outliers. Conversely, the robust phase retrieval problem [11] aligns with (2) under the situation that $\tau_{i}=0$ for all $i\in[n]$ , focusing on the robustness against corruptions. Additionally, the approach for non-convex phase retrieval with outliers, as proposed by [36], resonates with (2) and accommodates both dense bounded noise and sparse arbitrary outliers: $b_{i}=(a_{i}^{\top}x_{\star})^{2}+\tau_{i}+\xi_{i}$ for each $i\in[n]$ .

To solve the problem of phase retrieval with corruptions in Model (2), the objective remains to recover the signal vectors $x_{\star}$ or $-x_{\star}$ using the observations $\{(a_{i},b_{i})\}_{i=1}^{n}$ . In this vein, [36] extended the methodologies of [7] to propose the median truncated Wirtinger flow. [11] explored robust phase retrieval employing the $\ell_{1}$ -loss, formally defined as

\min_{x\in\mathbb{R}^{p}}F(x):=\min_{x\in\mathbb{R}^{p}}\frac{1}{n}\sum_{i=1}^{n}\left|(a_{i}^{\top}x)^{2}-b_{i}\right|.

(3)

[11] introduced the proximal linear (PL) algorithm to tackle this nonconvex optimization challenge, demonstrating that the $\ell_{1}$ -loss enhances success rates in comparison to the median truncated Wirtinger flow [36]. Furthermore, they established the properties of sharpness and weak convexity for $F(x)$ , alongside the local convergence of the PL algorithm. Subsequent research, targeting the optimization problem (3), has expanded upon these findings. [9, 10] investigated the local convergence of subgradient algorithms. Recently, [38] proposed the inexact proximal linear (IPL) algorithm and studied its local convergence.

Despite these advancements, the essential geometric structure of the robust phase retrieval is largely unknown to study spurious local solutions. This contrasts with the works of [32, 8, 2], which provided insights into the landscape of smooth objective functions in phase retrieval under noiseless conditions. Notably, even under the ideal noiseless setting, [11] did not conduct landscape analysis to understand the stationary points of the nonconvex and nonsmooth optimization. [10] only defined and analyzed stationary points in the context of subgradient, and whether they are local minimums is still unknown. Moreover, under the situation with corruption, we cannot find any existing work to study the landscape of any objective function in the literature. As a result, there is no theoretical guarantee for robust phase retrieval to avoid spurious local minimums using random initialization.

Another concern arises from the intrinsic nonsmooth nature of (3) that significantly impacts the efficiency of optimization algorithms including the PL method [11] and the IPL method [38]. Both approaches necessitate iteratively solving nonsmooth subproblems using the proximal operator graph splitting (POGS) algorithm [28] or the accelerated proximal gradient algorithm, which invariably leads to diminished optimization efficiency. Moreover, as pointed out by [9, 10], the subgradient algorithms for solving (3) are sensitive to the step size selection, and there are no practically reliable methods for choosing the step size.

All these challenges stem from the non-differentiability of the $\ell_{1}$ -loss employed in (3), which limits the technical tools for analyzing function landscapes and complicates the optimization process. In our paper, we propose the smoothed robust phase retrieval (SRPR) method by minimizing the smoothed surrogate $F_{\delta}(x)$ based on the convolution-type smoothing, which involves the convolution of the $\ell_{1}$ -loss with a kernel function $K_{\delta}(\cdot)$ and its bandwidth $\delta>0$ . This smoothing idea was first introduced by [13] and then investigated by [16], [33], [18], and [23] to overcome the lack of smoothness in quantile regression and improve optimization efficiency while achieving the desired convergence rate and oracle property.

We demonstrate that, with high probability, $F_{\delta}(x)$ not only satisfies the weak convexity property similar to $F(x)$ but also satisfies generalized sharpness in the noiseless scenario. Furthermore, this approach facilitates the utilization of gradient and Hessian information, providing a convenient mathematical structure. These properties are crucial for determining the convergence rates of gradient-based optimization algorithms and for conducting landscape analyses. By directly minimizing $F_{\delta}(x)$ in the absence of noise, exact signal recovery is achievable. Additionally, our methodology extends to scenarios involving corrupted measurements, illustrating its versatility and applicability. The SRPR approach has at least the following contributions to the field, enhancing both the theoretical and practical aspects of phase retrieval when compared to existing methodologies.

Firstly, the smoothness of our approach enables the application of gradient-based algorithms to minimize the nonconvex function $F_{\delta}(x)$ . We demonstrate that under the noiseless situation, the minimization of $F_{\delta}(x)$ through general first-order methods allows SRPR to achieve local linear convergence. This marks a significant improvement over the sublinear convergence rates observed with the PL [11] and IPL [38] algorithms. The attainment of a linear rate is facilitated by the Polyak-Lojasiewicz condition [21] that holds locally, circumventing the need for local convexity. Additionally, our method benefits from the potential for acceleration and line search strategies, offering clear advantages over subgradient algorithms [9, 10], which encounter challenges in step size tuning. In numerical experiments, SRPR demonstrates superior efficiency relative to competing algorithms.

Secondly, in the context of noiseless phase retrieval, we conduct landscape analysis to show that $F_{\delta}(x)$ , with high probability, exhibits a benign landscape and has no spurious local minima. This analysis underscores the efficacy of random initialization in achieving exact recovery through the minimization of $F_{\delta}(x)$ . To the best of our knowledge, this is the first instance of a benign landscape being identified for a robust objective function in the domain of noiseless phase retrieval, as previously discussed works [11] and [10] have only considered the local behaviors of optimization algorithms and the locations of stationary points.

Thirdly, in the presence of corrupted measurements, we prove that when $p_{\text{fail}}$ is small and bounded noises are absent, with high probability, the landscape of $F_{\delta}(x)$ remains benign outside a vicinity around $\{x_{\star},-x_{\star}\}$ with a radius proportional to $\delta p_{\text{fail}}/(1-p_{\text{fail}})$ . Similar results hold for the general Model (2). To the best of our knowledge, this represents the first result of benign function landscapes in the context of phase retrieval with corruptions.

The rest of this paper is organized as follows. Section 2 introduces the methodology of SRPR. Section 3 focuses on the noiseless landscape analysis. Section 4 focuses on the empirical landscapes with corruption. Section 5 presents the numerical experiments. Section 6 includes a few concluding remarks. The complete proofs are presented in the supplement.

2 Methodology

This section first presents the general framework for smoothed robust phase retrieval in Subsection 2.1 and then discusses the convolution-type smoothed loss functions in Subsection 2.2. Subsection 2.3 presents the properties of the smoothed objective function. Subsection 2.4 discusses the optimization algorithms.

Before proceeding, we introduce some useful notations. Denote $\mathbf{1}[\cdot]$ as the indicator function. Let $B_{p}(x,M)=\{y\in\mathbb{R}^{p}:\|y-x\|_{2}\leq M\}$ . Define $\mathbb{E}_{s}A=\frac{1}{s}\sum_{i=1}^{s}A_{i}$ , and $I_{p}$ denotes a $p\times p$ identity matrix. Let $e_{i}\in\mathbb{R}^{p}$ represent a vector whose $i-$ th element is 1 with all other elements taking the value 0. In addition, throughout the paper, we use $c,c^{\prime},C,C^{\prime},...$ to denote generic positive constants, though the actual values may vary on different occasions.

2.1 A General Framework

In the robust phase retrieval problem (2), we observe vectors $\{a_{i}\}$ and non-negative scalars $\{b_{i}\}$ and aim to recover the true phases $\{x_{\star},-x_{\star}\}$ ¹¹1Due to the quadratic relationship in (2), both $x_{\star}$ and $-x_{\star}$ are true signal vectors..

Assumption 1.

Let $a_{1},a_{2},\ldots,a_{n_{1}+n_{2}}$ be random vectors in $\mathbb{R}^{p}$ that are independently and identically distributed with the mean zero and covariance matrix $\Sigma_{p}$ , satisfying $u^{\top}\Sigma_{p}u\leq\lambda_{M}\|u\|_{2}^{2}$ for all $u\in\mathbb{R}^{p}$ , where $\lambda_{M}$ is a positive constant. Here, $n_{1},n_{2}$ are non-negative integers, and $p\geq 3$ is a positive integer. The true phase $x_{\star}\in\mathbb{R}^{p}\setminus\{0\}$ and

b_{i}=[(a_{i}^{\top}x_{\star})^{2}+\tau_{i}]\mathbf{1}[i\in[n_{1}]]+\xi_{i}\mathbf{1}[i\in[n_{1}+n_{2}]-[n_{1}]],\ \forall\ i\in[n_{1}+n_{2}].

(4)

The $\tau_{i}$ ’s are random variables in $\mathbb{R}$ with $|\tau_{i}|\leq\gamma$ for any $i\in[n_{1}]$ . The $\xi_{i}$ ’s are non-negative random variables without further assumptions.

In Assumption 1, $n=n_{1}+n_{2}$ represents the total number of measurements, and $p_{\text{fail}}=\frac{n_{2}}{n_{1}+n_{2}}$ denotes the proportion of corrupted measurements. The sets $\mathbb{I}_{1}=[n_{1}]$ and $\mathbb{I}_{2}=[n]-[n_{1}]$ correspond to the inliers and outliers among the measurements, respectively. In the special case where $\gamma=p_{\text{fail}}=0$ , the model reverts to the noiseless phase retrieval scenario as depicted in (1). Our goal is to recover the true signal vectors $x_{\star}$ or $-x_{\star}$ from the observed $\{(a_{i},b_{i})\}_{i=1}^{n}$ , without prior knowledge of which measurements are corrupted. This setup ensures the mutual independence of $\{a_{i},i\in[n_{1}]\}$ and $\{a_{i},i\in[n_{1}+n_{2}]-[n_{1}]\}$ . We will introduce stronger assumptions on the distribution of $a_{i}$ later. The bounded noise variables $\tau_{i}$ s require no additional distributional assumptions, whereas no assumptions are made regarding the distribution of $\{\xi_{i}\}_{i=1}^{n}$ or its independence with respect to $\{a_{i}\}$ , allowing for potentially heavy-tailed and adversarially distributed corruption. This assumption is more general then the model assumption $M_{2}$ of [11], which requires independence between $\{\xi_{i},i\in[n_{1}+n_{2}]-[n_{1}]\}$ and $\{a_{i},i\in[n_{1}]\}$ . For simplicity, in subsequent discussions, we will refer to a random vector $a$ to denote the generic distribution of the $a_{i}$ s.

As shown in [11], when $\gamma=0$ , exact recovery is achievable in the optimization problem (3) employing an $\ell_{1}$ -loss, indicating that $\{x_{\star},-x_{\star}\}=\operatorname*{arg\,min}_{x\in\mathbb{R}^{p}}F(x)$ , with high probability, under certain assumptions. To refine this approach, we employ convolution-type smoothing, focusing on the problem formulated as:

\min_{x\in\mathbb{R}^{p}}F_{\delta}(x)=\frac{1}{n}\sum_{i=1}^{n}l_{\delta}\left((a_{i}^{\top}x)^{2}-b_{i}\right),

(5)

where

l_{\delta}(x)=(l*K_{\delta})(x)=\int_{-\infty}^{+\infty}|y|K_{\delta}(x-y)dy,

(6)

$K(x)$ represents a kernel function, $\delta>0$ is the bandwidth, $*$ denotes the convolution operation, and $K_{\delta}(x)=\frac{1}{\delta}K(\frac{x}{\delta})$ . In this context, $l_{\delta}(x)$ serves as a smoothed loss function, substituting $l(x)$ for $F(x)$ . The adoption of convolution-type smoothing is motivated by gradient approximations: $\nabla F_{\delta}(x)$ closely mirrors the subgradient of $F(x)$ because $\lim_{\delta\rightarrow 0}l^{\prime}_{\delta}(x)=\text{sign}(x)\in\partial|x|$ , thereby suggesting that the statistical behavior of $F_{\delta}(x)$ will approximate that of $F(x)$ . Moreover, the smoothness of $F_{\delta}(x)$ significantly benefits gradient-based optimization and landscape analysis. Provided some mild conditions on the kernel function $K(x)$ , both $l^{\prime}_{\delta}(x)$ and $l^{\prime\prime}_{\delta}(x)$ are well-defined, ensuring that $F_{\delta}(x)$ is smooth and equipped with gradients and Hessians for both optimization and landscape analysis purposes.

2.2 Convolution-type Smoothing

In this subsection, we elaborate on the convolution-type smoothed loss function as defined in (6). Additionally, we introduce the notations $\tilde{K}(x)=\int_{-\infty}^{x}K(y)dy$ and $\tilde{K}_{\delta}(x)=\tilde{K}(x/\delta)$ for the integrated kernel function and its scaled version, respectively. To facilitate our analysis, we adhere to the subsequent assumption regarding the kernel function $K(x)$ throughout this paper:

Assumption 2.

The mapping $K:\mathbb{R}\rightarrow\mathbb{R}$ satisfies the following conditions:

(a)

$K(\cdot)$ is a kernel, namely, $K(x)\geq 0,K(x)=K(-x),\forall\ x\in\mathbb{R}$ and $\int_{-\infty}^{\infty}K(x)dx=1$ .
(b)

$K(x)$ is bounded, Lipschitz continuous and $K(0)>0$ .
(c)

Let $\bar{K}(x)=xK(x)$ . $\bar{K}(x)$ is bounded, Lipschitz continuous and $\int_{-\infty}^{+\infty}\left|\bar{K}(x)\right|dx<\infty$ .

Next, we give some examples of the kernel $K(x)$ that satisfies Assumption 2.

•

Gaussian kernel $K(x)=(2\pi)^{-1/2}e^{-x^{2}/2},$ which generates $l_{\delta}(x)=\delta l^{\mathrm{G}}(x/\delta)$ with $l^{\mathrm{G}}(x):=(2/\pi)^{1/2}e^{-x^{2}/2}+x\{1-$ $2\Phi(-x)\}$ and $\Phi(\cdot)$ being the CDF of $N(0,1)$ .
•

Logistic kernel $K(x)=e^{-x}/\left(1+e^{-x}\right)^{2},$ which generates $l_{\delta}(x)=\delta l^{\mathrm{L}}(x/\delta)$ with $l^{\mathrm{L}}(x)=x+2\log\left(1+e^{-x}\right)$ .
•

Epanechnikov kernel $K(x)=(3/4)\left(1-x^{2}\right)\mathbf{1}(|x|\leq 1),$ which generates $l_{\delta}(x)=\delta l^{\mathrm{E}}(x/\delta)$ with $l^{\mathrm{E}}(x)=\left(3x^{2}/4-x^{4}/8+3/8\right)\mathbf{1}(|x|\leq 1)+|x|\mathbf{1}(|x|>1)$ .
•

Triangular kernel $K(x)=(1-|x|)\mathbf{1}(|x\leq 1)$ , which generates $l_{\delta}(x)=\delta l^{\mathrm{T}}(x/\delta)$ with $l^{\mathrm{T}}(u)=(x^{2}-|x|^{3}/3+1/3)\mathbf{1}(|x|\leq 1)+|x|\mathbf{1}(|x|>1)$ .
•

$K(x)=\frac{1}{2(x^{2}+1)^{3/2}},$ which generates the Pseudo-Huber loss $l_{\delta}(x)=\sqrt{x^{2}+\delta^{2}}$ .

In practical implementations, $\delta$ ought to be a small constant so that $l_{\delta}(x)$ approximates the $\ell_{1}$ -loss, following the ideas introduced in smoothed quantile regression [16]. Following Assumption 2, we can derive several properties of $l_{\delta}(x)$ , which are briefly discussed here for space consideration. More details are summarized in Lemma LABEL:lemma_property_huber of Appendix LABEL:appendix:property_hubers in the supplement. Similar to [16], we prove that $l_{\delta}(x)$ is convex and has continuous first-order and second-order derivatives, namely, $l^{\prime}_{\delta}(x)=2\tilde{K}_{\delta}(x)-1$ and $l^{\prime\prime}_{\delta}(x)=2K_{\delta}(x).$ This ensures the existence of the gradient and Hessian for $F_{\delta}(x)$ . Moreover, note that $l_{\delta}(x)-l_{\delta}(0)\geq\min\left\{\frac{C_{1}}{\delta}x^{2},C_{2}|x|\right\},\forall x\in\mathbb{R}^{p},$ for some positive constants $C_{1}$ and $C_{2}$ . This relation suggests that $l_{\delta}(x)$ exhibits characteristics akin to the $\ell_{2}$ -loss for values of $x$ near $0$ , and resembles the $\ell_{1}$ -loss for values of $x$ distant from $0$ . This bound acts as the foundation for the generalized sharpness condition to $F_{\delta}(x)$ in the absence of noise, a topic to be elaborated upon in the subsequent subsection.

2.3 Properties of the Smoothed Objective Function

In this subsection, we discuss the properties of $F_{\delta}(x)$ , which will indicate the similarities between $F(x)$ and $F_{\delta}(x)$ and how minimizing $F_{\delta}(x)$ works for robust phase retrieval. We first introduce two parallel assumptions given in [11]. Assumption 3 claims that the distribution of $a_{i}$ is non-degenerate in all the directions, and Assumption 4 describes the tail-behavior of the distribution of $a_{i}$ .

Assumption 3.

There exists two positive constants $\kappa_{st}$ and $p_{st}$ such that, for any $i\in[n]$ and $u,v\in\mathbb{R}^{p}$ such that $\|u\|_{2}=\|v\|_{2}=1$ , we have $P\left(\min\{|a_{i}^{\top}u|,|a_{i}^{\top}v|\}\geq\kappa_{st}\right)\geq p_{st}.$

Assumption 4.

The vectors $a_{i}$ ’s are sub-Gaussian with parameter $c_{s}\geq 1$ , which means that

\inf\left\{t>0:\mathbb{E}\exp\left((a_{i}^{\top}u)^{2}/t^{2}\right)\leq 2\right\}\leq c_{s}\sqrt{u^{\top}\Sigma_{p}u},\forall i\in[n],u\in\mathbb{R}^{p}.

With the same Assumptions, we can prove two properties for $F_{\delta}(x)$ which are comparable to properties for $F(x)$ as follows. The first one will be generalized sharpness.

Lemma 1.

(Generalized Sharpness) Under Assumptions 1, 2 and 3, when $p_{\text{fail}}=\gamma=0$ and $n\geq cp$ , with probability at least $1-C\exp(-c^{\prime}n)$ , we have

F_{\delta}(x)-F_{\delta}(x_{\star})\geq\lambda_{s}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},\forall x\in\mathbb{R}^{p}.

(7)

where $\Delta(x)=\min\{\|x-x_{\star}\|_{2},\|x+x_{\star}\|_{2}\}$ , and $c,c^{\prime},C,\lambda_{s}$ are constants independent of $\delta$ .

In the literature, the sharpness of a function $f(x)$ bounds the difference between $f(x)$ and its global minimum by the distance between $x$ and its nearest global minimum point. This concept is prevalent in the optimization problems characterized by equation-solving aspects and nonsmooth objective functions. For more details, please see Section 1 of [29]. The generalized sharpness condition extends the concept of sharpness for $F(x)$ , as outlined in Corollary 3.1 of [11], which is the special example of (7) when $\Delta(x)\geq\delta$ :

F(x)-F(x_{\star})\geq\lambda_{s}\Delta(x),\forall x\in\mathbb{R}^{p}.

(8)

[11] and [12] proved that (8) holds with high probability for the noiseless robust phase retrieval. [11], [9], [10], and [38] leveraged (8) to show the convergence rate for their algorithms.

The second one is the weak convexity that relies on the Assumption 4.

Lemma 2.

(Weak Convexity) Given Assumptions 1, 2, and 4, and assuming $n\geq cp$ , the inequality below is satisfied with a probability of at least $1-C\exp(-c^{\prime}n)$ :

F_{\delta}(y)-F_{\delta}(x)\geq\nabla F_{\delta}(x)^{\top}(y-x)-\frac{\rho}{2}\|y-x\|_{2}^{2},\quad\forall x,y\in\mathbb{R}^{p}.

(9)

Here, $\rho,C,c,c^{\prime}$ are positive constants independent of $\delta$ , $K(x)$ , or the value of $p_{\text{fail}}$ .

This relationship also holds when substituting $F_{\delta}(x)$ with $F(x)$ and replacing the gradient with the subgradient. Weak convexity, a notion extensively recognized in optimization literature, serves as a generalization of convexity. It necessitates that $f(x)+\frac{\rho}{2}\|x\|_{2}^{2}$ be convex for a certain positive constant $\rho$ . Various definitions for weak convexity are discussed throughout the literature. The formulation presented in Lemma 2 aligns with that in [10]. Furthermore, Corollary 3.2 in [11] defines weak convexity by bounding the deviation between $F(x)$ and its linear approximation, requiring the existence of a positive constant $\rho>0$ such that:

\left|F(x)-\frac{1}{n}\sum_{i=1}^{n}\left|(a_{i}^{\top}y)^{2}-b_{i}+2(a_{i}^{\top}y)a_{i}^{\top}(x-y)\right|\right|\leq\frac{\rho}{2}\|y-x\|_{2}^{2},\quad\forall x,y\in\mathbb{R}^{p}.

Next, under the case $\gamma=0$ , we discuss how generalized sharpness and weak convexity lead to the robustness of using $F_{\delta}(x)$ . As $l_{\delta}(x)$ is an approximation of $l(x)$ , we can expect the approximation in function values, which is $\sup_{x\in\mathbb{R}}\left|l_{\delta}(x)-x\right|\leq C_{0}\delta$ for some positive constant $C_{0}$ . Together with the sharpness of $F(x)$ given in (8), we can conclude that, for some positive constant $C_{1}$ ,

\Delta(\bar{x})\leq C_{1}\delta,\ \forall\bar{x}\in\operatorname*{arg\,min}_{x\in\mathbb{R}^{p}}F_{\delta}(x).

This inequality shows the closeness of minimums for $F(x)$ and $F_{\delta}(x)$ . Moreover, our local landscape analysis, which aims at identifying and categorizing stationary points near the true signal vectors, allows for a finer characterization of local regions potentially containing a stationary point. When $p_{\text{fail}}=0$ , and under the conditions specified in (7) and (9), it is deduced that:

-\lambda_{s}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\}\geq F_{\delta}(x_{\star})-F_{\delta}(x)\geq\nabla F_{\delta}(x)^{\top}(x_{\star}-x)-\frac{\rho}{2}\Delta^{2}(x).

So, when $\delta$ is sufficiently small, there exists positive constants $C_{2},C_{3}$ such that

\nabla F_{\delta}(x)^{\top}(x-x_{\star})\geq C_{2}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},

(10)

\forall x\in\mathbb{R}^{p}\ s.t.\ \Delta(x)\leq C_{3},\|x-x_{\star}\|_{2}=\Delta(x).

Similar conclusions hold for the symmetrical area where $\Delta(z)=\|z+x_{\star}\|_{2}$ . So, within $R_{0}:=\{z\in\mathbb{R}^{p}:\Delta(z)\leq C_{3}\}$ , we can claim that true signal vectors are stationary points and local minima.

In scenarios involving corruptions with $\gamma=0$ , we employ the decomposition

F_{\delta}(x)=(1-p_{\text{fail}})F_{\delta,1}(x)+p_{\text{fail}}F_{\delta,2}(x),

where $F_{\delta,i}(x)$ , for $i=1,2$ , represents the sample mean of $l_{\delta}\left((a_{i}^{\top}x)^{2}-b_{i}\right)$ for the noiseless and corrupted measurements, respectively. This allows us to express the gradient as

\nabla F_{\delta}(x)^{\top}(x-x_{\star})=(1-p_{\text{fail}})\nabla F_{\delta,1}(x)^{\top}(x-x_{\star})+p_{\text{fail}}\nabla F_{\delta,2}(x)^{\top}(x-x_{\star}).

The first term on the right-hand side can be bounded in a manner akin to that in (10). For the second term, we will show in Section 4 that, with high probability, its absolute value can be uniformly bounded by $C_{4}p_{\text{fail}}\|x-x_{\star}\|_{2}$ within $R_{0}$ , where $C_{4}$ is a positive constant independent of $\delta$ and $p_{\text{fail}}$ . This serves as an indication of the impact corrupted measurements have on the gradient. When $p_{\text{fail}}$ is sufficiently small, these bounds suggest that the vicinity $R_{1}$ around the true signal vectors within $R_{0}$ is defined as

R_{1}=\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\},

for a positive constant $r_{0}$ , and no stationary points are found in $R_{0}\setminus R_{1}$ . Therefore, if an algorithm minimizing $F_{\delta}(x)$ converges to a local minimum $x_{0,\star}$ within $R_{0}$ , it must be located in $R_{1}$ . This illustrates the robustness of employing $F_{\delta}(x)$ , as the algorithm converges to a point near the true signal vectors. In a noiseless context, this point coincides with one of the true signal vectors. In cases with corruption, and when $\delta p_{\text{fail}}/(1-p_{\text{fail}})$ is small, the convergence point is close to either $\{x_{\star},-x_{\star}\}$ , serving as an effective warm-start for the IPL algorithm that minimizes $F(x)$ with notable local efficiency, thereby exactly finding one of the true signal vectors.

Addressing robustness alone is not sufficient given the nonconvex nature of $F_{\delta}(x)$ . We also need to demonstrate the absence of spurious local minima outside the region $R_{0}$ to ensure the efficacy of algorithms with random initialization. In Sections 3 and 4, we conduct landscape analysis to identify $R_{0}$ and to illustrate the benign landscape characteristics beyond $R_{0}$ .

Remark: Inner product quantities akin to $\nabla F_{\delta}(x)^{\top}(x-x_{\star})$ , which involve the interaction between the (sub)gradient and the deviation from $x$ to the true vector, are pivotal in the landscape analysis across various nonconvex formulations [10, 24, 32, 8]. Specifically, [24] leveraged such quantities in the global landscape analysis of single-index models with the essential assumption of a strictly monotone link function. Given that phase retrieval, with its squared link function not satisfying this monotonicity requirement, usage of inner product quantities in our paper, [32] and [8] is confined to local vicinities. In these works, nonlocal landscape analyses are conducted through the examination of Hessian matrix-related quantities. A distinctive aspect of our study is the incorporation of corruption into the analysis, setting it apart from existing literature. The comprehensive details of this approach will be elaborated upon in Section 4.

Finally, we provide the following normal assumption that is stronger than both 3 and 4.

Assumption 5.

For any $i\in[n]$ , $a_{i}\sim N(0,I_{p})$ .

Assumption 5 was used in [11, 10, 32, 8] to study initialization and landscape analysis for phase retrieval problems.

2.4 Algorithms and Rate of Convergence

In this subsection, we discuss the algorithms for minimizing $F_{\delta}(x)$ . We will show that under the noiseless situation, with high probability, gradient descent on $F_{\delta}(x)$ enjoys local linear convergence. Before we start, we first give a Lemma that finds a Lipschitz constant $\mu_{max}/\delta$ for $\nabla F_{\delta}(x)$ with high probability.

Lemma 3.

Under Assumptions 1, 2, and 5, when $p_{\text{fail}}=\gamma=0$ , $\|x_{\star}\|_{2}=1$ , and $p\geq 3$ , there exist positive constants $\mu_{max},c,c^{\prime},C$ such that, if $n\geq cp\log p$ , then with probability at least $1-C\exp(c^{\prime}n/\log n)-C/n$ , we have

\sup_{x\in\mathbb{R}^{p}}\|\partial^{2}F_{\delta}(x)/\partial x\partial x^{\top}\|_{2}\leq\mu_{max}/\delta.

(11)

Here, $\mu_{max}$ does not depend on $\delta$ but $c,c^{\prime},C$ depend on $\delta$ .

Given the above lemma, the local linear convergence for gradient descent can be shown in Theorem 1.

Theorem 1.

Suppose that (7), (9) and (11) hold with $\lambda_{s},\rho,\mu_{max}>0$ , $p_{\text{fail}}=\gamma=0$ and $\{x_{k}\}_{k=0}^{\infty}\subseteq\mathbb{R}^{p}$ is generated by gradient descent algorithm that iterates as follows:

x_{k+1}=x_{k}-t\nabla F_{\delta}(x_{k}),k=0,1,2\ldots

with $t=\delta/\mu_{max}$ . If $\Delta(x_{0})\leq\min\left\{1,\sqrt{\frac{2\lambda_{s}}{\mu_{max}}}\right\}\delta$ and $0<\delta\leq\lambda_{s}/\rho$ , we have

F_{\delta}(x_{k+1})-F_{\delta}(x_{\star})\leq(1-\lambda_{s}/(8\mu_{max}))\left(F_{\delta}(x_{k})-F_{\delta}(x_{\star})\right),\forall k\in\mathbb{N}.

(12)

In (12), if we let $k=0$ and choose $\Delta(x^{0})>0$ , we have $\lambda_{s}/(8\mu_{max})\leq 1$ . Thus, we can define $\kappa:=(8\mu_{max})/\lambda_{s}\geq 1$ as the condition number. It is worth pointing out that $\kappa$ is not related to $\delta$ or $p$ so the convergence rate is statistically stable.

Next, we comment on the technical details we use for reaching this result. Linear convergence usually appears in a strongly convex situation. When investigating the Hessian matrix $\frac{\partial^{2}F_{\delta}(x)}{\partial x\partial x^{\top}}$ , $\frac{\mu_{max}}{\delta}$ provides an upper bound for $\left\|\frac{\partial^{2}F_{\delta}(x)}{\partial x\partial x^{\top}}\right\|_{2}$ . However, it is hard to find a positive lower bound for the smallest eigenvalue for the Hessian matrix even within the local vicinity of the true signal vectors. So, we turn to the Polyak-Lojasiewicz condition introduced in [21], which proves linear convergence of gradient descent algorithm for minimizing a function $f(x)$ by assuming $\frac{1}{2}\|\nabla f(x)\|_{2}^{2}\geq\mu\left(f(x)-\min_{x}f(x)\right)$ with $\mu>0$ . We can prove that when (7), (9) and (11) hold with $\lambda_{s},\rho,\mu_{max}>0$ , for any $\delta>0$ that is small enough,

\frac{1}{2}\|\nabla F_{\delta}(x)\|_{2}^{2}\geq\frac{\mu_{min}}{\delta}\left(F_{\delta}(x)-F_{\delta}(x_{\star})\right),\forall x\in\{x\in\mathbb{R}^{p}:\Delta(x)\leq C\delta\}

in which $\mu_{min}$ and $C$ are positive constants that are not related to the selection of $\delta$ . With these conditions, we can get the convergence rate.

Linear convergence guarantees that minimizing the smoothed objective function $F_{\delta}(x)$ is not only sufficient for phase retrieval in the noiseless scenario but also offers enhanced efficiency in comparison to PL [11] and IPL [38]. This improvement stems from the methodologies adopted by [11], which solves the subproblem utilizing the POGS algorithm [28], and IPL, which solves the same subproblem through accelerated proximal gradient descent tailored for the dual problem. Both of these algorithms are characterized by their sub-linear convergence rates. Furthermore, a local linear convergence rate has been established across various phase retrieval algorithms, including Wirtinger flow [4], truncated Wirtinger flow [7], truncated amplitude flow [35], and subgradient methods [9, 10]. Smoothed robust phase retrieval thus joins this list of methodologies. In addition, Theorem 1 and (7) indicate that, for any $\varepsilon>0$ , the vanilla gradient descent on the smoothed robust objective function $F_{\delta}(x)$ can find an $x$ such that $\Delta(x)\leq\varepsilon$ with at most $O(\kappa\log\frac{1}{\varepsilon})$ iterations. This is better than $O(\kappa^{2}\log\frac{1}{\varepsilon})$ for the subgradient algorithms [9, 10] using the robust $\ell_{1}-$ loss. In our numerical experiments in Section 5, we employ monotone accelerated gradient descent (as outlined in Algorithm 1 of [22]) combined with line search techniques to achieve better overall efficiency. These experiments demonstrate that smoothed robust phase retrieval exhibits superior numerical efficiency on noiseless datasets when compared to both PL and IPL.

2.5 Overview of Benign Landscape

In this subsection, we provide an overview of the benign landscape for $F_{\delta}(x)$ . Its gradient and Hessian matrix help us find and classify the stationary points, and we will discuss different kinds of benign landscapes in this paper.

(a)

Noiseless population landscape. In this case, $p_{\text{fail}}=\gamma=0$ and we focus on the stationary point for $\mathbb{E}\left[F_{\delta}(x)\right]$ . Under some assumptions, we exactly locate all the stationary points and prove that only $\{x_{\star},-x_{\star}\}$ are local minimums.
(b)

Noiseless empirical landscape. In this case, $p_{\text{fail}}=\gamma=0$ and we care about the stationary point of $F_{\delta}(x)$ . Under some assumptions, we show that with high probability, the stationary points should be close to those for noiseless population landscape, and only $\{x_{\star},-x_{\star}\}$ are local minimums.
(c)

Empirical landscape with infrequent corruptions only. In this case, $p_{\text{fail}}>0,\gamma=0$ and we study the stationary point of $F_{\delta}(x)$ . Under some assumptions, we show that the stationary points are close to those of the noiseless population landscape and all the local minimums are close to $\{x_{\star},-x_{\star}\}$ with high probability. Specifically, all the local minimums lie in the set $\left\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\right\}$ in which $\Delta(x)=\mbox{dist}(x,\{x_{\star},-x_{\star}\})$ and $r_{0}$ is a positive constant. Thus, if an optimization algorithm finds a local minimum of $F_{\delta}(x)$ , its gap to $\{x_{\star},-x_{\star}\}$ is small when both $\delta$ and $p_{\text{fail}}$ are small.
(d)

Empirical landscape with infrequent corruptions and bounded noises. When $p_{\text{fail}}>0,\gamma>0$ , similar results can be obtained when $\delta,\gamma,p_{\text{fail}}$ are small enough and $\delta\geq\Omega(\sqrt{\gamma})$ , with high probability, all the local minimums lie in the set $\{x\in\mathbb{R}^{p}:\Delta(x)\leq\tilde{r}_{0}\max\{\sqrt{\delta\gamma},\delta p_{\text{fail}}/(1-p_{\text{fail}})\}\}$ where $\tilde{r}_{0}$ is a positive constant.

Now, we describe the procedure of smoothed robust phase retrieval. We first apply some gradient-based algorithm to minimize $F_{\delta}(x)$ and find a local minimum $x_{0,\star}$ . Due to the benign landscapes described above, for the optimization, we expect that both random initialization and modified spectral initialization work well such that $x_{0,\star}$ is close to $x_{\star}$ or $-x_{\star}$ . Second, for the noiseless situation, we can claim that $x_{0,\star}$ is already a true signal vector. For the situation with noise, benign landscape guarantees that $x_{0,\star}$ is close to one of the true signal vectors and is treated as a warm-start for IPL that minimizes $F(x)$ for exact recovery.

3 Landscape Analysis in the Noiseless Case

In this section, we focus on the noiseless case with $p_{\text{fail}}=\gamma=0$ and analyze both population and empirical landscapes. Without loss of generosity, we always assume that $\|x_{\star}\|_{2}=1$ throughout this section.

3.1 Notations

Denote $L(x;a)=l_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2}),x,a\in\mathbb{R}^{p}.$ The gradient vector is $\frac{\partial L(x;a)}{\partial x}=\left\{2l^{\prime}_{\delta}\left((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2}\right)aa^{\top}\right\}x$ and the Hessian matrix $\frac{\partial^{2}L(x;a)}{\partial x\partial x^{\top}}=2l^{\prime}_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2})aa^{\top}+4(a^{\top}x)^{2}l^{\prime\prime}_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2})aa^{\top}$ . The empirical and population versions of objective functions can be written as $F_{\delta}(x)=\mathbb{E}_{n}\left[L(x;a)\right]$ and $\mathbb{E}\left[F_{\delta}(x)\right]=\mathbb{E}\left[L(x;a)\right].$ They have continuous gradient vectors and Hessian matrices as $\nabla F_{\delta}(x)=\mathbb{E}_{n}\left[\nabla L(x;a)\right]$ , $\nabla\mathbb{E}\left[F_{\delta}(x)\right]=\mathbb{E}\left[\nabla L(x;a)\right]$ , $\frac{\partial^{2}}{\partial x\partial x^{\top}}F_{\delta}(x)=\mathbb{E}_{n}\left[\frac{\partial^{2}}{\partial x\partial x^{\top}}L(x;a)\right]$ , and $\frac{\partial^{2}}{\partial x\partial x^{\top}}\mathbb{E}\left[F_{\delta}(x)\right]=\mathbb{E}\left[\frac{\partial^{2}}{\partial x\partial x^{\top}}L(x;a)\right].$ We will use the Hessian matrix to classify the stationary points and find the benign noiseless population landscapes for $\mathbb{E}\left[F_{\delta}(x)\right]$ and empirical landscapes for $F_{\delta}(x)$ .

Next, we introduce three quantities that are used to characterize the landscape. Define

U_{1}(x;a)=x^{\top}\frac{\partial L(x,a)}{\partial x}=2(a^{\top}x)^{2}l^{\prime}_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2}),

U_{2}(x;a)=x_{\star}^{\top}\left[\frac{\partial^{2}L(x;a)}{\partial x\partial x^{\top}}\right]x_{\star},

and

U_{3}(x;a)=x_{\star}^{\top}\frac{\partial L(x,a)}{\partial x}=2(a^{\top}x_{\star})l^{\prime}_{\delta}((a^{\top}x)^{2}-(a^{\top}x_{\star})^{2})(a^{\top}x).

Based on Assumption 5, it is easy to find that, for any given $x_{0}\in\mathbb{R}^{p}$ satisfying $\|x_{0}\|_{2}>0$ , we have $z^{\top}\frac{\partial\mathbb{E}\left[F_{\delta}\right]}{\partial x}(x_{0})=0$ , $\forall\ z\in\{x\in\mathbb{R}^{p}:x^{\top}x_{0}=x^{\top}x_{\star}=0\}.$ This implies that most directional derivatives in $\mathbb{R}^{p}$ will be 0 and we only need to consider the remaining two directions. To be specific,

\mathbb{E}\left[U_{1}(x;a)\right]=\mathbb{E}\left[U_{3}(x;a)\right]=0\iff\frac{\partial\mathbb{E}\left[F_{\delta}(x)\right]}{\partial x}=0.

For $U_{2}(x;a)$ , it is easy to find that $\mathbb{E}\left[U_{2}(x_{0},a)\right]<0$ is sufficient for concluding that the Hessian matrix of $\mathbb{E}\left[F_{\delta}(x)\right]$ at $x_{0}$ contains at least one negative eigenvalue, which indicates that $x_{0}$ is not a local minimum. Similar properties hold for their empirical counterparts:

\mathbb{E}_{n}\left[U_{1}(x;a)\right]\neq 0\mbox{ or }\mathbb{E}_{n}\left[U_{3}(x;a)\right]\neq 0\implies\frac{\partial\mathbb{E}_{n}\left[F_{\delta}\right]}{\partial x}(x)\neq 0.

\mathbb{E}_{n}\left[U_{2}(x;a)\right]<0\implies x\mbox{ is not a local minimum for }F_{\delta}(x).

3.2 Population Landscape

In this subsection, we will give the conclusions for the stationary points of $\mathbb{E}\left[F_{\delta}(x)\right]$ . Denote

S_{0}=\{x\in\mathbb{R}^{p}:x^{\top}x_{\star}=0,\|x\|_{2}=u(\delta)\}

for some positive constant $u(\delta)$ , whose rigorous definition will be given in Theorem 2 later. To help to understand, we also use graphical explanations in Figure 1. We express vectors in the phase space $\mathbb{R}^{p}$ in a two-dimensional coordinate. For each vector $x\in\mathbb{R}^{p}$ , we decompose it into the direction of $x_{\star}$ and an orthogonal direction as $x=ux_{\star}+vx_{\star}^{\perp},x_{\star}^{\top}x_{\star}^{\perp}=0,\|x_{\star}^{\perp}\|_{2}=1,$ with $u,v\in\mathbb{R}$ . This decomposition exists for any $x\in\mathbb{R}^{p}$ , and for a given $x$ , $u$ and $|v|$ are unique. We let a point with coordinate $(u,v)$ denote the subset of $\mathbb{R}^{p}$ in which elements can give a decomposition with coefficients $u$ and $v$ :

\left\{x\in\mathbb{R}^{p}:x^{\top}x_{\star}=u,\left\|x-(x^{\top}x_{\star})x_{\star}\right\|_{2}=|v|\right\}.

So, in Figure 1, if we let the first coordinate for $A,B$ be $-1$ and $1$ and the second coordinate for $C,D$ be $u(\delta)$ and $-u(\delta)$ , we know that $A,B$ represent the true signal vectors, $C,D$ represent points in $S_{0}$ and the origin represents the $0\in\mathbb{R}^{p}$ . For the noiseless population landscape, we can show that when $\delta<\delta_{0}$ with some positive constant $\delta_{0}$ , there exists a unique positive constant $u(\delta)$ such that, the stationary point set is $S_{0}\bigcup\{0,x_{\star},-x_{\star}\}$ , among which, only global minimums $x_{\star}$ and $-x_{\star}$ are local minimums. This conclusion corresponds to Figure 1, in which $A,B,O,C,D$ are stationary points and only $A,B$ are local minimums. The following results describe the conclusion rigorously.

Refer to caption — Figure 1: An illustration of the noiseless population landscape

Theorem 2.

Under Assumptions 1, 2 and 5, when $p_{\text{fail}}=\gamma=0$ and $\|x_{\star}\|_{2}=1$ , for any given $\delta>0$ , there exists a unique $u(\delta)>0$ such that $\mathbf{S}:=\left\{x\in\mathbb{R}^{p}:\frac{\partial\mathbb{E}\left[F_{\delta}(x)\right]}{\partial x}=0\right\}$ , which is the stationary point set of $\mathbb{E}\left[F_{\delta}(x)\right]$ , becomes

\mathbf{S}=\left\{0,x_{\star},-x_{\star}\right\}\bigcup\left\{x\in\mathbb{R}^{p}:x^{\top}x_{\star}=0,\|x\|_{2}=u(\delta)\right\}.

The conclusion of Theorem 2 is similar to that of [10]. Next, we comment on some key ideas behind the proof. It is easy to find that $\{0,x_{\star},-x_{\star}\}$ are all stationary points. For $x\in\{x\in\mathbb{R}^{p}:x^{\top}x_{\star}=0,\|x\|_{2}>0\},$ it is also easy to find out that $\mathbb{E}\left[U_{3}(x;a)\right]=0$ . So we only need to solve the equation $\mathbb{E}\left[U_{1}(x;a)\right]=0.$ Based on the isotropy property of normal distribution, we can calculate $\mathbb{E}\left[U_{1}(x;a)\right]$ by assuming $x_{\star}=e_{1}$ and $x=ue_{2},u>0$ . So, $u(\delta)$ is the solution to the equation $0=\mathbb{E}\left[l^{\prime}_{\delta}(u^{2}(a^{\top}e_{2})^{2}-(a^{\top}e_{1})^{2})(a^{\top}e_{2})^{2}\right]$ about $u>0$ . For the complete proof, we also need to explain why $x$ is not a stationary point when $x^{\top}x_{\star}\neq 0$ and $\Delta(x)>0$ . The proof is a direct application of [10]’s work on spectral functions. Please refer to Appendix LABEL:appendix:proof_positions for details.

With access to the Hessian matrix, we can also classify those stationary points. The conclusion is in the Theorem 3.

Theorem 3.

Under Assumptions 1, 2, 5, when $p_{\text{fail}}=\gamma=0$ and $\|x_{\star}\|_{2}=1$ , for $\mathbb{E}\left[F_{\delta}(x)\right]$ , the following properties hold.

(a)

$x_{\star}$ and $-x_{\star}$ are local and global minimums.
(b)

$0\in\mathbb{R}^{p}$ is a local maximum.
(c)

There exists a constant $\delta_{0}>0$ such that for any $\delta\in(0,\delta_{0})$ , $\mathbb{E}\left[U_{2}(x;a)\right]<0,\forall x\in S_{0}.$ This means that stationary points in $S_{0}$ are not local minimums.

Theorem 3 classifies all the stationary points for $\mathbb{E}\left[F_{\delta}(x)\right]$ and indicates the benign noiseless population landscape that there is no spurious local minimum for $\mathbb{E}\left[F_{\delta}(x)\right]$ . Another interesting result found in the proof of Theorem 3 is that the limits $\lim_{\delta\rightarrow 0}u(\delta)$ and $\lim_{\delta\rightarrow 0}\mathbb{E}\left[U_{2}(x;a)\right]$ do not depend on the selection of kernel $K(x)$ . It shows that,

\lim_{\delta\rightarrow 0}u(\delta)=u_{0}\approx 0.44,

when $u_{0}$ is the unique solution of function $\frac{\pi}{4}=\arctan u+\frac{u}{u^{2}+1}$ and

\lim_{\delta\rightarrow 0}\mathbb{E}\left[U_{2}(x;a)\right]=\frac{16u^{3}}{\pi(1+u^{2})^{2}}+\frac{8}{\pi}\left(\arctan(u)-\frac{\pi}{4}-\frac{u}{u^{2}+1}\right)

for any $x$ that satisfies $x^{\top}x_{\star}=0$ and $\|x\|_{2}=u>0$ . When $u=u_{0}$ , $\lim_{\delta\rightarrow 0}\mathbb{E}\left[U_{2}(x;a)\right]\approx-1.57$ , which is negative. This implies that these quantities are properties for the $\ell_{1}-$ loss.

Finally, we give a lemma about the positiveness of $U_{1}(x;a)$ when $\|x\|_{2}$ is very large.

Lemma 4.

Under Assumptions 1, 2 and 5, when $p_{\text{fail}}=\gamma=0$ , there exists a positive constant $M>0$ such that

\inf_{x\in\mathbb{R}^{p},\|x\|_{2}\geq M}\ \mathbb{E}\left[U_{1}(x;a)\right]>0.

Lemma 4 further strengthens the non-zero nature of the gradient when $\|x\|_{2}$ is large and will be useful in reaching conclusions about the empirical landscape.

3.3 Empirical Landscape

In this subsection, we first show the conclusions for the benign empirical landscape and then provide a sketch of the proof.

We still use visualizations to help explain the landscape. Under some conditions, we can show that, with high probability, all the stationary points lie in the set

\mathbf{S}_{1}=\{0,x_{\star},-x_{\star}\}\bigcup\left\{x\in\mathbb{R}^{p}:\mbox{dist}(x,S_{0}\bigcup\{0\})\leq\epsilon\right\}

for any small positive constant $\epsilon$ , among which, only global minimums $x_{\star}$ and $-x_{\star}$ are local minimums. In other words, a stationary point is either a true phase or lies in the vicinity of $S_{0}\bigcup\{0\}$ . This conclusion corresponds to Figure 2, in which possible stationary points correspond to $A,B$ and the inner parts of the three circles, and only $A,B$ are local minimums.

Before giving the formal conclusion, we define the following events.

\mathcal{A}=\{\textrm{there exists a stationary point of $F_{\delta}(x)$ such that it does not belong to the set $\mathbf{S}_{1}$}\}.

\mathcal{B}=\{\textrm{there exists a local minimum of $F_{\delta}(x)$ such that it does not belong to the set $\{x_{\star},-x_{\star}\}$}\}.

Formally, the conclusion is as follows.

Theorem 4.

Under Assumptions 1, 2 and 5, when $p_{\text{fail}}=\gamma=0$ and $\|x_{\star}\|_{2}=1$ , there exists a constant $\delta_{0}>0$ such that, for any $\delta\in(0,\delta_{0})$ and $\epsilon>0$ , if $n\geq cp\log p$ , we have

P(\mathcal{A}\cup\mathcal{B})\leq C\exp(-c^{\prime}n)+\frac{C}{n},

where $c,c^{\prime},C$ are positive constants that depend on $\delta$ and $\epsilon$ .

Theorem 4 shows that the opposite event $\mathcal{A}^{c}\bigcap\mathcal{B}^{c}$ which corresponds to the benign noiseless empirical landscape happens with high probability. So, if we use a gradient-based optimization algorithm for minimizing $F_{\delta}(x)$ with random initialization and it finds a local minimum, the iterative sequence $\{x_{k}\}_{k=0}^{\infty}$ will converge to one of $\{x_{\star},-x_{\star}\}$ . In addition, under the benign landscape, though there are only theoretical guarantees for global convergence for first-order algorithms with additional tricks [19, 20] because of the existence of saddle points, numerical performances for robust phase retrieval indicate that we do not need those tricks for good performance. Readers with interests in saddle points can refer to [20] and references therein. The global convergence guarantee makes sure that our discussion of the local linear convergence rate in Section 2.4 is meaningful.

For the rest of this subsection, we provide a sketch of the proof. In the first step, we focus on a large bounded area $R_{2}=\{x\in\mathbb{R}^{p}:\|x\|_{2}\leq M\}$ and prove that all of $\mathbb{E}_{n}\left[U_{i}(x,a)\right],i=1,2,3$ uniformly concentrate to their population versions with high probability. These quantities are related to the gradient and Hessian matrix for $F_{\delta}(x)$ . If we do not consider the high-dimensional situation for $p$ , it is easy to apply the central limit theorem to each element of the gradient and Hessian and claim convergence. However, if we consider the high-dimensional case when we can only assume that $n\geq\Omega(cp\log p)$ , we can only focus on some selected key elements, which are $U_{1}(x;a),U_{2}(x;a)$ and $U_{3}(x;a)$ . Fortunately, they are sufficient to support our claim about the empirical landscape. Next, we state our conclusions.

Lemma 5.

Suppose that $\|x_{\star}\|_{2}=1$ . Under Assumptions 1, 2, 5, for any given $\delta,M>0$ and $t>0$ , when $n\geq cp\log p,p\geq 3$ , $\|x_{\star}\|_{2}=1$ , the following relationships hold.

P(\sup_{x\in B_{p}(0,M)}\left|\mathbb{E}_{n}\left[U_{1}(x;a)\right]-\mathbb{E}\left[U_{1}(x;a)\right]\right|\geq t)\leq C\exp\left(-c^{\prime}n\right)+\frac{C}{n}.

(13)

P(\sup_{x\in B_{p}(0,M)}\left|\mathbb{E}_{n}\left[U_{2}(x;a)\right]-\mathbb{E}\left[U_{2}(x;a)\right]\right|\geq t)\leq C\exp\left(-c^{\prime}n\right)+\frac{C}{n}.

(14)

P(\sup_{x\in B_{p}(0,M)}\left|\mathbb{E}_{n}\left[U_{3}(x;a)\right]-\mathbb{E}\left[U_{3}(x;a)\right]\right|\geq t)\leq C\exp\left(-c^{\prime}n\right)+\frac{C}{n}.

(15)

Here, $c\geq 1,c^{\prime},C$ are positive constants that depend on $\delta,M,t$ but do not depend on $n$ or $p$ .

Lemma 5 guarantees the uniform concentration of the selected quantities within a large bounded subset of $\mathbb{R}^{p}$ when $n$ is large. This indicates that, within this area, stationary points and local minimums for $F_{\delta}(x)$ should be close to those for $\mathbb{E}\left[F_{\delta}(x)\right]$ . It is also worth mentioning that, inferred from the proof, using smaller $\delta$ will require larger $n$ for these quantities to get close to their population version.

In the second step, we can go beyond closeness within a local area that contains the true signal vectors. Local landscape analysis strengthens the conclusion by showing that, within a local area that contains the true signal vectors, $\nabla F_{\delta}(x)$ is non-zero when $x\notin\{x_{\star},-x_{\star}\}$ , which means that there is no other stationary point in the local area. The two key properties for this part are the generalized sharpness condition given in Lemma 1 and the weak convexity condition given in Lemma 2. With them, we can claim the following conclusion for the lower bound of the norm of the gradient and claim the benign local landscape.

Lemma 6.

Suppose that $F_{\delta}(x)$ is weakly convex with parameter $\rho>0$ as given in Lemma 2 and also satisfies that

F_{\delta}(x)-F_{\delta}(x_{\star})\geq\lambda_{s}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},\forall x\in\mathbb{R}^{p}

(16)

as given in Lemma 1 for some positive constant $\lambda_{s}$ . Then we have, for all $\delta\in(0,\lambda_{s}/\rho]$ and $x\in\{x\in\mathbb{R}^{p}:\|x-x_{\star}\|_{2}\leq\lambda_{s}/\rho\}$ ,

\displaystyle\mathbb{E}_{n}\left[U_{1}(x;a)\right]-\mathbb{E}_{n}\left[U_{3}(x;a)\right]

\displaystyle=\nabla F_{\delta}(x)^{\top}(x-x_{\star})\geq\frac{\lambda_{s}}{2}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},

The symmetric case also holds when we replace $x_{\star}$ with $-x_{\star}$ . After considering the symmetrical case based on $x+x_{\star}$ , we have

\|\nabla F_{\delta}(x)\|_{2}>0,\ \forall\ x\in\{x\in\mathbb{R}^{p}:0<\Delta(x)\leq\lambda_{s}/\rho\}.

Finally, having handled $R_{2}$ , we will prove that the unbounded area $\mathbb{R}^{p}\backslash R_{2}$ does not contain a stationary point by showing that $\mathbb{E}_{n}\left[U_{1}(x;a)\right]$ is positive on the boundary and that $\frac{1}{u^{2}}\mathbb{E}_{n}\left[U_{1}(ux_{0},a)\right]$ is monotone increasing with regard to $u>0$ for any $x_{0}\in\partial R_{2}$ . The monotonicity is formally given in Lemma 7.

Lemma 7.

For any constant $x_{0}\in\mathbb{R}^{p},\|x_{0}\|_{2}>0,$ the function $h_{n}(u)=\frac{\mathbb{E}_{n}\left[U_{1}(ux_{0};a)\right]}{u^{2}}$ is a monotone increasing function on $(0,+\infty)$ .

Lemma 7 is obvious because $l^{\prime}_{\delta}(\cdot)$ is monotone increasing. It directly indicates the following result about the unbounded area.

Lemma 8.

If there exists a constant $M>0$ such that $\inf_{x\in\mathbb{R}^{p}:\|x\|_{2}=M}\mathbb{E}_{n}\left[U_{1}(x;a)\right]>0,$ then we have $\inf_{x\in\mathbb{R}^{p}:\|x\|_{2}\geq M}\mathbb{E}_{n}\left[U_{1}(x;a)\right]>0.$

The positiveness of $\mathbb{E}_{n}\left[U_{1}(x;a)\right]$ means that there is no stationary point in this area.

4 Empirical Landscape in the Presence of Noise

In this section, we discuss the corrupted situation and the empirical landscape of $F_{\delta}(x)$ . Without loss of generosity, we still assume that $\|x_{\star}\|_{2}=1$ throughout this section. We will follow the statistical assumptions in Section 2 with the general corruption setting $p_{\text{fail}}>0$ . Subsection 4.1 will first provide the results under the situation that $\gamma=0$ in Assumption 1 and then extend the result to the case that $\gamma\geq 0$ . Subsection 4.2 will provide a sketch of the proof.

4.1 Benign Landscape When Sparse Corruptions Exists

In this subsection, we first discuss the situation that $p_{\text{fail}}>0,\gamma=0$ in Assumption 1. We will show that, under some conditions, with high probability, all the stationary points of $F_{\delta}(x)$ lie in the set

\mathbf{S}_{3}=\left\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\right\}\bigcup\left\{x\in\mathbb{R}^{p}:\mbox{dist}(x,S_{0}\bigcup\{0\})\leq\epsilon\right\}

for some positive constant $\epsilon$ that is related to $p_{\text{fail}}$ and a positive constant $r_{0}$ , among which, local minimums lie in

\mathbf{S}_{4}=\left\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\right\}.

In other words, a stationary point either lies in the vicinity of $\{x_{\star},-x_{\star}\}$ whose radius is proportional to $\delta p_{\text{fail}}/(1-p_{\text{fail}})$ or lies in the vicinity of $S_{0}\bigcup\{0\}$ , and all local minimums lie in the vicinity of the true phase. This conclusion corresponds to Figure 3, in which possible stationary points correspond to the inner parts of the five circles and only inner parts of the circles $A$ and $B$ contain local minimums. The rigorous statement is given in the Theorem 5.

Theorem 5.

Suppose that $\|x_{\star}\|_{2}=1,p\geq 3$ and Assumptions 1, 2 and 5 hold with $\gamma=0$ . We consider the following two events.

\mathcal{C}=\{\textrm{there exists a stationary point of $F_{\delta}(x)$ such that it does not belong to the set $\mathbf{S}_{3}$}\}.

\mathcal{D}=\{\textrm{there exists a local minimum of $F_{\delta}(x)$ such that it does not belong to the set $\mathbf{S}_{4}$}\}.

Then we have that for any $\epsilon>0$ , there exist positive constants $\delta_{0},r_{0}$ and $p_{\text{fail}}^{m}$ , such that for any $\delta\in(0,\delta_{0})$ and $p_{\text{fail}}\in(0,p_{\text{fail}}^{m})$ , when $n\geq cp\log p$ ,

P(\mathcal{C}\bigcup\mathcal{D})\leq C\exp(-c^{\prime}n/\log n)+\frac{C}{n}.

Here, the constants $c,c^{\prime},C$ depend on $\delta,\epsilon$ and $p_{\text{fail}}$ , and $r_{0}$ does not depend on $\delta,\epsilon$ or $p_{\text{fail}}$ .

In Theorem 5, we claim that the local minimums for $F_{\delta}(x)$ only exist in the vicinity of the true phase $\{x_{\star},-x_{\star}\}$ . The radius $r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})$ of the vicinity is proportional to the bandwidth and monotone increasing with regard to the proportion of the corrupted measurements. In practice, we need to choose a $\delta$ . Although smaller $\delta$ leads to closer local minimums, the non-local optimization efficiency is still unclear for difference choices of $\delta$ , and decreasing $\delta$ will also require larger $n$ due to the dependency of the constants in Theorem 5 on $\delta$ . Next, we provide some guidance for picking $\delta$ under the situation that we do not know the covariance of $a_{i}$ and $\|x_{\star}\|_{2}$ is not necessarily 1. Using the consideration of homogeneity, we can let $\delta=\delta_{0}\|\mathbb{E}_{n}\left[aa^{\top}\right]\|_{2}\|x_{\star}\|_{2}^{2}$ , in which $\delta_{0}$ is a small but not extremely small positive constant. Finding or roughly estimating $\|\mathbb{E}_{n}\left[aa^{\top}\right]\|_{2}$ is easy in practice and a rough estimation of $\|x_{\star}\|_{2}$ can be done via prior knowledge or modified spectral initialization (Algorithm 3 in [11]). We will provide an example of image recovery in Section 5. For $\delta_{0}$ that should match the theoretical results based on the noiseless population landscape given in Theorem 3, we will conduct sensitive analysis in Section 5.

Next, we extend a benign landscape in Theorem 6 for the general situation that $\gamma\geq 0$ .

Theorem 6.

Suppose that $\|x_{\star}\|_{2}=1,p\geq 3$ . Under Assumptions 1, 2 5, there exist positive constants $\delta_{0},\tilde{r}_{0}$ , $\gamma_{0},c^{\prime\prime}$ and $p_{\text{fail}}^{m}$ , such that when $n\geq cp\log p$ , for any $\delta\in(0,\delta_{0})$ , $\gamma\in(0,\gamma_{0})$ such that $\delta\geq c^{\prime\prime}\sqrt{\gamma}$ and $p_{\text{fail}}\in(0,p_{\text{fail}}^{m})$ , the probability that a local minimum of $F_{\delta}(x)$ exists and does not belongs to the set

\{x\in\mathbb{R}^{p}:\Delta(x)\geq\tilde{r}_{0}\max\{\sqrt{\delta\gamma},\delta p_{\text{fail}}/(1-p_{\text{fail}})\}\}

(17)

is at most $C\exp(-c^{\prime}n/\log n)+\frac{C}{n}$ . Here, $c,c^{\prime},C$ are positive constants that depend on $\delta$ and $p_{\text{fail}}$ , and $\tilde{r}_{0}$ and $c^{\prime\prime}$ are positive constants that do not depend on $\delta$ , $\gamma$ or $p_{\text{fail}}$ .

Note that Theorem 5 is a special example of Theorem 6 when $\gamma=0$ .

4.2 Sketch of Proof

We first provide a sketch of proof for Theorem 5. Before we start, we define $L(x;a,b)=l_{\delta}((a^{\top}x)^{2}-b),x\in\mathbb{R}^{p}$ and the definition of $F_{\delta}(x)$ in (5) can also be written as $F_{\delta}(x)=\mathbb{E}_{n}L(x;a,b).$ It is easy to find that $\frac{\partial L(x;a,b)}{\partial x}=\left\{2l^{\prime}_{\delta}\left((a^{\top}x)^{2}-b\right)aa^{\top}\right\}x$ , and $\frac{\partial^{2}}{\partial x\partial x^{\top}}L(x;a,b)=2\left\{l^{\prime}_{\delta}((a^{\top}x)^{2}-b)aa^{\top}\right\}+4\left\{(a^{\top}x)^{2}l^{\prime\prime}_{\delta}((a^{\top}x)^{2}-b)aa^{\top}\right\}.$ Similar to Section 3, we also define

U_{1}(x;a,b)=x^{\top}\frac{\partial L(x,a,b)}{\partial x}=2(a^{\top}x)^{2}l^{\prime}_{\delta}((a^{\top}x)^{2}-b),

U_{2}(x;a,b)=x_{\star}^{\top}\left[\frac{\partial^{2}L(x;a,b)}{\partial x\partial x^{\top}}\right]x_{\star},

and

U_{3}(x;a,b)=x_{\star}^{\top}\frac{\partial L(x,a,b)}{\partial x}=2(a^{\top}x_{\star})l^{\prime}_{\delta}((a^{\top}x)^{2}-b)(a^{\top}x)

Now we provide the proof sketch. First, we focus on a large bounded area $R_{2}=\{x\in\mathbb{R}^{p}:\|x\|_{2}\leq M\}$ . The sample mean notations satisfies that $\mathbb{E}_{n}=(1-p_{\text{fail}})\mathbb{E}_{n_{1}}+p_{\text{fail}}\mathbb{E}_{n_{2}}$ . Here, $\mathbb{E}_{n_{1}}$ takes the sample mean with regard to indices from 1 to $n_{1}$ , and $\mathbb{E}_{n_{2}}$ focuses on the indices from $n_{1}+1$ to $n_{1}+n_{2}$ . When $p_{\text{fail}}$ is small, We can expect that, in $R_{2}$ , concentrations for the quantities defined above with $(1-p_{\text{fail}})\mathbb{E}_{n_{1}}$ hold as usual and those quantities with $p_{\text{fail}}\mathbb{E}_{n_{2}}$ are bounded by small positive constants. The rigorous statement is as follows.

Lemma 9.

Under Assumptions 1, 2 and 5 with $\gamma=0$ , $p\geq 3,\|x_{\star}\|_{2}=1$ , the following results hold.

(a)

For any given $M>0$ , $t>0$ and $\delta>0$ , when $n_{1}\geq c_{1}p\log p$ , we have, $\forall i=1,2,3,$

P(\sup_{x\in B_{p}(0,M)}\left|\mathbb{E}_{n_{1}}\left[U_{i}(x;a,b)\right]-\mathbb{E}\left[U_{i}(x;a)\right]\right|\geq t)\leq C_{1}\exp\left(-c^{\prime}_{1}n_{1}\right)+\frac{C_{1}}{n_{1}}.

(18)

Here, $c_{1},c^{\prime}_{1},C_{1}$ are positive constants that depend on $\delta,M,t$ .

(b)

For any $M>0,t>0$ and $\delta>0$ , when $n\geq c_{2}p\log p$ , we have, $\forall i=1,2,3$ ,

P(\sup_{x\in B_{p}(0,M)}\left|p_{\text{fail}}\mathbb{E}_{n_{2}}\left[U_{i}(x;a,b)\right]\right|\geq C^{\prime}_{2}p_{\text{fail}}+t)\leq C_{2}\exp\left(-nc^{\prime}_{2}/\log n\right)+\frac{C_{2}}{n}.

(19)

Here, $c_{2},c^{\prime}_{2},C_{2},C^{\prime}_{2}$ are all positive constants. $C^{\prime}_{2}$ is affected by $M,\delta$ but is not affected by $t$ or $p_{\text{fail}}$ . $c_{2},c^{\prime}_{2},C_{2}$ are affected by $M,t,\delta$ but are not affected by $p_{\text{fail}}$ .

(c)

For any $M>0,t>0$ and $\delta>0$ , when $n\geq c_{3}p\log p$ , we have

P(\sup_{x\in B_{p}(0,M)}\left|p_{\text{fail}}\mathbb{E}_{n_{2}}\left[U_{1}(x;a,b)-U_{3}(x;a,b)\right]\right|\geq\left(C^{\prime}_{3}p_{\text{fail}}+t\right)\|x-x_{\star}\|_{2})

(20)

\leq C_{3}\exp\left(-c^{\prime}_{3}n\right)+\frac{C_{3}}{n}.

Here, $c_{3},c^{\prime}_{3},C_{3},C^{\prime}_{3}$ are all positive constants. $C^{\prime}_{3}$ is affected by $M$ but is not affected by $\delta$ , $t$ or $p_{\text{fail}}$ . $c_{3},c^{\prime}_{3},C_{3}$ are affected by $M,t,\delta$ but are not affected by $p_{\text{fail}}$ .

All the above conclusions also hold for the situation of replacing $x_{\star}$ with $-x_{\star}$ .

Lemma 9(a) is a direct conclusion from the noiseless case. For others, we can observe that, when $n$ is large, $t$ can be very small so that the corruptions affect each of $U_{1}(x;a,b),U_{2}(x;a,b)$ and $U_{3}(x;a,b)$ by some constant multiplying $p_{\text{fail}}$ . This indicates that as long as $p_{\text{fail}}$ is smaller than some constant, the corruption will not overcome the power of normal measurements for $x$ that is not in the vicinity of $\{x_{\star},-x_{\star}\}$ . The last inequality will be used in the local landscape analysis to quantify the vicinity.

Second, we try to quantify the vicinity. Currently, within $R_{2}$ , we know that all the local minimums are close to the true signal vectors. Local landscape analysis strengthens the conclusion by showing that, within a local area that contains the true signal vectors, $\nabla F_{\delta}(x)$ is non-zero when $x\notin\{x\in\mathbb{R}^{p}:\Delta(x)\leq r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\}$ , which quantifies the distance of the stationary points in the local area to the true signal vectors. The key of the analysis is the conclusion for the local landscape of the noiseless situation given in Lemma 6, which means that with high probability,

\mathbb{E}_{n_{1}}\left[U_{1}(x;a,b)-U_{3}(x;a,b)\right]\geq\frac{\lambda_{s}}{2}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},

(21)

\forall\ x\in\{x\in\mathbb{R}^{p}:\|x-x_{\star}\|_{2}\leq\lambda_{s}/\rho\}.

Together with the third conclusion in Lemma 9, we can find the following result for the local landscape. Lemma 10 quantifies the radius of the vicinity as $r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})$ .

Lemma 10.

When Assumptions 1, 2 and 5 hold, $\|x_{\star}\|_{2}=1,p\geq 3,\gamma=0$ , there exists positive constants $p_{\text{fail}}^{m}$ and $\delta_{0}$ such that when $n\geq cp\log p$ , $0<p_{\text{fail}}\leq p_{\text{fail}}^{m}$ and $\delta<\delta_{0}$ , the following relationship holds with probability at least $1-C\exp(c^{\prime}n/\log n)-C/n$ .

\mathbb{E}_{n}\left[U_{13}(x;a,b)\right]\geq C^{(1)}\min\{\Delta(x),\frac{1}{\delta}\Delta^{2}(x)\},

(22)

\forall\ x\in\{x\in\mathbb{R}^{p}:r_{0}\delta p_{\text{fail}}/(1-p_{\text{fail}})\leq\|x-x_{\star}\|_{2}\leq C^{(2)}\}.

Here, $C^{(1)}$ , $r_{0}$ and $C^{(2)}$ are positive constants that are not related to the selection of $\delta$ and $p_{\text{fail}}$ . Positive constants $c,c^{\prime},C$ are related to $\delta$ and $p_{\text{fail}}$ . This conclusion also holds for the situation of replacing $x_{\star}$ with $-x_{\star}$ .

Last, we will prove that the unbounded area $\mathbb{R}^{p}\backslash R_{2}$ does not contain a stationary point by showing that $\mathbb{E}_{n}U_{1}(x,a,b)$ is positive on the boundary and that $\frac{1}{u^{2}}\mathbb{E}_{n}U_{1}(ux_{0},a,b)$ is monotone increasing with regard to $u>0$ for any $x_{0}\in\partial R_{2}$ . The following two lemmas discuss monotonicity and positiveness similarly as in Lemmas 7 and 8.

Lemma 11.

For any constant $x_{0}\in\mathbb{R}^{p},\|x_{0}\|_{2}>0,$ the function $h_{n}(u)=\frac{\mathbb{E}_{n}\left[U_{1}(ux_{0};a,b)\right]}{u^{2}}$ is a monotone increasing function on $(0,+\infty)$ .

Lemma 12.

If there exists a constant $M>0$ such that $\inf_{x\in\mathbb{R}^{p}:\|x\|_{2}=M}\mathbb{E}_{n}\left[U_{1}(x,a,b)\right]>0,$ then we have $\inf_{x\in\mathbb{R}^{p}:\|x\|_{2}\geq M}\mathbb{E}_{n}\left[U_{1}(x,a,b)\right]>0.$

Finally, we discuss some key points for generalizing the proof of Theorem 5 for $\gamma=0$ to Theorem 6 for $\gamma>0$ . First, requiring that $\mathbb{E}_{n_{1}}U_{i}(x;a,b)$ to be close to $\mathbb{E}_{n_{1}}U_{i}(x;a)$ , $i=1,2,3$ , we need that $\gamma$ is small and $\delta\geq\Omega(\sqrt{\gamma})$ . Next, similar to the previous discussions on the sketch of proof, we need to let $(1-p_{\text{fail}})\mathbb{E}_{n_{1}}U_{13}(x,a,b)$ dominates $p_{\text{fail}}\mathbb{E}_{n_{2}}U_{13}(x,a,b)$ in some bounded area excluding the vicinity of $x_{\star}$ . The estimation that $|p_{\text{fail}}\mathbb{E}_{n_{2}}U_{13}(x,a,b)|_{2}=O(p_{\text{fail}}\|x-x_{\star}\|_{2})$ with high probability remains unchanged. For the $(1-p_{\text{fail}})\mathbb{E}_{n_{1}}U_{13}(x,a,b)$ , due to $\gamma>0$ , the relationship $\mathbb{E}_{n_{1}}U_{13}(x,a,b)\geq\Omega(\min\{\Delta(x),\Delta^{2}(x)/\delta\})$ only holds when $\|x-x_{\star}\|_{2}\geq\Omega(\sqrt{\gamma\delta})$ . So, we can get the set given in (17) by comparing the bounds.

5 Numerical Experiments

This section compares the numerical performance of PL[11], IPL[38], and SRPR. Subsection 5.1 presents three algorithms. Subsections 5.2 and 5.3 present the numerical results from simulation and a task of image recovery, respectively.

5.1 Description of Different Algorithms

To illustrate the efficacy of the benign landscape for smoothed robust phase retrieval, we evaluate both random initialization (RI) and modified spectral initialization (SI) as described in Algorithm 3 of [11]. The SI method leverages spectral initialization on subsets of samples with small $b_{i}$ values, aiming to circumvent the influence of heavily-tailed corrupt measurements within $\{b_{i},i=1,2,\ldots,n\}$ . [11] established that SI can, with high probability, generate an initial point $x_{0}$ satisfying $\Delta(x_{0})\leq O(p_{\text{fail}})$ , provided the ratio $n/p$ is sufficiently large and $p_{\text{fail}}$ is sufficiently small.

Next, we discuss the proximal linear (PL) algorithm presented in [11] and the inexact PL (IPL) proposed in [38]. These algorithms are selected for comparative analysis due to PL’s pioneering role in minimizing $F(x)$ and IPL’s advancements, which enhance the numerical efficiency over PL and subgradient methods detailed in [9]. Thus they set a benchmark for robust phase retrieval [38]. Both PL and IPL proceed by solving a sequence of convex subproblems to minimize $F(x)$ .

We then introduce the smoothed robust phase retrieval (SRPR) methodology. In the noiseless scenario, SRPR involves minimizing $F_{\delta}(x)$ , utilizing either RI or SI for initialization. In cases involving corruption, SRPR entails an initial minimization of $F_{\delta}(x)$ using RI or SI, followed by employing the resultant minimization as a warm-start for IPL. Given that direct minimization of $F_{\delta}(x)$ does not yield an exact recovery of the true signal vectors, this two-step approach is necessary. For kernel selection across all comparisons with proximal linear algorithms, we employ the kernel $K(x)=\frac{1}{2(x^{2}+1)^{3/2}}$ , which generates the Pseudo Huber loss function $l_{\delta}(x)=\sqrt{x^{2}+\delta^{2}}$ . The choice of $\delta$ will be elaborated upon subsequently. The monotone accelerated gradient descent algorithm (MAPG) with line search, as outlined in Algorithm 1 of [22], is utilized for optimization. This algorithm, specifically designed for nonconvex optimization, incorporates a reference update mechanism based on direct gradient descent steps to ensure sufficient decrease. It effectively guarantees performance parity with, or superiority over, simple gradient descent. Subsequent numerical results will illustrate the linear convergence achievable through this algorithm in the noiseless context.

We consider the candidate algorithms in the comparison: proximal linear algorithm with SI or RI (PL-SI and PL-RI), inexact proximal linear algorithm with SI or RI (IPL-SI and IPL-RI), and smoothed robust phase retrieval with SI or RI (SRPR-SI and SRPR-RI).

In what follows, we examine their performance on both simulated datasets and image recoveries. For an experiment with a specific dataset and initialization, let the final solution obtained by an algorithm be denoted as $x_{K}$ . We consider the phase retrieval successful for simulated datasets if the relative error, defined as $\mbox{dist}(x_{K},\{x_{\star},-x_{\star}\})/\|x_{\star}\|_{2}$ , does not exceed $10^{-6}$ for both the IPL and SRPR algorithms. For PL, a less stringent threshold of $10^{-4}$ is adopted. For image recoveries, the success threshold for PL is adjusted to $10^{-1}$ , whereas for IPL and SRPR, it remains unchanged. This differentiation in accuracy thresholds is attributed to the comparatively lower optimization efficiency of PL compared to IPL and SRPR. Next, we explain the CPU time associated with each algorithm. Specifically, it encompasses the aggregate duration spent by all components of an algorithm. For instance, the CPU time calculation for SRPR-SI in the context of corruptions includes the time required for modified spectral initialization, the minimization of $F_{\delta}(x)$ through MAPG, and the execution of IPL for exact recovery. These experiments are conducted on a server equipped with an Intel Xeon E5-2650v4 (2.2GHz) processor. To ensure robustness in our findings, each experimental setup, defined by a unique set of hyper-parameters related to the dataset, is replicated 50 times with varied random samples for all components involved. This procedure enables us to report both the success rate and median CPU time for each experimental scenario. Consistent with the observations made by [11], we note that an increase in $k=n/p$ and a decrease in $p_{\text{fail}}$ generally lead to higher success rates. However, the impact on CPU time does not lend itself to straightforward conclusions. The subsequent subsections will present numerical experiments that furnish further insights.

5.2 Simulated Dataset

This subsection follows [11] to conduct experiments on simulated datasets. We generate simulated datasets that $\{a_{i},i=1,2\ldots n\}$ independently follows $N(0,I_{p})$ and $x_{\star}\in\frac{1}{\sqrt{p}}\{-1,1\}^{p}$ with each element independently following discrete uniform distribution so that $\|x_{\star}\|_{2}=1$ . The corruptions for these measurements are independent and the corruption type will be zeroing (i.e. letting $b_{i}=0,\forall i\in[n]\backslash[n_{1}]$ ) or using Cauchy distribution. A comprehensive description of the related hyper-parameters is as follows.

•

Signal size for recovery: $p\in\{500,1500\}$ .
•

Oversampling ratio $k=n/p\in\{2+0.25s|s\in[0,24],s\in N\}$ .
•

Corruption type $r\in\{0,1\}$ . $b_{j}=(a_{j}^{\top}x_{\star})^{2}$ for $j\in[n_{1}]$ . The corrupted response satisfies that for $\forall\ j\in[n]-[n_{1}]$ , we have $b_{j}=r\tan(\frac{\pi}{2}U_{j})\times\mbox{median}\left(\{(a_{i}^{\top}x_{\star})^{2}:i\in\{1,2,...,n\}\}\right),U_{j}\sim U(0,1)$ and are independent for different measurements. When $r=0$ , the corrupted values are all equal to zero. This condition diminishes the effectiveness of the modified spectral initialization, which relies on $b_{i}$ s with small magnitudes. Conversely, when $r=1$ , the corrupted values are drawn from a positive Cauchy distribution, characterized by its infinite expected value. The choice of this distribution stems from its heavy-tailed nature as the capacity to handle heavy-tail corruption represents one of the robust model’s benefits.
•

$p_{\text{fail}}\in\{0.025s|s\in[0,12],s\in N\}$ .
•

Random initialization $x_{0}=\tilde{x}\tilde{r}$ , in which $\tilde{x}\sim U[\partial B_{p}(0,1)]$ and $\tilde{r}\sim U[0,4]$ are independent random variables.
•

$\delta=0.25$ for SRPR when we make comparisons with PL and IPL, while $\delta$ can be selected easily here because we already know $\|x_{\star}\|_{2}$ .

Next, we present the performance of the candidate algorithms. Beginning with an evaluation under noiseless conditions, Figure 4 illustrates the comparison between CPU time and success rate for these algorithms. The figure is divided into two panels: the left panel displays the success rate, while the right panel provides the median CPU time (after logarithmic transformation) for successful trials. An absence of data points indicates a success rate of zero. It is observed that algorithms incorporating SI exhibit superior success rates compared to those employing random initialization. Notably, SRPR-SI emerges as the most effective among them. For algorithms utilizing random initialization, SRPR achieves a $100\%$ success rate for large $n/p$ ratios, benefiting from the benign landscape. In contrast, PL and IPL fail to achieve similar success rates under these conditions. Regarding CPU time efficiency, SRPR stands out due to its local linear convergence property. This efficiency is further evidenced in Figure 5, which delineates the decline of $F_{\delta}(x_{k})$ over iterations for a trial with random initialization (where $p=500$ and $k=8$ ) using MAPG. The figure reveals a pronounced linear trend, with no discernible sub-linear or non-local stages.

Second, we focus on simulated datasets with corruptions. The conclusion of the comparison is consistent for all the combinations of hyper-parameters. We pick the combination $r=1,p=500,p_{\text{fail}}=0.1$ and compare the performance of successful rate and CPU time for different selections of $n/p$ .

Figure 6 shows the performance of using these hyper-parameters and gives the same conclusion as Figure 4 in terms of success rate. In terms of CPU time, there is only a little advantage for SRPR compared to IPL-SI as SRPR eventually turns back to minimizing $F(x)$ for exact recovery.

In the sensitivity analysis concerning the parameter $\delta$ , our focus centers on the relative error following the minimization of $F_{\delta}(x)$ across various settings of $\delta$ and $p_{\text{fail}}$ . Employing hyperparameters $p=1500$ , $r=1$ , and $k=8$ , along with SI, Figure 7 delineates the outcomes from two perspectives. The first perspective plots $p_{\text{fail}}$ on the $x$ -axis against the relative error on the $y$ -axis, while the second uses $\delta$ as the $x$ -axis. Each point on the graph represents the median relative error derived from 50 replicates. The analysis distinctly illustrates that lower values of $\delta$ and $p_{\text{fail}}$ are associated with reduced relative errors. This trend of increase in $\delta$ and $p_{\text{fail}}$ is in alignment with the theoretical results outlined in Theorem 5. Notably, the precision achieved by $\min_{x\in\mathbb{R}^{p}}F_{\delta}(x)$ surpasses that of the SI, which exhibits a relative error ranging from 0.2 to 1.2.

Next, we provide some numerical results that will show the robustness of the model when bounded noise exists. We let $b_{j}=(a_{j}^{\top}x_{\star})^{2}+\tau_{j},\forall j\in[n_{1}]$ , in which $\tau_{i}$ is independent with $\{a_{i}\}_{i=1}^{n}$ and independently follows $U(-0.05,0.05)$ . We set the threshold for success as 0.05. Figure 8 shows the experiment results and shows similar patterns to Figure 6. This means that SRPR still enjoys a benign landscape and advantages over PL and IPL.

5.3 Image Recovery

Now, we study the task of image recovery within the field of computational biology, leveraging the prevalent application of phase retrieval techniques [17, 31]. Consider an image array $X_{\star}\in\mathbb{R}^{p_{1}\times p_{2}\times p_{3}}$ , where each element takes a value within the interval $[0,1]$ . Here, $p_{1}\times p_{2}$ represents the image dimensions, and $p_{3}=3$ signifies the RGB color channels. We define the signal as $x_{\star}=[\mbox{vec}(X_{\star});0]\in\mathbb{R}^{p}$ , where $p=\inf\{2^{s}:s\in\mathbb{N},2^{s}\geq p_{1}p_{2}p_{3}\}$ , and $k=n/p\in\mathbb{N}$ . The Hadamard matrix is denoted by $H_{p}\in\frac{1}{\sqrt{p}}\{-1,1\}^{p\times p}$ , and we let $S_{j}\in\mbox{diag}(\{-1,1\}^{p})$ , for $j=1,2,\ldots,k$ , be random diagonal matrices with diagonal elements following an independent discrete uniform distribution. The measurement matrix is defined as $A=[a_{1}^{\top};a_{2}^{\top};\ldots;a_{n}^{\top}]=\frac{1}{\sqrt{k}}[H_{p}S_{1};H_{p}S_{2};\ldots;H_{p}S_{k}]$ . Here, the semicolons mean that the matrix blocks are stacked vertically. This configuration benefits from an efficient computational complexity of $O(n\log p)$ for the transformation $Ay$ for $y\in\mathbb{R}^{p}$ , paralleling the complexity of the discrete Fourier transform commonly encountered in phase retrieval scenarios with complex transformation matrices. We use a real RNA nanoparticles image²²2https://visualsonline.cancer.gov/details.cfm?imageid=11167 which is also used in [11] with an original image size of $2133\times 2133\times 3$ . We follow [11]’s code³³3https://stanford.edu/ jduchi/projects/phase-retrieval-code.tgz to conduct experiments on a colored sub-image and pick $p=2^{18}$ or $p=2^{22}$ . A description of the related hyper-parameters is as follows. We consider the signal size for recovery $p\in\{2^{18},2^{22}\}$ , oversampling ratio $k=n/p\in\{3,6\}$ , corruption type $r\in\{0,1\}$ , and $p_{\text{fail}}\in\{0.025s|s\in[0,8],s\in N\}$ . The corrupted response satisfies that for $j\in[n]\backslash[n_{1}]$ , we have $b_{j}=r\tan(\frac{\pi}{2}U_{j})\times\mbox{median}\left(\{(a_{i}^{\top}x_{\star})^{2}:i\in\{1,2,...,n\}\}\right),U_{j}\sim U(0,1)$ and are independent for different measurements. The random initialization is $x=\frac{\sqrt{p}}{2}\tilde{x}$ , in which $\tilde{x}\sim U[\partial B_{p}(0,1)]$ . Next, we provide the choice of $\delta$ . We can find that $\mathbb{E}_{n}\left[aa^{\top}\right]=\frac{1}{n}A^{\top}A=I_{p}/n.$ Based on the property of an RGB image, we assume that $\|x_{\star}\|_{2}=c_{0}\sqrt{p},c_{0}\in(0,1)$ and $c_{0}$ is usually small when there is a large proportion of black pixels. So, $\delta$ can be selected as $\delta=\delta_{0}\left\|\frac{1}{n}A^{\top}A\right\|_{2}\|x_{\star}\|_{2}^{2}=\frac{\delta_{0}c_{0}^{2}}{k}$ for some small positive constant $\delta_{0}$ . As a result, we choose $\delta=0.01/6$ , since $k\leq 6$ .

In our experiments on image recoveries, we observe consistent results across different settings of $r$ . For illustration purposes, we select $r=0$ . The success rates are reported in Table 1, highlighting that SRPR uniquely does not depend on modified spectral initialization for large values of $n/p$ . Regarding the median CPU time when $p_{\text{fail}}>0$ , the performances of SRPR-RI, SRPR-SI, and IPL-SI are comparable. The CPU time spans from 200 to 600 seconds for $p=2^{18}$ , and from 6000 to 15000 seconds for $p=2^{22}$ . The computational times for the other algorithms are significantly longer. Interestingly, in scenarios where $p_{\text{fail}}=0$ , IPL-SI demonstrates markedly improved efficiency over its performance in cases of corruption, surpassing both SRPR-RI and SRPR-SI in terms of CPU time efficiency.

Success Rate	$p_{\text{fail}}\leq 0.15,k=3$	$p_{\text{fail}}\leq 0.15,k=6$	$p_{\text{fail}}>0.15,k=3$	$p_{\text{fail}}>0.15,k=6$
PL-RI	0	0	0	0
IPL-RI	0	0	0	0
SRPR-RI	0	1	0	1
PL-SI	1	1	0	1
IPL-SI	1	1	0	1
SRPR-SI	1	1	0	1

Table 1: Comparison of Success Rate when

p=2^{18}

for Image Recoveries

6 Conclusion

In this paper, we aim to fill an important gap in the literature by analyzing the essential geometric structure of the nonconvex robust phase retrieval. We propose the SRPR based on a family of convolution-type smoothed loss functions as a promising alternative to the RPR based on the $\ell_{1}$ -loss. We prove the benign geometric structure of the SRPR with high probability. Our result under the noiseless setting matches the best available result using the least-squares formulations, and our result under the setting with infrequent but arbitrary corruptions provides the first landscape analysis of phase retrieval with corruption in the literature. We also prove the local linear convergence rate of gradient descent for solving the SRPR under the noiseless setting. Numerical experiments on both simulation and a task of image recovery are presented to demonstrate the performance of the SRPR.

References

[1]
Cai et al. [2022] Cai, J.-F., Huang, M., Li, D. and Wang, Y. [2022], ‘Solving phase retrieval with random initial guess is nearly as good as by spectral initialization’, Applied and Computational Harmonic Analysis 58, 60–84.
Cai et al. [2016] Cai, T. T., Li, X. and Ma, Z. [2016], ‘Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow’, The Annals of Statistics 44(5), 2221–2251.
Candes et al. [2015] Candes, E. J., Li, X. and Soltanolkotabi, M. [2015], ‘Phase retrieval via wirtinger flow: Theory and algorithms’, IEEE Transactions on Information Theory 61(4), 1985–2007.
Candes et al. [2013] Candes, E. J., Strohmer, T. and Voroninski, V. [2013], ‘Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming’, Communications on Pure and Applied Mathematics 66(8), 1241–1274.
Chai et al. [2010] Chai, A., Moscoso, M. and Papanicolaou, G. [2010], ‘Array imaging using intensity-only measurements’, Inverse Problems 27(1), 015005.
Chen and Candès [2017] Chen, Y. and Candès, E. J. [2017], ‘Solving random quadratic systems of equations is nearly as easy as solving linear systems’, Communications on Pure and Applied Mathematics 70(5), 822–883.
Chen et al. [2019] Chen, Y., Chi, Y., Fan, J. and Ma, C. [2019], ‘Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval’, Mathematical Programming 176(1), 5–37.
Davis et al. [2018] Davis, D., Drusvyatskiy, D., MacPhee, K. J. and Paquette, C. [2018], ‘Subgradient methods for sharp weakly convex functions’, Journal of Optimization Theory and Applications 179(3), 962–982.
Davis et al. [2020] Davis, D., Drusvyatskiy, D. and Paquette, C. [2020], ‘The nonsmooth landscape of phase retrieval’, IMA Journal of Numerical Analysis 40(4), 2652–2695.
Duchi and Ruan [2019] Duchi, J. C. and Ruan, F. [2019], ‘Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval’, Information and Inference: A Journal of the IMA 8(3), 471–529.
Eldar and Mendelson [2014] Eldar, Y. C. and Mendelson, S. [2014], ‘Phase retrieval: Stability and recovery guarantees’, Applied and Computational Harmonic Analysis 36(3), 473–494.
Fernandes et al. [2021] Fernandes, M., Guerre, E. and Horta, E. [2021], ‘Smoothing quantile regressions’, Journal of Business & Economic Statistics 39(1), 338–357.
Fickus et al. [2014] Fickus, M., Mixon, D. G., Nelson, A. A. and Wang, Y. [2014], ‘Phase retrieval from very few measurements’, Linear Algebra and its Applications 449, 475–499.
Fienup and Dainty [1987] Fienup, C. and Dainty, J. [1987], ‘Phase retrieval and image reconstruction for astronomy’, Image Recovery: Theory and Application 231, 275.
He et al. [2023] He, X., Pan, X., Tan, K. M. and Zhou, W.-X. [2023], ‘Smoothed quantile regression with large-scale inference’, Journal of Econometrics 232(2), 367–388.
Huang et al. [2021] Huang, L., Liu, T., Yang, X., Luo, Y., Rivenson, Y. and Ozcan, A. [2021], ‘Holographic image reconstruction with phase recovery and autofocusing using recurrent neural networks’, ACS Photonics 8(6), 1763–1774.
Jiang and Yu [2021] Jiang, R. and Yu, K. [2021], ‘Smoothing quantile regression for a distributed system’, Neurocomputing 466, 311–326.
Jin et al. [2017] Jin, C., Ge, R., Netrapalli, P., Kakade, S. M. and Jordan, M. I. [2017], How to escape saddle points efficiently, in ‘International Conference on Machine Learning’, PMLR, pp. 1724–1732.
Jin et al. [2018] Jin, C., Netrapalli, P. and Jordan, M. I. [2018], Accelerated gradient descent escapes saddle points faster than gradient descent, in ‘Conference On Learning Theory’, PMLR, pp. 1042–1085.
Karimi et al. [2016] Karimi, H., Nutini, J. and Schmidt, M. [2016], Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition, in ‘Joint European Conference on Machine Learning and Knowledge Discovery in Databases’, Springer, pp. 795–811.
Li and Lin [2015] Li, H. and Lin, Z. [2015], Accelerated proximal gradient methods for nonconvex programming, in ‘Advances in Neural Information Processing Systems’, Vol. 28, Curran Associates, Inc.
https://proceedings.neurips.cc/paper/2015/file/f7664060cc52bc6f3d620bcedc94a4b6-Paper.pdf
Man et al. [2024] Man, R., Pan, X., Tan, K. M. and Zhou, W.-X. [2024], ‘A unified algorithm for penalized convolution smoothed quantile regression’, Journal of Computational and Graphical Statistics 33(2), 625–637.
Mei et al. [2018] Mei, S., Bai, Y. and Montanari, A. [2018], ‘The landscape of empirical risk for nonconvex losses’, The Annals of Statistics 46(6A), 2747–2774.
Miao et al. [1999] Miao, J., Charalambous, P., Kirz, J. and Sayre, D. [1999], ‘Extending the methodology of x-ray crystallography to allow imaging of micrometre-sized non-crystalline specimens’, Nature 400(6742), 342–344.
Miao et al. [2008] Miao, J., Ishikawa, T., Shen, Q. and Earnest, T. [2008], ‘Extending x-ray crystallography to allow the imaging of noncrystalline materials, cells, and single protein complexes’, Annu. Rev. Phys. Chem. 59, 387–410.
Millane [1990] Millane, R. P. [1990], ‘Phase retrieval in crystallography and optics’, JOSA A 7(3), 394–411.
Parikh and Boyd [2014] Parikh, N. and Boyd, S. [2014], ‘Block splitting for distributed optimization’, Mathematical Programming Computation 6(1), 77–102.
Roulet and d'Aspremont [2017] Roulet, V. and d'Aspremont, A. [2017], Sharpness, restart and acceleration, in ‘Advances in Neural Information Processing Systems’, Vol. 30, Curran Associates, Inc.
Shechtman et al. [2015] Shechtman, Y., Eldar, Y. C., Cohen, O., Chapman, H. N., Miao, J. and Segev, M. [2015], ‘Phase retrieval with application to optical imaging: a contemporary overview’, IEEE Signal Processing Magazine 32(3), 87–109.
Stefik [1978] Stefik, M. [1978], ‘Inferring dna structures from segmentation data’, Artificial Intelligence 11(1-2), 85–114.
Sun et al. [2018] Sun, J., Qu, Q. and Wright, J. [2018], ‘A geometric analysis of phase retrieval’, Foundations of Computational Mathematics 18(5), 1131–1198.
Tan et al. [2022] Tan, K. M., Wang, L. and Zhou, W.-X. [2022], ‘High-dimensional quantile regression: Convolution smoothing and concave regularization’, Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1), 205–233.
Taylor [2003] Taylor, G. [2003], ‘The phase problem’, Acta Crystallographica Section D 59(11), 1881–1890.
Wang et al. [2017] Wang, G., Giannakis, G. B. and Eldar, Y. C. [2017], ‘Solving systems of random quadratic equations via truncated amplitude flow’, IEEE Transactions on Information Theory 64(2), 773–794.
Zhang et al. [2016] Zhang, H., Chi, Y. and Liang, Y. [2016], Provable non-convex phase retrieval with outliers: Median truncated wirtinger flow, in ‘International Conference on Machine Learning’, PMLR, pp. 1022–1031.
Zhang et al. [2017] Zhang, H., You, S., Lin, Z. and Xu, C. [2017], Fast compressive phase retrieval under bounded noise, in ‘Proceedings of the AAAI Conference on Artificial Intelligence’, Vol. 31.
Zheng et al. [2024] Zheng, Z., Ma, S. and Xue, L. [2024], ‘A new inexact proximal linear algorithm with adaptive stopping criteria for robust phase retrieval’, IEEE Transactions on Signal Processing 72, 1081–1093.