Sharp Restricted Isometry Property Bounds for Low-rank Matrix Recovery Problems with Corrupted Measurements

Ziye Ma\equalcontrib¹, Yingjie Bi\equalcontrib², Javad Lavaei ², Somayeh Sojoudi ¹ ³

Abstract

In this paper, we study a general low-rank matrix recovery problem with linear measurements corrupted by some noise. The objective is to understand under what conditions on the restricted isometry property (RIP) of the problem local search methods can find the ground truth with a small error. By analyzing the landscape of the non-convex problem, we first propose a global guarantee on the maximum distance between an arbitrary local minimizer and the ground truth under the assumption that the RIP constant is smaller than $1/2$ . We show that this distance shrinks to zero as the intensity of the noise reduces. Our new guarantee is sharp in terms of the RIP constant and is much stronger than the existing results. We then present a local guarantee for problems with an arbitrary RIP constant, which states that any local minimizer is either considerably close to the ground truth or far away from it. Next, we prove the strict saddle property, which guarantees the global convergence of the perturbed gradient descent method in polynomial time. The developed results demonstrate how the noise intensity and the RIP constant of the problem affect the landscape of the problem.

1 Introduction

Low-rank matrix recovery problems arise in various applications, such as matrix completion (Candès and Recht 2009; Recht, Fazel, and Parrilo 2010), phase synchronization/retrieval (Singer 2011; Boumal 2016; Shechtman et al. 2015), robust PCA (Ge, Jin, and Zheng 2017), and several others (Chen and Chi 2018; Chi, Lu, and Chen 2019). In this paper, we study a class of low-rank matrix recovery problems, where the goal is to recover a symmetric and positive semidefinite ground truth matrix $M^{*}$ with $\operatorname{rank}(M^{*})=r$ from certain linear measurements corrupted by noise. This problem can be formulated as the following optimization problem:

	$\displaystyle\min_{M\in\mathbb{R}^{n\times n}}$	$\displaystyle\frac{1}{2}\lVert\mathcal{A}(M)-b+w\rVert^{2}$		(1)
	$\displaystyle\operatorname{s.t.}$	$\displaystyle\operatorname{rank}(M)\leq r,\quad M\succeq 0.$		(1)

Here, $\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m}$ is a linear operator whose action on a matrix $M$ is given by

\mathcal{A}(M)=[\langle A_{1},M\rangle,\dots,\langle A_{m},M\rangle]^{T},

where $A_{1},\dots,A_{m}\in\mathbb{R}^{n\times n}$ are called sensing matrices. In addition, $b=\mathcal{A}(M^{*})$ represents the perfect measurement on the ground truth $M^{*}$ and $w$ comes from an arbitrary probability distribution. Note that only the noisy measurement $b-w$ is available to the user, and indeed $b$ is unknown. In other words, from a problem-solving perspective, the random variable $w$ is hidden to the user, and it is explicitly modeled here only for the sake of analysis.

The modeling of the noise in this problem is critical for many practical applications in which the influence of the noise cannot be ignored. For example, state estimation is a major data analytic problem for the operation of power grids, which can be modeled as matrix sensing (Jin et al. 2019b). In this problem, each measurement comes from a physical device, and the noise in our formulation models not only the sensory noise but also mismatch between the true model of the system and the one used by a power operator, changes in the measurements due to cyber-attacks, mechanical faults, etc. In other words, by a noisy problem we mean a learning problem where a part of the data is wrong for various reasons, and as with this case, it is impossible to assume that the measurements are noise-free.

Although it may be possible to solve the problem (1) based on convex relaxations (Candès and Recht 2009; Recht, Fazel, and Parrilo 2010; Candès and Tao 2010), the computational complexity associated with solving a semidefinite program presents a major challenge for large-scale problems. A more scalable approach is to use the Burer–Monteiro factorization (Burer and Monteiro 2003) by expressing $M$ as $XX^{T}$ with $X\in\mathbb{R}^{n\times r}$ , which leads to the following equivalent formulation of the original problem (1):

\min_{X\in\mathbb{R}^{n\times r}}f(X)=\frac{1}{2}\|\mathcal{A}(XX^{T})-b+w\|^{2}.

(2)

The unconstrained problem (2) is often solved by local search methods such as gradient descent. Since the objective function $f(X)$ in (2) is non-convex, local search algorithms may converge to a local minimizer, leading to a suboptimal or plainly wrong solution. Hence, it is desirable to provide guarantees on the maximum distance between these local minimizers and the ground truth $M^{*}$ . This problem will be addressed in this paper.

Related Works

The special noiseless case of the problem (2) can be obtained by setting $w=0$ . In this case, any solution $Z$ with $ZZ^{T}=M^{*}$ is a global minimizer of the problem (2). Many previous researches such as Ge, Jin, and Zheng (2017); Bhojanapalli, Neyshabur, and Srebro (2016); Park et al. (2017); Zhang et al. (2018b); Zhu et al. (2018); Zhang, Sojoudi, and Lavaei (2019); Zhang and Zhang (2020); Bi and Lavaei (2020); Ha, Liu, and Barber (2020); Zhu et al. (2021); Zhang (2021) focused on proving that the problem has no spurious (non-global) local minimizers under the assumption of restricted isometry property (RIP). Moreover, as demonstrated in previous works such as Ge, Jin, and Zheng (2017), the developed techniques under the RIP condition can be adopted to show that other low-rank matrix recovery problems, such as matrix completion under incoherence condition and robust PCA problem, also have benign landscape. The RIP condition is equivalent to the restricted strongly convex and smooth property used in Wang, Zhang, and Gu (2017); Park et al. (2018); Zhu et al. (2021), and its formal definition is given below.

Definition 1.

The linear operator $\mathcal{A}(\cdot):\mathbb{R}^{n\times n}\to\mathbb{R}^{m}$ is said to satisfy the $\delta$ -RIP_2r property for some constant $\delta\in[0,1)$ if the inequality

(1-\delta)\|M\|_{F}^{2}\leq\|\mathcal{A}(M)\|^{2}\leq(1+\delta)\|M\|_{F}^{2}

holds for all $M\in\mathbb{R}^{n\times n}$ with $\operatorname{rank}(M)\leq 2r$ .

In the recent paper by Zhang (2021), the author developed a sharp bound on the absence of spurious local minima for the noiseless case of problem (2), which says that the problem has no spurious local minima if the measurement operator $\mathcal{A}$ satisfies the $\delta$ - $\operatorname{RIP}_{2r}$ property with $\delta<1/2$ . This result is tight since there is a known counterexample (Zhang et al. 2018a) having spurious local minima under $\delta=1/2$ .

For the noisy problem, the relation $X^{*}X^{*T}=M^{*}$ is unlikely to be satisfied, where $X^{*}$ denotes a global minimizer of problem (2). However, in this situation, $X^{*}X^{*T}$ should be close to the ground truth $M^{*}$ if the noise $w$ is small. As a generalization of the above-mentioned results for the noiseless problem, it is natural to study whether all local minimizers, including the global minimizers, are close to the ground truth $M^{*}$ under the RIP assumption. One such result is presented in Bhojanapalli, Neyshabur, and Srebro (2016) and given below.

Theorem 1 (from Theorem 3.1 in Bhojanapalli, Neyshabur, and Srebro (2016)).

Suppose that $w\sim\mathcal{N}(0,\sigma_{w}^{2}I_{m})$ and $\mathcal{A}(\cdot)$ has the $\delta$ -RIP_4r property with $\delta<1/10$ . Then, with probability at least $1-10/n^{2}$ , any local minimizer $\hat{X}$ of problem (2) satisfies the inequality

\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}\leq 20\sqrt{\frac{\log(n)}{m}}\sigma_{w}.

Theorem 31 in Ge, Jin, and Zheng (2017) further improves the above result by replacing the $\delta$ - $\operatorname{RIP}_{4r}$ property with the $\delta$ - $\operatorname{RIP}_{2r}$ property. Li et al. (2020) studies a similar noisy low-rank matrix recovery problem with $l_{1}$ norm.

As compared above, there is an evident gap between the state-of-the-art results for the noiseless and noisy problems. The result for the noiseless problem only requires the RIP constant $\delta<1/2$ , but the one for the noisy problem requires $\delta<1/10$ no matter how small the noise is. This gap will be addressed in this paper by showing that a major generalization of Theorem 1 holds for the noisy problem under the same RIP assumption as in the sharp bound for the noiseless problem.

Notations

In this paper, $I_{n}$ refers to the identity matrix of size $n\times n$ . The notation $M\succeq 0$ means that $M$ is a symmetric and positive semidefinite matrix. $\sigma_{i}(M)$ denotes the $i$ -th largest singular value of a matrix $M$ , and $\lambda_{i}(M)$ denotes the $i$ -th largest eigenvalue of $M$ . $\lVert v\rVert$ denotes the Euclidean norm of a vector $v$ , while $\lVert M\rVert_{F}$ and $\lVert M\rVert_{2}$ denote the Frobenius norm and induced $l_{2}$ norm of a matrix $M$ , respectively. $\langle A,B\rangle$ is defined to be $\operatorname{tr}(A^{T}B)$ for two matrices $A$ and $B$ of the same size. The Kronecker product between $A$ and $B$ is denoted as $A\otimes B$ . For a matrix $M$ , $\operatorname{vec}(M)$ is the usual vectorization operation by stacking the columns of the matrix $M$ into a vector. For a vector $v\in\mathbb{R}^{n^{2}}$ , $\operatorname{mat}(v)$ converts $v$ to a square matrix and $\operatorname{mat}_{S}(v)$ converts $v$ to a symmetric matrix, i.e., $\operatorname{mat}(v)=M$ and $\operatorname{mat}_{S}(v)=(M+M^{T})/2$ , where $M\in\mathbb{R}^{n\times n}$ is the unique matrix satisfying $v=\operatorname{vec}(M)$ . Finally, $\mathcal{N}(\mu,\mathbf{\Sigma})$ refers to the multivariate Gaussian distribution with mean $\mu$ and covariance $\mathbf{\Sigma}$ .

2 Main Results

We first present the global guarantee on the local minimizers of the problem (2). To simplify the notation, we use a matrix representation of the measurement operator $\mathcal{A}$ as follows:

\mathbf{A}=[\operatorname{vec}(A_{1}),\operatorname{vec}(A_{2}),\dots,\operatorname{vec}(A_{m})]^{T}\in\mathbb{R}^{m\times n^{2}}.

Then, $\mathbf{A}\operatorname{vec}(M)=\mathcal{A}(M)$ for every matrix $M\in\mathbb{R}^{n\times n}$ .

Theorem 2.

Assume that the linear operator $\mathcal{A}$ satisfies the $\delta$ - $\operatorname{RIP}_{2r}$ property with $\delta<1/2$ . For every $\epsilon>0$ , with probability at least $\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon)$ , either of the following two inequalities


	$\displaystyle\begin{split}(1-\delta)&\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}^{2}\leq\epsilon\sqrt{r}\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}\\ &+4\epsilon\sqrt{r}\lVert M^{*}\rVert_{F},\end{split}$		(3a)
	$\displaystyle\begin{split}&\frac{2(1-2\delta)}{3(1+\delta)}\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}\leq 2\epsilon\sqrt{r}\\ &\hskip 10.00002pt+2\sqrt{2\epsilon(1+\delta)}(\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}^{1/2}+\lVert M^{*}\rVert_{F}^{1/2})\end{split}$		(3b)

holds for every arbitrary local minimizer $\hat{X}\in\mathbb{R}^{n\times r}$ of problem (2).

Note that two upper bounds on the distance $\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}$ can be obtained for any local minimizer $\hat{X}$ by solving the two quadratic-like inequalities (3a) and (3b), and the larger bound needs to be used because only one of the two inequalities is guaranteed to hold. The explicit upper bound solved from Theorem 2 is given in Appendix A. The reason for the existence of two inequalities in Theorem 2 is the split of its proof into two cases. The first case is when the $r$ -th smallest singular value of $\hat{X}$ is small, and the second case is the opposite, which are respectively handled by Lemma 2 and Lemma 3.

The reason for the existence of two inequalities in Theorem 2 is the split of its proof into two cases. The first case is when the $r$ -th smallest singular value of $\hat{X}$ is small, and the second case is the opposite, which are respectively handled by Lemma 2 and Lemma 3.

Theorem 2 is a major extension of the existing sharp result stating that the noiseless problem has no spurious local minima under the same assumption of the $\delta$ - $\operatorname{RIP}_{2r}$ property with $\delta<1/2$ . The reason is that in the case when the noise $w$ is equal to zero, one can choose an arbitrarily small $\epsilon$ in Theorem 2 to conclude from the inequalities (3a) and (3b) that $\hat{X}\hat{X}^{T}=M^{*}$ for every local minimizer $\hat{X}$ . Moreover, when the RIP constant $\delta$ further decreases from $1/2$ , the upper bound on $\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}$ will also decrease, which means that a local minimizer found by local search methods will be closer to the ground truth $M^{*}$ . This suggests that the RIP condition is able to not only guarantee the absence of spurious local minima as shown in the previous literature but also mitigate the influence of the noise in the measurements.

Compared with the existing results such as Theorem 1, our new result has two advantages. First, by improving the RIP constant from $1/10$ to $1/2$ , one can apply the results on the location of spurious local minima to a much broader class of problems, which can often help reduce the number of measurements. For example, in the case when the measurements are given by random Gaussian matrices, it is proven in Candès and Recht (2009) that to achieve the $\delta$ - $\operatorname{RIP}_{2r}$ property the minimum number of measurements needed is in the order of $O(1/\delta^{2})$ . By improving the RIP constant in the bound, we can significantly reduce the number of measurements while still keeping the benign landscape. In applications such as learning for energy networks, there is a fundamental limit on the number of measurements that can be collected due to the physics of the problem (Jin et al. 2021). Finding a better bound on RIP helps with addressing the issues with the number of measurements needed to reliably solve the problem. Second, Theorem 1 is just about the probability of having all spurious solutions in a fixed ball around the ground truth of radius $O(\sigma_{w})$ instead of balls of arbitrary radii, and this fixed ball could be a large one depending on whether the noise level $\sigma_{w}$ is fixed or scales with the problem. On the other hand, in Theorem 2, we consider the probability $\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon)$ for any arbitrary value of $\epsilon$ . By having a flexible $\epsilon$ , our work not only improves the RIP constant but also allows computing the probability of having all spurious solutions in any given ball.

In the special case of rank $r=1$ , the conditions (3a) and (3b) in Theorem 2 can be substituted with a simpler condition as shown below. Its proof is highly similar to that of Lemma 3, and obviated in this paper for succinctness.

Theorem 3.

Consider the case $r=1$ and assume that the linear operator $\mathcal{A}$ satisfies the $\delta$ - $\operatorname{RIP}_{2}$ property with $\delta<1/2$ . For every $\epsilon>0$ , with probability at least $\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon)$ , every arbitrary local minimizer $\hat{X}\in\mathbb{R}^{n\times r}$ of problem (2) satisfies

\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq\frac{3(1+\sqrt{2})\epsilon(1+\delta)}{1-2\delta}.

(4)

In the case when the RIP constant $\delta$ is not less than $1/2$ , it is not possible to achieve a global guarantee similar to Theorem 2 or Theorem 3 since it is known that the problem may have a spurious solution even in the noiseless case. Instead, we turn to local guarantees by showing that every arbitrary local minimizer $\hat{X}$ of problem (2) is either close to the ground truth $M^{*}$ or far away from it in terms of the distance $\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}$ .

Theorem 4.

Assume that the linear operator $\mathcal{A}$ satisfies the $\delta$ -RIP_2r property for some $\delta\in[0,1)$ . Consider arbitrary constants $\epsilon>0$ and $\tau\in(0,2(\sqrt{2}-1))$ such that

\delta<\sqrt{1-\frac{3+2\sqrt{2}}{4}\tau^{2}}.

Every arbitrary local minimizer $\hat{X}\in\mathbb{R}^{n\times r}$ of problem (2) satisfying

\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}\leq\tau\lambda_{r}(M^{*})

(5)

will also satisfy

\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq\frac{\epsilon(1+\delta)C(\tau,M^{*})}{\sqrt{1-\frac{3+2\sqrt{2}}{4}\tau^{2}}-\delta}

(6)

with probability at least $\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon)$ , where

C(\tau,M^{*})=\sqrt{\frac{2(\lambda_{1}(M^{*})+\tau\lambda_{r}(M^{*}))}{(1-\tau)\lambda_{r}(M^{*})}}.

The upper bounds in (5) and (6) define an outer ball and an inner ball centered at the ground truth $M^{*}$ . Theorem 4 states that there is no local minimizer in the ring between the two balls. As $\epsilon$ approaches zero, the inner ball shrinks to the ground truth. This theorem shows that bad local minimizers are located outside the outer ball. Note that the problem could be highly non-convex when $\delta$ is close to 1, while this theorem shows a benign landscape in a local neighborhood of the solution. Furthermore, all the theorems in this section are applicable to arbitrary noise models since they make no explicit use of the probability distribution of the noise. The only required information is the probability $\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon)$ , which can be computed or bounded when the probability distribution of the noise is given as illustrated in Section 4.

The results presented above are all about the location of the local minimizers. They do not directly lead to the global convergence of local search methods with a fast convergence rate. To provide performance guarantees for local search methods, the next theorem establishes a stronger property for the landscape of the noisy problem that is usually called the strict saddle property in the literature, which essentially says that all approximate second-order critical points are close to the ground truth.

Theorem 5.

Assume that the linear operator $\mathcal{A}$ satisfies the $\delta$ - $\operatorname{RIP}_{2r}$ property with $\delta<1/2$ . For every $\epsilon>0$ and $\kappa\geq 0$ , with probability at least $\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon)$ , either of the following two inequalities


	$\displaystyle\begin{split}(1&-\delta)\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}^{2}\leq\epsilon\sqrt{r}\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}\\ &+\frac{r^{1/4}\kappa}{2}\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}^{1/2}+\frac{r^{1/4}\kappa}{2}\lVert M^{}\rVert_{F}^{1/2}\\ &+(4\sqrt{r}\epsilon+\frac{5\sqrt{r}\kappa}{2})\lVert M^{*}\rVert_{F}\end{split}$		(7a)
	$\displaystyle\begin{split}&\frac{2(1-2\delta)}{3(1+\delta)}\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}\leq(2\epsilon+\kappa)\sqrt{r}+\sqrt{2\kappa(1+\delta)}\\ &\hskip 20.00003pt+2\sqrt{2\epsilon(1+\delta)}(\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}^{1/2}+\lVert M^{*}\rVert_{F}^{1/2})\end{split}$		(7b)

holds for every matrix $\hat{X}\in\mathbb{R}^{n\times r}$ satisfying

\lVert\nabla f(\hat{X})\rVert\leq\kappa,\quad\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}.

Note that this property is not proven in the literature for $\delta<1/2$ even in the noiseless case, and thus our result generalizes the existing ones even in this scenario. On the other hand, it is proven by Jin et al. (2017) that the perturbed gradient descent method with an arbitrary initialization will find a solution $\hat{X}$ satisfying the requirements in Theorem 5 with a high probability in $O(\mathrm{poly}(1/\kappa))$ number of iterations. By Theorem 5, $\hat{X}\hat{X}^{T}$ will be close to the ground truth if $\epsilon$ and $\kappa$ are chosen to be relatively small.

Table 1 briefly summarizes our result compared with the existing literature.

Paper	Noise	RIP Assumption	Rank	Convergence
Bhojanapalli, Neyshabur, and Srebro (2016)	Isotropic Gaussian	$\delta<1/10$	rank $r$	N/A
Zhang, Sojoudi, and Lavaei (2019)	Noiseless	$\delta<1/2$	rank 1	N/A
Zhang (2021)	Noiseless	$\delta<1/2$	rank $r$	N/A
Ours	Finite Variance	$\delta<1/2$	rank $r$	Polynomial

Table 1: Comparison between our result and the existing literature.

3 Proofs of Main Results

Before presenting the proofs, we first compute the gradient and the Hessian of the objective function $f(\hat{X})$ of the problem (2):

	$\displaystyle\nabla f(\hat{X})=\hat{\mathbf{X}}^{T}\mathbf{A}^{T}(\mathbf{A}\mathbf{e}+w),$
	$\displaystyle\nabla^{2}f(\hat{X})=2I_{r}\otimes\operatorname{mat}_{S}(\mathbf{A}^{T}(\mathbf{A}\mathbf{e}+w))+\hat{\mathbf{X}}^{T}\mathbf{A}^{T}\mathbf{A}\hat{\mathbf{X}},$

where

\mathbf{e}=\operatorname{vec}(\hat{X}\hat{X}^{T}-M^{*}),

and $\hat{\mathbf{X}}\in\mathbb{R}^{n^{2}\times nr}$ is the matrix satisfying

\hat{\mathbf{X}}\operatorname{vec}(U)=\operatorname{vec}(\hat{X}U^{T}+U\hat{X}^{T}),\quad\forall U\in\mathbb{R}^{n\times r}.

The first step in the proofs is to derive necessary conditions for a matrix $\hat{X}\in\mathbb{R}^{n\times r}$ to be an approximate second-order critical point, which depend on the linear operator $\mathcal{A}$ , the noise $w\in\mathbb{R}^{m}$ , the solution $\hat{X}$ , and the parameter $\kappa$ characterizing how close $\hat{X}$ is to a true second-order critical point.

Lemma 1.

Given $\kappa\geq 0$ , assume that $\hat{X}\in\mathbb{R}^{n\times r}$ satisfies

\lVert\nabla f(\hat{X})\rVert\leq\kappa,\quad\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}.

Then, it must satisfy the following inequalities:


	$\displaystyle\lVert\hat{\mathbf{X}}^{T}\mathbf{H}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert\mathbf{A}^{T}w\rVert+\kappa,$		(8a)
	$\displaystyle 2I_{r}\otimes\operatorname{mat}_{S}(\mathbf{H}\mathbf{e})+\hat{\mathbf{X}}^{T}\mathbf{H}\hat{\mathbf{X}}\succeq-(2\lVert\mathbf{A}^{T}w\rVert+\kappa)I_{nr},$		(8b)

where $\mathbf{H}=\mathbf{A}^{T}\mathbf{A}$ .

Proof.

To obtain condition (8a), notice that $\lVert\nabla f(\hat{X})\rVert\leq\kappa$ implies that

\lVert\hat{\mathbf{X}}^{T}\mathbf{H}\mathbf{e}\rVert\leq\lVert\hat{\mathbf{X}}^{T}\mathbf{A}^{T}w\rVert+\kappa\leq\lVert\hat{\mathbf{X}}\rVert_{2}\lVert\mathbf{A}^{T}w\rVert+\kappa\\ \leq 2\lVert\hat{X}\rVert_{2}\lVert\mathbf{A}^{T}w\rVert+\kappa,

in which the last inequality is due to

\lVert\hat{\mathbf{X}}\operatorname{vec}(U)\rVert=\lVert\hat{X}U^{T}+U\hat{X}^{T}\rVert_{F}\leq 2\lVert\hat{X}\rVert_{2}\lVert U\rVert_{F},

for every $U\in\mathbb{R}^{n\times r}$ . Similarly, $\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}$ implies that

2I_{r}\otimes\operatorname{mat}_{S}(\mathbf{H}\mathbf{e})+\hat{\mathbf{X}}^{T}\mathbf{H}\hat{\mathbf{X}}\succeq-2I_{r}\otimes\operatorname{mat}_{S}(\mathbf{A}^{T}w)-\kappa I_{nr}.

On the other hand, the eigenvalues of $I_{r}\otimes\operatorname{mat}_{S}(\mathbf{A}^{T}w)$ are the same as those of $\operatorname{mat}_{S}(\mathbf{A}^{T}w)$ , and each eigenvalue $\lambda_{i}(\operatorname{mat}_{S}(\mathbf{A}^{T}w))$ of the latter matrix further satisfies

\lvert\lambda_{i}(\operatorname{mat}_{S}(\mathbf{A}^{T}w))\rvert\leq\lVert\operatorname{mat}_{S}(\mathbf{A}^{T}w)\rVert_{F}\leq\lVert\mathbf{A}^{T}w\rVert,

which proves condition (8b). ∎

If $\hat{X}$ is a local minimizer of the problem (2), Lemma 1 shows that $\hat{X}$ satisfies the inequalities (8a) and (8b) with $\kappa=0$ . Similarly, Theorem 2 can also be regarded as a special case of Theorem 5 with $\kappa=0$ . The proofs of these two theorems consist of inspecting two cases. The following lemma deals with the first case in which $\hat{X}$ is an approximate second-order critical point with $\sigma_{r}(\hat{X})$ being close to zero.

Lemma 2.

Given $\hat{X}\in\mathbb{R}^{n\times r}$ and arbitrary constants $\epsilon>0$ and $\kappa\geq 0$ , the inequalities

\sigma_{r}(\hat{X})\leq\sqrt{\frac{\epsilon+\kappa}{1+\delta}},\;\lVert\nabla f(\hat{X})\rVert\leq\kappa,\;\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}

and $\lVert\mathbf{A}^{T}w\rVert\leq\epsilon$ will together imply the inequality (7a).

Proof.

Let $G=\operatorname{mat}_{S}(\mathbf{H}\mathbf{e})$ and $u\in\mathbb{R}^{n}$ be a unit eigenvector of $G$ corresponding to its minimum eigenvalue, i.e.,

\lVert u\rVert=1,\quad Gu=\lambda_{\min}(G)u.

In addition, let $v\in\mathbb{R}^{r}$ be a singular vector of $\hat{X}$ such that

\lVert v\rVert=1,\quad\lVert\hat{X}v\rVert=\sigma_{r}(\hat{X}).

Let $\mathbf{U}=\operatorname{vec}(uv^{T})$ . Then, $\lVert\mathbf{U}\rVert\leq 1$ and (8b) implies that

$\displaystyle-2\epsilon$	$\displaystyle-\kappa\leq 2\mathbf{U}^{T}(I_{r}\otimes\operatorname{mat}_{S}(\mathbf{H}\mathbf{e}))\mathbf{U}+\mathbf{U}^{T}\hat{\mathbf{X}}^{T}\mathbf{H}\hat{\mathbf{X}}\mathbf{U}$
	$\displaystyle\leq 2\operatorname{tr}(vu^{T}Guv^{T})+(1+\delta)\lVert\hat{X}vu^{T}+uv^{T}\hat{X}^{T}\rVert_{F}^{2}$
	$\displaystyle\leq 2\lambda_{\min}(G)+4(1+\delta)\sigma_{r}(\hat{X})^{2}$
	$\displaystyle\leq 2\lambda_{\min}(G)+4\epsilon+4\kappa.$	(9)

On the other hand,

	$\displaystyle(1-\delta)$	$\displaystyle\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{2}\leq\mathbf{e}^{T}\mathbf{H}\mathbf{e}$
		$\displaystyle=\operatorname{vec}(\hat{X}\hat{X}^{T})^{T}\mathbf{H}\mathbf{e}-\operatorname{vec}(M^{*})^{T}\mathbf{H}\mathbf{e}$
		$\displaystyle=\frac{1}{2}\operatorname{vec}(\hat{X})^{T}\hat{\mathbf{X}}^{T}\mathbf{H}\mathbf{e}-\langle M^{*},\operatorname{mat}_{S}(\mathbf{H}\mathbf{e})\rangle$
		$\displaystyle\leq\frac{1}{2}\lVert\hat{X}\rVert_{2}\lVert\hat{\mathbf{X}}^{T}\mathbf{H}\mathbf{e}\rVert+\left(3\epsilon+\frac{5\kappa}{2}\right)\operatorname{tr}(M^{*})$
		$\displaystyle\leq\epsilon\lVert\hat{X}\rVert_{2}^{2}+\frac{\kappa}{2}\lVert\hat{X}\rVert_{2}+\left(3\epsilon+\frac{5\kappa}{2}\right)\operatorname{tr}(M^{*}),$

in which the second last inequality is due to (9) and the last inequality is due to (8a). Furthermore, the right-hand side of the above inequality can be relaxed as

	$\displaystyle\epsilon$	$\displaystyle\lVert\hat{X}\rVert_{2}^{2}+\frac{\kappa}{2}\lVert\hat{X}\rVert_{2}+\left(3\epsilon+\frac{5\kappa}{2}\right)\operatorname{tr}(M^{*})\leq\epsilon\sqrt{r}\lVert\hat{X}\hat{X}^{T}\rVert_{F}$
		$\displaystyle+\frac{r^{1/4}\kappa}{2}\lVert\hat{X}\hat{X}^{T}\rVert_{F}^{1/2}+\left(3\epsilon+\frac{5\kappa}{2}\right)\sqrt{r}\lVert M^{*}\rVert_{F}$
		$\displaystyle\leq\epsilon\sqrt{r}\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}+\frac{r^{1/4}\kappa}{2}\lVert\hat{X}\hat{X}^{T}-M^{}\rVert_{F}^{1/2}$
		$\displaystyle\hskip 10.00002pt+\left(4\sqrt{r}\epsilon+\frac{5\sqrt{r}\kappa}{2}\right)\lVert M^{}\rVert_{F}+\frac{r^{1/4}\kappa}{2}\lVert M^{}\rVert_{F}^{1/2},$

which leads to the inequality (7a). ∎

The remaining case with

\sigma_{r}(\hat{X})>\sqrt{\frac{\epsilon+\kappa}{1+\delta}}

will be handled in the following lemma using a different method.

Lemma 3.

Assume that the linear operator $\mathcal{A}$ satisfies the $\delta$ - $\operatorname{RIP}_{2r}$ property with $\delta<1/2$ . Given $\hat{X}\in\mathbb{R}^{n\times r}$ and arbitrary constants $\epsilon>0$ and $\kappa\geq 0$ , the inequalities

\sigma_{r}(\hat{X})>\sqrt{\frac{\epsilon+\kappa}{1+\delta}},\;\lVert\nabla f(\hat{X})\rVert\leq\kappa,\;\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}

and $\lVert\mathbf{A}^{T}w\rVert\leq\epsilon$ will together imply the inequality (7b).

The proofs of both Lemma 3 and the local guarantee in Theorem 4 generalize the proof of the absence of spurious local minima for the noiseless problem in Zhang and Zhang (2020); Zhang (2021). Our innovation here is to develop new techniques to analyze approximate optimality conditions for the solutions because unlike the noiseless problem the local minimizers of the noisy one are only approximate second-order critical points of the distance function $\|\mathcal{A}(XX^{T})-b\|^{2}$ . For a fixed solution $\hat{X}$ and noise $w$ , one can find an operator $\hat{\mathcal{A}}$ satisfying the $\delta$ - $\operatorname{RIP}_{2r}$ property with the smallest possible $\delta$ such that $\hat{X}$ and $\hat{\mathcal{A}}$ satisfy the necessary conditions stated in Lemma 1. Let $\delta^{*}(\hat{X})$ be the RIP constant of the found measurement operator $\hat{\mathcal{A}}$ in the worst-case scenario. Then, if $\hat{X}$ in Lemma 3 is a solution of the current problem with the linear operator $\mathcal{A}$ satisfying the $\delta$ - $\operatorname{RIP}_{2r}$ property, it holds that $\delta\geq\delta^{*}(\hat{X})$ , which can further lead to an upper bound on the distance $\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}$ .

To compute $\delta^{*}(\hat{X})$ defined above, let $q=\mathbf{A}^{T}w$ and solve the following optimization problem whose optimal value is $\delta^{*}(\hat{X})$ :

$\displaystyle\min_{\delta,\hat{\mathbf{H}}}$	$\displaystyle\delta$	(10)
s.t.	$\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert q\rVert+\kappa,$
	$\displaystyle 2I_{r}\otimes\operatorname{mat}_{S}(\hat{\mathbf{H}}\mathbf{e})+\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\hat{\mathbf{X}}\succeq-(2\lVert q\rVert+\kappa)I_{nr},$
	$\displaystyle\text{$\hat{\mathbf{H}}$ is symmetric and satisfies the $\delta$-$\operatorname{RIP}_{2r}$ property}.$

Note that a matrix $\hat{\mathbf{H}}\in\mathbb{R}^{n^{2}\times n^{2}}$ is said to satisfy the $\delta$ - $\operatorname{RIP}_{2r}$ property if

(1-\delta)\lVert\mathbf{U}\rVert^{2}\leq\mathbf{U}^{T}\hat{\mathbf{H}}\mathbf{U}\leq(1+\delta)\lVert\mathbf{U}\rVert^{2}

holds for every matrix $U\in\mathbb{R}^{n\times n}$ with $\operatorname{rank}(U)\leq 2r$ and $\mathbf{U}=\operatorname{vec}(U)$ . Obviously, for a linear operator $\hat{\mathcal{A}}$ , $\hat{\mathbf{H}}=\hat{\mathbf{A}}^{T}\hat{\mathbf{A}}$ satisfies the $\delta$ - $\operatorname{RIP}_{2r}$ property if and only if $\hat{\mathbf{A}}$ satisfies the $\delta$ - $\operatorname{RIP}_{2r}$ property.

However, since problem (10) is non-convex due to the RIP constraint, we instead solve the following convex reformulation:

$\displaystyle\min_{\delta,\hat{\mathbf{H}}}$	$\displaystyle\delta$	(11)
$\displaystyle\operatorname{s.t.}$	$\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert q\rVert+\kappa,$
	$\displaystyle 2I_{r}\otimes\operatorname{mat}_{S}(\hat{\mathbf{H}}\mathbf{e})+\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\hat{\mathbf{X}}\succeq-(2\lVert q\rVert+\kappa)I_{nr},$
	$\displaystyle(1-\delta)I_{n^{2}}\preceq\hat{\mathbf{H}}\preceq(1+\delta)I_{n^{2}}.$

Lemma 14 in Bi and Lavaei (2020) proves that problem (10) and problem (11) have the same optimal value. The remaining step in the proof of Lemma 3 is to solve the optimization problem (11) for given $\hat{X}$ , $q$ and $\kappa$ . The complete proof of Lemma 3 is lengthy and deferred to Appendix B. Finally, Theorem 2 and Theorem 5 are direct consequences of Lemma 2 and Lemma 3. The proof of Theorem 3 is very similar to that of Lemma 3 and is also given in Appendix B.

Now, we turn to the proof of the local guarantee in Theorem 4. The following existing result will be useful.

Lemma 4 (from Lemma 14 in Zhang, Sojoudi, and Lavaei (2019)).

Given $a,b\in\mathbb{R}^{n}$ , the rank-2 matrix $ab^{T}+ba^{T}$ has two possibly nonzero eigenvalues

\|a\|\|b\|(1+\cos\theta),\quad-\|a\|\|b\|(1-\cos\theta).

Here, $\theta$ is the angle between $a$ and $b$ .

Proof of Theorem 4.

First, we relax the optimization problem (11) by dropping the constraint related to the second-order necessary optimality condition. This gives rise to the optimization problem

$\displaystyle\min_{\delta,\hat{\mathbf{H}}}$	$\displaystyle\delta$	(12)
$\displaystyle\operatorname{s.t.}$	$\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert q\rVert,$
	$\displaystyle(1-\delta)I_{n^{2}}\preceq\hat{\mathbf{H}}\preceq(1+\delta)I_{n^{2}}.$

To further simplify the problem (12), one can replace its decision variable $\delta$ with $\eta$ and introduce the following optimization problem:

$\displaystyle\max_{\eta,\hat{\mathbf{H}}}$	$\displaystyle\eta$	(13)
$\displaystyle\operatorname{s.t.}$	$\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert q\rVert,$
	$\displaystyle\eta I_{n^{2}}\preceq\hat{\mathbf{H}}\preceq I_{n^{2}}.$

Given any feasible solution $(\delta,\hat{\mathbf{H}})$ to (12), the tuple

\left(\frac{1-\delta}{1+\delta},\frac{1}{1+\delta}\hat{\mathbf{H}}\right)

is a feasible solution to problem (13). Therefore, if the optimal value of (12) is denoted as $\delta_{f}^{*}(\hat{X})$ and the optimal value of (13) is denoted as $\eta_{f}^{*}(\hat{X})$ , then it holds that

\eta_{f}^{*}(\hat{X})\geq\frac{1-\delta_{f}^{*}(\hat{X})}{1+\delta_{f}^{*}(\hat{X})}\geq\frac{1-\delta^{*}(\hat{X})}{1+\delta^{*}(\hat{X})}\geq\frac{1-\delta}{1+\delta},

(14)

in which the last inequality is implied by $\delta\geq\delta^{*}(\hat{X})$ as shown above. To prove the inequality (6), we need to bound $\eta_{f}^{*}(\hat{X})$ from above, which can be achieved by finding a feasible solution to the dual problem of (13) given below:

$\displaystyle\min_{U_{1},U_{2},G,\lambda,y}$	$\displaystyle\operatorname{tr}(U_{2})+4\lVert\hat{X}\rVert^{2}_{2}\lVert q\rVert^{2}\lambda+\operatorname{tr}(G)$	(15)
$\displaystyle\operatorname{s.t.}$	$\displaystyle\operatorname{tr}(U_{1})=1,$
	$\displaystyle(\hat{\mathbf{X}}y)\mathbf{e}^{T}+\mathbf{e}(\hat{\mathbf{X}}y)^{T}=U_{1}-U_{2},$
	$\displaystyle\begin{bmatrix}G&-y\\ -y^{T}&\lambda\end{bmatrix}\succeq 0,$
	$\displaystyle U_{1}\succeq 0,\quad U_{2}\succeq 0.$

For any matrix $\hat{X}\in\mathbb{R}^{n\times r}$ satisfying $\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq\tau\lambda_{r}(M^{*})$ , we have $\hat{X}\neq 0$ , and it has been shown in the proof of Lemma 19 in Bi and Lavaei (2020) that there exists $y\neq 0$ satisfying the inequalities


	$\displaystyle\lVert\hat{\mathbf{X}}y\rVert^{2}\geq 2\lambda_{r}(\hat{X}\hat{X}^{T})\lVert y\rVert^{2},$		(16a)
	$\displaystyle\cos\theta\geq\sqrt{1-\frac{3+2\sqrt{2}}{4}\tau^{2}},$		(16b)

where $\theta$ is the angle between $\hat{\mathbf{X}}y$ and $\mathbf{e}$ . Note that (16b) holds only in the exact-parameterized regime, i.e., the case with $\operatorname{rank}(M^{*})=r$ , since the derivation of (16b) utilized Lemma 5.4 of Tu et al. (2016), which does not hold when $\operatorname{rank}(M^{*})<r$ . This is the main reason why our result cannot be directly generalized to the overparameterized case. Now, define

M=(\hat{\mathbf{X}}y)\mathbf{e}^{T}+\mathbf{e}(\hat{\mathbf{X}}y)^{T},

and then decompose $M$ as $M=[M]_{+}-[M]_{-}$ with $[M]_{+}\succeq 0$ and $[M]_{-}\succeq 0$ . Then, it is easy to verify that $(U_{1}^{*},U_{2}^{*},G^{*},\lambda^{*},y^{*})$ defined as

	$\displaystyle y^{}=\frac{y}{\operatorname{tr}([M]_{+})},\quad U_{1}^{}=\frac{[M]_{+}}{\operatorname{tr}([M]_{+})},\quad U_{2}^{*}=\frac{[M]_{-}}{\operatorname{tr}([M]_{+})},$
	$\displaystyle G^{}=\frac{y^{}(y^{})^{T}}{\lambda^{}},\quad\lambda^{}=\frac{\\|y^{}\\|}{2\lVert\hat{X}\rVert_{2}\lVert q\rVert}$

forms a feasible solution to the dual problem (15) with the objective value

\frac{\operatorname{tr}([M]_{-})+4\lVert\hat{X}\rVert_{2}\lVert q\rVert\|y\|}{\operatorname{tr}([M]_{+})}.

(17)

Furthermore, $\operatorname{rank}(M^{*})=r$ implies that $\lambda_{r}(M^{*})>0$ . By the Wielandt–Hoffman theorem,

	$\displaystyle\|\lambda_{r}(\hat{X}\hat{X}^{T})-\lambda_{r}(M^{})\|\leq\\|\hat{X}\hat{X}^{T}-M^{}\\|_{F}\leq\tau\lambda_{r}(M^{*}),$
	$\displaystyle\|\lambda_{1}(\hat{X}\hat{X}^{T})-\lambda_{1}(M^{})\|\leq\\|\hat{X}\hat{X}^{T}-M^{}\\|_{F}\leq\tau\lambda_{r}(M^{*}).$

Thus, using the above two inequalities and inequality (16a), we have

\frac{2\lVert\hat{X}\rVert_{2}\|y\|}{\|\hat{\mathbf{X}}y\|}\leq\frac{2\lVert\hat{X}\rVert_{2}}{\sqrt{2\lambda_{r}(\hat{X}\hat{X}^{T})}}\\ \leq\sqrt{\frac{2(\lambda_{1}(M^{*})+\tau\lambda_{r}(M^{*}))}{(1-\tau)\lambda_{r}(M^{*})}}=C(\tau,M^{*}).

(18)

Next, according to Lemma 4, one can write

	$\displaystyle\operatorname{tr}([M]_{+})=\\|\hat{\mathbf{X}}y\\|\\|\mathbf{e}\\|(1+\cos\theta),$
	$\displaystyle\operatorname{tr}([M]_{-})=\\|\hat{\mathbf{X}}y\\|\\|\mathbf{e}\\|(1-\cos\theta).$

Substituting the above two equations and (18) into the dual objective value (17), one can obtain

\eta_{f}^{*}(\hat{X})\leq\frac{1-\cos\theta+2C(\tau,M^{*})\lVert q\rVert/\lVert\mathbf{e}\rVert}{1+\cos\theta},

which together with (14) implies that

\lVert\mathbf{e}\rVert\leq(1+\delta)C(\tau,M^{*})\lVert q\rVert(\cos\theta-\delta)^{-1}.

The inequality (6) can then be proved by combining the above inequality and (16b) under the probabilistic event that $\lVert q\rVert\leq\epsilon$ . ∎

Refer to caption — (a) The upper bound derived from inequality (3a).

4 Numerical Illustration

In the next, we will empirically study the developed probabilistic guarantees and demonstrate the distance $\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}$ between any local minimizer $\hat{X}$ and the ground truth $M^{*}$ as well as the value of the RIP constant $\delta$ required to be satisfied by the linear operator $\mathcal{A}$ .

Before delving into the numerical illustration, note that the probability $\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon)$ used in both Theorem 2 and Theorem 4 can be bounded from below by the probability $\mathbb{P}(\lVert w\rVert\leq w_{0})$ with $w_{0}=\epsilon/\lVert\mathbf{A}\rVert_{2}$ . The latter probability can be easily estimated when the probability distribution of the noise $w$ is given. As an example, in the simplest case when $w$ is sampled from an isotropic Gaussian distribution, i.e., $w\sim\mathcal{N}(0,\sigma^{2}I_{m})$ , the random variable $\lVert w/\sigma\rVert^{2}$ follows the chi-square distribution and one can apply the Chernoff bound to obtain

	$\displaystyle\mathbb{P}(\\|w\\|\leq w_{0})$	$\displaystyle=1-\mathbb{P}\left(\left\lVert\frac{w}{\sigma}\right\rVert^{2}\geq\frac{w_{0}^{2}}{\sigma^{2}}\right)$
		$\displaystyle\geq 1-\inf_{0\leq t<1/2}(1-2t)^{-m/2}\mathrm{e}^{-tw_{0}^{2}/\sigma^{2}}.$

After solving the minimization problem in the above equation, we obtain

1-\left(\frac{2m\sigma^{2}}{w_{0}^{2}}\right)^{-m/2}\mathrm{e}^{m-\frac{w_{0}^{2}}{2\sigma^{2}}}\leq\mathbb{P}(\|w\|\leq w_{0}).

More generally, if $w$ is a $(\sigma/\sqrt{m})$ -sub-Gaussian vector, then applying Lemma 1 in Jin et al. (2019a) leads to

1-2\mathrm{e}^{-\frac{w_{0}^{2}}{16m\sigma^{2}}}\leq\mathbb{P}(\|w\|\leq w_{0}).

For numerical illustration, assume that $n=50$ , $m=10$ and $\lVert\mathbf{A}\rVert_{2}\leq 2$ , while the noise $w$ is a $(0.05/\sqrt{m})$ -sub-Gaussian vector. We also assume that the ground truth $M^{*}$ is of rank 5 with the largest eigenvalue being 1.5 and the smallest eigenvalue being 1.

First, we explore the two inequalities (3a) and (3b) in Theorem 2 to obtain two upper bounds on $\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}$ , where $\hat{X}$ denotes any arbitrary (worst) local minimizer. Figure 1 gives the contour plots of the two upper bounds on $\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}$ , which hold with the given probability on the $y$ -axis and the given RIP constant $\delta$ from 0 to $1/2$ on the $x$ -axis. While the final bound on $\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}$ is often determined by the inequality (3b), the inequality (3a) is needed theoretically to deal with the case when $\hat{X}$ has a singular value close to 0.

Furthermore, a tighter bound on the success probability can be derived by calculating the exact probability $\mathbb{P}(\|\mathbf{A}^{T}w\|\leq\epsilon)$ for an explicit distribution on $w$ . Figure 2 is obtained in this way under the same assumptions as those for Figure 1 except that $w$ is isotropic Gaussian with the same parameter in the sub-Gaussian assumption. Compared with Figure 1, the shape is similar, but the bound is tighter.

Next, we illustrate the bounds given by Theorem 2 and Theorem 4. Figure 3 shows the contour plots of the maximum RIP constant $\delta$ that is necessary to guarantee that each local minimizer $\hat{X}$ (satisfying the inequality (5) when Theorem 4 is applied) lies within a certain neighborhood of the ground truth (measured via the distance $\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}$ on the $x$ -axis) with a given probability on the $y$ -axis, as implied by the respective global and local guarantees. Figure 3 clearly shows how a smaller RIP constant $\delta$ leads to a tighter bound on the distance $\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}$ with a higher probability. In addition, the local guarantee generally requires a looser RIP assumption as it still holds even when $\delta>1/2$ . However, as the parameter $\tau$ in Theorem 4 increases, the local bound also degrades quickly, sometimes becoming worse than the global bound as illustrated in Figure 3(d).

Moreover, in our experiment, we have also tried different values of problem parameters $m$ and $n$ . They all yield similar results and are included in Appendix C for completeness.

5 Conclusion

In this paper, we develop global and local analyses for the locations of the local minima of the low-rank matrix recovery problem with noisy linear measurements. Unlike the existing results, the probability distribution of the noise is arbitrary and the RIP constant of the problem is free to take any arbitrary value. The developed results encompass the state-of-the-art results on the non-existence of spurious solutions in the noiseless case. Furthermore, we prove the strict saddle property, which guarantees the global convergence of the perturbed gradient descent method in polynomial time. Our analyses show how the value of the RIP constant and the intensity of noise affect the landscape of the non-convex learning problem and the locations of the local minima relative to the ground truth. Future research directions include extending our results into the cases when the matrices are asymmetric, the measurements are nonlinear, or the overparameterized regime in which $\operatorname{rank}(M^{*})$ is less than $r$ .

References

Bhojanapalli, Neyshabur, and Srebro (2016) Bhojanapalli, S.; Neyshabur, B.; and Srebro, N. 2016. Global Optimality of Local Search for Low Rank Matrix Recovery. In Advances in Neural Information Processing Systems, volume 29.
Bi and Lavaei (2020) Bi, Y.; and Lavaei, J. 2020. Global and Local Analyses of Nonlinear Low-Rank Matrix Recovery Problems. ArXiv:2010.04349.
Boumal (2016) Boumal, N. 2016. Nonconvex phase synchronization. SIAM Journal on Optimization, 26(4): 2355–2377.
Burer and Monteiro (2003) Burer, S.; and Monteiro, R. D. 2003. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2): 329–357.
Candès and Recht (2009) Candès, E. J.; and Recht, B. 2009. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6): 717–772.
Candès and Tao (2010) Candès, E. J.; and Tao, T. 2010. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5): 2053–2080.
Chen and Chi (2018) Chen, Y.; and Chi, Y. 2018. Harnessing structures in big data via guaranteed low-rank matrix estimation: Recent theory and fast algorithms via convex and nonconvex optimization. IEEE Signal Processing Magazine, 35(4): 14–31.
Chi, Lu, and Chen (2019) Chi, Y.; Lu, Y. M.; and Chen, Y. 2019. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20): 5239–5269.
Ge, Jin, and Zheng (2017) Ge, R.; Jin, C.; and Zheng, Y. 2017. No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1233–1242.
Ha, Liu, and Barber (2020) Ha, W.; Liu, H.; and Barber, R. F. 2020. An Equivalence Between Critical Points for Rank Constraints Versus Low-Rank Factorizations. SIAM Journal on Optimization, 30(4): 2927–2955.
Jin et al. (2017) Jin, C.; Ge, R.; Netrapalli, P.; Kakade, S. M.; and Jordan, M. I. 2017. How to Escape Saddle Points Efficiently. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1724–1732.
Jin et al. (2019a) Jin, C.; Netrapalli, P.; Ge, R.; Kakade, S. M.; and Jordan, M. I. 2019a. A short note on concentration inequalities for random vectors with subGaussian norm. ArXiv:1902.03736.
Jin et al. (2021) Jin, M.; Lavaei, J.; Sojoudi, S.; and Baldick, R. 2021. Boundary Defense Against Cyber Threat for Power System State Estimation. IEEE Transactions on Information Forensics and Security, 16: 1752–1767.
Jin et al. (2019b) Jin, M.; Molybog, I.; Mohammadi-Ghazi, R.; and Lavaei, J. 2019b. Towards robust and scalable power system state estimation. In 2019 IEEE 58th Conference on Decision and Control (CDC), 3245–3252. IEEE.
Li et al. (2020) Li, X.; Zhu, Z.; Man-Cho So, A.; and Vidal, R. 2020. Nonconvex robust low-rank matrix recovery. SIAM Journal on Optimization, 30(1): 660–686.
Park et al. (2018) Park, D.; Kyrillidis, A.; Caramanis, C.; and Sanghavi, S. 2018. Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably. SIAM Journal on Imaging Sciences, 11(4): 2165–2204.
Park et al. (2017) Park, D.; Kyrillidis, A.; Carmanis, C.; and Sanghavi, S. 2017. Non-square Matrix Sensing Without Spurious Local Minima via the Burer–Monteiro Approach. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, 65–74.
Recht, Fazel, and Parrilo (2010) Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3): 471–501.
Shechtman et al. (2015) Shechtman, Y.; Eldar, Y. C.; Cohen, O.; Chapman, H. N.; Miao, J.; and Segev, M. 2015. Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Processing Magazine, 32(3): 87–109.
Singer (2011) Singer, A. 2011. Angular synchronization by eigenvectors and semidefinite programming. Applied and Computational Harmonic Analysis, 30(1): 20–36.
Tu et al. (2016) Tu, S.; Boczar, R.; Simchowitz, M.; Soltanolkotabi, M.; and Recht, B. 2016. Low-Rank Solutions of Linear Matrix Equations via Procrustes Flow. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 964–973.
Wang, Zhang, and Gu (2017) Wang, L.; Zhang, X.; and Gu, Q. 2017. A unified computational and statistical framework for nonconvex low-rank matrix estimation. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, 981–990.
Zhang and Zhang (2020) Zhang, G.; and Zhang, R. Y. 2020. How Many Samples Is a Good Initial Point Worth in Low-Rank Matrix Recovery? In Advances in Neural Information Processing Systems, volume 33, 12583–12592.
Zhang (2021) Zhang, R. Y. 2021. Sharp Global Guarantees for Nonconvex Low-Rank Matrix Recovery in the Overparameterized Regime. ArXiv:2104.10790.
Zhang et al. (2018a) Zhang, R. Y.; Josz, C.; Sojoudi, S.; and Lavaei, J. 2018a. How Much Restricted Isometry Is Needed in Nonconvex Matrix Recovery? In Advances in Neural Information Processing Systems, volume 31.
Zhang, Sojoudi, and Lavaei (2019) Zhang, R. Y.; Sojoudi, S.; and Lavaei, J. 2019. Sharp Restricted Isometry Bounds for the Inexistence of Spurious Local Minima in Nonconvex Matrix Recovery. Journal of Machine Learning Research, 20(114): 1–34.
Zhang et al. (2018b) Zhang, X.; Wang, L.; Yu, Y.; and Gu, Q. 2018b. A Primal-Dual Analysis of Global Optimality in Nonconvex Low-Rank Matrix Recovery. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 5862–5871.
Zhu et al. (2018) Zhu, Z.; Li, Q.; Tang, G.; and Wakin, M. B. 2018. Global Optimality in Low-Rank Matrix Optimization. IEEE Transactions on Signal Processing, 66(13): 3614–3628.
Zhu et al. (2021) Zhu, Z.; Li, Q.; Tang, G.; and Wakin, M. B. 2021. The global optimization geometry of low-rank matrix optimization. IEEE Transactions on Information Theory, 67(2): 1308–1331.

Acknowledgments

This work was supported by grants from AFOSR, ARO, ONR, and NSF.

Appendix A Remark on Theorem 2

The two upper bounds on the distance $\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}$ can be obtained for any local minimizer $\hat{X}$ by solving the two quadratic-like inequalities (3a) and (3b), and the larger bound needs to be used. To be explicit,

\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq\max\{T_{1},T_{2}\},

with

	$\displaystyle T_{1}$	$\displaystyle=\frac{\epsilon\sqrt{r}+\sqrt{r\epsilon^{2}+16(1-\delta)\epsilon\sqrt{r}}}{2(1-\delta)},$
	$\displaystyle T_{2}$	$\displaystyle=\left(\frac{2\sqrt{6\epsilon(1+\delta)^{3}}+3\sqrt{8\epsilon(1+\delta)^{3}+\frac{8}{3}(1-2\delta)(1+\delta)(2\epsilon r+2\sqrt{2\epsilon(1+\delta)}\\|M^{*}\\|_{F}^{1/2})}}{4(1-2\delta)}\right)^{2}.$

Appendix B Proofs of Lemma 3 and Theorem 3

Proof of Lemma 3.

Let $Z\in\mathbb{R}^{n\times r}$ be a matrix satisfying $ZZ^{T}=M^{*}$ . Similar to the proof of Theorem 4, we introduce an optimization problem as follows:

$\displaystyle\max_{\eta,\hat{\mathbf{H}}}$	$\displaystyle\eta$	(19)
$\displaystyle\operatorname{s.t.}$	$\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\epsilon+\kappa,$
	$\displaystyle 2I_{r}\otimes\operatorname{mat}_{S}(\hat{\mathbf{H}}\mathbf{e})+\hat{\mathbf{X}}^{T}\hat{\mathbf{X}}\succeq-(2\epsilon+\kappa)I_{nr},$
	$\displaystyle\eta I_{n^{2}}\preceq\hat{\mathbf{H}}\preceq I_{n^{2}},$

where its optimal value $\eta^{*}(\hat{X})$ satisfies the inequality

\eta^{*}(\hat{X})\geq\frac{1-\delta^{*}(\hat{X})}{1+\delta^{*}(\hat{X})}\geq\frac{1-\delta}{1+\delta}.

(20)

In the remaining part, we will prove the following upper bound on $\eta^{*}(\hat{X})$ :

\eta^{*}(\hat{X})\leq\frac{1}{3}+\frac{(2\epsilon+\kappa)\sqrt{r}+\sqrt{2\kappa(1+\delta)}+2\sqrt{2\epsilon(1+\delta)}\lVert\hat{X}\rVert_{2}}{\lVert\mathbf{e}\rVert}.

(21)

The inequality (7b) is a consequence of (20), (21) and the inequality

\lVert\hat{X}\rVert_{2}\leq\lVert\hat{X}\hat{X}^{T}\rVert_{F}^{1/2}\leq\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{1/2}+\lVert M^{*}\rVert_{F}^{1/2}.

The proof of the upper bound (21) can be completed by finding a feasible solution to the dual problem of (19):

$\displaystyle\min_{\begin{subarray}{c}U_{1},U_{2},W,\\ G,\lambda,y\end{subarray}}$	$\displaystyle\operatorname{tr}(U_{2})+\langle\hat{\mathbf{X}}^{T}\hat{\mathbf{X}},W\rangle+(2\epsilon+\kappa)\operatorname{tr}(W)+(2\lVert\hat{X}\rVert_{2}\epsilon+\kappa)^{2}\lambda+\operatorname{tr}(G)$	(22)
$\displaystyle\operatorname{s.t.}$	$\displaystyle\operatorname{tr}(U_{1})=1,$
	$\displaystyle(\hat{\mathbf{X}}y-w)\mathbf{e}^{T}+\mathbf{e}(\hat{\mathbf{X}}y-w)^{T}=U_{1}-U_{2},$
	$\displaystyle\begin{bmatrix}G&-y\\ -y^{T}&\lambda\end{bmatrix}\succeq 0,$
	$\displaystyle U_{1}\succeq 0,\quad U_{2}\succeq 0,\quad W=\begin{bmatrix}W_{1,1}&\cdots&W_{r,1}^{T}\\ \vdots&\ddots&\vdots\\ W_{r,1}&\cdots&W_{r,r}\end{bmatrix}\succeq 0,$
	$\displaystyle w=\sum_{i=1}^{r}\operatorname{vec}(W_{i,i}).$

Before describing the choice of the dual feasible solution, we need to represent the error vector $\mathbf{e}$ in a different form. Let $\mathcal{P}\in\mathbb{R}^{n\times n}$ be the orthogonal projection matrix onto the range of $\hat{X}$ , and $\mathcal{P}_{\perp}\in\mathbb{R}^{n\times n}$ be the orthogonal projection matrix onto the orthogonal complement of the range of $\hat{X}$ . Then, $Z$ can be decomposed as $Z=\mathcal{P}Z+\mathcal{P}_{\perp}Z$ , and there exists a matrix $R\in\mathbb{R}^{r\times r}$ such that $\mathcal{P}Z=\hat{X}R$ . Note that

ZZ^{T}=\mathcal{P}ZZ^{T}\mathcal{P}+\mathcal{P}ZZ^{T}\mathcal{P}_{\perp}+\mathcal{P}_{\perp}ZZ^{T}\mathcal{P}+\mathcal{P}_{\perp}ZZ^{T}\mathcal{P}_{\perp}.

Thus, if we choose

\hat{Y}=\frac{1}{2}\hat{X}-\frac{1}{2}\hat{X}RR^{T}-\mathcal{P}_{\perp}ZR^{T},\quad\hat{y}=\operatorname{vec}(\hat{Y}),

(23)

then it can be verified that

	$\displaystyle\hat{X}\hat{Y}^{T}+\hat{Y}\hat{X}^{T}-\mathcal{P}_{\perp}ZZ^{T}\mathcal{P}_{\perp}=\hat{X}\hat{X}^{T}-ZZ^{T},$
	$\displaystyle\langle\hat{X}\hat{Y}^{T}+\hat{Y}\hat{X}^{T},\mathcal{P}_{\perp}ZZ^{T}\mathcal{P}_{\perp}\rangle=0.$

Moreover, we have

	$\displaystyle\lVert\hat{X}\hat{Y}^{T}+\hat{Y}\hat{X}^{T}\rVert_{F}^{2}$	$\displaystyle=2\operatorname{tr}(\hat{X}^{T}\hat{X}\hat{Y}^{T}\hat{Y})+\operatorname{tr}(\hat{X}^{T}\hat{Y}\hat{X}^{T}\hat{Y})+\operatorname{tr}(\hat{Y}^{T}\hat{X}\hat{Y}^{T}\hat{X})$		(24)
		$\displaystyle\geq 2\operatorname{tr}(\hat{X}^{T}\hat{X}\hat{Y}^{T}\hat{Y})\geq 2\sigma_{r}(\hat{X})^{2}\lVert\hat{Y}\rVert_{F}^{2},$		(24)

in which the first inequality is due to

\operatorname{tr}(\hat{X}^{T}\hat{Y}\hat{X}^{T}\hat{Y})=\frac{1}{4}\operatorname{tr}((\hat{X}^{T}\hat{X}(I_{r}-RR^{T}))^{2})=\frac{1}{4}\operatorname{tr}((\hat{X}(I_{r}-RR^{T})\hat{X}^{T})^{2})\geq 0.

Assume first that $Z_{\perp}=\mathcal{P}_{\perp}Z\neq 0$ . The other case will be handled at the end of this proof. In the case when $Z_{\perp}\neq 0$ , we also have $\hat{X}\hat{Y}^{T}+\hat{Y}\hat{X}^{T}\neq 0$ . Otherwise, the inequality (24) and the assumption $\sigma_{r}(\hat{X})>0$ imply that $\hat{Y}=0$ . The orthogonality and the definition of $\hat{Y}$ in (23) then give rise to

\hat{X}-\hat{X}RR^{T}=0,\quad\mathcal{P}_{\perp}ZR^{T}=0.

The first equation above implies that $R$ is invertible since $\hat{X}$ has full column rank, which contradicts $Z_{\perp}\neq 0$ . Now, define the unit vectors

\hat{u}_{1}=\frac{\hat{\mathbf{X}}\hat{y}}{\lVert\hat{\mathbf{X}}\hat{y}\rVert},\quad\hat{u}_{2}=\frac{\operatorname{vec}(Z_{\perp}Z_{\perp}^{T})}{\lVert Z_{\perp}Z_{\perp}^{T}\rVert_{F}}.

Then, $\hat{u}_{1}\perp\hat{u}_{2}$ and

\mathbf{e}=\lVert\mathbf{e}\rVert(\sqrt{1-\alpha^{2}}\hat{u}_{1}-\alpha\hat{u}_{2})

(25)

with

\alpha=\frac{\lVert Z_{\perp}Z_{\perp}^{T}\rVert_{F}}{\lVert\hat{X}\hat{X}^{T}-ZZ^{T}\rVert_{F}}.

(26)

We first describe our choices of the dual variables $W$ and $y$ (which will be scaled later). Let

\hat{X}^{T}\hat{X}=QSQ^{T},\quad Z_{\perp}Z_{\perp}^{T}=PGP^{T},

with orthogonal matrices $Q,P$ and diagonal matrices $S,G$ such that $S_{11}=\sigma_{r}(\hat{X})^{2}$ . Fix a constant $\gamma\in[0,1]$ that is to be determined and define

	$\displaystyle V_{i}=k^{1/2}G_{ii}^{1/2}PE_{i1}Q^{T},\quad\forall i=1,\dots,r,$
	$\displaystyle W=\sum_{i=1}^{r}\operatorname{vec}(V_{i})\operatorname{vec}(V_{i})^{T},\quad y=l\hat{y},$

with $\hat{y}$ defined in (23) and

k=\frac{\gamma}{\lVert\mathbf{e}\rVert\lVert Z_{\perp}Z_{\perp}^{T}\rVert_{F}},\quad l=\frac{\sqrt{1-\gamma^{2}}}{\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}\hat{y}\rVert}.

Here, $E_{ij}$ is the elementary matrix of size $n\times r$ with the $(i,j)$ -entry being $1$ . By our construction, $\hat{X}^{T}V_{i}=0$ , which implies that

\langle\hat{\mathbf{X}}^{T}\hat{\mathbf{X}},W\rangle=\sum_{i=1}^{r}\lVert\hat{X}V_{i}^{T}+V_{i}\hat{X}^{T}\rVert_{F}^{2}=2\sum_{i=1}^{r}\operatorname{tr}(\hat{X}^{T}\hat{X}V_{i}^{T}V_{i})=2k\sigma_{r}(\hat{X})^{2}\sum_{i=1}^{r}G_{ii}=2\beta\gamma,

(27)

with

\beta=\frac{\sigma_{r}(\hat{X})^{2}\operatorname{tr}(Z_{\perp}Z_{\perp}^{T})}{\lVert\hat{X}\hat{X}^{T}-ZZ^{T}\rVert_{F}\lVert Z_{\perp}Z_{\perp}^{T}\rVert_{F}}.

(28)

In addition,

\operatorname{tr}(W)=\sum_{i=1}^{r}\lVert V_{i}\rVert_{F}^{2}=k\sum_{i=1}^{r}G_{ii}=k\operatorname{tr}(Z_{\perp}Z_{\perp}^{T})\leq\frac{\sqrt{r}}{\lVert\mathbf{e}\rVert},

(29)

and

w=\sum_{i=1}^{r}\operatorname{vec}(W_{i,i})=\sum_{i=1}^{r}V_{i}V_{i}^{T}=kZ_{\perp}Z_{\perp}^{T}.

Therefore,

\hat{\mathbf{X}}y-w=\frac{1}{\lVert\mathbf{e}\rVert}(\sqrt{1-\gamma^{2}}\hat{u}_{1}-\gamma\hat{u}_{2}),

which together with (25) implies that

\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}y-w\rVert=1,\quad\langle\mathbf{e},\hat{\mathbf{X}}y-w\rangle=\gamma\alpha+\sqrt{1-\gamma^{2}}\sqrt{1-\alpha^{2}}=\psi(\gamma).

(30)

Next, the inequality (24) and the assumption on $\sigma_{r}(\hat{X})$ imply that

\epsilon\lVert y\rVert\leq\frac{\sqrt{1-\gamma^{2}}\epsilon}{\sqrt{2}\sigma_{r}(\hat{X})\lVert\mathbf{e}\rVert}\leq\frac{\sqrt{1+\delta}\epsilon}{\sqrt{2(\epsilon+\kappa)}\lVert\mathbf{e}\rVert}\leq\frac{\sqrt{\epsilon(1+\delta)}}{\sqrt{2}\lVert\mathbf{e}\rVert}

(31)

and similarly

\kappa\lVert y\rVert\leq\frac{\sqrt{\kappa(1+\delta)}}{\sqrt{2}\lVert\mathbf{e}\rVert}.

(32)

Define

M=(\hat{\mathbf{X}}y-w)\mathbf{e}^{T}+\mathbf{e}(\hat{\mathbf{X}}y-w)^{T},

and decompose $M$ as $M=[M]_{+}-[M]_{-}$ in which both $[M]_{+}\succeq 0$ and $[M]_{-}\succeq 0$ . Let $\theta$ be the angle between $\mathbf{e}$ and $\hat{\mathbf{X}}y-w$ . By Lemma 4, we have

\operatorname{tr}([M]_{+})=\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}y-w\rVert(1+\cos\theta),\quad\operatorname{tr}([M]_{-})=\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}y-w\rVert(1-\cos\theta).

Now, one can verify that $(U_{1}^{*},U_{2}^{*},W^{*},G^{*},\lambda^{*},y^{*})$ defined as

	$\displaystyle U_{1}^{}=\frac{[M]_{+}}{\operatorname{tr}([M]_{+})},\quad U_{2}^{}=\frac{[M]_{-}}{\operatorname{tr}([M]_{+})},\quad y^{*}=\frac{y}{\operatorname{tr}([M]_{+})},$
	$\displaystyle W^{}=\frac{W}{\operatorname{tr}([M]_{+})},\quad\lambda^{}=\frac{\lVert y^{}\rVert}{2\lVert\hat{X}\rVert_{2}\epsilon+\kappa},\quad G^{}=\frac{1}{\lambda^{}}y^{}y^{*T}$

forms a feasible solution to the dual problem (22) whose objective value is equal to

\frac{\operatorname{tr}([M]_{-})+\langle\hat{\mathbf{X}}^{T}\hat{\mathbf{X}},W\rangle+(2\epsilon+\kappa)\operatorname{tr}(W)+2(2\lVert\hat{X}\rVert_{2}\epsilon+\kappa)\lVert y\rVert}{\operatorname{tr}([M]_{+})}.

Substituting (27), (29), (30), (31) and (32) into the above equation, we obtain

	$\displaystyle\eta^{*}(\hat{X})$	$\displaystyle\leq\frac{2\beta\gamma+1-\psi(\gamma)+((2\epsilon+\kappa)\sqrt{r}+\sqrt{2\kappa(1+\delta)}+2\sqrt{2\epsilon(1+\delta)}\lVert\hat{X}\rVert_{2})/\lVert\mathbf{e}\rVert}{1+\psi(\gamma)}$
		$\displaystyle\leq\frac{2\beta\gamma+1-\psi(\gamma)}{1+\psi(\gamma)}+\frac{(2\epsilon+\kappa)\sqrt{r}+\sqrt{2\kappa(1+\delta)}+2\sqrt{2\epsilon(1+\delta)}\lVert\hat{X}\rVert_{2}}{\lVert\mathbf{e}\rVert}.$

Choosing the best value of the parameter $\gamma\in[0,1]$ to minimize the far right-side of the above inequality leads to

\frac{2\beta\gamma+1-\psi(\gamma)}{1+\psi(\gamma)}\leq\eta_{0}(\hat{X}),

with

\eta_{0}(\hat{X}):=\begin{dcases*}\frac{1-\sqrt{1-\alpha^{2}}}{1+\sqrt{1-\alpha^{2}}},&if $\beta\geq\dfrac{\alpha}{1+\sqrt{1-\alpha^{2}}}$,\\ \frac{\beta(\alpha-\beta)}{1-\beta\alpha},&if $\beta\leq\dfrac{\alpha}{1+\sqrt{1-\alpha^{2}}}$.\end{dcases*}

Here, $\alpha$ and $\beta$ are defined in (26) and (28), respectively. In the proof of Theorem 1.2 in Zhang (2021), it is shown that $\eta_{0}(\hat{X})\leq 1/3$ for every $\hat{X}$ with $\hat{X}\hat{X}^{T}\neq ZZ^{T}$ , which gives the upper bound (21).

Finally, we still need to deal with the case when $\mathcal{P}_{\perp}Z=0$ . In this case, we know that $\hat{\mathbf{X}}\hat{y}=\mathbf{e}$ with $\hat{y}$ defined in (23). Then, it is easy to check that $(U_{1}^{*},U_{2}^{*},W^{*},G^{*},\lambda^{*},y^{*})$ defined as

	$\displaystyle U_{1}^{}=\frac{\mathbf{e}\mathbf{e}^{T}}{\lVert\mathbf{e}\rVert^{2}},\quad U_{2}^{}=0,\quad y^{*}=\frac{\hat{y}}{2\lVert\mathbf{e}\rVert^{2}},$
	$\displaystyle W^{}=0,\quad\lambda^{}=\frac{\lVert y^{}\rVert}{2\lVert\hat{X}\rVert_{2}\epsilon+\kappa},\quad G^{}=\frac{1}{\lambda^{}}y^{}y^{*T}$

forms a feasible solution to the dual problem (22) whose objective value is $2(2\lVert\hat{X}\rVert_{2}\epsilon+\kappa)\lVert y^{*}\rVert$ . By the inequality (24), we have

\eta^{*}(\hat{X})\leq 2(2\lVert\hat{X}\rVert_{2}\epsilon+\kappa)\lVert y^{*}\rVert\leq\frac{\kappa/\sqrt{2}+\sqrt{2}\epsilon\lVert\hat{X}\rVert_{2}}{\sigma_{r}(\hat{X})\lVert\mathbf{e}\rVert}\leq\frac{\sqrt{\kappa(1+\delta)/2}+\sqrt{2\epsilon(1+\delta)}\lVert\hat{X}\rVert_{2}}{\lVert\mathbf{e}\rVert}.

Hence, the upper bound (21) still holds in this case. ∎

Proof of Theorem 3.

The proof of Theorem 3 is similar to the above proof of Lemma 3 in the situation with $\kappa=0$ , and we will only emphasize the difference here. In the case when $\hat{X}\neq 0$ , after constructing the feasible solution to the dual problem (22), we have

\frac{1-\delta}{1+\delta}\leq\eta^{*}(\hat{X})\leq\frac{\operatorname{tr}([M]_{-})+\langle\hat{\mathbf{X}}^{T}\hat{\mathbf{X}},W\rangle+2\epsilon\operatorname{tr}(W)+4\lVert\hat{X}\rVert_{2}\epsilon\lVert y\rVert}{\operatorname{tr}([M]_{+})}.

(33)

Note that in the rank-1 case, one can write $\sigma_{r}(\hat{X})=\lVert\hat{X}\rVert_{2}$ and

\lVert y\rVert\leq\frac{\lVert\hat{y}\rVert}{\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}\hat{y}\rVert}\leq\frac{1}{\sqrt{2}\lVert\hat{X}\rVert_{2}\lVert\mathbf{e}\rVert},

in which the last inequality is due to (24). Substituting (27), (29), (30) and the above inequality into (33) and choosing an appropriate $\gamma$ as shown in the proof of Lemma 3, we obtain

	$\displaystyle\frac{1-\delta}{1+\delta}\leq\eta^{*}(\hat{X})$	$\displaystyle\leq\frac{2\beta\gamma+1-\psi(\gamma)+(2\epsilon+2\sqrt{2}\epsilon)/\lVert\mathbf{e}\rVert}{1+\psi(\gamma)}$
		$\displaystyle\leq\frac{1}{3}+\frac{2\epsilon+2\sqrt{2}\epsilon}{\lVert\mathbf{e}\rVert},$

which implies inequality (4) under the probabilistic event that $\lVert q\rVert\leq\epsilon$ .

In the case when $\hat{X}=0$ , $(U_{1}^{*},U_{2}^{*},W^{*},G^{*},\lambda^{*},y^{*})$ defined as

	$\displaystyle U_{1}^{}=\frac{\mathbf{e}\mathbf{e}^{T}}{\lVert\mathbf{e}\rVert^{2}},\quad U_{2}^{}=0,\quad y^{*}=0,$
	$\displaystyle W^{}=\frac{ZZ^{T}}{2\lVert\mathbf{e}\rVert^{2}},\quad\lambda^{}=0,\quad G^{*}=0$