This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sharp Restricted Isometry Property Bounds for Low-rank Matrix Recovery Problems with Corrupted Measurements

Ziye Ma\equalcontrib1, Yingjie Bi\equalcontrib2, Javad Lavaei 2, Somayeh Sojoudi 1 3
Abstract

In this paper, we study a general low-rank matrix recovery problem with linear measurements corrupted by some noise. The objective is to understand under what conditions on the restricted isometry property (RIP) of the problem local search methods can find the ground truth with a small error. By analyzing the landscape of the non-convex problem, we first propose a global guarantee on the maximum distance between an arbitrary local minimizer and the ground truth under the assumption that the RIP constant is smaller than 1/21/2. We show that this distance shrinks to zero as the intensity of the noise reduces. Our new guarantee is sharp in terms of the RIP constant and is much stronger than the existing results. We then present a local guarantee for problems with an arbitrary RIP constant, which states that any local minimizer is either considerably close to the ground truth or far away from it. Next, we prove the strict saddle property, which guarantees the global convergence of the perturbed gradient descent method in polynomial time. The developed results demonstrate how the noise intensity and the RIP constant of the problem affect the landscape of the problem.

1 Introduction

Low-rank matrix recovery problems arise in various applications, such as matrix completion (Candès and Recht 2009; Recht, Fazel, and Parrilo 2010), phase synchronization/retrieval (Singer 2011; Boumal 2016; Shechtman et al. 2015), robust PCA (Ge, Jin, and Zheng 2017), and several others (Chen and Chi 2018; Chi, Lu, and Chen 2019). In this paper, we study a class of low-rank matrix recovery problems, where the goal is to recover a symmetric and positive semidefinite ground truth matrix MM^{*} with rank(M)=r\operatorname{rank}(M^{*})=r from certain linear measurements corrupted by noise. This problem can be formulated as the following optimization problem:

minMn×n\displaystyle\min_{M\in\mathbb{R}^{n\times n}} 12𝒜(M)b+w2\displaystyle\frac{1}{2}\lVert\mathcal{A}(M)-b+w\rVert^{2} (1)
s.t.\displaystyle\operatorname{s.t.} rank(M)r,M0.\displaystyle\operatorname{rank}(M)\leq r,\quad M\succeq 0.

Here, 𝒜:n×nm\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m} is a linear operator whose action on a matrix MM is given by

𝒜(M)=[A1,M,,Am,M]T,\mathcal{A}(M)=[\langle A_{1},M\rangle,\dots,\langle A_{m},M\rangle]^{T},

where A1,,Amn×nA_{1},\dots,A_{m}\in\mathbb{R}^{n\times n} are called sensing matrices. In addition, b=𝒜(M)b=\mathcal{A}(M^{*}) represents the perfect measurement on the ground truth MM^{*} and ww comes from an arbitrary probability distribution. Note that only the noisy measurement bwb-w is available to the user, and indeed bb is unknown. In other words, from a problem-solving perspective, the random variable ww is hidden to the user, and it is explicitly modeled here only for the sake of analysis.

The modeling of the noise in this problem is critical for many practical applications in which the influence of the noise cannot be ignored. For example, state estimation is a major data analytic problem for the operation of power grids, which can be modeled as matrix sensing (Jin et al. 2019b). In this problem, each measurement comes from a physical device, and the noise in our formulation models not only the sensory noise but also mismatch between the true model of the system and the one used by a power operator, changes in the measurements due to cyber-attacks, mechanical faults, etc. In other words, by a noisy problem we mean a learning problem where a part of the data is wrong for various reasons, and as with this case, it is impossible to assume that the measurements are noise-free.

Although it may be possible to solve the problem (1) based on convex relaxations (Candès and Recht 2009; Recht, Fazel, and Parrilo 2010; Candès and Tao 2010), the computational complexity associated with solving a semidefinite program presents a major challenge for large-scale problems. A more scalable approach is to use the Burer–Monteiro factorization (Burer and Monteiro 2003) by expressing MM as XXTXX^{T} with Xn×rX\in\mathbb{R}^{n\times r}, which leads to the following equivalent formulation of the original problem (1):

minXn×rf(X)=12𝒜(XXT)b+w2.\min_{X\in\mathbb{R}^{n\times r}}f(X)=\frac{1}{2}\|\mathcal{A}(XX^{T})-b+w\|^{2}. (2)

The unconstrained problem (2) is often solved by local search methods such as gradient descent. Since the objective function f(X)f(X) in (2) is non-convex, local search algorithms may converge to a local minimizer, leading to a suboptimal or plainly wrong solution. Hence, it is desirable to provide guarantees on the maximum distance between these local minimizers and the ground truth MM^{*}. This problem will be addressed in this paper.

Related Works

The special noiseless case of the problem (2) can be obtained by setting w=0w=0. In this case, any solution ZZ with ZZT=MZZ^{T}=M^{*} is a global minimizer of the problem (2). Many previous researches such as Ge, Jin, and Zheng (2017); Bhojanapalli, Neyshabur, and Srebro (2016); Park et al. (2017); Zhang et al. (2018b); Zhu et al. (2018); Zhang, Sojoudi, and Lavaei (2019); Zhang and Zhang (2020); Bi and Lavaei (2020); Ha, Liu, and Barber (2020); Zhu et al. (2021); Zhang (2021) focused on proving that the problem has no spurious (non-global) local minimizers under the assumption of restricted isometry property (RIP). Moreover, as demonstrated in previous works such as Ge, Jin, and Zheng (2017), the developed techniques under the RIP condition can be adopted to show that other low-rank matrix recovery problems, such as matrix completion under incoherence condition and robust PCA problem, also have benign landscape. The RIP condition is equivalent to the restricted strongly convex and smooth property used in Wang, Zhang, and Gu (2017); Park et al. (2018); Zhu et al. (2021), and its formal definition is given below.

Definition 1.

The linear operator 𝒜():n×nm\mathcal{A}(\cdot):\mathbb{R}^{n\times n}\to\mathbb{R}^{m} is said to satisfy the δ\delta-RIP2r property for some constant δ[0,1)\delta\in[0,1) if the inequality

(1δ)MF2𝒜(M)2(1+δ)MF2(1-\delta)\|M\|_{F}^{2}\leq\|\mathcal{A}(M)\|^{2}\leq(1+\delta)\|M\|_{F}^{2}

holds for all Mn×nM\in\mathbb{R}^{n\times n} with rank(M)2r\operatorname{rank}(M)\leq 2r.

In the recent paper by Zhang (2021), the author developed a sharp bound on the absence of spurious local minima for the noiseless case of problem (2), which says that the problem has no spurious local minima if the measurement operator 𝒜\mathcal{A} satisfies the δ\delta-RIP2r\operatorname{RIP}_{2r} property with δ<1/2\delta<1/2. This result is tight since there is a known counterexample (Zhang et al. 2018a) having spurious local minima under δ=1/2\delta=1/2.

For the noisy problem, the relation XXT=MX^{*}X^{*T}=M^{*} is unlikely to be satisfied, where XX^{*} denotes a global minimizer of problem (2). However, in this situation, XXTX^{*}X^{*T} should be close to the ground truth MM^{*} if the noise ww is small. As a generalization of the above-mentioned results for the noiseless problem, it is natural to study whether all local minimizers, including the global minimizers, are close to the ground truth MM^{*} under the RIP assumption. One such result is presented in Bhojanapalli, Neyshabur, and Srebro (2016) and given below.

Theorem 1 (from Theorem 3.1 in Bhojanapalli, Neyshabur, and Srebro (2016)).

Suppose that w𝒩(0,σw2Im)w\sim\mathcal{N}(0,\sigma_{w}^{2}I_{m}) and 𝒜()\mathcal{A}(\cdot) has the δ\delta-RIP4r property with δ<1/10\delta<1/10. Then, with probability at least 110/n21-10/n^{2}, any local minimizer X^\hat{X} of problem (2) satisfies the inequality

X^X^TMF20log(n)mσw.\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}\leq 20\sqrt{\frac{\log(n)}{m}}\sigma_{w}.

Theorem 31 in Ge, Jin, and Zheng (2017) further improves the above result by replacing the δ\delta-RIP4r\operatorname{RIP}_{4r} property with the δ\delta-RIP2r\operatorname{RIP}_{2r} property. Li et al. (2020) studies a similar noisy low-rank matrix recovery problem with l1l_{1} norm.

As compared above, there is an evident gap between the state-of-the-art results for the noiseless and noisy problems. The result for the noiseless problem only requires the RIP constant δ<1/2\delta<1/2, but the one for the noisy problem requires δ<1/10\delta<1/10 no matter how small the noise is. This gap will be addressed in this paper by showing that a major generalization of Theorem 1 holds for the noisy problem under the same RIP assumption as in the sharp bound for the noiseless problem.

Notations

In this paper, InI_{n} refers to the identity matrix of size n×nn\times n. The notation M0M\succeq 0 means that MM is a symmetric and positive semidefinite matrix. σi(M)\sigma_{i}(M) denotes the ii-th largest singular value of a matrix MM, and λi(M)\lambda_{i}(M) denotes the ii-th largest eigenvalue of MM. v\lVert v\rVert denotes the Euclidean norm of a vector vv, while MF\lVert M\rVert_{F} and M2\lVert M\rVert_{2} denote the Frobenius norm and induced l2l_{2} norm of a matrix MM, respectively. A,B\langle A,B\rangle is defined to be tr(ATB)\operatorname{tr}(A^{T}B) for two matrices AA and BB of the same size. The Kronecker product between AA and BB is denoted as ABA\otimes B. For a matrix MM, vec(M)\operatorname{vec}(M) is the usual vectorization operation by stacking the columns of the matrix MM into a vector. For a vector vn2v\in\mathbb{R}^{n^{2}}, mat(v)\operatorname{mat}(v) converts vv to a square matrix and matS(v)\operatorname{mat}_{S}(v) converts vv to a symmetric matrix, i.e., mat(v)=M\operatorname{mat}(v)=M and matS(v)=(M+MT)/2\operatorname{mat}_{S}(v)=(M+M^{T})/2, where Mn×nM\in\mathbb{R}^{n\times n} is the unique matrix satisfying v=vec(M)v=\operatorname{vec}(M). Finally, 𝒩(μ,𝚺)\mathcal{N}(\mu,\mathbf{\Sigma}) refers to the multivariate Gaussian distribution with mean μ\mu and covariance 𝚺\mathbf{\Sigma}.

2 Main Results

We first present the global guarantee on the local minimizers of the problem (2). To simplify the notation, we use a matrix representation of the measurement operator 𝒜\mathcal{A} as follows:

𝐀=[vec(A1),vec(A2),,vec(Am)]Tm×n2.\mathbf{A}=[\operatorname{vec}(A_{1}),\operatorname{vec}(A_{2}),\dots,\operatorname{vec}(A_{m})]^{T}\in\mathbb{R}^{m\times n^{2}}.

Then, 𝐀vec(M)=𝒜(M)\mathbf{A}\operatorname{vec}(M)=\mathcal{A}(M) for every matrix Mn×nM\in\mathbb{R}^{n\times n}.

Theorem 2.

Assume that the linear operator 𝒜\mathcal{A} satisfies the δ\delta-RIP2r\operatorname{RIP}_{2r} property with δ<1/2\delta<1/2. For every ϵ>0\epsilon>0, with probability at least (𝐀Twϵ)\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon), either of the following two inequalities

(1δ)X^X^TMF2ϵrX^X^TMF+4ϵrMF,\displaystyle\begin{split}(1-\delta)&\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{2}\leq\epsilon\sqrt{r}\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\\ &+4\epsilon\sqrt{r}\lVert M^{*}\rVert_{F},\end{split} (3a)
2(12δ)3(1+δ)X^X^TMF2ϵr+22ϵ(1+δ)(X^X^TMF1/2+MF1/2)\displaystyle\begin{split}&\frac{2(1-2\delta)}{3(1+\delta)}\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq 2\epsilon\sqrt{r}\\ &\hskip 10.00002pt+2\sqrt{2\epsilon(1+\delta)}(\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{1/2}+\lVert M^{*}\rVert_{F}^{1/2})\end{split} (3b)

holds for every arbitrary local minimizer X^n×r\hat{X}\in\mathbb{R}^{n\times r} of problem (2).

Note that two upper bounds on the distance X^X^TMF\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F} can be obtained for any local minimizer X^\hat{X} by solving the two quadratic-like inequalities (3a) and (3b), and the larger bound needs to be used because only one of the two inequalities is guaranteed to hold. The explicit upper bound solved from Theorem 2 is given in Appendix A. The reason for the existence of two inequalities in Theorem 2 is the split of its proof into two cases. The first case is when the rr-th smallest singular value of X^\hat{X} is small, and the second case is the opposite, which are respectively handled by Lemma 2 and Lemma 3.

The reason for the existence of two inequalities in Theorem 2 is the split of its proof into two cases. The first case is when the rr-th smallest singular value of X^\hat{X} is small, and the second case is the opposite, which are respectively handled by Lemma 2 and Lemma 3.

Theorem 2 is a major extension of the existing sharp result stating that the noiseless problem has no spurious local minima under the same assumption of the δ\delta-RIP2r\operatorname{RIP}_{2r} property with δ<1/2\delta<1/2. The reason is that in the case when the noise ww is equal to zero, one can choose an arbitrarily small ϵ\epsilon in Theorem 2 to conclude from the inequalities (3a) and (3b) that X^X^T=M\hat{X}\hat{X}^{T}=M^{*} for every local minimizer X^\hat{X}. Moreover, when the RIP constant δ\delta further decreases from 1/21/2, the upper bound on X^X^TMF\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F} will also decrease, which means that a local minimizer found by local search methods will be closer to the ground truth MM^{*}. This suggests that the RIP condition is able to not only guarantee the absence of spurious local minima as shown in the previous literature but also mitigate the influence of the noise in the measurements.

Compared with the existing results such as Theorem 1, our new result has two advantages. First, by improving the RIP constant from 1/101/10 to 1/21/2, one can apply the results on the location of spurious local minima to a much broader class of problems, which can often help reduce the number of measurements. For example, in the case when the measurements are given by random Gaussian matrices, it is proven in Candès and Recht (2009) that to achieve the δ\delta-RIP2r\operatorname{RIP}_{2r} property the minimum number of measurements needed is in the order of O(1/δ2)O(1/\delta^{2}). By improving the RIP constant in the bound, we can significantly reduce the number of measurements while still keeping the benign landscape. In applications such as learning for energy networks, there is a fundamental limit on the number of measurements that can be collected due to the physics of the problem (Jin et al. 2021). Finding a better bound on RIP helps with addressing the issues with the number of measurements needed to reliably solve the problem. Second, Theorem 1 is just about the probability of having all spurious solutions in a fixed ball around the ground truth of radius O(σw)O(\sigma_{w}) instead of balls of arbitrary radii, and this fixed ball could be a large one depending on whether the noise level σw\sigma_{w} is fixed or scales with the problem. On the other hand, in Theorem 2, we consider the probability (𝐀Twϵ)\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon) for any arbitrary value of ϵ\epsilon. By having a flexible ϵ\epsilon, our work not only improves the RIP constant but also allows computing the probability of having all spurious solutions in any given ball.

In the special case of rank r=1r=1, the conditions (3a) and (3b) in Theorem 2 can be substituted with a simpler condition as shown below. Its proof is highly similar to that of Lemma 3, and obviated in this paper for succinctness.

Theorem 3.

Consider the case r=1r=1 and assume that the linear operator 𝒜\mathcal{A} satisfies the δ\delta-RIP2\operatorname{RIP}_{2} property with δ<1/2\delta<1/2. For every ϵ>0\epsilon>0, with probability at least (𝐀Twϵ)\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon), every arbitrary local minimizer X^n×r\hat{X}\in\mathbb{R}^{n\times r} of problem (2) satisfies

X^X^TMF3(1+2)ϵ(1+δ)12δ.\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq\frac{3(1+\sqrt{2})\epsilon(1+\delta)}{1-2\delta}. (4)

In the case when the RIP constant δ\delta is not less than 1/21/2, it is not possible to achieve a global guarantee similar to Theorem 2 or Theorem 3 since it is known that the problem may have a spurious solution even in the noiseless case. Instead, we turn to local guarantees by showing that every arbitrary local minimizer X^\hat{X} of problem (2) is either close to the ground truth MM^{*} or far away from it in terms of the distance X^X^TMF\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}.

Theorem 4.

Assume that the linear operator 𝒜\mathcal{A} satisfies the δ\delta-RIP2r property for some δ[0,1)\delta\in[0,1). Consider arbitrary constants ϵ>0\epsilon>0 and τ(0,2(21))\tau\in(0,2(\sqrt{2}-1)) such that

δ<13+224τ2.\delta<\sqrt{1-\frac{3+2\sqrt{2}}{4}\tau^{2}}.

Every arbitrary local minimizer X^n×r\hat{X}\in\mathbb{R}^{n\times r} of problem (2) satisfying

X^X^TMFτλr(M)\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}\leq\tau\lambda_{r}(M^{*}) (5)

will also satisfy

X^X^TMFϵ(1+δ)C(τ,M)13+224τ2δ\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq\frac{\epsilon(1+\delta)C(\tau,M^{*})}{\sqrt{1-\frac{3+2\sqrt{2}}{4}\tau^{2}}-\delta} (6)

with probability at least (𝐀Twϵ)\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon), where

C(τ,M)=2(λ1(M)+τλr(M))(1τ)λr(M).C(\tau,M^{*})=\sqrt{\frac{2(\lambda_{1}(M^{*})+\tau\lambda_{r}(M^{*}))}{(1-\tau)\lambda_{r}(M^{*})}}.

The upper bounds in (5) and (6) define an outer ball and an inner ball centered at the ground truth MM^{*}. Theorem 4 states that there is no local minimizer in the ring between the two balls. As ϵ\epsilon approaches zero, the inner ball shrinks to the ground truth. This theorem shows that bad local minimizers are located outside the outer ball. Note that the problem could be highly non-convex when δ\delta is close to 1, while this theorem shows a benign landscape in a local neighborhood of the solution. Furthermore, all the theorems in this section are applicable to arbitrary noise models since they make no explicit use of the probability distribution of the noise. The only required information is the probability (𝐀Twϵ)\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon), which can be computed or bounded when the probability distribution of the noise is given as illustrated in Section 4.

The results presented above are all about the location of the local minimizers. They do not directly lead to the global convergence of local search methods with a fast convergence rate. To provide performance guarantees for local search methods, the next theorem establishes a stronger property for the landscape of the noisy problem that is usually called the strict saddle property in the literature, which essentially says that all approximate second-order critical points are close to the ground truth.

Theorem 5.

Assume that the linear operator 𝒜\mathcal{A} satisfies the δ\delta-RIP2r\operatorname{RIP}_{2r} property with δ<1/2\delta<1/2. For every ϵ>0\epsilon>0 and κ0\kappa\geq 0, with probability at least (𝐀Twϵ)\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon), either of the following two inequalities

(1δ)X^X^TMF2ϵrX^X^TMF+r1/4κ2X^X^TMF1/2+r1/4κ2MF1/2+(4rϵ+5rκ2)MF\displaystyle\begin{split}(1&-\delta)\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{2}\leq\epsilon\sqrt{r}\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\\ &+\frac{r^{1/4}\kappa}{2}\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{1/2}+\frac{r^{1/4}\kappa}{2}\lVert M^{*}\rVert_{F}^{1/2}\\ &+(4\sqrt{r}\epsilon+\frac{5\sqrt{r}\kappa}{2})\lVert M^{*}\rVert_{F}\end{split} (7a)
2(12δ)3(1+δ)X^X^TMF(2ϵ+κ)r+2κ(1+δ)+22ϵ(1+δ)(X^X^TMF1/2+MF1/2)\displaystyle\begin{split}&\frac{2(1-2\delta)}{3(1+\delta)}\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq(2\epsilon+\kappa)\sqrt{r}+\sqrt{2\kappa(1+\delta)}\\ &\hskip 20.00003pt+2\sqrt{2\epsilon(1+\delta)}(\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{1/2}+\lVert M^{*}\rVert_{F}^{1/2})\end{split} (7b)

holds for every matrix X^n×r\hat{X}\in\mathbb{R}^{n\times r} satisfying

f(X^)κ,2f(X^)κInr.\lVert\nabla f(\hat{X})\rVert\leq\kappa,\quad\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}.

Note that this property is not proven in the literature for δ<1/2\delta<1/2 even in the noiseless case, and thus our result generalizes the existing ones even in this scenario. On the other hand, it is proven by Jin et al. (2017) that the perturbed gradient descent method with an arbitrary initialization will find a solution X^\hat{X} satisfying the requirements in Theorem 5 with a high probability in O(poly(1/κ))O(\mathrm{poly}(1/\kappa)) number of iterations. By Theorem 5, X^X^T\hat{X}\hat{X}^{T} will be close to the ground truth if ϵ\epsilon and κ\kappa are chosen to be relatively small.

Table 1 briefly summarizes our result compared with the existing literature.

Paper Noise RIP Assumption Rank Convergence
Bhojanapalli, Neyshabur, and Srebro (2016) Isotropic Gaussian δ<1/10\delta<1/10 rank rr N/A
Zhang, Sojoudi, and Lavaei (2019) Noiseless δ<1/2\delta<1/2 rank 1 N/A
Zhang (2021) Noiseless δ<1/2\delta<1/2 rank rr N/A
Ours Finite Variance δ<1/2\delta<1/2 rank rr Polynomial
Table 1: Comparison between our result and the existing literature.

3 Proofs of Main Results

Before presenting the proofs, we first compute the gradient and the Hessian of the objective function f(X^)f(\hat{X}) of the problem (2):

f(X^)=𝐗^T𝐀T(𝐀𝐞+w),\displaystyle\nabla f(\hat{X})=\hat{\mathbf{X}}^{T}\mathbf{A}^{T}(\mathbf{A}\mathbf{e}+w),
2f(X^)=2IrmatS(𝐀T(𝐀𝐞+w))+𝐗^T𝐀T𝐀𝐗^,\displaystyle\nabla^{2}f(\hat{X})=2I_{r}\otimes\operatorname{mat}_{S}(\mathbf{A}^{T}(\mathbf{A}\mathbf{e}+w))+\hat{\mathbf{X}}^{T}\mathbf{A}^{T}\mathbf{A}\hat{\mathbf{X}},

where

𝐞=vec(X^X^TM),\mathbf{e}=\operatorname{vec}(\hat{X}\hat{X}^{T}-M^{*}),

and 𝐗^n2×nr\hat{\mathbf{X}}\in\mathbb{R}^{n^{2}\times nr} is the matrix satisfying

𝐗^vec(U)=vec(X^UT+UX^T),Un×r.\hat{\mathbf{X}}\operatorname{vec}(U)=\operatorname{vec}(\hat{X}U^{T}+U\hat{X}^{T}),\quad\forall U\in\mathbb{R}^{n\times r}.

The first step in the proofs is to derive necessary conditions for a matrix X^n×r\hat{X}\in\mathbb{R}^{n\times r} to be an approximate second-order critical point, which depend on the linear operator 𝒜\mathcal{A}, the noise wmw\in\mathbb{R}^{m}, the solution X^\hat{X}, and the parameter κ\kappa characterizing how close X^\hat{X} is to a true second-order critical point.

Lemma 1.

Given κ0\kappa\geq 0, assume that X^n×r\hat{X}\in\mathbb{R}^{n\times r} satisfies

f(X^)κ,2f(X^)κInr.\lVert\nabla f(\hat{X})\rVert\leq\kappa,\quad\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}.

Then, it must satisfy the following inequalities:

𝐗^T𝐇𝐞2X^2𝐀Tw+κ,\displaystyle\lVert\hat{\mathbf{X}}^{T}\mathbf{H}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert\mathbf{A}^{T}w\rVert+\kappa, (8a)
2IrmatS(𝐇𝐞)+𝐗^T𝐇𝐗^(2𝐀Tw+κ)Inr,\displaystyle 2I_{r}\otimes\operatorname{mat}_{S}(\mathbf{H}\mathbf{e})+\hat{\mathbf{X}}^{T}\mathbf{H}\hat{\mathbf{X}}\succeq-(2\lVert\mathbf{A}^{T}w\rVert+\kappa)I_{nr}, (8b)

where 𝐇=𝐀T𝐀\mathbf{H}=\mathbf{A}^{T}\mathbf{A}.

Proof.

To obtain condition (8a), notice that f(X^)κ\lVert\nabla f(\hat{X})\rVert\leq\kappa implies that

𝐗^T𝐇𝐞𝐗^T𝐀Tw+κ𝐗^2𝐀Tw+κ2X^2𝐀Tw+κ,\lVert\hat{\mathbf{X}}^{T}\mathbf{H}\mathbf{e}\rVert\leq\lVert\hat{\mathbf{X}}^{T}\mathbf{A}^{T}w\rVert+\kappa\leq\lVert\hat{\mathbf{X}}\rVert_{2}\lVert\mathbf{A}^{T}w\rVert+\kappa\\ \leq 2\lVert\hat{X}\rVert_{2}\lVert\mathbf{A}^{T}w\rVert+\kappa,

in which the last inequality is due to

𝐗^vec(U)=X^UT+UX^TF2X^2UF,\lVert\hat{\mathbf{X}}\operatorname{vec}(U)\rVert=\lVert\hat{X}U^{T}+U\hat{X}^{T}\rVert_{F}\leq 2\lVert\hat{X}\rVert_{2}\lVert U\rVert_{F},

for every Un×rU\in\mathbb{R}^{n\times r}. Similarly, 2f(X^)κInr\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr} implies that

2IrmatS(𝐇𝐞)+𝐗^T𝐇𝐗^2IrmatS(𝐀Tw)κInr.2I_{r}\otimes\operatorname{mat}_{S}(\mathbf{H}\mathbf{e})+\hat{\mathbf{X}}^{T}\mathbf{H}\hat{\mathbf{X}}\succeq-2I_{r}\otimes\operatorname{mat}_{S}(\mathbf{A}^{T}w)-\kappa I_{nr}.

On the other hand, the eigenvalues of IrmatS(𝐀Tw)I_{r}\otimes\operatorname{mat}_{S}(\mathbf{A}^{T}w) are the same as those of matS(𝐀Tw)\operatorname{mat}_{S}(\mathbf{A}^{T}w), and each eigenvalue λi(matS(𝐀Tw))\lambda_{i}(\operatorname{mat}_{S}(\mathbf{A}^{T}w)) of the latter matrix further satisfies

|λi(matS(𝐀Tw))|matS(𝐀Tw)F𝐀Tw,\lvert\lambda_{i}(\operatorname{mat}_{S}(\mathbf{A}^{T}w))\rvert\leq\lVert\operatorname{mat}_{S}(\mathbf{A}^{T}w)\rVert_{F}\leq\lVert\mathbf{A}^{T}w\rVert,

which proves condition (8b). ∎

If X^\hat{X} is a local minimizer of the problem (2), Lemma 1 shows that X^\hat{X} satisfies the inequalities (8a) and (8b) with κ=0\kappa=0. Similarly, Theorem 2 can also be regarded as a special case of Theorem 5 with κ=0\kappa=0. The proofs of these two theorems consist of inspecting two cases. The following lemma deals with the first case in which X^\hat{X} is an approximate second-order critical point with σr(X^)\sigma_{r}(\hat{X}) being close to zero.

Lemma 2.

Given X^n×r\hat{X}\in\mathbb{R}^{n\times r} and arbitrary constants ϵ>0\epsilon>0 and κ0\kappa\geq 0, the inequalities

σr(X^)ϵ+κ1+δ,f(X^)κ,2f(X^)κInr\sigma_{r}(\hat{X})\leq\sqrt{\frac{\epsilon+\kappa}{1+\delta}},\;\lVert\nabla f(\hat{X})\rVert\leq\kappa,\;\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}

and 𝐀Twϵ\lVert\mathbf{A}^{T}w\rVert\leq\epsilon will together imply the inequality (7a).

Proof.

Let G=matS(𝐇𝐞)G=\operatorname{mat}_{S}(\mathbf{H}\mathbf{e}) and unu\in\mathbb{R}^{n} be a unit eigenvector of GG corresponding to its minimum eigenvalue, i.e.,

u=1,Gu=λmin(G)u.\lVert u\rVert=1,\quad Gu=\lambda_{\min}(G)u.

In addition, let vrv\in\mathbb{R}^{r} be a singular vector of X^\hat{X} such that

v=1,X^v=σr(X^).\lVert v\rVert=1,\quad\lVert\hat{X}v\rVert=\sigma_{r}(\hat{X}).

Let 𝐔=vec(uvT)\mathbf{U}=\operatorname{vec}(uv^{T}). Then, 𝐔1\lVert\mathbf{U}\rVert\leq 1 and (8b) implies that

2ϵ\displaystyle-2\epsilon κ2𝐔T(IrmatS(𝐇𝐞))𝐔+𝐔T𝐗^T𝐇𝐗^𝐔\displaystyle-\kappa\leq 2\mathbf{U}^{T}(I_{r}\otimes\operatorname{mat}_{S}(\mathbf{H}\mathbf{e}))\mathbf{U}+\mathbf{U}^{T}\hat{\mathbf{X}}^{T}\mathbf{H}\hat{\mathbf{X}}\mathbf{U}
2tr(vuTGuvT)+(1+δ)X^vuT+uvTX^TF2\displaystyle\leq 2\operatorname{tr}(vu^{T}Guv^{T})+(1+\delta)\lVert\hat{X}vu^{T}+uv^{T}\hat{X}^{T}\rVert_{F}^{2}
2λmin(G)+4(1+δ)σr(X^)2\displaystyle\leq 2\lambda_{\min}(G)+4(1+\delta)\sigma_{r}(\hat{X})^{2}
2λmin(G)+4ϵ+4κ.\displaystyle\leq 2\lambda_{\min}(G)+4\epsilon+4\kappa. (9)

On the other hand,

(1δ)\displaystyle(1-\delta) X^X^TMF2𝐞T𝐇𝐞\displaystyle\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{2}\leq\mathbf{e}^{T}\mathbf{H}\mathbf{e}
=vec(X^X^T)T𝐇𝐞vec(M)T𝐇𝐞\displaystyle=\operatorname{vec}(\hat{X}\hat{X}^{T})^{T}\mathbf{H}\mathbf{e}-\operatorname{vec}(M^{*})^{T}\mathbf{H}\mathbf{e}
=12vec(X^)T𝐗^T𝐇𝐞M,matS(𝐇𝐞)\displaystyle=\frac{1}{2}\operatorname{vec}(\hat{X})^{T}\hat{\mathbf{X}}^{T}\mathbf{H}\mathbf{e}-\langle M^{*},\operatorname{mat}_{S}(\mathbf{H}\mathbf{e})\rangle
12X^2𝐗^T𝐇𝐞+(3ϵ+5κ2)tr(M)\displaystyle\leq\frac{1}{2}\lVert\hat{X}\rVert_{2}\lVert\hat{\mathbf{X}}^{T}\mathbf{H}\mathbf{e}\rVert+\left(3\epsilon+\frac{5\kappa}{2}\right)\operatorname{tr}(M^{*})
ϵX^22+κ2X^2+(3ϵ+5κ2)tr(M),\displaystyle\leq\epsilon\lVert\hat{X}\rVert_{2}^{2}+\frac{\kappa}{2}\lVert\hat{X}\rVert_{2}+\left(3\epsilon+\frac{5\kappa}{2}\right)\operatorname{tr}(M^{*}),

in which the second last inequality is due to (9) and the last inequality is due to (8a). Furthermore, the right-hand side of the above inequality can be relaxed as

ϵ\displaystyle\epsilon X^22+κ2X^2+(3ϵ+5κ2)tr(M)ϵrX^X^TF\displaystyle\lVert\hat{X}\rVert_{2}^{2}+\frac{\kappa}{2}\lVert\hat{X}\rVert_{2}+\left(3\epsilon+\frac{5\kappa}{2}\right)\operatorname{tr}(M^{*})\leq\epsilon\sqrt{r}\lVert\hat{X}\hat{X}^{T}\rVert_{F}
+r1/4κ2X^X^TF1/2+(3ϵ+5κ2)rMF\displaystyle+\frac{r^{1/4}\kappa}{2}\lVert\hat{X}\hat{X}^{T}\rVert_{F}^{1/2}+\left(3\epsilon+\frac{5\kappa}{2}\right)\sqrt{r}\lVert M^{*}\rVert_{F}
ϵrX^X^TMF+r1/4κ2X^X^TMF1/2\displaystyle\leq\epsilon\sqrt{r}\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}+\frac{r^{1/4}\kappa}{2}\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{1/2}
+(4rϵ+5rκ2)MF+r1/4κ2MF1/2,\displaystyle\hskip 10.00002pt+\left(4\sqrt{r}\epsilon+\frac{5\sqrt{r}\kappa}{2}\right)\lVert M^{*}\rVert_{F}+\frac{r^{1/4}\kappa}{2}\lVert M^{*}\rVert_{F}^{1/2},

which leads to the inequality (7a). ∎

The remaining case with

σr(X^)>ϵ+κ1+δ\sigma_{r}(\hat{X})>\sqrt{\frac{\epsilon+\kappa}{1+\delta}}

will be handled in the following lemma using a different method.

Lemma 3.

Assume that the linear operator 𝒜\mathcal{A} satisfies the δ\delta-RIP2r\operatorname{RIP}_{2r} property with δ<1/2\delta<1/2. Given X^n×r\hat{X}\in\mathbb{R}^{n\times r} and arbitrary constants ϵ>0\epsilon>0 and κ0\kappa\geq 0, the inequalities

σr(X^)>ϵ+κ1+δ,f(X^)κ,2f(X^)κInr\sigma_{r}(\hat{X})>\sqrt{\frac{\epsilon+\kappa}{1+\delta}},\;\lVert\nabla f(\hat{X})\rVert\leq\kappa,\;\nabla^{2}f(\hat{X})\succeq-\kappa I_{nr}

and 𝐀Twϵ\lVert\mathbf{A}^{T}w\rVert\leq\epsilon will together imply the inequality (7b).

The proofs of both Lemma 3 and the local guarantee in Theorem 4 generalize the proof of the absence of spurious local minima for the noiseless problem in Zhang and Zhang (2020); Zhang (2021). Our innovation here is to develop new techniques to analyze approximate optimality conditions for the solutions because unlike the noiseless problem the local minimizers of the noisy one are only approximate second-order critical points of the distance function 𝒜(XXT)b2\|\mathcal{A}(XX^{T})-b\|^{2}. For a fixed solution X^\hat{X} and noise ww, one can find an operator 𝒜^\hat{\mathcal{A}} satisfying the δ\delta-RIP2r\operatorname{RIP}_{2r} property with the smallest possible δ\delta such that X^\hat{X} and 𝒜^\hat{\mathcal{A}} satisfy the necessary conditions stated in Lemma 1. Let δ(X^)\delta^{*}(\hat{X}) be the RIP constant of the found measurement operator 𝒜^\hat{\mathcal{A}} in the worst-case scenario. Then, if X^\hat{X} in Lemma 3 is a solution of the current problem with the linear operator 𝒜\mathcal{A} satisfying the δ\delta-RIP2r\operatorname{RIP}_{2r} property, it holds that δδ(X^)\delta\geq\delta^{*}(\hat{X}), which can further lead to an upper bound on the distance X^X^TMF\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}.

To compute δ(X^)\delta^{*}(\hat{X}) defined above, let q=𝐀Twq=\mathbf{A}^{T}w and solve the following optimization problem whose optimal value is δ(X^)\delta^{*}(\hat{X}):

minδ,𝐇^\displaystyle\min_{\delta,\hat{\mathbf{H}}} δ\displaystyle\delta (10)
s.t. 𝐗^T𝐇^𝐞2X^2q+κ,\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert q\rVert+\kappa,
2IrmatS(𝐇^𝐞)+𝐗^T𝐇^𝐗^(2q+κ)Inr,\displaystyle 2I_{r}\otimes\operatorname{mat}_{S}(\hat{\mathbf{H}}\mathbf{e})+\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\hat{\mathbf{X}}\succeq-(2\lVert q\rVert+\kappa)I_{nr},
𝐇^ is symmetric and satisfies the δ-RIP2r property.\displaystyle\text{$\hat{\mathbf{H}}$ is symmetric and satisfies the $\delta$-$\operatorname{RIP}_{2r}$ property}.

Note that a matrix 𝐇^n2×n2\hat{\mathbf{H}}\in\mathbb{R}^{n^{2}\times n^{2}} is said to satisfy the δ\delta-RIP2r\operatorname{RIP}_{2r} property if

(1δ)𝐔2𝐔T𝐇^𝐔(1+δ)𝐔2(1-\delta)\lVert\mathbf{U}\rVert^{2}\leq\mathbf{U}^{T}\hat{\mathbf{H}}\mathbf{U}\leq(1+\delta)\lVert\mathbf{U}\rVert^{2}

holds for every matrix Un×nU\in\mathbb{R}^{n\times n} with rank(U)2r\operatorname{rank}(U)\leq 2r and 𝐔=vec(U)\mathbf{U}=\operatorname{vec}(U). Obviously, for a linear operator 𝒜^\hat{\mathcal{A}}, 𝐇^=𝐀^T𝐀^\hat{\mathbf{H}}=\hat{\mathbf{A}}^{T}\hat{\mathbf{A}} satisfies the δ\delta-RIP2r\operatorname{RIP}_{2r} property if and only if 𝐀^\hat{\mathbf{A}} satisfies the δ\delta-RIP2r\operatorname{RIP}_{2r} property.

However, since problem (10) is non-convex due to the RIP constraint, we instead solve the following convex reformulation:

minδ,𝐇^\displaystyle\min_{\delta,\hat{\mathbf{H}}} δ\displaystyle\delta (11)
s.t.\displaystyle\operatorname{s.t.} 𝐗^T𝐇^𝐞2X^2q+κ,\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert q\rVert+\kappa,
2IrmatS(𝐇^𝐞)+𝐗^T𝐇^𝐗^(2q+κ)Inr,\displaystyle 2I_{r}\otimes\operatorname{mat}_{S}(\hat{\mathbf{H}}\mathbf{e})+\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\hat{\mathbf{X}}\succeq-(2\lVert q\rVert+\kappa)I_{nr},
(1δ)In2𝐇^(1+δ)In2.\displaystyle(1-\delta)I_{n^{2}}\preceq\hat{\mathbf{H}}\preceq(1+\delta)I_{n^{2}}.

Lemma 14 in Bi and Lavaei (2020) proves that problem (10) and problem (11) have the same optimal value. The remaining step in the proof of Lemma 3 is to solve the optimization problem (11) for given X^\hat{X}, qq and κ\kappa. The complete proof of Lemma 3 is lengthy and deferred to Appendix B. Finally, Theorem 2 and Theorem 5 are direct consequences of Lemma 2 and Lemma 3. The proof of Theorem 3 is very similar to that of Lemma 3 and is also given in Appendix B.

Now, we turn to the proof of the local guarantee in Theorem 4. The following existing result will be useful.

Lemma 4 (from Lemma 14 in Zhang, Sojoudi, and Lavaei (2019)).

Given a,bna,b\in\mathbb{R}^{n}, the rank-2 matrix abT+baTab^{T}+ba^{T} has two possibly nonzero eigenvalues

ab(1+cosθ),ab(1cosθ).\|a\|\|b\|(1+\cos\theta),\quad-\|a\|\|b\|(1-\cos\theta).

Here, θ\theta is the angle between aa and bb.

Proof of Theorem 4.

First, we relax the optimization problem (11) by dropping the constraint related to the second-order necessary optimality condition. This gives rise to the optimization problem

minδ,𝐇^\displaystyle\min_{\delta,\hat{\mathbf{H}}} δ\displaystyle\delta (12)
s.t.\displaystyle\operatorname{s.t.} 𝐗^T𝐇^𝐞2X^2q,\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert q\rVert,
(1δ)In2𝐇^(1+δ)In2.\displaystyle(1-\delta)I_{n^{2}}\preceq\hat{\mathbf{H}}\preceq(1+\delta)I_{n^{2}}.

To further simplify the problem (12), one can replace its decision variable δ\delta with η\eta and introduce the following optimization problem:

maxη,𝐇^\displaystyle\max_{\eta,\hat{\mathbf{H}}} η\displaystyle\eta (13)
s.t.\displaystyle\operatorname{s.t.} 𝐗^T𝐇^𝐞2X^2q,\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\lVert q\rVert,
ηIn2𝐇^In2.\displaystyle\eta I_{n^{2}}\preceq\hat{\mathbf{H}}\preceq I_{n^{2}}.

Given any feasible solution (δ,𝐇^)(\delta,\hat{\mathbf{H}}) to (12), the tuple

(1δ1+δ,11+δ𝐇^)\left(\frac{1-\delta}{1+\delta},\frac{1}{1+\delta}\hat{\mathbf{H}}\right)

is a feasible solution to problem (13). Therefore, if the optimal value of (12) is denoted as δf(X^)\delta_{f}^{*}(\hat{X}) and the optimal value of (13) is denoted as ηf(X^)\eta_{f}^{*}(\hat{X}), then it holds that

ηf(X^)1δf(X^)1+δf(X^)1δ(X^)1+δ(X^)1δ1+δ,\eta_{f}^{*}(\hat{X})\geq\frac{1-\delta_{f}^{*}(\hat{X})}{1+\delta_{f}^{*}(\hat{X})}\geq\frac{1-\delta^{*}(\hat{X})}{1+\delta^{*}(\hat{X})}\geq\frac{1-\delta}{1+\delta}, (14)

in which the last inequality is implied by δδ(X^)\delta\geq\delta^{*}(\hat{X}) as shown above. To prove the inequality (6), we need to bound ηf(X^)\eta_{f}^{*}(\hat{X}) from above, which can be achieved by finding a feasible solution to the dual problem of (13) given below:

minU1,U2,G,λ,y\displaystyle\min_{U_{1},U_{2},G,\lambda,y} tr(U2)+4X^22q2λ+tr(G)\displaystyle\operatorname{tr}(U_{2})+4\lVert\hat{X}\rVert^{2}_{2}\lVert q\rVert^{2}\lambda+\operatorname{tr}(G) (15)
s.t.\displaystyle\operatorname{s.t.} tr(U1)=1,\displaystyle\operatorname{tr}(U_{1})=1,
(𝐗^y)𝐞T+𝐞(𝐗^y)T=U1U2,\displaystyle(\hat{\mathbf{X}}y)\mathbf{e}^{T}+\mathbf{e}(\hat{\mathbf{X}}y)^{T}=U_{1}-U_{2},
[GyyTλ]0,\displaystyle\begin{bmatrix}G&-y\\ -y^{T}&\lambda\end{bmatrix}\succeq 0,
U10,U20.\displaystyle U_{1}\succeq 0,\quad U_{2}\succeq 0.

For any matrix X^n×r\hat{X}\in\mathbb{R}^{n\times r} satisfying X^X^TMFτλr(M)\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq\tau\lambda_{r}(M^{*}), we have X^0\hat{X}\neq 0, and it has been shown in the proof of Lemma 19 in Bi and Lavaei (2020) that there exists y0y\neq 0 satisfying the inequalities

𝐗^y22λr(X^X^T)y2,\displaystyle\lVert\hat{\mathbf{X}}y\rVert^{2}\geq 2\lambda_{r}(\hat{X}\hat{X}^{T})\lVert y\rVert^{2}, (16a)
cosθ13+224τ2,\displaystyle\cos\theta\geq\sqrt{1-\frac{3+2\sqrt{2}}{4}\tau^{2}}, (16b)

where θ\theta is the angle between 𝐗^y\hat{\mathbf{X}}y and 𝐞\mathbf{e}. Note that (16b) holds only in the exact-parameterized regime, i.e., the case with rank(M)=r\operatorname{rank}(M^{*})=r, since the derivation of (16b) utilized Lemma 5.4 of Tu et al. (2016), which does not hold when rank(M)<r\operatorname{rank}(M^{*})<r. This is the main reason why our result cannot be directly generalized to the overparameterized case. Now, define

M=(𝐗^y)𝐞T+𝐞(𝐗^y)T,M=(\hat{\mathbf{X}}y)\mathbf{e}^{T}+\mathbf{e}(\hat{\mathbf{X}}y)^{T},

and then decompose MM as M=[M]+[M]M=[M]_{+}-[M]_{-} with [M]+0[M]_{+}\succeq 0 and [M]0[M]_{-}\succeq 0. Then, it is easy to verify that (U1,U2,G,λ,y)(U_{1}^{*},U_{2}^{*},G^{*},\lambda^{*},y^{*}) defined as

y=ytr([M]+),U1=[M]+tr([M]+),U2=[M]tr([M]+),\displaystyle y^{*}=\frac{y}{\operatorname{tr}([M]_{+})},\quad U_{1}^{*}=\frac{[M]_{+}}{\operatorname{tr}([M]_{+})},\quad U_{2}^{*}=\frac{[M]_{-}}{\operatorname{tr}([M]_{+})},
G=y(y)Tλ,λ=y2X^2q\displaystyle G^{*}=\frac{y^{*}(y^{*})^{T}}{\lambda^{*}},\quad\lambda^{*}=\frac{\|y^{*}\|}{2\lVert\hat{X}\rVert_{2}\lVert q\rVert}

forms a feasible solution to the dual problem (15) with the objective value

tr([M])+4X^2qytr([M]+).\frac{\operatorname{tr}([M]_{-})+4\lVert\hat{X}\rVert_{2}\lVert q\rVert\|y\|}{\operatorname{tr}([M]_{+})}. (17)

Furthermore, rank(M)=r\operatorname{rank}(M^{*})=r implies that λr(M)>0\lambda_{r}(M^{*})>0. By the Wielandt–Hoffman theorem,

|λr(X^X^T)λr(M)|X^X^TMFτλr(M),\displaystyle|\lambda_{r}(\hat{X}\hat{X}^{T})-\lambda_{r}(M^{*})|\leq\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}\leq\tau\lambda_{r}(M^{*}),
|λ1(X^X^T)λ1(M)|X^X^TMFτλr(M).\displaystyle|\lambda_{1}(\hat{X}\hat{X}^{T})-\lambda_{1}(M^{*})|\leq\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}\leq\tau\lambda_{r}(M^{*}).

Thus, using the above two inequalities and inequality (16a), we have

2X^2y𝐗^y2X^22λr(X^X^T)2(λ1(M)+τλr(M))(1τ)λr(M)=C(τ,M).\frac{2\lVert\hat{X}\rVert_{2}\|y\|}{\|\hat{\mathbf{X}}y\|}\leq\frac{2\lVert\hat{X}\rVert_{2}}{\sqrt{2\lambda_{r}(\hat{X}\hat{X}^{T})}}\\ \leq\sqrt{\frac{2(\lambda_{1}(M^{*})+\tau\lambda_{r}(M^{*}))}{(1-\tau)\lambda_{r}(M^{*})}}=C(\tau,M^{*}). (18)

Next, according to Lemma 4, one can write

tr([M]+)=𝐗^y𝐞(1+cosθ),\displaystyle\operatorname{tr}([M]_{+})=\|\hat{\mathbf{X}}y\|\|\mathbf{e}\|(1+\cos\theta),
tr([M])=𝐗^y𝐞(1cosθ).\displaystyle\operatorname{tr}([M]_{-})=\|\hat{\mathbf{X}}y\|\|\mathbf{e}\|(1-\cos\theta).

Substituting the above two equations and (18) into the dual objective value (17), one can obtain

ηf(X^)1cosθ+2C(τ,M)q/𝐞1+cosθ,\eta_{f}^{*}(\hat{X})\leq\frac{1-\cos\theta+2C(\tau,M^{*})\lVert q\rVert/\lVert\mathbf{e}\rVert}{1+\cos\theta},

which together with (14) implies that

𝐞(1+δ)C(τ,M)q(cosθδ)1.\lVert\mathbf{e}\rVert\leq(1+\delta)C(\tau,M^{*})\lVert q\rVert(\cos\theta-\delta)^{-1}.

The inequality (6) can then be proved by combining the above inequality and (16b) under the probabilistic event that qϵ\lVert q\rVert\leq\epsilon. ∎

Refer to caption
(a) The upper bound derived from inequality (3a).
Refer to caption
(b) The upper bound derived from inequality (3b).
Figure 1: Comparison of the upper bounds given by Theorem 2 for the distance X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F} with X^\hat{X} being an arbitrary local minimizer.
Refer to caption
(a) The upper bound derived from inequality (3a).
Refer to caption
(b) The upper bound derived from inequality (3b).
Figure 2: Upper bounds given by Theorem 2 for the distance X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F} based on the explicit distribution on the noise ww.

4 Numerical Illustration

In the next, we will empirically study the developed probabilistic guarantees and demonstrate the distance X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F} between any local minimizer X^\hat{X} and the ground truth MM^{*} as well as the value of the RIP constant δ\delta required to be satisfied by the linear operator 𝒜\mathcal{A}.

Before delving into the numerical illustration, note that the probability (𝐀Twϵ)\mathbb{P}(\lVert\mathbf{A}^{T}w\rVert\leq\epsilon) used in both Theorem 2 and Theorem 4 can be bounded from below by the probability (ww0)\mathbb{P}(\lVert w\rVert\leq w_{0}) with w0=ϵ/𝐀2w_{0}=\epsilon/\lVert\mathbf{A}\rVert_{2}. The latter probability can be easily estimated when the probability distribution of the noise ww is given. As an example, in the simplest case when ww is sampled from an isotropic Gaussian distribution, i.e., w𝒩(0,σ2Im)w\sim\mathcal{N}(0,\sigma^{2}I_{m}), the random variable w/σ2\lVert w/\sigma\rVert^{2} follows the chi-square distribution and one can apply the Chernoff bound to obtain

(ww0)\displaystyle\mathbb{P}(\|w\|\leq w_{0}) =1(wσ2w02σ2)\displaystyle=1-\mathbb{P}\left(\left\lVert\frac{w}{\sigma}\right\rVert^{2}\geq\frac{w_{0}^{2}}{\sigma^{2}}\right)
1inf0t<1/2(12t)m/2etw02/σ2.\displaystyle\geq 1-\inf_{0\leq t<1/2}(1-2t)^{-m/2}\mathrm{e}^{-tw_{0}^{2}/\sigma^{2}}.

After solving the minimization problem in the above equation, we obtain

1(2mσ2w02)m/2emw022σ2(ww0).1-\left(\frac{2m\sigma^{2}}{w_{0}^{2}}\right)^{-m/2}\mathrm{e}^{m-\frac{w_{0}^{2}}{2\sigma^{2}}}\leq\mathbb{P}(\|w\|\leq w_{0}).

More generally, if ww is a (σ/m)(\sigma/\sqrt{m})-sub-Gaussian vector, then applying Lemma 1 in Jin et al. (2019a) leads to

12ew0216mσ2(ww0).1-2\mathrm{e}^{-\frac{w_{0}^{2}}{16m\sigma^{2}}}\leq\mathbb{P}(\|w\|\leq w_{0}).
Refer to caption
(a) δ\delta bound in Theorem 2 with τ=+\tau=+\infty.
Refer to caption
(b) δ\delta bound in Theorem 4 with τ=0.2\tau=0.2.
Refer to caption
(c) δ\delta bound in Theorem 4 with τ=0.5\tau=0.5.
Refer to caption
(d) δ\delta bound in Theorem 4 with τ=0.8\tau=0.8.
Figure 3: Comparison of the maximum RIP constants δ\delta allowed by Theorem 2 and Theorem 4 to guarantee a given maximum distance X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F} for an arbitrary local minimizer X^\hat{X} satisfying (5) with a given probability.

For numerical illustration, assume that n=50n=50, m=10m=10 and 𝐀22\lVert\mathbf{A}\rVert_{2}\leq 2, while the noise ww is a (0.05/m)(0.05/\sqrt{m})-sub-Gaussian vector. We also assume that the ground truth MM^{*} is of rank 5 with the largest eigenvalue being 1.5 and the smallest eigenvalue being 1.

First, we explore the two inequalities (3a) and (3b) in Theorem 2 to obtain two upper bounds on X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}, where X^\hat{X} denotes any arbitrary (worst) local minimizer. Figure 1 gives the contour plots of the two upper bounds on X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F}, which hold with the given probability on the yy-axis and the given RIP constant δ\delta from 0 to 1/21/2 on the xx-axis. While the final bound on X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F} is often determined by the inequality (3b), the inequality (3a) is needed theoretically to deal with the case when X^\hat{X} has a singular value close to 0.

Furthermore, a tighter bound on the success probability can be derived by calculating the exact probability (𝐀Twϵ)\mathbb{P}(\|\mathbf{A}^{T}w\|\leq\epsilon) for an explicit distribution on ww. Figure 2 is obtained in this way under the same assumptions as those for Figure 1 except that ww is isotropic Gaussian with the same parameter in the sub-Gaussian assumption. Compared with Figure 1, the shape is similar, but the bound is tighter.

Next, we illustrate the bounds given by Theorem 2 and Theorem 4. Figure 3 shows the contour plots of the maximum RIP constant δ\delta that is necessary to guarantee that each local minimizer X^\hat{X} (satisfying the inequality (5) when Theorem 4 is applied) lies within a certain neighborhood of the ground truth (measured via the distance X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F} on the xx-axis) with a given probability on the yy-axis, as implied by the respective global and local guarantees. Figure 3 clearly shows how a smaller RIP constant δ\delta leads to a tighter bound on the distance X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F} with a higher probability. In addition, the local guarantee generally requires a looser RIP assumption as it still holds even when δ>1/2\delta>1/2. However, as the parameter τ\tau in Theorem 4 increases, the local bound also degrades quickly, sometimes becoming worse than the global bound as illustrated in Figure 3(d).

Moreover, in our experiment, we have also tried different values of problem parameters mm and nn. They all yield similar results and are included in Appendix C for completeness.

5 Conclusion

In this paper, we develop global and local analyses for the locations of the local minima of the low-rank matrix recovery problem with noisy linear measurements. Unlike the existing results, the probability distribution of the noise is arbitrary and the RIP constant of the problem is free to take any arbitrary value. The developed results encompass the state-of-the-art results on the non-existence of spurious solutions in the noiseless case. Furthermore, we prove the strict saddle property, which guarantees the global convergence of the perturbed gradient descent method in polynomial time. Our analyses show how the value of the RIP constant and the intensity of noise affect the landscape of the non-convex learning problem and the locations of the local minima relative to the ground truth. Future research directions include extending our results into the cases when the matrices are asymmetric, the measurements are nonlinear, or the overparameterized regime in which rank(M)\operatorname{rank}(M^{*}) is less than rr.

References

  • Bhojanapalli, Neyshabur, and Srebro (2016) Bhojanapalli, S.; Neyshabur, B.; and Srebro, N. 2016. Global Optimality of Local Search for Low Rank Matrix Recovery. In Advances in Neural Information Processing Systems, volume 29.
  • Bi and Lavaei (2020) Bi, Y.; and Lavaei, J. 2020. Global and Local Analyses of Nonlinear Low-Rank Matrix Recovery Problems. ArXiv:2010.04349.
  • Boumal (2016) Boumal, N. 2016. Nonconvex phase synchronization. SIAM Journal on Optimization, 26(4): 2355–2377.
  • Burer and Monteiro (2003) Burer, S.; and Monteiro, R. D. 2003. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2): 329–357.
  • Candès and Recht (2009) Candès, E. J.; and Recht, B. 2009. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6): 717–772.
  • Candès and Tao (2010) Candès, E. J.; and Tao, T. 2010. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5): 2053–2080.
  • Chen and Chi (2018) Chen, Y.; and Chi, Y. 2018. Harnessing structures in big data via guaranteed low-rank matrix estimation: Recent theory and fast algorithms via convex and nonconvex optimization. IEEE Signal Processing Magazine, 35(4): 14–31.
  • Chi, Lu, and Chen (2019) Chi, Y.; Lu, Y. M.; and Chen, Y. 2019. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20): 5239–5269.
  • Ge, Jin, and Zheng (2017) Ge, R.; Jin, C.; and Zheng, Y. 2017. No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1233–1242.
  • Ha, Liu, and Barber (2020) Ha, W.; Liu, H.; and Barber, R. F. 2020. An Equivalence Between Critical Points for Rank Constraints Versus Low-Rank Factorizations. SIAM Journal on Optimization, 30(4): 2927–2955.
  • Jin et al. (2017) Jin, C.; Ge, R.; Netrapalli, P.; Kakade, S. M.; and Jordan, M. I. 2017. How to Escape Saddle Points Efficiently. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1724–1732.
  • Jin et al. (2019a) Jin, C.; Netrapalli, P.; Ge, R.; Kakade, S. M.; and Jordan, M. I. 2019a. A short note on concentration inequalities for random vectors with subGaussian norm. ArXiv:1902.03736.
  • Jin et al. (2021) Jin, M.; Lavaei, J.; Sojoudi, S.; and Baldick, R. 2021. Boundary Defense Against Cyber Threat for Power System State Estimation. IEEE Transactions on Information Forensics and Security, 16: 1752–1767.
  • Jin et al. (2019b) Jin, M.; Molybog, I.; Mohammadi-Ghazi, R.; and Lavaei, J. 2019b. Towards robust and scalable power system state estimation. In 2019 IEEE 58th Conference on Decision and Control (CDC), 3245–3252. IEEE.
  • Li et al. (2020) Li, X.; Zhu, Z.; Man-Cho So, A.; and Vidal, R. 2020. Nonconvex robust low-rank matrix recovery. SIAM Journal on Optimization, 30(1): 660–686.
  • Park et al. (2018) Park, D.; Kyrillidis, A.; Caramanis, C.; and Sanghavi, S. 2018. Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably. SIAM Journal on Imaging Sciences, 11(4): 2165–2204.
  • Park et al. (2017) Park, D.; Kyrillidis, A.; Carmanis, C.; and Sanghavi, S. 2017. Non-square Matrix Sensing Without Spurious Local Minima via the Burer–Monteiro Approach. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, 65–74.
  • Recht, Fazel, and Parrilo (2010) Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3): 471–501.
  • Shechtman et al. (2015) Shechtman, Y.; Eldar, Y. C.; Cohen, O.; Chapman, H. N.; Miao, J.; and Segev, M. 2015. Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Processing Magazine, 32(3): 87–109.
  • Singer (2011) Singer, A. 2011. Angular synchronization by eigenvectors and semidefinite programming. Applied and Computational Harmonic Analysis, 30(1): 20–36.
  • Tu et al. (2016) Tu, S.; Boczar, R.; Simchowitz, M.; Soltanolkotabi, M.; and Recht, B. 2016. Low-Rank Solutions of Linear Matrix Equations via Procrustes Flow. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 964–973.
  • Wang, Zhang, and Gu (2017) Wang, L.; Zhang, X.; and Gu, Q. 2017. A unified computational and statistical framework for nonconvex low-rank matrix estimation. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, 981–990.
  • Zhang and Zhang (2020) Zhang, G.; and Zhang, R. Y. 2020. How Many Samples Is a Good Initial Point Worth in Low-Rank Matrix Recovery? In Advances in Neural Information Processing Systems, volume 33, 12583–12592.
  • Zhang (2021) Zhang, R. Y. 2021. Sharp Global Guarantees for Nonconvex Low-Rank Matrix Recovery in the Overparameterized Regime. ArXiv:2104.10790.
  • Zhang et al. (2018a) Zhang, R. Y.; Josz, C.; Sojoudi, S.; and Lavaei, J. 2018a. How Much Restricted Isometry Is Needed in Nonconvex Matrix Recovery? In Advances in Neural Information Processing Systems, volume 31.
  • Zhang, Sojoudi, and Lavaei (2019) Zhang, R. Y.; Sojoudi, S.; and Lavaei, J. 2019. Sharp Restricted Isometry Bounds for the Inexistence of Spurious Local Minima in Nonconvex Matrix Recovery. Journal of Machine Learning Research, 20(114): 1–34.
  • Zhang et al. (2018b) Zhang, X.; Wang, L.; Yu, Y.; and Gu, Q. 2018b. A Primal-Dual Analysis of Global Optimality in Nonconvex Low-Rank Matrix Recovery. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 5862–5871.
  • Zhu et al. (2018) Zhu, Z.; Li, Q.; Tang, G.; and Wakin, M. B. 2018. Global Optimality in Low-Rank Matrix Optimization. IEEE Transactions on Signal Processing, 66(13): 3614–3628.
  • Zhu et al. (2021) Zhu, Z.; Li, Q.; Tang, G.; and Wakin, M. B. 2021. The global optimization geometry of low-rank matrix optimization. IEEE Transactions on Information Theory, 67(2): 1308–1331.

Acknowledgments

This work was supported by grants from AFOSR, ARO, ONR, and NSF.

Appendix A Remark on Theorem 2

The two upper bounds on the distance X^X^TMF\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F} can be obtained for any local minimizer X^\hat{X} by solving the two quadratic-like inequalities (3a) and (3b), and the larger bound needs to be used. To be explicit,

X^X^TMFmax{T1,T2},\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}\leq\max\{T_{1},T_{2}\},

with

T1\displaystyle T_{1} =ϵr+rϵ2+16(1δ)ϵr2(1δ),\displaystyle=\frac{\epsilon\sqrt{r}+\sqrt{r\epsilon^{2}+16(1-\delta)\epsilon\sqrt{r}}}{2(1-\delta)},
T2\displaystyle T_{2} =(26ϵ(1+δ)3+38ϵ(1+δ)3+83(12δ)(1+δ)(2ϵr+22ϵ(1+δ)MF1/2)4(12δ))2.\displaystyle=\left(\frac{2\sqrt{6\epsilon(1+\delta)^{3}}+3\sqrt{8\epsilon(1+\delta)^{3}+\frac{8}{3}(1-2\delta)(1+\delta)(2\epsilon r+2\sqrt{2\epsilon(1+\delta)}\|M^{*}\|_{F}^{1/2})}}{4(1-2\delta)}\right)^{2}.

Appendix B Proofs of Lemma 3 and Theorem 3

Proof of Lemma 3.

Let Zn×rZ\in\mathbb{R}^{n\times r} be a matrix satisfying ZZT=MZZ^{T}=M^{*}. Similar to the proof of Theorem 4, we introduce an optimization problem as follows:

maxη,𝐇^\displaystyle\max_{\eta,\hat{\mathbf{H}}} η\displaystyle\eta (19)
s.t.\displaystyle\operatorname{s.t.} 𝐗^T𝐇^𝐞2X^2ϵ+κ,\displaystyle\lVert\hat{\mathbf{X}}^{T}\hat{\mathbf{H}}\mathbf{e}\rVert\leq 2\lVert\hat{X}\rVert_{2}\epsilon+\kappa,
2IrmatS(𝐇^𝐞)+𝐗^T𝐗^(2ϵ+κ)Inr,\displaystyle 2I_{r}\otimes\operatorname{mat}_{S}(\hat{\mathbf{H}}\mathbf{e})+\hat{\mathbf{X}}^{T}\hat{\mathbf{X}}\succeq-(2\epsilon+\kappa)I_{nr},
ηIn2𝐇^In2,\displaystyle\eta I_{n^{2}}\preceq\hat{\mathbf{H}}\preceq I_{n^{2}},

where its optimal value η(X^)\eta^{*}(\hat{X}) satisfies the inequality

η(X^)1δ(X^)1+δ(X^)1δ1+δ.\eta^{*}(\hat{X})\geq\frac{1-\delta^{*}(\hat{X})}{1+\delta^{*}(\hat{X})}\geq\frac{1-\delta}{1+\delta}. (20)

In the remaining part, we will prove the following upper bound on η(X^)\eta^{*}(\hat{X}):

η(X^)13+(2ϵ+κ)r+2κ(1+δ)+22ϵ(1+δ)X^2𝐞.\eta^{*}(\hat{X})\leq\frac{1}{3}+\frac{(2\epsilon+\kappa)\sqrt{r}+\sqrt{2\kappa(1+\delta)}+2\sqrt{2\epsilon(1+\delta)}\lVert\hat{X}\rVert_{2}}{\lVert\mathbf{e}\rVert}. (21)

The inequality (7b) is a consequence of (20), (21) and the inequality

X^2X^X^TF1/2X^X^TMF1/2+MF1/2.\lVert\hat{X}\rVert_{2}\leq\lVert\hat{X}\hat{X}^{T}\rVert_{F}^{1/2}\leq\lVert\hat{X}\hat{X}^{T}-M^{*}\rVert_{F}^{1/2}+\lVert M^{*}\rVert_{F}^{1/2}.

The proof of the upper bound (21) can be completed by finding a feasible solution to the dual problem of (19):

minU1,U2,W,G,λ,y\displaystyle\min_{\begin{subarray}{c}U_{1},U_{2},W,\\ G,\lambda,y\end{subarray}} tr(U2)+𝐗^T𝐗^,W+(2ϵ+κ)tr(W)+(2X^2ϵ+κ)2λ+tr(G)\displaystyle\operatorname{tr}(U_{2})+\langle\hat{\mathbf{X}}^{T}\hat{\mathbf{X}},W\rangle+(2\epsilon+\kappa)\operatorname{tr}(W)+(2\lVert\hat{X}\rVert_{2}\epsilon+\kappa)^{2}\lambda+\operatorname{tr}(G) (22)
s.t.\displaystyle\operatorname{s.t.} tr(U1)=1,\displaystyle\operatorname{tr}(U_{1})=1,
(𝐗^yw)𝐞T+𝐞(𝐗^yw)T=U1U2,\displaystyle(\hat{\mathbf{X}}y-w)\mathbf{e}^{T}+\mathbf{e}(\hat{\mathbf{X}}y-w)^{T}=U_{1}-U_{2},
[GyyTλ]0,\displaystyle\begin{bmatrix}G&-y\\ -y^{T}&\lambda\end{bmatrix}\succeq 0,
U10,U20,W=[W1,1Wr,1TWr,1Wr,r]0,\displaystyle U_{1}\succeq 0,\quad U_{2}\succeq 0,\quad W=\begin{bmatrix}W_{1,1}&\cdots&W_{r,1}^{T}\\ \vdots&\ddots&\vdots\\ W_{r,1}&\cdots&W_{r,r}\end{bmatrix}\succeq 0,
w=i=1rvec(Wi,i).\displaystyle w=\sum_{i=1}^{r}\operatorname{vec}(W_{i,i}).

Before describing the choice of the dual feasible solution, we need to represent the error vector 𝐞\mathbf{e} in a different form. Let 𝒫n×n\mathcal{P}\in\mathbb{R}^{n\times n} be the orthogonal projection matrix onto the range of X^\hat{X}, and 𝒫n×n\mathcal{P}_{\perp}\in\mathbb{R}^{n\times n} be the orthogonal projection matrix onto the orthogonal complement of the range of X^\hat{X}. Then, ZZ can be decomposed as Z=𝒫Z+𝒫ZZ=\mathcal{P}Z+\mathcal{P}_{\perp}Z, and there exists a matrix Rr×rR\in\mathbb{R}^{r\times r} such that 𝒫Z=X^R\mathcal{P}Z=\hat{X}R. Note that

ZZT=𝒫ZZT𝒫+𝒫ZZT𝒫+𝒫ZZT𝒫+𝒫ZZT𝒫.ZZ^{T}=\mathcal{P}ZZ^{T}\mathcal{P}+\mathcal{P}ZZ^{T}\mathcal{P}_{\perp}+\mathcal{P}_{\perp}ZZ^{T}\mathcal{P}+\mathcal{P}_{\perp}ZZ^{T}\mathcal{P}_{\perp}.

Thus, if we choose

Y^=12X^12X^RRT𝒫ZRT,y^=vec(Y^),\hat{Y}=\frac{1}{2}\hat{X}-\frac{1}{2}\hat{X}RR^{T}-\mathcal{P}_{\perp}ZR^{T},\quad\hat{y}=\operatorname{vec}(\hat{Y}), (23)

then it can be verified that

X^Y^T+Y^X^T𝒫ZZT𝒫=X^X^TZZT,\displaystyle\hat{X}\hat{Y}^{T}+\hat{Y}\hat{X}^{T}-\mathcal{P}_{\perp}ZZ^{T}\mathcal{P}_{\perp}=\hat{X}\hat{X}^{T}-ZZ^{T},
X^Y^T+Y^X^T,𝒫ZZT𝒫=0.\displaystyle\langle\hat{X}\hat{Y}^{T}+\hat{Y}\hat{X}^{T},\mathcal{P}_{\perp}ZZ^{T}\mathcal{P}_{\perp}\rangle=0.

Moreover, we have

X^Y^T+Y^X^TF2\displaystyle\lVert\hat{X}\hat{Y}^{T}+\hat{Y}\hat{X}^{T}\rVert_{F}^{2} =2tr(X^TX^Y^TY^)+tr(X^TY^X^TY^)+tr(Y^TX^Y^TX^)\displaystyle=2\operatorname{tr}(\hat{X}^{T}\hat{X}\hat{Y}^{T}\hat{Y})+\operatorname{tr}(\hat{X}^{T}\hat{Y}\hat{X}^{T}\hat{Y})+\operatorname{tr}(\hat{Y}^{T}\hat{X}\hat{Y}^{T}\hat{X}) (24)
2tr(X^TX^Y^TY^)2σr(X^)2Y^F2,\displaystyle\geq 2\operatorname{tr}(\hat{X}^{T}\hat{X}\hat{Y}^{T}\hat{Y})\geq 2\sigma_{r}(\hat{X})^{2}\lVert\hat{Y}\rVert_{F}^{2},

in which the first inequality is due to

tr(X^TY^X^TY^)=14tr((X^TX^(IrRRT))2)=14tr((X^(IrRRT)X^T)2)0.\operatorname{tr}(\hat{X}^{T}\hat{Y}\hat{X}^{T}\hat{Y})=\frac{1}{4}\operatorname{tr}((\hat{X}^{T}\hat{X}(I_{r}-RR^{T}))^{2})=\frac{1}{4}\operatorname{tr}((\hat{X}(I_{r}-RR^{T})\hat{X}^{T})^{2})\geq 0.

Assume first that Z=𝒫Z0Z_{\perp}=\mathcal{P}_{\perp}Z\neq 0. The other case will be handled at the end of this proof. In the case when Z0Z_{\perp}\neq 0, we also have X^Y^T+Y^X^T0\hat{X}\hat{Y}^{T}+\hat{Y}\hat{X}^{T}\neq 0. Otherwise, the inequality (24) and the assumption σr(X^)>0\sigma_{r}(\hat{X})>0 imply that Y^=0\hat{Y}=0. The orthogonality and the definition of Y^\hat{Y} in (23) then give rise to

X^X^RRT=0,𝒫ZRT=0.\hat{X}-\hat{X}RR^{T}=0,\quad\mathcal{P}_{\perp}ZR^{T}=0.

The first equation above implies that RR is invertible since X^\hat{X} has full column rank, which contradicts Z0Z_{\perp}\neq 0. Now, define the unit vectors

u^1=𝐗^y^𝐗^y^,u^2=vec(ZZT)ZZTF.\hat{u}_{1}=\frac{\hat{\mathbf{X}}\hat{y}}{\lVert\hat{\mathbf{X}}\hat{y}\rVert},\quad\hat{u}_{2}=\frac{\operatorname{vec}(Z_{\perp}Z_{\perp}^{T})}{\lVert Z_{\perp}Z_{\perp}^{T}\rVert_{F}}.

Then, u^1u^2\hat{u}_{1}\perp\hat{u}_{2} and

𝐞=𝐞(1α2u^1αu^2)\mathbf{e}=\lVert\mathbf{e}\rVert(\sqrt{1-\alpha^{2}}\hat{u}_{1}-\alpha\hat{u}_{2}) (25)

with

α=ZZTFX^X^TZZTF.\alpha=\frac{\lVert Z_{\perp}Z_{\perp}^{T}\rVert_{F}}{\lVert\hat{X}\hat{X}^{T}-ZZ^{T}\rVert_{F}}. (26)

We first describe our choices of the dual variables WW and yy (which will be scaled later). Let

X^TX^=QSQT,ZZT=PGPT,\hat{X}^{T}\hat{X}=QSQ^{T},\quad Z_{\perp}Z_{\perp}^{T}=PGP^{T},

with orthogonal matrices Q,PQ,P and diagonal matrices S,GS,G such that S11=σr(X^)2S_{11}=\sigma_{r}(\hat{X})^{2}. Fix a constant γ[0,1]\gamma\in[0,1] that is to be determined and define

Vi=k1/2Gii1/2PEi1QT,i=1,,r,\displaystyle V_{i}=k^{1/2}G_{ii}^{1/2}PE_{i1}Q^{T},\quad\forall i=1,\dots,r,
W=i=1rvec(Vi)vec(Vi)T,y=ly^,\displaystyle W=\sum_{i=1}^{r}\operatorname{vec}(V_{i})\operatorname{vec}(V_{i})^{T},\quad y=l\hat{y},

with y^\hat{y} defined in (23) and

k=γ𝐞ZZTF,l=1γ2𝐞𝐗^y^.k=\frac{\gamma}{\lVert\mathbf{e}\rVert\lVert Z_{\perp}Z_{\perp}^{T}\rVert_{F}},\quad l=\frac{\sqrt{1-\gamma^{2}}}{\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}\hat{y}\rVert}.

Here, EijE_{ij} is the elementary matrix of size n×rn\times r with the (i,j)(i,j)-entry being 11. By our construction, X^TVi=0\hat{X}^{T}V_{i}=0, which implies that

𝐗^T𝐗^,W=i=1rX^ViT+ViX^TF2=2i=1rtr(X^TX^ViTVi)=2kσr(X^)2i=1rGii=2βγ,\langle\hat{\mathbf{X}}^{T}\hat{\mathbf{X}},W\rangle=\sum_{i=1}^{r}\lVert\hat{X}V_{i}^{T}+V_{i}\hat{X}^{T}\rVert_{F}^{2}=2\sum_{i=1}^{r}\operatorname{tr}(\hat{X}^{T}\hat{X}V_{i}^{T}V_{i})=2k\sigma_{r}(\hat{X})^{2}\sum_{i=1}^{r}G_{ii}=2\beta\gamma, (27)

with

β=σr(X^)2tr(ZZT)X^X^TZZTFZZTF.\beta=\frac{\sigma_{r}(\hat{X})^{2}\operatorname{tr}(Z_{\perp}Z_{\perp}^{T})}{\lVert\hat{X}\hat{X}^{T}-ZZ^{T}\rVert_{F}\lVert Z_{\perp}Z_{\perp}^{T}\rVert_{F}}. (28)

In addition,

tr(W)=i=1rViF2=ki=1rGii=ktr(ZZT)r𝐞,\operatorname{tr}(W)=\sum_{i=1}^{r}\lVert V_{i}\rVert_{F}^{2}=k\sum_{i=1}^{r}G_{ii}=k\operatorname{tr}(Z_{\perp}Z_{\perp}^{T})\leq\frac{\sqrt{r}}{\lVert\mathbf{e}\rVert}, (29)

and

w=i=1rvec(Wi,i)=i=1rViViT=kZZT.w=\sum_{i=1}^{r}\operatorname{vec}(W_{i,i})=\sum_{i=1}^{r}V_{i}V_{i}^{T}=kZ_{\perp}Z_{\perp}^{T}.

Therefore,

𝐗^yw=1𝐞(1γ2u^1γu^2),\hat{\mathbf{X}}y-w=\frac{1}{\lVert\mathbf{e}\rVert}(\sqrt{1-\gamma^{2}}\hat{u}_{1}-\gamma\hat{u}_{2}),

which together with (25) implies that

𝐞𝐗^yw=1,𝐞,𝐗^yw=γα+1γ21α2=ψ(γ).\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}y-w\rVert=1,\quad\langle\mathbf{e},\hat{\mathbf{X}}y-w\rangle=\gamma\alpha+\sqrt{1-\gamma^{2}}\sqrt{1-\alpha^{2}}=\psi(\gamma). (30)

Next, the inequality (24) and the assumption on σr(X^)\sigma_{r}(\hat{X}) imply that

ϵy1γ2ϵ2σr(X^)𝐞1+δϵ2(ϵ+κ)𝐞ϵ(1+δ)2𝐞\epsilon\lVert y\rVert\leq\frac{\sqrt{1-\gamma^{2}}\epsilon}{\sqrt{2}\sigma_{r}(\hat{X})\lVert\mathbf{e}\rVert}\leq\frac{\sqrt{1+\delta}\epsilon}{\sqrt{2(\epsilon+\kappa)}\lVert\mathbf{e}\rVert}\leq\frac{\sqrt{\epsilon(1+\delta)}}{\sqrt{2}\lVert\mathbf{e}\rVert} (31)

and similarly

κyκ(1+δ)2𝐞.\kappa\lVert y\rVert\leq\frac{\sqrt{\kappa(1+\delta)}}{\sqrt{2}\lVert\mathbf{e}\rVert}. (32)

Define

M=(𝐗^yw)𝐞T+𝐞(𝐗^yw)T,M=(\hat{\mathbf{X}}y-w)\mathbf{e}^{T}+\mathbf{e}(\hat{\mathbf{X}}y-w)^{T},

and decompose MM as M=[M]+[M]M=[M]_{+}-[M]_{-} in which both [M]+0[M]_{+}\succeq 0 and [M]0[M]_{-}\succeq 0. Let θ\theta be the angle between 𝐞\mathbf{e} and 𝐗^yw\hat{\mathbf{X}}y-w. By Lemma 4, we have

tr([M]+)=𝐞𝐗^yw(1+cosθ),tr([M])=𝐞𝐗^yw(1cosθ).\operatorname{tr}([M]_{+})=\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}y-w\rVert(1+\cos\theta),\quad\operatorname{tr}([M]_{-})=\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}y-w\rVert(1-\cos\theta).

Now, one can verify that (U1,U2,W,G,λ,y)(U_{1}^{*},U_{2}^{*},W^{*},G^{*},\lambda^{*},y^{*}) defined as

U1=[M]+tr([M]+),U2=[M]tr([M]+),y=ytr([M]+),\displaystyle U_{1}^{*}=\frac{[M]_{+}}{\operatorname{tr}([M]_{+})},\quad U_{2}^{*}=\frac{[M]_{-}}{\operatorname{tr}([M]_{+})},\quad y^{*}=\frac{y}{\operatorname{tr}([M]_{+})},
W=Wtr([M]+),λ=y2X^2ϵ+κ,G=1λyyT\displaystyle W^{*}=\frac{W}{\operatorname{tr}([M]_{+})},\quad\lambda^{*}=\frac{\lVert y^{*}\rVert}{2\lVert\hat{X}\rVert_{2}\epsilon+\kappa},\quad G^{*}=\frac{1}{\lambda^{*}}y^{*}y^{*T}

forms a feasible solution to the dual problem (22) whose objective value is equal to

tr([M])+𝐗^T𝐗^,W+(2ϵ+κ)tr(W)+2(2X^2ϵ+κ)ytr([M]+).\frac{\operatorname{tr}([M]_{-})+\langle\hat{\mathbf{X}}^{T}\hat{\mathbf{X}},W\rangle+(2\epsilon+\kappa)\operatorname{tr}(W)+2(2\lVert\hat{X}\rVert_{2}\epsilon+\kappa)\lVert y\rVert}{\operatorname{tr}([M]_{+})}.

Substituting (27), (29), (30), (31) and (32) into the above equation, we obtain

η(X^)\displaystyle\eta^{*}(\hat{X}) 2βγ+1ψ(γ)+((2ϵ+κ)r+2κ(1+δ)+22ϵ(1+δ)X^2)/𝐞1+ψ(γ)\displaystyle\leq\frac{2\beta\gamma+1-\psi(\gamma)+((2\epsilon+\kappa)\sqrt{r}+\sqrt{2\kappa(1+\delta)}+2\sqrt{2\epsilon(1+\delta)}\lVert\hat{X}\rVert_{2})/\lVert\mathbf{e}\rVert}{1+\psi(\gamma)}
2βγ+1ψ(γ)1+ψ(γ)+(2ϵ+κ)r+2κ(1+δ)+22ϵ(1+δ)X^2𝐞.\displaystyle\leq\frac{2\beta\gamma+1-\psi(\gamma)}{1+\psi(\gamma)}+\frac{(2\epsilon+\kappa)\sqrt{r}+\sqrt{2\kappa(1+\delta)}+2\sqrt{2\epsilon(1+\delta)}\lVert\hat{X}\rVert_{2}}{\lVert\mathbf{e}\rVert}.

Choosing the best value of the parameter γ[0,1]\gamma\in[0,1] to minimize the far right-side of the above inequality leads to

2βγ+1ψ(γ)1+ψ(γ)η0(X^),\frac{2\beta\gamma+1-\psi(\gamma)}{1+\psi(\gamma)}\leq\eta_{0}(\hat{X}),

with

η0(X^):={11α21+1α2,if βα1+1α2,β(αβ)1βα,if βα1+1α2.\eta_{0}(\hat{X}):=\begin{dcases*}\frac{1-\sqrt{1-\alpha^{2}}}{1+\sqrt{1-\alpha^{2}}},&if $\beta\geq\dfrac{\alpha}{1+\sqrt{1-\alpha^{2}}}$,\\ \frac{\beta(\alpha-\beta)}{1-\beta\alpha},&if $\beta\leq\dfrac{\alpha}{1+\sqrt{1-\alpha^{2}}}$.\end{dcases*}

Here, α\alpha and β\beta are defined in (26) and (28), respectively. In the proof of Theorem 1.2 in Zhang (2021), it is shown that η0(X^)1/3\eta_{0}(\hat{X})\leq 1/3 for every X^\hat{X} with X^X^TZZT\hat{X}\hat{X}^{T}\neq ZZ^{T}, which gives the upper bound (21).

Finally, we still need to deal with the case when 𝒫Z=0\mathcal{P}_{\perp}Z=0. In this case, we know that 𝐗^y^=𝐞\hat{\mathbf{X}}\hat{y}=\mathbf{e} with y^\hat{y} defined in (23). Then, it is easy to check that (U1,U2,W,G,λ,y)(U_{1}^{*},U_{2}^{*},W^{*},G^{*},\lambda^{*},y^{*}) defined as

U1=𝐞𝐞T𝐞2,U2=0,y=y^2𝐞2,\displaystyle U_{1}^{*}=\frac{\mathbf{e}\mathbf{e}^{T}}{\lVert\mathbf{e}\rVert^{2}},\quad U_{2}^{*}=0,\quad y^{*}=\frac{\hat{y}}{2\lVert\mathbf{e}\rVert^{2}},
W=0,λ=y2X^2ϵ+κ,G=1λyyT\displaystyle W^{*}=0,\quad\lambda^{*}=\frac{\lVert y^{*}\rVert}{2\lVert\hat{X}\rVert_{2}\epsilon+\kappa},\quad G^{*}=\frac{1}{\lambda^{*}}y^{*}y^{*T}

forms a feasible solution to the dual problem (22) whose objective value is 2(2X^2ϵ+κ)y2(2\lVert\hat{X}\rVert_{2}\epsilon+\kappa)\lVert y^{*}\rVert. By the inequality (24), we have

η(X^)2(2X^2ϵ+κ)yκ/2+2ϵX^2σr(X^)𝐞κ(1+δ)/2+2ϵ(1+δ)X^2𝐞.\eta^{*}(\hat{X})\leq 2(2\lVert\hat{X}\rVert_{2}\epsilon+\kappa)\lVert y^{*}\rVert\leq\frac{\kappa/\sqrt{2}+\sqrt{2}\epsilon\lVert\hat{X}\rVert_{2}}{\sigma_{r}(\hat{X})\lVert\mathbf{e}\rVert}\leq\frac{\sqrt{\kappa(1+\delta)/2}+\sqrt{2\epsilon(1+\delta)}\lVert\hat{X}\rVert_{2}}{\lVert\mathbf{e}\rVert}.

Hence, the upper bound (21) still holds in this case. ∎

Proof of Theorem 3.

The proof of Theorem 3 is similar to the above proof of Lemma 3 in the situation with κ=0\kappa=0, and we will only emphasize the difference here. In the case when X^0\hat{X}\neq 0, after constructing the feasible solution to the dual problem (22), we have

1δ1+δη(X^)tr([M])+𝐗^T𝐗^,W+2ϵtr(W)+4X^2ϵytr([M]+).\frac{1-\delta}{1+\delta}\leq\eta^{*}(\hat{X})\leq\frac{\operatorname{tr}([M]_{-})+\langle\hat{\mathbf{X}}^{T}\hat{\mathbf{X}},W\rangle+2\epsilon\operatorname{tr}(W)+4\lVert\hat{X}\rVert_{2}\epsilon\lVert y\rVert}{\operatorname{tr}([M]_{+})}. (33)

Note that in the rank-1 case, one can write σr(X^)=X^2\sigma_{r}(\hat{X})=\lVert\hat{X}\rVert_{2} and

yy^𝐞𝐗^y^12X^2𝐞,\lVert y\rVert\leq\frac{\lVert\hat{y}\rVert}{\lVert\mathbf{e}\rVert\lVert\hat{\mathbf{X}}\hat{y}\rVert}\leq\frac{1}{\sqrt{2}\lVert\hat{X}\rVert_{2}\lVert\mathbf{e}\rVert},

in which the last inequality is due to (24). Substituting (27), (29), (30) and the above inequality into (33) and choosing an appropriate γ\gamma as shown in the proof of Lemma 3, we obtain

1δ1+δη(X^)\displaystyle\frac{1-\delta}{1+\delta}\leq\eta^{*}(\hat{X}) 2βγ+1ψ(γ)+(2ϵ+22ϵ)/𝐞1+ψ(γ)\displaystyle\leq\frac{2\beta\gamma+1-\psi(\gamma)+(2\epsilon+2\sqrt{2}\epsilon)/\lVert\mathbf{e}\rVert}{1+\psi(\gamma)}
13+2ϵ+22ϵ𝐞,\displaystyle\leq\frac{1}{3}+\frac{2\epsilon+2\sqrt{2}\epsilon}{\lVert\mathbf{e}\rVert},

which implies inequality (4) under the probabilistic event that qϵ\lVert q\rVert\leq\epsilon.

In the case when X^=0\hat{X}=0, (U1,U2,W,G,λ,y)(U_{1}^{*},U_{2}^{*},W^{*},G^{*},\lambda^{*},y^{*}) defined as

U1=𝐞𝐞T𝐞2,U2=0,y=0,\displaystyle U_{1}^{*}=\frac{\mathbf{e}\mathbf{e}^{T}}{\lVert\mathbf{e}\rVert^{2}},\quad U_{2}^{*}=0,\quad y^{*}=0,
W=ZZT2𝐞2,λ=0,G=0\displaystyle W^{*}=\frac{ZZ^{T}}{2\lVert\mathbf{e}\rVert^{2}},\quad\lambda^{*}=0,\quad G^{*}=0

forms a feasible solution to the dual problem (22), which shows that

1δ1+δη(X^)ϵ𝐞.\frac{1-\delta}{1+\delta}\leq\eta^{*}(\hat{X})\leq\frac{\epsilon}{\lVert\mathbf{e}\rVert}.

The above inequality also implies inequality (4) under the probabilistic event that qϵ\lVert q\rVert\leq\epsilon. ∎

Appendix C Additional Numerical Illustration

Refer to caption
(a) The upper bound derived from inequality (3a), m=20,n=60m=20,n=60.
Refer to caption
(b) The upper bound derived from inequality (3b), m=20,n=60m=20,n=60.
Refer to caption
(c) The upper bound derived from inequality (3a), m=30,n=60m=30,n=60.
Refer to caption
(d) The upper bound derived from inequality (3b), m=30,n=60m=30,n=60.
Refer to caption
(e) The upper bound derived from inequality (3a), m=20,n=90m=20,n=90.
Refer to caption
(f) The upper bound derived from inequality (3b), m=20,n=90m=20,n=90.
Figure 4: Comparison of the upper bounds given by Theorem 2 for the distance X^X^TMF\|\hat{X}\hat{X}^{T}-M^{*}\|_{F} with X^\hat{X} being an arbitrary local minimizer under varying values of problem parameters mm and nn.