This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How Many Samples is a Good Initial Point Worth?

Gavin Zhang 
Department of Electrical and Computer Engineering
University of Illinois Urbana Champaign
Illinois, IL61820
[email protected]
Richard Y. Zhang 
Department of Electrical and Computer Engineering
University of Illinois Urbana Champaign
Illinois, IL61820
[email protected]

How Many Samples is a Good Initial Point Worth in Low-rank Matrix Recovery?

Gavin Zhang 
Department of Electrical and Computer Engineering
University of Illinois Urbana Champaign
Illinois, IL61820
[email protected]
Richard Y. Zhang 
Department of Electrical and Computer Engineering
University of Illinois Urbana Champaign
Illinois, IL61820
[email protected]
Abstract

Given a sufficiently large amount of labeled data, the non-convex low-rank matrix recovery problem contains no spurious local minima, so a local optimization algorithm is guaranteed to converge to a global minimum starting from any initial guess. However, the actual amount of data needed by this theoretical guarantee is very pessimistic, as it must prevent spurious local minima from existing anywhere, including at adversarial locations. In contrast, prior work based on good initial guesses have more realistic data requirements, because they allow spurious local minima to exist outside of a neighborhood of the solution. In this paper, we quantify the relationship between the quality of the initial guess and the corresponding reduction in data requirements. Using the restricted isometry constant as a surrogate for sample complexity, we compute a sharp “threshold” number of samples needed to prevent each specific point on the optimization landscape from becoming a spurious local minimum. Optimizing the threshold over regions of the landscape, we see that for initial points around the ground truth, a linear improvement in the quality of the initial guess amounts to a constant factor improvement in the sample complexity.

1 Introduction

A perennial challenge in non-convex optimization is the possible existence of bad or spurious critical points and local minima, which can cause a local optimization algorithm like gradient descent to slow down or get stuck. Several recent lines of work showed that the effects of non-convexity can be tamed through a large amount of diverse and high quality training data [17, 1, 9, 3, 18, 12]. Concretely, these authors showed that, for classes of problems based on random sampling, spurious critical points and local minima become progressively less likely to exist with the addition of each new sample. After a sufficiently large number of samples, all spurious local minima are eliminated, so any local optimization algorithm is guaranteed to converge to the globally optimal solution starting from an arbitrary, possibly random initial guess.

This notion of a global guarantee—one that is valid starting from any initial point—is considerably stronger than what is needed for empirical success to be observed [8]. For example, the existence of a spurious local minimum may not pose an issue if gradient descent does not converge towards it. However, a theoretical guarantee is no longer possible, as starting the algorithm from the spurious local minimum would result in failure [22]. As a consequence, these global guarantees tend to be pessimistic, because the number of samples must be sufficiently large to eliminate spurious local minima everywhere, even at adversarial locations. By contrast, the weaker notion of a local guarantee [11, 10, 15, 19, 5, 7, 20, 13]—one that is valid only for a specified set of initial points—is naturally less conservative, as it allows spurious local minima to exist outside of the specified set.

In this paper, we provide a unifying view between the notions of the global and local guarantees by quantifying the relationship between the sample complexity and the quality of the initial point. We restrict our attention to the matrix sensing problem, which seeks to recover a rank-rr positive semidefinite matrix M=ZZTn×nM^{*}=ZZ^{T}\in\mathbb{R}^{n\times n} with Zn×rZ\in\mathbb{R}^{n\times r} from mm sub-Gaussian linear measurements of the form

b𝒜(ZZT)[A1,MAm,M]Tb\equiv\mathcal{A}(ZZ^{T})\equiv\left[\left\langle A_{1},M^{*}\right\rangle\quad\cdots\quad\left\langle A_{m},M^{*}\right\rangle\right]^{T} (1)

by solving the following non-convex optimization problem:

minXn×rf𝒜(X)𝒜(XXTZZT)2=i=1m(Ai,XXTbi)2.\min_{X\in\mathbb{R}^{n\times r}}f_{\mathcal{A}}(X)\equiv\left\|\mathcal{A}\left(XX^{T}-ZZ^{T}\right)\right\|^{2}=\sum_{i=1}^{m}\left(\left\langle A_{i},XX^{T}\right\rangle-b_{i}\right)^{2}. (2)

We characterize a sharp “threshold” on the number of samples mm needed to prevent each specific point on the optimization landscape from becoming a spurious local minimum. While the threshold is difficult to solve, we derive a lower-bound in closed-form based on spurious critical points, and show that it constitutes a sharp lower-bound on the original threshold of interest. The lower-bound reveals a simple geometric relationship: a point XX is more likely to be a local minimum if the column spaces of XX and ZZ are close to orthogonal. Optimizing the closed-form lower-bound over regions of the landscape, we show that for initial points close to the ground truth, a constant factor improvement of the initial point amounts to a constant factor reduction in the number of samples needed to guarantee recovery.

2 Related Work

Local Guarantees. The earliest work on exact guarantees for non-convex optimization focused on generating a good initial guess within a local region of attraction. For instance, in [21, 24], the authors showed that when 𝒜\mathcal{A} satisfies (δ,6r)(\delta,6r)-RIP with a constant δ1/10\delta\leq 1/10, and there exists a initial point sufficiently close to the ground truth, then gradient descent starting from this initial point has a linear convergence rate. The typical strategy to find such the initial point is spectral initialization [11, 10, 21, 19, 5, 14, 6]: using the singular value decomposition on a surrogate matrix to find low-rank factors that are close to the ground truth.

In this paper, we focus on the trade-off between the quality of an initial point and the number of samples needed to prevent the existence of spurious local minima, while sidestepping the question of how it is found. We note, however, that the number of samples needed to find an ϵ\epsilon-good initial guess (e.g. via spectral initialization) forms an interesting secondary trade-off. It remains a future work to study the interactions between these two points.

Global Guarantees. Recent work focused on establishing a global guarantee that is independent of the initial guess [17, 1, 9, 3, 18, 12]. For our purposes, Bhojanapalli et al. [2] showed that RIP with δ2r<1/5\delta_{2r}<1/5 eliminates all spurious local minima, while Zhang et al. [23] refined this to δ2r<1/2\delta_{2r}<1/2 for the rank-1 case, and showed that this is both and necessary and sufficient. This paper is inspired by proof techniques in the latter paper; an important contribution of our paper is generalizing their rank-1 techniques to accommodate for matrices of arbitrary rank.

3 Our Approach: Threshold RIP Constant

Previous work that studied the global optimization landscape of problem (2) typically relied on the restricted isometry property (RIP) of 𝒜\mathcal{A}. It is now well-known that if the measurement operator 𝒜\mathcal{A} satisfies the restricted isometry property with a sufficiently small constant δ<1/5\delta<1/5 then problem (2) contains no spurious local minima; see Bhojanapalli et al. [2].

Definition 1 (δ\delta-RIP).

Let 𝒜:n×nm\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m} be a linear measurement operator. We say that 𝒜\mathcal{A} satisfies the δ\delta-restricted isometry property (or simply δ\delta-RIP) if satisfies the following inequality

(1δ)MF2𝒜(M)2(1+δ)MF2M2r(1-\delta)\|M\|_{F}^{2}\leq\|\mathcal{A}(M)\|^{2}\leq(1+\delta)\|M\|_{F}^{2}\qquad\forall M\in\mathcal{M}_{2r}

where 2r={Xn×n:rank(X)2r}\mathcal{M}_{2r}=\{X\in\mathbb{R}^{n\times n}:\mathrm{rank}(X)\leq 2r\} denotes the set of rank-2r2r matrices. The RIP constant of 𝒜\mathcal{A} is the smallest value of δ\delta such that the inequality above holds.

Let δ[0,1)\delta\in[0,1) denote the RIP constant of 𝒜\mathcal{A}. It is helpful to view δ\delta as a surrogate for the number of measurements m0m\geq 0, with a large value of δ\delta corresponding a smaller value of mm and vice versa. For a wide range of sub-Gaussian measurement ensembles, if mC0nr/δ2m\geq C_{0}nr/\delta^{2} where C0C_{0} is an absolute constant, then 𝒜\mathcal{A} satisfies δ\delta-RIP with high probability [4, 16].

Take Xn×rX\in\mathbb{R}^{n\times r} to be a spurious point such that XXTZZTXX^{T}\neq ZZ^{T}. Our approach in this paper is to define a threshold number of measurements that would be needed to prevent XX from becoming a local minimum for problem (1). Viewing the RIP constant δ\delta as a surrogate for the number of measurements mm, we follow a construction of Zhang et al. [23], and instead define a threshold δsoc(X)\delta_{\mathrm{soc}}(X) on the RIP constant δ\delta that would prevent XX from becoming a local minimum for problem (1). Such a construction must necessarily take into account all choices of 𝒜\mathcal{A} satisfying δ\delta-RIP, including those that adversarially target XX, bending the optimization landscape into forming a region of convergence around the point. On the other hand, such adversarial choices of 𝒜\mathcal{A} must necessarily be defeated for a sufficiently small threshold on δ\delta, as we already know that spurious local minima cannot exist for δ<1/5\delta<1/5. The statement below makes this idea precise, and also extends it to a set of spurious points.

Definition 2 (Threshold for second-order condition).

Fix Zn×rZ\in\mathbb{R}^{n\times r}. For Xn×rX\in\mathbb{R}^{n\times r}, if XXT=ZZTXX^{T}=ZZ^{T}, then define δsoc(X)=1\delta_{\mathrm{soc}}(X)=1. Otherwise, if XXTZZTXX^{T}\neq ZZ^{T}, then define

δsoc(X)\displaystyle\delta_{\mathrm{soc}}(X)\equiv min𝒜{δ:f𝒜(X)=0,2f𝒜(X)0,𝒜 satisfies δ-RIP}\displaystyle\min_{\mathcal{A}}\{\delta:\nabla f_{\mathcal{A}}(X)=0,\quad\nabla^{2}f_{\mathcal{A}}(X)\succeq 0,\quad\mathcal{A}\text{ satisfies }\delta\text{-RIP}\} (3)

where the minimum is taken over all linear measurements 𝒜:n×nm\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m}. For 𝒲n×r,\mathcal{W}\subseteq\mathbb{R}^{n\times r}, define δsoc(𝒲)=infX𝒲δsoc(X).\delta_{\mathrm{soc}}(\mathcal{W})=\inf_{X\in\mathcal{W}}\delta_{\mathrm{soc}}(X).

If δ<δsoc(X)\delta<\delta_{\mathrm{soc}}(X), then XX cannot be a spurious local minimum by construction, or it would contradict the definition of δsoc(X)\delta_{\mathrm{soc}}(X) as the minimum value. By the same logic, if δ<δsoc(𝒲)\delta<\delta_{\mathrm{soc}}(\mathcal{W}), then no choice of X𝒲X\in\mathcal{W} can be a spurious local minimum. In particular, it follows that δsoc(n×r)\delta_{\mathrm{soc}}(\mathbb{R}^{n\times r}) is the usual global RIP threshold: if 𝒜\mathcal{A} satisfies δ\delta-RIP with δ<δsoc(n×r)\delta<\delta_{\mathrm{soc}}(\mathbb{R}^{n\times r}), then f𝒜(X)f_{\mathcal{A}}(X) is guaranteed to admit no spurious local minima. Starting a local optimization algorithm from any initial point guarantees exact recovery of an XX satisfying XXT=ZZTXX^{T}=ZZ^{T}.

Now, suppose we are given an initial point X0X_{0}. It is natural to measure the quality of X0X_{0} by its relative error, as in ε=XXTZZTF/ZZTF\varepsilon=\|XX^{T}-ZZ^{T}\|_{F}/\|ZZ^{T}\|_{F}. If we define an ε\varepsilon-neighborhood of all points with the same relative error

ε={Xn×r,XXTZZTFεZZTF}\mathcal{B}_{\varepsilon}=\{X\in\mathbb{R}^{n\times r},\|XX^{T}-ZZ^{T}\|_{F}\leq\varepsilon\|ZZ^{T}\|_{F}\} (4)

then it follows that δsoc(ε)\delta_{\mathrm{soc}}(\mathcal{B}_{\varepsilon}) is an analogous local RIP threshold: if 𝒜\mathcal{A} satisfies δ\delta-RIP with δ<δsoc(ε)\delta<\delta_{\mathrm{soc}}(\mathcal{B}_{\varepsilon}), then f𝒜(X)f_{\mathcal{A}}(X) is guaranteed to admit no spurious local minima over all XεX\in\mathcal{B}_{\varepsilon}. Starting a local optimization algorithm from the initial point X0X_{0} guarantees either exact recovery of an XX satisfying XXT=ZZTXX^{T}=ZZ^{T}, or termination at a strictly worse point XX with XXTZZTF>X0X0TZZTF\|XX^{T}-ZZ^{T}\|_{F}>\|X_{0}X_{0}^{T}-ZZ^{T}\|_{F}. Imposing further restrictions on the algorithm prevents the latter scenario from occurring (local strong convexity with gradient descent [19], strict decrements in the levels set [10, 23, 8]), and so exact recovery is guaranteed.

The numerical difference between the global threshold δsoc(n×r)\delta_{\mathrm{soc}}(\mathbb{R}^{n\times r}) and the local threshold δsoc(ε)\delta_{\mathrm{soc}}(\mathcal{B}_{\varepsilon}) is precisely the number of samples that an ε\varepsilon-quality initial point X0X_{0} is worth, up to some conversion factor. But two major difficulties remain in this line of reasoning. First, evaluating δsoc(X)\delta_{\mathrm{soc}}(X) for some Xn×rX\in\mathbb{R}^{n\times r} requires solving a minimization problem over the set of δ\delta-RIP operators. Second, evaluating δsoc(ε)\delta_{\mathrm{soc}}(\mathcal{B}_{\varepsilon}) in turn requires minimizing δsoc(X)\delta_{\mathrm{soc}}(X) over all choices of XX within an ε\varepsilon-neighborhood. Regarding the first point, Zhang et al. [23] showed that δsoc(X)\delta_{\mathrm{soc}}(X) is the optimal value to a convex optimization problem, and can therefore be evaluated to arbitrary precising using a numerical algorithm. In the rank-1 case, they solved this convex optimization in closed-form, and use it to optimize over all XεX\in\mathcal{B}_{\varepsilon}. Their closed-form solution spanned 9 journal pages, and evoked a number of properties specific to the rank-1 case (for example, xyT+yxT=0xy^{T}+yx^{T}=0 implies x=0x=0 and y=0y=0, but XYT+YXT=0XY^{T}+YX^{T}=0 may hold for X0X\neq 0 and Y0Y\neq 0). The authors noted that a similar closed-form solution for the general rank-rr case appeared exceedingly difficult. While overall proof technique is sharp and descriptive, its applicability appears to be entirely limited to the rank-1 case.

4 Main results

In this paper, we bypass the difficulty of deriving a closed-form solution for δsoc(X)\delta_{\mathrm{soc}}(X) altogether by adopting a sharp lower-bound. This is based on two key insights. First, a spurious local minimum must also be a spurious critical point, so the analogous threshold over critical points would give an obvious lower-bound δfoc(X)δsoc(X)\delta_{\mathrm{foc}}(X)\leq\delta_{\mathrm{soc}}(X).

Definition 3 (Threshold for first-order condition).

Fix Zn×rZ\in\mathbb{R}^{n\times r}. For Xn×rX\in\mathbb{R}^{n\times r}, if XXT=ZZTXX^{T}=ZZ^{T}, then define δfoc(X)=1\delta_{\mathrm{foc}}(X)=1. Otherwise, if XXTZZTXX^{T}\neq ZZ^{T}, then define

δfoc(X)\displaystyle\delta_{\mathrm{foc}}(X)\equiv min𝒜{δ:f𝒜(X)=0,𝒜 satisfies δ-RIP},\displaystyle\min_{\mathcal{A}}\{\delta:\nabla f_{\mathcal{A}}(X)=0,\quad\mathcal{A}\text{ satisfies }\delta\text{-RIP}\}, (5)

where the minimum is taken over all linear measurements 𝒜:n×nm\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m}. For 𝒲n×r,\mathcal{W}\subseteq\mathbb{R}^{n\times r}, define δfoc(𝒲)=infX𝒲δfoc(X).\delta_{\mathrm{foc}}(\mathcal{W})=\inf_{X\in\mathcal{W}}\delta_{\mathrm{foc}}(X).

Whereas the main obstacle in Zhang et al. [23] is the considerable difficulty in deriving a closed-form solution for δsoc(X)\delta_{\mathrm{soc}}(X), we show in this paper that it is relatively straightforward to solve δfoc(X)\delta_{\mathrm{foc}}(X) in closed-form, to result in a simple, geometric solution.

Theorem 4.

Fix Zn×rZ\in\mathbb{R}^{n\times r}. Given 𝒜\mathcal{A} satisfying δ\delta-RIP and Xn×rX\in\mathbb{R}^{n\times r} such that XXTZZTXX^{T}\neq ZZ^{T}, we have δfoc(X)=cosθ\delta_{\mathrm{foc}}(X)=\cos\theta, where

sinθ=ZT(IXX)ZF/XXTZZTF.\sin\theta=\|Z^{T}(I-XX^{{\dagger}})Z\|_{F}\;\big{/}\;\|XX^{T}-ZZ^{T}\|_{F}. (6)

and XX^{{\dagger}} denotes the pseudo-inverse of XX. It follows that if δ<cosθ\delta<\cos\theta, then XX is not a spurious critical point of f𝒜(X)f_{\mathcal{A}}(X). If δcosθ\delta\geq\cos\theta, then there exists some 𝒜\mathcal{A}^{\star} satisfying cosθ\cos\theta-RIP such that f𝒜(X)=0\nabla f_{\mathcal{A}}(X)=0.

The complete proof of Theorem 8 is given in Appendix A and a sketch is given in section 5. There is a nice geometric interpretation: the exact value of δfoc(X)\delta_{\mathrm{foc}}(X) depends largely on the incidence angle between the column space of XX and the column space of ZZ. When the angle between XXTXX^{T} and ZZTZZ^{T} becomes small, the projection of XXTXX^{T} onto ZZTZZ^{T} becomes large. As a result, sinθ\sin\theta becomes small and cosθ\cos\theta becomes large. Therefore, Theorem 8 says that in regions where XXTXX^{T} and ZZTZZ^{T} are more aligned, fewer samples are required to prevent XX from becoming a spurious critical point. In regions where XXTXX^{T} and ZZTZZ^{T} are more orthogonal, a larger sample complexity is needed. Indeed, these are precisely the adversarial locations for which a large number of samples are required to prevent spurious local minima from appearing.

Refer to caption
Figure 1: This paper is motivated by two key insights. First, it is relatively straightforward to solve δfoc(X)\delta_{\mathrm{foc}}(X) in closed-form (Theorem 8). Second, the resulting lower-bound δsoc(X)max{δfoc(X),δ}\delta_{\mathrm{soc}}(X)\geq\max\{\delta_{\mathrm{foc}}(X),\delta^{*}\} (δ=1/2\delta^{*}=1/2 for rank 1 and δ=1/5\delta^{*}=1/5 for rank >1>1) is remarkably tight. This means that max{δfoc(ε),δ}\max\{\delta_{\mathrm{foc}}(\mathcal{B}_{\varepsilon}),\delta^{*}\} is a tight lower bound for δfoc(ε)\delta_{\mathrm{foc}}(\mathcal{B}_{\varepsilon}).

The lower-bound δfoc(X)δsoc(X)\delta_{\mathrm{foc}}(X)\leq\delta_{\mathrm{soc}}(X) appears conservative, because critical points should be much more ubiquitous than local minima over a non-convex landscape. In particular, observe that δfoc(X)=cosθ0\delta_{\mathrm{foc}}(X)=\cos\theta\to 0 as X0X\to 0, which makes sense because X=0X=0 is a saddle point for all choices of 𝒜\mathcal{A}. In other words, for any region 𝒲\mathcal{W} that contains 0, the lower-bound becomes trivial, as in δfoc(𝒲)=0<δsoc(𝒲)\delta_{\mathrm{foc}}(\mathcal{W})=0<\delta_{\mathrm{soc}}(\mathcal{W}). Our second insight here is that we must simultaneously have δsoc(X)1/5\delta_{\mathrm{soc}}(X)\geq 1/5 due to the global threshold of Bhojanapalli et al. [2] (or δsoc(x)1/2\delta_{\mathrm{soc}}(x)\geq 1/2 in the rank-1 case due to Zhang et al. [23]). Extending this idea over sets yields the following lower-bound

δsoc(𝒲)max{δfoc(𝒲),δ}for all 𝒲n×r,\delta_{\mathrm{soc}}(\mathcal{W})\geq\max\{\delta_{\mathrm{foc}}(\mathcal{W}),\delta^{*}\}\qquad\text{for all }\mathcal{W}\subseteq\mathbb{R}^{n\times r}, (7)

where δ=1/2\delta^{*}=1/2 for r=1r=1 and δ=1/5>1\delta^{*}=1/5>1. This bound is remarkably tight, as shown in Figure 1 for 𝒲=ε\mathcal{W}=\mathcal{B}_{\varepsilon} over a range of ε\varepsilon. Explicitly solving the optimization δfoc(ε)=infXεδfoc(X)\delta_{\mathrm{foc}}(\mathcal{B}_{\varepsilon})=\inf_{X\in\mathcal{B}_{\varepsilon}}\delta_{\mathrm{foc}}(X) using Theorem 8 and substituting into (7) yields the following.111We denote [x]+=max{0,x}[x]_{+}=\max\{0,x\}.

Theorem 5.

Let 𝒜\mathcal{A} satisfy δ\delta-RIP. Then we have δfoc(ε)>1Cε\delta_{\mathrm{foc}}(\mathcal{B}_{\varepsilon})>\sqrt{1-C\varepsilon} for all ϵ1/C\epsilon\leq 1/C, where C=ZZTF/σmin2(Z)C=\|ZZ^{T}\|_{F}/\sigma_{\min}^{2}(Z). Hence, if

δ<max{[1Cε]+,δ}\delta<\max\left\{\sqrt{[1-C\varepsilon]_{+}},\delta^{*}\right\} (8)

where δ=1/2\delta^{*}=1/2 if r=1r=1 and δ=1/5\delta^{*}=1/5 if r>1r>1, then f𝒜(X)f_{\mathcal{A}}(X) has no spurious critical point within an ε\varepsilon-neighborhood of the solution:

f𝒜(X)=0,XXTZZTFεZZTFXXT=ZZT.\nabla f_{\mathcal{A}}(X)=0,\quad\|XX^{T}-ZZ^{T}\|_{F}\leq\varepsilon\|ZZ^{T}\|_{F}\quad\Longleftrightarrow\quad XX^{T}=ZZ^{T}. (9)

The complete proof of this theorem is in Appendix B. Theorem 5 says that the number of samples needed to eliminate spurious critical points within an ε\varepsilon-neighborhood of the solution decreases dramatically as ε\varepsilon becomes small. Given that mC0nr/δ2m\geq C_{0}nr/\delta^{2} sub-Gaussian measurements are needed to satisfy δ\delta-RIP, we can translate Theorem 5 into the following sample complexity bound.

Corollary 6.

Let 𝒜:n×nm\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m} be a sub-Gaussian measurement ensemble. If

mmin{1[1Cε]+,25}C0nrm\geq\min\left\{\frac{1}{[1-C\varepsilon]_{+}},25\right\}C_{0}nr

then with high probability there are no spurious local minima within ε.\mathcal{B}_{\varepsilon}.

The proof of Corollary 6 follows immediately from Theorem 5 combined with the direct relationship between the RIP-property and the sample complexity for sub-Gaussian measurement ensembles. We see that the relationship between the quality of the initial point and the number of samples saved is essentially linear. Improving the quality of the initial point by a linear factor corresponds to a linear decrease in sample complexity. Moreover, the rate of improvement depends on the constant CC. This shows that in the non-convex setup of matrix sensing, there is a significant difference between a good initial point and a mediocre initial point. In the case that C=ZZTF/σmin2(Z)C=\|ZZ^{T}\|_{F}/\sigma_{\min}^{2}(Z) is large, this difference is even more pronounced.

5 Proof of Main Results

5.1 Notation and Definitions

We use \|\cdot\| for the vector 22-norm and use F\|\cdot\|_{F} to denote the Frobenius norm of a matrix. For two square matrices AA and BB, ABA\succeq B means BAB-A is positive semidefinite. The trace of a square matrix AA is denoted by tr(A)\mathrm{tr}(A). The vectorization vec(A)\mathrm{vec}\,(A) is the length-mnmn vector obtained by stacking the columns of AA. Let 𝒜:n×nm\mathcal{A}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m} be a linear measurement operator, and let Zn×rZ\in\mathbb{R}^{n\times r} be a fixed ground truth matrix. We define 𝐀=[vec(A1),,vec(Am)]\mathbf{A}=[\mathrm{vec}\,(A_{1}),\ldots,\mathrm{vec}\,(A_{m})] as the matrix representation of 𝒜\mathcal{A}, and note that vec[𝒜(X)]=𝐀vec(X)\mathrm{vec}\,[\mathcal{A}(X)]=\mathbf{A}\,\mathrm{vec}\,(X). We define the error vector 𝐞\mathbf{e} and its Jacobian 𝐗\mathbf{X} to satisfy

𝐞\displaystyle\mathbf{e} =vec(XXTZZT)\displaystyle=\mathrm{vec}\,(XX^{T}-ZZ^{T}) (10a)
𝐗vec(Y)\displaystyle\mathbf{X}\,\mathrm{vec}\,(Y) =vec(XYT+YXT)for all Yn×r.\displaystyle=\mathrm{vec}\,(XY^{T}+YX^{T})\qquad\text{for all }Y\in\mathbb{R}^{n\times r}. (10b)

5.2 Proof Sketch of Theorem 4

A complete proof of Theorem 4 relies on a few technical lemmas, so we defer the complete proof to Appendix A. The key insight is that δfoc(X)\delta_{\mathrm{foc}}(X) is the solution to a convex optimization problem, which we can solve in closed-form. At first sight, evaluating δfoc(X)\delta_{\mathrm{foc}}(X) seems very difficult as it involves solving an optimization problem over the set of δ\delta-RIP operators, as defined in equation 5 . However, a minor modification of Theorem 8 in Zhang et al. [23] shows that δfoc(X)\delta_{\mathrm{foc}}(X) can be reformulated as a convex optimization problem of the form

η(X)maxη,𝐇{η:𝐗T𝐇𝐞=0,ηI𝐇I}.\eta(X)\quad\equiv\quad\max_{\eta,\mathbf{H}}\left\{\eta\quad:\quad\mathbf{X}^{T}\mathbf{H}\mathbf{e}=0,\quad\eta I\preceq\mathbf{H}\preceq I\right\}. (11)

where η(X)\eta(X) is related to δfoc(X)\delta_{\mathrm{foc}}(X) by

δfoc(X)=1η(X)1+η(X).\delta_{\mathrm{foc}}(X)=\frac{1-\eta(X)}{1+\eta(X)}. (12)

We will show that problem (17) actually has a simple closed-form solution. First, we write its Lagrangian dual as

 minimize y,U1,U2\displaystyle\underset{y,U_{1},U_{2}}{\text{ minimize }}\quad tr(U2)\displaystyle\mathrm{tr}(U_{2}) (13)
subject to (𝐗y)𝐞T+𝐞(𝐗y)T=U1U2\displaystyle(\mathbf{X}y)\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}y)^{T}=U_{1}-U_{2}
tr(U1)=1,U1,U20.\displaystyle\mathrm{tr}(U_{1})=1,\quad U_{1},U_{2}\succeq 0.

Notice that strong duality holds because Slater’s condition is trivially satisfied by the dual: y=0y=0 and U1=U2=2I/n(n+1)U_{1}=U_{2}=2I/n(n+1) is a strictly feasible point. It turns out that the dual problem can be rewritten as an optimization problem over the eigenvalues of the matrix (𝐗y)𝐞T+𝐞(𝐗y)T(\mathbf{X}y)\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}y)^{T}. The proof of this in in Appendix A.

For any α\alpha\in\mathbb{R} we denote [α]+=max{0,+α}[\alpha]_{+}=\max\{0,+\alpha\} and [α]=max{0,α}[\alpha]_{-}=\max\{0,-\alpha\}. The dual problem can be written as

minytr[M(y)]tr[M(y)]+=minyiλi[M(y)]iλi[M(y)]+,whereM(y)=(𝐗y)𝐞T+𝐞(𝐗y)T,\min_{y}\frac{\mathrm{tr}{[M(y)]_{-}}}{\mathrm{tr}{[M(y)]_{+}}}=\min_{y}\frac{\sum_{i}{\lambda_{i}[M(y)]_{-}}}{\sum_{i}{\lambda_{i}[M(y)]_{+}}},\quad\mathrm{where}\quad M(y)=(\mathbf{X}y)\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}y)^{T},

and λi[M(y)]\lambda_{i}[M(y)] denotes the eigenvalues of the rank-2 matrix M(y)M(y). It is easy to verify that the only two non-zero eigenvalues of (𝐗y)𝐞T+𝐞(𝐗y)T(\mathbf{X}y)\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}y)^{T} are

𝐗y𝐞(cosθy±1), where cosθy=𝐞T𝐗y𝐞𝐗y.\|\mathbf{X}y\|\|\mathbf{e}\|\left(\cos\theta_{y}\pm 1\right),\quad\text{ where }\cos\theta_{y}=\frac{\mathbf{e}^{T}\mathbf{X}y}{\|\mathbf{e}\|\|\mathbf{X}y\|}.

It follows that

η(X)=miny1cosθy1+cosθy\eta(X)=\min_{y}\frac{1-\cos\theta_{y}}{1+\cos\theta_{y}}

and therefore

δfoc(X)=maxycosθy=maxy𝐞T𝐗y𝐞𝐗y.\delta_{\mathrm{foc}}(X)=\max_{y}\cos\theta_{y}=\max_{y}\frac{\mathbf{e}^{T}\mathbf{X}y}{\|\mathbf{e}\|\|\mathbf{X}y\|}.

Let yy^{*} be the optimizer of the optimization problem above, then θy\theta_{y^{*}} is simply the incidence angle between the column space of XX and the error vector 𝐞\mathbf{e}. Thus we have y=argminy𝐞𝐗y.y^{\star}=\arg\min_{y}\|\mathbf{e}-\mathbf{X}y\|. Using Lemma 12 in Appendix A, we show that solving for yy^{*} yields a closed-form expression for θy\theta_{y^{*}} in the form

sinθy=ZT(IXX)ZFXXTZZTF.\sin\theta_{y^{*}}=\frac{\|Z^{T}(I-XX^{{\dagger}})Z\|_{F}}{\|XX^{T}-ZZ^{T}\|_{F}}.

Hence we have δfoc(X)=cosθ\delta_{\mathrm{foc}}(X)=\cos\theta, with θ=θy\theta=\theta_{y^{*}} given by the equation above.

5.3 Proof of Theorem 5

The proof of Theorem 5 is based on the following lemma. Its proof is very technical and can be can be found in Appendix B.

Lemma 7.

Let Z0Z\neq 0 and suppose that XXTZZTFϵZZF2\|XX^{T}-ZZ^{T}\|_{F}\leq\epsilon\|ZZ\|_{F}^{2}. Then

sin2θ=ZT(IXX)ZF2XXTZZTF2ϵ2σmin2(Z)/ZZTFϵ.\sin^{2}\theta=\frac{\|Z^{T}(I-XX^{\dagger})Z\|_{F}^{2}}{\|XX^{T}-ZZ^{T}\|_{F}^{2}}\leq\frac{\epsilon}{2\sigma_{\min}^{2}(Z)/\|ZZ^{T}\|_{F}-\epsilon}.

To prove Theorem 5, we simply set C1=σmin2(Z)/ZZTFC_{1}=\sigma_{\min}^{2}(Z)/\|ZZ^{T}\|_{F} and write

cosθ=1sin2θ1ϵ2C1ϵ.\cos\theta=\sqrt{1-\sin^{2}\theta}\geq\sqrt{1-\frac{\epsilon}{2C_{1}-\epsilon}}.

It is easy to see that ϵ2C1ϵ\frac{\epsilon}{2C_{1}-\epsilon} is dominated by the linear function ε/C1\varepsilon/C_{1} so long as ϵC1\epsilon\leq C_{1}. This follows directly from the fact that ϵ2C1ϵ\frac{\epsilon}{2C_{1}-\epsilon} is convex between 0 and C1C_{1}. Thus we have

cosθ1ϵC1\cos\theta\geq\sqrt{1-\frac{\epsilon}{C_{1}}}

Since this lower bound holds for all XX in ε\mathcal{B}_{\varepsilon}, it follows that δfoc(ε)1ε/C1\delta_{\mathrm{foc}}(\mathcal{B}_{\varepsilon})\geq\sqrt{1-\varepsilon/C_{1}}.

6 Numerical Results

In this section we give a geometric interpretation for Theorem 8, which we already alluded to in section 4: the sample complexity to eliminate spurious critical points is small in regions where the column spaces of XX and ZZ are more aligned and large in regions where they are orthogonal. We also numerically verify that δfoc(X)\delta_{\mathrm{foc}}(X) is a tight lower bound for δsoc(X)\delta_{\mathrm{soc}}(X) for a wide range of ε\varepsilon, providing numerical evidence that the bound in Theorem 5 is tight.

Our main results and geometric insights hold for any rank, but for ease of visualization we focus on the rank-1 case where xx and zz are now just vectors. To measure the alignment between the column space of xx and that of zz in the rank-1 case , we define the length ratio and the incidence angle as

ρ=xz,cosϕ=xTzxz.\rho=\frac{\|x\|}{\|z\|},\quad\cos\phi=\frac{x^{T}z}{\|x\|\|z\|}.

Our goal is to plot how sample complexity depends on this alignment. Visualizing the dependence of sample complexity on ρ\rho and cosϕ\cos\phi is particularly easy in rank-1 because these two parameters completely determine the values of both δfoc(x)\delta_{\mathrm{foc}}(x) and δsoc(x)\delta_{\mathrm{soc}}(x). See [23] section 8.1 for a proof of this fact. This allows us to plot the level curves of δfoc(x)\delta_{\mathrm{foc}}(x) and δsoc(x)\delta_{\mathrm{soc}}(x) over the parameter space ρ\rho and ϕ\phi in Figure 2. This is shown by the blue curves. Since we are particularly interested in sample complexity near the ground truth, we also plot the level sets of the function xxTzzTF/zzTF{\|xx^{T}-zz^{T}\|_{F}}/{\|zz^{T}\|_{F}} using red curves. The horizontal axis is the value of ρcosϕ\rho\cos\phi and the vertical axis is the value of ρsinϕ\rho\sin\phi.

We can immediately see that in regions in the optimization landscape where xx is more aligned with zz, i.e., when sinϕ\sin\phi is small, the values of both threshold functions tend to be high and a relatively small number of samples suffices to prevent xx from becoming a spurious critical point. However, when xx and zz becomes closer to being orthogonal, i.e., when cosϕ\cos\phi is close to 0, then δfoc(x)\delta_{\mathrm{foc}}(x) becomes arbitrarily small, and δsoc(x)\delta_{\mathrm{soc}}(x) also becomes smaller, albeit to a lesser extent. As a result, preventing xx from becoming a spurious critical point (or spurious local minima) in these regions require many more samples. This intuition also permeates to the high-rank case, even though visualization becomes difficult, and a slightly more general definition of length ratio and alignment is required. Similar to the rank-1 case, in regions where XXTXX^{T} and ZZTZZ^{T} are more aligned, the sample complexity required to eliminate spurious critical points is small and in regions where XXTXX^{T} and ZZTZZ^{T} are close to orthogonal, a small sample complexity is required.

Regarding the tightness of using δfoc(X)\delta_{\mathrm{foc}}(X) as a lower bound for δsoc(X)\delta_{\mathrm{soc}}(X), note that if we look at the level sets of xxTzzTF/zzTF{\|xx^{T}-zz^{T}\|_{F}}/{\|zz^{T}\|_{F}}, we see that in regions close to the ground truth, both δsoc(x)\delta_{\mathrm{soc}}(x) and δfoc(x)\delta_{\mathrm{foc}}(x) are very close to 11. This is in perfect agreement with our results in Theorem 5, where we showed that a small ε\varepsilon results in a large δfoc(ε)\delta_{\mathrm{foc}}(\mathcal{B}_{\varepsilon}). Moreover, the shapes of the level curves of δsoc\delta_{\mathrm{soc}} and δfoc\delta_{\mathrm{foc}} that flow through the regions near the ground truth are almost identical. This indicates that for a large region near the ground truth, the second-order condition, i.e., the hessian being positive semidefinite, is inactive. This is the underlying mechanism that causes δfoc\delta_{\mathrm{foc}} to be a tight lower bound for δfoc\delta_{\mathrm{foc}}.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) the level sets of δfoc\delta_{\mathrm{foc}} and xxTzzTF/zzTF{\|xx^{T}-zz^{T}\|_{F}}/{\|zz^{T}\|_{F}} (b) the level sets of δsoc\delta_{\mathrm{soc}} and xxTzzTF/zzTF{\|xx^{T}-zz^{T}\|_{F}}/{\|zz^{T}\|_{F}}

7 Conclusions

Recent work by Bhojanapalli et al. [2] has shown that the non-convex optimization landscape of matrix sensing contains no spurious local minima when there are sufficiently large amount of samples. However, these theoretical bounds on the sample complexity are very conservative compared to the number of samples needed in real applications like power state estimation. In our paper, we provide one explanation for this phenomenon: in real life, we often have access to good initial points, which can reduce the number of samples we need. The main results of our paper give a mathematical characterization of this phenomenon. We define a function δsoc(X)\delta_{\mathrm{soc}}(X) that gives a precise threshold on the number of samples needed to prevent XX from becoming a spurious local minima. Although δsoc\delta_{\mathrm{soc}} is difficult to compute exactly, we obtain a closed-form, sharp lower bound using convex optimization. As a result, we are able to characterize the tradeoff between the quality of the initial point and the sample complexity. In particular, we show that a linear improvement in the quality of the initial point corresponds to a linear decrease in sample complexity.

On a more general level, our work uses new techniques to paint a full picture for the non-convex landscape of matrix sensing: the problem becomes more “non-convex” (requiring more samples to eliminate spurious local minima) as we get further and further away from the global min. Once we are sufficiently far away, it becomes necessary to rely on global guarantees instead. Thus, our work brings new insight into how a non-convex problem can gradually become more tractable either through more samples or a better initial point and provides a tradeoff between these two mechanisms. For future work, it would be interesting to see if similar techniques can be extended to other non-convex models such as neural networks.

Acknowledgements

Partial financial support was provided by the National Science Foundation under award ECCS-1808859.

Broader Impact

Many modern applications in engineering and computer science, and in machine learning in particular often have to deal with non-convex optimization. However, many aspects of non-convex optimization are still not well understood. Our paper provides more insight into the optimization landscape of a particular problem: low-rank matrix factorization. In addition, the methods we develop can potentially be used to understand many other non-convex problems. This is a step towards a more thorough analysis of current algorithms for non-convex optimization and also a step towards developing better and more efficient algorithms with theoretical guarantees.

References

  • [1] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-definite optimization. In Conference on Learning Theory, pages 530–582, 2016.
  • [2] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
  • [3] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pages 2757–2765, 2016.
  • [4] E Candes and Y Plan. Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. to appear. IEEE Trans. Info. Theo, 2009.
  • [5] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
  • [6] Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.
  • [7] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems, pages 739–747, 2015.
  • [8] Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.
  • [9] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
  • [10] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665–674, 2013.
  • [11] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998, 2010.
  • [12] Qiuwei Li, Zhihui Zhu, and Gongguo Tang. The non-convex geometry of low-rank matrix optimization. Information and Inference: A Journal of the IMA, 8(1):51–96, 2019.
  • [13] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis, 47(3):893–934, 2019.
  • [14] Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics, pages 1–182, 2019.
  • [15] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-convex robust pca. In Advances in Neural Information Processing Systems, pages 1107–1115, 2014.
  • [16] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
  • [17] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery using nonconvex optimization. In International Conference on Machine Learning, pages 2351–2360, 2015.
  • [18] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
  • [19] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535–6579, 2016.
  • [20] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via procrustes flow. In International Conference on Machine Learning, pages 964–973, 2016.
  • [21] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. arXiv preprint arXiv:1507.03566, 2015.
  • [22] Richard Zhang, Cédric Josz, Somayeh Sojoudi, and Javad Lavaei. How much restricted isometry is needed in nonconvex matrix recovery? In Advances in neural information processing systems, pages 5586–5597, 2018.
  • [23] Richard Y Zhang, Somayeh Sojoudi, and Javad Lavaei. Sharp restricted isometry bounds for the inexistence of spurious local minima in nonconvex matrix recovery. Journal of Machine Learning Research, 20(114):1–34, 2019.
  • [24] Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In Advances in Neural Information Processing Systems, pages 109–117, 2015.

Appendix A.1

In Appendix A we fill out the missing details in the proof sketch of Section 5.2 and provide a complete proof of Theorem 4, which we restate below.

Theorem 8.

(Same as theorem 4). Fix Zn×rZ\in\mathbb{R}^{n\times r}. Given 𝒜\mathcal{A} satisfying δ\delta-RIP and Xn×rX\in\mathbb{R}^{n\times r} such that XXTZZTXX^{T}\neq ZZ^{T}, we have δfoc(X)=cosθ\delta_{\mathrm{foc}}(X)=\cos\theta, where

sinθ=ZT(IXX)ZF/XXTZZTF.\sin\theta=\|Z^{T}(I-XX^{{\dagger}})Z\|_{F}\;\big{/}\;\|XX^{T}-ZZ^{T}\|_{F}. (14)

and XX^{{\dagger}} denotes the pseudo-inverse of XX. It follows that if δ<cosθ\delta<\cos\theta, then XX is not a spurious critical point of f𝒜(X)f_{\mathcal{A}}(X). If δcosθ\delta\geq\cos\theta, then there exists some 𝒜\mathcal{A}^{\star} satisfying cosθ\cos\theta-RIP such that f𝒜(X)=0\nabla f_{\mathcal{A}}(X)=0.

Before we prove the theorem above, we first prove two technical lemmas. The first lemma gives an explicit solution to the eigenvalues of a rank-2 matrix and the second lemma characterizes the solution to an SDP that will be a part of the proof of theorem 4.

Lemma 9.

Given a,bn,a,b\in\mathbb{R}^{n}, the matrix M=abT+baTM=ab^{T}+ba^{T} has eigenvalues λ1λn\lambda_{1}\geq\cdots\geq\lambda_{n} where:

λi={+ab(1+cosθ)i=1ab(1cosθ)i=n0otherwise\lambda_{i}=\left\{\begin{array}[]{ll}+\|a\|\|b\|(1+\cos\theta)&i=1\\ \vspace{-1mm}\\ -\|a\|\|b\|(1-\cos\theta)&i=n\\ \vspace{-1mm}\\ 0&\mathrm{otherwise}\end{array}\right.

and θarccos(aTbab)\theta\equiv\arccos\left(\frac{a^{T}b}{\|a\|\|b\|}\right) is the angle between aa and bb.

Lemma 10.

Given a matrix M0M\neq 0 we can split the matrix MM into a positive and negative part satisfying

M=M+MwhereM+,M0,M+M=0.M=M_{+}-M_{-}\quad where\quad M_{+},M_{-}\succeq 0,\quad M_{+}M_{-}=0.

Then the following problem has solution

minαU,V0{tr(V):tr(U)=1,αM=UV}=min{tr(M)tr(M+),tr(M+)tr(M)}.\min_{{\alpha\in\mathbb{R}\atop U,V\succeq 0}}\{\operatorname{tr}(V):\operatorname{tr}(U)=1,\alpha M=U-V\}=\min\left\{\frac{\operatorname{tr}\left(M_{-}\right)}{\operatorname{tr}\left(M_{+}\right)},\frac{\operatorname{tr}\left(M_{+}\right)}{\operatorname{tr}\left(M_{-}\right)}\right\}.
Proof.

(Lemma 9). Without loss of generality, assume that a=b=1.\|a\|=\|b\|=1. (Otherwise, we can rescale a^=\hat{a}= a/a,b^=b/ba/\|a\|,\hat{b}=b/\|b\| and write M=ab(a^b^T+b^a^T)M=\|a\|\|b\|(\hat{a}\hat{b}^{T}+\hat{b}\hat{a}^{T}). Now decompose bb into a tangent and normal component with respect to a,a, as in

b=aaTbcosθ+(IaaT)bcsinθ=acosθ+csinθb=a\underbrace{a^{T}b}_{\cos\theta}+\underbrace{\left(I-aa^{T}\right)b}_{c\sin\theta}=a\cos\theta+c\sin\theta

where cc is a unit normal vector with c=1\|c\|=1 and aTc=0.a^{T}c=0. Thus abT+baTab^{T}+ba^{T} can be written as

abT+baT=[ac][2cosθsinθsinθ0][ac]T.ab^{T}+ba^{T}=\left[\begin{array}[]{ll}a&c\end{array}\right]\left[\begin{array}[]{cc}2\cos\theta&\sin\theta\\ \sin\theta&0\end{array}\right]\left[\begin{array}[]{ll}a&c\end{array}\right]^{T}.

This shows that MM is spectrally similar to a 2×22\times 2 matrix with eigenvalues cosθ±1\cos\theta\pm 1. ∎

Proof.

(Lemma 10). In this proof we will consider two cases: tr(M)tr(M+)\mathrm{tr}(M_{-})\leq\mathrm{tr}(M_{+}) and tr(M)tr(M+)\mathrm{tr}(M_{-})\geq\mathrm{tr}(M_{+}). We’ll see that in the first case, the optimal value is tr(M)/tr(M+)\mathrm{tr}(M_{-})/\mathrm{tr}(M_{+}) and in the second case, the optimal value is tr(M+)/tr(M1)\mathrm{tr}(M_{+})/\mathrm{tr}(M_{1}).

First, assume that tr(M)tr(M+)\mathrm{tr}(M_{-})\leq\mathrm{tr}(M_{+}). Let pp^{*} be the optimal value. Then we have

p\displaystyle p^{\star} =maxβminαU,V0{tr(V)+β[1tr(U)]:αM=UV}\displaystyle=\max_{\beta}\min_{{\alpha\in\mathbb{R}\atop U,V\succeq 0}}\{\operatorname{tr}(V)+\beta\cdot[1-\operatorname{tr}(U)]:\alpha M=U-V\} (15)
=maxβminα{β+minU,V0{tr(V)βtr(U):αM=UV}}\displaystyle=\max_{\beta}\min_{\alpha\in\mathbb{R}}\left\{\beta+\min_{U,V\succeq 0}\{\operatorname{tr}(V)-\beta\cdot\operatorname{tr}(U):\alpha M=U-V\}\right\}
=maxβminα{β+minU[tr(UαM)βtr(U)]:UαM0,U0}\displaystyle=\max_{\beta}\min_{\alpha\in\mathbb{R}}\left\{\beta+\min_{U}\left[\operatorname{tr}\left(U-\alpha M\right)-\beta\cdot\operatorname{tr}\left(U\right)\right]:U-\alpha M\succeq 0,U\succeq 0\right\}
=maxβminα{β+minU[αtr(M)+(1β)tr(U)]:UαM0,U0}\displaystyle=\max_{\beta}\min_{\alpha\in\mathbb{R}}\left\{\beta+\min_{U}[-\alpha\mathrm{tr}(M)+(1-\beta)\mathrm{tr}(U)]:U-\alpha M\succeq 0,U\succeq 0\right\}
=maxβ1minα{β+minU[αtr(M)+(1β)tr(U)]:UαM0,U0}.\displaystyle=\max_{\beta\leq 1}\min_{\alpha\in\mathbb{R}}\left\{\beta+\min_{U}[-\alpha\mathrm{tr}(M)+(1-\beta)\mathrm{tr}(U)]:U-\alpha M\succeq 0,U\succeq 0\right\}. (16)

Note that the first line converts the equality constraint into a Lagrangian. The second line simply rearranges the terms. The third line plugs in V=UαMV=U-\alpha M. The fourth line again rearranges the terms. The last line follows from the observation that if β>1\beta>1, then the inner minimization over UU will go to negative infinity since the trace of UU can be arbitrarily large.

First, consider the case α0\alpha\geq 0. Then we have αM=αM+αM\alpha M=\alpha M_{+}-\alpha M_{-}. Since 1β01-\beta\geq 0, the minimization over UU is achieved at U=αM+U=\alpha M_{+}. Plugging this value into the optimization problem, then (19) becomes

maxβ1minα0{β+α[tr(M)βtr(M+)]}\max_{\beta\leq 1}\min_{\alpha\geq 0}\{\beta+\alpha[\mathrm{tr}(M_{-})-\beta\mathrm{tr}(M_{+})]\}

If tr(M)βtr(M+)<0\mathrm{tr}(M_{-})-\beta\mathrm{tr}(M_{+})<0, then the optimal value of the inner minimization will go to negative infinity. On the other hand, if tr(M)βtr(M+)0\mathrm{tr}(M_{-})-\beta\mathrm{tr}(M_{+})\geq 0 then the minimum inside is achieved at α=0\alpha=0. Thus the problem above is equivalent to

maxβ1{β:tr(M)βtr(M+)0}.\max_{\beta\leq 1}\{\beta:\mathrm{tr}(M_{-})-\beta\mathrm{tr}(M_{+})\geq 0\}.

Since tr(M)tr(M+)\mathrm{tr}(M_{-})\leq\mathrm{tr}(M_{+}), the optimal value of the problem above is achieved at tr(M)/tr(M+)1\mathrm{tr}(M_{-})/\mathrm{tr}(M_{+})\leq 1. Now suppose that α0\alpha\leq 0. Then the optimal value for UU is achieved at U=αMU=-\alpha M_{-}. Plugging this value in and (19) becomes

maxβ1minα0{β+α[βtr(M)tr(M+)]}.\max_{\beta\leq 1}\min_{\alpha\leq 0}\{\beta+\alpha[\beta\mathrm{tr}(M_{-})-\mathrm{tr}(M_{+})]\}.

Similar to before, we must have βtr(M)tr(M+)0\beta\mathrm{tr}(M_{-})-\mathrm{tr}(M_{+})\leq 0, so βtr(M+)/tr(M)\beta\leq\mathrm{tr}(M_{+})/\mathrm{tr}(M_{-}). Since tr(M)tr(M+)\mathrm{tr}(M_{-})\leq\mathrm{tr}(M_{+}), the optimal value in this case is just β=1\beta=1. Combining the results for α0\alpha\geq 0 and α0\alpha\leq 0, we find that when tr(M)tr(M+)\mathrm{tr}(M_{-})\leq\mathrm{tr}(M_{+}), the optimal value is

p=min{1,tr(M)tr(M+)}=tr(M)tr(M+).p^{*}=\min\left\{1,\frac{\mathrm{tr}(M_{-})}{\mathrm{tr}(M_{+})}\right\}=\frac{\mathrm{tr}(M_{-})}{\mathrm{tr}(M_{+})}.

Repeating the same arguments for when tr(M)tr(M+)\mathrm{tr}(M_{-})\geq\mathrm{tr}(M_{+}), we see that in this case the optimal value becomes

p=min{tr(M+)tr(M),1}=tr(M+)tr(M).p^{*}=\min\left\{\frac{\mathrm{tr}(M_{+})}{\mathrm{tr}(M_{-})},1\right\}=\frac{\mathrm{tr}(M_{+})}{\mathrm{tr}(M_{-})}.

Finally, combining these two cases, i.e., tr(M)tr(M+)\mathrm{tr}(M_{-})\geq\mathrm{tr}(M_{+}) and tr(M)tr(M+)\mathrm{tr}(M_{-})\leq\mathrm{tr}(M_{+}), we obtain

p=min{tr(M)tr(M+),tr(M+)tr(M)},p^{*}=\min\left\{\frac{\operatorname{tr}\left(M_{-}\right)}{\operatorname{tr}\left(M_{+}\right)},\frac{\operatorname{tr}\left(M_{+}\right)}{\operatorname{tr}\left(M_{-}\right)}\right\},

which completes the proof.

Appendix A.2

Now we are ready to prove Theorem 4. Recall that the first order threshold function is defined as the solution to the following optimization problem:

δfoc(X)min𝒜{δ:f𝒜(X)=0,𝒜 satisfies δ-RIP}\delta_{\mathrm{foc}}(X)\equiv\min_{\mathcal{A}}\{\delta:\nabla f_{\mathcal{A}}(X)=0,\quad\mathcal{A}\text{ satisfies }\delta\text{-RIP}\}

Using Theorem 8 from [23], the optimization problem above can be formulated as

η(X)maxη,𝐇{η:𝐗T𝐇𝐞=0,ηI𝐇I}.\eta(X)\quad\equiv\quad\max_{\eta,\mathbf{H}}\left\{\eta\quad:\quad\mathbf{X}^{T}\mathbf{H}\mathbf{e}=0,\quad\eta I\preceq\mathbf{H}\preceq I\right\}. (17)

where η=(1δfoc)/(1+δfoc)\eta=(1-\delta_{\mathrm{foc}})/(1+\delta_{\mathrm{foc}}). Our goal is to solve this optimization problem in closed form. In Section 5.2, we wrote the dual of problem (17) as

miny,U1,U2\displaystyle\underset{y,U_{1},U_{2}}{\mathrm{min}}\quad tr(U2)\displaystyle\mathrm{tr}(U_{2}) (18)
subjectto\displaystyle\mathrm{subject\leavevmode\nobreak\ to}\quad (𝐗y)𝐞T+𝐞(𝐗y)T=U1U2\displaystyle(\mathbf{X}y)\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}y)^{T}=U_{1}-U_{2}
tr(U1)=1,U1,U20.\displaystyle\mathrm{tr}(U_{1})=1,\quad U_{1},U_{2}\succeq 0.

and stated that this dual problem can be rewritten as an optimization problem over the eigenvalues of a rank-2 matrix. This is given in the lemma below. To simplify notation, here we define a positive/negative splitting: for any α+\alpha\in\mathbb{R}_{+} we denote [α]+=max{0,+α}[\alpha]_{+}=\max\{0,+\alpha\} and [α]=max{0,α}[\alpha]_{-}=\max\{0,-\alpha\}. This idea can be extended to matrices by applying splitting to the eigenvalues.

Lemma 11.

Given data 𝐞\mathbf{e} and 𝐗0\mathbf{X}\neq 0, define

η=miny,U1,U2\displaystyle\eta=\underset{y,U_{1},U_{2}}{\mathrm{min}}\quad tr(U2)\displaystyle\mathrm{tr}(U_{2}) (19)
subjectto\displaystyle\mathrm{subject\leavevmode\nobreak\ to}\quad (𝐗y)𝐞T+𝐞(𝐗y)T=U1U2\displaystyle(\mathbf{X}y)\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}y)^{T}=U_{1}-U_{2}
tr(U1)=1,U1,U20.\displaystyle\mathrm{tr}(U_{1})=1,\quad U_{1},U_{2}\succeq 0.

Define M(y)M(y) to be the rank-2 matrix (𝐗y)𝐞T+𝐞(𝐗y)T(\mathbf{X}y)\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}y)^{T} and let λi[M(y)]\lambda_{i}[M(y)] denote its eigenvalues. Then η\eta can be evaluated as

η=miny0tr[M(y)]tr[M(y)]+=miny0iλi[M(y)]iλi[M(y)]+=miny01cosθy1+cosθy,\eta=\min_{y\neq 0}\frac{\mathrm{tr}{[M(y)]_{-}}}{\mathrm{tr}{[M(y)]_{+}}}=\min_{y\neq 0}\frac{\sum_{i}{\lambda_{i}[M(y)]_{-}}}{\sum_{i}{\lambda_{i}[M(y)]_{+}}}=\min_{y\neq 0}\frac{1-\cos\theta_{y}}{1+\cos\theta_{y}},

where cosθy=𝐞T𝐗y/𝐞𝐗y.\cos\theta_{y}={\mathbf{e}^{T}\mathbf{X}y}/{\|\mathbf{e}\|\|\mathbf{X}y\|}.

The proof of Lemma 11 relies mainly on the two lemmas we proved in the preceding section.

Proof.

(Lemma 11). Let y=αy^y=\alpha\hat{y}, where y^=1\|\hat{y}\|=1 and αn\alpha\in\mathbb{R}^{n}. Thus the optimization problem (19) becomes

η=minα,y^,U1,U2\displaystyle\eta=\underset{\alpha,\hat{y},U_{1},U_{2}}{\mathrm{min}}\quad tr(U2)\displaystyle\mathrm{tr}(U_{2})
subjectto\displaystyle\mathrm{subject\leavevmode\nobreak\ to}\quad α[(𝐗y^)𝐞T+𝐞(𝐗y^)T]=U1U2\displaystyle\alpha\cdot[(\mathbf{X}\hat{y})\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}\hat{y})^{T}]=U_{1}-U_{2}
tr(U1)=1,y^=1,U1,U20.\displaystyle\mathrm{tr}(U_{1})=1,\quad\|\hat{y}\|=1,\quad U_{1},U_{2}\succeq 0.

To solve this problem, first we keep y^\hat{y} fixed, and optimize over α,U1,U2\alpha,U_{1},U_{2}. This gives us the problem

minα,U1,U2\displaystyle\underset{\alpha,U_{1},U_{2}}{\mathrm{min}}\quad tr(U2)\displaystyle\mathrm{tr}(U_{2})
subjectto\displaystyle\mathrm{subject\leavevmode\nobreak\ to}\quad α[(𝐗y^)𝐞T+𝐞(𝐗y^)T]=U1U2\displaystyle\alpha\cdot[(\mathbf{X}\hat{y})\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}\hat{y})^{T}]=U_{1}-U_{2}
tr(U1)=1,U1,U20.\displaystyle\mathrm{tr}(U_{1})=1,\quad U_{1},U_{2}\succeq 0.

Notice that if we set M(y^)=(𝐗y^)𝐞T+𝐞(𝐗y^)TM(\hat{y})=(\mathbf{X}\hat{y})\mathbf{e}^{T}+\mathbf{e}(\mathbf{X}\hat{y})^{T}, then the problem above is in exactly the same form as the one in lemma 10. Therefore, its optimal value is

min{tr(M(y^))tr(M(y^)+),tr(M(y^)+)tr(M(y^))}.\min\left\{\frac{\operatorname{tr}\left(M(\hat{y})_{-}\right)}{\operatorname{tr}\left(M(\hat{y})_{+}\right)},\frac{\operatorname{tr}\left(M(\hat{y})_{+}\right)}{\operatorname{tr}\left(M(\hat{y})_{-}\right)}\right\}.

Finally, to obtain η\eta, we still need to optimize over y^\hat{y}, i.e.,

η=miny^=1min{tr(M(y^))tr(M(y^)+),tr(M(y^)+)tr(M(y^))}.\eta=\min_{\|\hat{y}\|=1}\min\left\{\frac{\operatorname{tr}\left(M(\hat{y})_{-}\right)}{\operatorname{tr}\left(M(\hat{y})_{+}\right)},\frac{\operatorname{tr}\left(M(\hat{y})_{+}\right)}{\operatorname{tr}\left(M(\hat{y})_{-}\right)}\right\}.

Since both the numerator and the denominator are linear in yy, we can ignore the constraint y^=1\|\hat{y}\|=1 and simply optimize over yy, which gives us

η=miny0min{tr(M(y))tr(M(y)+),tr(M(y)+)tr(M(y))}.\eta=\min_{y\neq 0}\min\left\{\frac{\operatorname{tr}\left(M({y})_{-}\right)}{\operatorname{tr}\left(M({y})_{+}\right)},\frac{\operatorname{tr}\left(M({y})_{+}\right)}{\operatorname{tr}\left(M({y})_{-}\right)}\right\}.

With lemma 9, we see that the only two eigenvalues of M(y)M(y) are

𝐗yy(cosθy+1),𝐗yy(cosθy1),\|\mathbf{X}y\|\|y\|(\cos\theta_{y}+1),\qquad\|\mathbf{X}y\|\|y\|(\cos\theta_{y}-1),

where cosθy=𝐞T𝐗y/𝐞𝐗y.\cos\theta_{y}={\mathbf{e}^{T}\mathbf{X}y}/{\|\mathbf{e}\|\|\mathbf{X}y\|}. It follows that tr(M)=𝐗yy(1cosθy)\mathrm{tr}(M_{-})=\|\mathbf{X}y\|\|y\|(1-\cos\theta_{y}) and tr(M+)=𝐗yy(cosθy+1)\mathrm{tr}(M_{+})=\|\mathbf{X}y\|\|y\|(\cos\theta_{y}+1). Thus

η=miny0min{1cosθy1+cosθy,1+cosθy1cosθy}.\eta=\min_{y\neq 0}\min\left\{\frac{1-\cos\theta_{y}}{1+\cos\theta_{y}},\frac{1+\cos\theta_{y}}{1-\cos\theta_{y}}\right\}.

Notice that in the optimization problem above, if the minimum is achieved at some yy^{*}, it must also be achieved at y-y^{*}, due to symmetry. Therefore, it suffices to optimize over only the first term 1cosθy1+cosθy\frac{1-\cos\theta_{y}}{1+\cos\theta_{y}}, so we get

η=miny01cosθy1+cosθy.\eta=\min_{y\neq 0}\frac{1-\cos\theta_{y}}{1+\cos\theta_{y}}.

This completes the proof. ∎

Notice that Lemma 11 reduces problem 17 to only depend on the values of cosθy\cos\theta_{y}. Now, to complete the proof of Theorem 4, we just need one additional lemma that gives a closed form solution for cosθy\cos\theta_{y}, which we state below.

Lemma 12.

Let X,ZX,Z be n×rn\times r matrices of any rank, and define 𝐞\mathbf{e} and 𝐗0\mathbf{X}\neq 0 as in equations 10(a) and 10(b). Then, the incidence angle θ\theta between 𝐞\mathbf{e} and range(𝐗)\mathrm{range}(\mathbf{X}), defined as in

cosθ=maxy0{𝐞T𝐗y𝐞𝐗y}=𝐗𝐗𝐞𝐞,\cos\theta=\max_{y\neq 0}\left\{\frac{\mathbf{e}^{T}\mathbf{X}y}{\|\mathbf{e}\|\|\mathbf{X}y\|}\right\}=\frac{\|\mathbf{X}\mathbf{X}^{{\dagger}}\mathbf{e}\|}{\|\mathbf{e}\|},

has closed-form expression

sinθ=ZT(IXX)ZFXXTZZTF\sin\theta=\frac{\|Z^{T}(I-XX^{{\dagger}})Z\|_{F}}{\|XX^{T}-ZZ^{T}\|_{F}}

where XX^{{\dagger}} denotes the Moore–Penrose pseudoinverse of XX.

Proof.

(Lemma 12). Define y=argminy𝐞𝐗yy^{\star}=\arg\min_{y}\|\mathbf{e}-\mathbf{X}y\| and decompose 𝐞=𝐗y+w\mathbf{e}=\mathbf{X}y^{\star}+w. The optimality condition for yy^{\star} reads 𝐗T(𝐞𝐗y)=𝐗Tw=0\mathbf{X}^{T}(\mathbf{e}-\mathbf{X}y^{\star})=\mathbf{X}^{T}w=0, so we substitute 𝐞T𝐗=(y)T𝐗T𝐗\mathbf{e}^{T}\mathbf{X}=(y^{*})^{T}\mathbf{X}^{T}\mathbf{X} to yield

𝐞cosθ=𝐞maxy0{𝐞T𝐗y𝐞𝐗y}=maxy0{(y)T𝐗T𝐗y𝐗y}=𝐗y,\|\mathbf{e}\|\cos\theta=\|\mathbf{e}\|\max_{y\neq 0}\left\{\frac{\mathbf{e}^{T}\mathbf{X}y}{\|\mathbf{e}\|\|\mathbf{X}y\|}\right\}=\max_{y\neq 0}\left\{\frac{(y^{\star})^{T}\mathbf{X}^{T}\mathbf{X}y}{\|\mathbf{X}y\|}\right\}=\|\mathbf{X}y^{\star}\|,

and therefore 𝐞sinθ=w=miny𝐞𝐗y\|\mathbf{e}\|\sin\theta=\|w\|=\min_{y}\|\mathbf{e}-\mathbf{X}y\|, because we have 𝐞=𝐗y+w\mathbf{e}=\mathbf{X}y^{*}+w with wT𝐗y=0w^{T}\mathbf{X}y^{*}=0. Now, define Q=orth(X)n×qQ=\mathrm{orth}(X)\in\mathbb{R}^{n\times q} where q=rank(X)rq=\mathrm{rank}(X)\leq r, and define Pn×(nq)P\in\mathbb{R}^{n\times(n-q)} as the orthogonal complement of QQ. Decompose X=QX^X=Q\hat{X}, and Z=QZ^1+PZ^2Z=Q\hat{Z}_{1}+P\hat{Z}_{2}, and note that

w\displaystyle\|w\| =miny𝐞𝐗y\displaystyle=\min_{y}\|\mathbf{e}-\mathbf{X}y\|
=minY(XXTZZT)(XYT+YXT)F\displaystyle=\min_{Y}\|(XX^{T}-ZZ^{T})-(XY^{T}+YX^{T})\|_{F}
=min[Y^1;Y^2]n×r[X^X^TZ^1Z^1TZ^1Z^2TZ^2Z^1TZ^2Z^2T][X^Y^1T+Y^1X^TX^Y^2TY^2X^T0]F\displaystyle=\min_{[\hat{Y}_{1};\hat{Y}_{2}]\in\mathbb{R}^{n\times r}}\left\|\begin{bmatrix}\hat{X}\hat{X}^{T}-\hat{Z}_{1}\hat{Z}_{1}^{T}&-\hat{Z}_{1}\hat{Z}_{2}^{T}\\ -\hat{Z}_{2}\hat{Z}_{1}^{T}&-\hat{Z}_{2}\hat{Z}_{2}^{T}\end{bmatrix}-\begin{bmatrix}\hat{X}\hat{Y}_{1}^{T}+\hat{Y}_{1}\hat{X}^{T}&\hat{X}\hat{Y}_{2}^{T}\\ \hat{Y}_{2}\hat{X}^{T}&0\end{bmatrix}\right\|_{F}
=Z^2Z^2TF\displaystyle=\|\hat{Z}_{2}\hat{Z}_{2}^{T}\|_{F}

From the second line to the third, we apply a change of basis onto [QP][Q\leavevmode\nobreak\ P], which preserves the Frobenius norm. To derive the last line, notice that the q×rq\times r matrix X^\hat{X} has full row rank, so that X^X^T0\hat{X}\hat{X}^{T}\succ 0 and X^X^=Iq\hat{X}\hat{X}^{{\dagger}}=I_{q}. We want to show that there exists Y^1\hat{Y}_{1} such that

X^Y^1T+Y^1X^T=X^X^TZ^1Z^1T.\hat{X}\hat{Y}_{1}^{T}+\hat{Y}_{1}\hat{X}^{T}=\hat{X}\hat{X}^{T}-\hat{Z}_{1}\hat{Z}_{1}^{T}.

Since the right hand side is symmetric, we can write it as L+LTL+L^{T}, where LL is some lower-triangular matrix. Thus it suffices to show that there exists Y^1\hat{Y}_{1} such that X^Y^1T=L\hat{X}\hat{Y}_{1}^{T}=L, which follows from that fact that X^\hat{X} has full row-rank. Similarly, there exists some Y^2\hat{Y}_{2} such that X^Y^2=Z^2Z^1T\hat{X}\hat{Y}_{2}=-\hat{Z}_{2}\hat{Z}_{1}^{T}. Thus, all terms except the last one cancels out and we are left with miny𝐞𝐗y=Z^2Z^2TF\min_{y}\|\mathbf{e}-\mathbf{X}y\|=\|\hat{Z}_{2}\hat{Z}_{2}^{T}\|_{F}.

Finally, note that QZ^1=XXZQ\hat{Z}_{1}=XX^{{\dagger}}Z and PZ^2=(IXX)ZP\hat{Z}_{2}=(I-XX^{{\dagger}})Z and that

Z^2Z^2TF2\displaystyle\|\hat{Z}_{2}\hat{Z}_{2}^{T}\|_{F}^{2} =PZ^2Z^2TPTF2\displaystyle=\|P\hat{Z}_{2}\hat{Z}_{2}^{T}P^{T}\|_{F}^{2}
=(IXX)ZZT(IXX)F2\displaystyle=\|(I-XX^{{\dagger}})ZZ^{T}(I-XX^{{\dagger}})\|_{F}^{2}
=tr[(IXX)ZZT(IXX)ZZT(IXX)]\displaystyle=\mathrm{tr}[(I-XX^{{\dagger}})ZZ^{T}(I-XX^{{\dagger}})ZZ^{T}(I-XX^{{\dagger}})]
=tr[ZT(IXX)ZZT(IXX)Z]\displaystyle=\mathrm{tr}[Z^{T}(I-XX^{{\dagger}})ZZ^{T}(I-XX^{{\dagger}})Z]
=ZT(IXX)ZF2.\displaystyle=\|Z^{T}(I-XX^{{\dagger}})Z\|_{F}^{2}.

Substituting the definition of 𝐞\mathbf{e} completes the proof. ∎

Now theorem 4 will be a direct consequence of lemma 11 and lemma 12. We give a proof below.

Proof.

(Theorem 4). Note that δfoc\delta_{\mathrm{foc}} is related to η\eta by the equation

η=1δfoc1+δfoc.\eta=\frac{1-\delta_{\mathrm{foc}}}{1+\delta_{\mathrm{foc}}}.

Applying lemma 11, we immediately get

δfoc(X)=maxy0cosθy=maxy0𝐞T𝐗y𝐞𝐗y.\delta_{\mathrm{foc}}(X)=\max_{y\neq 0}\cos\theta_{y}=\max_{y\neq 0}\frac{\mathbf{e}^{T}\mathbf{X}y}{\|\mathbf{e}\|\|\mathbf{X}y\|}.

From lemma 12, we see that this optimization problem over yy has a simple closed form solution of the form

δfoc(X)=cosθ, where sinθ=ZT(IXX)ZFXXTZZTF.\delta_{\mathrm{foc}}(X)=\cos\theta,\quad\text{ where }\sin\theta=\frac{\|Z^{T}(I-XX^{{\dagger}})Z\|_{F}}{\|XX^{T}-ZZ^{T}\|_{F}}.

This completes the proof. ∎

Appendix B

In this section we provide a complete proof of Theorem 5, which includes all the intermediate calculations that was skipped in Section 5.3. We begin by proving a bound on sinθ\sin\theta.

Lemma 13 (Same as Lemma 7).

Let Z0Z\neq 0 and suppose that XXTZZTFϵZZF2\|XX^{T}-ZZ^{T}\|_{F}\leq\epsilon\|ZZ\|_{F}^{2}. Then

sin2θ=ZT(IXX)ZF2XXTZZTF2ϵ2(σmin2(Z)/ZZTF)ϵ.\sin^{2}\theta=\frac{\|Z^{T}(I-XX^{\dagger})Z\|_{F}^{2}}{\|XX^{T}-ZZ^{T}\|_{F}^{2}}\leq\frac{\epsilon}{2(\sigma_{\min}^{2}(Z)/\|ZZ^{T}\|_{F})-\epsilon}.
Proof.

The problem is homogeneous to scaling XαXX\leftarrow\alpha X and ZαZZ\leftarrow\alpha Z for the same α\alpha; Since Z0Z\neq 0, we may rescale XX and ZZ until ZZF2=1\|ZZ\|_{F}^{2}=1. Additionally, we can assume that

X=[X10]Z=[Z1Z2] where X1,Z1r×r,Z2(nr)×rX=\begin{bmatrix}X_{1}\\ 0\end{bmatrix}\qquad Z=\begin{bmatrix}Z_{1}\\ Z_{2}\end{bmatrix}\qquad\text{ where }X_{1},Z_{1}\in\mathbb{R}^{r\times r},Z_{2}\in\mathbb{R}^{(n-r)\times r}

due to the rotational invariance of the problem. (Concretely, we compute the QR decomposition QR=[X,Z]QR=[X,Z] with Qn×2rQ\in\mathbb{R}^{n\times 2r} noting that X=QQTXX=QQ^{T}X and Z=QQTZZ=QQ^{T}Z. We then make a change of basis XQTXX\leftarrow Q^{T}X and ZQTZZ\leftarrow Q^{T}Z). Then, observe that

ZT(IXX)ZF=[Z1Z2]T(I[I000])[Z1Z2]F=Z2TZ2F=Z2Z2TF\|Z^{T}(I-XX^{\dagger})Z\|_{F}=\left\|\begin{bmatrix}Z_{1}\\ Z_{2}\end{bmatrix}^{T}\left(I-\begin{bmatrix}I&0\\ 0&0\end{bmatrix}\right)\begin{bmatrix}Z_{1}\\ Z_{2}\end{bmatrix}\right\|_{F}=\|Z_{2}^{T}Z_{2}\|_{F}=\|Z_{2}Z_{2}^{T}\|_{F} (20)

and that Z2Z2TF2ϵ2\|Z_{2}Z_{2}^{T}\|_{F}^{2}\leq\epsilon^{2} because

XXTZZTF2\displaystyle\|XX^{T}-ZZ^{T}\|_{F}^{2} =[Z1Z1TX1X1TZ1Z2TZ2Z1inTZ2Z2T]F2\displaystyle=\left\|\begin{bmatrix}Z_{1}Z_{1}^{T}-X_{1}X_{1}^{T}&Z_{1}Z_{2}^{T}\\ Z_{2}Z_{1in}^{T}&Z_{2}Z_{2}^{T}\end{bmatrix}\right\|_{F}^{2}
=Z1Z1TX1X1TF2+2Z1Z2TF2+Z2Z2TF2ϵ2.\displaystyle=\|Z_{1}Z_{1}^{T}-X_{1}X_{1}^{T}\|_{F}^{2}+2\|Z_{1}Z_{2}^{T}\|_{F}^{2}+\|Z_{2}Z_{2}^{T}\|_{F}^{2}\leq\epsilon^{2}. (21)

In order to derive a non-vacuous bound, we will need to lower-bound the term Z1Z2TF2\|Z_{1}Z_{2}^{T}\|_{F}^{2} as follows

Z1Z2TF2=tr(Z1TZ1Z2TZ2)λmin(Z1TZ1)tr(Z2TZ2)=σmin2(Z1)Z2F2.\|Z_{1}Z_{2}^{T}\|_{F}^{2}=\mathrm{tr}(Z_{1}^{T}Z_{1}Z_{2}^{T}Z_{2})\geq\lambda_{\min}(Z_{1}^{T}Z_{1})\mathrm{tr}(Z_{2}^{T}Z_{2})=\sigma_{\min}^{2}(Z_{1})\|Z_{2}\|_{F}^{2}. (22)

To lower-bound σmin2(Z1)\sigma_{\min}^{2}(Z_{1}), observe that

A+BμIAμIB(μB2)I,A+B\succeq\mu I\iff A\succeq\mu I-B\succeq(\mu-\|B\|_{2})I,

and therefore

σmin2(Z1)=λmin(Z1TZ1)\displaystyle\sigma_{\min}^{2}(Z_{1})=\lambda_{\min}(Z_{1}^{T}Z_{1}) λmin(Z1TZ1+Z2TZ2)λmax(Z2TZ2)\displaystyle\geq\lambda_{\min}(Z_{1}^{T}Z_{1}+Z_{2}^{T}Z_{2})-\lambda_{\max}(Z_{2}^{T}Z_{2})
=σmin2(Z)Z2Z2T2σmin2(Z)Z2Z2TF\displaystyle=\sigma_{\min}^{2}(Z)-\|Z_{2}Z_{2}^{T}\|_{2}\geq\sigma_{\min}^{2}(Z)-\|Z_{2}Z_{2}^{T}\|_{F}
σmin2(Z)ϵ.\displaystyle\geq\sigma_{\min}^{2}(Z)-\epsilon. (23)

Finally, we substitute (20) and (21) and perform a sequence of reductions:

ZT(IXX)ZF2XXTZZTF2\displaystyle\frac{\|Z^{T}(I-XX^{\dagger})Z\|_{F}^{2}}{\|XX^{T}-ZZ^{T}\|_{F}^{2}} =Z2Z2TF2Z1Z1TX1X1TF2+2Z1Z2TF2+Z2Z2TF2\displaystyle=\frac{\|Z_{2}Z_{2}^{T}\|_{F}^{2}}{\|Z_{1}Z_{1}^{T}-X_{1}X_{1}^{T}\|_{F}^{2}+2\|Z_{1}Z_{2}^{T}\|_{F}^{2}+\|Z_{2}Z_{2}^{T}\|_{F}^{2}}
(a)Z2Z2TF22Z1Z2TF2+Z2Z2TF2(b)Z2Z2TF22σmin2(Z1)Z2F2+Z2Z2TF2\displaystyle\overset{\text{(a)}}{\leq}\frac{\|Z_{2}Z_{2}^{T}\|_{F}^{2}}{2\|Z_{1}Z_{2}^{T}\|_{F}^{2}+\|Z_{2}Z_{2}^{T}\|_{F}^{2}}\overset{\text{(b)}}{\leq}\frac{\|Z_{2}Z_{2}^{T}\|_{F}^{2}}{2\sigma_{\min}^{2}(Z_{1})\|Z_{2}\|_{F}^{2}+\|Z_{2}Z_{2}^{T}\|_{F}^{2}}
(c)Z2Z2TFZ2F22σmin2(Z1)Z2F2+Z2Z2TFZ2F2=Z2Z2TF2σmin2(Z1)+Z2Z2TF\displaystyle\overset{\text{(c)}}{\leq}\frac{\|Z_{2}Z_{2}^{T}\|_{F}\|Z_{2}\|_{F}^{2}}{2\sigma_{\min}^{2}(Z_{1})\|Z_{2}\|_{F}^{2}+\|Z_{2}Z_{2}^{T}\|_{F}\|Z_{2}\|_{F}^{2}}=\frac{\|Z_{2}Z_{2}^{T}\|_{F}}{2\sigma_{\min}^{2}(Z_{1})+\|Z_{2}Z_{2}^{T}\|_{F}}
(d)ϵ2(σmin2(Z)ϵ)+ϵϵ2σmin2(Z)ϵ.\displaystyle\overset{\text{(d)}}{\leq}\frac{\epsilon}{2(\sigma_{\min}^{2}(Z)-\epsilon)+\epsilon}\leq\frac{\epsilon}{2\sigma_{\min}^{2}(Z)-\epsilon}.

Step (a) sets X1=Z1X_{1}=Z_{1} to minimize the denominator; step (b) bounds Z1Z2TF2\|Z_{1}Z_{2}^{T}\|_{F}^{2} using (22); step (c) bounds Z2Z2TFZ2F2\|Z_{2}Z_{2}^{T}\|_{F}\leq\|Z_{2}\|_{F}^{2} noting that a function like x/(1+x)x/(1+x) is increasing with xx; step (d) substitutes Z2Z2Fϵ\|Z_{2}Z_{2}\|_{F}\leq\epsilon and σmin2(Z1)σmin2(Z)ϵ\sigma_{\min}^{2}(Z_{1})\geq\sigma_{\min}^{2}(Z)-\epsilon. ∎