How Many Samples is a Good Initial Point Worth?
How Many Samples is a Good Initial Point Worth in Low-rank Matrix Recovery?
Abstract
Given a sufficiently large amount of labeled data, the non-convex low-rank matrix recovery problem contains no spurious local minima, so a local optimization algorithm is guaranteed to converge to a global minimum starting from any initial guess. However, the actual amount of data needed by this theoretical guarantee is very pessimistic, as it must prevent spurious local minima from existing anywhere, including at adversarial locations. In contrast, prior work based on good initial guesses have more realistic data requirements, because they allow spurious local minima to exist outside of a neighborhood of the solution. In this paper, we quantify the relationship between the quality of the initial guess and the corresponding reduction in data requirements. Using the restricted isometry constant as a surrogate for sample complexity, we compute a sharp “threshold” number of samples needed to prevent each specific point on the optimization landscape from becoming a spurious local minimum. Optimizing the threshold over regions of the landscape, we see that for initial points around the ground truth, a linear improvement in the quality of the initial guess amounts to a constant factor improvement in the sample complexity.
1 Introduction
A perennial challenge in non-convex optimization is the possible existence of bad or spurious critical points and local minima, which can cause a local optimization algorithm like gradient descent to slow down or get stuck. Several recent lines of work showed that the effects of non-convexity can be tamed through a large amount of diverse and high quality training data [17, 1, 9, 3, 18, 12]. Concretely, these authors showed that, for classes of problems based on random sampling, spurious critical points and local minima become progressively less likely to exist with the addition of each new sample. After a sufficiently large number of samples, all spurious local minima are eliminated, so any local optimization algorithm is guaranteed to converge to the globally optimal solution starting from an arbitrary, possibly random initial guess.
This notion of a global guarantee—one that is valid starting from any initial point—is considerably stronger than what is needed for empirical success to be observed [8]. For example, the existence of a spurious local minimum may not pose an issue if gradient descent does not converge towards it. However, a theoretical guarantee is no longer possible, as starting the algorithm from the spurious local minimum would result in failure [22]. As a consequence, these global guarantees tend to be pessimistic, because the number of samples must be sufficiently large to eliminate spurious local minima everywhere, even at adversarial locations. By contrast, the weaker notion of a local guarantee [11, 10, 15, 19, 5, 7, 20, 13]—one that is valid only for a specified set of initial points—is naturally less conservative, as it allows spurious local minima to exist outside of the specified set.
In this paper, we provide a unifying view between the notions of the global and local guarantees by quantifying the relationship between the sample complexity and the quality of the initial point. We restrict our attention to the matrix sensing problem, which seeks to recover a rank- positive semidefinite matrix with from sub-Gaussian linear measurements of the form
(1) |
by solving the following non-convex optimization problem:
(2) |
We characterize a sharp “threshold” on the number of samples needed to prevent each specific point on the optimization landscape from becoming a spurious local minimum. While the threshold is difficult to solve, we derive a lower-bound in closed-form based on spurious critical points, and show that it constitutes a sharp lower-bound on the original threshold of interest. The lower-bound reveals a simple geometric relationship: a point is more likely to be a local minimum if the column spaces of and are close to orthogonal. Optimizing the closed-form lower-bound over regions of the landscape, we show that for initial points close to the ground truth, a constant factor improvement of the initial point amounts to a constant factor reduction in the number of samples needed to guarantee recovery.
2 Related Work
Local Guarantees. The earliest work on exact guarantees for non-convex optimization focused on generating a good initial guess within a local region of attraction. For instance, in [21, 24], the authors showed that when satisfies -RIP with a constant , and there exists a initial point sufficiently close to the ground truth, then gradient descent starting from this initial point has a linear convergence rate. The typical strategy to find such the initial point is spectral initialization [11, 10, 21, 19, 5, 14, 6]: using the singular value decomposition on a surrogate matrix to find low-rank factors that are close to the ground truth.
In this paper, we focus on the trade-off between the quality of an initial point and the number of samples needed to prevent the existence of spurious local minima, while sidestepping the question of how it is found. We note, however, that the number of samples needed to find an -good initial guess (e.g. via spectral initialization) forms an interesting secondary trade-off. It remains a future work to study the interactions between these two points.
Global Guarantees. Recent work focused on establishing a global guarantee that is independent of the initial guess [17, 1, 9, 3, 18, 12]. For our purposes, Bhojanapalli et al. [2] showed that RIP with eliminates all spurious local minima, while Zhang et al. [23] refined this to for the rank-1 case, and showed that this is both and necessary and sufficient. This paper is inspired by proof techniques in the latter paper; an important contribution of our paper is generalizing their rank-1 techniques to accommodate for matrices of arbitrary rank.
3 Our Approach: Threshold RIP Constant
Previous work that studied the global optimization landscape of problem (2) typically relied on the restricted isometry property (RIP) of . It is now well-known that if the measurement operator satisfies the restricted isometry property with a sufficiently small constant then problem (2) contains no spurious local minima; see Bhojanapalli et al. [2].
Definition 1 (-RIP).
Let be a linear measurement operator. We say that satisfies the -restricted isometry property (or simply -RIP) if satisfies the following inequality
where denotes the set of rank- matrices. The RIP constant of is the smallest value of such that the inequality above holds.
Let denote the RIP constant of . It is helpful to view as a surrogate for the number of measurements , with a large value of corresponding a smaller value of and vice versa. For a wide range of sub-Gaussian measurement ensembles, if where is an absolute constant, then satisfies -RIP with high probability [4, 16].
Take to be a spurious point such that . Our approach in this paper is to define a threshold number of measurements that would be needed to prevent from becoming a local minimum for problem (1). Viewing the RIP constant as a surrogate for the number of measurements , we follow a construction of Zhang et al. [23], and instead define a threshold on the RIP constant that would prevent from becoming a local minimum for problem (1). Such a construction must necessarily take into account all choices of satisfying -RIP, including those that adversarially target , bending the optimization landscape into forming a region of convergence around the point. On the other hand, such adversarial choices of must necessarily be defeated for a sufficiently small threshold on , as we already know that spurious local minima cannot exist for . The statement below makes this idea precise, and also extends it to a set of spurious points.
Definition 2 (Threshold for second-order condition).
Fix . For , if , then define . Otherwise, if , then define
(3) |
where the minimum is taken over all linear measurements . For define
If , then cannot be a spurious local minimum by construction, or it would contradict the definition of as the minimum value. By the same logic, if , then no choice of can be a spurious local minimum. In particular, it follows that is the usual global RIP threshold: if satisfies -RIP with , then is guaranteed to admit no spurious local minima. Starting a local optimization algorithm from any initial point guarantees exact recovery of an satisfying .
Now, suppose we are given an initial point . It is natural to measure the quality of by its relative error, as in . If we define an -neighborhood of all points with the same relative error
(4) |
then it follows that is an analogous local RIP threshold: if satisfies -RIP with , then is guaranteed to admit no spurious local minima over all . Starting a local optimization algorithm from the initial point guarantees either exact recovery of an satisfying , or termination at a strictly worse point with . Imposing further restrictions on the algorithm prevents the latter scenario from occurring (local strong convexity with gradient descent [19], strict decrements in the levels set [10, 23, 8]), and so exact recovery is guaranteed.
The numerical difference between the global threshold and the local threshold is precisely the number of samples that an -quality initial point is worth, up to some conversion factor. But two major difficulties remain in this line of reasoning. First, evaluating for some requires solving a minimization problem over the set of -RIP operators. Second, evaluating in turn requires minimizing over all choices of within an -neighborhood. Regarding the first point, Zhang et al. [23] showed that is the optimal value to a convex optimization problem, and can therefore be evaluated to arbitrary precising using a numerical algorithm. In the rank-1 case, they solved this convex optimization in closed-form, and use it to optimize over all . Their closed-form solution spanned 9 journal pages, and evoked a number of properties specific to the rank-1 case (for example, implies and , but may hold for and ). The authors noted that a similar closed-form solution for the general rank- case appeared exceedingly difficult. While overall proof technique is sharp and descriptive, its applicability appears to be entirely limited to the rank-1 case.
4 Main results
In this paper, we bypass the difficulty of deriving a closed-form solution for altogether by adopting a sharp lower-bound. This is based on two key insights. First, a spurious local minimum must also be a spurious critical point, so the analogous threshold over critical points would give an obvious lower-bound .
Definition 3 (Threshold for first-order condition).
Fix . For , if , then define . Otherwise, if , then define
(5) |
where the minimum is taken over all linear measurements . For define
Whereas the main obstacle in Zhang et al. [23] is the considerable difficulty in deriving a closed-form solution for , we show in this paper that it is relatively straightforward to solve in closed-form, to result in a simple, geometric solution.
Theorem 4.
Fix . Given satisfying -RIP and such that , we have , where
(6) |
and denotes the pseudo-inverse of . It follows that if , then is not a spurious critical point of . If , then there exists some satisfying -RIP such that .
The complete proof of Theorem 8 is given in Appendix A and a sketch is given in section 5. There is a nice geometric interpretation: the exact value of depends largely on the incidence angle between the column space of and the column space of . When the angle between and becomes small, the projection of onto becomes large. As a result, becomes small and becomes large. Therefore, Theorem 8 says that in regions where and are more aligned, fewer samples are required to prevent from becoming a spurious critical point. In regions where and are more orthogonal, a larger sample complexity is needed. Indeed, these are precisely the adversarial locations for which a large number of samples are required to prevent spurious local minima from appearing.

The lower-bound appears conservative, because critical points should be much more ubiquitous than local minima over a non-convex landscape. In particular, observe that as , which makes sense because is a saddle point for all choices of . In other words, for any region that contains , the lower-bound becomes trivial, as in . Our second insight here is that we must simultaneously have due to the global threshold of Bhojanapalli et al. [2] (or in the rank-1 case due to Zhang et al. [23]). Extending this idea over sets yields the following lower-bound
(7) |
where for and . This bound is remarkably tight, as shown in Figure 1 for over a range of . Explicitly solving the optimization using Theorem 8 and substituting into (7) yields the following.111We denote .
Theorem 5.
Let satisfy -RIP. Then we have for all , where . Hence, if
(8) |
where if and if , then has no spurious critical point within an -neighborhood of the solution:
(9) |
The complete proof of this theorem is in Appendix B. Theorem 5 says that the number of samples needed to eliminate spurious critical points within an -neighborhood of the solution decreases dramatically as becomes small. Given that sub-Gaussian measurements are needed to satisfy -RIP, we can translate Theorem 5 into the following sample complexity bound.
Corollary 6.
Let be a sub-Gaussian measurement ensemble. If
then with high probability there are no spurious local minima within
The proof of Corollary 6 follows immediately from Theorem 5 combined with the direct relationship between the RIP-property and the sample complexity for sub-Gaussian measurement ensembles. We see that the relationship between the quality of the initial point and the number of samples saved is essentially linear. Improving the quality of the initial point by a linear factor corresponds to a linear decrease in sample complexity. Moreover, the rate of improvement depends on the constant . This shows that in the non-convex setup of matrix sensing, there is a significant difference between a good initial point and a mediocre initial point. In the case that is large, this difference is even more pronounced.
5 Proof of Main Results
5.1 Notation and Definitions
We use for the vector -norm and use to denote the Frobenius norm of a matrix. For two square matrices and , means is positive semidefinite. The trace of a square matrix is denoted by . The vectorization is the length- vector obtained by stacking the columns of . Let be a linear measurement operator, and let be a fixed ground truth matrix. We define as the matrix representation of , and note that . We define the error vector and its Jacobian to satisfy
(10a) | ||||
(10b) |
5.2 Proof Sketch of Theorem 4
A complete proof of Theorem 4 relies on a few technical lemmas, so we defer the complete proof to Appendix A. The key insight is that is the solution to a convex optimization problem, which we can solve in closed-form. At first sight, evaluating seems very difficult as it involves solving an optimization problem over the set of -RIP operators, as defined in equation 5 . However, a minor modification of Theorem 8 in Zhang et al. [23] shows that can be reformulated as a convex optimization problem of the form
(11) |
where is related to by
(12) |
We will show that problem (17) actually has a simple closed-form solution. First, we write its Lagrangian dual as
(13) | ||||
subject to | ||||
Notice that strong duality holds because Slater’s condition is trivially satisfied by the dual: and is a strictly feasible point. It turns out that the dual problem can be rewritten as an optimization problem over the eigenvalues of the matrix . The proof of this in in Appendix A.
For any we denote and . The dual problem can be written as
and denotes the eigenvalues of the rank-2 matrix . It is easy to verify that the only two non-zero eigenvalues of are
It follows that
and therefore
Let be the optimizer of the optimization problem above, then is simply the incidence angle between the column space of and the error vector . Thus we have Using Lemma 12 in Appendix A, we show that solving for yields a closed-form expression for in the form
Hence we have , with given by the equation above.
5.3 Proof of Theorem 5
The proof of Theorem 5 is based on the following lemma. Its proof is very technical and can be can be found in Appendix B.
Lemma 7.
Let and suppose that . Then
To prove Theorem 5, we simply set and write
It is easy to see that is dominated by the linear function so long as . This follows directly from the fact that is convex between and . Thus we have
Since this lower bound holds for all in , it follows that .
6 Numerical Results
In this section we give a geometric interpretation for Theorem 8, which we already alluded to in section 4: the sample complexity to eliminate spurious critical points is small in regions where the column spaces of and are more aligned and large in regions where they are orthogonal. We also numerically verify that is a tight lower bound for for a wide range of , providing numerical evidence that the bound in Theorem 5 is tight.
Our main results and geometric insights hold for any rank, but for ease of visualization we focus on the rank-1 case where and are now just vectors. To measure the alignment between the column space of and that of in the rank-1 case , we define the length ratio and the incidence angle as
Our goal is to plot how sample complexity depends on this alignment. Visualizing the dependence of sample complexity on and is particularly easy in rank-1 because these two parameters completely determine the values of both and . See [23] section 8.1 for a proof of this fact. This allows us to plot the level curves of and over the parameter space and in Figure 2. This is shown by the blue curves. Since we are particularly interested in sample complexity near the ground truth, we also plot the level sets of the function using red curves. The horizontal axis is the value of and the vertical axis is the value of .
We can immediately see that in regions in the optimization landscape where is more aligned with , i.e., when is small, the values of both threshold functions tend to be high and a relatively small number of samples suffices to prevent from becoming a spurious critical point. However, when and becomes closer to being orthogonal, i.e., when is close to , then becomes arbitrarily small, and also becomes smaller, albeit to a lesser extent. As a result, preventing from becoming a spurious critical point (or spurious local minima) in these regions require many more samples. This intuition also permeates to the high-rank case, even though visualization becomes difficult, and a slightly more general definition of length ratio and alignment is required. Similar to the rank-1 case, in regions where and are more aligned, the sample complexity required to eliminate spurious critical points is small and in regions where and are close to orthogonal, a small sample complexity is required.
Regarding the tightness of using as a lower bound for , note that if we look at the level sets of , we see that in regions close to the ground truth, both and are very close to . This is in perfect agreement with our results in Theorem 5, where we showed that a small results in a large . Moreover, the shapes of the level curves of and that flow through the regions near the ground truth are almost identical. This indicates that for a large region near the ground truth, the second-order condition, i.e., the hessian being positive semidefinite, is inactive. This is the underlying mechanism that causes to be a tight lower bound for .


7 Conclusions
Recent work by Bhojanapalli et al. [2] has shown that the non-convex optimization landscape of matrix sensing contains no spurious local minima when there are sufficiently large amount of samples. However, these theoretical bounds on the sample complexity are very conservative compared to the number of samples needed in real applications like power state estimation. In our paper, we provide one explanation for this phenomenon: in real life, we often have access to good initial points, which can reduce the number of samples we need. The main results of our paper give a mathematical characterization of this phenomenon. We define a function that gives a precise threshold on the number of samples needed to prevent from becoming a spurious local minima. Although is difficult to compute exactly, we obtain a closed-form, sharp lower bound using convex optimization. As a result, we are able to characterize the tradeoff between the quality of the initial point and the sample complexity. In particular, we show that a linear improvement in the quality of the initial point corresponds to a linear decrease in sample complexity.
On a more general level, our work uses new techniques to paint a full picture for the non-convex landscape of matrix sensing: the problem becomes more “non-convex” (requiring more samples to eliminate spurious local minima) as we get further and further away from the global min. Once we are sufficiently far away, it becomes necessary to rely on global guarantees instead. Thus, our work brings new insight into how a non-convex problem can gradually become more tractable either through more samples or a better initial point and provides a tradeoff between these two mechanisms. For future work, it would be interesting to see if similar techniques can be extended to other non-convex models such as neural networks.
Acknowledgements
Partial financial support was provided by the National Science Foundation under award ECCS-1808859.
Broader Impact
Many modern applications in engineering and computer science, and in machine learning in particular often have to deal with non-convex optimization. However, many aspects of non-convex optimization are still not well understood. Our paper provides more insight into the optimization landscape of a particular problem: low-rank matrix factorization. In addition, the methods we develop can potentially be used to understand many other non-convex problems. This is a step towards a more thorough analysis of current algorithms for non-convex optimization and also a step towards developing better and more efficient algorithms with theoretical guarantees.
References
- [1] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-definite optimization. In Conference on Learning Theory, pages 530–582, 2016.
- [2] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
- [3] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pages 2757–2765, 2016.
- [4] E Candes and Y Plan. Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. to appear. IEEE Trans. Info. Theo, 2009.
- [5] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
- [6] Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.
- [7] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems, pages 739–747, 2015.
- [8] Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.
- [9] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
- [10] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665–674, 2013.
- [11] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998, 2010.
- [12] Qiuwei Li, Zhihui Zhu, and Gongguo Tang. The non-convex geometry of low-rank matrix optimization. Information and Inference: A Journal of the IMA, 8(1):51–96, 2019.
- [13] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis, 47(3):893–934, 2019.
- [14] Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics, pages 1–182, 2019.
- [15] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-convex robust pca. In Advances in Neural Information Processing Systems, pages 1107–1115, 2014.
- [16] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
- [17] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery using nonconvex optimization. In International Conference on Machine Learning, pages 2351–2360, 2015.
- [18] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
- [19] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535–6579, 2016.
- [20] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via procrustes flow. In International Conference on Machine Learning, pages 964–973, 2016.
- [21] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. arXiv preprint arXiv:1507.03566, 2015.
- [22] Richard Zhang, Cédric Josz, Somayeh Sojoudi, and Javad Lavaei. How much restricted isometry is needed in nonconvex matrix recovery? In Advances in neural information processing systems, pages 5586–5597, 2018.
- [23] Richard Y Zhang, Somayeh Sojoudi, and Javad Lavaei. Sharp restricted isometry bounds for the inexistence of spurious local minima in nonconvex matrix recovery. Journal of Machine Learning Research, 20(114):1–34, 2019.
- [24] Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In Advances in Neural Information Processing Systems, pages 109–117, 2015.
Appendix A.1
In Appendix A we fill out the missing details in the proof sketch of Section 5.2 and provide a complete proof of Theorem 4, which we restate below.
Theorem 8.
(Same as theorem 4). Fix . Given satisfying -RIP and such that , we have , where
(14) |
and denotes the pseudo-inverse of . It follows that if , then is not a spurious critical point of . If , then there exists some satisfying -RIP such that .
Before we prove the theorem above, we first prove two technical lemmas. The first lemma gives an explicit solution to the eigenvalues of a rank-2 matrix and the second lemma characterizes the solution to an SDP that will be a part of the proof of theorem 4.
Lemma 9.
Given the matrix has eigenvalues where:
and is the angle between and .
Lemma 10.
Given a matrix we can split the matrix into a positive and negative part satisfying
Then the following problem has solution
Proof.
(Lemma 9). Without loss of generality, assume that (Otherwise, we can rescale and write . Now decompose into a tangent and normal component with respect to as in
where is a unit normal vector with and Thus can be written as
This shows that is spectrally similar to a matrix with eigenvalues . ∎
Proof.
(Lemma 10). In this proof we will consider two cases: and . We’ll see that in the first case, the optimal value is and in the second case, the optimal value is .
First, assume that . Let be the optimal value. Then we have
(15) | ||||
(16) |
Note that the first line converts the equality constraint into a Lagrangian. The second line simply rearranges the terms. The third line plugs in . The fourth line again rearranges the terms. The last line follows from the observation that if , then the inner minimization over will go to negative infinity since the trace of can be arbitrarily large.
First, consider the case . Then we have . Since , the minimization over is achieved at . Plugging this value into the optimization problem, then (19) becomes
If , then the optimal value of the inner minimization will go to negative infinity. On the other hand, if then the minimum inside is achieved at . Thus the problem above is equivalent to
Since , the optimal value of the problem above is achieved at . Now suppose that . Then the optimal value for is achieved at . Plugging this value in and (19) becomes
Similar to before, we must have , so . Since , the optimal value in this case is just . Combining the results for and , we find that when , the optimal value is
Repeating the same arguments for when , we see that in this case the optimal value becomes
Finally, combining these two cases, i.e., and , we obtain
which completes the proof.
∎
Appendix A.2
Now we are ready to prove Theorem 4. Recall that the first order threshold function is defined as the solution to the following optimization problem:
Using Theorem 8 from [23], the optimization problem above can be formulated as
(17) |
where . Our goal is to solve this optimization problem in closed form. In Section 5.2, we wrote the dual of problem (17) as
(18) | ||||
and stated that this dual problem can be rewritten as an optimization problem over the eigenvalues of a rank-2 matrix. This is given in the lemma below. To simplify notation, here we define a positive/negative splitting: for any we denote and . This idea can be extended to matrices by applying splitting to the eigenvalues.
Lemma 11.
Given data and , define
(19) | ||||
Define to be the rank-2 matrix and let denote its eigenvalues. Then can be evaluated as
where
The proof of Lemma 11 relies mainly on the two lemmas we proved in the preceding section.
Proof.
(Lemma 11). Let , where and . Thus the optimization problem (19) becomes
To solve this problem, first we keep fixed, and optimize over . This gives us the problem
Notice that if we set , then the problem above is in exactly the same form as the one in lemma 10. Therefore, its optimal value is
Finally, to obtain , we still need to optimize over , i.e.,
Since both the numerator and the denominator are linear in , we can ignore the constraint and simply optimize over , which gives us
With lemma 9, we see that the only two eigenvalues of are
where It follows that and . Thus
Notice that in the optimization problem above, if the minimum is achieved at some , it must also be achieved at , due to symmetry. Therefore, it suffices to optimize over only the first term , so we get
This completes the proof. ∎
Notice that Lemma 11 reduces problem 17 to only depend on the values of . Now, to complete the proof of Theorem 4, we just need one additional lemma that gives a closed form solution for , which we state below.
Lemma 12.
Proof.
(Lemma 12). Define and decompose . The optimality condition for reads , so we substitute to yield
and therefore , because we have with . Now, define where , and define as the orthogonal complement of . Decompose , and , and note that
From the second line to the third, we apply a change of basis onto , which preserves the Frobenius norm. To derive the last line, notice that the matrix has full row rank, so that and . We want to show that there exists such that
Since the right hand side is symmetric, we can write it as , where is some lower-triangular matrix. Thus it suffices to show that there exists such that , which follows from that fact that has full row-rank. Similarly, there exists some such that . Thus, all terms except the last one cancels out and we are left with .
Finally, note that and and that
Substituting the definition of completes the proof. ∎
Appendix B
In this section we provide a complete proof of Theorem 5, which includes all the intermediate calculations that was skipped in Section 5.3. We begin by proving a bound on .
Lemma 13 (Same as Lemma 7).
Let and suppose that . Then
Proof.
The problem is homogeneous to scaling and for the same ; Since , we may rescale and until . Additionally, we can assume that
due to the rotational invariance of the problem. (Concretely, we compute the QR decomposition with noting that and . We then make a change of basis and ). Then, observe that
(20) |
and that because
(21) |
In order to derive a non-vacuous bound, we will need to lower-bound the term as follows
(22) |
To lower-bound , observe that
and therefore
(23) |
Finally, we substitute (20) and (21) and perform a sequence of reductions:
Step (a) sets to minimize the denominator; step (b) bounds using (22); step (c) bounds noting that a function like is increasing with ; step (d) substitutes and . ∎