Robust Sub-Gaussian Principal Component Analysis
and Width-Independent Schatten Packing
1 Introduction
We study two natural, but seemingly unrelated, problems in high dimensional robust statistics and continuous optimization respectively. As we will see, these problems have an intimate connection.
Problem 1: Robust sub-Gaussian principal component analysis. We consider the following statistical task, which we call robust sub-Gaussian principal component analysis (PCA). Given samples from sub-Gaussian111See Section 2 for a formal definition. distribution with covariance , an fraction of which are arbitrarily corrupted, the task asks to output unit vector with 222Throughout we use to denote the Schatten -norm (cf. Section 2 for more details). for tolerance . Ergo, the goal is to robustly return a -approximate top eigenvector of the covariance of sub-Gaussian . This is the natural extension of PCA to the robust statistics setting.
There has been a flurry of recent work on efficient algorithms for robust statistical tasks, e.g. covariance estimation and PCA. From an information-theoretic perspective, sub-Gaussian concentration suffices for robust covariance estimation. Nonetheless, to date all polynomial-time algorithms achieving nontrivial guarantees on covariance estimation of a sub-Gaussian distribution (including PCA specifically) in the presence of adversarial noise require additional algebraic structure. For instance, sum-of-squares certifiably bounded moments have been leveraged in polynomial time covariance estimation [HL18, KSS18]; however, this is a stronger assumption than sub-Gaussianity.
In many applications (see discussion in [DKK+17]), the end goal of covariance estimation is PCA. Thus, a natural question which relaxes robust covariance estimation is: can we robustly estimate the top eigenvector of the covariance , assuming only sub-Gaussian concentration? Our work answers this question affirmatively via two incomparable algorithms. The first achieves in polynomial time; the second achieves in nearly-linear time under a mild gap assumption on . Moreover, both methods have nearly-optimal sample complexity.
Problem 2: Width-independent Schatten packing. We consider a natural generalization of packing semidefinite programs (SDPs) which we call Schatten packing. Given symmetric positive semidefinite and parameter , a Schatten packing SDP asks to solve the optimization problem
(1) |
Here, is the Schatten- norm of matrix and is the probability simplex (see Section 2). When , (1) is the well-studied (standard) packing SDP objective [JY11, ALO16, PTZ16], which asks to find the most spectrally bounded convex combination of packing matrices. For smaller , the objective encourages combinations more (spectrally) uniformly distributed over directions.
The specialization of (1) to diagonal matrices is a smooth generalization of packing linear programs, previously studied in the context of fair resource allocation [MSZ16, DFO18]. For the case of (1), packing SDPs have the desirable property of admitting “width-independent” approximation algorithms via exploiting positivity structure. Specifically, width-independent solvers obtain multiplicative approximations with runtimes independent or logarithmically dependent on size parameters of the problem. This is a strengthening of additive notions of approximation typically used for approximate semidefinite programming. Our work gives the first width-independent solver for Schatten packing.
1.1 Previous work
Learning with adversarial outliers. The study of estimators robust to a small fraction of adversarial outliers dates back to foundational work, e.g. [Hub64, Tuk75]. Following more recent work [LRV16, DKK+19], there has been significant interest in efficient, robust algorithms for statistical tasks in high-dimensional settings. We focus on methods robustly estimating covariance properties here, and defer a thorough discussion of the (extensive) robust statistics literature to [Ste18, Li18, DK19].
There has been quite a bit of work in understanding and giving guarantees for robust covariance estimation where the uncorrupted distribution is exactly Gaussian [DKK+17, DKK+18, DKK+19, CDGW19]. These algorithms strongly use relationships between higher-order moments of Gaussian distributions via Isserlis’ theorem. Departing from the Gaussian setting, work of [LRV16] showed that if the distribution is an affine transformation of a 4-wise independent distribution, robust covariance estimation is possible. This was extended by [KSS18], which also assumed nontrivial structure in the moments of the distribution, namely that sub-Gaussianity was certifiable via the sum-of-squares proof system. To the best of our knowledge it has remained open to give nontrivial guarantees for robust estimation of any covariance properties under minimal assumptions, i.e. sub-Gaussian concentration.
All aforementioned algorithms also yield guarantees for robust PCA, by applying a top eigenvector method to the learned covariance. However, performing robust PCA via the intermediate covariance estimation step is lossy, both statistically and computationally. From a statistical perspective, samples are necessary to learn the covariance of a -dimensional Gaussian in Frobenius norm (and for known efficient algorithms for spectral norm error [DKS17]); in contrast, samples suffice for (non-robust) PCA. Computationally, even when the underlying distrubution is exactly Gaussian, the best-known covariance estimation algorithms run in time ; algorithms working in more general settings based on the sum-of-squares approach require much more time. In contrast, the power method for PCA in a matrix takes time 333We say if for some constant .. Motivated by this, our work initiates the direct study of robust PCA, which is often independently interesting in applications.
We remark there is another problem termed “robust PCA” in the literature, e.g. [CLMW11], under a different generative model. We defer a detailed discussion to [DKK+17], which experimentally shows that algorithms from that line of work do not transfer well to our corruption model.
Width-independent iterative methods. Semidefinite programming (SDP) and its linear programming specialization are fundamental computational tasks, with myriad applications in learning, operations research, and computer science. Though general-purpose polynomial time algorithms exist for SDPs ([NN94]), in practical settings in high dimensions, approximations depending linearly on input size and polynomially on error are sometimes desirable. To this end, approximation algorithms based on entropic mirror descent have been intensely studied [WK06, AK16, GHM15, AL17, CDST19], obtaining additive approximations to the objective with runtimes depending polynomially on , where is the “width”, the largest spectral norm of a constraint.
For structured SDPs, stronger guarantees can be obtained in terms of width. Specifically, several algorithms developed for packing SDPs ((1) with ) yield -multiplicative approximations to the objective, with logarithmic dependence on width [JY11, PTZ16, ALO16, JLL+20]. As upper bounds objective value in this setting, in the worst case runtimes of width-dependent solvers yielding -additive approximations have similar dependences as width-independent counterparts. Width-independent solvers simultaneously yield stronger multiplicative bounds at all scales of objective value, making them desirable in suitable applications. In particular, packing SDPs have found great utility in robust statistics algorithm design [CG18, CDG19, CDGW19, DL19].
Beyond packing, width-independent guarantees in the SDP literature are few and far between; to our knowledge, other than the covering and mixed solvers of [JLL+20], ours is the first such guarantee for a broader family of objectives444In concurrent and independent work, [CMY20] develops width-independent solvers for Ky-Fan packing objectives, a different notion of generalization than the Schatten packing objectives we consider.. Our method complements analogous extensions in the width-dependent setting, e.g. [ALO15], as well as width-independent solvers for packing linear programs [MSZ16, DFO18]. We highlight the fair packing solvers of [MSZ16, DFO18], motivated by problems in equitable resource allocation, which further solved packing variants for . We find analogous problems in semidefinite settings interesting, and defer to future work.
1.2 Our results
Robust sub-Gaussian principal component analysis. We give two algorithms for robust sub-Gaussian PCA555We follow the distribution and corruption model described in Assumption 1.. Both are near-sample optimal, polynomial-time, and assume only sub-Gaussianity. The first is via a simple filtering approach, as summarized here (and developed in Section 3).
Theorem 1.
Our second algorithm is more efficient under mild conditions, but yields a worse approximation for . Specifically, if there are few eigenvalues of larger than , our algorithm runs in nearly-linear time. Note that if there are many eigenvalues above this threshold, then the PCA problem itself is not very well-posed; our algorithm is very efficient in the interesting setting where the approximate top eigenvector is identifiable. We state our main algorithmic guarantee here, and defer details to Section 5.
Theorem 2.
We remark that samples are necessary for a -approximation to the top eigenvector of via uncorrupted samples from , so our first method is sample-optimal, as is our second up to a factor.
Width-independent Schatten packing. Our second method crucially requires an efficient solver for Schatten packing SDPs. We demonstrate that Schatten packing, i.e. (1) for arbitrary , admits width-independent solvers. We state an informal guarantee, and defer details to Section 4.
Theorem 3.
Let , and . There is an algorithm taking iterations, returning a multiplicative approximation to the problem (1). For odd , each iteration can be implemented in time nearly-linear in the number of nonzeros amongst all .
2 Preliminaries
General notation. denotes the set . Applied to a vector, is the norm; applied to a symmetric matrix, is the Schatten- norm, i.e. the norm of the spectrum. The dual norm of is for ; when , . is the -dimensional simplex (subset of positive orthant with -norm ) and we define to be the truncated simplex:
(2) |
Matrices. is symmetric matrices, and is the positive semidefinite subset. is the identity of appropriate dimension. , , and Tr are the largest and smallest eigenvalues and trace of a symmetric matrix. For , and we use the Loewner order , ( iff ). The seminorm of is .
Fact 1.
For , with compatible dimension, . For , .
Fact 2.
We have the following characterization of the Schatten- norm: for , and ,
For , the satisfying is , so has spectrum .
Distributions. We denote drawing vector from distribution by , and the covariance of is . We say scalar distribution is -sub-Gaussian if and
Multivariate has sub-Gaussian proxy if its restriction to any unit is -sub-Gaussian, i.e.
(3) |
We consider the following standard model for gross corruption with respect to distribution a .
Assumption 1 (Corruption model, see [DKK+19]).
Let be a mean-zero distribution on with covariance and sub-Gaussian proxy for a constant . Denote by index set with a set of (uncorrupted) samples . An adversary arbitrarily replaces points in ; we denote the new index set by , where is the (unknown) set of points added by an adversary, and is the set of points from that were not changed.
As we only estimate covariance properties, the assumption that is mean-zero only loses constants in problem parameters, by pairing samples and subtracting them (cf. [DKK+19], Section 4.5.1).
3 Robust sub-Gaussian PCA via filtering
In this section, we sketch the proof of Theorem 1, which gives guarantees on our filtering algorithm for robust sub-Gaussian PCA. This algorithm obtains stronger statistical guarantees than Theorem 2, at the cost of super-linear runtime; the algorithm is given as Algorithm 6. Our analysis stems largely from concentration facts about sub-Gaussian distributions, as well as the following (folklore) fact regarding estimation of variance along any particular direction.
Lemma 1.
In other words, we show that using corrupted samples, we can efficiently estimate a -multiplicative approximation of the variance of in any unit direction666Corollary 5 gives a slightly stronger guarantee that reusing samples does not break dependencies of .. This proof is deferred to Appendix B for completeness. Algorithm 6 combines this key insight with a soft filtering approach, suggested by the following known structural fact found in previous work (e.g. Lemma A.1 of [DHL19], see also [SCV17, Ste18]).
Lemma 2.
Let , be sets of nonnegative reals, and . Define , for all . Consider any disjoint partition , of with Then, .
Our Algorithm 6, , takes as input a set of corrupted samples following Assumption 1 and the corruption parameter . At a high level, it initializes a uniform weight vector , and iteratively operates as follows (we denote by the empirical covariance ).
-
1.
approximate top eigenvector of via power iteration.
-
2.
Compute .
-
3.
If , then terminate and return .
-
4.
Else:
-
(a)
Sort indices by , with smallest.
-
(b)
Let be the smallest set for which , and apply the downweighting procedure of Lemma 2 to this subset of indices.
-
(a)
The analysis of Algorithm 6 then proceeds in two stages.
Monotonicity of downweighting.
We show the invariant criteria for Lemma 2 (namely, that for the set in every iteration, there is more spectral mass on bad points than good) holds inductively for our algorithm. Specifically, lack of termination implies puts significant mass on bad directions, which combined with concentration of good directions yields the invariant. The details of this argument can be found as Lemma 13.
Roughly uniform weightings imply approximation quality.
As Lemma 2 then applies, the procedure always removes more mass from bad points than good, and thus can only remove at most mass total by the corruption model. Thus, the weights are always roughly uniform (in ), which by standard concentration facts (see Appendix A) imply the quality of the approximate top eigenvector is good. Moreover, the iteration count is bounded by roughly because whenever the algorithm does not terminate, enough mass is removed from large spectral directions. Combining with the termination criteria imply that when a vector is returned, it is a close approximation to the top direction of . Details can be found as Lemma 15 and in the proof of Theorem 1.
4 Schatten packing
4.1 Mirror descent interpretation of [MRWZ16]
We begin by reinterpreting the [MRWZ16] solver, which achieves the state-of-the-art parallel runtime for packing LPs777The [MRWZ16] solver also generalizes to covering and mixed objectives; we focus on packing in this work.. An packing LP algorithm solves the following decision problem.888Packing linear programs are sometimes expressed as the optimization problem , similarly to (1); these problems are equivalent up to a standard binary search, see e.g. discussion in [JLL+20]..
Problem 1 ( packing linear program).
Given entrywise nonnegative , either find primal solution with or dual solution with .
The following result is shown in [MRWZ16].
Oour interpretation of the analysis of [MRWZ16], combines two ingredients: a potential argument and mirror descent, which yields a dual feasible point if did not grow sufficiently.
Potential argument. The potential used by [MRWZ16] is , well-known to be a -additive approximation of . As soon as or reaches the scale , by nonnegativity this becomes a multiplicative guarantee, motivating the setting of threshold . To prove the potential is monotone, [MRWZ16] uses step size and a Taylor approximation; combining with the termination condition yields the desired claim.
Mirror descent. To certify that grows sufficiently (e.g. the method terminates in few iterations, else dual feasibility holds), we interpret the step as approximate entropic mirror descent. Specifically, we track the quantity , and show that if has not grown sufficiently, then it must be bounded for every , certifying dual feasibility. Formally, for any sequence and , we show
The last inequality followed by being an upwards truncation. If is bounded (else, we have primal feasibility), we show the entire above expression is bounded for any . Thus, by setting and choosing to be each coordinate indicator, it follows that the average of all is coordinatewise at least , and solves Problem 1 as a dual solution.
Our is chosen as the (truncated) gradient of the function used in the potential analysis, so its form allows us to interpret dual feasibility (e.g. has norm 1 and is a valid dual point). Our analysis patterns standard mirror descent, complemented by side information which says that lack of a primal solution can transform a regret guarantee into a feasibility bound. We apply this framework to analyze variants of Problem 1, via different potentials; nonetheless, our proofs are quite straightforward upon adopting this perspective, and we believe it may yield new insights for instances with positivity structure.
4.2 -norm packing linear programs
In this section, we give a complete self-contained example of the framework proposed in Section 4.1, for approximately solving norm packing linear programs. Specifically, we now consider the generalization of Problem 1 to norms; throughout, is the dual norm.
Problem 2 ( packing linear program).
Given entrywise nonnegative , either find primal solution with or dual solution with .
For , Problem 2 recovers Problem 1 up to constants as multiplicatively approximates by . We now state our method for solving Problem 2 as Algorithm 2.
Other than changing parameters, the only difference from Algorithm 1 is that is a point with unit norm induced by the gradient of our potential . We state our main potential fact, whose proof is based straightforwardly on Taylor expanding , and deferred to Appendix C for brevity.
Lemma 3.
In all iterations of Algorithm 2, defining , .
We now prove our main result, which leverages the potential bound following the framework of Section 4.1. In the proof, we assume that entries of are bounded by ; this does not incur more loss than a constant multiple of in the guarantees, and a proof can be found as Lemma 16.
Proof.
The runtime follows from Line 7 (each iteration cost is dominated by multiplication through ), so we prove correctness. Define potential as in Lemma 3, and note that as ,
The second inequality followed from our assumption on entry sizes (Lemma 16). If Algorithm 2 breaks out of the while loop of Line 4, we have by Lemma 3 that for returned on Line 11,
Thus, primal feasibility is always correct. We now prove correctness of dual feasibility. First, let be the Kullback-Leibler divergence from to , for , . Define the normalized points in each iteration. Expanding definitions,
(4) | ||||
The only inequality used the bounds, for ,
Telescoping (4) over all iterations, and using for all since is uniform, we have that whenever Line 4 is not satisfied before the check on Line 7 (i.e. ),
(5) |
The last inequality used by assumption, and . Next, since each entrywise, defining ,
(6) |
Combining (5) and (6), and rearranging, yields by definition of ,
The last claim follows by setting to be each coordinate-sparse simplex vector. Finally, since , to show that is a correct dual solution to Problem 2 it suffices to show . This follows as is an average of the , convexity of norms, and that for all ,
∎
4.3 Schatten-norm packing semidefinite programs
We generalize Algorithm 2 to solve Schatten packing semidefinite programs, which we now define.
Problem 3.
Given , either find primal solution with or dual solution , with for all .
We assume that is an odd integer for simplicity (sufficient for our applications), and leave for interesting future work the cases when is even or noninteger. The potential used in the analysis and an overall guarantee are stated here, and deferred to Appendix C. The proofs are simple modifications of Lemma 3 and Theorem 4 using trace inequalities (similar to those in [JLL+20]) in place of scalar inequalities, as well as efficient approximation of quantities in Line 5 via the standard technique of Johnson-Lindestrauss projections.
Lemma 4.
In all iterations of Algorithm 3, defining , .
4.4 Schatten packing with a constraint
We remark that the framework outlined in Section 4.1 is flexible enough to handle mixed-norm packing problems. Specifically, developments in Section 5 require the following guarantee.
Proposition 2.
Our method, found in Appendix C, approximately solves (7) by first applying a standard binary search to place on the right scale, for which it suffices to solve an approximate decision problem. Then, we apply a truncated mirror descent procedure on the potential , and prove correctness for solving the decision problem following the framework we outlined in Section 4.1.
5 Robust sub-Gaussian PCA in nearly-linear time
We give our nearly-linear time robust PCA method, leveraging developments of Section 4. Throughout, we will be operating under Assumption 1, for some corruption parameter with ; suffices. We now develop tools to prove Theorem 2.
Algorithm 4 uses three subroutines: our earlier method (Lemma 1), an application of our earlier Proposition 2 to approximate the solution to
(8) |
and a method for computing approximate eigenvectors by [MM15] (discussed in Appendix D).
Proposition 3.
There is an algorithm (Algorithm 1, [MM15]), parameterized by , tolerance , , and , which outputs orthonormal with the guarantee
(9) |
Here, is the largest eigenvalue of . The total time required by the method is .
Algorithm 4 is computationally bottlenecked by the application of Proposition 2 on Line 2 and the call to on Line 4, from which the runtime guarantee of Theorem 2 follows straightforwardly. To demonstrate correctness, we first certify the quality of the solution to (8).
Lemma 5.
Let . With probability , the uniform distribution over attains value for objective (8), where for a universal constant .
The proof of this is similar to results in e.g. [DKK+19, Li18], and combines concentration guarantees with a union bound over all possible corruption sets . This implies the following immediately, upon applying the guarantees of Proposition 2.
Corollary 1.
Let be the output of Line 2 of . Then, we have , and under the guarantee of Lemma 5.
Let be the output of the solver. Recall that . Additionally, define
(10) |
Notice in particular that , and that all these matrices are PSD. We next prove the second, crucial fact, which says that is a good approximator to in Loewner ordering:
Lemma 6.
Let . With probability at least , .
The proof combines the strategy in Lemma 5 with the guarantee of the SDP solver. Perhaps surprisingly, Corollary 1 and Lemma 6 are the only two properties about that our final analysis of Theorem 2 will need. In particular, we have the following key geometric proposition, which carefully combines trace inequalities to argue that the corrupted points cannot create too many new large eigendirections.
Proposition 4.
Proof.
For concreteness, we will define the parameters
For these choices of , , we will use the following (loose) approximations for sufficiently small :
(12) |
Suppose for contradiction that all for . By applying the guarantee of Corollary 1 and Fact 2, it follows that
(13) |
Let be the largest index such that , and note that . We define
That is, is the restriction of to its top eigendirections. Then,
(14) | ||||
In the second line, we used Lemma 6 twice, as well as the trace inequality Lemma 7 with and . Combining (13) with (14), and expanding the definition of , yields
(15) | ||||
The third line followed from from the spectral bound of Lemma 6, and the fourth followed from the fact that , eigendecompose , as well as the assumption for all . Letting , and using both approximations in (12),
(16) |
Next, we bound the last term of (15). By using ,
(17) | ||||
The second line used by definition of , Lemma 8 (twice), and . Combining (17) and (16) and plugging into (15),
By the choice of and (i.e. ), we attain a contradiction. ∎
In the proof of Proposition 4, we used the following facts.
Lemma 7.
Let be symmetric matrices and a positive integer. Then we have
Proof.
For any ,
The first step used the Extended Lieb-Thirring trace inequality for , (see e.g. Lemma 2.1, [ALO16]), and the second . Finally, induction on yields the claim. ∎
Lemma 8.
For all , .
Proof.
By the Courant-Fischer minimax characterization of eigenvalues,
However, we also have (Lemma 6), yielding the conclusion. ∎
The guarantees of Proposition 4 were geared towards exact eigenvectors of the matrix . We now modify the analysis to tolerate inexactness in the eigenvector computation, in line with the processing of Line 5 of our Algorithm 4. This yields our final claim in Theorem 2.
Corollary 2.
Proof.
Assume all have for contradiction. We outline modifications to the proof of Proposition 4. Specifically, we redefine the matrix by
Because is a projection matrix, it is clear . Therefore, by combining the derivations (13) and (14), it remains true that
We now bound these two terms in an analogous way from Proposition 4, with negligible loss; combining these bounds will again yield a contradiction. First, we have the lower bound
Here, the last inequality applied the assumption (9) with respect to . Next, we upper bound
The first line used , the second used the definition of , the third used our assumption , and the last used (9) with respect to . Finally, the remaining derivation (17) is tolerant to additional factors of , yielding the same conclusion up to constants. ∎
Finally, we prove Theorem 2 by combining the tools developed thus far.
Proof of Theorem 2.
Correctness of the algorithm is immediate from Corollary 2 and the guarantees of . Concretely, Corollary 2 guarantees that one of the vectors we produce will be a -approximate top eigenvector (say some index ), and will only lose a negligible fraction of this quality (see Lemma 1); the best returned eigenvector as measured by can only improve the guarantee. Finally, the failure probability follows by combining the guarantees of Lemmas 1, 5, and 6.
We now discuss runtime. The complexity of lines 2, 4, and 5, as guaranteed by Proposition 2, Proposition 3, and Lemma 1 are respectively (recalling )
Throughout we use that we can compute matrix-vector products in an arbitrary linear combination of the in time ; it is easy to check that in all runtime guarantees, nnz can be replaced by this computational cost. Combining these bounds yields the final conclusion. ∎
Acknowledgments
We thank Swati Padmanabhan and Aaron Sidford for helpful discussions.
References
- [AK16] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefinite programs. J. ACM, 63(2):12:1–12:35, 2016.
- [AL17] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster online learning of eigenvectors and faster MMWU. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 116–125, 2017.
- [ALO15] Zeyuan Allen Zhu, Zhenyu Liao, and Lorenzo Orecchia. Spectral sparsification and regret minimization beyond matrix multiplicative updates. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 237–245, 2015.
- [ALO16] Zeyuan Allen Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 1824–1831, 2016.
- [CDG19] Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2755–2771, 2019.
- [CDGW19] Yu Cheng, Ilias Diakonikolas, Rong Ge, and David Woodruff. Faster algorithms for high-dimensional robust covariance estimation. arXiv preprint arXiv:1906.04661, 2019.
- [CDST19] Yair Carmon, John C. Duchi, Aaron Sidford, and Kevin Tian. A rank-1 sketch for matrix multiplicative weights. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 589–623, 2019.
- [CG18] Yu Cheng and Rong Ge. Non-convex matrix completion against a semi-random adversary. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 1362–1394, 2018.
- [CLMW11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
- [CMY20] Yeshwanth Cherapanamjeri, Sidhanth Mohanty, and Morris Yau. List decodable mean estimation in nearly linear time. CoRR, abs/2005.09796, 2020.
- [DFO18] Jelena Diakonikolas, Maryam Fazel, and Lorenzo Orecchia. Width-independence beyond linear objectives: Distributed fair packing and covering algorithms. CoRR, abs/1808.02517, 2018.
- [DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003.
- [DHL19] Yihe Dong, Samuel Hopkins, and Jerry Li. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. In Advances in Neural Information Processing Systems, pages 6065–6075, 2019.
- [DK19] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
- [DKK+17] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 999–1008. JMLR. org, 2017.
- [DKK+18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2683–2702. SIAM, 2018.
- [DKK+19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742–864, 2019.
- [DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017.
- [DL19] Jules Despersin and Guillaume Lecué. Robust subgaussian estimation of a mean vector in nearly linear time. CoRR, abs/1906.03058, 2019.
- [GHM15] Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 560–568, 2015.
- [HL18] Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034, 2018.
- [Hub64] Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
- [JLL+20] Arun Jambulapati, Yin Tat Lee, Jerry Li, Swati Padmanabhan, and Kevin Tian. Positive semidefinite programming: Mixed, parallel, and width-independent. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020.
- [JY11] Rahul Jain and Penghui Yao. A parallel approximation algorithm for positive semidefinite programming. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 463–471, 2011.
- [KSS18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046, 2018.
- [Li18] Jerry Zheng Li. Principled approaches to robust machine learning and beyond. PhD thesis, Massachusetts Institute of Technology, 2018.
- [LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.
- [MM15] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1396–1404, 2015.
- [MRWZ16] Michael W. Mahoney, Satish Rao, Di Wang, and Peng Zhang. Approximating the solution to mixed packing and covering lps in parallel o~(epsilon^{-3}) time. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 52:1–52:14, 2016.
- [MSZ16] Jelena Marasevic, Clifford Stein, and Gil Zussman. A fast distributed stateless algorithm for alpha-fair packing problems. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 54:1–54:15, 2016.
- [NN94] Yurii Nesterov and Arkadi Nemirovski. Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics, 1994.
- [PTZ16] Richard Peng, Kanat Tangwongsan, and Peng Zhang. Faster and simpler width-independent parallel algorithms for positive semidefinite programming. CoRR, abs/1201.5135, 2016.
- [RH17] Philippe Rigollet and Jan-Christian Hütter. High-Dimensional Statistics. 2017.
- [SCV17] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
- [Ste18] Jacob Steinhardt. Robust Learning: Information Theory and Algorithms. PhD thesis, Stanford University, 2018.
- [Tuk75] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.
- [Ver16] Roman Vershynin. High-Dimensional Probability, An Introduction with Applications in Data Science. 2016.
- [WK06] Manfred K. Warmuth and Dima Kuzmin. Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 1481–1488, 2006.
Appendix A Concentration
A.1 Sub-Gaussian concentration
We use the following concentration facts on sub-Gaussian distributions following from standard techniques, and give an application bounding Schatten-norm deviations.
Lemma 9.
Under Assumption 1, there are universal constants , such that
Proof.
By observing (3), it is clear that the random vector for has covariance and sub-Gaussian proxy . For any fixed unit vector , by Lemma 1.12 of [RH17], the random variable is sub-exponential with parameter , so by Bernstein’s inequality (Theorem 1.13, [RH17]), defining for each ,
For shorthand define , and let be a maximal -net of the unit ball (as measured in distance). By Lemma 1.18 of [RH17], , so by a union bound,
Next, by a standard application of the triangle inequality (see e.g. Exercise 4.3.3, [Ver16])
with probability at least for appropriate , . The conclusion follows since its statement is scale invariant, so it suffices to show as we have that
∎
Corollary 3.
Let . Under Assumption 1, there are universal constants , with
A.2 Concentration under weightings in
We consider concentration of the empirical covariance under weightings which are not far from uniform, in spectral and Schatten senses.
Lemma 10.
Under Assumption 1, let , , and for a sufficiently large constant. Then for a universal constant ,
Proof.
Because the vertices of are uniform over sets with (see e.g. Section 4.1, [DKK+19]), by convexity of the Schatten- norm it suffices to prove
For any fixed , and recalling , we can decompose this sum as
(18) |
By applying Corollary 3, it follows that by setting and our choice of that
(19) |
Moreover, for any fixed , setting where is a sufficiently large constant, so that for sufficiently small , ,
(20) | ||||
Here, we used that . Finally, union bounding over all possible sets imply that with probability at least , the following events hold:
Combining these bounds in the context of (18) after applying the triangle inequality, we have with probability at least for all the desired conclusion,
∎
Corollary 4.
Under Assumption 1, let for a sufficiently large constant. For universal and all , with probability at least ,
Proof.
Consider any unit vector . By similar arguments as in (19), (20), and applying a union bound over all with , with probability at least , it follows from Lemma 9 that
(21) | |||
(22) |
Therefore, again using the formula (18) and the triangle inequality yields the desired conclusion for all directions , which is equivalent to the spectral bound of the lemma statement. ∎
Appendix B Deferred proofs from Section 3
B.1 Robust univariate variance estimation
In this section, we prove Lemma 1, which allows us to robustly estimate the quadratic form of a vector in the covariance of a sub-Gaussian distribution from corrupted samples. Algorithm 5 is folklore, and intuitively very simple; it projects all samples onto , throws away the fraction of points with largest magnitude in this direction, and takes the mean of the remaining set.
We require the following helper fact.
Fact 3.
Let be a sub-exponential random variable with parameter at most 999We say mean-zero is sub-exponential with parameter if , ., and let . Then, for any event with , .
Proof.
We have by Hölder’s inequality that for any with ,
The second inequality is Lemma 1.10 [RH17]. Setting yields the result. ∎
See 1
Proof.
The runtime claim is immediate; we now turn our attention to correctness. We follow notation of Assumption 1, and in a slight abuse of notation, also define for . First, for , then is sub-exponential with parameter at most (Lemma 1.12, [RH17]). By Bernstein’s inequality, we have that if , then for all ,
(23) |
Using this in a standard Chernoff bound, we have that with probability ,
(24) |
Let , and let be distributed as , where . We observe is also sub-exponential with parameter , and that by Fact 3,
(25) |
Define the interval and let be the set of points in that survive the truncation procedure, so that . Given event (24), for all , since there are at most points in outside , and . We decompose the deviation as follows:
(26) | ||||
Here we overloaded to mean that lies in the interval , and conditioned on lying entirely in . We bound each of these terms individually. First, for all , conditioning on (24) (i.e. all ), is an independent sample from . Thus, by (25) and Bernstein’s inequality,
(27) | ||||
with (conditional) probability at least . By a union bound, both events occur with probability at least ; condition on this for the remainder of the proof. Under this assumption, we control the other three terms of (26). Observe that , , and . Further, by definition of , every summand is at most . Thus,
(28) | ||||
(29) | ||||
(30) |
Combining (27), (28), (29), and (30) in derivation (26) and dividing by yields the claim. ∎
Finally, we also give an alternative set of conditions under which we can certify correctness of . Specifically, this assumption will be useful in lifting indpendence assumptions between and our samples in repeated calls within Algorithm 6.
Assumption 2.
Under Assumption 1, let the following conditions hold for universal constant :
(31) | ||||
(32) |
Note that (32) is a factor weaker in its guarantee than Corollary 4, and is over weights in a different set . Standard sub-Gaussian concentration (i.e. an unweighted variant of Corollary 4) and modifying the proof of Corollary 4 to take the constraint set and normalizing over vertex sets of size yield the following conclusion.
Lemma 11.
Let for a sufficiently large constant. Assumption 2 holds with probability at least .
We give a variant of Lemma 1 with slightly stronger guarantees for ; specifically, it holds for all simultaneously for a fixed set of samples satisfying Assumption 2.
Corollary 5.
Proof.
We discuss how to modify the derivations from Lemma 1 appropriately in the absence of applications of Bernstein’s inequality. First, note that appropriately combining (31) and (32) in a derivation such as (18) yields the following bound (deterministically under Assumption 2):
(33) |
Now, consider the decomposition (26). We claim first that similarly to (28), (29), (30) we can bound each summand in the latter three terms by ; to prove this, it suffices to show that at least one filtered attains this bound, as then by definition of the algorithm, each non-filtered will as well. Note that a fraction between and of points in is filtered (since there are only points from ). The assumption (32) then implies precisely the desired bound on some filtered by placing uniform mass on filtered points from , and applying pigeonhole. So, all non-filtered are bounded by , yielding analogous statements to (28), (29), (30).
B.2 Preliminaries
For convenience, we give the following preliminaries before embarking on our proof of Theorem 1 and giving guarantees on Algorithm 6. First, we state a set of assumptions which augments Assumption 2 with one additional condition, used in bounding the iteration count of our algorithm.
Assumption 3.
Standard sub-Gaussian concentration inequalities and a union bound, combined with our earlier claim Lemma 11, then yield the following guarantee.
Lemma 12.
Let for a sufficiently large constant. Assumption 3 holds with probability at least .
B.3 Analysis of
For this section, for any nonnegative weights , define . We now state our algorithm, . At all iterations , it maintains a current nonnegative weight vector (initialized to be the uniform distribution on ), preserving the following invariants for all :
(35) |
We assume that in Line 8, we also have , as we can assume at least one point is corrupted i.e. (else standard algorithms suffice for our setting), so adding an additional can only change the sum by . We first prove invariants (35) are preserved; at a high level, we simply demonstrate that Lemma 2 holds via concentration on and lack of termination.
Lemma 13.
Proof.
The first part of (35) is immediate by observing the update in Line 9, so we show the second. We drop subscripts and superscripts for conciseness and focus on a single iteration . Let , and . By Lemma 2, it suffices to demonstrate that
(36) |
First, , so by definition of index , we have . Define if , and otherwise, and observe . By modifying constants appropriately from (32), it follows from definition of that
(37) |
On the other hand, by (33) we know that the total quadratic form over is bounded as
(38) |
Here, we applied the observation that the normalized restricted to are in (e.g. using Lemma 14 inductively). However, since we did not terminate (Line 5), we must have by being a top eigenvector and Corollary 5 (we defer discussions of inexactness to Theorem 1) that
To obtain the last conclusion, we used (38). Finally, note that for all ,
by rearranging (37). This implies that
Thus, the desired inequality (36) follows from combining the above derivations, e.g. using (37) and
∎
Lemma 13 yields for all that by telescoping. Note that we can only remove at most mass from total, as . Denote for shorthand normalized weights . Then, the following is immediate by .
Using Lemma 14, we show that the output has the desired quality of being a large eigenvector.
Proof.
We assume for now that is an exact top eigenvector, and discuss inexactness while proving Theorem 1. By (33) and Lemma 14, as then the normalized restriction of to is in ,
We used the Courant-Fischer characterization of eigenvalues, and that is a top eigenvector of . Moreover, by termination conditions and Corollary 5 (correctness of ),
Combining these two bounds and rescaling yields the conclusion. ∎
Proof.
First, we will operate under Assumption 3, which holds with probability at least . It is clear that the analyses of Lemma 13 and 15 hold with multiplicative approximations of top eigenvector computation, which the power method approximates with high probability. Thus, each iteration takes time , where we will union bound over the number of iterations. We now give an iteration bound: in any iteration where we do not terminate, Lemma 2 implies
Here, the second line used Assumption 3, the third used that the are in sorted order, and the last used the definition of as well as the derivations of Lemma 15 (specifically, that spectrally dominates for roughly uniform ). The conclusion follows since there can be at most iterations, as the algorithm terminates when a fraction of the mass is removed, giving the overall runtime claim. ∎
Appendix C Deferred proofs from Section 4
C.1 Proofs from Section 4.2
Since our notion of approximation is multiplicative, we can assume without more than constant loss that has bounded entries. This observation is standard, and formalized in the following lemma.
Lemma 16 (Entrywise bounds on ).
Feasibility of Problem 2 is unaffected (up to constants in ) by removing columns of with entries larger than .
Proof.
If for any entry, then , else is already larger than . Ignoring all such entries of and rescaling can only change the objective by a factor. ∎
See 3
Proof.
Fix an iteration . Define , and note ; henceforth in this proof, we will drop subscripts when clear. Observe that
As , entrywise. Via for , it follows that
By direct manipulation of the above quantity, and recalling we defined ,
Using , i.e. , we thus obtain
Cauchy-Schwarz yields that , . Substituting into the above,
(39) | ||||
Finally, to bound this latter quantity, since , we observe that for all either or , in which case
Thus, plugging this bound into (39) entrywise,
Rearranging yields the desired claim. ∎
C.2 Proofs from Section 4.3
Our analysis of Algorithm 3 will use the following helper fact.
Lemma 17 (Spectral bounds on ).
Feasibility of Problem 3 is unaffected (up to constants in ) by removing matrices with an eigenvalue larger than .
Proof.
The proof is identical to Lemma 16; we also require the additional fact that the Schatten norm is monotone in the Loewner order, forcing the constraint . ∎
We remark that we can perform this preprocessing procedure via power iteration on each .
See 4
Proof.
Drop and define . For simplicity, define the matrices
We recall the Lieb-Thirring inequality . Applying this, we have
As , we have . Applying the bounds for , where we use that commutes with all , it follows that
Definitions of , , , and preservation of positiveness under Schur complements imply
Thus, . Applying this and recalling ,
By , taking roots we thus have
Finally, the conclusion follows as in Lemma 3; by linearity of trace and ,
Here, we used the inequality for all nonzero ,
∎
See 5
Proof.
The proof is analogous to that of Theorem 4; we sketch the main differences here. By applying Lemma 17 and monotonicity of Schatten norms in the Loewner order, we again have , implying correctness whenever the algorithm terminates on Line 4. Correctness of dual certification again follows from lack of termination and the choice of , as well as setting to indicate each coordinate. Finally, the returned matrix in Line 8 is correct by convexity of the Schatten- norm, and the fact that all have unit Schatten- norm.
We now discuss issues regearding computing in Line 5 of the algorithm, the bottleneck step; these techniques are standard in the approximate SDP literature, and we defer a more formal discussion to e.g. [JLL+20]. First, note that each coordinate of requires us to compute
(40) |
We estimate the two quantities in the above expression each to multiplicative error with high probability. Union bounding over iterations, and modifying Lemma 4 to use the potential , the analysis remains valid up to constants in with this multiplicative approximation quality. We now discuss our approximation strategies.
For shorthand, denote . To estimate the denominator of (40), it suffices to multiplicatively approximate within a factor, as raising to the power can only improve this. To do so, we use the well-known fact (e.g. [DG03]) that letting be a matrix with independent entries , for , with probability ,
to a factor. To read this from the standard Johnson-Lindestrauss guarantee, it suffices to factorize and use that each row of the square root’s norm is preserved with high probability under multiplication by , and then apply the cyclic definition of trace. Similarly, for each , we can approximate the numerators via
We can simultaneously compute all such quantities by first applying matrix-vector multiplications through to each row of , and then computing all quadratic forms. In total, the computational cost per iteration of all approximations is as desired. ∎
C.3 Proof of Proposition 2
In this section, following our prior developments, we prove the following claim.
See 2
C.3.1 Reduction to a decision problem
Given access to an oracle for the following approximate decision problem, we can implement an efficient binary search for estimating OPT. Specifically, letting the range of OPT be , we can subdivide the range into multiplicative intervals of range , and then compute a binary search using our decision oracle. This incurs a multiplicative overhead in the setting of Proposition 2 (see Appendix A, [JLL+20], for a more formal treatment).
Problem 4.
Given , either find primal solution with , , or conclude no satisfies , .
C.3.2 Preliminaries
We use the shorthand , and , so and are interchangeable up to factors. In other words, Problem 4 asks to certify whether there exists with
(41) |
up to multiplicative tolerance on either side. Consider the potential function
(42) |
It is clear that the first term of approximates the left hand side of (41) up to a additive factor, so if any of , , or reaches the scale and is bounded by , we can safely terminate. and conclude primal feasibility for Problem 4. Next, we compute
(43) | |||
The following helper lemma will be useful in concluding dual infeasibility of Problem 4.
Lemma 18.
Proof.
From the definitions in (43), it is clear that , where , are the dual norms of , respectively. Moreover, by the definition of , we have for all ,
This follows from the dual definition of the norm (see Fact 2). Now, note that for some nonnegative , summing to 1, using the above claim and (43),
as desired (here, we used positivity of all relevant quantities). ∎
C.3.3 Potential monotonicity
We prove a monotonicity property regarding the potential in (42).
Lemma 19.
Let satisfy , , let entrywise, and let , where . Then, .
Proof.
Denote for simplicity the threshold and the step vector . First, by prior calculations in Lemma 3 and Lemma 4, it follows that
Next, note that by entrywise and lack of termination (i.e. the threshold ),
Therefore, by for ,
(44) |
Moreover, by applying Cauchy-Schwarz and the threshold once more,
(45) | ||||
Combining (44) and (45) (and applying similar reasoning to the term ), we conclude
Recall the inequality for nonnegative . Expanding the definition of and (cf. (42)), and plugging in the above bounds, we conclude that
As before, we show that this sum is entrywise nonpositive. For any with , we have
as desired, where we used that . This yields the conclusion . ∎
C.3.4 Algorithm and analysis
Proof of Proposition 2.
Correctness of the reduction to deciding Problem 4 follows from the discussion in Section C.3.1. Moreover, by the given Algorithm 7, it is clear (following e.g. the preprocessing of Lemma 17) that throughout the algorithm, so whenever the algorithm terminates we have primal feasibility. It suffices to prove that whenever the problem admits with
then the algorithm terminates on Line 5 in iterations. Analogously to Theorem 4, we have
Next, since is an upwards truncation of , applying Lemma 18 implies that
The conclusion follows by the definition of , as desired. Finally, the iteration complexity follows analogously to the discussion in Theorem 5’s proof, where the only expensive cost is estimating coordinates of the component of every iteration. ∎
Finally, we remark that by opening up the dual certificates , of our mirror descent analysis, we can in fact implement a stronger version of the decision Problem 4 which returns a feasible dual certificate whenever the primal problem is infeasible. We omit this extension for brevity, as it is unnecessary for our applications, but it is analogous to the analysis of Theorem 5.
Appendix D Deferred proofs from Section 5
D.1 Proof of Proposition 3
See 3
Proof.
We claim that Algorithm 1 in [MM15] applied to the matrix with a careful choice of exponent in their Algorithm 1 yields this guarantee. Specifically, we choose , both of which satisfy the criteria in their main theorem, such that the iterates produced by simultaneous power iteration with exponent and with exponent are identical; it suffices to choose a multiple of . Thus, we can also apply their guarantees to and apply a union bound. Notice that their Algorithm 1 also contains some postprocessing to ensure that they obtain singular values in the right space, which is unnecessary for us, as our matrices are Hermitian. ∎
D.2 Proof of Lemma 5
See 5
Proof.
Lemma 10 implies that letting be the uniform distribution over the uncorrupted samples amongst , we have with probability at least , and denoting ,
Therefore, the mixed - packing semidefinite program
is feasible. This completes the proof. ∎
D.3 Proof of Lemma 6
See 6
Proof.
We follow the notation of (10). First, by the guarantees of Corollary 1,
Therefore, again applying Corollary 1, for all ,
We conclude that the set of weights belong to . By applying Corollary 4 to these weights and adjusting the definition of by a constant, we conclude with probability at least
The conclusion follows by multiplying through by , and using the definition . ∎