Sublinear Time Eigenvalue Approximation via Random Sampling
Abstract
We study the problem of approximating the eigenspectrum of a symmetric matrix with bounded entries (i.e., ). We present a simple sublinear time algorithm that approximates all eigenvalues of up to additive error using those of a randomly sampled principal submatrix. Our result can be viewed as a concentration bound on the complete eigenspectrum of a random submatrix, significantly extending known bounds on just the singular values (the magnitudes of the eigenvalues). We give improved error bounds of and when the rows of can be sampled with probabilities proportional to their sparsities or their squared norms respectively. Here is the number of non-zero entries in and is its Frobenius norm. Even for the strictly easier problems of approximating the singular values or testing the existence of large negative eigenvalues (Bakshi, Chepurko, and Jayaram, FOCS ’20), our results are the first that take advantage of non-uniform sampling to give improved error bounds. From a technical perspective, our results require several new eigenvalue concentration and perturbation bounds for matrices with bounded entries. Our non-uniform sampling bounds require a new algorithmic approach, which judiciously zeroes out entries of a randomly sampled submatrix to reduce variance, before computing the eigenvalues of that submatrix as estimates for those of . We complement our theoretical results with numerical simulations, which demonstrate the effectiveness of our algorithms in practice.
1 Introduction
Approximating the eigenvalues of a symmetric matrix is a fundamental problem – with applications in engineering, optimization, data analysis, spectral graph theory, and beyond. For an matrix, all eigenvalues can be computed to high accuracy using direct eigendecomposition in time, where is the exponent of matrix multiplication [DDHK07, AW21]. When just a few of the largest magnitude eigenvalues are of interest, the power method and other iterative Krylov methods can be applied [Saa11]. These methods repeatedly multiply the matrix of interest by query vectors, requiring time per multiplication when the matrix is dense and unstructured.
For large , it is desirable to have even faster eigenvalue approximation algorithms, running in time – i.e., sublinear in the size of the input matrix. Unfortunately, for general matrices, no non-trivial approximation can be computed in time: without reading entries, it is impossible to distinguish with reasonable probability if all entries (and hence all eigenvalues) are equal to zero, or if there is a single pair of arbitrarily large entries at positions and , leading to a pair of arbitrarily large eigenvalues. Given this, we seek to address the following question:
Under what assumptions on a symmetric input matrix, can we compute non-trivial approximations to its eigenvalues in time?
It is well known that time eigenvalue computation is possible for highly structured inputs, like tridiagonal or Toeplitz matrices [GE95]. For sparse or structured matrices that admit fast matrix vector multiplication, one can compute a small number of the largest in magnitude eigenvalues in time using iterative methods. Through the use of robust iterative methods, fast top eigenvalue estimation is also possible for matrices that admit fast approximate matrix-vector multiplication, such as kernel similarity matrices [GS91, HP14, BIMW21]. Our goal is to study simple, sampling-based sublinear time algorithms that work under much weaker assumptions on the input matrix.
1.1 Our Contributions
Our main contribution is to show that a very simple algorithm can be used to approximate all eigenvalues of any symmetric matrix with bounded entries. In particular, for any with maximum entry magnitude , sampling an principal submatrix of with and scaling its eigenvalues by yields a additive error approximation to all eigenvalues of with good probability.333Here and throughout, hides logarithmic factors in the argument. Note that by scaling, our algorithm gives a approximation for any . This result is formally stated below, where .
Theorem 1 (Sublinear Time Eigenvalue Approximation).
Let be symmetric with and eigenvalues . Let be formed by including each index independently with probability as in Algorithm 1. Let be the corresponding principal submatrix of , with eigenvalues .
For all with , let . For all with , let . For all other , let . If , for large enough constant , then with probability , for all ,
See Figure 1 for an illustration of how the eigenvalues of are mapped to estimates for all eigenvalues of . Note that the principal submatrix sampled in Theorem 1 will have rows/columns with high probability. Thus, with high probability, the algorithm reads just entries of and runs in time. Standard matrix concentration bounds imply that one can sample random entries from the random submatrix and preserve its eigenvalues to error with probability [AM07]. See Appendix F for a proof. This can be directly combined with Theorem 1 to give improved sample complexity:
Corollary 1 (Improved Sample Complexity via Entrywise Sampling).
Let be symmetric with and eigenvalues . For any , there is an algorithm that reads entries of and returns, with probability at least , for each satisfying .
Observe that the dependence on in Theorem 1 and Corollary 1 can be improved via standard arguments: running the algorithm with failure probability , repeating times, and taking the median estimate for each . This guarantees that the algorithm will succeed with probability at most at the expense of a dependence in the complexity.
Comparison to known bounds. Theorem 1 can be viewed as a concentration inequality on the full eigenspectrum of a random principal submatrix of . This significantly extends prior work, which was able to bound just the spectral norm (i.e., the magnitude of the top eigenvalue) of a random principal submatrix [RV07, Tro08a]. Bakshi, Chepurko, and Jayaram [BCJ20] recently identified developing such full eigenspectrum concentration inequalities as an important step in expanding our knowledge of sublinear time property testing algorithms for bounded entry matrices.
Standard matrix concentration bounds [GT11] can be used to show that the singular values of (i.e., the magnitudes of its eigenvalues) are approximated by those of a random submatrix (see Appendix G) with independently sampled rows and columns. However, such a random matrix will not be symmetric or even have real eigenvalues in general, and thus no analogous bounds were previously known for the eigenvalues themselves.
Recently, Bakshi, Chepurko, and Jayaram [BCJ20] studied the closely related problem of testing positive semidefiniteness in the bounded entry model. They show how to test whether the minimum eigenvalue of is either greater than or smaller than by reading just entries. They show that this result is optimal in terms of query complexity, up to logarithmic factors. Like our approach, their algorithm is based on random principal submatrix sampling. Our eigenvalue approximation guarantee strictly strengthens the testing guarantee – given approximations to all eigenvalues, we immediately solve the testing problem. Thus, our query complexity is tight up to a factor. It is open if our higher sample complexity is necessary to solve the harder full eigenspectrum estimation problem. See Section 1.4 for further discussion.
Improved bounds for non-uniform sampling. Our second main contribution is to show that, when it is possible to efficiently sample rows/columns of with probabilities proportional to their sparsities or their squared norms, significantly stronger eigenvalue estimates can be obtained. In particular, letting denote the number of nonzero entries in and denote its Frobenius norm, we show that sparsity-based sampling yields eigenvalue estimates with error and norm-based sampling gives error . See Theorems 2 and 3 for formal statements. Observe that when , its eigenvalues are bounded in magnitude by . Thus, Theorems 2 and 3 are natural strengthenings of Theorem 1. Row norm-based sampling (Theorem 3) additionally removes the bounded entry requirement of Theorems 1 and 2.
As discussed in Section 1.3.1, sparsity-based sampling can be performed in sublinear time when is stored in a slightly augmented sparse matrix format or when is the adjacency matrix of a graph accessed in the standard graph query model of the sublinear algorithms literature [GR97]. Norm-based sampling can also be performed efficiently with an augmented matrix format, and is commonly studied in randomized and ‘quantum-inspired’ algorithms for linear algebra [FKV04, Tan18].
Theorem 2 (Sparse Matrix Eigenvalue Approximation).
Let be symmetric with and eigenvalues . Let be formed by including the th index independently with probability as in Algorithm 2. Here is the number of non-zero entries in the row of . Let be the corresponding principal submatrix of , and let be the estimate of computed from as in Algorithm 2. If , for large enough constant , then with probability , for all , .
Theorem 3 (Row Norm Based Matrix Eigenvalue Approximation).
Let be symmetric and eigenvalues . Let be formed by including the th index independently with probability as in Algorithm 3. Here is the norm of the row of . Let be the corresponding principal submatrix of , and let be the estimate of computed from as in Algorithm 3. If , for large enough constant , then with probability , for all ,
The above non-uniform sampling theorems immediately yield algorithms for testing the presence of a negative eigenvalue with magnitude at least or respectively, strengthening the testing results of [BCJ20], which require eigenvalue magnitude at least . In the graph property testing literature, there is a rich line of work exploring the testing of bounded degree or sparse graphs [GR97, BSS10]. Theorem 2 can be thought of as first step in establishing a related theory of sublinear time approximation algorithms and property testers for sparse matrices.
Surprisingly, in the non-uniform sampling case, the eigenvalue estimates derived from cannot simply be its scaled eigenvalues, as in Theorem 1. E.g., when is the identity, our row sampling probabilities are uniform in all cases. However, the scaled submatrix will be a scaled identity, and have eigenvalues equal to – failing to give a approximation to the true eigenvalues (all of which are ) unless . To handle this, and related cases, we must argue that selectively zeroing out entries in sufficiently low probability rows/columns of (see Algorithms 2 and 3) does not significantly change the spectrum, and ensures concentration of the submatrix eigenvalues. It is not hard to see that simple random submatrix sampling fails even for the easier problem of singular value estimation. Theorems 2 and 3 give the first results of their kinds for this problem as well.
1.2 Related Work
Eigenspectrum estimation is a key primitive in numerical linear algebra, typically known as spectral density estimation. The eigenspectrum is viewed as a distribution with mass at each of the eigenvalues, and the goal is to approximate this distribution [WWAF06, LSY16]. Applications include identifying motifs in social networks [DBB19], studying Hessian and weight matrix spectra in deep learning [SBL16, YGL+18, GKX19], ‘spectrum splitting’ in parallel eigensolvers [LXES19], and the study of many systems in experimental physics and chemistry [Wan94, SR94, HBT19].
Recent work has studied sublinear time spectral density estimation for graph structured matrices – Braverman, Krishnan, and Musco [BKM22] show that the spectral density of a normalized graph adjacency or Laplacian matrix can be estimated to error in the Wasserstein distance in time. Cohen-Steiner, Kong, Sohler, and Valiant study a similar setting, giving runtime [CSKSV18]. We note that the additive error eigenvalue approximation result of Theorem 1 (analogously Theorems 2 and 3) directly gives an approximation to the spectral density in the Wasserstein distance – extending the above results to a much broader class of matrices. When , can have eigenvalues as large as , while the normalized adjacency matrices studied in [CSKSV18, BKM22] have eigenvalues in . So, while the results are not directly comparable, our Wasserstein error can be thought as on order of their error of after scaling.
Our work is also closely related to a line of work on sublinear time property testing for bounded entry matrices, initiated by Balcan et al. [BLWZ19]. In that work, they study testing of rank, Schatten- norms, and several other global spectral properties. Sublinear time testing algorithms for the rank and other properties have also been studied under low-rank and bounded row norm assumptions on the input matrix [KS03, LWW14]. Recent work studies positive semidefiniteness testing and eigenvalue estimation in the matrix-vector query model, where each query computes for some . As in Theorem 3, eigenvalue estimation can be achieved with queries in this model [NSW22]. Finally, several works study streaming algorithms for eigenspectrum approximation [AN13, LNW14, LW16]. These algorithms are not sublinear time – they require at least linear time to process the input matrix. However, they use sublinear working memory. Note that Theorem 1 immediately gives a sublinear space streaming algorithm for eigenvalue estimation. We can simply store the sampled submatrix as its entries are updated.
1.3 Technical Overview
In this section, we overview the main techniques used to prove Theorems 1, and then how these techniques are extended to prove Theorems 2 and 3. We start by defining a decomposition of any symmetric into the sum of two matrices containing its large and small magnitude eigendirections.
Definition 1.1 (Eigenvalue Split).
Let be symmetric. For any , let where is diagonal, with the eigenvalues of with magnitude on its diagonal, and has the corresponding eigenvectors as columns. Similarly, let where has the eigenvalues of with magnitude on its diagonal and has the corresponding eigenvectors as columns. Then, can be decomposed as
Any principal submatrix of , , can be similarly written as
where are the corresponding submatrices obtained by sampling rows of .
Since , and are all symmetric, we can use Weyl’s eigenvalue perturbation theorem [Wey12] to show that for all eigenvalues of ,
(1) |
We will argue that the eigenvalues of approximate those of – i.e. all eigenvalues of with magnitude . Further, we will show that is small with good probability. Thus, via (1), the eigenvalues of approximate those of . In the estimation procedure of Theorem 1, all other small magnitude eigenvalues of are estimated to be , which will immediately give our approximation bound when the original eigenvalue has magnitude .
Bounding the eigenvalues of . The first step is to show that the eigenvalues of well-approximate those of . As in [BCJ20], we critically use that the eigenvectors corresponding to large eigenvalues are incoherent – intuitively, since is bounded, their mass must be spread out in order to witness a large eigenvalue. Specifically, [BCJ20] shows that for any eigenvector of with corresponding eigenvalue , . We give related bounds on the Euclidean norms of the rows of (the leverage scores of ), and on these rows after weighting by .
Using these incoherence bounds, we argue that the eigenvalues of approximate those of up to error. A key idea is to bound the eigenvalues of , which are identical to the non-zero eigenvalues of . Via a matrix Bernstein bound and our incoherence bounds on , we show that this matrix is close to with high probability. However, since may be complex, the matrix is not necessarily Hermitian and standard perturbation bounds [SgS90, HJ12] do not apply. Thus, to derive an eigenvalue bound, we apply a perturbation bound of Bhatia [Bha13], which generalizes Weyl’s inequality to the non-Hermitian case, with a factor loss. To the best of our knowledge, this is the first time that perturbation theory bounds for non-Hermitian matrices have been used to prove improved algorithmic results in the theoretical computer science literature.
We note that in Appendix B, we give an alternate bound, which instead analyzes the Hermitian matrix , whose eigenvalues are again identical to those of . This approach only requires Weyl’s inequality, and yields an overall bound of , improving the factors of Theorem 1 at the cost of worse dependence.
Bounding the spectral norm of . The next step is to show that all eigenvalues of are small provided a sufficiently large submatrix is sampled. This means that the “middle” eigenvalues of , i.e. those with magnitude do not contribute much to any eigenvalue . To do so, we apply a theorem of [RV07, Tro08a] which shows concentration of the spectral norm of a uniformly random submatrix of an entrywise bounded matrix Observe that while , such a bound will not in general hold for . Nevertheless, we can use the incoherence of to show that is bounded, which via triangle inequality, yields a bound on . In the end, we show that if , with probability at least , . After the scaling in the estimation procedure of Theorem 1, this spectral norm bound translates into an additive error in approximating the eigenvalues of .
Completing the argument. Once we establish the above bounds on and , Theorem 1 is essentially complete. Any eigenvalue in with magnitude will correspond to a nearby eigenvalue in and in turn, given our spectral norm bound on . An eigenvalue in with magnitude may or may not correspond to a nearby by eigenvalue in (it will only if it lies in the range ). In any case however, in the estimation procedure of Theorem 1, such an eigenvalue will either be estimated using a small eigenvalue of , or be estimated as . In both instances, the estimate will give error, as required.
Can we beat additive error? It is natural to ask if our approach can be improved to yield sublinear time algorithms with stronger relative error approximation guarantees for ’s eigenvalues. Unfortunately, this is not possible – consider a matrix with just a single pair of entries set to . To obtain relative error approximations to the two non-zero eigenvalues, we must find the pair , as otherwise we cannot distinguish from the all zeros matrix. This requires reading a of ’s entries. More generally, consider with a random principal submatrix populated by all s, and with all other entries equal to . has largest eigenvalue . However, if we read entries of , with good probability, we will not see even a single one, and thus we will not be able to distinguish from the all zeros matrix. This example establishes that any sublinear time algorithm with query complexity must incur additive error at least .
1.3.1 Improved Bounds via Non-Uniform Sampling
We now discuss how to give improved approximation bounds via non-uniform sampling. We focus on the bound of Theorem 2 using sparsity-based sampling. The proof of Theorem 3 for row norm sampling follows the same general ideas, but with some additional complications.
Theorem 2 requires sampling a submatrix , where each index is included in with probability . We reweight each sampled row by . Thus, if entry is sampled, it is scaled by . When the rows have uniform sparsity (so all ), this ensures that the full submatrix is scaled by , as in Theorem 1.
The proof of Theorem 2 follows the same outline as that of Theorem 1: we first argue that the outlying eigenvectors in are incoherent, giving a bound on the norm of each row of in terms of . We then apply a matrix Bernstein bound and Bhatia’s non-Hermitian eigenvalue perturbation bound to show that the eigenvalues of approximate those of up to .
Bounding the spectral norm of . The major challenge is showing that the subsampled middle eigendirections do not significantly increase the approximation error by bounding the by . This is difficult since the indices in are sampled nonuniformly, so existing bounds [Tro08a] on the spectral norm of uniformly random submatrices do not apply. We extend these bounds to the non-uniform sampling case, but still face an issue due to the rescaling of entries by . In fact, without additional algorithmic modifications, is simply not bounded by ! For example, as already discussed, if is the identity matrix, we get and so , assuming . Relatedly, suppose that is tridiagonal, with zeros on the diagonal and ones on the first diagonals above and below the main diagonal. Then, if , with constant probability, one of the ones will be sampled and scaled by . Thus, we will again have , assuming . Observe that this issue arrises even when trying to approximate just the singular values (the eigenvalue magnitudes) via sampling. Thus, while an analogous bound to the uniform sampling result of Theorem 1 can easily be given for singular value estimation via matrix concentration inequalities (see Appendix G), to the best of our knowledge, Theorems 2 and 3 are the first of their kind even for singular value estimation.
Zeroing out entries in sparse rows/columns. To handle the above cases, we prove a novel perturbation bound, arguing that if we zero out any entry of where , then the eigenvalues of are not perturbed by more than . This can be thought of as a strengthening of Girshgorin’s circle theorem, which would ensure that zeroing out entries in rows/columns with does not perturb the eigenvalues by more than . Armed with this perturbation bound, we argue that if we zero out the appropriate entries of before computing its eigenvalues, then since we have removed entries in very sparse rows and columns which would be scaled by a large factor in , we can bound . This requires relating the magnitudes of the entries in to those in using the incoherence of the top eigenvectors, which gives bounds on the entries of .
Sampling model. We note that the sparsity-based sampling of Theorem 2 can be efficiently implemented in several natural settings. Given a matrix stored in sparse format, i.e., as a list of nonzero entries, we can easily sample a row with probability by sampling a uniformly random non-zero entry and looking at its corresponding row. Via standard techniques, we can convert several such samples into a sampled set close in distribution to having each included independently with probability . If we store the values of , we can also efficiently access each , which is needed for rescaling and zeroing out entries. Also observe that if is the adjacency matrix of a graph, in the standard graph query model [GR97], it is well known how to approximately count edges and sample them uniformly at random, i.e., compute and sample its nonzero entries, in sublinear time [GR08, ER18]. Further, it is typically assumed that one has access to the node degrees, i.e., . Thus, our algorithm can naturally be used to estimate spectral graph properties in sublinear time.
The norm-based sampling of Theorem 3 can also be performed efficiently using an augmented data structure for storing . Such data structures have been used extensively in the literature on quantum-inspired algorithms, and require just time to construct, space, and time to update give an update to an entry of [Tan18, CCH+20].
1.4 Towards Optimal Query Complexity
As discussed, Bakshi et al. [BCJ20] show that any algorithm which can test with good probability whether has an eigenvalue or else has all non-negative eigenvalues must read entries of . This testing problem is strictly easier than outputting error estimates of all eigenvalues, so gives a lower bound for our setting. If the queried entries are restricted to fall in a submatrix, [BCJ20] shows that this submatrix must have dimensions , giving total query complexity . Closing the gap between our upper bound of and the lower bound of for submatrix queries is an intriguing open question.
We show in Appendix A that this gap can be easily closed via a surprisingly simple argument if is positive semidefinite (PSD). In that case, with . Writing for a sampling matrix , the non-zero eigenvalues of are identical to those of . Via a standard approximate matrix multiplication analysis [DK01], one can then show that, for , with probability at least , . Via Weyl’s inequality, this shows that the eigenvalues of , and hence , approximate those of up to error.444In fact, via more refined eigenvalue perturbation bounds [Bha13] one can show an norm bound on the eigenvalue approximation errors, which can be much stronger than the norm bound of Theorem 1.
Unfortunately, this approach breaks down when has negative eigenvalues, and so cannot be factored as for real . This is more than a technical issue: observe that when is PSD and has , it can have at most eigenvalues larger than – since its trace, which is equal to the sum of its eigenvalues, is bounded by , and since all eigenvalues are non-negative. When is not PSD, it can have eigenvalues with magnitude larger than . In particular, if is the tensor product of a random matrix and the all ones matrix, the bulk of its eigenvalues (of which there are ) will concentrate around . As a result it remains unclear whether we can match the dependence of the PSD case, or if a stronger lower bound can be shown for indefinite matrices.
Outside the dependence, it is unknown if full eigenspectrum approximation can be performed with sample complexity independent of the matrix size . [BCJ20] achieve this for the easier positive semidefiniteness testing problem, giving sample complexity . However our bounds have additional factors. As discussed, in Appendix B we give an alternate analysis for Theorem 1, which shows that sampling a submatrix suffices for eigenvalue approximation, saving a factor at the cost of worse dependence. However, removing the final seems difficult – it arises when bounding via bounds on the spectral norms of random principal submatrices [RV07]. Removing it seems as though it would require either improving such bounds, or taking a different algorithmic approach.
Also note that our and dependencies for non-uniform sampling (Theorems 2 and 3) are likely not tight. It is not hard to check that the lower bounds of [BCJ20] still hold in these settings. For example, in the sparsity-based sampling setting, by simply having the matrix entirely supported on a submatrix, the lower bounds of [BCJ20] directly carry over. Giving tight query complexity bounds here would also be interesting. Finally, it would be interesting to go beyond principal submatrix based algorithms, to achieve improved query complexity, as in Corollary 1. Finding an algorithm matching the overall query complexity lower bound of [BCJ20] is open even in the much simpler PSD setting.
2 Notation and Preliminaries
We now define notation and foundational results that we use throughout our work. For any integer , let denote the set . We write matrices and vectors in bold literals – e.g., or . We denote the eigenvalues of a symmetric matrix by , in decreasing order. A symmetric matrix is positive semidefinite if all its eigenvalues are non-negative. For two matrices , we let denote that is positive semidefinite. For any matrix and , we let denote the row of , denote the number of non-zero elements in this row, and denote its norm. We let denote the total number of non-zero elements . For a vector , we let denote its Euclidean norm. For a matrix , we let denote the largest magnitude of an entry, denote the spectral norm, denote the Frobenius norm, and denote the maximum Euclidean norm of a column. For and we let denote the principal submatrix corresponding to . We let denote the norm of a random variable, , where denotes expectation.
We use the following basic facts and identities on eigenvalues throughout our proofs.
Fact 1 (Eigenvalue of Matrix Product).
For any two matrices , the non-zero eigenvalues of are identical to those of .
Fact 2 (Girshgorin’s circle theorem [Ger31]).
Let with entries . For , let be the sum of absolute values of non-diagonal entries in the th row. Let be the closed disc centered at with radius . Then every eigenvalue of lies within one of the discs .
Fact 3 (Weyl’s Inequality [Wey12]).
For any two Hermitian matrices with ,
Weyl’s inequality ensures that a small Hermitian perturbation of a Hermitian matrix will not significantly change its eigenvalues. The bound can be extended to the case when the perturbation is not Hermitian, with a loss of an factor; to the best of our knowledge this loss is necessary:
Fact 4 (Non-Hermitian perturbation bound [Bha13]).
Let be Hermitian and be any matrix whose eigenvalues are such that (where denotes the real part of ). Let . For some universal constant ,
Beyond the above facts, we use several theorems to obtain eigenvalue concentration bounds. We first state a theorem from [Tro08a], which bounds the spectral norm of a principal submatrix sampled uniformly at random from a bounded entry matrix. We build on this to prove the full eigenspectrum concentration result of Theorem 1.
Theorem 4 (Random principal submatrix spectral norm bound [RV07, Tro08a]).
Let be Hermitian, decomposed into diagonal and off-diagonal parts: . Let be a diagonal sampling matrix with the diagonal entry set to independently with probability and otherwise. Then, for some universal constant ,
For Theorems 2 and 3, we need an extension of Theorem 4 to the setting where rows are sampled non-uniformly. We will use two bounds here. The first is a decoupling and recoupling result for matrix norms. One can prove this lemma following an analogous result in [Tro08a] for sampling rows/columns uniformly. The proof is almost identical so we omit it.
Lemma 1 (Decoupling and recoupling).
Let be a Hermitian matrix with zero diagonal. Let be a sequence of independent random variables such that with probability and otherwise. Let be a square diagonal sampling matrix with diagonal entry set to . Then:
where is an independent diagonal sampling matrix drawn from the same distribution as .
The second theorem bounds the spectral norm of a non-uniform random column sample of a matrix. We give a proof in Appendix D, again following a theorem in [Tro08b] for uniform sampling.
Theorem 5 (Non-uniform column sampling – spectral norm bound).
Let be an matrix with rank . Let be a sequence of independent random variables such that with probability and otherwise. Let be a square diagonal sampling matrix with diagonal entry set to .
We use a standard Matrix Bernstein inequality to bound the spectral norm of random submatrices.
Theorem 6 (Matrix Bernstein [Tro15]).
Consider a finite sequence of random matrices in . Assume that for all , Let and let be semidefinite upper-bounds for the matrix valued variances and :
Then, letting , for any ,
For real valued random variables, we use the standard Bernstein inequality.
Theorem 7 (Bernstein inequality [Ber27]).
Let for be independent random variables with zero mean such that for all . Then for all positive ,
3 Sublinear Time Eigenvalue Estimation using Uniform Sampling
We now prove our main eigenvalue estimation result – Theorem 1. We give the pseudocode for our principal submatrix based estimation procedure in Algorithm 1. We will show that any positive or negative eigenvalue of with magnitude will appear as an approximate eigenvalue in with good probability. Thus, in step 5 of Algorithm 1, the positive and negative eigenvvalues of are used to estimate the outlying largest and smallest eigenvalues of . All other interior eigenvalues of are estimated to be , which will immediately give our approximation bound when the original eigenvalue has magnitude .
Running time. Observe that the expected number of indices chosen by Algorithm 1 is . A standard concentration bound can be used to show that with high probability , the number of sampled entries is . Thus, the algorithm reads a total of entries of and runs in time – the time to compute a full eigendecomposition of .
3.1 Outer and Middle Eigenvalue Bounds
Recall that we will split into two symmetric matrices (Definition 1.1): which contains its large magnitude (outlying) eigendirections with eigenvalue magnitudes and which contains its small magnitude (middle) eigendirections.
We first show that the eigenvectors in are incoherent. I.e., that their (eigenvalue weighted) squared row norms are bounded. This ensures that the outlying eigenspace of is well-approximated via uniform sampling.
Lemma 2 (Incoherence of outlying eigenvectors).
Let be symmetric with . Let be as in Definition 1.1. Let denote the th row of . Then,
Proof.
Observe that . Let denote the th row of the . Then we have
(2) |
where , is the th element of and . by assumption and since has orthonormal columns, its spectral norm is bounded by , thus we have
Therefore, by (2), we have:
(3) |
Since by definition of , for all , we finally have
and
∎
Let be the scaled sampling matrix satisfying . We next apply Lemma 2 in conjunction with a matrix Bernstein bound to show that concentrates around its expectation, . Since by Fact 1, this matrix has identical eigenvalues to , this allows us to argue that the eigenvalues of approximate those of .
Lemma 3 (Concentration of outlying eigenvalues).
Proof.
Define . For all , let be the row of and define the matrix valued random variable
(4) |
Define . Observe that are independent random variables and that . Further, observe that . Now, by Lemma 2. Thus, . The variance can be bounded as:
(5) |
Again by Lemma 2, . Plugging back into (5) we can bound,
Since is PSD, this establishes that . We then apply Theorem 6 (the matrix Bernstein inequality) with , , and since there are at most outlying eigenvalues with magnitude in . This gives:
Thus, if we set for large enough , then the probability is bounded above by , completing the proof. ∎
We cannot prove an analogous leverage score bound to Lemma 2 for the interior eigenvectors of appearing in . Thus we cannot apply a matrix Bernstein bound as in Lemma 3. However, we can use Theorem 4 to show that the spectral norm of the random principal submatrix is not too large, and thus that the eigenvalues of are close to those of .
Lemma 4 (Spectral norm bound – sampled middle eigenvalues).
Proof.
Let where is the matrix of diagonal elements and the matrix of off-diagonal elements. Let be the binary sampling matrix with . From Theorem 4, we have for some constant ,
(6) |
Considering the various terms in (6), we have and . We also have
and
The final bound follows since , where is an orthogonal projection matrix. Thus, by our assumption that . Plugging all these bounds into (6) we have, for some constant ,
(7) |
It remains to bound . We have and thus by triangle inequality,
(8) |
Writing (see Definition 1.1), and letting denote the th row of , the th element of has magnitude
by Cauchy-Schwarz. From Lemma 2, we have . Also, from (2), . Overall, for all we have , giving . Plugging back into (8) and in turn (7), we have for some constant ,
Setting for sufficiently large , all terms in the right hand side of the above equation are bounded by and so
Thus, by Markov’s inequality, with probability at least , we have . We can adjust by a constant to obtain the required bound. ∎
3.2 Main Accuracy Bounds
Proof.
Let be the binary sampling matrix with a single one in each column such that . Let Following Definition 1.1, we write . By Fact 1 we have that the nonzero eigenvalues of are identical to those of where is the square root matrix of such that .
Note that is Hermitian. However may be complex, and hence is not necessarily Hermitian, although it does have real eigenvalues. Thus, we can apply the perturbation bound of Fact 4 to and to claim for all , and some constant ,
By Lemma 3 applied with error , with probability at least , for any (for a large enough constant ) we have . Thus, for all ,
(9) |
We note that the conceptual part of the proof is essentially complete: the nonzero eigenvalues of are identical to those of , which we have shown well approximate those of and in turn . i.e., the non-zero eigenvalues of approximate all outlying eigenvalues of . It remains to carefully argue how these approximations should be ‘lined up’ given the presence of zero eigenvalues in the spectrum of these matrices. We also must account for the impact of the interior eigenvalues in , which is limited by the spectral norm bound of Lemma 4.
Eigenvalue alignment and effect of interior eigenvalues. First recall that . By Lemma 4 applied with error , we have with probability at least when . By Weyl’s inequality (Fact 3), for all we thus have
(10) |
Consider with . Since the nonzero eigenvalues of are identical to those of , , and so by (9),
(11) |
Analogously, consider such that . We have , where is the dimension of – i.e., the number of outlying eigenvalues in . Again by (9) we have
(12) |
Now the nonzero eigenvalues of are identical to those of . Consider such that . In this case, by (10), (11), and the triangle inequality, we have and thus we have . In turn, again applying (10), (11), and the triangle inequality, we have
Analogously, for such that , we have by (10) and (12) that . Thus . Again by (10), (12), and triangle inequality this gives
Now, consider all such that is not well approximated by one of the outlying eigenvalues of as argued above. By (10), (11), and (12), all such eigenvalues must have . Thus, if we approximate them in any way either by the remaining eigenvalues of with magnitude , or else by , we will approximate all to error at most . Thus, if (as in Algorithm 1) for with , we let and for with , let , and let for all other , we will have for all ,
Finally by definition, for all , and thus, via triangle inequality, This gives our final error bound after adjusting constants on .
Recall that we require for the outer eigenvalue bound of (9) to hold with probability . We require for to hold with probability by Lemma 4. Thus, for both conditions to hold simultaneously with probability by a union bound, if suffices to set , where we use that , as otherwise our algorithm can take to be the full matrix . Adjusting to completes the theorem. ∎
Remark: The proof of Lemma 3 and consequently, Theorem 1 can be modified to give better bounds for the case when the eigenvalues of lie in a bounded range – between and where . See Theorem 9 in Appendix C for details. For example, if all the top eigenvalues are equal, one can show that suffices to give error, nearly matching the lower bound of [BCJ20]. This seems to indicate that improving Theorem 1 in general requires tackling the case when the outlying eigenvalues in have a wide range.
4 Improved Bounds via Sparsity-Based Sampling
We now prove the approximation bound of Theorem 2, assuming the ability to sample each row with probability proportional to . Pseudocode for our algorithm is given in Algorithm 2. Unlike in the uniform sampling case (Algorithm 1), we cannot simply sample a principal submatrix of and compute its eigenvalues. We must carefully zero out entries lying at the intersection of sparse rows and columns to ensure accuracy of our estimates. A similar approach is taken for the norm-based sampling result of Theorem 3. We defer that proof to Appendix E.
4.1 Preliminary Lemmas
Our first step is to argue that zeroing out entries in sparse rows/columns in step 5 of Algorithm 2 does not introduce significant error. We define to be the extension of to the original matrix – i.e., whenever or . Otherwise . We argue via a strengthening of Girshgorin’s theorem that for all .
After this step is complete, our proof follows the same general outline as that of Theorem 1 in Section 3. We split , arguing that (1) after sampling and (2) that the eigenvalues of are approximations to those of . In both cases, we critically use that the rescaling factors introduced in line 4 of Algorithm 2 do not introduce too much variance, due to the zeroing out of entries in .
Remark: Throughout, we will assume that does not have any rows/columns that are all , as such rows will never be sampled and will have no effect on the output of Algorithm 2. Additionally, we will assume that , as otherwise, has at most non-zero rows. Thus, rather than running Algorithm 2, we can directly compute the eigenvalues of .
Lemma 5.
Let be symmetric with and . Let have if or for a sufficiently large constant and otherwise. Then, for all ,
Proof.
We consider the matrix , which is defined identically to except we only set if . I.e., we do not have the condition requiring setting the diagonal to . We will show that . By Weyl’s inequality, and the assumption that , we then have as required.
Let be the set of rows/columns with and be the submatrix of formed with rows in and columns in . Define in the same way and observe that whenever .
When , we may zero out some entries of to produce . Let be equal to on this set of zeroed out entries, and everywhere else. Observe that . Next observe that has at most non-zero entries. Similarly, each row of has at most non-zero elements. Thus, for all , using that ,
Applying Girshgorin’s circle theorem (Theorem 2) we thus have:
(13) |
Let be a symmetric matrix such that , , and is zero everywhere else. By triangle inequality and the bound of (13),
Observe that, since we assume all rows have at least one non-zero entry, and . Therefore, can range from to . By triangle inequality,
Finally, setting large enough and using Weyls’ inequality (Fact 4) we have the required bound:
∎
We next give a bound on the coherence of the outlying eigenvectors of . This bound is analogous to Lemma 2, but is more refined, taking into account the sparsity of each row.
Lemma 6 (Incoherence of outlying eigenvectors in terms of sparsity).
Let be as in Lemma 5. Let where is diagonal, with the eigenvalues of with magnitude on its diagonal, and has columns equal to the corresponding eigenvectors. Let denote the th row of . Then,
Proof.
The proof is nearly identical to that of Lemma 2. Observe that . Letting denote the th row of the , we have
(14) |
where , is the th element of and . Since has orthonormal columns, we thus have . Therefore, by (14),
(15) |
Since by definition for all , we can concluse that and , which completes the lemma. ∎
4.2 Outer and Middle Eigenvalue Bounds
Using Lemma 6, we next argue that the eigenvalues of will approximate those of , and in turn those of . The proof is very similar to Lemma 3 in the uniform sampling case.
Lemma 7 (Concentration of outlying eigenvalues with sparsity-based sampling).
Let be as in Lemmas 5 and 6. Let , where , and are projections onto the eigenspaces with magnitude and respectively (analogous to Definition 1.1) As in Algorithm 2, for all let and let be a scaled diagonal sampling matrix such that the with probability and otherwise. If for a large enough constant , then with probability at least , .
Proof.
Define . For all , let be the row of and define the matrix valued random variable
(16) |
Define . We can observe that are independent random variables and that . Let . Then, observe that . So, . Then, similar to the proof of Lemma 3, we need to bound for all and using the improved row norm bounds of Lemma 5. In particular, we have
(17) |
By Lemma 5, . Plugging back into (17),
Since is PSD this establishes that . Since there are at most eigenvalues with absolute value , we can apply the matrix Bernstein inequality exactly as in the proof of Lemma 3 with to show that when for large enough , with probability at least , . ∎
We next bound the spectral norm of . This is the most challenging part of the proof – the rows of this matrix are sampled non-uniformly and scaled proportional to their inverse sampling probabilities, so we cannot apply existing bounds on the spectral norms of uniformly sampled random submatrices [RV07]. We extend these bounds to the non-uniform case, critically using that entries which would be scaled up significantly after sampling (i.e. those lying in sparse rows/columns), have already been set to in , and thus do not contribute to the spectral norm.
Lemma 8 (Concentration of middle eigenvalues with sparsity-based sampling).
Let be as in Lemmas 5 and 6. Let , where , and are projections onto the eigenspaces with magnitude and respectively (analogous to Definition 1.1). As in Algorithm 2, for all let and let be a scaled diagonal sampling matrix such that the with probability and otherwise. If for a large enough constant , then with probability at least ,
Proof.
The initial part of the proof follows the outline of proof of the spectral norm bound for uniformly random submatrices (Theorem 4) of [Tro08a]. From Lemma 6, we have . Also, following the proof of Lemma 6, we have . Thus, for all , using Cauchy Schwarz’s inequality, we have
(18) |
Let where and contain the off-diagonal and diagonal elements of respectively. Note that while is zero on the diagonal, may not be. We have:
Using Lemma 1 (decoupling) on , we get
(19) |
where is an independent copy of . Upper bounding the rank of as and applying Theorem 5 twice to , once for each operator, we get
(20) |
Plugging (20) into (19), we have:
(21) |
We now proceed to bound each of the terms on the right hand side of (21). We start with . First, observe that . We consider two cases.
Case 1: . Then, and (since ). Then by (18), we have .
Case 2: . Then we have .
From the two cases above, for , we have:
(22) |
We can bound similarly. Since and ,
(23) |
where the second step follows from the fact that .
We next bound the term . Observe that , where is the th column/row of . We again consider the two cases when and :
Case 1: . Then .
Case 2: . Then . Thus, setting we have:
Thus, from the two cases above, for all , adjusting by a factor, we have for :
(24) |
It remains to bound , which is the most complex part of the proof. Since is an independent copy of , we denote the norm of the th column of as . Then . We will argue that is bounded by with probability . Since our sampling probabilities are all at least and since , this value is also deterministically bounded by . Thus, our high probability bound implies the needed bound on .
We begin by observing that since , , and so to bound , it suffices to bound for all . Towards this end, for a fixed and any , define
Then and . Since is a sum of independent random variables, we can bound this quantity by applying Bernstein’s inequality. To do this, we must bound for all and . We will again consider the cases of and separately.
Case 1: . Then, we have . If then
where the fourth inequality uses (18). By the thresholding procedure which defines , if ,
(26) |
and thus for and we have
If then we simply have
Overall for all ,
(27) |
and since ,
(28) |
For and large enough , we thus have .
We next bound the variance by:
where the last inequality uses (18). Now since for all and we have
(29) |
Combining (26) with the second term to the right of (29) we have
and since , we have
(30) |
Finally since and we have
(31) |
For for large enough , we have .
Therefore, using (28) and (31) with , we can apply Bernstein inequality (Theorem 7) (for some constant ) to get
If we set , for some constant we have
Since , we have . Then with probability at least , for any row with , we have
for for large enough . Observe that, as in Lemma 3 w.l.o.g. we have assumed , since otherwise, our algorithm would read all entries of the matrix.
Case 2: . Then, we have . As in the case, we have from (27):
Now, we observe that , which gives us
(32) |
Thus, for for a large enough constant and adjusting for other constants we have . Also observe that the expectation of can be bounded by:
Next, the variance of the sum of the random variables can again be bounded by following the analysis presented in and prior to (30) and (31) we have
(33) |
where we again bound using
Then for , we have for large enough constant .
Using (32) and (33) and noting that we can apply the Bernstein inequality (Theorem 7):
If we set , then for some constant we have
This, since , when , setting for large enough , we have with probability
We thus have, that with probability , for both cases when and , . Taking a union bound over all , with probability at least , for . As stated before, since for all , and since , we also have . Thus,
after adjusting by at most some constants. Overall, we finally get
Plugging this bound into (25), we have for ,
Finally after adjusting by a factor, we have for or ,
The final bound then follows via Markov’s inequality on . ∎
4.3 Main Accuracy Bound
We are finally ready to prove our main result for sparsity-based sampling, which we restate below. See 2
Proof.
With Lemmas 7 and 8 in place, the proof is nearly identical to that of Theorem 1, with the additional need to apply Lemma 5 to show that the eigenvalues of are close to those of .
For all let and let be a scaled diagonal sampling matrix such that the with probability and otherwise. Let be the matrix constructed from by zeroing out its elements as described in Lemma 5. Then, note that where is the submatrix constructed as in Algorithm 2. We first show that the eigenvalues of approximate those of up to error . The steps are almost identical to those in the proof of Theorem 1. We provide a brief outline of the steps but skip the details.
We split as where and contain eigenvalues of of magnitudes and. This implies where and . By Fact 1 we have that the nonzero eigenvalues of are identical to those of . Thus, applying the perturbation bound of Fact 4, we have:
From Lemma 7, we get for with probability at least . Thus, setting the error parameter to in Lemma 7, for , with probability at least we have:
(34) |
We have thus shown that the non-zero eigenvalues of approximate all outlying eigenvalues of . Note that by Lemma 8, we also have with probability at least for . Then, similarly to the section on eigenvalue alignment of Theorem 1, we can argue how these approximations ‘line up’ in the presence of zero eigenvalues in the spectrum of these matrices, concluding that, for all ,
Finally, by Lemma 5, we have for all . Thus, via triangle inequality, , which gives the required bound after adjusting to .
Recall that we require for (34) to hold with probability . We also require for to hold with probability by Lemma 8. Thus, for both conditions to hold simultaneously with probability by a union bound, it suffices to set , where we use that , as otherwise our algorithm can take to be the full matrix . Adjusting to completes the theorem. ∎
5 Empirical Evaluation
We complement our theoretical results by evaluating Algorithms 1 (uniform sampling) and Algorithm 2 (sparsity-based sampling) in approximating the eigenvalues of several symmetric matrices. We defer an evaluation of Algorithm 3 (norm-based sampling) to later work. Algorithm 1 and Algorithm 2 perform very well. They seem to have error dependence roughly in practice, as compared to the dependence proven in Theorem 1 and dependence in Theorem 2. Closing the gap between the theory and observed results would be very interesting.
5.1 Datasets
We test Algorithm 1 (uniform sampler) on three dense matrices. We also compare the relative performance of Algorithm 1 and Algorithm 2 (sparsity sampler) on three other synthetic and real world matrices.
The first two dense matrices, following [CNX21], are created by sampling points from a binary image. We then normalize all the points in the range in both axes. The original image and resulting set of points are shown in Figure 2. We then compute a similarity matrix for the points using two common similarity functions used in machine learning and computer graphics: , the hyperbolic tangent; and , the thin plane spline. These measures lead to symmetric, indefinite, and entrywise bounded similarity matrices.
Our next dense matrix (called the block matrix) is based on the construction of the hard instance for the lower bound in [BCJ20] which shows that we need samples to compute approximations to the eigenvalues of a bounded entry matrix. It is a matrix containing a principal submatrix of all s, with the rest of the entries set to . It has and all other eigenvalues equal to .
We now describe the three matrices used to compare Algorithm 1 and Algorithm 2. All three are graph adjacency matrices, which are symmetric, indefinite, entrywise bounded and sparse. Spectral density estimation for graph structured matrices is an important primitive in network analysis [DBB19]. The first is a dense Erdös-Rényi graph with nodes and connection probability . The second two are real world graphs, taken from SNAP [LK14]; namely Facebook [ML12] and Arxiv COND-MAT [LKF07]. The Facebook graph contains nodes and directed edges. We symmetrize the adjacency matrix. Arxiv COND-MAT is a collaboration network between authors of Condensed Matter papers published on arXiv, containing nodes and undirected edges. Both these graphs are very sparse – the number of edges is of the total edges in a complete graph with same number of nodes.


5.2 Implementation Details
Apart from uniform random sampling (Algorithm 1), we also apply the sparsity-based sampling technique in Algorithm 2 and a modification to Algorithm 2, where we do not zero out the elements of the sampled submatrix (we call this simple sparsity sampler). In practice, to apply Algorithm 2, we zero out element (line 5 of Algorithm 2) if or , where is a constant and is the size of the sample. We set experimentally as this results in consistent behavior across datasets.
5.3 Experimental Setup
We subsample each matrix and compute its eigenvalues using numpy [Com21]. We then use our approximation algorithms to estimate the eigenvalues of by scaling the eigenvalues of the sampled submatrix. For trials, we report the logarithm of the average absolute scaled error, , where is the estimated eigenvalue in the trial, is the true eigenvalue and is the number of non-zero elements in . Recall that is an upper bound on all eigenvalue magnitudes. Also note that for the fully dense matrices, .
We repeat our experiments for trials at different sampling rates and aggregate the results. The resultant errors of estimation for dense matrices are plotted in Figure 3 and for the graph matrices are plotted in Figure 4. The -axis is the log proportion of the number of random samples chosen from the matrix. If we sample of the rows/columns, then the comes to around . In these log-log plots, if the sample size has polynomial dependence on , e.g., or error is achieved with sample size proportional to , we expect to see error falling off linearly, with slope equal to where is the exponent on .
As a baseline we also show the error if we approximate all eigenvalues with which results in an error of . This helps us to observe how the approximation algorithms perform for both large and small order eigenvalues, as opposed to just approximating everything by .
Code. All codes are written in Python and available at https://github.com/archanray/eigenvalue_estimation.
5.4 Summary of Results
Our results are plotted in Figures 3 and 4. We observe relatively small error in approximating all eigenvalues, with the error decreasing as the number of samples increases. What is more interesting is that the relationship between sample size and error seems to be generally on the order of , our expected lower bound for approximating eigenvalues by randomly sampling a principal submatrix. This can be seen by observing the slope of approximately on the log-log error plots. In some cases, we do better in approximating small eigenvalues of – if the eigenvalue lies well within the range of middle eigenvalues, i.e. ), we may achieve a very good absolute error estimate simply by approximating it to .
As expected, on the graph adjacency matrices (in Figure 4), sparsity-based sampling techniques generally achieve better error than uniform sampling. For the Erdös-Rényi graph, we expect the node degrees (and hence row sparsities) to be similar. Thus the sampling probability for each row will be roughly uniform, which leads to similar performance of sparsity-based techniques and uniform sampling. For the real world graphs, which have power law degree distributions, sparsity-based sampling techniques has a significant effect. As a result Algorithm 2, and the simple sparsity sampler variant significantly outperform uniform sampling.
Algorithm 2 almost always dominates simple sparsity sampler. In some cases simple sparsity sampler performs better or equivalent to Algorithm 2. This may happen because for two reasons: 1) if Algorithm 2 zeroes out almost all of the sampled submatrix for small samples, the algorithm will underestimate the corresponding eigenvalue, and 2) the cut-off threshold for the term may be too high leading to no difference between simple sparsity sampler and Algorithm 2.
We also observe that approximating all eigenvalues with 0 results in very good approximation for small eigenvalues of the Erdös-Rényi graph. We believe this is because the smaller eigenvalues are significantly less than the largest eigenvalue (of the order of ). We see similar trends of approximating eigenvalues with zero for the real world graphs too. But since eigenvalues at the extreme spectrum are of a larger order, we see reasonably good approximation for the sampling algorithms. Algorithm 2 outperforms approximation by in all of these cases.
In the dense matrices uniform sampling almost always outperforms approximation by when estimating any reasonably large eigenvalues. Additionally, note that the block matrix is rank- with true eigenvalues . Any sampled principal submatrix will also have rank at most . Thus, outside the top eigenvalue, the submatrix will have all zero eigenvalues. So, in theory, our algorithm should give perfect error for all eigenvalues outside the top – we see that this is nearly the case. The very small and sporadic error in the plots for these eigenvalues arises due to numerical roundoff in the eigensolver. The only non-trivial approximation for this matrix is for the top eigenvalue. This approximation seems to have error dependency around , as expected.
6 Conclusion
We present efficient algorithms for estimating all eigenvalues of a symmetric matrix with bounded entries up to additive error , by reading just a random principal submatrix. We give improved error bounds of and when the rows/columns are sampled with probabilities proportional to their sparsities or squared norms, respectively.
As discussed, our work leaves several open questions. In particular, it is open if our query complexity for approximation can be improved, possibly to total entries using principal submatrix queries or entries using general queries. The later bound is open even when is PSD, a setting where we know that sampling a principal submatrix (with total entries) does suffice. Additionally, it is open if we can achieve sample complexity independent of , by removing all factors, as have been done for the easier problem of testing positive semidefiniteness [BCJ20]. See Section 1.4 for more details.
It would also be interesting to extend our results to give improved approximation bounds for other properties of the matrix spectrum, such as various Schatten- norms and spectral summaries. For many of these problems large gaps in understanding exist – e.g., for approximation to the Schatten- norm, which requires queries, but for which no query algorithm is known. Applying our techniques to improve sublinear time PSD testing algorithms under an rather than approximation requirement [BCJ20] would also be interesting. Finally, it would be interesting to identify additional assumptions on or on the sampling model where stronger approximation guarantees (e.g., relative error) can be achieved in sublinear time.
Acknowledgements
We thank Ainesh Bakshi, Rajesh Jayaram, Anil Damle, and Christopher Musco for helpful conversations about this work. RB, CM and AR was partially supported by an Adobe Research grant, along with NSF Grants 2046235 and 1763618. PD and GD were partially supported by NSF AF 1814041, NSF FRG 1760353, and DOE-SC0022085.


















References
- [AM07] Dimitris Achlioptas and Frank McSherry. Fast computation of low-rank matrix approximations. Journal of the ACM (JACM), 54(2):9–es, 2007.
- [AN13] Alexandr Andoni and Huy L Nguyên. Eigenvalues of a matrix in the streaming model. In Proceedings of the \nth24 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2013.
- [AW21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the \nth32 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021.
- [BCJ20] Ainesh Bakshi, Nadiia Chepurko, and Rajesh Jayaram. Testing positive semi-definiteness via random submatrices. Proceedings of the \nth61 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2020.
- [Ber27] Serge Bernstein. Sur l’extension du théorème limite du calcul des probabilités aux sommes de quantités dépendantes. Mathematische Annalen, 97(1):1–59, 1927.
- [Bha13] Rajendra Bhatia. Matrix analysis. Springer Science & Business Media, 2013.
- [BIMW21] Arturs Backurs, Piotr Indyk, Cameron Musco, and Tal Wagner. Faster kernel matrix algebra via density estimation. Proceedings of the \nth38 International Conference on Machine Learning (ICML), 2021.
- [BKKS21] Vladimir Braverman, Robert Krauthgamer, Aditya R Krishnan, and Shay Sapir. Near-optimal entrywise sampling of numerically sparse matrices. In Proceedings of the \nth34 Annual Conference on Computational Learning Theory (COLT), 2021.
- [BKM22] Vladimir Braverman, Aditya Krishnan, and Christopher Musco. Linear and sublinear time spectral density estimation. Proceedings of the \nth54 Annual ACM Symposium on Theory of Computing (STOC), 2022.
- [BLWZ19] Maria-Florina Balcan, Yi Li, David P Woodruff, and Hongyang Zhang. Testing matrix rank, optimally. In Proceedings of the \nth30 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2019.
- [BSS10] Itai Benjamini, Oded Schramm, and Asaf Shapira. Every minor-closed property of sparse graphs is testable. Advances in Mathematics, 223(6):2200–2218, 2010.
- [CCH+20] Nadiia Chepurko, Kenneth L Clarkson, Lior Horesh, Honghao Lin, and David P Woodruff. Quantum-inspired algorithms from randomized numerical linear algebra. arXiv:2011.04125, 2020.
- [CLM+15] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of the \nth6 Conference on Innovations in Theoretical Computer Science (ITCS), 2015.
- [CNX21] Difeng Cai, James Nagy, and Yuanzhe Xi. Fast and stable deterministic approximation of general symmetric kernel matrices in high dimensions. arXiv:2102.05215, 2021.
- [Com21] The Numpy Community. numpy.linalg.eigvals. https://numpy.org/doc/stable/reference/generated/numpy.linalg.eigvals.html, 2021.
- [CSKSV18] David Cohen-Steiner, Weihao Kong, Christian Sohler, and Gregory Valiant. Approximating the spectrum of a graph. In Proceedings of the \nth24 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.
- [DBB19] Kun Dong, Austin R Benson, and David Bindel. Network density of states. In Proceedings of the \nth25 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2019.
- [DDHK07] James Demmel, Ioana Dumitriu, Olga Holtz, and Robert Kleinberg. Fast matrix multiplication is stable. Numerische Mathematik, 2007.
- [DK01] Petros Drineas and Ravi Kannan. Fast monte-carlo algorithms for approximate matrix multiplication. In Proceedings of the \nth42 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2001.
- [ER59] P. Erdös and A. Rényi. On random graphs I. Publicationes Mathematicae Debrecen, 1959.
- [ER18] Talya Eden and Will Rosenbaum. On sampling edges almost uniformly. SIAM Symposium on Simplicty in Algorithms (SOSA), 2018.
- [FKV04] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations. Journal of the ACM (JACM), 51(6):1025–1041, 2004.
- [GE95] Ming Gu and Stanley C Eisenstat. A divide-and-conquer algorithm for the symmetric tridiagonal eigenproblem. SIAM Journal on Matrix Analysis and Applications, 1995.
- [Ger31] Semyon Aranovich Gershgorin. Uber die abgrenzung der eigenwerte einer matrix. Izvestiya Rossiyskoy akademii nauk. Seriya matematicheskaya, (6):749–754, 1931.
- [GKX19] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In Proceedings of the \nth36 International Conference on Machine Learning (ICML), 2019.
- [GR97] Oded Goldreich and Dana Ron. Property testing in bounded degree graphs. In Proceedings of the \nth29 Annual ACM Symposium on Theory of Computing (STOC), 1997.
- [GR08] Oded Goldreich and Dana Ron. Approximating average parameters of graphs. Random Structures & Algorithms, 32(4):473–493, 2008.
- [GS91] Leslie Greengard and John Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 1991.
- [GT11] Alex Gittens and Joel A Tropp. Tail bounds for all eigenvalues of a sum of random matrices. arXiv:1104.4513, 2011.
- [HBT19] Jonas Helsen, Francesco Battistel, and Barbara M Terhal. Spectral quantum tomography. Quantum Information, 2019.
- [HJ12] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, USA, 2nd edition, 2012.
- [HP14] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. Advances in Neural Information Processing Systems 27 (NIPS), 2014.
- [KS03] Robert Krauthgamer and Ori Sasson. Property testing of data dimensionality. In Proceedings of the \nth14 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2003.
- [LK14] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, 2014.
- [LKF07] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 2007.
- [LNW14] Yi Li, Huy L Nguyê̋n, and David P Woodruff. On sketching matrix norms and the top singular vector. In Proceedings of the \nth25 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2014.
- [LSY16] Lin Lin, Yousef Saad, and Chao Yang. Approximating spectral densities of large matrices. SIAM Review, 2016.
- [LW16] Yi Li and David P Woodruff. On approximating functions of the singular values in a stream. In Proceedings of the \nth48 Annual ACM Symposium on Theory of Computing (STOC), 2016.
- [LWW14] Yi Li, Zhengyu Wang, and David P Woodruff. Improved testing of low rank matrices. In Proceedings of the \nth20 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2014.
- [LXES19] Ruipeng Li, Yuanzhe Xi, Lucas Erlandson, and Yousef Saad. The eigenvalues slicing library (EVSL): Algorithms, implementation, and software. SIAM Journal on Scientific Computing, 2019.
- [ML12] Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems 25 (NIPS), 2012.
- [MU17] Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
- [NSW22] Deanna Needell, William Swartworth, and David P Woodruff. Testing positive semidefiniteness using linear measurements. In Proceedings of the \nth63 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2022.
- [RV07] Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM (JACM), 2007.
- [Saa11] Yousef Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011.
- [SBL16] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv:1611.07476, 2016.
- [SgS90] G. W. Stewart and Ji guang Sun. Matrix Perturbation Theory. Academic Press, 1990.
- [SR94] RN Silver and H Röder. Densities of states of mega-dimensional Hamiltonian matrices. International Journal of Modern Physics C, 1994.
- [Tan18] Ewin Tang. Quantum-inspired classical algorithms for principal component analysis and supervised clustering. arXiv:1811.00414, 2018.
- [Tro08a] Joel A Tropp. Norms of random submatrices and sparse approximation. Comptes Rendus Mathematique, 2008.
- [Tro08b] Joel A. Tropp. The random paving property for uniformly bounded matrices. Studia Mathematica, 185:67–82, 2008.
- [Tro15] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv:1501.01571, 2015.
- [Wan94] Lin-Wang Wang. Calculating the density of states and optical-absorption spectra of large quantum systems by the plane-wave moments method. Physical Review B, 1994.
- [Wey12] Hermann Weyl. The asymptotic distribution law of the eigenvalues of linear partial differential equations (with an application to the theory of cavity radiation). Mathematical Annals, 1912.
- [WWAF06] Alexander Weiße, Gerhard Wellein, Andreas Alvermann, and Holger Fehske. The kernel polynomial method. Reviews of Modern Physics, 2006.
- [YGL+18] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based analysis of large batch training and robustness to adversaries. arXiv:1802.08241, 2018.
Appendix A Eigenvalue Approximation for PSD Matrices
Here we give a simple proof that shows if Algorithm 1 is used to approximate the eigenvalues of positive semidefinite (PSD) matrices (i.e., with all non-negative eigenvalues) using a random submatrix, then the norm of the error of eigenvalue approximations is bounded by . This much stronger result immediately implies that each eigenvalue of a PSD matrix can be approximated to additive error using just a random submatrix. The proof follows from a bound in [Bha13] which bounds the norm of the difference vector of eigenvalues of a Hermitian matrix and any other random matrix by the Frobenius norm of the difference of the two matrices. This improves on the bound of Theorem 1 for general entrywise bounded matrices by a factor, and matches the lower bound for principal submatrix queries in [BCJ20]. Note that the hard instance used to prove the lower bound in [BCJ20] can in fact be negated to be PSD, thus showing that our upper bound here is tight.
We first state the result from [Bha13] which we will be using in our proof.
Fact 5 (-norm bound on eigenvalues [Bha13]).
Let be Hermitian and be any matrix whose eigenvalues are such that (where denotes the real part of ). Let . Then
Our result is based on the following Lemma, we prove at the end of the section.
Lemma 9.
Consider a PSD matrix with . Let be sampled as in Algorithm 1 for . Let be the scaled sampling matrix satisfying . Then with probability at least ,
From the above Lemma we have:
Corollary 2 (Spectral norm bound – PSD matrices).
Consider a PSD matrix with . Let be a subset of indices formed by including each index in independently with probability as in Algorithm 1. Let be the corresponding principal submatrix of , with eigenvalues .
For all with , let . For all other , let . Then if , with probability at least ,
which implies that for all ,
Proof.
Let be sampled as in Algorithm 1 and let be the scaled sampling matrix satisfying . Since is PSD, we can write for some matrix . From Lemma 9, for , we have with probability at least :
Using Fact 5, we have,
(35) |
Also from Fact 1, we have for all . Thus,
Also by Fact 1, all non-zero eigenvalues of are equal to those of . All other eigenvalue estimates are set to . Further, for all , . Thus,
Adjusting to then gives us the bound. ∎
We now prove Lemma 9, using a standard approach for sampling based approximate matrix multiplication – see e.g. [DK01].
Proof of Lemma 9.
For let with probability and with probability . Thus and
Fixing , each the are mean independent random variables. Thus we have:
since . Rearranging the sums we have:
Observe that , thus overall we have:
So by Markov’s inequality, with probability , . This completes the theorem after taking a square root. ∎
Remark: The proof of Lemma 9 can be easily modified to show that the th row of can be sampled with probability proportional to to approximate the eigenvalues of any PSD up to error ( is the trace of ). When sampling with probability proportional to , we do not require a bounded entry assumption on .
Appendix B Alternate Bound for Uniform Sampling
In this section we provide an alternate bound for approximating eigenvalues with uniform sampling. The sample complexity is worse by a factor of for this approach, but better by a factor as compared to Theorem 1. We start with an analog to Lemma 3, showing that the outlying eigenspace remains nearly orthogonal after sampling. In particular, we show concentration of the Hermitian matrix about its expectation rather than the non-Hermitian as in Lemma 3. This allows us to use Weyl’s inequality in our final analysis, rather than the non-Hermitian eigenvalue perturbation bound of Fact 4, saving a factor in the sample complexity.
Lemma 10 (Near orthonormality – sampled outlying eigenvalues).
Let be sampled as in Algorithm 1 for where is a sufficiently large constant. Let be the scaled sampling matrix satisfying . Then with probability at least ,
Proof.
The result is standard in randomized numerical linear algebra – see e.g., [CLM+15]. For completeness, we give a proof here. Define . For all , let be the row of and define the matrix valued random variable
Then, similar to the proof of Lemma 3, define . Since are independent random variables and , we need to bound for all and . Observe , by row norm bounds of Lemma 2. Again, using Lemma 2 we have
where is the identity matrix of appropriate dimension. By setting , we can finally bound the probability of the event using Theorem 6 (the matrix Bernstein inequality) with if . Since these steps follow Lemma 3 nearly exactly, we omit them here. ∎
With Lemma 10 in place, we can now give our alternate sample complexity bound.
Theorem 8 (Sublinear Time Eigenvalue Approximation).
Let be symmetric with and eigenvalues . Let be formed by including each index independently with probability as in Algorithm 1. Let be the corresponding principal submatrix of , with eigenvalues .
For all with , let . For all with , let . For all other , let . If , for a large enough constant , then with probability , for all ,
Proof.
Let be the binary sampling matrix with a single one in each column such that . Let . Following Definition 1.1, we write . By Fact 1 we have that the nonzero eigenvalues of are identical to those of .
Note that is positive semidefinite. Writing its eigendecomposition we can define the matrix squareroot with . By Lemma 10 applied with error , with probability at least , all eigenvalues of lie in the range . In turn, all eigenvalues of also lie in this range. Again using Fact 1, we have that the nonzero eigenvalues of , and in turn those of , are identical to those of .
Let . Since the diagonal entries of lie in , those of lie in . Thus, . We can write
We can then bound
Applying Weyl’s eigenvalue perturbation theorem (Fact 3), we thus have for all ,
(36) |
Note that we have shown that the nonzero eigenvalues of are identical to those of , which we have shown well approximate those of and in turn i.e., the non-zero eigenvalues of approximate all outlying eigenvalues of . We can also bound the middle eigenvalues using Lemma 4 as in Theorem 1. Now the only thing left is to argue how these approximations ‘line up’ in the presence of zero eigenvalues in the spectrum of these matrices. This part of the proof again proceeds similarly to that of Theorem 1 in Section 3.2.
Analogous to Theorem 1, from Lemma 10 equation (36) holds with probability if . We also require for to hold with probability by Lemma 4. Thus, for both conditions to hold simultaneously with probability by a union bound, it suffices to set , where we use that , as otherwise our algorithm can take to be all of . Adjusting to completes the theorem. ∎
Appendix C Refined Bounds
In this section, we show how it is possible to get better query complexity or tighter approximation factors by modifying the proof of Theorem 1 and Lemmas 3 and 2 under some assumptions. We give an extension to Theorem 1 in Theorem 9 for the case when the eigenvalues of lie in a bounded range – between and where .
Theorem 9.
Let be symmetric with and eigenvalues . Let be as in Definition 1.1 such that for all eigenvalues we have either or where . Let be formed by including each index independently with probability as in Algorithm 1. Let be the corresponding principal submatrix of , with eigenvalues .
For all with , let . For all with , let . For all other , let . If , for large enough , then with probability at least , for all ,
Proof.
The proof follows by modifying the proofs of Theorem 1, Lemmas 2 and 3 to account for the tighter intervals. First observe that since for all , we can give a tighter row norm bound for from the proof of Lemma 2. In particular, from equation (3) we get:
We can then bound the number of samples we need to take such that for (as defined in Theorem 8) we have with probability at least via a matrix Bernstein bound. By appropriately modifying the proof of Lemma 3 to incorporate the stronger row norm bound for , we can show that sampling with probability for for large enough suffices. Specifically, we get , and for the Bernstein bound in Lemma 3 which enables us to get the tighter bound. Thus, we have with probability for following Lemma 3. We also require for to hold with probability by Lemma 4. Then, following the proof of Theorem 1, by Fact 4, for all , and some constant , we have:
Appendix D Spectral Norm Bounds for Non-Uniform Random Submatrices
See 5
Proof.
The proof follows from [Tro08b]. We begin by first defining the following term
Now we have
where is the sequence of independent random variables such that with probability and otherwise, and is the th column of . Then, . Let be an independent copy of the sequence . Subtracting the mean and applying triangle inequality we have
Using Jensen’s inequality we have
The random variables are symmetric and independent. Let be i.i.d Rademacher random variables for all . Then applying the standard symmetrization argument followed by triangle inequality, we have:
Let . Let be the partial expectation with respect to , keeping the other random variables fixed. Then, we get:
Using Rudelson’s Lemma 11 of [Tro08b] for any matrix with columns and any we have
Since is concave for , using Jensen’s inequality we get:
Applying the above result to the matrix , we get:
Applying Cauchy Schwartz we get:
The above equation is of the form . Thus, the values of fro which the above equation is true is given by . Thus, we get:
This gives us the final bound. ∎
Appendix E Improved Bounds via Row-Norm-Based Sampling
Building on the sparsity-based sampling results presented in Section 4, we now show how to obtain improved approximation error of assuming we can sample the rows of with probabilties proportional to their squared norms. The ability to sample by norms also allows us to remove the assumption that has bounded entries – our results apply to any symmetric matrix.
For technical reasons, we mix row norm sampling with uniform sampling, forming a random principal submatrix by sampling each index independently with probability and rescaling each sampled row/column by . As in the sparsity-based sampling setting, we must carefully zero out entries of the sampled submatrix to ensure concentration of the sampled eigenvalues. Pseudocode for the full algorithm is given in Algorithm 3.
E.1 Preliminary Lemmas
Our proof closely follows that of Theorem 2 in Section 4. We start by defining obtained by zeroing out entries of as described in Algorithm 3. We have whenever 1) and or 2) and . Otherwise . Similar to the sparsity sampling case, we argue that the eigenvalues of are close to i.e., zeroing out entries of according to the given condition doesn’t change it’s eigenvalues by too much (Lemma 11. Then, we again split such that . We argue that after sampling, we have and the eigenvalues of approximate those of up to error.
Lemma 11.
Let be symmetric. Let have if either 1) and or 2) and for a sufficiently large constant . Otherwise, . Then, for all ,
Proof.
Consider the matrix , which is defined identically to except we only set if and . I.e., we do not zero out any entries on the diagonal as in . We will show that . If is zeroed out in this implies that . Thus, and so . So, by triangle inequality, we will then have . The lemma then follows from Weyl’s inequality
To show that , we use a variant of Girshgorin’s theorem, as in the proof of Lemma 5. First, we split the entries of into level sets, according to their magnitudes. Let where if and otherwise. For , if and otherwise. We can also define where each are defined similarly. By triangle inequality, . First observe that . Further, we can assume without loss of generality that and so , as otherwise our algorithm can afford to read all of . So, it suffices to show that for all ,
(37) |
This will give , which gives the lemma after adjusting by a constant factor.
We now prove (37) for each . For , let be the set of rows/columns in with and let be the submatrix of formed with rows in and columns in . Define the submatrix of in the same way. Let and finally, let be the symmetric error matrix such that and .
Note that all rows from which we zero out entries must have at least one non-zero entry (otherwise all entries in that row/column are already zero), thus all such rows have and so are covered by the submatrices . Thus, by triangle inequality, we can bound
(38) |
To prove (37), we need to bound for all and . We use a case analysis.
Case 1: In this case, first observe that since the nonzero entries of lie in , for any , ,
Thus, by the assumed bound on , we have for any where is nonzero,
where the second inequality follows again from the fact that the nonzero entries of lie in . Thus, any with nonzero is not zeroed out in line 5 of Algorithm 3. So . Plugging into (38), we thus have:
(39) |
Case 2: In this case, observe that . We can see that has at most non-zero entries. Similarly, each row of has at most non-zero elements. Thus, for all , using the fact that all non-zero entries of are bounded by , we have:
Applying Girshgorin’s circle theorem (Theorem 2) we thus have:
and so
Plugging to (39), we thus have:
Setting , we thus have (37), and in turn the lemma. ∎
We next give a bound on the incoherence of the outlying eigenvectors of . This bound is again similar to Lemmas 2 and 6.
Lemma 12 (Incoherence of outlying eigenvectors in terms of norms).
Let be as in Lemma 11. Let where is diagonal, with the eigenvalues of with magnitude on its diagonal, and has columns equal to the corresponding eigenvectors. Let denote the th row of . Then,
Proof.
The proof is again nearly identical to that of Lemma 2. Observe that . Letting denote the th row of the , we have
(40) |
where , is the th element of and . Since has orthonormal columns, we have . Therefore, by (40),
(41) |
Since by definition for all , we can conclude that and , which completes the lemma. ∎
E.2 Outer and Middle Eigenvalue Bounds
Using Lemma 12, we next argue that the eigenvalues of will approximate those of , and in turn those of . The proof is very similar to Lemmas 3 and 7.
Lemma 13 (Concentration of outlying eigenvalues with norm based sampling).
Let be as in algorithm 3. Let , where , and are projections onto the eigenspaces with magnitude and respectively. For all let and let be a scaled diagonal sampling matrix such that the with probability and otherwise. If for a large enough constant , then with probability at least , .
Proof.
We define the random variables and the set exactly as in the proof of Lemma 7. Then, as explained in the proof of Lemma 7 it is sufficient to bound . From 17 we have . Also from Lemma 11, we have and for all , . We thus get,
Since is PSD this establishes that . We can then apply the matrix Bernstein inequality exactly as in the proof of Lemma 3 to show that when for large enough , with probability at least , . ∎
We now bound the middle eignevalues.
Lemma 14 (Concentration of middle eigenvalues with - norm based sampling).
Let be as in Lemma 12. Let , where , and are projections onto the eigenspaces with magnitude and respectively (analogous to Definition 1.1). As in Algorithm 2, for all let and let be a scaled diagonal sampling matrix such that the with probability and otherwise. If for a large enough constant , then with probability at least ,
Proof.
First observe that since (for large enough ), the results of Lemmas 11 and 12 still hold. The proof follows the same structure as the proof of bounding the middle eigenvalues for sparsity sampling in Lemma 8. From Lemma 12, we have . Also, following the proof of Lemma 12, we have . Thus, for all , using Cauchy Schwarz’s inequality, we have
(42) |
Let where and contain the off-diagonal and diagonal elements of respectively. Then following the proof of Lemma 8, we get:
(43) |
We now proceed to bound each of the terms on the right hand side of (43). We start with . First, observe that . We consider two cases.
Case 1: . Then, as we have since . So we must have that have (since ). Then by (42), we have .
Case 2: . Then we have .
From the two cases above, for , we have:
(44) |
We can bound similarly. Since and ,
(45) |
where the second step follows from the fact that .
We next bound the term . Observe that , where is the th column/row of . We again consider the two cases when and :
Case 1: . Then .
Case 2: . Then . Thus, setting we have:
Thus, from the two cases above, for all , adjusting by a factor, we have for :
(46) |
Overall, plugging (44), (45), and (46) back into (43), we have :
(47) |
Finally we bound . As in the proof of Lemma 8, we have and we will argue that is bounded by with probability . Also as argued in the proof of Lemma 8, since , it suffices to bound for all with high probability. Again, for a fixed and any , define the random variables as:
Then and . We will again use Bernstein’s inequality to bound by bounding bound for all and . We consider the cases of and separately.
Case 1: . Then, we have . If then
where the fourth inequality uses (42). By the thresholding procedure which defines , if and ,
(48) |
and thus for and we have
Also since we must have as . If or , then we simply have
Overall for all ,
(49) |
and since ,
(50) |
For and large enough , we thus have .
We next bound the variance by:
where the last inequality uses (42). We thus get:
(51) |
Now as (and thus, ). Combining (48) with the second term to the right of (51) we have
and since , we have
(52) |
Finally since and we have
(53) |
For for large enough , we have .
Therefore, using (50) and (53) with , we can apply Bernstein inequality (Theorem 7) (for some constant ) to get
If we set , for some constant we have
Since , we have . Then with probability at least , for any row with , we have
for for large enough . Observe that, as in Lemma 3 w.l.o.g. we have assumed , since otherwise, our algorithm would read all entries of the matrix.
Case 2: . Then, we have . As in the case, when , (and this ) we have from (49):
Now, we observe that , which gives us
(54) |
Note that if , the second term in (49) is bounded as for . Thus, for for a large enough constant and adjusting for other constants we have . Also observe that the expectation of can be bounded by:
Next, the variance of the sum of the random variables can again be bounded by following the analysis presented in and prior to (52) and (53) we have
(55) |
where we again bound using
Then for , we have for large enough constant .
Using (54) and (55) and noting that we can apply the Bernstein inequality (Theorem 7):
If we set , then for some constant we have
This, since , when , setting for large enough , we have with probability
We have proven that with probability , for both cases when and , . Taking a union bound over all , with probability at least , for . Also, since for all , . Thus, and we get,
after adjusting by at most some constants. Overall, we finally get
Plugging this bound into (47), we have for ,
Finally after adjusting by a factor, we have for or ,
The final bound then follows via Markov’s inequality on . ∎
E.3 Main Accuracy Bound
We are finally ready to state our main result for norm based sampling.
See 3
Proof.
The proof follows exactly the same structure as the proofs of Theorems 1 and 2 for uniform and sparsity based sampling respectively. We use the results of Lemmas 14 and 13 on the concentration of the middle and large eigenvalues for norm based sampling.
Analogous to Theorem 2, from Lemma 13 with error parameter the eigenvalues of approximate those of up to error with probability if . We also require for to hold with probability by Lemma 14. Thus, for both conditions to hold simultaneously with probability by a union bound, if suffices to set , where we use that , as otherwise our algorithm can take to be the full matrix . Adjusting to completes the theorem. ∎
Appendix F Eigenvalue Approximation via Entrywise Sampling
In this section we show that sampling entries from a bounded entry matrix preserves its eigenvalues up to error . We use this result to improve the sample complexity of Theorem 1 from to by applying entrywise sampling to further sparsify the submatrix that is sampled in Algorithm 1. Entrywise sampling results similar to what we show are well-known in the literature. See for example [AM07] and [BKKS21]. For completeness, we give a proof here using standard matrix concentration bounds.
Theorem 10 (Entrywise sampling – spectral norm bound).
Consider with . Let be constructed by setting for all and
For any , if for a large enough constant , then with probability at least , .
Note that by Weyl’s inequality (Fact 3), Theorem 10 immediately implies that the eigenvalues of approximate those of up to error with good probability.
Proof.
For any , define the symmetric random matrix with
. Observe that . Further, each has just two non-zero values in different rows and columns. So
where the last inequality uses that . Additionally, is diagonal with two diagonal entries at or equal to . Thus, is also diagonal. We have
where in the final inequality we use that . Thus, since is diagonal, . Putting the above together using Theorem 6 we get,
Thus, for for large enough , with probability at least we have . ∎
F.1 Improved Sample Complexity via Entrywise Sampling
We can combine Theorem 10 directly with Theorem 1 to give an improved sample complexity for eigenvalue estimation. we have: See 1
Proof.
Letting for large enough constant , by Theorem 1, for a random principal submatrix formed by sampling each index with probability , the eigenvalues of , after scaling up by a factor of approximate those of to error with probability at least . By Theorem 10, if we sample off-diagonal entries of with probability to produce , then we preserve its eigenvalues to error . Thus, after scaling by , the eigenvalues of approximate those of to error . Finally, observe that by a standard Chernoff bound, with probability at least . So adjusting by a constant, the scaled eigenvalues of give approximations to ’s eigenvalues. The expected number of entries read is . Additionally, by a standard Chernoff bound at most are read with probability at least . ∎
Appendix G Singular Value Approximation via Sampling
We now show how to estimate the singular values of a bounded-entry matrix via random subsampling. Unlike in eigenvalue estimation, instead of sampling a random principal submatrix, here we sample a random submatrix with independent rows and columns. This allows us to apply known interior eigenvalue matrix Chernoff bounds to bound the perturbation in the singular values [GT11, BCJ20]. We first state a simplified version of Theorem 4.1 from [GT11] (also stated as Theorem 4.6 in [BCJ20]), simplified using standard upper bounds on the Chernoff bounds in [MU17].
Theorem 11 (Interior Eigenvalue Matrix Chernoff bounds – Theorem 4.1 of [GT11]).
Let be a finite sequence of independent, random, positive-semidefinite matrices with dimension , and assume that for some value almost surely. Given an integer , define
Then we have the tail inequalities:
We are now ready to state and prove the main theorem.
Theorem 12.
Let be a matrix with and singular values . Let be a scaled diagonal sampling matrix such that with probability and otherwise. Let be an independent and identically distributed random sampling matrix. Let be the sampled submatrix from with singular values . Then, if for some constant , with probability at least , for all ,
Proof.
We first prove that singular values of are close to those of . Let be matrix valued r.v.’s for such that:
where is the th row of written as a column vector. Then, and . We have and for .
Case 1: We will first prove that for all . Note that when , is trivially true. We now consider all such that . Setting , and (note that ) in Theorem 11, we get:
where is constant. So, for for any , we have with probability at least . Taking a square root on both sides we get . Taking a union bound over all with , holds for all such with probability at least .
Case 2: We now prove that for all . We first consider the case when . Setting , and (note that ) in Theorem 11, we get (for some constant ):
Thus, if , we have for all such that with probability at least via a union bound. Taking square root on both sides and using the facts that , and , we get .
We now consider the case . Setting , and (note that ) in Theorem 11, we get (for some constant ):
Thus, if , we have for all such that with probability at least via a union bound. Taking square root on both sides and using the fact that , and for any , we get . Thus, via a union bound over all , we have with probability .
Thus, via a union bound over the two cases above, for all with probability at least for we have, for all ,
(56) |
Next we prove that the singular values of are close to those of , using essentially the same approach as above. Let be a matrix values random variable for such that:
where is the th column of . Then, . Also, we have . First, using a standard Chernoff bound, we can claim that will sample at most rows from with probability at least for any . Thus, we have with probability . Let this event be called . We now consider two cases conditioned on the event .
Case 1: We first prove that for all . Again note that when this is trvially true. So we consider all such that . Setting , (as we have conditioned on ) and (note that ) in Theorem 11, we get:
where is some constant. So, for for any , we have with probability at least . Taking a square root on both sides we get . Taking a union bound over all with , holds for all such with probability at least .
Case 2: We now prove for all . We again first consider the case . Setting , and (note that ) in Theorem 11:
Then, similar to the case in the previous case 2, taking square root of both sides and via a union bound, we get for all such that with probability at least for . The case will again be similar as in the previous case 2. We set and apply Theorem 11 and take the square root on both sides to get with probability for all for . Thus, with probability , conditioned on the event , we have for all . Finally, via a union bound over the two cases above, and conditioned on , for all with probability at least for we get
(57) |
Thus, taking a union bound over all the cases above (including ), from equation (56) and (57) and via a triangle inequality, we get: with probability at least (where is a small constant) for . Adjusting and by constant factors gives us the final bound. ∎
Remark on Rectangular Matrices: Though we have considered to be a square matrix for simplicity, notice that Theorem 12 also holds for any arbitrary (non-square) matrix , with replaced by in the sample complexity bound.
Remark on Non-Uniform Sampling: As discussed in Section 1.3.1, simple non-uniform random submatrix sampling via row/column sparsities or norms does not suffice to estimate the singular values up to improved error bounds of or . A more complex strategy, such as the zeroing out used in Theorems 2 and 3 must be used. It is worth noting that following the same proof as Theorem 12, it is easy to show that if is sampled according to row norms or sparsities and appropriately weighted, then the singular values of do approximate those of up to these improved error bounds. The proof breaks down when analyzing . would have to be sampled according to the row norms/sparsities of , not , for the proof to go through. However, in general, these sampling probabilities can differ significantly between and .