Universal Matrix Sparsifiers and Fast Deterministic Algorithms for Linear Algebra
Let be any matrix satisfying , where is the all ones matrix and is the spectral norm. It is well-known that there exists with just non-zero entries achieving this guarantee: we can let be the scaled adjacency matrix of a Ramanujan expander graph. We show that, beyond giving a sparse approximation to the all ones matrix, yields a universal sparsifier for any positive semidefinite (PSD) matrix. In particular, for any PSD which is normalized so that its entries are bounded in magnitude by , we show that , where denotes the entrywise (Hadamard) product. Our techniques also yield universal sparsifiers for non-PSD matrices. In this case, we show that if satisfies for some sufficiently large constant , then , where is the nuclear norm. Again letting be a scaled Ramanujan graph adjacency matrix, this yields a sparsifier with entries. We prove that the above universal sparsification bounds for both PSD and non-PSD matrices are tight up to logarithmic factors.
Since can be constructed deterministically without reading all of , our result for PSD matrices derandomizes and improves upon established results for randomized matrix sparsification, which require sampling a random subset of entries and only give an approximation to any fixed with high probability. We further show that any randomized algorithm must read at least entries to spectrally approximate general to error , thus proving that these existing randomized algorithms are optimal up to logarithmic factors. We leverage our deterministic sparsification results to give the first deterministic algorithms for several problems, including singular value and singular vector approximation and positive semidefiniteness testing, that run in faster than matrix multiplication time. This partially addresses a significant gap between randomized and deterministic algorithms for fast linear algebraic computation.
Finally, if is PSD, we show that a spectral approximation with can be obtained by deterministically reading entries of . This improves the dependence on our result for general PSD matrices by a quadratic factor and is information-theoretically optimal up to a logarithmic factor.
1 Introduction
A common task in processing large matrices is element-wise sparsification. Given an matrix , the goal is to choose a small subset of coordinates in , where , such that is small, where denotes the Hadamard (entrywise) product and is a sampling matrix which equals on the entries in , but is otherwise. As in previous work, we consider operator norm error, where for a matrix , . Elementwise sparsification has been widely studied [AM07, DZ11, AKL13, BKKS21] and has been used as a primitive in several applications, including low-rank approximation [AM07, Kun15], approximate eigenvector computation [AHK06, AM07], semi-definite programming [AHK05, d’A11], and matrix completion [CR12a, CT10]. Without loss of generality, one can scale the entries of so that the maximum entry is bounded by in absolute value, and we refer to such matrices as having bounded entries. With this normalization, it will be convenient to consider the task of finding a small subset with corresponding sampling matrix such that, for a given error parameter ,
(1) |
One can achieve the error guarantee in (1) for any bounded entry matrix with high probability by uniformly sampling a set of entries of . See, e.g., Theorem 1 of [DZ11].555To apply Thm. 1 of [DZ11] we first rescale so that all entries are in magnitude, and note that in this case. Thus, they sample entries. Rescaling their error guarantee analogously gives (1). However, there are no known lower bounds for this problem, even if we consider the harder task of universal sparsification, which requires finding a fixed subset such that (1) holds simultaneously for every bounded entry matrix . The existence of such a fixed subset corresponds to the existence of a deterministic sublinear query algorithm that constructs a spectral approximation to any by forming (which requires reading just entries of ). As we will see, such algorithms have applications to fast deterministic algorithms for linear algebraic computation. We ask:
What is the size of the smallest set that achieves (1) simultaneously for every bounded entry matrix ?
Previously, no bound on the size of better than the trivial was known.
1.1 Our Results
Our first result answers the above question for the class of symmetric positive semidefinite (PSD) matrices, i.e., those matrices for which for all vectors , or equivalently, whose eigenvalues are all non-negative. PSD matrices arise e.g., as covariance matrices, kernel similarity matrices, and graph Laplacians, and significant work has considered efficient algorithms for approximating their properties [WLZ16, CW17, XG10, Cha11, ACK+16, MMMW21, SKO21]. We summarize our results below and give a comparison to previous work in Table 1.
We show that there exists a set with that achieves (1) simultaneously for all bounded entry PSD matrices. This improves the best known randomized bound of for algorithms which only succeed on a fixed matrix with high probability.
Theorem 1 (Universal Sparsifiers for PSD Matrices).
There exists a subset of entries of such that, letting have for and otherwise, simultaneously for all PSD matrices with bounded entries,
|
|
|
|
|
|
||||||||||||
|
sparse | (Thms 1, 3) |
|
||||||||||||||
sparse |
|
|
|||||||||||||||
|
any |
|
|
||||||||||||||
any |
|
(Thm 7) |
Theorem 1 can be viewed as a significant strengthening of a classic spectral graph expander guarantee [Mar73, Alo86]; indeed, letting be the all ones matrix, we have that if satisfies (1), then it matches the near-optimal spectral expansion of Ramanujan graphs, up to a constant factor.666If we let be the average number of entries per row of , and let be the binary adjacency matrix of the graph with edges in (i.e., ), then by (1) applied with , , while for , . In fact, we prove Theorem 1 by proving a more general claim: that any matrix that sparsifies the all ones matrix also sparsifies every bounded entry PSD matrix. In particular, Theorem 1 follows as a direct corollary of:
Theorem 2 (Spectral Expanders are Universal Sparsifiers for PSD Matrices).
Let be the all ones matrix. Let be any matrix such that . Then for any PSD matrix with bounded entries, .
To prove Theorem 1 given Theorem 2, we let be the edge set of a Ramanujan spectral expander graph with degree and adjacency matrix . We have and . Thus, the top eigenvector of is the all ones vector with eigenvalue . All other eigenvalues are bounded by Combined, after adjusting by a constant factor, this shows that , as required.
Sparsity Lower Bound. We show that Theorem 1 is tight. Even if we only seek to approximate the all ones matrix (rather than all bounded PSD matrices), requires non-zero entries.
Theorem 3 (Sparsity Lower Bound – PSD Matrices).
Let be the all-ones matrix. Then, for any with for large enough constant , any with must have nonzero entries.
Theorem 3 resolves an open question of [BKKS21], which asked whether a spectral approximation must have non-zeros, where is the stable rank and , where is the row of , is the ‘numerical sparsity’. They give a lower bound when but ask if this can be extended to . For the all ones matrix, and , and so Theorem 3 resolves this question. Further, by applying the theorem to a block diagonal matrix with disjoint blocks of all ones, we resolve the question for integer and .
Non-PSD Matrices. The techniques used to prove Theorem 1 also give nearly tight universal sparsification bounds for general bounded entry (not necessarily PSD) matrices. In this case, we show that a subset of entries suffices to achieve spectral norm error depending on the nuclear norm , which is the sum of singular values of .
Theorem 4 (Universal Sparsifiers for Non-PSD Matrices).
There exists a subset of entries of such that, letting have for and otherwise, simultaneously for all symmetric matrices with bounded entries,
(2) |
Note that for a bounded entry PSD matrix , we have and so for PSD matrices, (1) and (2) are equivalent.
Remark 1.
Although Theorem 4 is stated for symmetric , one can first symmetrize by considering . Then for any , we have , and , so applying the above theorem to gives us a spectral approximation to .
As with Theorem 1, Theorem 4 follows from a general claim which shows that sparsifying the all ones matrix to small enough error suffices to sparsify any bounded entry symmetric matrix.
Theorem 5 (Spectral Expanders are Universal Sparsifiers for Non-PSD Matrices).
Let be the all ones matrix. Let be any matrix such that for some large enough constant . Then for any symmetric with bounded entries,
Theorem 4 follows by applying Theorem 5 where is taken to be the scaled adjacency matrix of a Ramanujan expander graph with degree .
Lower Bounds for Non-PSD Matrices. We can prove that Theorem 4 is tight up to logarithmic factors. Our lower bound holds even for the easier problem of top singular value (spectral norm) approximation and against a more general class of algorithms, which non-adaptively and deterministically query entries of the input matrix. The idea is simple: since the entries read by the deterministic algorithm are fixed, we can construct two very different input instances on which the algorithm behaves identically: one which is the all ones matrix, and the other which is one only on the entries read by the algorithm and zero everywhere else. We show that if the number of entries that are read is too small, the top singular value of the second instance is significantly smaller than the first (as compared to their Schatten-1 norms), which violates the desired error bound of . To obtain a tight bound, we apply the above construction on just a subset of the rows of the matrix on which the algorithm does not read too many entries. Here, we critically use non-adaptivity, as this subset of rows can be fixed, independent of the input instance.
Theorem 6 (Non-Adaptive Query Lower Bound for Deterministic Spectral Approximation of Non-PSD Matrices).
For any , any deterministic algorithm that queries entries of a bounded entry matrix non-adaptively and outputs satisfying must read at least entries. Here is the largest singular value of .
Observe that Theorem 4 implies the existence of a non-adaptive deterministic algorithm for the above problem with query complexity , since can be computed with non-adaptive determinstic queries and since implies via Weyl’s inequality that . Also recall that while Theorem 4 is stated for bounded entry symmetric matrices, it applies to general (possibly asymmetric) matrices by Remark 1.
Note that Theorem 6 establishes a separation between universal sparsifiers and randomized sparsification since randomly sampling entries of any achieves error by [DZ11]. In fact, for the problem of just approximating to error , randomized algorithms using just samples are known [BCJ20, BDD+23]. Theorem 6 shows that universal sparsifiers for general bounded entry matrices (and in fact all non-adaptive deterministic algorithms for spectral approximation) require a worse dependence and error bound of . We can extend Theorem 6 to apply to general deterministic algorithms that query possibly adaptively, however the lower bound weakens to . Understanding if this gap between adaptive and non-adaptive query deterministic algorithms is real is an interesting open question.
Theorem 7 (Adaptive Query Lower Bound for Deterministic Spectral Approximation of Non-PSD Matrices).
For any , any deterministic algorithm that queries entries of a bounded entry matrix (possibly adaptively) and outputs satisfying must read at least entries. Here is the largest singular value of .
1.1.1 Applications to Fast Deterministic Algorithms for Linear Algebra
Given sampling matrix such that , one can use to approximate various linear algebraic properties of . For example, by Weyl’s inequality [Wey12], the eigenvalues of approximate those of up to additive error . Thus, our universal sparsification results (Theorems 1, 4) immediately give the first known deterministic algorithms for approximating the eigenspectrum of a bounded entry matrix up to small additive error with sublinear entrywise query complexity. Previously, only randomized sublinear query algorithms were known [WS00, MM17, MW17, TYUC17, BCJ20, BDD+23].
Further, our results yield the first deterministic algorithms for several problems that run in time, where is the matrix multiplication exponent [AW21]. Consider e.g., approximating the top singular value (the spectral norm) of . A -relative error approximation can be computed in time with high probability using iterations of the Lanczos method with a random initialization [KW92, MM15]. This can be further accelerated e.g., via randomized entrywise sparsification [DZ11], allowing an additive approximation to to be computed in time. However, prior to our work, no fast deterministic algorithms were known, even with coarse approximation guarantees. The fastest approach was just to perform a full SVD of , requiring time. In fact, this gap between randomized and deterministic methods exists for many linear algebraic problems, and resolving it is a central open question.
By combining our universal sparsification results with a derandomized power method that uses a full subspace of vectors for initalization, and iteratively approximates the singular values of via ‘deflation’ [Saa11], we give to the best of our knowledge, the first time deterministic algorithm for approximating all singular values of a bounded entry matrix to small additive error.
Theorem 8 (Deterministic Singular Value Approximation).
Let be a bounded entry symmetric matrix with singular values . Then there exists a deterministic algorithm that, given , reads entries of , runs in time, and returns singular value approximations satisfying for all ,
Further, for all , the algorithm returns a unit vector such that and the returned vectors are all mutually orthogonal.
The runtime of Theorem 8 is stated assuming we have already constructed a sampling matrix that samples entries of and satisfies Theorem 4. It is well known that, if we let be the adjacency matrix of a Ramanujan expander graph, it can indeed be constructed deterministically in time [Alo21]. This is lower order as compared to the runtime of unless for some large enough constant . Further, for a fixed input size (and in fact, for a range of input sizes), needs to be constructed only once – see Section 2 for further discussion.
Recall that if we further assume to be PSD, then the error in Theorem 8 is bounded by . The sample complexity and runtime also improve by factors due to the tighter universal sparisfier bound for PSD matrices given in Theorem 1. Also observe that while Theorem 8 gives additive error approximations to all of ’s singular values, these approximations are only meaningful for singular values larger than , of which there are at most . Similar additive error guarantees have been given using randomized algorithms [BDD+23, WS23]. Related bounds have also been studied in work on randomized methods for spectral density estimation [WWAF06, LSY16, BKM22].
We further leverage Theorem 8 to give the first time deterministic algorithm for testing if a bounded entry matrix is either PSD or has at least one large negative eigenvalue . Recent work has focused on optimal randomized methods for this problem [BCJ20, NSW22]. We also show that, under the assumption that for some , one can deterministically compute with in time. That is, one can compute a highly accurate approximation to the top singular value in roughly linear time in the input matrix size. Again, this is the first time deterministic algorithm for this problem, and matches the runtime of the best known randomized methods for high accuracy top singular value computation, up to a factor.
1.1.2 Beyond Sparse Approximations
It is natural ask if it is possible to achieve better than sample complexity for spectral approximation by using an algorithm that does not output a sparse approximation to , but can output more general data structures, allowing it to avoid the sparsity lower bound of Theorem 3. Theorems 6 and 7 already rule this out for both non-adaptive and adaptive deterministic algorithms for non-PSD matrices. However, it is known that this is possible with randomized algorithms for PSD matrices. For example, following [AKM+17], one can apply Theorem 3 of [MM17] with error parameter . Observing that all ridge leverage score sampling probabilities are bounded by (e.g., via Lemma 6 of the same paper and the bounded entry assumption), one can show that a Nyström approximation based on uniformly sampled columns satisfies with high probability. Further can be constructed by reading just columns and thus entries of , giving a linear rather than quadratic dependence on as compared to Theorem 1. Unfortunately, derandomizing such a column-sampling-based approach seems difficult – any deterministic algorithm must read entries in columns of , as otherwise it will fail when is entirely supported on the unread columns.
Nevertheless, in the special case where is PSD and has entries in , we show that a spectral approximation can be obtained by deterministically reading just entries of .
Theorem 9 (Deterministic Spectral Approximation of Binary Magnitude PSD Matrices).
Let be PSD. Then for any , there exists a deterministic algorithm which reads entries of and returns PSD such that .
Using Turán’s theorem, we show that Theorem 9 is information theoretically optimal up to a factor, even for the potentially much easier problem of eigenvalue approximation:
Theorem 10 (Deterministic Spectral Approximation of Binary PSD Matrices – Lower Bound).
Let be PSD. Then for any , any possibly adaptive deterministic algorithm which approximates all eigenvalues of up to additive error must read entries of .
An interesting open question is if sample complexity can be achieved for deterministic spectral approximation of any bounded entry PSD matrix, with a matrix of any form, matching what is known for randomized algorithms.
Finally, we show that the PSD assumption is critical to achieve query complexity, for both randomized and deterministic algorithms. Without this assumption, queries are required, even to achieve (1) with constant probability for a single input . For bounded entry matrices, the randomized element-wise sparsification algorithms of [AM07, DZ11] read just entries of , and so our lower bounds are the first to show near-optimality of the number of entries read of these algorithms, which may be of independent interest.
Theorem 11 (Lower Bound for Randomized Spectral Approximation).
Any randomized algorithm that (possibly adaptively) reads entries of a binary matrix to construct a data structure that, for satisfies with probability at least ,
must read entries of in the worst case, provided .
1.1.3 Relation to Spectral Graph Sparsification
We remark that while our work focuses on general bounded entry symmetric (PSD) matrices, when is a PSD graph Laplacian matrix, it is possible to construct a sparsifier with just entries, that achieves a relative error spectral approximation guarantee that can be much stronger than the additive error guarantee of (1) [BSST13]. In particular, one can achieve for all . To achieve such a guarantee however, it is not hard to see that the set of entries sampled in must depend on , and cannot be universal. Fast randomized algorithms for constructing have been studied extensively [ST04, SS08, BSST13]. Recent work has also made progress towards developing fast deterministic algorithms [CGL+20].
1.2 Road Map
We introduce notation in Section 2. In Section 3, we prove our main universal sparsification results for PSD and non-PSD matrices (Theorems 1 and 4). In Section 4 we prove nearly matching lower bounds (Theorems 3 and 6). We also prove Theorems 7 and 11 which give lower bounds for general deterministic (possibly adaptive) spectral approximation algorithms, and general randomized algorithms, respectively. In Section 5, we show how to give tighter deterministic spectral approximation results for PSD matrices with entries in . Finally, in Section 6, we discuss applications of our results to fast deterministic algorithms for singular value and vector approximation.
2 Notation and Preliminaries
We start by defining notation used throughout. For any integer , let denote the set .
Matrices and Vectors. Matrices are represented with bold uppercase literals, e.g., . Vectors are represented with bold lowercase literals, e.g., . and denote all ones (resp. all zeros) matrices or vectors. The identity matrix is denoted by . The size of these matrices vary based on their applications. For a vector , denotes its entry. For a matrix , denotes the entry in the th row and th column. For a vector (or matrix ), (resp. denotes its transpose. For two matrices of the same size, denotes the entrywise (Hadamard) product.
Matrix Norms and Properties. For a vector , denotes its Euclidean norm and denotes its norm. We denote the eigenvalues of a symmetric matrix as in decreasing order. A symmetric matrix is positive semidefinite (PSD) if for all . The singular values of a matrix are denoted as in decreasing order. We let denote the spectral norm, denote the largest magnitude of an entry, denote the Frobenius norm, and denote the Schatten-1 norm (also called the trace norm or nuclear norm).
Expander Graphs. Our universal sparsifier constructions are based on Ramanujan expander graphs [LPS88, Mar73, Alo86], defined below.
Definition 2.1 (Ramanujan Expander Graphs [LPS88]).
Let be a connected -regular, unweighted and undirected graph on vertices. Let be the adjacency matrix corresponding to and be its eigenvalue, such that . Then is called a Ramanujan graph if:
Equivalently, letting be the all ones matrix, .
Efficient Construction of Ramanujan graphs. Significant work has studied efficient constructions for Ramanujan Graphs [Mar73, LPS88, Mor94, Coh16], and nearly linear time constructions, called strongly explicit constructions [Alo21], have been proposed. E.g., by Proposition 1.1 of [Alo21] we can reconstruct for any and , a graph on vertices with second eigenvalue at most in time. Additionally, in our applications, the expander just needs to be constructed once and can then be used for any input of size . In fact, a single expander can be used for a range of input sizes, by the argument below.
Though the size of the expander above is instead of exactly , and its expansion does not exactly hit the tight Ramanujan bound of , we can let our input matrix be the top principal submatrix of a slightly larger matrix that is zero everywhere else. Then, the top principal submatrix of a spectral approximation to is a spectral approximation to . Thus, obtaining a spectral approximation to via a Ramanujan sparsifier on vertices suffices. Similarly, in our applications, we can adjust by at most a constant factor to account for the bound on the second eigenvalue, rather than the tight Ramanujan bound of .
3 Universal Sparsifier Upper Bounds
We now prove our main results on universal sparsifiers for PSD matrices (Theorem 1) and general symmetric matrices (Theorem 4). Both theorems follow from general reductions which show that any sampling matrix that sparsifies the all ones matrix to sufficient accuracy (i.e., is a sufficiently good spectral expander) yields a universal sparsifier. We prove the reduction for the PSD case in Section 3.1, and then extend it to the non-PSD case in Section 3.2.
3.1 Universal Sparsifiers for PSD Matrices
In the PSD case, we prove the following:
See 2 Theorem 1 follows directly from Theorem 2 by letting be the scaled adjacency matrix of a Ramanujan expander graph with degree (Definition 2.1).
Proof of Theorem 2.
To prove the theorem it suffices to show that for any with , . Let and be the eigenvectors and eigenvalues of so that . Then we can expand out this error as:
where we use that for any matrix , . Now observe that if we let be a diagonal matrix with the entries of on its diagonal, then we can write . Plugging this back in we have:
(3) |
In the second line we use that is PSD and thus is non-negative for all . We also use that . In the third line we bound using the assumption of the theorem statement.
Finally, writing and plugging back into (3), we have:
where in the last step we use that by our bounded entry assumption, and that is a unit vector. This completes the theorem. ∎
Remark 2.
Note that we can state a potentially stronger bound for Theorem 2 by observing that in the second to last step, , where is a diagonal matrix containing the diagonal elements of . That is, letting denote that is PSD, we have the following spectral approximation bound:
3.2 Universal Sparsifiers for Non-PSD Matrices
We next extend the above approach to give a similar reduction for universal sparsification of general symmetric matrices. See 5 Observe that as compared to the PSD case (Theorem 2), here we require that gives a stronger approximation to the all ones matrix. Theorem 4 follows directly by letting be the scaled adjacency matrix of a Ramanujan expander graph with degree (Definition 2.1).
Proof of Theorem 5.
To prove the theorem, it suffices to show that for any with , . We will split the entries of into level sets according to their magnitude. In particular, let and let . Then we can write , such that for any :
(4) |
Via triangle inequality, we have:
(5) |
We will bound each term in (5) by . Summing over all terms we achieve a final bound of . The theorem then follows by adjusting by a constant factor.
Fixing , let if and otherwise. Let if and otherwise. In the edge case, for , let and . Writing and applying triangle inequality, we can bound:
(6) |
In the following, we separately bound the two terms in (6). Roughly, since has relatively small entries and thus is relatively well spread, we will be able to show, using a similar approach to Theorem 2, that the first term is small since sparsification with approximately preserves . On the otherhand, since has relatively large entries and thus is relatively sparse, we can show that the second term is small since is small (and cannot be much larger).
Term 1: Well-Spread Vectors. We start by bounding . Let and be the eigenvectors and eigenvalues of so that . Then we write:
As in the proof of Theorem 2, if we let be a diagonal matrix with the entries of on its diagonal, then we have . Further, , where are diagonal with the entries of and on their diagonals. Plugging this back in we have:
(7) |
Finally, observe that by definition, for any , and . Thus, . For , by definition and so, the bound holds vacuously.
Term 2: Sparse Vectors. We next bound the second term of (6): . We write , where and contain its positive and non-positive entries respectively. Similarly, write . We can then bound via triangle inequality:
(9) |
We will bound each term in (9) by , giving that overall . We first observe that
By assumption . Further, for , since for all and since , has at most non-zero entries. Thus, . For , this bound holds trivially since .
Similarly, for , since and since by definition, has at most non-zero entries. So . For this bound holds trivially since . Putting these all together, we have
(10) |
We next bound . Since and are both all positive vectors, and since by assumption ,
(11) |
Following the same argument used to show (10), . Further, since by the assumption of the theorem, , and since and both are at most unit norm, . Thus, plugging back into (11), we have . Identical bounds will hold for the remaining three terms of (9), since for each, the two vectors in the quadratic form have entries that either always match or never match on sign. Plugging these bounds and (10) back into (9), we obtain
(12) |
which completes the required bound in the sparse case.
4 Spectral Approximation Lower Bounds
We now show that our universal sparsifier upper bounds for both PSD matrices (Theorem 1) and non-PSD matrices (Theorem 4) are nearly tight. We also give query lower bounds against general deterministic (possibly adaptive) spectral approximation algorithms (Theorem 7) and general randomized spectral approximation algorithms (Theorem 11).
4.1 Sparsity Lower Bound for PSD Matrices
We first prove that every matrix which is an spectral approximation to the all-ones matrix must have non-zero entries. This shows that Theorem 1 is optimal up to constant factors, even for algorithms that sparsify just a single bounded entry PSD matrix. The idea of the lower bound is simple: if a matrix spectrally approximates the all-ones matrix, its entries must sum to . Thus, if has just non-zero entries, it must have Frobenius norm at least . Unless , this Frobenius norm is too large for to be a -spectral approximation of the all ones matrix (which has .)
See 3
Proof.
If we let be the all ones vector, then since , implies that . This in turn implies:
where the last inequality follows from the assumption that . If has non-zero entries, since is the -norm of the entries, and is the -norm of the entries, we conclude by norm equivalence that . Therefore, by the triangle inequality,
as long as so that . Now, using that , we have:
By our assumption that , this means that , concluding the theorem. ∎
4.2 Lower Bounds for Deterministic Approximation of Non-PSD Matrices
We next show that our universal sparsification bound for non-PSD matrices (Theorem 4) is tight up to a factor. Our lower bound holds even for the easier problem of top singular value (spectral norm) approximation and against a more general class of algorithms, which non-adaptively and deterministically query entries of the input matrix. We show how to extend the lower bound to possibly adaptive deterministic algorithms in Theorem 7, but with a factor loss.
See 6
Proof.
Assume that we have a deterministic algorithm that non-adaptively reads entries of any bounded symmetric matrix and outputs with . Assume for the sake of contradiction that for some sufficiently small constant .
Let be a set of rows on which reads at most entries. Such a subset must exist since the average number of entries read in any set of rows is .
Let be the matrix which is on all the rows in and zero everywhere else. Let be the matrix which matches on all entries, except is on any entry in the rows of that is not read by the algorithm . Observe that reads the same entries (all ones) and thus outputs the same approximation for and . We now bound the allowed error of this approximation. is rank-1 and so:
We have since has just entries set to – the entries that reads in the rows of . Using that is supported only on the rows of , has rank at most . Thus, we can bound:
Now, since , our algorithm incurs error on one of the two input instances at least
where the inequality follows since, by assumption, for some small constant and , and thus .
The above is a contradiction when since the above error is at least and further, for , the error it at least if we set small enough. Thus, on at least one of the two input instances the error exceeds , yielding a contradiction. ∎
We can prove a variant on Theorem 6 when the algorithm is allowed to make adaptive queries. Here, our lower bound reduces to , as we are not able to restrict our hard case to a small set of rows of the input matrix. Closing the gap here – either by giving a stronger lower bound in the adaptive case or giving an adaptive query deterministic algorithm that achieves query complexity is an interesting open question.
See 7
Proof.
Assume that we have a deterministic algorithm that reads at most entries of any bounded entry matrix and outputs with . Assume for the sake of contradiction that for some sufficiently small constant .
Let be the set of entries that reads when given the all ones matrix as input. Let be the matrix which is one on every entry in and zero elsewhere. Observe that reads the same entries and thus outputs the same approximation for and . We now bound the allowed error of this approximation. is rank-1 and has . We have and can bound where the last bound follows from our assumption that . Now, since , our algorithm incurs an error on one of the two input instances at least
where the inequality holds since, by assumption, for some small constant and , and thus .
The above is a contradiction when since the error mentioned above is at least . Furthermore, since we have bounded , the error is greater than when . Thus, on at least one of the two input instances the error exceeds , yielding a contradiction. ∎
4.3 Lower Bound for Randomized Approximation of Non-PSD Matrices
Finally, we prove Theorem 11, which gives an query lower bound for spectral approximation of bounded entry matrices that holds even for randomized and adaptive algorithms that approximate in an arbitrary manner (not necessarily via sparsification). In particular, we show the lower bound against algorithms that produce any data structure, , that satisfies for all unit norm
To prove this lower bound, we let be a random matrix where each row has binary entries that are either i.i.d. fair coin flips, or else are coin flips with bias . Each row is unbiased with probability and biased with probability . We show that an -spectral approximation to suffices to identify for at least a fraction of the rows, whether or not they are biased – roughly since the approximation must preserve the fact that is large when is supported just on this biased set. Consider a communication problem in the blackboard model, in which each of players can access just a single entry of . It is known that if the players corresponding to a single row of the matrix want to identify with good probability whether or not it is biased, they must communicate at least bits [SWXZ22]. Further, via a direct sum type argument, we can show that for the players to identify the bias of a fraction of the rows with good probability, bits must be communicated. I.e., at least of the players must read their input bits, yielding our query complexity lower bound.
Note that there may be other proof approaches here, based on a direct sum of -player Gap-Hamming instances [BJKS04, CR12b, BGPW13], but our argument is simple and already gives optimal bounds.
Section Roadmap. In Section 4.3.1, we formally define the problem of identifying the bias of a large fraction of ’s rows as the -distributed detection problem (Definition 4.2). We prove a query lower bound for this problem in Lemma 3.. Then, in Section 4.3.2, we prove Theorem 11 by showing a reduction from the distributed detection problem to spectral approximation.
4.3.1 Distributed Detection Lower Bound
Here, we define additional concepts and notation specific to this section. For random variables , and , let denote the entropy, denote the conditional entropy, denote mutual information, and denote conditional mutual information [Cov99]. We also use some ideas from communication complexity. Namely, we work in the blackboard model, where parties communicate by posting messages to a public blackboard with access to public randomness with unlimited rounds of adaptivity. Let denote the transcript of a protocol posted to the blackboard, and let denote the length of a transcript. For a fixed protocol with fixed probability of success, we may assume, without loss of generality, that has a fixed pre-determined length (see Section 2.2 of [Rou16]).
Our query complexity lower bound for spectral approximation will follow from a reduction from the following testing problem.
Definition 4.1.
(-Distributed detection problem [SWXZ22]). For fixed distributions and , with , let be a random vector such that are sampled i.i.d. from , for . The distributed detection problem is the task of determining whether or , given the values of .
This decision problem can be naturally interpreted as a communication problem in the blackboard model, where players each have a single private bit corresponding to whether a unique entry of is zero or one. Prior work takes this view to lower bound the mutual information between the transcript of a protocol which correctly solves the -Distributed detection problem with constant advantage and the sampled bits in in the case .
Theorem 12.
(Theorem 6 in [SWXZ22]). Let be the transcript of a protocol that solves -distributed detection problem with probability for any fixed choice of . Then
Next, we define the -Distributed detection problem, which combines length- -Distributed detection problems into a single joint detection problem.
Definition 4.2.
(-Distributed detection problem) For distributed uniformly on the Hamming cube, generate the matrix such that if , then all entries in the -th row of are i.i.d. samples from . Otherwise, let all entries in the -th row be sampled from . The -Distributed detection problem is the task of recovering a vector such that .
Again, this vector recovery problem has a natural interpretation as a communication problem, where players each hold a single bit of information that corresponds to whether a unique entry of is zero or one. Throughout this section, we will view Definition 4.2 as a decision problem or communication problem as needed. Let be the transcript of a protocol that solves the -Distributed detection problem in the communication model introduced at the beginning of this section. Lower bounding the mutual information between the transcript, , and the private information held by the players (the entries of ) lower bounds the length of the transcript via the following argument.
Lemma 1.
If is the transcript of a protocol that solves the -Distributed detection problem and , then .
Proof.
For random variables , [Cov99]. If a random variable has finite support, then . Recall that is the number of bits in the transcript and hence . Hence, we conclude the statement. ∎
Next, we lower bound the number of entries in the matrix that must be observed to solve a variant of the -Distributed detection problem, where we aim to correctly decide each index of with constant success probability, rather than constructing such that . There are three sources of randomness in this problem: 1) the initial randomness in sampling , 2) the random variable which depends on , and 3) the random transcript which depends on . When necessary to avoid confusion, we will be explicit regarding which of these variables are considered fixed and which are considered as random in probabilistic statements.
Lemma 2.
Let and be distributed as in the -Distributed detection problem. Any protocol with transcript that constructs a vector such that , for all , (where the randomness is with respect to , , and ), must observe entries of in the worst case.
Proof.
Consider as players each holding a single bit of information corresponding to whether a unique entry of is zero or one. Here, let be the transcript of a protocol that satisfies for every . Note that any algorithm which solves this problem by querying entries of implies a protocol for the communication problem which communicates bits, since the algorithm could be simulated by (possibly adaptively) posting each queried bit to the blackboard.
First, we lower bound the un-conditional mutual information between the private information in the -th row of and the transcript . Note that is one or zero with equal probability, therefore,
Next, we decompose the mutual information by its definition and the chain rule, then, we use the fact that and that conditioning can only decrease the entropy of a random variable.
Observe that determining with the guarantee solves the -Distributed detection problem with probability at least . Therefore, by Theorem 12 , and so we can use the previous two equations to lower bound for the conditional mutual information:
(13) |
Next, we lower bound the total mutual information between and . Mirroring the argument of Lemma 1 in [SWXZ22], we use that, for independent random variables, entropy is additive and conditional entropy is subadditive to lower bound the mutual information between and :
where the last step follows from (13). By Lemma 1, . Therefore, every algorithm which samples entries of to construct a vector satisfying must observe at least entries of . ∎
We now have the necessary results to prove a lower bound on the number of entries of that must be observed to solve the -Distributed detection problem, which we do by reducing to the problem in Lemma 2.
Lemma 3.
Any adaptive randomized algorithm which solves the -Distributed detection problem with probability at least (with respect to the randomness in , , and ) must observe entries of .
Proof.
Any algorithm that can produce a vector such that with probability at least could be used to guarantee that by the following argument.
For any input matrix , create the matrix such that , where is a permutation sampled uniformly from the symmetric group of size . Let be the vector which satisfies . Run the considered protocol on to recover such that with probability . Finally, let for all .
Then, . Since is uniformly distributed over , is the number of correctly decided entries divided by .
By Lemma 2, we conclude that solving the -Distributed detection with probability at least requires observing entries of in the worst case. ∎
4.3.2 Spectral Approximation Query Lower Bound
Next, we show that the -Distributed detection problem can be solved using a spectral approximation of , thereby implying a query complexity lower bound for spectral approximation. See 11
Proof.
We reduce solving the -Distributed detection problem to constructing a spectral approximation satisfying the guarantee in the above theorem statement. Throughout this proof, let be the parameter associated with the -Distributed detection problem and be the accuracy of the spectral approximation.
Let be the matrix associated with the -Distributed detection problem. Suppose that by reading entries of , we can create a data structure satisfying:
We show that for sufficiently small respect to , such a spectral approximation is sufficient to solve the -Distributed detection problem with no further queries to .
First, let denote the number of ones in the -th row of . Observe that if , and, if . By Hoeffding’s inequality [Ver18],
Therefore, by the union bound over all ,
Let , such that . Then with probability at least ,
(14) |
For large enough , we have with probability at least that there are at least biased rows, since the number of biased rows is distributed as . Define the set such that and contains a maximal number of biased rows. Define the set as,
We will show that can contain at most fewer biased rows than . By guarantee of the spectral approximation and (14), (note
If has fewer biased rows than , then, . By the optimality of , . Therefore, if , then,
Solving for implies, . By assumption of the theorem statement, , therefore, .
Let be the number of biased rows in . First, we consider the case where exactly. In this case, contains biased rows, and hence, contains at least biased rows for large enough . Therefore, deciding all rows in (i.e., for all are biased will correctly decide of the biased rows. By looking at the complement of and , we conclude that of the unbiased rows are decided correctly as well. Hence, we can construct such that by the assignment for and otherwise.
The number of biased rows is distributed as a binomial random variable, i.e., . Therefore, by Hoeffding’s inequaliy, . For large enough , . Therefore, with probability at least , we can reduce to the case by assuming that at most an additional rows are misclassified as biased or unbiased. Hence, with probability at least , a spectral approximation of with accuracy is sufficient to recover at least of the biased rows with probability at least .
Correctly classifying of the rows as biased or unbiased requires observing entries of by Lemma 3. Since entries of must be observed to construct the spectral approximation used to solve the -Distributed detection problem, we conclude the lower bound of the theorem statement. ∎
Note that the assumption in the previous theorem is mild, since if this assumption does not hold, then we must read nearly all entries of the matrix anyways. We also show that our construction provides a lower bound even when the input is restricted to be symmetric.
Corollary 1.
The lower bound in Theorem 11 applies when the input is restricted to symmetric binary matrices.
Proof.
While construction used in Theorem 11 is not symmetric, we can modify it to give the same lower bound for symmetric input. Let be defined as in the proof of Theorem 11, and let be the Hermitian dilation of , i.e.,
Any query can be simulated by the query , where and . The size of is and the size of is , hence, if is a spectral approximation of such that,
then we can simulate such that,
Therefore, the proof of Theorem 11 applies to the Hermitian dilation after adjusting for a constant factor in the accuracy parameter . ∎
5 Improved Bounds for Binary Magnitude PSD Matrices
We call a PSD matrix a binary magnitude PSD matrix if , i.e., if . In this section, we give deterministic spectral approximation algorithms for such matrices with improved dependencies. Specifically, we show that there exists a deterministic error spectral approximation algorithm reading just entries of the input matrix. This improves on our bound for general bounded entry PSD matrices (Theorem 1) by a factor, while losing a factor. Further, we show that it is tight up to a logarithmic factor, via a simple application of Turán’s theorem [Tur41].
The key idea of our improved algorithm is that for any PSD , we can write , so that , where are the and columns of respectfully. Since has bounded entries, for any , . Thus, we can apply transitivity arguments: if is large (close to ) and is also large, then must be relatively large as well. Thus, we can hope to infer the contribution of an entry to without necessarily reading that entry, allowing for reduced sample complexity.
The logic is particularly clean when is binary. In this case, we can restrict to looking at the principal submatrix corresponding to for which . The remainder of the matrix is all zeros. For such , we have , and thus if and , we can conclude that . This implies that, up to a permutation of the row/column indices, is a block diagonal matrix, with blocks of all ones, corresponding to groups of indices with for all . The eigenvalues of this matrix are exactly the sizes of these blocks, and to recover a spectral approximation, it suffices to recover the set corresponding to any block of size .
To do so, we employ a weak notion of expanders [Pip87, WZ93], that requires that any two subsets of vertices both of size , are connected by at least one edge. Importantly, such expanders can be constructed with edges – beating Ramanujan graphs, which would require edges, but satisfy much stronger notions of edge discrepancy. Further, if we consider the graph whose edges are in the intersection of those in the expander graph and of the non-zero entries in , we can show that any block of ones in of size will correspond to a large sized connected component in this graph. We can form this graph by querying at positions (the edges in the expander) and then computing these large connected components to recover a spectral approximation. Each sparse connected component becomes a dense block of all ones in our approximation, avoiding the sparsity lower bound of Theorem 3. Noting that we can extend this logic to matrices with entries in , yields the main result of this section, Theorem 9.
It is not hard to see that is optimal, even for binary PSD matrices (Theorem 10). By Turán’s theorem, any sample set of size does not read any entries in some principal submatrix of size . By letting be the identity plus a block of all ones on this submatrix, we force our algorithm to incur approximation error, since it cannot find this large block of ones.
Section Roadmap. We formally prove the block diagonal structure of binary magnitude PSD matrices in Section 5.1. We leverage this structure to give our improved spectral approximation algorithm for these matrices in Section 5.2. Finally, in Section 5.3 we prove a nearly matching lower bound via Turán’s theorem.
5.1 Structure of Binary Magnitude PSD Matrices
As discussed, our improved algorithm hinges on the fact that, up to a permutation of the rows and columns, binary magnitude PSD matrices are block diagonal. Further, each block is itself a rank- matrix, consisting of two on-diagonal blocks of ’s and two off-diagonal blocks of ’s. Formally,
Lemma 4 (Binary Magnitude PSD Matrices are Block Matrices).
Let be PSD and let for be the eigenvalues of , with and . Then, there exists a permutation matrix such that
Further, Each is a rank-1 PSD matrix. Letting denote the set of indices corresponding to the principal submatrix , which we call its support set, can be partitioned in to two subsets and such that, if we let for and for , then are orthogonal and .
Proof.
Since is PSD, there exists with rows such that . Note that either or for all since . Also for any , if or , then . Otherwise, if or were , we would have or and thus .
Thus, if , we have , , and . I.e., we must have . Similarly, if , we have and and so . Finally, if , then . This implies that the indices can be grouped into disjoint subsets () such that for any and any , either or . Also for any and with , we must have . Label these subsets such that . Now observe that any subset can be further divided into two disjoint subsets and such that if or , but if and (or vice-versa), . Thus, the principal submatrix corresponding to , which we label is a rank-1 block matrix with two blocks of all ones on the principal submatrices corresponding to and and ’s on all remaining entries.
For any , let be a vector such that if , if and otherwise. Then, observe that . Thus, each is an eigenvalue of and each is an eigenvector. Since has disjoint supports, are orthogonal. Further, and thus, all other eigenvalues of are . Thus, we can write . This concludes the lemma. ∎
5.2 Improved Spectral Approximation of Binary Magnitude PSD matrices
Given Lemma 4, the key idea to efficiently recovering a spectral approximation to a binary magnitude PSD matrix is that we do not need to read very many entries in each block to identify the block. We just need enough to identify a large fraction of the indices in the support set .
To ensure that our sampling is able to do so, we will sample according to the edges of a certain type of weak expander graph, which ensures that any two vertex sets of large enough size are connected. I.e., that we read at least one entry in any large enough off-diagonal block of our matrix.
Definition 5.1 (-expander graphs [WZ93, Pip87]).
For any , an -expander graph is any undirected graph on vertices such that any two disjoint subsets of vertices containing at least vertices each are joined by an edge.
Moreover, it is not hard to show that -expanders with just edges exist.
Fact 1.
[WZ93] A random regular graph is an -expander with high probability.
We note that while a spectral expander graph, such as a Ramanujan graph, as used to achieve Theorem 1, will also be an -expander by the expander mixing lemma [CG02], such graphs require edges, as opposed to the given by Fact 1. Further, while Fact 1 is not constructive, in [WZ93] an explicit polynomial time construction of -expander graphs is shown, with just an loss. This result can be plugged directly into our algorithms to give a polynomial time constructible deterministic spectral approximation algorithm with sample complexity.
Fact 2 (Constructive -expanders [WZ93]).
There is a polynomial time algorithm that, given an integer and , constructs an -expanding graph on vertices with maximum degree .
We now prove our main technical result, which shows that if we read the entries of a binary magnitude PSD matrix according to an -expander graph, then we can approximately recover all blocks of this matrix by considering the connected components of the expander graph restricted to edges where is nonzero.
Theorem 13 (Approximation via Sampled Connected Components).
Let be a PSD matrix with eigenvalues and . Let be an -expanding graph on vertices with adjacency matrix . Let be the binary adjacency matrix with whenever and . Let be the graph whose adjacency matrix is . Let denote the support sets of ’s blocks (as defined in Lemma 4) with , and let be the induced subgraph of restricted to . Finally, let be the largest connected component of . We have:
Proof.
First observe that since , we trivially have that . Thus, it remains to show that . This holds vacuously if . Thus, we focus our attention to . For contradiction assume . Let denote the connected components of such that . So . Observe that since all entries in in the principal submatrix indexed by (i.e., in from Lemma 4) have magnitude , . Consider two cases.
Case 1: . From our assumption: which implies that . Let so . Since are disconnected, there are no edges from to in , and thus no edges in , since . However, this contradicts the fact that is an expander and thus must have at least one edge between any two set of vertices of size at least (Definition 1). Thus we have a contradiction and our assumption is incorrect. So we must have as needed.
Case 2: . Since , we have . Consider partitioning these connected components into two sets, and . Let be the vertex sets corresponding to these two partitions, i.e., and . Pick and to minimize . Since , for this optimal , we must have .
Since and are disconnected in , they must be disconnected in . Thus, we must have either or since any two subsets of vertices with size must have an edge between them in due to it being an -expander. Assume w.l.o.g. that . Then since , we must have . However, this means that , which contradicts the fact that we picked such that .
∎
We now present our deterministic algorithm for spectral approximation of binary magnitude PSD matrices, based on Theorem 13. The key idea is to compute all connected components of – the graph whose edges lie in the intersection of the edges of and the non-zero entries of . Any large enough block in (see Lemma 4) will correspond to a large connected component in this graph and can thus be identified.
Query complexity: Observe that Algorithm 1 requires querying at the locations where is non-zero (to perform step 2). By Fact 1, there are -expander graphs with just edges, so this requires queries. Further, in a graph with nodes,there can be at most connected components of size . Thus, line (7) requires total query complexity to read one column of corresponding to each large component identified.
See 9
Proof.
We will show that Algorithm 1 satisfies the requirements of the theorem. We have already argued above that its query complexity is .
By Lemma 4, we know that, up to a permutation, is a block matrix with rank- blocks with support sets . Since has edges only where is non-zero, each connected component in is entirely contained within one of these support sets. Further, if , then by Theorem 13, there is exactly one connected component, call it (the largest connected component of ) with vertices in and . If , then there it at most one such connected component by Theorem 13 (there may be none).
So, Step 7 is triggered for exactly one connected component for each with and at most one connected component for all other . If Line 7 is triggered, then we add to . Observe that is a column for , and thus by Lemma 4, is supported only on . We have that either on and on , or on and . That is, is equal to either or as defined in Lemma 4.
So overall, letting be the set of blocks for which one connected component is recovered in Line 7, . From Lemma 4, and so . Since the are orthogonal and since for we must have , this gives that , completing the theorem.
∎
5.3 Lower Bounds for Binary PSD Matrix Approximation
We can prove that our query complexity for binary magnitude matrices is optimal up to a factor via Turán’s Theorem, stated below. Our lower bound holds for the easier problem of eigenvalue approximation for PSD .
Fact 3 (Turan’s theorem [Tur41, Aig95]).
Let be a graph on vertices that does not include a -clique as a subgraph. Let be the number of edges in . Then
Using Turan’s theorem we can prove: See 10
Proof.
Let be a deterministic algorithm that approximates the eigenvalues of a binary PSD input matrix to additive error. Let , i.e., the identity matrix. Let be the first off-diagonal entries read by on input . Since the choices of the algorithm are deterministic, these same entries will be the first entries queried by from any input matrix which has ones on the diagonal and zeroes in all entries of , even if the algorithm is adaptive.
Without loss of generality, we can assume that the entries in are above the diagonal, since reading any entry below the diagonal of an input matrix would be redundant due to symmetry. Consider a graph G with vertex set and undirected edge set , that is, the edge set contains all undirected edges such that for . Hence, for all ,
where the first inequality is true since . Therefore, if we set , we conclude by the contrapositive of Fact 3 that there exists a clique of size in . Let denote the set of edges which forms this clique in .
Construct such that for all such that and for all such that . Then, there is a principal submatrix containing all ones in where the off-diagonal entries correspond to edges in along with the diagonal entries which are all ones by construction. Hence, contains a non-zero principal submatrix of size , and so, it has an eigenvalue that is at least . Both and have identical values on the diagonal and on the first off-diagonal entries read by (i.e., the entries in ). Therefore, the entries of and queried by are identical, and hence cannot distinguish these two matrices, despite the fact their maximum eigenvalue differs by .
After adjusting for constant factors, we conclude that any deterministic (possibly adaptive) algorithm that estimates the eigenvalues of a -matrix to additive error must observe entries of the input matrix in the worst case.
∎
6 Applications to Fast Deterministic Algorithms for Linear Algebra
We finally show how to leverage our universal sparsifiers to design the first time deterministic algorithms for several problems related to singular value and vector approximation. As discussed in Section 1.1.1, throughout this section, runtimes are stated assuming that we have already constructed a deterministic sampling matrix satisfying the universal sparsification guarantees of Theorem 1 or Theorem 4. When is the adjacency matrix of a Ramanujan graph, this can be done in time – see Section 1.1.1 and Section 2 for further discussion.
In Section 6.1, we consider approximating the singular values of to additive error . Observe that if we deterministically compute with via Theorem 4, by Weyl’s inequality, all singular values of approximate those of up to additive error . Thus we can focus on approximating ’s singular values.
To do so efficiently, we will use the fact that is sparse and so admits very fast time matrix-vector products. Focusing on the top singular value, it is well known that the power method, initialized with a vector that has non-negligible (i.e., ) inner product with the top singular vector of , converges to a relative error approximation in iterations. I.e., the method outputs with . Each iteration requires a single matrix-vector product with , yielding total runtime . Since is unknown, one typically initializes the power method with a random vector, which has non-negligible inner product with with high probability. To derandomize this approach, we simply try orthogonal starting vectors, e.g., the standard basis vectors, of which at least one must have large inner product with . This simple brute force approach yields total runtime , which, for large enough , is .
To approximate more than just the top singular value, one would typically use a block power or Krylov method. However, derandomization is difficult here – unlike in the single vector case, it is not clear how to pick a small set of deterministic starting blocks for which at least one gives a good approximation. Thus, we instead use deflation [Saa11, AZL16]. We first use our deterministic power method to compute with . We then run the method again on the deflated matrix , which we can still multiply by in time. This second run outputs which is orthogonal to and which has . Since gives a good approximation to , we then argue that and so . We repeat this process to approximate all large singular values of . The key challenge is bounding the accumulation of error that occurs at each step.
In Section 6.2 we apply the above derandomized power method approach to the problem of testing whether is PSD or has at least on large negative eigenvalue . We reduce this problem to approximating the top singular value of a shifted version of .
Finally, in Section 6.3, we consider approximating to high accuracy – e.g., to relative error with , under the assumption that for some . Here, one cannot simply approximate in place of , since the error due to sparsification is too large. Instead, we use our deflation approach to compute a ‘coarse’ approximate set of top singular vectors for . We then use these singular vectors to initialize a block Krylov method run on itself, which we argue converges in iterations, giving total run time . The key idea in showing convergence is to argue that since by assumption, we have . So, there must be at least some singular value with that is larger than by . The presence of this spectral gap allows us to (1) argue that the first approximate singular vectors output by our deflation approach applied to have non-neglible inner product with the true top singular vectors of and (2) that the block Krylov method initialized with these vectors converges rapidly to a high accuracy approximation to , via known gap-dependent convergence bounds [MM15].
6.1 Deterministic Singular Value Approximation
We start by giving a deterministic algorithm to approximate all singular values of a symmetric bounded entry matrix up to error in time. Our main result appears as Theorem 8.
As discussed, we first deterministically compute a sparsified matrix from with . We then use a simple brute-force derandomization of the classical power method to approximate to the top singular value of (Lemma 6). To approximate the rest of ’s singular values, we repeatedly deflate the matrix and compute the top singular value of the deflated matrix (Lemma 7). By Weyl’s inequality, the singular values of approximate those of up to error. Thus, we can argue that our approximations will approximate to the singular values of itself, yielding Theorem 8. We first present the derandomized power method that will be applied to in Algorithm 2.
Let be a matrix with its SVD given by where are matrices with orthonormal columns containing the left and right singular vectors respectively and is a diagonal matrix containing the singular values . Observe that any vector can be written as where are scalar coefficients and are the columns of . Then, we have the following well-known result which shows that when the starting unit vector has a large enough inner product with i.e., , power iterations on converge to a approximation to within iterations. We will leverage this lemma to prove the correctness and runtime of our deterministic power method, Algorithm 2.
Lemma 5 (Power iterations – gap independent bound).
Let be a matrix with largest singular value and left singular vectors . Let be a unit vector where . Then, for , if we set ,
Proof.
We prove Lemma 5 for completeness. Let . Then, . First observe that . Thus, . Let be the smallest such that . Observe that if no such exists, then for any unit vector , , and thus the lemma holds trivially.
Let where and . Then:
Since for all , and since by assumption , for all . Also, for any , and for . Thus, we have . Setting , we get .
Next, observe that which implies that or . Thus, using the fact that and , and the fact that for , we get:
Finally dividing both sides of the above equation by , we get . Taking square root on both sides gives us, . ∎
Next we prove the correctness of Algorithm 2 using Lemma 5. The idea is to run power iterations using all basis vectors as starting vectors. At least one of these vectors will have high inner product with , and so so this starting vector will give us a good approximation to by Lemma 5. We also prove that the squared singular values of the deflated matrix are close to the squared singular values of the original matrix up to an additive error of . This will be used to as part of the correctness proof for our main algorithm for singular value approximation.
Lemma 6.
Let be the output of Algorithm 2 for some matrix with singular values and error parameter as input. Then:
Further, for any :
Proof.
Let be the left singular vector corresponding to the largest singular value of . Observe that since , there exists at least one standard basis vector such that . After iterations of the inner loop of Algorithm 2, we have . Using Lemma 5, we thus have . Also since is a unit vector, we trivially have for all . Since for output by Algorithm 2, , we thus have:
(15) |
giving our first error bound.
We now prove the second bound on . Using the Pythagorean theorem, . We have . We thus have . Combining this with (15):
(16) |
where is the best rank- approximation to in the Frobenius norm, with . So, we have:
Simply subtracting from the lefthand side and pulling out of the summation for some gives that
(17) |
Using the eigenvalue min-max theorem (Fact 4), we have for all , . Using this fact in (17),
which implies that . This completes the proof after adjusting by a factor of . ∎
We now state as Algorithm 3 our main deterministic algorithm for estimating singular values of a matrix by leveraging Algorithm 2 as a subroutine. We will then show that by applying Algorithm 3 to a sparse spectral approximation of , we can estimate the singular values of up to error in time.
Lemma 7.
Let be the output of Algorithm 3 for some matrix with singular values and error parameter as input. Then, are orthogonal unit vectors and for any :
Proof.
Let for all . Since each is the output of Algorithm 2 with as the input, using the bound of Lemma 6, for any we get that
(18) |
Note that where is a matrix with columns where the th column is equal to . We first prove that is a matrix with orthonormal columns. First note that each is a unit vector since Algorithm 2 always outputs a unit vector. We can prove that all are orthogonal to each other by induction. Suppose that all with are orthogonal to each other for some . Now, where is the identity matrix. Observe that lies in the column span of . Thus, to prove that is orthogonal to all with , it is enough to show that each such is orthogonal to the column space of . This follows since since is a projection onto the span of . Thus, . Thus, is orthogonal to all with . Hence, is a matrix with orthonormal columns. This implies that is a rank projection matrix.
By the above orthogonality claim, and we have . So, to prove the lemma, it is enough to bound . We first prove the lower bound on using the min-max principle of eigenvalues, which we state below.
Fact 4 (Eigenvalue Minimax Principle – see e.g., [Bha13]).
Let be a symmetric matrix with eigenvalues , then for all ,
We now prove the upper bound on . For any and any from Lemma 6 we have:
Now using the fact that and , we get:
(19) |
From (18) we have . Squaring both sides and applying the bound from (19) times repeatedly on the RHS, we get
Finally, taking the square root on both sides and using that for any non-negative gives us the upper bound. ∎
Lemma 7 in place, we are now ready to state and prove our main result on deterministic singular value estimation for any bounded entry matrix. See 8
Proof.
Note that can have at most singular values greater than . Thus, it is enough to approximate only the top singular values of up to error . The rest of the singular values can be approximated by to still have the required approximation error of . Let be the sparsification of as described in Theorem 4, which satisfies Using Weyl’s inequality we have for all :
Thus it is enough to approximate the top singular values of up to error and then use triangle inequality to get the final approximation to the top singular values of . To do so, we run Algorithm 3 with as the input matrix, as the error parameter, and as the number of singular values to be estimated. Using Lemma 7 we get orthonormal vectors for such that:
(20) |
Let for . Now since , using Weyl’s inequality, we have . Thus, for any , . So, from Equation (20), we get:
This gives us . Finally, we get:
where the second inequality follows from the bound on and Weyl’s inequality which gives us . This gives us the required additive approximation error after adjusting by constants. Next, observe that for , using triangle inequality:
where in the second step, the first term is bounded as . This gives us the second bound (after adjusting by constants).
Sample Complexity and Runtime Analysis. By Theorem 4, the number of queries to ’s entries need to construct is . The loop estimating the singular values in Algorithm 3 runs times and Algorithm 2 is called in Line 3 inside the loop each time with error parameter . At the ith iteration of the loop in Algorithm 3, in Line 4 we have . Note that since the matrix could be dense we don’t explicitly compute to do power iterations in Algorithm 2 with this matrix. Instead, in each step of power iteration in Line 5 of Algorithm 2, we can first calculate in time and then multiply this vector first with and then with in time to get . Thus, each power iteration (Lines 5 and 6 of Algorithm 2) takes time. Since we call Algorithm 2 with error parameter , the number of power iterations with each starting vector is . There are also different starting vectors. Thus, each call to Algorithm 2 in Line 3 of Algorithm 3 takes time. Thus, the total running time of the algorithm is (as the loop in Algorithm 3 runs times). ∎
6.2 Deterministic PSD Testing
We next leverage our deterministic power method approach (Algorithm 2) to give an time deterministic algorithm for testing if a bounded entry matrix is either PSD or has at least one large negative eigenvalue . An optimal randomized algorithm for this problem with detection threshold was presented in [BCJ20]. The idea of our approach is to approximate the maximum singular value of which will be at smaller than 1 if is PSD and greater than 1 if is far from being PSD.
Theorem 14 (Deterministic PSD testing with -gap).
Given , let be a symmetric bounded entry matrix with eigenvalues such that either is PSD i.e. or is at least -far from being PSD, i.e., . There exists a deterministic algorithm that distinguishes between these two cases by reading entries of and which runs in time .
Proof.
Let be a sparsification of as described in Theorem 4 with approximation parameter that satisfies . Using Weyl’s inequality we have for all , .
Let be the estimate of output by Algorithm 2 with input and error parameter . Then by Lemma 6, we have:
Let . Let be the estimate of output by Algorithm 2 with input and error parameter . Then by Lemma 6, we again have:
We now distinguish between the two cases as follows:
Case 1: When is PSD, we have . Now, .
First assume that . Now, . Thus, . Now, assume . Then, we must have as otherwise, we will have . Thus, we get that .
Case 2: When is far from being PSD, we have . Observe that . Then, we have .
Thus, in Case 1, we have while for Case 2, we have . So we can distinguish between the above two cases.
Sample Complexity and Runtime Analysis. The total runtime is the time taken to estimate and using Algorithm 2 with error parameter . This is since we run iterations of the power method with different starting vectors, where each iteration takes time since and both have non-zero entries. The sample complexity required for form and is .
∎
6.3 High Accuracy Spectral Norm Approximation
Finally, we show that, under the assumption that for some , we can deterministically compute such that in time. This allows us to set to to get a high accuracy approximation to in roughly linear time in the input matrix size. This is the first time deterministic algorithm for this problem, and to the best of our knowledge matches the best known randomized methods for high accuracy top singular value computation, up to a factor. Our approach is described in Algorithm 4 below.
Algorithm 4 first computes (Step 2) a coarse approximate subspace of top singular vectors for using our deflation based Algorithm 3 applied to the sparsified matrix . This subspace is then used in Step 3 to initialize a Block Krylov method applied to to compute a much more accurate approximation to . Our proof will leverage the analysis of Block Krylov methods given in [MM15]. Specifically, we will apply Theorem 13 of [MM15] which bounds the number of iterations required to find a accurate approximation to the top singular values in terms of the singular value gap for some block size .
The proof of Theorem 13 in [MM15] uses a degree amplifying polynomial (for large enough ) in the Krylov subspace to significantly separate the top singular values from the rest. It is shown that if the starting block is such that projected onto the column span of is a good rank approximation to , then the approximate singular values and vectors output by the Block Krylov method will be accurate. To prove this condition for our starting block, we will apply the following well-known result, which bounds the error of any low rank approximation to a matrix in terms of the error of the best possible low rank approximation.
Lemma 8 (Lemma 48 of [Woo14], proved in [BDMI14]).
Let be a low-rank matrix factorization of , with , and . Let with be any matrix such that . Let be an orthonormal basis for the column span of . Then,
The above result lets us bound the difference between the Frobenius norm error of projecting onto the column span of , and the error of the best rank approximation of . The proof of Theorem 13 of [MM15] uses Lemma 4 of [MM15], (which in turn is proven using Lemma 8 above) to show that where is an orthonormal basis for , where is a random starting block, is some constant, and is the best rank approximation to . In order to apply this result in our setting, we need to show that the same bound applies for our starting block , which is computed deterministically by Algorithm 3.
In particular, by Lemma 8, it suffices to bound , where contains the top singular vectors of (which are identical to those of ). That is, we need to ensure that the columns of and the top singular vectors of have large inner product with each other. We next prove that this is the case when there is a large gap between the top singular values and the rest. We will later show that under the assumption that , there is always some that satisfies this gap assumption.
Lemma 9 (Starting Block Condition).
Let be a bounded entry symmetric matrix with singular values . Let where the sampling matrix is as specified in Theorem 4 with error parameter for a sufficiently small constant and some . For some , let be the output of Algorithm 3 with input and parameters and . For some , assume that for some constant ,
Also let be the matrix with columns equal to the singular vectors of corresponding to its largest singular values. Let contain the first columns of . Then for some constant ,
Proof.
Before proving the main result, we will bound the error between and where is a column of for any . From Lemma 7, we have for any :
since we run Algorithm 3 with error parameter . Also since , using Weyl’s inequality and the fact that we have (for some constant )
Now, using triangle inequality,
where the second inequality uses that . Thus, using the bounds on , we get:
for some constant . Finally, squaring both sides, we get for any :
(21) |
We are now ready to prove the main result of the lemma. Note that showing is equivalent to showing that . Assume for contradiction that . Then there is some vector with such that . Let . Observe that since has orthonormal columns. Then, for any column of ,
(22) |
where the last step follows by our assumption. Then, using triangle inequality and the above bound, we have , where the last equality uses the fact that since it has bounded entries. Thus, for any , squaring both sides of this bound, we get:
(23) |
for some constant , where the second inequality uses , and the last inequality uses (21). Moreover since is spanned by , we have:
(24) |
where the final inequality is by the gap assumption in the lemma statement.
Consider the matrix . Next, we show that under our assumption that , will be very close to orthonormal, and further that by (23) and (24), . Since has columns, this will lead to a contradiction since we must have . In particular, using (23) and (24),
(25) |
for some constant which will be positive as long as is large enough compared to (which can be made arbitrary small by our setting of in the lemma statement).
We will now show that all eigenvalues of are close to , which implies that is approximately orthogonal. This allows us to translate the bound on given in (25) to an orthonormal basis for the column span of .
Observe that for and using (22), otherwise. Thus, for all . Further, when and as (and similarly when and ). Also, when and , otherwise. We can then apply Gershgorin’s circle theorem to bound the eigenvalues of .
Fact 5 (Gershgorin’s circle theorem [Ger31]).
Let with entries . For , let be the sum of absolute values of non-diagonal entries in the th row. Let be the closed disc centered at with radius . Then every eigenvalue of lies within one of the discs .
From Fact 5, all eigenvalues of lie in the range . Thus, is a full rank matrix and so, is a rank matrix. Let be the SVD of , where and are orthonormal matrices and is a diagonal matrix of singular values of . Then, all eigenvalues of or all diagonal entries of lie in for some constant .
Now (for some constant ) where the last step follows from the fact that since is an orthonormal square matrix. Then we have
(26) |
for some constant , where the last inequality uses (25) and the fact that . However, by the optimality of the SVD for Frobenius norm low-rank approximation (based on the eigenvvalue min-max theorem of Fact 4), for any matrix with orthonormal columns, . This contradicts (26) which was derived using the assumption . ∎
We now prove the main result of this section. We will first show that there will be a large (in terms of ) singular value gap for some and then use Lemmas 8 and 9 to prove a gap dependent bound similar to Theorem 13 of [MM15].
Theorem 15 (High accuracy spectral norm approximation).
Let be a bounded entry symmetric matrix with largest singular value , such that for some . Then there exists an time deterministic algorithm that computes , such that .
Proof.
We will first show that there exists some such that there is a large enough gap between the squared singular values and . For we have and squaring both sides we have . Also since , we have . Thus there exists such that
(27) |
Let be the output of step 2 of Algorithm 4. Let be the matrix with the first columns of . Then performing Block Krylov iterations (i.e. step 3 of Algorithm 4) with as a starting block can only yield a larger approximation for as compared to doing Block Krylov iterations with as a starting block. Thus it suffices to show that Block Krylov iterations (step 3 of Algorithm 4) with starting block produces some matrix with orthonormal columns such that where is the first columns of .
To show this, we will be using Theorem 13 of [MM15] which bounds the number of iterations of randomized Block Krylov iterations in terms of the singular value gap. To apply this in our setting, we need to show that is a good enough starting block. Specifically, let be the amplifying polynomial used in the proof of Theorem 13 of [MM15]. Let be an orthonormal basis for the span of . Then, following the proof in [MM15], to prove the gap-dependent convergence bound, we just need to show that
where is the best rank approximation to and is a constant. To prove this, we will be proceeding in a similar manner as to the proof of Lemma 4 of [MM15]. Using Lemma 8, we have:
where contains the top singular vectors of as its columns (note that is symmetric so it has same left and right singular vectors). Thus, it suffices to show that
for some constant . Using the spectral submultiplicativity property, we have
where the second step uses the fact that . From Lemma 9, we have for some constant . Thus, we have shown that doing Block Krylov iterations with starting block gives us the same guarantees as Theorem 13 of [MM15] for gap-dependent convergence of randomized Block Krylov.
Specifically, we can apply the per-vector guarantee (equation 3 of [MM15]) i.e. if is the first column of the matrix after doing Block Krylov iterations, and is the singular vector of corresponding to its largest singular value, we have . This implies that (as ).
Runtime Analysis. The time to compute from in step 2 of Algorithm 4 with and error parameter (for some small ) is , since we only need to estimate the top singular values, and estimating each singular value using power iterations in Algorithm 3 takes time as described in the analysis of Theorem 8.
The number of iterations of the Block Krylov with starting block is given by as specified in Theorem 13 of [MM15]. Thus, we first need to bound . We have:
where the second inequality uses (27) and for the third inequality we use the fact that . Taking square root we have , where is a large enough constant. Thus, the number of iterations of Block Krylov in step 3 of Algorithm 4 is . From Theorem 7 of [MM15], for iterations, Block Krylov takes time where . Thus, the total time taken by step 3 of Algorithm 4 is . Since asymptotically dominates each term of , the total time taken by the Algorithm 4 is . ∎
Remark. Since we can apply the per-vector guarantee (Equation 3 of [MM15]) for , i.e., the output of Block Krylov in Algorithm 4, then for all such that the condition in (27) holds and any , , where is the th column of . This implies for all we have (as ). Thus we are able to approximate the top singular values and corresponding singular vectors of with high accuracy using Algorithm 4, where the th singular value of satisfies the condition of (27).
7 Conclusion
Our work has shown that it is possible to deterministically construct an entrywise sampling matrix (a universal sparsifier) with just non-zero entries such that for any bounded entry with , . We show how to achieve sparsity when is restricted to be PSD (Theorem 1) and when is general (Theorem 4), and prove that both these bounds are tight up to logarithmic factors (Theorems 3 and 6). Further, our proofs are based on simple reductions, which show that any that spectrally approximates the all ones matrix to sufficient accuracy (i.e., is a sufficiently good spectral expander) yields a universal sparsifier.
We apply our universal sparsification bounds to give the first time deterministic algorithms for several core linear algebraic problems, including singular value/vector approximation and positive semidefiniteness testing. When is restricted to be PSD and have entries in , we show how to give achieve improved deterministic query complexity of to construct a general spectral approximation , which may not be sparse (Theorem 9). We again show that this bound is tight up a to a logarithmic factor (Theorem 10)
Our work leaves several open questions:
-
1.
An interesting question is if sample complexity can be achieved for deterministic spectral approximation of any bounded entry PSD matrix if the sparsity of the output is not restricted, thereby generalizing the upper bound for PSD -matrices proven in Theorem 9. This query complexity is known for randomized algorithms based on column sampling [MM17], however it is not currently known how to derandomize such results.
-
2.
It would also be interesting to close the factor gap between our universal sparsification upper bound of queries for achieving spectral approximation error for non-PSD matrices (Theorem 4) and our query lower bound for general deterministic algorithms that make possibly adaptive queries to (Theorem 7). By Theorem 6, our universal sparsification bound is tight up to log factors for algorithms that make non-adaptive deterministic queries to . It is unknown if adaptive queries can be used to give improved algorithms.
-
3.
Finally, it would be interesting to understand if our deterministic algorithms for spectrum approximation can be improved. For example, can one compute an additive error or a relative error approximation to the top singular value for bounded entry and constant in time deterministically? Are there fundamental barriers that make doing so difficult?
Acknowledgements
We thank Christopher Musco for helpful conversations about this work. RB, CM and AR was partially supported by an Adobe Research grant, along with NSF Grants 2046235, 1763618, and 1934846. AR was also partially supported by the Manning College of Information and Computer Sciences Dissertation Writing Fellowship. GD was partially supported by NSF AF 1814041, NSF FRG 1760353, and DOE-SC0022085. SS was supported by an NSERC Discovery Grant RGPIN-2018-06398 and a Sloan Research Fellowship. DW was supported by a Simons Investigator Award.
References
- [ACK+16] Alexandr Andoni, Jiecao Chen, Robert Krauthgamer, Bo Qin, David P Woodruff, and Qin Zhang. On sketching quadratic forms. In Proceedings of the \nth7 Conference on Innovations in Theoretical Computer Science (ITCS), 2016.
- [AHK05] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In Proceedings of the \nth46 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2005.
- [AHK06] Sanjeev Arora, Elad Hazan, and Satyen Kale. A fast random sampling algorithm for sparsifying matrices. In Proceedings of the \nth10 International Workshop on Randomization and Computation (RANDOM), 2006.
- [Aig95] Martin Aigner. Turán’s graph theorem. The American Mathematical Monthly, 1995.
- [AKL13] Dimitris Achlioptas, Zohar Shay Karnin, and Edo Liberty. Near-optimal entrywise sampling for data matrices. In Advances in Neural Information Processing Systems 26 (NeurIPS), 2013.
- [AKM+17] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the \nth34 International Conference on Machine Learning (ICML), 2017.
- [Alo86] Noga Alon. Eigenvalues and expanders. Combinatorica, 1986.
- [Alo21] Noga Alon. Explicit expanders of every degree and size. Combinatorica, 2021.
- [AM07] Dimitris Achlioptas and Frank McSherry. Fast computation of low rank matrix approximations. In Proceedings of the \nth39 Annual ACM Symposium on Theory of Computing (STOC), 2007.
- [AW21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the \nth32 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021.
- [AZL16] Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even faster SVD decomposition yet without agonizing pain. Advances in Neural Information Processing Systems 29 (NeurIPS), 2016.
- [BCJ20] Ainesh Bakshi, Nadiia Chepurko, and Rajesh Jayaram. Testing positive semi-definiteness via random submatrices. In Proceedings of the \nth61 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2020.
- [BDD+23] Rajarshi Bhattacharjee, Gregory Dexter, Petros Drineas, Cameron Musco, and Archan Ray. Sublinear time eigenvalue approximation via random sampling. Proceedings of the \nth50 International Colloquium on Automata, Languages and Programming (ICALP), 2023.
- [BDMI14] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 2014.
- [BGPW13] Mark Braverman, Ankit Garg, Denis Pankratov, and Omri Weinstein. Information lower bounds via self-reducibility. In Proceedings of the \nth8 International Computer Science Symposium in Russia (CSR), 2013.
- [Bha13] Rajendra Bhatia. Matrix analysis. Springer Science & Business Media, 2013.
- [BJKS04] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 2004.
- [BKKS21] Vladimir Braverman, Robert Krauthgamer, Aditya R Krishnan, and Shay Sapir. Near-optimal entrywise sampling of numerically sparse matrices. In Proceedings of the \nth34 Annual Conference on Computational Learning Theory (COLT), 2021.
- [BKM22] Vladimir Braverman, Aditya Krishnan, and Christopher Musco. Sublinear time spectral density estimation. In Proceedings of the \nth54 Annual ACM Symposium on Theory of Computing (STOC), 2022.
- [BSST13] Joshua Batson, Daniel A Spielman, Nikhil Srivastava, and Shang-Hua Teng. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 2013.
- [CG02] Fan Chung and Ronald Graham. Sparse quasi-random graphs. Combinatorica, 2002.
- [CGL+20] Julia Chuzhoy, Yu Gao, Jason Li, Danupon Nanongkai, Richard Peng, and Thatchaphol Saranurak. A deterministic algorithm for balanced cut with applications to dynamic connectivity, flows, and beyond. In Proceedings of the \nth61 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2020.
- [Cha11] Françoise Chatelin. Spectral approximation of linear operators. SIAM, 2011.
- [Coh16] Michael B Cohen. Ramanujan graphs in polynomial time. In Proceedings of the \nth57 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2016.
- [Cov99] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
- [CR12a] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 2012.
- [CR12b] Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-Hamming-distance. SIAM Journal on Computing, 2012.
- [CT10] Emmanuel J. Candès and Terence Tao. The power of convex relaxation: near-optimal matrix completion. IEEE Transations on Information Theory, 2010.
- [CW17] Kenneth L Clarkson and David P Woodruff. Low-rank PSD approximation in input-sparsity time. In Proceedings of the \nth28 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2017.
- [d’A11] Alexandre d’Aspremont. Subsampling algorithms for semidefinite programming. Stochastic Systems, 2011.
- [DZ11] Petros Drineas and Anastasios Zouzias. A note on element-wise matrix sparsification via a matrix-valued Bernstein inequality. Information Processing Letters, 2011.
- [Ger31] Semyon Aranovich Gershgorin. Uber die abgrenzung der eigenwerte einer matrix. Izvestiya Rossiyskoy akademii nauk. Seriya matematicheskaya, 1931.
- [Kun15] Abhisek Kundu. Element-wise matrix sparsification and reconstruction. PhD thesis, Rensselaer Polytechnic Institute, USA, 2015.
- [KW92] Jacek Kuczyński and Henryk Woźniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 1992.
- [LPS88] Alexander Lubotzky, Ralph Phillips, and Peter Sarnak. Ramanujan graphs. Combinatorica, 1988.
- [LSY16] Lin Lin, Yousef Saad, and Chao Yang. Approximating spectral densities of large matrices. SIAM Review, 2016.
- [Mar73] Grigorii Aleksandrovich Margulis. Explicit constructions of concentrators. Problemy Peredachi Informatsii, 1973.
- [MM15] Cameron Musco and Christopher Musco. Randomized block Krylov methods for stronger and faster approximate singular value decomposition. Advances in Neural Information Processing Systems 28 (NeurIPS), 2015.
- [MM17] Cameron Musco and Christopher Musco. Recursive sampling for the Nyström method. Advances in Neural Information Processing Systems 30 (NeurIPS), 2017.
- [MMMW21] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA). SIAM, 2021.
- [Mor94] Moshe Morgenstern. Existence and explicit constructions of regular Ramanujan graphs for every prime power . Journal of Combinatorial Theory, Series B, 1994.
- [MW17] Cameron Musco and David P. Woodruff. Sublinear time low-rank approximation of positive semidefinite matrices. In Proceedings of the \nth58 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2017.
- [NSW22] Deanna Needell, William Swartworth, and David P. Woodruff. Testing positive semidefiniteness using linear measurements. Proceedings of the \nth63 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2022.
- [Pip87] Nicholas Pippenger. Sorting and selecting in rounds. SIAM Journal on Computing, 1987.
- [Rou16] Tim Roughgarden. Communication complexity (for algorithm designers). Foundations and Trends in Theoretical Computer Science, 2016.
- [Saa11] Yousef Saad. Numerical methods for large eigenvalue problems: Revised edition. SIAM, 2011.
- [SKO21] Florian Schäfer, Matthias Katzfuss, and Houman Owhadi. Sparse Cholesky factorization by Kullback–Leibler minimization. SIAM Journal on Scientific Computing, 2021.
- [SS08] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. In Proceedings of the \nth40 Annual ACM Symposium on Theory of Computing (STOC), 2008.
- [ST04] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the \nth36 Annual ACM Symposium on Theory of Computing (STOC), 2004.
- [SWXZ22] Vaidehi Srinivas, David P Woodruff, Ziyu Xu, and Samson Zhou. Memory bounds for the experts problem. Proceedings of the \nth54 Annual ACM Symposium on Theory of Computing (STOC), 2022.
- [Tur41] Pál Turán. On an extremal problem in graph theory. Mat. Fiz. Lapok, 1941.
- [TYUC17] JA Tropp, A Yurtsever, M Udell, and V Cevher. Randomized single-view algorithms for low-rank matrix approximation, 2017.
- [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018.
- [Wey12] Hermann Weyl. The asymptotic distribution law of the eigenvalues of linear partial differential equations (with an application to the theory of cavity radiation). Mathematical Annals, 1912.
- [WLZ16] Shusen Wang, Luo Luo, and Zhihua Zhang. SPSD matrix approximation vis column selection: Theories, algorithms, and extensions. The Journal of Machine Learning Research, 2016.
- [Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 2014.
- [WS00] Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. Advances in Neural Information Processing Systems 13 (NeurIPS), 2000.
- [WS23] David Woodruff and William Swartworth. Optimal eigenvalue approximation via sketching. In Proceedings of the \nth55 Annual ACM Symposium on Theory of Computing (STOC), 2023.
- [WWAF06] Alexander Weisse, Gerhard Wellein, Andreas Alvermann, and Holger Fehske. The kernel polynomial method. Reviews of modern physics, 2006.
- [WZ93] Avi Wigderson and David Zuckerman. Expanders that beat the eigenvalue bound: Explicit construction and applications. In Proceedings of the \nth25 Annual ACM Symposium on Theory of Computing (STOC), 1993.
- [XG10] Jianlin Xia and Ming Gu. Robust approximate Cholesky factorization of rank-structured symmetric positive definite matrices. SIAM Journal on Matrix Analysis and Applications, 2010.