Sublinear Time Eigenvalue Approximation via Random Sampling

Rajarshi Bhattacharjee¹¹1Manning College of Information and Computer Sciences, University of Massachusetts, Amherst, {rbhattacharj, cmusco, ray}@cs.umass.edu Gregory Dexter²²2Department of Computer Science, Purdue University, West Lafayette, {gdexter, pdrineas}@purdue.edu Petros Drineas²²footnotemark: 2 Cameron Musco¹¹footnotemark: 1 Archan Ray¹¹footnotemark: 1

Abstract

We study the problem of approximating the eigenspectrum of a symmetric matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ with bounded entries (i.e., $\|\mathbf{A}\|_{\infty}\leq 1$ ). We present a simple sublinear time algorithm that approximates all eigenvalues of $\mathbf{A}$ up to additive error $\pm\epsilon n$ using those of a randomly sampled $\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{3}}\right)\times\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{3}}\right)$ principal submatrix. Our result can be viewed as a concentration bound on the complete eigenspectrum of a random submatrix, significantly extending known bounds on just the singular values (the magnitudes of the eigenvalues). We give improved error bounds of $\pm\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ and $\pm\epsilon\|\mathbf{A}\|_{F}$ when the rows of $\mathbf{A}$ can be sampled with probabilities proportional to their sparsities or their squared $\ell_{2}$ norms respectively. Here $\operatorname{nnz}(\mathbf{A})$ is the number of non-zero entries in $\mathbf{A}$ and $\|\mathbf{A}\|_{F}$ is its Frobenius norm. Even for the strictly easier problems of approximating the singular values or testing the existence of large negative eigenvalues (Bakshi, Chepurko, and Jayaram, FOCS ’20), our results are the first that take advantage of non-uniform sampling to give improved error bounds. From a technical perspective, our results require several new eigenvalue concentration and perturbation bounds for matrices with bounded entries. Our non-uniform sampling bounds require a new algorithmic approach, which judiciously zeroes out entries of a randomly sampled submatrix to reduce variance, before computing the eigenvalues of that submatrix as estimates for those of $\mathbf{A}$ . We complement our theoretical results with numerical simulations, which demonstrate the effectiveness of our algorithms in practice.

1 Introduction

Approximating the eigenvalues of a symmetric matrix is a fundamental problem – with applications in engineering, optimization, data analysis, spectral graph theory, and beyond. For an $n\times n$ matrix, all eigenvalues can be computed to high accuracy using direct eigendecomposition in $O(n^{\omega})$ time, where $\omega\approx 2.37$ is the exponent of matrix multiplication [DDHK07, AW21]. When just a few of the largest magnitude eigenvalues are of interest, the power method and other iterative Krylov methods can be applied [Saa11]. These methods repeatedly multiply the matrix of interest by query vectors, requiring $O(n^{2})$ time per multiplication when the matrix is dense and unstructured.

For large $n$ , it is desirable to have even faster eigenvalue approximation algorithms, running in $o(n^{2})$ time – i.e., sublinear in the size of the input matrix. Unfortunately, for general matrices, no non-trivial approximation can be computed in $o(n^{2})$ time: without reading $\Omega(n^{2})$ entries, it is impossible to distinguish with reasonable probability if all entries (and hence all eigenvalues) are equal to zero, or if there is a single pair of arbitrarily large entries at positions $(i,j)$ and $(j,i)$ , leading to a pair of arbitrarily large eigenvalues. Given this, we seek to address the following question:

Under what assumptions on a symmetric $n\times n$ input matrix, can we compute non-trivial approximations to its eigenvalues in $o(n^{2})$ time?

It is well known that $o(n^{2})$ time eigenvalue computation is possible for highly structured inputs, like tridiagonal or Toeplitz matrices [GE95]. For sparse or structured matrices that admit fast matrix vector multiplication, one can compute a small number of the largest in magnitude eigenvalues in $o(n^{2})$ time using iterative methods. Through the use of robust iterative methods, fast top eigenvalue estimation is also possible for matrices that admit fast approximate matrix-vector multiplication, such as kernel similarity matrices [GS91, HP14, BIMW21]. Our goal is to study simple, sampling-based sublinear time algorithms that work under much weaker assumptions on the input matrix.

1.1 Our Contributions

Our main contribution is to show that a very simple algorithm can be used to approximate all eigenvalues of any symmetric matrix with bounded entries. In particular, for any $\mathbf{A}\in\mathbb{R}^{n\times n}$ with maximum entry magnitude $\|\mathbf{A}\|_{\infty}\leq 1$ , sampling an $s\times s$ principal submatrix $\mathbf{A}_{S}$ of $\mathbf{A}$ with $s=\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{3}}\right)$ and scaling its eigenvalues by $n/s$ yields a $\pm\epsilon n$ additive error approximation to all eigenvalues of $\mathbf{A}$ with good probability.³³3Here and throughout, $\tilde{O}(\cdot)$ hides logarithmic factors in the argument. Note that by scaling, our algorithm gives a $\pm\epsilon n\cdot\|\mathbf{A}\|_{\infty}$ approximation for any $\mathbf{A}$ . This result is formally stated below, where $[n]\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\{1,\ldots,n\}$ .

Theorem 1 (Sublinear Time Eigenvalue Approximation).

For all $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{S})\geq 0$ , let $\tilde{\lambda}_{i}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ . For all $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{S})<0$ , let $\tilde{\lambda}_{n-(|S|-i)}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ . For all other $i\in[n]$ , let $\tilde{\lambda}_{i}(\mathbf{A})=0$ . If $s\geq\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}{\delta}}$ , for large enough constant $c$ , then with probability $\geq 1-\delta$ , for all $i\in[n]$ ,

\displaystyle\lambda_{i}(\mathbf{A})-\epsilon n\leq\tilde{\lambda}_{i}(\mathbf{A})\leq\lambda_{i}(\mathbf{A})+\epsilon n.

See Figure 1 for an illustration of how the $|S|$ eigenvalues of $\mathbf{A}_{S}$ are mapped to estimates for all $n$ eigenvalues of $\mathbf{A}$ . Note that the principal submatrix $\mathbf{A}_{S}$ sampled in Theorem 1 will have $O(s)=\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{3}\delta}\right)$ rows/columns with high probability. Thus, with high probability, the algorithm reads just $\tilde{O}\left(\frac{\log^{6}n}{\epsilon^{6}\delta^{2}}\right)$ entries of $\mathbf{A}$ and runs in $\mathop{\mathrm{missing}}{poly}(\log n,1/\epsilon,1/\delta)$ time. Standard matrix concentration bounds imply that one can sample $O\left(\frac{s\log(1/\delta)}{\epsilon^{2}}\right)$ random entries from the $O(s)\times O(s)$ random submatrix $\mathbf{A}_{S}$ and preserve its eigenvalues to error $\pm\epsilon s$ with probability $1-\delta$ [AM07]. See Appendix F for a proof. This can be directly combined with Theorem 1 to give improved sample complexity:

Corollary 1 (Improved Sample Complexity via Entrywise Sampling).

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric with $\|\mathbf{A}\|_{\infty}\leq 1$ and eigenvalues $\lambda_{1}(\mathbf{A})\geq\ldots\geq\lambda_{n}(\mathbf{A})$ . For any $\epsilon,\delta\in(0,1)$ , there is an algorithm that reads $\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{5}\delta}\right)$ entries of $\mathbf{A}$ and returns, with probability at least $1-\delta$ , $\tilde{\lambda}_{i}(\mathbf{A})$ for each $i\in[n]$ satisfying $|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A})|\leq\epsilon n$ .

Observe that the dependence on $\delta$ in Theorem 1 and Corollary 1 can be improved via standard arguments: running the algorithm with failure probability $\delta^{\prime}=2/3$ , repeating $O(\log(1/\delta))$ times, and taking the median estimate for each $\lambda_{i}(\mathbf{A})$ . This guarantees that the algorithm will succeed with probability at most $1-\delta$ at the expense of a $\log(1/\delta)$ dependence in the complexity.

Figure 1: Alignment of eigenvalues in Thm. 1 and Algo. 1. We illustrate how the eigenvalues of

\mathbf{A}_{S}

, scaled by

\frac{n}{s}

, are used to approximate all eigenvalues of

\mathbf{A}

. If

\mathbf{A}_{S}

has

p-1

positive eigenvalues, they are set to the top

p-1

eigenvalue estimates. Its

|S|-p+1

negative eigenvalues are set to the bottom eigenvalue estimates. All remaining eigenvalues are simply approximated as zero.

Comparison to known bounds. Theorem 1 can be viewed as a concentration inequality on the full eigenspectrum of a random principal submatrix $\mathbf{A}_{S}$ of $\mathbf{A}$ . This significantly extends prior work, which was able to bound just the spectral norm (i.e., the magnitude of the top eigenvalue) of a random principal submatrix [RV07, Tro08a]. Bakshi, Chepurko, and Jayaram [BCJ20] recently identified developing such full eigenspectrum concentration inequalities as an important step in expanding our knowledge of sublinear time property testing algorithms for bounded entry matrices.

Standard matrix concentration bounds [GT11] can be used to show that the singular values of $\mathbf{A}$ (i.e., the magnitudes of its eigenvalues) are approximated by those of a $O\left(\frac{\log n}{\epsilon^{2}}\right)\times O\left(\frac{\log n}{\epsilon^{2}}\right)$ random submatrix (see Appendix G) with independently sampled rows and columns. However, such a random matrix will not be symmetric or even have real eigenvalues in general, and thus no analogous bounds were previously known for the eigenvalues themselves.

Recently, Bakshi, Chepurko, and Jayaram [BCJ20] studied the closely related problem of testing positive semidefiniteness in the bounded entry model. They show how to test whether the minimum eigenvalue of $\mathbf{A}$ is either greater than $0$ or smaller than $-\epsilon n$ by reading just $\tilde{O}(\frac{1}{\epsilon^{2}})$ entries. They show that this result is optimal in terms of query complexity, up to logarithmic factors. Like our approach, their algorithm is based on random principal submatrix sampling. Our eigenvalue approximation guarantee strictly strengthens the testing guarantee – given $\pm\epsilon n$ approximations to all eigenvalues, we immediately solve the testing problem. Thus, our query complexity is tight up to a $\mathop{\mathrm{missing}}{poly}(\log n,1/\epsilon)$ factor. It is open if our higher sample complexity is necessary to solve the harder full eigenspectrum estimation problem. See Section 1.4 for further discussion.

Improved bounds for non-uniform sampling. Our second main contribution is to show that, when it is possible to efficiently sample rows/columns of $\mathbf{A}$ with probabilities proportional to their sparsities or their squared $\ell_{2}$ norms, significantly stronger eigenvalue estimates can be obtained. In particular, letting $\operatorname{nnz}(\mathbf{A})$ denote the number of nonzero entries in $\mathbf{A}$ and $\|\mathbf{A}\|_{F}$ denote its Frobenius norm, we show that sparsity-based sampling yields eigenvalue estimates with error $\pm\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ and norm-based sampling gives error $\pm\epsilon\|\mathbf{A}\|_{F}$ . See Theorems 2 and 3 for formal statements. Observe that when $\|\mathbf{A}\|_{\infty}\leq 1$ , its eigenvalues are bounded in magnitude by $\|\mathbf{A}\|_{2}\leq\|\mathbf{A}\|_{F}\leq\sqrt{\operatorname{nnz}(\mathbf{A})}\leq n$ . Thus, Theorems 2 and 3 are natural strengthenings of Theorem 1. Row norm-based sampling (Theorem 3) additionally removes the bounded entry requirement of Theorems 1 and 2.

As discussed in Section 1.3.1, sparsity-based sampling can be performed in sublinear time when $\mathbf{A}$ is stored in a slightly augmented sparse matrix format or when $\mathbf{A}$ is the adjacency matrix of a graph accessed in the standard graph query model of the sublinear algorithms literature [GR97]. Norm-based sampling can also be performed efficiently with an augmented matrix format, and is commonly studied in randomized and ‘quantum-inspired’ algorithms for linear algebra [FKV04, Tan18].

Theorem 2 (Sparse Matrix Eigenvalue Approximation).

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric with $\|\mathbf{A}\|_{\infty}\leq 1$ and eigenvalues $\lambda_{1}(\mathbf{A})\geq\ldots\geq\lambda_{n}(\mathbf{A})$ . Let $S\subseteq[n]$ be formed by including the $i$ ^th index independently with probability $p_{i}=\min\left(1,\frac{s\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}\right)$ as in Algorithm 2. Here $\operatorname{nnz}(\mathbf{A}_{i})$ is the number of non-zero entries in the $i^{th}$ row of $\mathbf{A}$ . Let $\mathbf{A}_{S}$ be the corresponding principal submatrix of $\mathbf{A}$ , and let $\tilde{\lambda}_{i}(\mathbf{A})$ be the estimate of $\lambda_{i}(\mathbf{A})$ computed from $\mathbf{A}_{S}$ as in Algorithm 2. If $s\geq\frac{c\log^{8}n}{\epsilon^{8}\delta^{4}}$ , for large enough constant $c$ , then with probability $\geq 1-\delta$ , for all $i\in[n]$ , $|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A})|\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ .

Theorem 3 (Row Norm Based Matrix Eigenvalue Approximation).

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric and eigenvalues $\lambda_{1}(\mathbf{A})\geq\ldots\geq\lambda_{n}(\mathbf{A})$ . Let $S\subseteq[n]$ be formed by including the $i$ ^th index independently with probability $p_{i}=\min\left(1,\frac{s\|\mathbf{A}_{i}\|^{2}_{2}}{\|\mathbf{A}\|^{2}_{F}}+\frac{1}{n^{2}}\right)$ as in Algorithm 3. Here $\|\mathbf{A}_{i}\|_{2}$ is the $\ell_{2}$ norm of the $i^{th}$ row of $\mathbf{A}$ . Let $\mathbf{A}_{S}$ be the corresponding principal submatrix of $\mathbf{A}$ , and let $\tilde{\lambda}_{i}(\mathbf{A})$ be the estimate of $\lambda_{i}(\mathbf{A})$ computed from $\mathbf{A}_{S}$ as in Algorithm 3. If $s\geq\frac{c\log^{10}n}{\epsilon^{8}\delta^{4}}$ , for large enough constant $c$ , then with probability $\geq 1-\delta$ , for all $i\in[n]$ , $|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A})|\leq\epsilon\|\mathbf{A}\|_{F}.$

The above non-uniform sampling theorems immediately yield algorithms for testing the presence of a negative eigenvalue with magnitude at least $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ or $\epsilon\|\mathbf{A}\|_{F}$ respectively, strengthening the testing results of [BCJ20], which require eigenvalue magnitude at least $\epsilon n$ . In the graph property testing literature, there is a rich line of work exploring the testing of bounded degree or sparse graphs [GR97, BSS10]. Theorem 2 can be thought of as first step in establishing a related theory of sublinear time approximation algorithms and property testers for sparse matrices.

Surprisingly, in the non-uniform sampling case, the eigenvalue estimates derived from $\mathbf{A}_{S}$ cannot simply be its scaled eigenvalues, as in Theorem 1. E.g., when $\mathbf{A}$ is the identity, our row sampling probabilities are uniform in all cases. However, the scaled submatrix $\frac{n}{s}\cdot\mathbf{A}_{S}$ will be a scaled identity, and have eigenvalues equal to $n/s$ – failing to give a $\pm\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}=\pm\epsilon\|\mathbf{A}\|_{F}=\pm\epsilon\sqrt{n}$ approximation to the true eigenvalues (all of which are $1$ ) unless $s\gtrsim\frac{\sqrt{n}}{\epsilon}$ . To handle this, and related cases, we must argue that selectively zeroing out entries in sufficiently low probability rows/columns of $\mathbf{A}$ (see Algorithms 2 and 3) does not significantly change the spectrum, and ensures concentration of the submatrix eigenvalues. It is not hard to see that simple random submatrix sampling fails even for the easier problem of singular value estimation. Theorems 2 and 3 give the first results of their kinds for this problem as well.

1.2 Related Work

Eigenspectrum estimation is a key primitive in numerical linear algebra, typically known as spectral density estimation. The eigenspectrum is viewed as a distribution with mass $1/n$ at each of the $n$ eigenvalues, and the goal is to approximate this distribution [WWAF06, LSY16]. Applications include identifying motifs in social networks [DBB19], studying Hessian and weight matrix spectra in deep learning [SBL16, YGL⁺18, GKX19], ‘spectrum splitting’ in parallel eigensolvers [LXES19], and the study of many systems in experimental physics and chemistry [Wan94, SR94, HBT19].

Recent work has studied sublinear time spectral density estimation for graph structured matrices – Braverman, Krishnan, and Musco [BKM22] show that the spectral density of a normalized graph adjacency or Laplacian matrix can be estimated to $\epsilon$ error in the Wasserstein distance in $\tilde{O}(n/\mathop{\mathrm{missing}}{poly}(\epsilon))$ time. Cohen-Steiner, Kong, Sohler, and Valiant study a similar setting, giving runtime $2^{O(1/\epsilon)}$ [CSKSV18]. We note that the additive error eigenvalue approximation result of Theorem 1 (analogously Theorems 2 and 3) directly gives an $\epsilon n$ approximation to the spectral density in the Wasserstein distance – extending the above results to a much broader class of matrices. When $\|\mathbf{A}\|_{\infty}\leq 1$ , $\mathbf{A}$ can have eigenvalues as large as $n$ , while the normalized adjacency matrices studied in [CSKSV18, BKM22] have eigenvalues in $[-1,1]$ . So, while the results are not directly comparable, our Wasserstein error can be thought as on order of their error of $\epsilon$ after scaling.

Our work is also closely related to a line of work on sublinear time property testing for bounded entry matrices, initiated by Balcan et al. [BLWZ19]. In that work, they study testing of rank, Schatten- $p$ norms, and several other global spectral properties. Sublinear time testing algorithms for the rank and other properties have also been studied under low-rank and bounded row norm assumptions on the input matrix [KS03, LWW14]. Recent work studies positive semidefiniteness testing and eigenvalue estimation in the matrix-vector query model, where each query computes $\mathbf{A}\mathbf{x}$ for some $\mathbf{x}\in\mathbb{R}^{n\times n}$ . As in Theorem 3, $\pm\epsilon\|\mathbf{A}\|_{F}$ eigenvalue estimation can be achieved with $\mathop{\mathrm{missing}}{poly}(\log n,1/\epsilon)$ queries in this model [NSW22]. Finally, several works study streaming algorithms for eigenspectrum approximation [AN13, LNW14, LW16]. These algorithms are not sublinear time – they require at least linear time to process the input matrix. However, they use sublinear working memory. Note that Theorem 1 immediately gives a sublinear space streaming algorithm for eigenvalue estimation. We can simply store the sampled submatrix $\mathbf{A}_{S}$ as its entries are updated.

1.3 Technical Overview

In this section, we overview the main techniques used to prove Theorems 1, and then how these techniques are extended to prove Theorems 2 and 3. We start by defining a decomposition of any symmetric $\mathbf{A}$ into the sum of two matrices containing its large and small magnitude eigendirections.

Definition 1.1 (Eigenvalue Split).

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric. For any $\epsilon,\delta\in(0,1)$ , let $\mathbf{A}_{o}=\mathbf{V}_{o}\mathbf{\Lambda}_{o}\mathbf{V}_{o}^{T}$ where $\mathbf{\Lambda}_{o}$ is diagonal, with the eigenvalues of $\mathbf{A}$ with magnitude $\geq\epsilon\sqrt{\delta}n$ on its diagonal, and $\mathbf{V}_{o}$ has the corresponding eigenvectors as columns. Similarly, let $\mathbf{A}_{m}=\mathbf{V}_{m}\mathbf{\Lambda}_{m}\mathbf{V}_{m}^{T}$ where $\mathbf{\Lambda}_{m}$ has the eigenvalues of $\mathbf{A}$ with magnitude $<\epsilon\sqrt{\delta}n$ on its diagonal and $\mathbf{V}_{m}$ has the corresponding eigenvectors as columns. Then, $\mathbf{A}$ can be decomposed as

\mathbf{A}=\mathbf{A}_{o}+\mathbf{A}_{m}=\mathbf{V}_{o}\mathbf{\Lambda}_{o}\mathbf{V}_{o}^{T}+\mathbf{V}_{m}\mathbf{\Lambda}_{m}\mathbf{V}_{m}^{T}.

Any principal submatrix of $\mathbf{A}$ , $\mathbf{A}_{S}$ , can be similarly written as

\mathbf{A}_{S}=\mathbf{A}_{o,S}+\mathbf{A}_{m,S}=\mathbf{V}_{o,S}\mathbf{\Lambda}_{o}\mathbf{V}_{o,S}^{T}+\mathbf{V}_{m,S}\mathbf{\Lambda}_{m}\mathbf{V}_{m,S}^{T},

where $\mathbf{V}_{o,S},\mathbf{V}_{m,S}$ are the corresponding submatrices obtained by sampling rows of $\mathbf{V}_{o},\mathbf{V}_{m}$ .

Since $\mathbf{A}_{S}$ , $\mathbf{A}_{m,S}$ and $\mathbf{A}_{o,S}$ are all symmetric, we can use Weyl’s eigenvalue perturbation theorem [Wey12] to show that for all eigenvalues of $\mathbf{A}_{S}$ ,

\displaystyle\lvert\lambda_{i}(\mathbf{A}_{S})-\lambda_{i}(\mathbf{A}_{o,S})\rvert\leq\|\mathbf{A}_{m,S}\|_{2}.

(1)

We will argue that the eigenvalues of $\mathbf{A}_{o,S}$ approximate those of $\mathbf{A}_{o}$ – i.e. all eigenvalues of $\mathbf{A}$ with magnitude $\geq\epsilon\sqrt{\delta}n$ . Further, we will show that $\|\mathbf{A}_{m,S}\|_{2}$ is small with good probability. Thus, via (1), the eigenvalues of $\mathbf{A}_{S}$ approximate those of $\mathbf{A}_{o}$ . In the estimation procedure of Theorem 1, all other small magnitude eigenvalues of $\mathbf{A}$ are estimated to be $0$ , which will immediately give our $\pm\epsilon n$ approximation bound when the original eigenvalue has magnitude $\leq\epsilon n$ .

Bounding the eigenvalues of $\mathbf{A}_{o,S}$ . The first step is to show that the eigenvalues of $\mathbf{A}_{o,S}$ well-approximate those of $\mathbf{A}_{o}$ . As in [BCJ20], we critically use that the eigenvectors corresponding to large eigenvalues are incoherent – intuitively, since $\|\mathbf{A}\|_{\infty}$ is bounded, their mass must be spread out in order to witness a large eigenvalue. Specifically, [BCJ20] shows that for any eigenvector $\mathbf{v}$ of $\mathbf{A}$ with corresponding eigenvalue $\geq\epsilon\sqrt{\delta}n$ , $\|\mathbf{v}\|_{\infty}\leq\frac{1}{\epsilon\sqrt{\delta}\cdot\sqrt{n}}$ . We give related bounds on the Euclidean norms of the rows of $\mathbf{V}_{o}$ (the leverage scores of $\mathbf{A}_{o}$ ), and on these rows after weighting by $\mathbf{\Lambda}_{o}$ .

Using these incoherence bounds, we argue that the eigenvalues of $\mathbf{A}_{o,S}$ approximate those of $\mathbf{A}_{o}$ up to $\pm\epsilon n$ error. A key idea is to bound the eigenvalues of $\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,S}^{T}\mathbf{V}_{o,S}\mathbf{\Lambda}_{o}^{1/2}$ , which are identical to the non-zero eigenvalues of $\mathbf{A}_{o,S}=\mathbf{V}_{o,S}\mathbf{\Lambda}_{o}\mathbf{V}_{o,S}^{T}$ . Via a matrix Bernstein bound and our incoherence bounds on $\mathbf{V}_{o}$ , we show that this matrix is close to $\mathbf{\Lambda}_{o}$ with high probability. However, since $\mathbf{\Lambda}_{o}^{1/2}$ may be complex, the matrix is not necessarily Hermitian and standard perturbation bounds [SgS90, HJ12] do not apply. Thus, to derive an eigenvalue bound, we apply a perturbation bound of Bhatia [Bha13], which generalizes Weyl’s inequality to the non-Hermitian case, with a $\log n$ factor loss. To the best of our knowledge, this is the first time that perturbation theory bounds for non-Hermitian matrices have been used to prove improved algorithmic results in the theoretical computer science literature.

We note that in Appendix B, we give an alternate bound, which instead analyzes the Hermitian matrix $(\mathbf{V}_{o,S}^{T}\mathbf{V}_{o,S})^{1/2}\mathbf{\Lambda}_{o}(\mathbf{V}_{o,S}^{T}\mathbf{V}_{o,S})^{1/2}$ , whose eigenvalues are again identical to those of $\mathbf{A}_{o,S}$ . This approach only requires Weyl’s inequality, and yields an overall bound of $s=O\left(\frac{\log n}{\epsilon^{4}\delta}\right)$ , improving the $\log n$ factors of Theorem 1 at the cost of worse $\epsilon$ dependence.

Bounding the spectral norm of $\mathbf{A}_{m,S}$ . The next step is to show that all eigenvalues of $\mathbf{A}_{m,S}$ are small provided a sufficiently large submatrix is sampled. This means that the “middle” eigenvalues of $\mathbf{A}$ , i.e. those with magnitude $\leq\epsilon\sqrt{\delta}n$ do not contribute much to any eigenvalue $\lambda_{i}(\mathbf{A}_{S})$ . To do so, we apply a theorem of [RV07, Tro08a] which shows concentration of the spectral norm of a uniformly random submatrix of an entrywise bounded matrix Observe that while $\|\mathbf{A}\|_{\infty}\leq 1$ , such a bound will not in general hold for $\|\mathbf{A}_{m}\|_{\infty}$ . Nevertheless, we can use the incoherence of $\mathbf{V}_{o}$ to show that $\|\mathbf{A}_{o}\|_{\infty}$ is bounded, which via triangle inequality, yields a bound on $\|\mathbf{A}_{m}\|_{\infty}\leq\|\mathbf{A}\|_{\infty}+\|\mathbf{A}_{o}\|_{\infty}$ . In the end, we show that if $s\geq{O}(\frac{\log n}{\epsilon^{2}\delta})$ , with probability at least $1-\delta$ , $\|\mathbf{A}_{m,S}\|_{2}\leq\epsilon s$ . After the $n/s$ scaling in the estimation procedure of Theorem 1, this spectral norm bound translates into an additive $\epsilon n$ error in approximating the eigenvalues of $\mathbf{A}$ .

Completing the argument. Once we establish the above bounds on $\mathbf{A}_{o,S}$ and $\mathbf{A}_{m,S}$ , Theorem 1 is essentially complete. Any eigenvalue in $\mathbf{A}$ with magnitude $\geq\epsilon n$ will correspond to a nearby eigenvalue in $\frac{n}{s}\cdot\mathbf{A}_{o,S}$ and in turn, $\frac{n}{s}\cdot\mathbf{A}_{S}$ given our spectral norm bound on $\mathbf{A}_{m,S}$ . An eigenvalue in $\mathbf{A}$ with magnitude $\leq\epsilon n$ may or may not correspond to a nearby by eigenvalue in $\mathbf{A}_{o,S}$ (it will only if it lies in the range $[\epsilon\sqrt{\delta}n,\epsilon n]$ ). In any case however, in the estimation procedure of Theorem 1, such an eigenvalue will either be estimated using a small eigenvalue of $\mathbf{A}_{S}$ , or be estimated as $0$ . In both instances, the estimate will give $\pm\epsilon n$ error, as required.

Can we beat additive error? It is natural to ask if our approach can be improved to yield sublinear time algorithms with stronger relative error approximation guarantees for $\mathbf{A}$ ’s eigenvalues. Unfortunately, this is not possible – consider a matrix with just a single pair of entries $\mathbf{A}_{i,j},\mathbf{A}_{j,i}$ set to $1$ . To obtain relative error approximations to the two non-zero eigenvalues, we must find the pair $(i,j)$ , as otherwise we cannot distinguish $\mathbf{A}$ from the all zeros matrix. This requires reading a $\Omega(n^{2})$ of $\mathbf{A}$ ’s entries. More generally, consider $\mathbf{A}$ with a random $n/t\times n/t$ principal submatrix populated by all $1$ s, and with all other entries equal to $0$ . $\mathbf{A}$ has largest eigenvalue $n/t$ . However, if we read $s\ll t^{2}$ entries of $\mathbf{A}$ , with good probability, we will not see even a single one, and thus we will not be able to distinguish $\mathbf{A}$ from the all zeros matrix. This example establishes that any sublinear time algorithm with query complexity $s$ must incur additive error at least $\Omega(n/\sqrt{s})$ .

1.3.1 Improved Bounds via Non-Uniform Sampling

We now discuss how to give improved approximation bounds via non-uniform sampling. We focus on the $\pm\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ bound of Theorem 2 using sparsity-based sampling. The proof of Theorem 3 for row norm sampling follows the same general ideas, but with some additional complications.

Theorem 2 requires sampling a submatrix $\mathbf{A}_{S}$ , where each index $i$ is included in $S$ with probability $p_{i}=\min(1,\frac{s\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})})$ . We reweight each sampled row by $\frac{1}{\sqrt{p_{i}}}$ . Thus, if entry $\mathbf{A}_{ij}$ is sampled, it is scaled by $\frac{1}{\sqrt{p_{i}\cdot p_{j}}}$ . When the rows have uniform sparsity (so all $p_{i}=s/n$ ), this ensures that the full submatrix is scaled by $n/s$ , as in Theorem 1.

The proof of Theorem 2 follows the same outline as that of Theorem 1: we first argue that the outlying eigenvectors in $\mathbf{V}_{o}$ are incoherent, giving a bound on the norm of each row of $\mathbf{V}_{o}$ in terms of $\operatorname{nnz}(\mathbf{A}_{i})$ . We then apply a matrix Bernstein bound and Bhatia’s non-Hermitian eigenvalue perturbation bound to show that the eigenvalues of $\mathbf{A}_{o,S}$ approximate those of $\mathbf{A}_{o}$ up to $\pm\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ .

Bounding the spectral norm of $\mathbf{A}_{m,S}$ . The major challenge is showing that the subsampled middle eigendirections do not significantly increase the approximation error by bounding the $\|\mathbf{A}_{m,S}\|_{2}$ by $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ . This is difficult since the indices in $\mathbf{A}_{m,S}$ are sampled nonuniformly, so existing bounds [Tro08a] on the spectral norm of uniformly random submatrices do not apply. We extend these bounds to the non-uniform sampling case, but still face an issue due to the rescaling of entries by $\frac{1}{\sqrt{p_{i}p_{j}}}$ . In fact, without additional algorithmic modifications, $\|\mathbf{A}_{m,S}\|_{2}$ is simply not bounded by $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ ! For example, as already discussed, if $\mathbf{A}=\mathbf{I}$ is the identity matrix, we get $\mathbf{A}_{m,S}=\frac{n}{s}\cdot\mathbf{I}$ and so $\|\mathbf{A}_{m,S}\|_{2}=\frac{n}{s}>\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ , assuming $s<\frac{\sqrt{n}}{\epsilon}$ . Relatedly, suppose that $\mathbf{A}$ is tridiagonal, with zeros on the diagonal and ones on the first diagonals above and below the main diagonal. Then, if $s\geq\sqrt{n}$ , with constant probability, one of the ones will be sampled and scaled by $\frac{n}{s}$ . Thus, we will again have $\|\mathbf{A}_{m,S}\|_{2}\geq\frac{n}{s}\geq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ , assuming $s<\frac{\sqrt{n}}{2\epsilon}$ . Observe that this issue arrises even when trying to approximate just the singular values (the eigenvalue magnitudes) via sampling. Thus, while an analogous bound to the uniform sampling result of Theorem 1 can easily be given for singular value estimation via matrix concentration inequalities (see Appendix G), to the best of our knowledge, Theorems 2 and 3 are the first of their kind even for singular value estimation.

Zeroing out entries in sparse rows/columns. To handle the above cases, we prove a novel perturbation bound, arguing that if we zero out any entry $\mathbf{A}_{ij}$ of $\mathbf{A}$ where $\sqrt{\operatorname{nnz}(\mathbf{A}_{i})\cdot\operatorname{nnz}(\mathbf{A}_{j})}\leq\frac{\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}}{c\log n}$ , then the eigenvalues of $\mathbf{A}$ are not perturbed by more than $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ . This can be thought of as a strengthening of Girshgorin’s circle theorem, which would ensure that zeroing out entries in rows/columns with $\operatorname{nnz}(\mathbf{A}_{i})\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ does not perturb the eigenvalues by more than $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ . Armed with this perturbation bound, we argue that if we zero out the appropriate entries of $\mathbf{A}_{S}$ before computing its eigenvalues, then since we have removed entries in very sparse rows and columns which would be scaled by a large $\frac{1}{\sqrt{p_{i}p_{j}}}$ factor in $\mathbf{A}_{S}$ , we can bound $\|\mathbf{A}_{m,S}\|_{2}$ . This requires relating the magnitudes of the entries in $\mathbf{A}_{m,S}$ to those in $\mathbf{A}_{S}$ using the incoherence of the top eigenvectors, which gives bounds on the entries of $\mathbf{A}_{o,S}=\mathbf{A}_{S}-\mathbf{A}_{m,S}$ .

Sampling model. We note that the sparsity-based sampling of Theorem 2 can be efficiently implemented in several natural settings. Given a matrix stored in sparse format, i.e., as a list of nonzero entries, we can easily sample a row with probability $\frac{\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}$ by sampling a uniformly random non-zero entry and looking at its corresponding row. Via standard techniques, we can convert several such samples into a sampled set $S$ close in distribution to having each $i\in[n]$ included independently with probability $\min\left(1,\frac{s\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}\right)$ . If we store the values of $\operatorname{nnz}(\mathbf{A}),\operatorname{nnz}(\mathbf{A}_{1}),\ldots,\operatorname{nnz}(\mathbf{A}_{n})$ , we can also efficiently access each $p_{i}$ , which is needed for rescaling and zeroing out entries. Also observe that if $\mathbf{A}$ is the adjacency matrix of a graph, in the standard graph query model [GR97], it is well known how to approximately count edges and sample them uniformly at random, i.e., compute $\operatorname{nnz}(\mathbf{A})$ and sample its nonzero entries, in sublinear time [GR08, ER18]. Further, it is typically assumed that one has access to the node degrees, i.e., $\operatorname{nnz}(\mathbf{A}_{1}),\ldots,\operatorname{nnz}(\mathbf{A}_{n})$ . Thus, our algorithm can naturally be used to estimate spectral graph properties in sublinear time.

The $\ell_{2}$ norm-based sampling of Theorem 3 can also be performed efficiently using an augmented data structure for storing $\mathbf{A}$ . Such data structures have been used extensively in the literature on quantum-inspired algorithms, and require just $O(\operatorname{nnz}(\mathbf{A}))$ time to construct, $O(\operatorname{nnz}(\mathbf{A}))$ space, and $O(\log n)$ time to update give an update to an entry of $\mathbf{A}$ [Tan18, CCH⁺20].

1.4 Towards Optimal Query Complexity

As discussed, Bakshi et al. [BCJ20] show that any algorithm which can test with good probability whether $\mathbf{A}$ has an eigenvalue $\leq-\epsilon n$ or else has all non-negative eigenvalues must read $\tilde{\Omega}\left(\frac{1}{\epsilon^{2}}\right)$ entries of $\mathbf{A}$ . This testing problem is strictly easier than outputting $\pm\epsilon n$ error estimates of all eigenvalues, so gives a lower bound for our setting. If the queried entries are restricted to fall in a submatrix, [BCJ20] shows that this submatrix must have dimensions $\Omega\left(\frac{1}{\epsilon^{2}}\right)\times\Omega\left(\frac{1}{\epsilon^{2}}\right)$ , giving total query complexity $\Omega\left(\frac{1}{\epsilon^{4}}\right)$ . Closing the gap between our upper bound of $\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{3}}\right)\times\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{3}}\right)$ and the lower bound of $\Omega\left(\frac{1}{\epsilon^{2}}\right)\times\Omega\left(\frac{1}{\epsilon^{2}}\right)$ for submatrix queries is an intriguing open question.

We show in Appendix A that this gap can be easily closed via a surprisingly simple argument if $\mathbf{A}$ is positive semidefinite (PSD). In that case, $\mathbf{A}=\mathbf{B}\mathbf{B}^{T}$ with $\mathbf{B}\in\mathbb{R}^{n\times n}$ . Writing $\mathbf{A}_{S}=\mathbf{S}^{T}\mathbf{A}\mathbf{S}$ for a sampling matrix $\mathbf{S}\in\mathbb{R}^{n\times|S|}$ , the non-zero eigenvalues of $\mathbf{A}_{S}$ are identical to those of $\mathbf{B}\mathbf{S}\mathbf{S}^{T}\mathbf{B}^{T}$ . Via a standard approximate matrix multiplication analysis [DK01], one can then show that, for $s\geq\frac{1}{\epsilon^{2}\delta}$ , with probability at least $1-\delta$ , $\|\mathbf{BB}^{T}-\mathbf{B}\mathbf{S}\mathbf{S}^{T}\mathbf{B}\|_{F}\leq\epsilon n$ . Via Weyl’s inequality, this shows that the eigenvalues of $\mathbf{B}\mathbf{S}\mathbf{S}^{T}\mathbf{B}$ , and hence $\mathbf{A}_{S}$ , approximate those of $\mathbf{A}$ up to $\pm\epsilon n$ error.⁴⁴4In fact, via more refined eigenvalue perturbation bounds [Bha13] one can show an $\ell_{2}$ norm bound on the eigenvalue approximation errors, which can be much stronger than the $\ell_{\infty}$ norm bound of Theorem 1.

Unfortunately, this approach breaks down when $\mathbf{A}$ has negative eigenvalues, and so cannot be factored as $\mathbf{BB}^{T}$ for real $\mathbf{B}\in\mathbb{R}^{n\times n}$ . This is more than a technical issue: observe that when $\mathbf{A}$ is PSD and has $\|\mathbf{A}\|_{\infty}\leq 1$ , it can have at most $1/\epsilon$ eigenvalues larger than $\epsilon n$ – since its trace, which is equal to the sum of its eigenvalues, is bounded by $n$ , and since all eigenvalues are non-negative. When $\mathbf{A}$ is not PSD, it can have $\Omega(1/\epsilon^{2})$ eigenvalues with magnitude larger than $\epsilon n$ . In particular, if $\mathbf{A}$ is the tensor product of a $1/\epsilon^{2}\times 1/\epsilon^{2}$ random $\pm 1$ matrix and the $\epsilon^{2}n\times\epsilon^{2}n$ all ones matrix, the bulk of its eigenvalues (of which there are $1/\epsilon^{2}$ ) will concentrate around $1/\epsilon\cdot\epsilon^{2}n=\epsilon n$ . As a result it remains unclear whether we can match the $1/\epsilon^{2}$ dependence of the PSD case, or if a stronger lower bound can be shown for indefinite matrices.

Outside the $\epsilon$ dependence, it is unknown if full eigenspectrum approximation can be performed with sample complexity independent of the matrix size $n$ . [BCJ20] achieve this for the easier positive semidefiniteness testing problem, giving sample complexity $\tilde{O}(1/\epsilon^{2})$ . However our bounds have additional $\log n$ factors. As discussed, in Appendix B we give an alternate analysis for Theorem 1, which shows that sampling a $O\left(\frac{\log n}{\epsilon^{4}\delta}\right)\times O\left(\frac{\log n}{\epsilon^{4}\delta}\right)$ submatrix suffices for $\pm\epsilon n$ eigenvalue approximation, saving a $\log^{2}n$ factor at the cost of worse $\epsilon$ dependence. However, removing the final $\log n$ seems difficult – it arises when bounding $\|\mathbf{A}_{m,S}\|_{2}$ via bounds on the spectral norms of random principal submatrices [RV07]. Removing it seems as though it would require either improving such bounds, or taking a different algorithmic approach.

Also note that our $\log n$ and $\epsilon$ dependencies for non-uniform sampling (Theorems 2 and 3) are likely not tight. It is not hard to check that the lower bounds of [BCJ20] still hold in these settings. For example, in the sparsity-based sampling setting, by simply having the matrix entirely supported on a $\sqrt{\operatorname{nnz}(\mathbf{A})}\times\sqrt{\operatorname{nnz}(\mathbf{A})}$ submatrix, the lower bounds of [BCJ20] directly carry over. Giving tight query complexity bounds here would also be interesting. Finally, it would be interesting to go beyond principal submatrix based algorithms, to achieve improved query complexity, as in Corollary 1. Finding an algorithm matching the $\tilde{O}\left(\frac{1}{\epsilon^{2}}\right)$ overall query complexity lower bound of [BCJ20] is open even in the much simpler PSD setting.

2 Notation and Preliminaries

We now define notation and foundational results that we use throughout our work. For any integer $n$ , let $[n]$ denote the set $\{1,2,\ldots,n\}$ . We write matrices and vectors in bold literals – e.g., $\mathbf{A}$ or $\mathbf{x}$ . We denote the eigenvalues of a symmetric matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ by $\lambda_{1}(\mathbf{A})\geq\ldots\geq\lambda_{n}(\mathbf{A})$ , in decreasing order. A symmetric matrix is positive semidefinite if all its eigenvalues are non-negative. For two matrices $\mathbf{A},\mathbf{B}$ , we let $\mathbf{A}\succeq\mathbf{B}$ denote that $\mathbf{A}-\mathbf{B}$ is positive semidefinite. For any matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ and $i\in[n]$ , we let $\mathbf{A}_{i}$ denote the $i^{th}$ row of $\mathbf{A}$ , $\operatorname{nnz}(\mathbf{A}_{i})$ denote the number of non-zero elements in this row, and $\|\mathbf{A}_{i}\|_{2}$ denote its $\ell_{2}$ norm. We let $\operatorname{nnz}(\mathbf{A})$ denote the total number of non-zero elements $\mathbf{A}$ . For a vector $\mathbf{x}$ , we let $\|\mathbf{x}\|_{2}$ denote its Euclidean norm. For a matrix $\mathbf{A}$ , we let $\|\mathbf{A}\|_{\infty}$ denote the largest magnitude of an entry, $\|\mathbf{A}\|_{2}=\max_{\mathbf{x}}\frac{\|\mathbf{Ax}\|_{2}}{\|\mathbf{x}\|_{2}}$ denote the spectral norm, $\|\mathbf{A}\|_{F}=(\sum_{i,j}\mathbf{A}_{ij}^{2})^{1/2}$ denote the Frobenius norm, and $\|\mathbf{A}\|_{1\rightarrow 2}$ denote the maximum Euclidean norm of a column. For $\mathbf{A}\in\mathbb{R}^{n\times n}$ and $S\subseteq[n]$ we let $\mathbf{A}_{S}$ denote the principal submatrix corresponding to $S$ . We let $\mathbb{E}_{2}$ denote the $L_{2}$ norm of a random variable, $\mathbb{E}_{2}[X]=(\mathbb{E}[X^{2}])^{1/2}$ , where $\mathbb{E}[\cdot]$ denotes expectation.

We use the following basic facts and identities on eigenvalues throughout our proofs.

Fact 1 (Eigenvalue of Matrix Product).

For any two matrices $\mathbf{A}\in\mathbb{C}^{n\times m},\mathbf{B}\in\mathbb{C}^{m\times n}$ , the non-zero eigenvalues of $\mathbf{AB}$ are identical to those of $\mathbf{B}\mathbf{A}$ .

Fact 2 (Girshgorin’s circle theorem [Ger31]).

Let $\mathbf{A}\in\mathbb{C}^{n\times n}$ with entries $\mathbf{A}_{ij}$ . For $i\in[n]$ , let $\mathbf{R}_{i}$ be the sum of absolute values of non-diagonal entries in the $i$ ^th row. Let $D(\mathbf{A}_{ii},\mathbf{R}_{i})$ be the closed disc centered at $\mathbf{A}_{ii}$ with radius $\mathbf{R}_{i}$ . Then every eigenvalue of $\mathbf{A}$ lies within one of the discs $D(\mathbf{A}_{ii},\mathbf{R}_{i})$ .

Fact 3 (Weyl’s Inequality [Wey12]).

For any two Hermitian matrices $\mathbf{A},\mathbf{B}\in\mathbb{C}^{n\times n}$ with $\mathbf{A}-\mathbf{B}=\mathbf{E}$ ,

\displaystyle\max_{i}|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{B})|\leq\|\mathbf{E}\|_{2}.

Weyl’s inequality ensures that a small Hermitian perturbation of a Hermitian matrix will not significantly change its eigenvalues. The bound can be extended to the case when the perturbation is not Hermitian, with a loss of an $O(\log n)$ factor; to the best of our knowledge this loss is necessary:

Fact 4 (Non-Hermitian perturbation bound [Bha13]).

Let $\mathbf{A}\in\mathbb{C}^{n\times n}$ be Hermitian and $\mathbf{B}\in\mathbb{C}^{n\times n}$ be any matrix whose eigenvalues are $\lambda_{1}(\mathbf{B}),\ldots,\lambda_{n}(\mathbf{B})$ such that $Re(\lambda_{1}(\mathbf{B}))\geq\ldots\geq Re(\lambda_{n}(\mathbf{B}))$ (where $Re(\lambda_{i}(\mathbf{B}))$ denotes the real part of $\lambda_{i}(\mathbf{B})$ ). Let $\mathbf{A}-\mathbf{B}=\mathbf{E}$ . For some universal constant $C$ ,

\displaystyle\max_{i}|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{B})|\leq C\log n\|\mathbf{E}\|_{2}.

Beyond the above facts, we use several theorems to obtain eigenvalue concentration bounds. We first state a theorem from [Tro08a], which bounds the spectral norm of a principal submatrix sampled uniformly at random from a bounded entry matrix. We build on this to prove the full eigenspectrum concentration result of Theorem 1.

Theorem 4 (Random principal submatrix spectral norm bound [RV07, Tro08a]).

Let $\mathbf{A}\in\mathbb{C}^{n\times n}$ be Hermitian, decomposed into diagonal and off-diagonal parts: $\mathbf{A}=\mathbf{D}+\mathbf{H}$ . Let $\mathbf{S}\in\mathbb{R}^{n\times n}$ be a diagonal sampling matrix with the $j^{th}$ diagonal entry set to $1$ independently with probability $s/n$ and $0$ otherwise. Then, for some universal constant $C$ ,

\displaystyle\mathbb{E}_{2}\|\mathbf{S}\mathbf{AS}\|_{2}\leq C\left[\log n\cdot\mathbb{E}_{2}\|\mathbf{S}\mathbf{HS}\|_{\infty}+\sqrt{\frac{s\log n}{n}}\cdot\mathbb{E}_{2}\|\mathbf{HS}\|_{1\rightarrow 2}+\frac{s}{n}\cdot\|\mathbf{H}\|_{2}\right]+\mathbb{E}_{2}\|\mathbf{S}\mathbf{DS}\|_{2}.

For Theorems 2 and 3, we need an extension of Theorem 4 to the setting where rows are sampled non-uniformly. We will use two bounds here. The first is a decoupling and recoupling result for matrix norms. One can prove this lemma following an analogous result in [Tro08a] for sampling rows/columns uniformly. The proof is almost identical so we omit it.

Lemma 1 (Decoupling and recoupling).

Let $\mathbf{H}$ be a Hermitian matrix with zero diagonal. Let $\delta_{j}$ be a sequence of independent random variables such that $\delta_{j}=\frac{1}{\sqrt{p_{j}}}$ with probability $p_{j}$ and $0$ otherwise. Let $\mathbf{S}$ be a square diagonal sampling matrix with $j^{th}$ diagonal entry set to $\delta_{j}$ . Then:

\mathbb{E}_{2}\|\mathbf{SHS}\|_{2}\leq 2\mathbb{E}_{2}\|\mathbf{SH\hat{S}}\|_{2}\hskip 10.00002pt\text{and}\hskip 10.00002pt\mathbb{E}_{2}\|\mathbf{SH\hat{S}}\|_{\infty}\leq 4\mathbb{E}_{2}\|\mathbf{SHS}\|_{\infty},

where $\mathbf{\hat{S}}$ is an independent diagonal sampling matrix drawn from the same distribution as $\mathbf{S}$ .

The second theorem bounds the spectral norm of a non-uniform random column sample of a matrix. We give a proof in Appendix D, again following a theorem in [Tro08b] for uniform sampling.

Theorem 5 (Non-uniform column sampling – spectral norm bound).

Let $\mathbf{A}$ be an $m\times n$ matrix with rank $r$ . Let $\delta_{j}$ be a sequence of independent random variables such that $\delta_{j}=\frac{1}{\sqrt{p_{j}}}$ with probability $p_{j}$ and $0$ otherwise. Let $\mathbf{S}$ be a square diagonal sampling matrix with $j^{th}$ diagonal entry set to $\delta_{j}$ .

\mathbb{E}_{2}\|\mathbf{AS}\|_{2}\leq 5\sqrt{\log r}\cdot\mathbb{E}_{2}\|\mathbf{AS}\|_{1\rightarrow 2}+\|\mathbf{A}\|_{2}

We use a standard Matrix Bernstein inequality to bound the spectral norm of random submatrices.

Theorem 6 (Matrix Bernstein [Tro15]).

Consider a finite sequence $\{\mathbf{S}_{k}\}$ of random matrices in $\mathbb{R}^{d\times d}$ . Assume that for all $k$ , $\mathbb{E}[\mathbf{S}_{k}]=\mathbf{0}\quad\text{and}\quad\|\mathbf{S}_{k}\|_{2}\leq L.$ Let $\mathbf{Z}=\sum_{k}\mathbf{S}_{k}$ and let $\mathbf{V}_{1},\mathbf{V}_{2}$ be semidefinite upper-bounds for the matrix valued variances $\mathbf{Var}_{1}(\mathbf{Z})$ and $\mathbf{Var}_{2}(\mathbf{Z})$ :

	$\displaystyle\mathbf{V}_{1}$	$\displaystyle\succeq\mathbf{Var}_{1}(\mathbf{Z})\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\mathbb{E}\left(\mathbf{ZZ}^{T}\right)=\sum_{k}\mathbb{E}\left(\mathbf{S}_{k}\mathbf{S}_{k}^{T}\right),\quad\text{and}$
	$\displaystyle\mathbf{V}_{2}$	$\displaystyle\succeq\mathbf{Var}_{2}(\mathbf{Z})\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\mathbb{E}\left(\mathbf{Z}^{T}\mathbf{Z}\right)=\sum_{k}\mathbb{E}\left(\mathbf{S}_{k}^{T}\mathbf{S}_{k}\right).$

Then, letting $v=\max(\|\mathbf{V}_{1}\|_{2},\|\mathbf{V}_{2}\|_{2})$ , for any $t\geq 0$ ,

\displaystyle\operatorname*{\mathbb{P}}(\|\mathbf{Z}\|_{2}\geq t)

\displaystyle\leq 2d\cdot\exp\left(\frac{-t^{2}/2}{v+Lt/3}\right).

For real valued random variables, we use the standard Bernstein inequality.

Theorem 7 (Bernstein inequality [Ber27]).

Let $\{z_{j}\}$ for $j\in[n]$ be independent random variables with zero mean such that $\lvert z_{j}\rvert\leq M$ for all $j$ . Then for all positive $t$ ,

\displaystyle\operatorname*{\mathbb{P}}\left(\left\lvert\sum_{j=1}^{n}z_{j}\right\rvert\geq t\right)\leq\exp\left(\frac{-t^{2}/2}{\sum_{i=1}^{n}\mathbb{E}[z_{i}^{2}]+Mt/3}\right).

3 Sublinear Time Eigenvalue Estimation using Uniform Sampling

We now prove our main eigenvalue estimation result – Theorem 1. We give the pseudocode for our principal submatrix based estimation procedure in Algorithm 1. We will show that any positive or negative eigenvalue of $\mathbf{A}$ with magnitude $\geq\epsilon n$ will appear as an approximate eigenvalue in $\mathbf{A}_{S}$ with good probability. Thus, in step 5 of Algorithm 1, the positive and negative eigenvvalues of $\mathbf{A}_{S}$ are used to estimate the outlying largest and smallest eigenvalues of $\mathbf{A}$ . All other interior eigenvalues of $\mathbf{A}$ are estimated to be $0$ , which will immediately give our $\pm\epsilon n$ approximation bound when the original eigenvalue has magnitude $\leq\epsilon n$ .

Algorithm 1 Eigenvalue estimator using uniform sampling

1: Input: Symmetric

\mathbf{A}\in\mathbb{R}^{n\times n}

with

\|\mathbf{A}\|_{\infty}\leq 1

, Accuracy

\epsilon\in(0,1)

, failure prob.

\delta\in(0,1)

2: Fix

s=\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}{\delta}}

where

c

is a sufficiently large constant.

3: Add each index

i\in[n]

to the sample set

S

independently with probability

\frac{s}{n}

. Let the principal submatrix of

\mathbf{A}

corresponding

S

\mathbf{A}_{S}

4: Compute the eigenvalues of

\mathbf{A}_{S}

\lambda_{1}(\mathbf{A}_{S})\geq\ldots\geq\lambda_{|S|}(\mathbf{A}_{S})

5: For all

i\in[|S|]

with

\lambda_{i}(\mathbf{A}_{S})\geq 0

, let

\tilde{\lambda}_{i}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})

. For all

i\in[|S|]

with

\lambda_{i}(\mathbf{A}_{S})<0

, let

\tilde{\lambda}_{n-(|S|-i)}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})

. For all remaining

i\in[n]

, let

\tilde{\lambda}_{i}(\mathbf{A})=0

6: Return: Eigenvalue estimates

\tilde{\lambda}_{1}(\mathbf{A})\geq\ldots\geq\tilde{\lambda}_{n}(\mathbf{A})

Running time. Observe that the expected number of indices chosen by Algorithm 1 is $s=\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}{\delta}}$ . A standard concentration bound can be used to show that with high probability $(1-1/\mathop{\mathrm{missing}}{poly}(n))$ , the number of sampled entries is $O(s)$ . Thus, the algorithm reads a total of $O(s^{2})$ entries of $\mathbf{A}$ and runs in $O(s^{\omega})$ time – the time to compute a full eigendecomposition of $\mathbf{A}_{S}$ .

3.1 Outer and Middle Eigenvalue Bounds

Recall that we will split $\mathbf{A}$ into two symmetric matrices (Definition 1.1): $\mathbf{A}_{o}=\mathbf{V}_{o}\mathbf{\Lambda}_{o}\mathbf{V}_{o}^{T}$ which contains its large magnitude (outlying) eigendirections with eigenvalue magnitudes $\geq\epsilon\sqrt{\delta}n$ and $\mathbf{A}_{m}=\mathbf{V}_{m}\mathbf{\Lambda}_{m}\mathbf{V}_{m}^{T}$ which contains its small magnitude (middle) eigendirections.

We first show that the eigenvectors in $\mathbf{V}_{o}$ are incoherent. I.e., that their (eigenvalue weighted) squared row norms are bounded. This ensures that the outlying eigenspace of $\mathbf{A}$ is well-approximated via uniform sampling.

Lemma 2 (Incoherence of outlying eigenvectors).

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric with $\|\mathbf{A}\|_{\infty}\leq 1$ . Let $\mathbf{V}_{o}$ be as in Definition 1.1. Let $\mathbf{V}_{o,i}$ denote the $i$ ^th row of $\mathbf{V}_{o}$ . Then,

\displaystyle\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}\leq\frac{1}{\epsilon\sqrt{\delta}}\hskip 20.00003pt\text{ and }\hskip 20.00003pt\|\mathbf{V}_{o,i}\|^{2}_{2}\leq\frac{1}{\epsilon^{2}\delta n}.

Proof.

Observe that $\mathbf{A}\mathbf{V}_{o}=\mathbf{V}_{o}\mathbf{\Lambda}_{o}$ . Let $[\mathbf{A}\mathbf{V}_{o}]_{i}$ denote the $i$ ^th row of the $\mathbf{A}\mathbf{V}_{o}$ . Then we have

\|[\mathbf{A}\mathbf{V}_{o}]_{i}\|_{2}^{2}=\|[\mathbf{V}_{o}\mathbf{\Lambda}_{o}]_{i}\|_{2}^{2}=\sum_{j=1}^{r}\lambda_{j}^{2}\cdot\mathbf{V}_{o,i,j}^{2},

(2)

where $r=\operatorname{rank}(\mathbf{A}_{o})$ , $\mathbf{V}_{o,i,j}$ is the $(i,j)$ ^th element of $\mathbf{V}_{o}$ and $\lambda_{j}=\mathbf{\Lambda}_{o}(j,j)$ . $\|\mathbf{A}\|_{\infty}\leq 1$ by assumption and since $\mathbf{V}_{o}$ has orthonormal columns, its spectral norm is bounded by $1$ , thus we have

\|[\mathbf{A}\mathbf{V}_{o}]_{i}\|_{2}^{2}=\|[\mathbf{A}]_{i}\mathbf{V}_{o}\|_{2}^{2}\leq\|[\mathbf{A}]_{i}\|_{2}^{2}\cdot\|\mathbf{V}_{o}\|^{2}_{2}\leq n.

Therefore, by (2), we have:

\sum_{j=1}^{r}\lambda_{j}^{2}\cdot\mathbf{V}_{o,i,j}^{2}\leq n.

(3)

Since by definition of $\mathbf{\Lambda}_{o}$ , $\lvert\lambda_{j}\rvert\geq\epsilon\sqrt{\delta}n$ for all $j$ , we finally have

\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}=\sum_{j=1}^{r}\lambda_{j}\cdot\mathbf{V}_{o,i,j}^{2}\leq\frac{n}{\epsilon\sqrt{\delta}n}=\frac{1}{\epsilon\sqrt{\delta}}

and

\displaystyle\|\mathbf{V}_{o,i}\|_{2}^{2}=\sum_{j=1}^{r}\mathbf{V}_{o,i,j}^{2}

\displaystyle\leq\frac{n}{\epsilon^{2}\delta n^{2}}=\frac{1}{\epsilon^{2}\delta n}.

∎

Let $\mathbf{\bar{S}}\in\mathbb{R}^{n\times|S|}$ be the scaled sampling matrix satisfying $\mathbf{\bar{S}}^{T}\mathbf{A}\mathbf{\bar{S}}=\frac{n}{s}\cdot\mathbf{A}_{S}$ . We next apply Lemma 2 in conjunction with a matrix Bernstein bound to show that $\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}$ concentrates around its expectation, $\mathbf{\Lambda}_{o}$ . Since by Fact 1, this matrix has identical eigenvalues to $\frac{n}{s}\cdot\mathbf{A}_{o,S}=\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}$ , this allows us to argue that the eigenvalues of $\frac{n}{s}\cdot\mathbf{A}_{o,S}$ approximate those of $\mathbf{\Lambda}_{o}$ .

Lemma 3 (Concentration of outlying eigenvalues).

Let $S\subseteq[n]$ be sampled as in Algorithm 1 for $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{3}\sqrt{\delta}}$ where $c$ is a sufficiently large constant. Let $\mathbf{\bar{S}}\in\mathbb{R}^{n\times|S|}$ be the scaled sampling matrix satisfying $\mathbf{\bar{S}}^{T}\mathbf{A}\mathbf{\bar{S}}=\frac{n}{s}\cdot\mathbf{A}_{S}$ . Letting $\mathbf{\Lambda}_{o},\mathbf{V}_{o}$ be as in Definition 1.1, with probability at least $1-\delta$ ,

\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}-\mathbf{\Lambda}_{o}\|_{2}\leq\epsilon n.

Proof.

Define $\mathbf{E}=\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}-\mathbf{\Lambda}_{o}$ . For all $i\in[n]$ , let $\mathbf{V}_{o,i}$ be the $i^{th}$ row of $\mathbf{V}_{o}$ and define the matrix valued random variable

\displaystyle\mathbf{Y}_{i}=\begin{cases}\frac{n}{s}\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2},&\text{with probability }s/n\\ 0&\text{otherwise.}\end{cases}

(4)

Define $\mathbf{Q}_{i}=\mathbf{Y}_{i}-\mathbb{E}\left[\mathbf{Y}_{i}\right]$ . Observe that $\mathbf{Q}_{1},\ldots,\mathbf{Q}_{n}$ are independent random variables and that $\sum_{i=1}^{n}\mathbf{Q}_{i}=\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}-\mathbf{\Lambda}_{o}=\mathbf{E}$ . Further, observe that $\|\mathbf{Q}_{i}\|_{2}\leq\max\left(1,\frac{n}{s}-1\right)\cdot\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2}\|_{2}\leq\max\left(1,\frac{n}{s}-1\right)\cdot\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}$ . Now, $\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}\leq\frac{1}{\epsilon\sqrt{\delta}}$ by Lemma 2. Thus, $\|\mathbf{Q}_{i}\|_{2}\leq\frac{n}{\epsilon\sqrt{\delta}s}$ . The variance $\mathbf{Var}(\mathbf{E})\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\mathbb{E}(\mathbf{EE}^{T})=\mathbb{E}(\mathbf{E}^{T}\mathbf{E})=\sum_{i=1}^{n}\mathbb{E}[\mathbf{Q}_{i}^{2}]$ can be bounded as:

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}[\mathbf{Q}_{i}^{2}]$	$\displaystyle=\sum_{i=1}^{n}\left[\frac{s}{n}\cdot\left(\frac{n}{s}-1\right)^{2}+\left(1-\frac{s}{n}\right)\right]\cdot(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2})$
		$\displaystyle\preceq\sum_{i=1}^{n}\frac{n}{s}\cdot\\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\\|_{2}^{2}\cdot(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2}).$		(5)

Again by Lemma 2, $\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}\leq\frac{1}{\epsilon\sqrt{\delta}}$ . Plugging back into (5) we can bound,

\displaystyle\sum_{i=1}^{n}\mathbb{E}[\mathbf{Q}_{i}^{2}]

\displaystyle\preceq\sum_{i=1}^{n}\frac{n}{s}\cdot\frac{1}{\epsilon\sqrt{\delta}}\cdot(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2})=\frac{n}{s\epsilon\sqrt{\delta}}\mathbf{\Lambda}_{o}\preceq\frac{n^{2}}{s\epsilon\sqrt{\delta}}\cdot\mathbf{I}.

Since $\mathbf{Q}_{i}^{2}$ is PSD, this establishes that $\|\mathbf{Var}(\mathbf{E})\|_{2}\leq\frac{n^{2}}{s\epsilon\sqrt{\delta}}$ . We then apply Theorem 6 (the matrix Bernstein inequality) with $L=\frac{n}{s\epsilon\sqrt{\delta}}$ , $v=\frac{n^{2}}{s\epsilon\sqrt{\delta}}$ , and $d\leq\frac{1}{\epsilon^{2}\delta}$ since there are at most $\frac{\|\mathbf{A}\|_{F}^{2}}{\delta\epsilon^{2}n^{2}}\leq\frac{1}{\epsilon^{2}\delta}$ outlying eigenvalues with magnitude $\geq\sqrt{\delta}\epsilon n$ in $\boldsymbol{\Lambda}_{o}$ . This gives:

	$\displaystyle\operatorname*{\mathbb{P}}\left(\left\\|\mathbf{E}\right\\|_{2}\geq\epsilon n\right)$	$\displaystyle\leq\frac{2}{\epsilon^{2}\delta}\cdot\exp\left(\frac{-\epsilon^{2}n^{2}/2}{v+L\epsilon n/3}\right)$
		$\displaystyle\leq\frac{2}{\epsilon^{2}\delta}\cdot\exp\left(\frac{-\epsilon^{2}n^{2}/2}{\frac{n^{2}}{s\epsilon\sqrt{\delta}}+\frac{\epsilon n^{2}}{3s\epsilon\sqrt{\delta}}}\right)$
		$\displaystyle\leq\frac{2}{\epsilon^{2}\delta}\cdot\exp\left(\frac{-s\epsilon^{3}\sqrt{\delta}}{4}\right).$

Thus, if we set $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{3}\sqrt{\delta}}$ for large enough $c$ , then the probability is bounded above by $\delta$ , completing the proof. ∎

We cannot prove an analogous leverage score bound to Lemma 2 for the interior eigenvectors of $\mathbf{A}$ appearing in $\mathbf{V}_{m}$ . Thus we cannot apply a matrix Bernstein bound as in Lemma 3. However, we can use Theorem 4 to show that the spectral norm of the random principal submatrix $\mathbf{A}_{m,S}$ is not too large, and thus that the eigenvalues of $\mathbf{A}_{S}=\mathbf{A}_{o,S}+\mathbf{A}_{m,S}$ are close to those of $\mathbf{A}_{o,S}$ .

Lemma 4 (Spectral norm bound – sampled middle eigenvalues).

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric with $\|\mathbf{A}\|_{\infty}\leq 1$ . Let $\mathbf{A}_{m}$ be as in Definition 1.1. Let $S$ be sampled as in Algorithm 1. If $s\geq\frac{c\log n}{\epsilon^{2}\delta}$ for some sufficiently large constant $c$ , then with probability at least $1-\delta$ , $\|\mathbf{A}_{m,S}\|_{2}\leq\epsilon s$ .

Proof.

Let $\mathbf{A}_{m}=\mathbf{D}_{m}+\mathbf{H}_{m}$ where $\mathbf{D}_{m}$ is the matrix of diagonal elements and $\mathbf{H}_{m}$ the matrix of off-diagonal elements. Let $\mathbf{S}\in\mathbb{R}^{n\times|S|}$ be the binary sampling matrix with $\mathbf{A}_{m,S}=\mathbf{S}^{T}\mathbf{A}_{m}\mathbf{S}$ . From Theorem 4, we have for some constant $C$ ,

\mathbb{E}_{2}[\|\mathbf{A}_{m,S}\|_{2}]\leq C\bigg{[}\log n\cdot\mathbb{E}_{2}[\|\mathbf{S}^{T}\mathbf{H}_{m}\mathbf{S}\|_{\infty}]+\sqrt{\frac{s\log n}{n}}\mathbb{E}_{2}[\|\mathbf{H}_{m}\mathbf{S}\|_{1\rightarrow 2}]+\frac{s}{n}\|\mathbf{H}_{m}\|_{2}\bigg{]}+\mathbb{E}_{2}[\|\mathbf{S}^{T}\mathbf{D}_{m}\mathbf{S}\|].

(6)

Considering the various terms in (6), we have $\|\mathbf{S}^{T}\mathbf{H}_{m}\mathbf{S}\|_{\infty}\leq\|\mathbf{A}_{m}\|_{\infty}$ and $\|\mathbf{S}^{T}\mathbf{D}_{m}\mathbf{S}\|_{2}=\|\mathbf{S}^{T}\mathbf{D}_{m}\mathbf{S}\|_{\infty}\leq\|\mathbf{A}_{m}\|_{\infty}$ . We also have

\|\mathbf{H}_{m}\|_{2}\leq\|\mathbf{A}_{m}\|_{2}+\|\mathbf{D}_{m}\|_{2}\leq\|\mathbf{A}_{m}\|_{2}+\|\mathbf{A}_{m}\|_{\infty}\leq\epsilon\delta^{1/2}n+\|\mathbf{A}_{m}\|_{\infty}

and

\|\mathbf{H}_{m}\mathbf{S}\|_{1\rightarrow 2}\leq\|\mathbf{A}_{m}\mathbf{S}\|_{1\rightarrow 2}\leq\|\mathbf{A}_{m}\|_{1\rightarrow 2}\leq\sqrt{n}.

The final bound follows since $\mathbf{A}_{m}=\mathbf{V}_{m}\mathbf{V}_{m}^{T}\mathbf{A}$ , where $\mathbf{V}_{m}\mathbf{V}_{m}^{T}$ is an orthogonal projection matrix. Thus, $\|\mathbf{A}_{m}\|_{1\rightarrow 2}\leq\|\mathbf{A}\|_{1\rightarrow 2}\leq\sqrt{n}$ by our assumption that $\|\mathbf{A}\|_{\infty}\leq 1$ . Plugging all these bounds into (6) we have, for some constant $C$ ,

\displaystyle\mathbb{E}_{2}[\|\mathbf{A}_{m,S}\|_{2}]\leq C\bigg{[}\log n\cdot\|\mathbf{A}_{m}\|_{\infty}+\sqrt{\log n\cdot s}+s\cdot\epsilon\delta^{1/2}\bigg{]}.

(7)

It remains to bound $\|\mathbf{A}_{m}\|_{\infty}$ . We have $\mathbf{A}=\mathbf{A}_{m}+\mathbf{A}_{o}$ and thus by triangle inequality,

\displaystyle\|\mathbf{A}_{m}\|_{\infty}\leq\|\mathbf{A}\|_{\infty}+\|\mathbf{A}_{o}\|_{\infty}=1+\|\mathbf{A}_{o}\|_{\infty}.

(8)

Writing $\mathbf{A}_{o}=\mathbf{V}_{o}\mathbf{\Lambda}_{o}\mathbf{V}_{o}^{T}$ (see Definition 1.1), and letting $\mathbf{V}_{o,i}$ denote the $i$ ^th row of $\mathbf{V}_{o}$ , the $(i,j)$ ^th element of $\mathbf{A}_{o}$ has magnitude

|\mathbf{A}_{o,i,j}|=|\mathbf{V}_{o,i}\mathbf{\Lambda}_{o}\mathbf{V}^{T}_{o,j}|\leq\|\mathbf{V}_{o,i}\|_{2}\cdot\|\mathbf{\Lambda}_{o}\mathbf{V}^{T}_{o,j}\|_{2},

by Cauchy-Schwarz. From Lemma 2, we have $\|\mathbf{V}_{o,i}\|_{2}\leq\frac{1}{\epsilon\delta^{1/2}\sqrt{n}}$ . Also, from (2), $\|\mathbf{\Lambda}_{o}\mathbf{V}^{T}_{o,j}\|_{2}=\|[\mathbf{A}\mathbf{V}_{o}]_{j}\|_{2}\leq\sqrt{n}$ . Overall, for all $i,j$ we have $\mathbf{A}_{o,i,j}\leq\frac{1}{\epsilon\delta^{1/2}\sqrt{n}}\cdot\sqrt{n}=\frac{1}{\epsilon\delta^{1/2}}$ , giving $\|\mathbf{A}_{o}\|_{\infty}\leq\frac{1}{\epsilon\delta^{1/2}}$ . Plugging back into (8) and in turn (7), we have for some constant $C$ ,

\mathbb{E}_{2}[\|\mathbf{A}_{m,S}\|_{2}]\leq C\bigg{[}\frac{\log n}{\epsilon\delta^{1/2}}+\sqrt{s\log n}+s\epsilon\delta^{1/2}\bigg{]}.

Setting $s\geq\frac{c\log n}{\epsilon^{2}\delta}$ for sufficiently large $c$ , all terms in the right hand side of the above equation are bounded by $\epsilon\sqrt{\delta}s$ and so

\mathbb{E}_{2}[\|\mathbf{A}_{m,S}\|_{2}]\leq 3\epsilon\sqrt{\delta}s

Thus, by Markov’s inequality, with probability at least $1-\delta$ , we have $\|\mathbf{A}_{m,S}\|_{2}\leq 3\epsilon s$ . We can adjust $\epsilon$ by a constant to obtain the required bound. ∎

3.2 Main Accuracy Bounds

We now restate our main result, and give its proof via Lemmas 3 and 4. See 1

Proof.

Let $\mathbf{S}\in\mathbb{R}^{n\times|S|}$ be the binary sampling matrix with a single one in each column such that $\mathbf{S}^{T}\mathbf{A}\mathbf{S}=\mathbf{A}_{S}$ . Let $\bar{\mathbf{S}}=\sqrt{n/s}\cdot\mathbf{S}$ Following Definition 1.1, we write $\mathbf{A}=\mathbf{A}_{o}+\mathbf{A}_{m}$ . By Fact 1 we have that the nonzero eigenvalues of $\frac{n}{s}\cdot\mathbf{A}_{o,S}=\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}$ are identical to those of $\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}$ where $\mathbf{\Lambda}_{o}^{1/2}$ is the square root matrix of $\mathbf{\Lambda}_{o}$ such that $\mathbf{\Lambda}_{o}^{1/2}\mathbf{\Lambda}_{o}^{1/2}=\mathbf{\Lambda}_{o}$ .

Note that $\mathbf{\Lambda}_{o}$ is Hermitian. However $\mathbf{\Lambda}_{o}^{1/2}$ may be complex, and hence $\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}$ is not necessarily Hermitian, although it does have real eigenvalues. Thus, we can apply the perturbation bound of Fact 4 to $\mathbf{\Lambda}_{o}$ and $\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}$ to claim for all $i\in[n]$ , and some constant $C$ ,

\lvert\lambda_{i}(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2})-\lambda_{i}(\mathbf{\Lambda}_{o})\rvert\leq C\log n\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}-\mathbf{\Lambda}_{o}\|_{2}.

By Lemma 3 applied with error $\frac{\epsilon}{2C\log n}$ , with probability at least $1-\delta$ , for any $s\geq\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}\sqrt{\delta}}$ (for a large enough constant $c$ ) we have $\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}-\mathbf{\Lambda}_{o}\|_{2}\leq\frac{\epsilon n}{2C\log n}$ . Thus, for all $i$ ,

\displaystyle\left|\lambda_{i}(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2})-\lambda_{i}(\mathbf{\Lambda}_{o})\right|

\displaystyle<\frac{\epsilon n}{2}.

(9)

We note that the conceptual part of the proof is essentially complete: the nonzero eigenvalues of $\frac{n}{s}\cdot\mathbf{A}_{o,S}$ are identical to those of $\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}$ , which we have shown well approximate those of $\mathbf{\Lambda}_{o}$ and in turn $\mathbf{A}_{o}$ . i.e., the non-zero eigenvalues of $\frac{n}{s}\cdot\mathbf{A}_{o,S}$ approximate all outlying eigenvalues of $\mathbf{A}$ . It remains to carefully argue how these approximations should be ‘lined up’ given the presence of zero eigenvalues in the spectrum of these matrices. We also must account for the impact of the interior eigenvalues in $\mathbf{A}_{m,S}$ , which is limited by the spectral norm bound of Lemma 4.

Eigenvalue alignment and effect of interior eigenvalues. First recall that $\mathbf{A}_{S}=\mathbf{A}_{o,S}+\mathbf{A}_{m,S}$ . By Lemma 4 applied with error $\epsilon/2$ , we have $\|\mathbf{A}_{m,S}\|_{2}\leq\epsilon/2\cdot s$ with probability at least $1-\delta$ when $s\geq\frac{c\log n}{\epsilon^{2}\delta}$ . By Weyl’s inequality (Fact 3), for all $i\in[|S|]$ we thus have

\displaystyle\left\lvert\frac{n}{s}\lambda_{i}(\mathbf{A}_{S})-\frac{n}{s}\lambda_{i}(\mathbf{A}_{o,S})\right\rvert

\displaystyle\leq\frac{n}{s}\cdot\frac{\epsilon s}{2}=\frac{\epsilon n}{2}.

(10)

Consider $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{o,S})>0$ . Since the nonzero eigenvalues of $\frac{n}{s}\cdot\mathbf{A}_{o,S}$ are identical to those of $\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}$ , $\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{o,S})=\lambda_{i}(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2})$ , and so by (9),

\displaystyle\left|\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{o,S})-\lambda_{i}(\mathbf{\Lambda}_{o})\right|

\displaystyle<\frac{\epsilon n}{2}.

(11)

Analogously, consider $i\in[|S|]$ such that $\lambda_{i}(\mathbf{A}_{o,S})<0$ . We have $\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{o,S})=\lambda_{r-(|S|-i)}(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2})$ , where $r$ is the dimension of $\mathbf{\Lambda}_{o}$ – i.e., the number of outlying eigenvalues in $\mathbf{A}$ . Again by (9) we have

\displaystyle\left|\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{o,S})-\lambda_{r-(|S|-i)}(\mathbf{\Lambda}_{o})\right|

\displaystyle<\frac{\epsilon n}{2}.

(12)

Now the nonzero eigenvalues of $\mathbf{A}_{o}$ are identical to those of $\mathbf{\Lambda}_{o}$ . Consider $i\in[|S|]$ such that $\lambda_{i}(\mathbf{A}_{S})\geq\epsilon s$ . In this case, by (10), (11), and the triangle inequality, we have $\lambda_{i}(\mathbf{\Lambda}_{o})>0$ and thus we have $\lambda_{i}(\mathbf{\Lambda}_{o})=\lambda_{i}(\mathbf{A}_{o})$ . In turn, again applying (10), (11), and the triangle inequality, we have

\left|\frac{n}{s}\lambda_{i}(\mathbf{A}_{S})-\lambda_{i}(\mathbf{A}_{o})\right|\leq\left|\frac{n}{s}\lambda_{i}(\mathbf{A}_{o,S})-\lambda_{i}(\mathbf{A}_{o})\right|+\left|\frac{n}{s}\lambda_{i}(\mathbf{A}_{S})-\lambda_{i}(\mathbf{A}_{o,S})\right|\leq\epsilon n.

Analogously, for $i\in[|S|]$ such that $\lambda_{i}(\mathbf{A}_{S})\leq-\epsilon s$ , we have by (10) and (12) that $\lambda_{r-(|S|-i)}(\mathbf{\Lambda}_{o})<0$ . Thus $\lambda_{r-(|S|-i)}(\mathbf{\Lambda}_{o})=\lambda_{n-(r-i)}(\mathbf{A}_{o})$ . Again by (10), (12), and triangle inequality this gives

\left|\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})-\lambda_{n-(|S|-i)}(\mathbf{A}_{o})\right|\leq\epsilon n.

Now, consider all $i\in[n]$ such that $\lambda_{i}(\mathbf{A}_{o})$ is not well approximated by one of the outlying eigenvalues of $\mathbf{A}_{S}$ as argued above. By (10), (11), and (12), all such eigenvalues must have $|\lambda_{i}(\mathbf{A}_{o})|\leq 2\epsilon n$ . Thus, if we approximate them in any way either by the remaining eigenvalues of $\mathbf{A}_{S}$ with magnitude $\leq\epsilon s$ , or else by $0$ , we will approximate all to error at most $3\epsilon n$ . Thus, if (as in Algorithm 1) for $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{S})\geq 0$ , we let $\tilde{\lambda}_{i}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ and for $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{S})<0$ , let $\tilde{\lambda}_{n-(|S|-i)}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ , and let $\tilde{\lambda}_{i}(\mathbf{A})=0$ for all other $i$ , we will have for all $i$ ,

\displaystyle\left|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}_{o})\right|\leq 3\epsilon n.

Finally by definition, for all $i$ , $|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}_{o})|\leq\epsilon\sqrt{\delta}n\leq\epsilon n$ and thus, via triangle inequality, $\left|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A})\right|\leq 4\epsilon n.$ This gives our final error bound after adjusting constants on $\epsilon$ .

Recall that we require $s\geq\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}\sqrt{\delta}}$ for the outer eigenvalue bound of (9) to hold with probability $1-\delta$ . We require $s\geq\frac{c\log n}{\epsilon^{2}\delta}$ for $\|\mathbf{A}_{m,S}\|_{2}\leq\epsilon/2\cdot s$ to hold with probability $1-\delta$ by Lemma 4. Thus, for both conditions to hold simultaneously with probability $1-2\delta$ by a union bound, if suffices to set $s=\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}{\delta}}\geq\max\left(\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}\sqrt{\delta}},\frac{c\log n}{\epsilon^{2}\delta}\right)$ , where we use that $\log(1/(\epsilon\delta)\leq O(\log n)$ , as otherwise our algorithm can take $\mathbf{A}_{S}$ to be the full matrix $\mathbf{A}$ . Adjusting $\delta$ to $\delta/2$ completes the theorem. ∎

Remark: The proof of Lemma 3 and consequently, Theorem 1 can be modified to give better bounds for the case when the eigenvalues of $\mathbf{A}_{o}$ lie in a bounded range – between $\epsilon^{a}\sqrt{\delta}n$ and $\epsilon^{b}n$ where $0\leq b\leq a\leq 1$ . See Theorem 9 in Appendix C for details. For example, if all the top eigenvalues are equal, one can show that $s=\tilde{O}\left(\frac{\log^{2}n}{\epsilon^{2}}\right)$ suffices to give $\pm\epsilon n$ error, nearly matching the lower bound of [BCJ20]. This seems to indicate that improving Theorem 1 in general requires tackling the case when the outlying eigenvalues in $\boldsymbol{\Lambda}_{o}$ have a wide range.

4 Improved Bounds via Sparsity-Based Sampling

We now prove the $\pm\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ approximation bound of Theorem 2, assuming the ability to sample each row with probability proportional to $\frac{\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}$ . Pseudocode for our algorithm is given in Algorithm 2. Unlike in the uniform sampling case (Algorithm 1), we cannot simply sample a principal submatrix of $\mathbf{A}$ and compute its eigenvalues. We must carefully zero out entries lying at the intersection of sparse rows and columns to ensure accuracy of our estimates. A similar approach is taken for the norm-based sampling result of Theorem 3. We defer that proof to Appendix E.

4.1 Preliminary Lemmas

Our first step is to argue that zeroing out entries in sparse rows/columns in step 5 of Algorithm 2 does not introduce significant error. We define $\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ to be the extension of $\mathbf{A}^{\prime}$ to the original matrix – i.e., $\mathbf{A}^{\prime}_{ij}=0$ whenever $i=j$ or $\operatorname{nnz}(\mathbf{A}_{i})\operatorname{nnz}(\mathbf{A}_{j})<\frac{\epsilon^{2}\operatorname{nnz}({\mathbf{A}})}{c_{2}\log^{2}n}$ . Otherwise $\mathbf{A}^{\prime}_{ij}=\mathbf{A}_{ij}$ . We argue via a strengthening of Girshgorin’s theorem that $|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}^{\prime})|\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ for all $i$ .

After this step is complete, our proof follows the same general outline as that of Theorem 1 in Section 3. We split $\mathbf{A}^{\prime}=\mathbf{A}^{\prime}_{o}+\mathbf{A}^{\prime}_{m}$ , arguing that (1) after sampling $\|\mathbf{A}_{m,S}^{\prime}\|_{2}\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ and (2) that the eigenvalues of $\mathbf{A}^{\prime}_{o,S}$ are $\pm\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ approximations to those of $\mathbf{A}^{\prime}_{o}$ . In both cases, we critically use that the rescaling factors introduced in line 4 of Algorithm 2 do not introduce too much variance, due to the zeroing out of entries in $\mathbf{A}^{\prime}$ .

Algorithm 2 Eigenvalue estimator using sparsity-based sampling

1: Input: Symmetric

\mathbf{A}\in\mathbb{R}^{n\times n}

with

\|\mathbf{A}\|_{\infty}\leq 1

, Accuracy

\epsilon\in(0,1)

, failure prob.

\delta\in(0,1)

\operatorname{nnz}(\mathbf{A}_{i})

for all

i\in[n]

and

\operatorname{nnz}(\mathbf{A})

2: Fix

s=\frac{c_{1}\log^{8}n}{\epsilon^{8}\delta^{4}}

where

c_{1}

is a sufficiently large constant.

3: Add each

i\in[n]

to sample set

S

independently, with probability

p_{i}=\min\left(1,\frac{s\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}\right)

. Let the principal submatrix of

\mathbf{A}

corresponding to

S

\mathbf{A}_{S}

4: Let

\mathbf{A}_{S}=\mathbf{D}\mathbf{A}_{S}\mathbf{D}

where

\mathbf{D}\in\mathbb{R}^{|S|\times|S|}

is diagonal with

\mathbf{D}_{i,i}=\frac{1}{\sqrt{p_{j}}}

if the

i^{th}

element of

S

j

5: Construct

\mathbf{A}^{\prime}_{S}\in\mathbb{R}^{|S|\times|S|}

from

\mathbf{A}_{S}

as follows:

\displaystyle\mathbf{[}\mathbf{A}^{\prime}_{S}]_{i,j}

\displaystyle=\begin{cases}0&\text{if $i=j$ or }\operatorname{nnz}(\mathbf{A}_{i})\operatorname{nnz}(\mathbf{A}_{j})<\frac{\epsilon^{2}\operatorname{nnz}({\mathbf{A}})}{c_{2}\log^{2}n}\text{ for sufficient large constant $c_{2}$}\\ [\mathbf{A}_{S}]_{i,j}&\text{otherwise}.\end{cases}

6: Compute the eigenvalues of

\mathbf{A}^{\prime}_{S}

\lambda_{1}(\mathbf{A}^{\prime}_{S})\geq\ldots\geq\lambda_{|S|}(\mathbf{A}^{\prime}_{S})

7: For all

i\in[|S|]

with

\lambda_{i}(\mathbf{A}^{\prime}_{S})\geq 0

, let

\tilde{\lambda}_{i}(\mathbf{A})=\lambda_{i}(\mathbf{A}^{\prime}_{S})

. For all

i\in[|S|]

with

\lambda_{i}(\mathbf{A}^{\prime}_{S})<0

, let

\tilde{\lambda}_{n-(|S|-i)}(\mathbf{A})=\lambda_{i}(\mathbf{A}^{\prime}_{S})

. For all remaining

i\in[n]

, let

\tilde{\lambda}_{i}(\mathbf{A})=0

8: Return: Eigenvalue estimates

\tilde{\lambda}_{1}(\mathbf{A})\geq\ldots\geq\tilde{\lambda}_{n}(\mathbf{A})

Remark: Throughout, we will assume that $\mathbf{A}$ does not have any rows/columns that are all $0$ , as such rows will never be sampled and will have no effect on the output of Algorithm 2. Additionally, we will assume that $\operatorname{nnz}(\mathbf{A})\geq\frac{c_{1}\log^{8}n}{\epsilon^{8}\delta^{4}}$ , as otherwise, $\mathbf{A}$ has at most $s=\frac{c_{1}\log^{8}n}{\epsilon^{8}\delta^{4}}$ non-zero rows. Thus, rather than running Algorithm 2, we can directly compute the eigenvalues of $\mathbf{A}$ .

Lemma 5.

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric with $\|\mathbf{A}\|_{\infty}\leq 1$ and $\operatorname{nnz}(\mathbf{A})\geq 2/\epsilon^{2}$ . Let $\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ have $\mathbf{A}^{\prime}_{ij}=0$ if $i=j$ or $\operatorname{nnz}(\mathbf{A}_{i})\cdot\operatorname{nnz}(\mathbf{A}_{j})<\frac{\epsilon^{2}\operatorname{nnz}({\mathbf{A}})}{c_{2}\log^{2}n}$ for a sufficiently large constant $c_{2}$ and $\mathbf{A}^{\prime}_{ij}=\mathbf{A}_{ij}$ otherwise. Then, for all $i\in[n]$ ,

|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}^{\prime})|\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}.

Proof.

We consider the matrix $\mathbf{A}^{\prime\prime}$ , which is defined identically to $\mathbf{A}^{\prime}$ except we only set $\mathbf{A}^{\prime\prime}_{ij}=0$ if $\operatorname{nnz}(\mathbf{A}_{i})\cdot\operatorname{nnz}(\mathbf{A}_{j})<\frac{\epsilon^{2}\operatorname{nnz}({\mathbf{A}})}{c_{2}\log^{2}n}$ . I.e., we do not have the condition requiring setting the diagonal to $0$ . We will show that $|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}^{\prime\prime})|\leq\epsilon/2\cdot\sqrt{\operatorname{nnz}(\mathbf{A})}$ . By Weyl’s inequality, and the assumption that $\operatorname{nnz}(\mathbf{A})\geq 2/\epsilon^{2}$ , we then have $|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}^{\prime})|\leq\epsilon/2\cdot\sqrt{\operatorname{nnz}(\mathbf{A})}+1\leq\epsilon\cdot\sqrt{\operatorname{nnz}(\mathbf{A})}$ as required.

Let $\mathcal{I}_{k}\subset[n]$ be the set of rows/columns with $\operatorname{nnz}(\mathbf{A}_{i})\in\left[\frac{\operatorname{nnz}(\mathbf{A})}{2^{k}},\frac{\operatorname{nnz}(\mathbf{A})}{2^{k-1}}\right)$ and $\mathbf{A}_{kl}=\mathbf{A}(\mathcal{I}_{k},\mathcal{I}_{l})$ be the submatrix of $\mathbf{A}$ formed with rows in $\mathcal{I}_{k}$ and columns in $\mathcal{I}_{l}$ . Define $\mathbf{A}^{\prime\prime}_{kl}$ in the same way and observe that $\mathbf{A}^{\prime\prime}_{kl}=\mathbf{A}_{kl}$ whenever $2^{k+l}\leq\frac{c_{2}\operatorname{nnz}(\mathbf{A})\log^{2}n}{\epsilon^{2}}$ .

When $2^{k+l}>\frac{c_{2}\operatorname{nnz}(\mathbf{A})\log^{2}n}{\epsilon^{2}}$ , we may zero out some entries of $\mathbf{A}_{kl}$ to produce $\mathbf{A}_{kl}^{\prime\prime}$ . Let $\mathbf{\widehat{A}}_{kl}$ be equal to $\mathbf{A}_{kl}$ on this set of zeroed out entries, and $0$ everywhere else. Observe that $(\mathbf{\widehat{A}}_{kl}\mathbf{\widehat{A}}_{kl}^{T})_{m,:}=(\mathbf{\widehat{A}}_{kl})_{m,:}\mathbf{\widehat{A}}_{kl}^{T}$ . Next observe that $(\mathbf{\widehat{A}}_{kl})_{m,:}$ has at most $\operatorname{nnz}(\mathbf{A}_{m})\leq\frac{\operatorname{nnz}(\mathbf{A})}{2^{k-1}}$ non-zero entries. Similarly, each row of $\mathbf{\widehat{A}}_{kl}^{T}$ has at most $\frac{\operatorname{nnz}(\mathbf{A})}{2^{l-1}}$ non-zero elements. Thus, for all $m\in|\mathcal{I}_{k}|$ , using that $\|\mathbf{A}\|_{\infty}\leq 1$ ,

\displaystyle\|(\mathbf{\widehat{A}}_{kl}\mathbf{\widehat{A}}_{kl}^{T})_{m,:}\|_{1}\leq\frac{\operatorname{nnz}(\mathbf{A})^{2}}{2^{k+l-2}}=\frac{4\operatorname{nnz}(\mathbf{A})^{2}}{2^{k+l}}.

Applying Girshgorin’s circle theorem (Theorem 2) we thus have:

\displaystyle\|\mathbf{\widehat{A}}_{kl}\|_{2}^{2}=\|\mathbf{\widehat{A}}_{kl}\mathbf{\widehat{A}}_{kl}^{T}\|_{2}\leq\max_{m}\|(\mathbf{\widehat{A}}_{kl}\mathbf{\widehat{A}}_{kl}^{T})_{m,:}\|_{1}\leq\frac{4\operatorname{nnz}(\mathbf{A})^{2}}{2^{k+l}}.

(13)

Let $\bar{\mathbf{A}}_{kl}\in\mathbb{R}^{n\times n}$ be a symmetric matrix such that $\bar{\mathbf{A}}_{kl}(\mathcal{I}_{k},\mathcal{I}_{l})=\mathbf{\widehat{A}}_{kl}$ , $\bar{\mathbf{A}}_{kl}(\mathcal{I}_{l},\mathcal{I}_{k})=\mathbf{\widehat{A}}_{lk}$ , and $\bar{\mathbf{A}}_{kl}$ is zero everywhere else. By triangle inequality and the bound of (13),

\displaystyle\|\bar{\mathbf{A}}_{kl}\|_{2}\leq\|\mathbf{\widehat{A}}_{kl}\|_{2}+\|\mathbf{\widehat{A}}_{lk}\|_{2}\leq\frac{4\operatorname{nnz}(\mathbf{A})}{2^{(k+l)/2}}.

Observe that, since we assume all rows have at least one non-zero entry, $\operatorname{nnz}(\mathbf{A}_{i})\geq 1$ and $\operatorname{nnz}(\mathbf{A})\leq n^{2}$ . Therefore, $k,l$ can range from $1$ to $\log(n^{2})=2\log n$ . By triangle inequality,

	$\displaystyle\\|\mathbf{A}-\mathbf{A}^{\prime\prime}\\|_{2}$	$\displaystyle\leq\sum_{(k,l):2^{k+l}>\frac{c_{2}\operatorname{nnz}(\mathbf{A})\log^{2}n}{\epsilon^{2}}}\\|\mathbf{\bar{A}}_{kl}\\|$
		$\displaystyle\leq\sum_{k=1}^{2\log n}\frac{4\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}}{\sqrt{c_{2}}\cdot\log n}\cdot\sum_{i=1}^{2\log n}\frac{1}{2^{i-1}}$
		$\displaystyle\leq\frac{16\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}}{\sqrt{c_{2}}}.$

Finally, setting $c_{2}$ large enough and using Weyls’ inequality (Fact 4) we have the required bound:

\displaystyle|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}^{\prime\prime})|

\displaystyle\leq\epsilon/2\sqrt{\operatorname{nnz}(\mathbf{A})}.

∎

We next give a bound on the coherence of the outlying eigenvectors of $\mathbf{A}^{\prime}$ . This bound is analogous to Lemma 2, but is more refined, taking into account the sparsity of each row.

Lemma 6 (Incoherence of outlying eigenvectors in terms of sparsity).

Let $\mathbf{A},\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ be as in Lemma 5. Let $\mathbf{A}^{\prime}_{o}=\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}\mathbf{V}_{o}^{{}^{\prime}T}$ where $\mathbf{\Lambda}^{\prime}_{o}$ is diagonal, with the eigenvalues of $\mathbf{A}^{\prime}$ with magnitude $\geq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ on its diagonal, and $\mathbf{V}^{\prime}_{o}$ has columns equal to the corresponding eigenvectors. Let $\mathbf{V}^{\prime}_{o,i}$ denote the $i$ ^th row of $\mathbf{V}^{\prime}_{o}$ . Then,

\displaystyle\|\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}^{\prime}_{o,i}\|_{2}^{2}\leq\frac{\operatorname{nnz}(\mathbf{A}_{i})}{\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}}\hskip 10.00002ptand\hskip 10.00002pt\|\mathbf{V}^{\prime}_{o,i}\|^{2}_{2}\leq\frac{\operatorname{nnz}(\mathbf{A}_{i})}{\epsilon^{2}\delta\operatorname{nnz}(\mathbf{A})}.

Proof.

The proof is nearly identical to that of Lemma 2. Observe that $\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}=\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}$ . Letting $[\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}]_{i}$ denote the $i$ ^th row of the $\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}$ , we have

\|[\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}]_{i}\|_{2}^{2}=\|[\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}]_{i}\|_{2}^{2}=\sum_{j=1}^{r}\lambda_{j}^{2}\cdot\mathbf{V}_{o,i,j}^{{}^{\prime}2},

(14)

where $r=\operatorname{rank}(\mathbf{A}^{\prime}_{o})$ , $\mathbf{V}^{\prime}_{o,i,j}$ is the $(i,j)$ ^th element of $\mathbf{V}^{\prime}_{o}$ and $\lambda_{j}=\mathbf{\Lambda}^{\prime}_{o}(j,j)$ . Since $\mathbf{V}^{\prime}_{o}$ has orthonormal columns, we thus have $\|[\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}]_{i}\|_{2}^{2}\leq\|\mathbf{A}^{\prime}_{i}\|_{2}^{2}\leq\|\mathbf{A}_{i}\|_{2}^{2}\leq\operatorname{nnz}(\mathbf{A}_{i})$ . Therefore, by (14),

\sum_{j=1}^{r}\lambda_{j}^{2}\cdot\mathbf{V}_{o,i,j}^{{}^{\prime}2}\leq\operatorname{nnz}(\mathbf{A}_{i}).

(15)

Since by definition $\lvert\lambda_{j}\rvert\geq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ for all $j$ , we can concluse that $\|\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}^{\prime}_{o,i}\|_{2}^{2}=\sum_{j=1}^{r}\lambda_{j}\cdot\mathbf{V}_{o,i,j}^{{}^{\prime}2}\leq\frac{\operatorname{nnz}(\mathbf{A}_{i})}{\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}}$ and $\|\mathbf{V}^{\prime}_{o,i}\|_{2}^{2}=\sum_{j=1}^{r}\mathbf{V}_{o,i,j}^{{}^{\prime}2}\leq\frac{\operatorname{nnz}(\mathbf{A}_{i})}{\epsilon^{2}\delta\operatorname{nnz}(\mathbf{A})}$ , which completes the lemma. ∎

4.2 Outer and Middle Eigenvalue Bounds

Using Lemma 6, we next argue that the eigenvalues of $\mathbf{A}_{o,S}^{\prime}$ will approximate those of $\mathbf{A}^{\prime}$ , and in turn those of $\mathbf{A}$ . The proof is very similar to Lemma 3 in the uniform sampling case.

Lemma 7 (Concentration of outlying eigenvalues with sparsity-based sampling).

Let $\mathbf{A},\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ be as in Lemmas 5 and 6. Let $\mathbf{A}^{\prime}=\mathbf{A}^{\prime}_{m}+\mathbf{A}^{\prime}_{o}$ , where $\mathbf{A}^{\prime}_{m}=\mathbf{V}^{\prime}_{m}\mathbf{\Lambda}^{\prime}_{m}\mathbf{\mathbf{V}^{\prime}}_{m}^{T}$ , and $\mathbf{A}^{\prime}_{o}=\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}\mathbf{\mathbf{V}^{\prime}}_{o}^{T}$ are projections onto the eigenspaces with magnitude $<\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ and $\geq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ respectively (analogous to Definition 1.1) As in Algorithm 2, for all $i\in[n]$ let $p_{i}=\min\left(1,\frac{s\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}\right)$ and let $\bar{\mathbf{S}}$ be a scaled diagonal sampling matrix such that the $\bar{\mathbf{S}}_{ii}=\frac{1}{\sqrt{p_{i}}}$ with probability $p_{i}$ and $\bar{\mathbf{S}}_{ii}=0$ otherwise. If $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{3}\sqrt{\delta}}$ for a large enough constant $c$ , then with probability at least $1-\delta$ , $\|\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}-\mathbf{\Lambda}^{\prime}_{o}\|_{2}\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ .

Proof.

Define $\mathbf{E}=\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}-\mathbf{\Lambda}^{\prime}_{o}$ . For all $i\in[n]$ , let $\mathbf{V}_{o,i}$ be the $i^{th}$ row of $\mathbf{V}^{\prime}_{o}$ and define the matrix valued random variable

\displaystyle\mathbf{Y}_{i}=\begin{cases}\frac{1}{p_{i}}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}^{\prime}_{o,i}\mathbf{V}_{o,i}^{{}^{\prime}T}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2},&\text{with probability }p_{i}\\ 0&\text{otherwise.}\end{cases}

(16)

Define $\mathbf{Q}_{i}=\mathbf{Y}_{i}-\mathbb{E}\left[\mathbf{Y}_{i}\right]$ . We can observe that $\mathbf{Q}_{1},\mathbf{Q}_{2},\ldots,\mathbf{Q}_{n}$ are independent random variables and that $\sum_{i=1}^{n}\mathbf{Q}_{i}=\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}-\mathbf{\Lambda}^{\prime}_{o}=\mathbf{E}$ . Let $P=\{i\in[n]:p_{i}<1\}$ . Then, observe that $\sum_{i\in[n]\setminus P}\mathbf{Q}_{i}=0$ . So, $\mathbf{E}=\sum_{i\in P}\mathbf{Q}_{i}$ . Then, similar to the proof of Lemma 3, we need to bound $\|\mathbf{Q}_{i}\|_{2}$ for all $i\in P$ and $\mathbf{Var}(\mathbf{E})\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\mathbb{E}(\mathbf{EE}^{T})=\mathbb{E}(\mathbf{E}^{T}\mathbf{E})=\sum_{i\in P}\mathbb{E}[\mathbf{Q}_{i}^{2}]$ using the improved row norm bounds of Lemma 5. In particular, we have

	$\displaystyle\sum_{i\in P}\mathbb{E}[\mathbf{Q}_{i}^{2}]$	$\displaystyle=\sum_{i\in P}\left[p_{i}\cdot\left(\frac{1}{p_{i}}-1\right)^{2}+\left(1-p_{i}\right)\right]\cdot(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2})$
		$\displaystyle\preceq\sum_{i\in P}\frac{1}{p_{i}}\cdot\\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\\|_{2}^{2}\cdot(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2}).$		(17)

By Lemma 5, $\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}\leq\frac{\operatorname{nnz}(\mathbf{A}_{i})}{\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}}$ . Plugging back into (17),

	$\displaystyle\sum_{i\in P}\mathbb{E}[\mathbf{Q}_{i}^{2}]$	$\displaystyle\preceq\sum_{i\in P}\frac{1}{p_{i}}\cdot\frac{\operatorname{nnz}(\mathbf{A}_{i})}{\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}}\cdot(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2})$
		$\displaystyle\preceq\frac{\sqrt{\operatorname{nnz}(\mathbf{A})}}{s\epsilon\sqrt{\delta}}(\sum_{i\in P}\Lambda_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2})$
		$\displaystyle=\frac{\sqrt{\operatorname{nnz}(\mathbf{A})}}{s\epsilon\sqrt{\delta}}\mathbf{\Lambda}_{o}\preceq\frac{\operatorname{nnz}(\mathbf{A})}{s\epsilon\sqrt{\delta}}\cdot\mathbf{I}.$

Since $\mathbf{Q}_{i}^{2}$ is PSD this establishes that $v\leq\|\textbf{Var(E)}\|_{2}\leq\frac{\operatorname{nnz}(\mathbf{A})}{s\epsilon\sqrt{\delta}}$ . Since there are at most $\frac{\operatorname{nnz}(\mathbf{A})}{\delta\epsilon^{2}\operatorname{nnz}(\mathbf{A})}=\frac{1}{\epsilon^{2}\delta}$ eigenvalues with absolute value $\geq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ , we can apply the matrix Bernstein inequality exactly as in the proof of Lemma 3 with $d=\frac{1}{\epsilon^{2}\delta}$ to show that when $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{3}\sqrt{\delta}}$ for large enough $c$ , with probability at least $1-\delta$ , $\left\|\mathbf{E}\right\|_{2}\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ . ∎

We next bound the spectral norm of $\mathbf{A}^{\prime}_{m,S}$ . This is the most challenging part of the proof – the rows of this matrix are sampled non-uniformly and scaled proportional to their inverse sampling probabilities, so we cannot apply existing bounds on the spectral norms of uniformly sampled random submatrices [RV07]. We extend these bounds to the non-uniform case, critically using that entries which would be scaled up significantly after sampling (i.e. those lying in sparse rows/columns), have already been set to $0$ in $\mathbf{A}^{\prime}_{m,S}$ , and thus do not contribute to the spectral norm.

Lemma 8 (Concentration of middle eigenvalues with sparsity-based sampling).

Let $\mathbf{A},\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ be as in Lemmas 5 and 6. Let $\mathbf{A}^{\prime}=\mathbf{A}^{\prime}_{m}+\mathbf{A}^{\prime}_{o}$ , where $\mathbf{A}^{\prime}_{m}=\mathbf{V}^{\prime}_{m}\mathbf{\Lambda}^{\prime}_{m}\mathbf{\mathbf{V}^{\prime}}_{m}^{T}$ , and $\mathbf{A}^{\prime}_{o}=\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}\mathbf{\mathbf{V}^{\prime}}_{o}^{T}$ are projections onto the eigenspaces with magnitude $<\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ and $\geq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ respectively (analogous to Definition 1.1). As in Algorithm 2, for all $i\in[n]$ let $p_{i}=\min\left(1,\frac{s\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}\right)$ and let $\bar{\mathbf{S}}$ be a scaled diagonal sampling matrix such that the $\bar{\mathbf{S}}_{ii}=\frac{1}{\sqrt{p_{i}}}$ with probability $p_{i}$ and $\bar{\mathbf{S}}_{ii}=0$ otherwise. If $s\geq\frac{c\log^{8}n}{\epsilon^{8}\delta^{4}}$ for a large enough constant $c$ , then with probability at least $1-\delta$ ,

\|\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m}\bar{\mathbf{S}}\|_{2}\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}.

Proof.

The initial part of the proof follows the outline of proof of the spectral norm bound for uniformly random submatrices (Theorem 4) of [Tro08a]. From Lemma 6, we have $\|{\mathbf{V}^{\prime}}_{o,i}\|_{2}\leq\frac{\sqrt{\operatorname{nnz}({\mathbf{A}}_{i})}}{\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}}$ . Also, following the proof of Lemma 6, we have $\|{\mathbf{\Lambda}^{\prime}}_{o}{\mathbf{V}^{\prime}}^{T}_{o,j}\|_{2}=\|[{\mathbf{A}^{\prime}}{\mathbf{V}^{\prime}}_{o}]_{j}\|_{2}\leq\sqrt{\operatorname{nnz}({\mathbf{A}}_{j})}$ . Thus, for all $i,j\in[n]$ , using Cauchy Schwarz’s inequality, we have

\displaystyle|{\mathbf{A}^{\prime}}_{o,i,j}|=|{\mathbf{V}^{\prime}}_{o,i}{\mathbf{\Lambda}^{\prime}}_{o}{\mathbf{V}^{\prime}}_{o,j}^{T}|\leq\|{\mathbf{V}^{\prime}}_{o,i}\|_{2}\cdot\|{\mathbf{\Lambda}^{\prime}}_{o}{\mathbf{V}^{\prime}}_{o,j}^{T}\|_{2}\leq\frac{\sqrt{\operatorname{nnz}({\mathbf{A}}_{i})}}{\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}}\cdot\sqrt{\operatorname{nnz}({\mathbf{A}}_{j})}.

(18)

Let ${\mathbf{A}^{\prime}}_{m}=\mathbf{H}_{m}+\mathbf{D}_{m}$ where $\mathbf{H}_{m}$ and $\mathbf{D}_{m}$ contain the off-diagonal and diagonal elements of $\mathbf{A}^{\prime}_{m}$ respectively. Note that while $\mathbf{A}^{\prime}$ is zero on the diagonal, $\mathbf{A}^{\prime}_{m}$ may not be. We have:

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}\leq\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\bar{\mathbf{S}}\|_{2}+\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}.

Using Lemma 1 (decoupling) on $\mathbb{E}_{2}\|\mathbf{S}\mathbf{H}_{m}\bar{\mathbf{S}}\|_{2}$ , we get

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}\leq 2\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{2}+\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2},

(19)

where $\hat{\mathbf{S}}$ is an independent copy of $\bar{\mathbf{S}}$ . Upper bounding the rank of $\mathbf{H}_{m}$ as $n$ and applying Theorem 5 twice to $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{2}$ , once for each operator, we get

	$\displaystyle\mathbb{E}_{2}\\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\\|_{2}$	$\displaystyle\leq 5\sqrt{\log n}\mathbb{E}_{2}\\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\\|_{1\to 2}+\mathbb{E}_{2}\\|\hat{\mathbf{S}}\mathbf{H}_{m}\\|_{2}$
		$\displaystyle\leq 5\sqrt{\log n}\mathbb{E}_{2}\\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\\|_{1\to 2}+5\sqrt{\log n}\mathbb{E}_{2}\\|\mathbf{H}_{m}\hat{\mathbf{S}}\\|_{1\to 2}+\\|\mathbf{H}_{m}\\|_{2}.$		(20)

Plugging (20) into (19), we have:

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}\leq 10\sqrt{\log n}\left(\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}+\mathbb{E}_{2}\|\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}\right)+2\|\mathbf{H}_{m}\|_{2}+\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}

(21)

We now proceed to bound each of the terms on the right hand side of (21). We start with $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}$ . First, observe that $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}\leq\max_{i}\frac{1}{p_{i}}\lvert(\mathbf{D}_{m})_{ii}\rvert$ . We consider two cases.

Case 1: $p_{i}<1$ . Then, $p_{i}=\frac{s\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}$ and $\lvert(\mathbf{D}_{m})_{ii}\rvert=\lvert({\mathbf{A}^{\prime}}_{m})_{ii}\rvert=\lvert(\mathbf{A}^{\prime}_{o})_{ii}\rvert$ (since $\mathbf{A}^{\prime}_{ii}=0$ ). Then by (18), we have $\frac{1}{p_{i}}\lvert(\mathbf{D}_{m})_{ii}\rvert\leq\frac{\sqrt{\operatorname{nnz}(\mathbf{A})}}{s\epsilon\sqrt{\delta}}$ .

Case 2: $p_{i}=1$ . Then we have $\frac{1}{p_{i}}\lvert(\mathbf{D}_{m})_{ii}\rvert=\lvert(\mathbf{D}_{m})_{ii}\rvert\leq\max_{j}\lvert(\mathbf{D}_{m})_{jj}\rvert\leq\|\mathbf{A}^{\prime}_{m}\|_{2}\leq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ .
From the two cases above, for $s\geq\frac{1}{\epsilon^{2}\delta}$ , we have:

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}\leq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}.

(22)

We can bound $\|\mathbf{H}_{m}\|_{2}$ similarly. Since $\mathbf{H}_{m}={\mathbf{A}^{\prime}}_{m}-\mathbf{D}_{m}$ and $\|{\mathbf{A}^{\prime}}_{m}\|_{2}\leq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}$ ,

$\displaystyle\\|\mathbf{H}_{m}\\|_{2}$	$\displaystyle\leq\\|{\mathbf{A}^{\prime}}_{m}\\|_{2}+\\|\mathbf{D}_{m}\\|_{2}$
	$\displaystyle\leq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}+\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}$
	$\displaystyle=2\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}$	(23)

where the second step follows from the fact that $\|\mathbf{D}_{m}\|_{2}\leq\max_{i}\lvert(\mathbf{D}_{m})_{ii}\rvert\leq\|\mathbf{A}^{\prime}_{m}\|_{2}$ .

We next bound the term $\mathbb{E}_{2}\|\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}$ . Observe that $\mathbb{E}_{2}\|\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}\leq\frac{\max_{i}\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}}{\sqrt{p_{i}}}$ , where $\mathbf{A^{\prime}}_{m,i}$ is the $i$ ^th column/row of $\mathbf{A}^{\prime}_{m}$ . We again consider the two cases when $p_{i}=1$ and $p_{i}<1$ :

Case 1: $p_{i}=1$ . Then $\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}\leq\|{\mathbf{A}^{\prime}}_{m}\|_{2}\leq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}$ .

Case 2: $p_{i}<1$ . Then $\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}\leq\|{\mathbf{A}^{\prime}}_{i}\|_{2}\leq\sqrt{\operatorname{nnz}({\mathbf{A}}_{i})}$ . Thus, setting $s\geq\frac{1}{\epsilon^{2}\delta}$ we have:

	$\displaystyle\frac{\\|{\mathbf{A}^{\prime}}_{m,i}\\|_{2}}{\sqrt{p_{i}}}$	$\displaystyle\leq\sqrt{\frac{\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{i})}}\cdot\\|{\mathbf{A}^{\prime}}_{i}\\|_{2}$
		$\displaystyle\leq\sqrt{\frac{\operatorname{nnz}({\mathbf{A}})}{s}}\leq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}.$

Thus, from the two cases above, for all $i\in[n]$ , adjusting $\epsilon$ by a $\frac{1}{\sqrt{\log n}}$ factor, we have for $s\geq\frac{\log n}{\epsilon^{2}\delta}$ :

\displaystyle\mathbb{E}_{2}\|\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}

\displaystyle\leq\frac{\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}}{\sqrt{\log n}}.

(24)

Overall, plugging (22), (23), and (24) back into (21), we have :

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}\leq 10\sqrt{\log n}\cdot\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}+15\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}.

(25)

It remains to bound $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}$ , which is the most complex part of the proof. Since $\hat{\mathbf{S}}$ is an independent copy of $\bar{\mathbf{S}}$ , we denote the norm of the $i$ ^th column of $\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}$ as $\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}$ . Then $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}\leq\mathbb{E}_{2}\left(\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\right)$ . We will argue that $\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}$ is bounded by $\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ with probability $1-1/\mathop{\mathrm{missing}}{poly}(n)$ . Since our sampling probabilities are all at least $1/n^{2}$ and since $\|\mathbf{H}_{m}\|_{F}\leq\|\mathbf{A}\|_{F}\leq n$ , this value is also deterministically bounded by $n^{2}$ . Thus, our high probability bound implies the needed bound on $\mathbb{E}_{2}\left(\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\right)$ .

We begin by observing that since ${\mathbf{A}^{\prime}}_{m}=\mathbf{H}_{m}+\mathbf{D}_{m}$ , $\|(\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m})_{:,i}\|_{2}\geq\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}$ , and so to bound $\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}$ , it suffices to bound $\frac{\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}$ for all $i\in[n]$ . Towards this end, for a fixed $i$ and any $j\in[n]$ , define

\displaystyle z_{j}

\displaystyle=\begin{cases}\frac{1}{p_{j}}|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}&\text{with probability $p_{j}$}\\ 0&\text{otherwise}.\end{cases}

Then $\sum_{j=1}^{n}z_{j}=\|(\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m})_{:,i}\|_{2}^{2}$ and $\mathbb{E}[\sum_{j=1}^{n}z_{j}]=\|\mathbf{A}^{\prime}_{m,i}\|_{2}^{2}\leq\|\mathbf{A}^{\prime}_{i}\|_{2}^{2}\leq\operatorname{nnz}(\mathbf{A}_{i})$ . Since $\sum_{j=1}^{n}z_{j}=\|(\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m})_{:,i}\|_{2}^{2}$ is a sum of independent random variables, we can bound this quantity by applying Bernstein’s inequality. To do this, we must bound $|z_{j}|$ for all $j\in[n]$ and $\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)$ . We will again consider the cases of $p_{i}<1$ and $p_{i}=1$ separately.

Case 1: $p_{i}<1$ . Then, we have $p_{i}=s\operatorname{nnz}({\mathbf{A}}_{i})/\operatorname{nnz}({\mathbf{A}})$ . If ${\mathbf{A}^{\prime}}_{i,j}\neq 0$ then

	$\displaystyle\|z_{j}\|$	$\displaystyle\leq\frac{1}{p_{j}}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}\leq\max\left(1,\frac{\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\right)\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\|{\mathbf{A}^{\prime}}_{o,i,j}\|^{2}\right)$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\frac{\operatorname{nnz}({\mathbf{A}}_{i})\operatorname{nnz}({\mathbf{A}}_{j})}{\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}})}\right)$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}}_{i})}{\epsilon^{2}\delta s},$

where the fourth inequality uses (18). By the thresholding procedure which defines $\mathbf{A}^{\prime}$ , if $\mathbf{A}^{\prime}_{ij}\neq 0$ ,

\displaystyle\operatorname{nnz}({\mathbf{A}}_{i})\cdot\operatorname{nnz}({\mathbf{A}}_{j})\geq\frac{\epsilon^{2}\operatorname{nnz}({\mathbf{A}})}{c_{2}\log^{2}n}\Rightarrow\operatorname{nnz}({\mathbf{A}}_{j})\geq\frac{\epsilon^{2}\operatorname{nnz}({\mathbf{A}})}{c_{2}\log^{2}n\operatorname{nnz}({\mathbf{A}}_{i})},

(26)

and thus for $p_{i}<1$ and ${\mathbf{A}^{\prime}}_{ij}\neq 0$ we have

\displaystyle|z_{j}|

\displaystyle\leq|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}+\frac{2c_{2}\log^{2}n\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}}+\frac{2\operatorname{nnz}({\mathbf{A}}_{i})}{\epsilon^{2}\delta s}.

If ${\mathbf{A}^{\prime}}_{i,j}=0$ then we simply have

\displaystyle|z_{j}|

\displaystyle\leq|{\mathbf{A}^{\prime}}_{m,ij}|^{2}+\frac{\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}\delta}.

Overall for all $j\in[n]$ ,

\displaystyle|z_{j}|

\displaystyle\leq|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}\delta}+\frac{2c_{2}\log^{2}n\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}},

(27)

and since $|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\sum_{j=1}^{n}|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}=\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{2}\leq\|{\mathbf{A}^{\prime}}_{i}\|_{2}^{2}\leq\operatorname{nnz}({\mathbf{A}}_{i})$ ,

\displaystyle|z_{j}|

\displaystyle\leq\operatorname{nnz}({\mathbf{A}}_{i})+\frac{2\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}\delta}+\frac{2c_{2}\log^{2}n\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}}.

(28)

For $s\geq c\left(\frac{\log^{2}n}{\epsilon^{2}}+\frac{1}{\epsilon^{2}\delta}\right)$ and large enough $c$ , we thus have $|z_{j}|\leq 2\operatorname{nnz}({\mathbf{A}})$ .

We next bound the variance by:

	$\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)$	$\displaystyle\leq\sum_{j=1}^{n}\mathbb{E}[z_{j}^{2}]\leq\sum_{j=1}^{n}p_{j}\frac{1}{p_{j}^{2}}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}$
		$\displaystyle=\sum_{j=1}^{n}\max\left(1,\frac{\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\right)\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}$
		$\displaystyle\leq\sum_{j=1}^{n}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}+\sum_{j=1}^{n}\frac{12\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{4}+\|{\mathbf{A}^{\prime}}_{o,i,j}\|^{4}\right)$
		$\displaystyle\leq\\|{\mathbf{A}^{\prime}}_{m,i}\\|_{2}^{4}+\sum_{j=1}^{n}\frac{12\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\left(\|\mathbf{A}_{i,j}^{\prime}\|^{4}+\frac{\operatorname{nnz}(\mathbf{A}_{i})^{2}\operatorname{nnz}(\mathbf{A}_{j})^{2}}{\epsilon^{4}\delta^{2}\operatorname{nnz}(\mathbf{A})^{2}}\right),$

where the last inequality uses (18). Now since $\mathbf{A}_{ii}^{\prime}=0$ for all $i$ and $\|\mathbf{A}^{\prime}\|_{\infty}\leq 1$ we have

\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)

\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{4}+\sum_{j:{\mathbf{A}^{\prime}}_{i,j}\neq 0}\frac{12\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}+\sum_{j=1}^{n}\frac{12\operatorname{nnz}({\mathbf{A}}_{i})^{2}\operatorname{nnz}({\mathbf{A}}_{j})}{s\epsilon^{4}\delta^{2}\operatorname{nnz}({\mathbf{A}})}.

(29)

Combining (26) with the second term to the right of (29) we have

\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)

\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{4}+\sum_{j:{\mathbf{A}^{\prime}}_{i,j}\neq 0}\frac{12c_{2}\log^{2}n\cdot\operatorname{nnz}({\mathbf{A}_{i}})}{s\epsilon^{2}}+\sum_{j=1}^{n}\frac{12\operatorname{nnz}({\mathbf{A}}_{i})^{2}\operatorname{nnz}({\mathbf{A}}_{j})}{s\epsilon^{4}\delta^{2}\operatorname{nnz}({\mathbf{A}})},

and since $|\{j:{\mathbf{A}^{\prime}}_{i,j}\neq 0\}|=\operatorname{nnz}({\mathbf{A}}_{i})$ , we have

\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)

\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{4}+\frac{12c_{2}\log^{2}n\cdot\operatorname{nnz}({\mathbf{A}_{i}})^{2}}{s\epsilon^{2}}+\sum_{j=1}^{n}\frac{12\operatorname{nnz}({\mathbf{A}}_{i})^{2}\operatorname{nnz}({\mathbf{A}}_{j})}{s\epsilon^{4}\delta^{2}\operatorname{nnz}({\mathbf{A}})}.

(30)

Finally since $\sum_{j=1}^{n}\operatorname{nnz}(\mathbf{A}_{j})=\operatorname{nnz}(\mathbf{A})$ and $\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{4}\leq\|\mathbf{A^{\prime}}_{i}\|_{2}^{4}\leq\operatorname{nnz}(\mathbf{A}_{i})^{2}$ we have

\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)

\displaystyle\leq\operatorname{nnz}({\mathbf{A}}_{i})^{2}+\frac{12c_{2}\log^{2}n\cdot\operatorname{nnz}({\mathbf{A}}_{i})^{2}}{s\epsilon^{2}}+\frac{12\operatorname{nnz}({\mathbf{A}}_{i})^{2}}{s\epsilon^{4}\delta^{2}}.

(31)

For $s\geq c\left(\frac{\log^{2}n}{\epsilon^{2}}+\frac{1}{\epsilon^{4}\delta^{2}}\right)$ for large enough $c$ , we have $\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)\leq 2\operatorname{nnz}({\mathbf{A}}_{i})^{2}$ .

Therefore, using (28) and (31) with $s\geq c\left(\frac{\log^{2}n}{\epsilon^{2}}+\frac{1}{\epsilon^{4}\delta^{2}}\right)$ , we can apply Bernstein inequality (Theorem 7) (for some constant $c$ ) to get

	$\displaystyle\operatorname*{\mathbb{P}}\left(\\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\\|_{2}^{2}\geq\mathbb{E}\\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\\|_{2}^{2}+t\right)$	$\displaystyle\leq\operatorname*{\mathbb{P}}\left(\sum_{j=1}^{n}z_{j}\geq\operatorname{nnz}(\mathbf{A}_{i})+t\right)$
		$\displaystyle\leq\exp\left(\frac{-t^{2}/2}{c\operatorname{nnz}({\mathbf{A}}_{i})^{2}+ct\operatorname{nnz}({\mathbf{A}}_{i})/3}\right).$

If we set $t=\log n\cdot\operatorname{nnz}({\mathbf{A}}_{i})$ , for some constant $c^{\prime}$ we have

\displaystyle\operatorname*{\mathbb{P}}\left(\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}\geq\mathbb{E}\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}+\log n\cdot\operatorname{nnz}({\mathbf{A}}_{i})\right)

\displaystyle\leq\exp\left(\frac{-(\log n)^{2}/2}{c+c(\log n)/3}\right)\leq\exp(-c^{\prime}\log n)\leq 1/n^{c^{\prime}}.

Since ${\mathbf{A}^{\prime}}_{m}=\mathbf{H}_{m}+\mathbf{D}_{m}$ , we have $\|(\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m})_{:,i}\|_{2}\geq\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}$ . Then with probability at least $1-1/n^{c^{\prime}}\geq 1-\delta$ , for any row $i$ with $p_{i}<1$ , we have

\displaystyle\frac{1}{p_{i}}\cdot\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}\leq\frac{\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{i})}\cdot c(\log n)\operatorname{nnz}({\mathbf{A}}_{i})\leq\frac{\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}})}{\log n},

for $s\geq c\left(\frac{\log^{2}n}{\epsilon^{2}}+\frac{1}{\epsilon^{4}\delta^{2}}\right)$ for large enough $c$ . Observe that, as in Lemma 3 w.l.o.g. we have assumed $1-n^{c^{\prime}}\geq 1-\delta$ , since otherwise, our algorithm would read all $n^{2}$ entries of the matrix.

Case 2: $p_{i}=1$ . Then, we have $\operatorname{nnz}({\mathbf{A}}_{i})\geq\operatorname{nnz}({\mathbf{A}})/s$ . As in the $p_{i}<1$ case, we have from (27):

\displaystyle|z_{j}|

\displaystyle\leq|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}\delta}+\frac{2c_{2}\log^{2}n\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}}.

Now, we observe that $|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\sum_{j=1}^{n}|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\|\mathbf{A}^{\prime}_{m,i}\|^{2}_{2}\leq\|\mathbf{A}^{\prime}\|^{2}_{2}\leq\epsilon^{2}\delta\operatorname{nnz}(\mathbf{A})$ , which gives us

\displaystyle|z_{j}|

\displaystyle\leq\epsilon^{2}\delta\operatorname{nnz}(\mathbf{A})+\frac{2\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}\delta}+\frac{2c_{2}\log^{2}n\operatorname{nnz}({\mathbf{A}}_{i})}{s\epsilon^{2}}.

(32)

Thus, for $s\geq c\left(\frac{\log^{2}n}{\epsilon^{4}\delta}+\frac{1}{\epsilon^{4}\delta^{2}}\right)$ for a large enough constant $c$ and adjusting for other constants we have $|z_{j}|\leq 2\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}})$ . Also observe that the expectation of $\sum z_{j}$ can be bounded by:

\displaystyle\mathbb{E}\left(\sum_{j=1}^{n}z_{j}\right)=\mathbb{E}\|(\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m})_{:,i}\|_{2}^{2}=\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{2}\leq\|{\mathbf{A}^{\prime}}_{m}\|_{2}^{2}\leq\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}}).

Next, the variance of the sum of the random variables $\{z_{j}\}$ can again be bounded by following the analysis presented in and prior to (30) and (31) we have

	$\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)$	$\displaystyle\leq\\|{\mathbf{A}^{\prime}}_{m,i,j}\\|_{2}^{4}+\frac{12c_{2}\log^{2}n\cdot\operatorname{nnz}({\mathbf{A}}_{i})^{2}}{s\epsilon^{2}}+\frac{12\operatorname{nnz}({\mathbf{A}}_{i})^{2}}{s\epsilon^{4}\delta^{2}}$
		$\displaystyle\leq\epsilon^{4}\delta^{2}\operatorname{nnz}({\mathbf{A}})^{2}+\frac{12c_{2}\log^{2}n\cdot\operatorname{nnz}({\mathbf{A}}_{i})^{2}}{s\epsilon^{2}}+\frac{12\operatorname{nnz}({\mathbf{A}}_{i})^{2}}{s\epsilon^{4}\delta^{2}},$		(33)

where we again bound $\|{\mathbf{A}^{\prime}}_{m,i,j}\|_{2}^{4}$ using

|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\sum_{j=1}^{n}|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\|\mathbf{A}^{\prime}_{m,i}\|^{2}_{2}\leq\|\mathbf{A}^{\prime}\|^{2}_{2}\leq\epsilon^{2}\delta\operatorname{nnz}(\mathbf{A}).

Then for $s\geq c(\frac{\log^{2}n}{\epsilon^{6}\delta^{2}}+\frac{1}{\epsilon^{8}\delta^{4}})$ , we have $\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)\leq 2\epsilon^{4}\delta^{2}\operatorname{nnz}({\mathbf{A}})^{2}$ for large enough constant $c$ .

Using (32) and (33) and noting that $\sum_{j=1}^{n}\mathbb{E}\left(z_{j}^{2}\right)\geq\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)-\mathbb{E}^{2}\left(\sum_{j=1}^{n}z_{j}\right)$ we can apply the Bernstein inequality (Theorem 7):

	$\displaystyle\operatorname*{\mathbb{P}}\left(\\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\\|_{2}^{2}\geq\mathbb{E}\\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\\|_{2}^{2}+t\right)$	$\displaystyle\leq\operatorname*{\mathbb{P}}\left(\sum_{j=1}^{n}z_{j}\geq\epsilon^{2}\delta\operatorname{nnz}(\mathbf{A}_{i})+t\right)$
		$\displaystyle\leq\exp\left(\frac{-t^{2}/2}{c\epsilon^{4}\delta^{2}\operatorname{nnz}({\mathbf{A}})^{2}+c\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}})t/3}\right).$

If we set $t=(\log n)\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}})$ , then for some constant $c^{\prime}$ we have

\displaystyle\operatorname*{\mathbb{P}}\left(\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}\geq\mathbb{E}\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}+t\right)

\displaystyle\leq\exp(-c^{\prime}\log n)\leq 1/n^{c^{\prime}}.

This, since $\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}\leq\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}$ , when $p_{i}=1$ , setting $s\geq c(\frac{\log^{2}n}{\epsilon^{6}\delta^{2}}+\frac{1}{\epsilon^{8}\delta^{4}})$ for large enough $c$ , we have with probability $\geq 1-1/n^{c^{\prime}}$ $\frac{1}{p_{i}}\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}=\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}\leq\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}\leq(\log n)\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}}).$

We thus have, that with probability $\geq 1-1/n^{c^{\prime}}$ , for both cases when $p_{i}<1$ and $p_{i}=1$ , $\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}}{p_{i}}\leq(\log n)\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}})$ . Taking a union bound over all $i\in[n]$ , with probability at least $1-1/n^{c^{\prime}-1}$ , $\max_{i}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\leq\sqrt{\log n}\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}$ for $s\geq c(\frac{\log^{2}n}{\epsilon^{6}\delta^{2}}+\frac{1}{\epsilon^{8}\delta^{4}})$ . As stated before, since $p_{i}\geq\frac{1}{n^{2}}$ for all $i\in[n]$ , and since $\|\mathbf{H}_{m}\|_{F}\leq\|\mathbf{A}\|_{F}\leq n$ , we also have $\max_{i}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\leq n^{2}$ . Thus,

\displaystyle\mathbb{E}_{2}\left(\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\right)\leq\sqrt{\log n}\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}(1-\frac{1}{n^{c^{\prime}-1}})+\frac{1}{n^{c^{\prime}-3}}\leq\sqrt{\log n}\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}.

after adjusting $\epsilon$ by at most some constants. Overall, we finally get

\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}\leq\mathbb{E}_{2}\left(\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\right)\leq\epsilon\sqrt{\log n}\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}.

Plugging this bound into (25), we have for $s\geq c(\frac{\log^{2}n}{\epsilon^{6}\delta^{2}}+\frac{1}{\epsilon^{8}\delta^{4}})$ ,

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}

\displaystyle\leq(\log n)\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}.

Finally after adjusting $\epsilon$ by a $\frac{1}{\log n}$ factor, we have for $s\geq c(\frac{\log^{8}n}{\epsilon^{6}\delta^{2}}+\frac{\log^{8}n}{\epsilon^{8}\delta^{4}})$ or $s\geq\frac{c\log^{8}n}{\epsilon^{8}\delta^{4}}$ ,

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}

\displaystyle\leq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}({\mathbf{A}})}.

The final bound then follows via Markov’s inequality on $\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}$ . ∎

4.3 Main Accuracy Bound

We are finally ready to prove our main result for sparsity-based sampling, which we restate below. See 2

Proof.

With Lemmas 7 and 8 in place, the proof is nearly identical to that of Theorem 1, with the additional need to apply Lemma 5 to show that the eigenvalues of $\mathbf{A}^{\prime}$ are close to those of $\mathbf{A}$ .

For all $i\in[n]$ let $p_{i}=\min\left(1,\frac{s\operatorname{nnz}(\mathbf{A}_{i})}{\operatorname{nnz}(\mathbf{A})}\right)$ and let $\bar{\mathbf{S}}$ be a scaled diagonal sampling matrix such that the $\bar{\mathbf{S}}_{ii}=\frac{1}{\sqrt{p_{i}}}$ with probability $p_{i}$ and $\bar{\mathbf{S}}_{ii}=0$ otherwise. Let $\mathbf{A}^{\prime}$ be the matrix constructed from $\mathbf{A}$ by zeroing out its elements as described in Lemma 5. Then, note that $\bar{\mathbf{S}}\mathbf{A}^{\prime}\bar{\mathbf{S}}=\mathbf{A}^{\prime}_{S}$ where $\mathbf{A}^{\prime}_{S}$ is the submatrix constructed as in Algorithm 2. We first show that the eigenvalues of $\mathbf{A}^{\prime}_{S}$ approximate those of $\mathbf{A}^{\prime}$ up to error $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ . The steps are almost identical to those in the proof of Theorem 1. We provide a brief outline of the steps but skip the details.

We split $\mathbf{A}^{\prime}$ as $\mathbf{A}^{\prime}=\mathbf{A}^{\prime}_{o}+\mathbf{A}^{\prime}_{m}$ where $\mathbf{A}^{\prime}_{o}$ and $\mathbf{A}^{\prime}_{m}$ contain eigenvalues of $\mathbf{A}^{\prime}$ of magnitudes $<\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ and $\geq\epsilon\sqrt{\delta}\sqrt{\operatorname{nnz}(\mathbf{A})}$ . This implies $\mathbf{A}^{\prime}_{S}=\mathbf{A}^{\prime}_{o,S}+\mathbf{A}^{\prime}_{m,S}$ where $\mathbf{A}^{\prime}_{o,S}=\bar{\mathbf{S}}\mathbf{A}^{\prime}_{o}\bar{\mathbf{S}}$ and $\mathbf{A}^{\prime}_{m,S}=\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m}\bar{\mathbf{S}}$ . By Fact 1 we have that the nonzero eigenvalues of $\mathbf{A}^{\prime}_{o,S}=\bar{\mathbf{S}}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}$ are identical to those of $\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}$ . Thus, applying the perturbation bound of Fact 4, we have:

\displaystyle\left|\lambda_{i}(\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2})-\lambda_{i}(\mathbf{\Lambda}^{\prime}_{o})\right|\leq C\log n\|\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}-\mathbf{\Lambda}^{\prime}_{o}\|_{2}.

From Lemma 7, we get $\|\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}-\mathbf{\Lambda}^{\prime}_{o}\|_{2}\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ for $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{3}\sqrt{\delta}}$ with probability at least $1-\delta$ . Thus, setting the error parameter to $\frac{\epsilon}{\log n}$ in Lemma 7, for $s\geq\frac{c\log(1/(\epsilon\delta))\log^{3}n}{\epsilon^{3}\sqrt{\delta}}$ , with probability at least $1-\delta$ we have:

\displaystyle\left|\lambda_{i}(\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2})-\lambda_{i}(\mathbf{\Lambda}^{\prime}_{o})\right|

\displaystyle<\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}.

(34)

We have thus shown that the non-zero eigenvalues of $\mathbf{A}^{\prime}_{o,S}$ approximate all outlying eigenvalues of $\mathbf{A}^{\prime}$ . Note that by Lemma 8, we also have $\|\mathbf{A}^{\prime}_{m,S}\|_{2}\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ with probability at least $1-\delta$ for $s\geq\frac{c\log^{8}n}{\epsilon^{8}\delta^{4}}$ . Then, similarly to the section on eigenvalue alignment of Theorem 1, we can argue how these approximations ‘line up’ in the presence of zero eigenvalues in the spectrum of these matrices, concluding that, for all $i\in[n]$ ,

\displaystyle\left|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}^{\prime})\right|\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}.

Finally, by Lemma 5, we have $\lvert\lambda_{i}(\mathbf{A}^{\prime})-\lambda_{i}(\mathbf{A})\rvert\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ for all $i\in[n]$ . Thus, via triangle inequality, $\left|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A})\right|\leq 2\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ , which gives the required bound after adjusting $\epsilon$ to $\epsilon/2$ .

Recall that we require $s\geq\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}\sqrt{\delta}}$ for (34) to hold with probability $1-\delta$ . We also require $s\geq\frac{c\log^{8}n}{\epsilon^{8}\delta^{4}}$ for $\|\mathbf{A}_{m,S}^{\prime}\|_{2}\leq\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ to hold with probability $1-\delta$ by Lemma 8. Thus, for both conditions to hold simultaneously with probability $1-2\delta$ by a union bound, it suffices to set $s=\frac{c\log^{8}n}{\epsilon^{8}\delta^{4}}\geq\max\left(\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}\sqrt{\delta}},\frac{c\log^{8}n}{\epsilon^{8}\delta^{4}}\right)$ , where we use that $\log(1/(\epsilon\delta)\leq O(\log n)$ , as otherwise our algorithm can take $\mathbf{A}_{S}$ to be the full matrix $\mathbf{A}$ . Adjusting $\delta$ to $\delta/2$ completes the theorem. ∎

5 Empirical Evaluation

We complement our theoretical results by evaluating Algorithms 1 (uniform sampling) and Algorithm 2 (sparsity-based sampling) in approximating the eigenvalues of several symmetric matrices. We defer an evaluation of Algorithm 3 (norm-based sampling) to later work. Algorithm 1 and Algorithm 2 perform very well. They seem to have error dependence roughly $1/\epsilon^{2}$ in practice, as compared to the $1/\epsilon^{3}$ dependence proven in Theorem 1 and $1/\epsilon^{8}$ dependence in Theorem 2. Closing the gap between the theory and observed results would be very interesting.

5.1 Datasets

We test Algorithm 1 (uniform sampler) on three dense matrices. We also compare the relative performance of Algorithm 1 and Algorithm 2 (sparsity sampler) on three other synthetic and real world matrices.

The first two dense matrices, following [CNX21], are created by sampling $5000$ points from a binary image. We then normalize all the points in the range $[0,1]$ in both axes. The original image and resulting set of points are shown in Figure 2. We then compute a similarity matrix for the points using two common similarity functions used in machine learning and computer graphics: $\delta(\mathbf{x},\mathbf{y})=\tanh\left({\frac{\langle\mathbf{x},\mathbf{y}\rangle}{2}}\right)$ , the hyperbolic tangent; and $\delta(\mathbf{x},\mathbf{y})={\|\mathbf{x}-\mathbf{y}\|_{2}^{2}}\cdot\log\left({\|\mathbf{x}-\mathbf{y}\|_{2}^{2}}\right)$ , the thin plane spline. These measures lead to symmetric, indefinite, and entrywise bounded similarity matrices.

Our next dense matrix (called the block matrix) is based on the construction of the hard instance for the lower bound in [BCJ20] which shows that we need $\Omega(1/\epsilon^{2})\times\Omega(1/\epsilon^{2})$ samples to compute $\epsilon n$ approximations to the eigenvalues of a bounded entry matrix. It is a $5000\times 5000$ matrix containing a $2500\times 2500$ principal submatrix of all $1$ s, with the rest of the entries set to $0$ . It has $\lambda_{1}(\mathbf{A})=2500$ and all other eigenvalues equal to $0$ .

We now describe the three matrices used to compare Algorithm 1 and Algorithm 2. All three are graph adjacency matrices, which are symmetric, indefinite, entrywise bounded and sparse. Spectral density estimation for graph structured matrices is an important primitive in network analysis [DBB19]. The first is a dense Erdös-Rényi graph with $5000$ nodes and connection probability $0.1$ . The second two are real world graphs, taken from SNAP [LK14]; namely Facebook [ML12] and Arxiv COND-MAT [LKF07]. The Facebook graph contains $4039$ nodes and $88234$ directed edges. We symmetrize the adjacency matrix. Arxiv COND-MAT is a collaboration network between authors of Condensed Matter papers published on arXiv, containing $23133$ nodes and $93497$ undirected edges. Both these graphs are very sparse – the number of edges is $\leq 1\%$ of the total edges in a complete graph with same number of nodes.

Refer to caption — Figure 2: Kong dataset. The image on the left is the original synthetic binary image and the image on the right shows the 5000 sampled points from the outline used as dataset in our experiments.

5.2 Implementation Details

Apart from uniform random sampling (Algorithm 1), we also apply the sparsity-based sampling technique in Algorithm 2 and a modification to Algorithm 2, where we do not zero out the elements of the sampled submatrix $\mathbf{A}_{S}$ (we call this simple sparsity sampler). In practice, to apply Algorithm 2, we zero out element $[\mathbf{A}_{S}]_{i,j}$ (line 5 of Algorithm 2) if $i=j$ or $\operatorname{nnz}(\mathbf{A}_{i})\operatorname{nnz}(\mathbf{A}_{j})<\frac{\operatorname{nnz}(\mathbf{A})}{c_{2}s}$ , where $c_{2}$ is a constant and $s$ is the size of the sample. We set $c_{2}=0.1$ experimentally as this results in consistent behavior across datasets.

5.3 Experimental Setup

We subsample each matrix and compute its eigenvalues using numpy [Com21]. We then use our approximation algorithms to estimate the eigenvalues of $\mathbf{A}$ by scaling the eigenvalues of the sampled submatrix. For $t$ trials, we report the logarithm of the average absolute scaled error, $\log\left(\frac{1}{t}\sum\frac{|\tilde{\lambda}_{i,t}(\mathbf{A})-\lambda_{i}(\mathbf{A})|}{\sqrt{\operatorname{nnz}(\mathbf{A})}}\right)$ , where $\tilde{\lambda}_{i,t}(\mathbf{A})$ is the estimated eigenvalue in the $t^{th}$ trial, $\lambda_{i}(\mathbf{A})$ is the true eigenvalue and $\operatorname{nnz}(\mathbf{A})$ is the number of non-zero elements in $\mathbf{A}$ . Recall that $\sqrt{\operatorname{nnz}(\mathbf{A})}\geq\|\mathbf{A}\|_{F}$ is an upper bound on all eigenvalue magnitudes. Also note that for the fully dense matrices, $\sqrt{\operatorname{nnz}(\mathbf{A})}\approx n$ .

We repeat our experiments for $t=50$ trials at different sampling rates and aggregate the results. The resultant errors of estimation for dense matrices are plotted in Figure 3 and for the graph matrices are plotted in Figure 4. The $x$ -axis is the log proportion of the number of random samples chosen from the matrix. If we sample $1\%$ of the rows/columns, then the $\log$ comes to around $-4.5$ . In these log-log plots, if the sample size has polynomial dependence on $\epsilon$ , e.g., $\epsilon n$ or $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ error is achieved with sample size proportional to $1/\epsilon^{p}$ , we expect to see error falling off linearly, with slope equal to $-1/p$ where $p$ is the exponent on $\epsilon$ .

As a baseline we also show the error if we approximate all eigenvalues with $0$ which results in an error of $\frac{\lambda_{i}}{\sqrt{\operatorname{nnz}(\mathbf{A})}}$ . This helps us to observe how the approximation algorithms perform for both large and small order eigenvalues, as opposed to just approximating everything by $0$ .

Code. All codes are written in Python and available at https://github.com/archanray/eigenvalue_estimation.

5.4 Summary of Results

Our results are plotted in Figures 3 and 4. We observe relatively small error in approximating all eigenvalues, with the error decreasing as the number of samples increases. What is more interesting is that the relationship between sample size and error $\epsilon n$ seems to be generally on the order of $1/\epsilon^{2}$ , our expected lower bound for approximating eigenvalues by randomly sampling a principal submatrix. This can be seen by observing the slope of approximately $-1/2$ on the log-log error plots. In some cases, we do better in approximating small eigenvalues of $\mathbf{A}$ – if the eigenvalue lies well within the range of middle eigenvalues, i.e. $\{-\epsilon n,\epsilon n\}$ ), we may achieve a very good absolute error estimate simply by approximating it to $0$ .

As expected, on the graph adjacency matrices (in Figure 4), sparsity-based sampling techniques generally achieve better error than uniform sampling. For the Erdös-Rényi graph, we expect the node degrees (and hence row sparsities) to be similar. Thus the sampling probability for each row will be roughly uniform, which leads to similar performance of sparsity-based techniques and uniform sampling. For the real world graphs, which have power law degree distributions, sparsity-based sampling techniques has a significant effect. As a result Algorithm 2, and the simple sparsity sampler variant significantly outperform uniform sampling.

Algorithm 2 almost always dominates simple sparsity sampler. In some cases simple sparsity sampler performs better or equivalent to Algorithm 2. This may happen because for two reasons: 1) if Algorithm 2 zeroes out almost all of the sampled submatrix $\mathbf{A}_{S}$ for small samples, the algorithm will underestimate the corresponding eigenvalue, and 2) the cut-off threshold for the term $\operatorname{nnz}(\mathbf{A}_{i})\operatorname{nnz}(\mathbf{A}_{j})$ may be too high leading to no difference between simple sparsity sampler and Algorithm 2.

We also observe that approximating all eigenvalues with 0 results in very good approximation for small eigenvalues of the Erdös-Rényi graph. We believe this is because the smaller eigenvalues are significantly less than the largest eigenvalue (of the order of $3500$ ). We see similar trends of approximating eigenvalues with zero for the real world graphs too. But since eigenvalues at the extreme spectrum are of a larger order, we see reasonably good approximation for the sampling algorithms. Algorithm 2 outperforms approximation by $0$ in all of these cases.

In the dense matrices uniform sampling almost always outperforms approximation by $0$ when estimating any reasonably large eigenvalues. Additionally, note that the block matrix is rank- $1$ with true eigenvalues $\{2500,0,\ldots,0\}$ . Any sampled principal submatrix will also have rank at most $1$ . Thus, outside the top eigenvalue, the submatrix will have all zero eigenvalues. So, in theory, our algorithm should give perfect error for all eigenvalues outside the top – we see that this is nearly the case. The very small and sporadic error in the plots for these eigenvalues arises due to numerical roundoff in the eigensolver. The only non-trivial approximation for this matrix is for the top eigenvalue. This approximation seems to have error dependency around $1/\epsilon^{2}$ , as expected.

6 Conclusion

We present efficient algorithms for estimating all eigenvalues of a symmetric matrix with bounded entries up to additive error $\epsilon n$ , by reading just a $\mathop{\mathrm{missing}}{poly}(\log n,1/\epsilon)\times\mathop{\mathrm{missing}}{poly}(\log n,1/\epsilon)$ random principal submatrix. We give improved error bounds of $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ and $\epsilon\|\mathbf{A}\|_{F}$ when the rows/columns are sampled with probabilities proportional to their sparsities or squared $\ell_{2}$ norms, respectively.

As discussed, our work leaves several open questions. In particular, it is open if our query complexity for $\pm\epsilon n$ approximation can be improved, possibly to $\tilde{O}(\log^{c}n/\epsilon^{4})$ total entries using principal submatrix queries or $\tilde{O}(\log^{c}/\epsilon^{2})$ entries using general queries. The later bound is open even when $\mathbf{A}$ is PSD, a setting where we know that sampling a $O(1/\epsilon^{2})\times O(1/\epsilon^{2})$ principal submatrix (with $O(1/\epsilon^{4})$ total entries) does suffice. Additionally, it is open if we can achieve sample complexity independent of $n$ , by removing all $\log n$ factors, as have been done for the easier problem of testing positive semidefiniteness [BCJ20]. See Section 1.4 for more details.

It would also be interesting to extend our results to give improved approximation bounds for other properties of the matrix spectrum, such as various Schatten- $p$ norms and spectral summaries. For many of these problems large gaps in understanding exist – e.g., for $\pm n^{3/2}$ approximation to the Schatten- $1$ norm, which requires $\Omega(n)$ queries, but for which no $o(n^{2})$ query algorithm is known. Applying our techniques to improve sublinear time PSD testing algorithms under an $\ell_{2}$ rather than $\ell_{\infty}$ approximation requirement [BCJ20] would also be interesting. Finally, it would be interesting to identify additional assumptions on $\mathbf{A}$ or on the sampling model where stronger approximation guarantees (e.g., relative error) can be achieved in sublinear time.

Acknowledgements

We thank Ainesh Bakshi, Rajesh Jayaram, Anil Damle, and Christopher Musco for helpful conversations about this work. RB, CM and AR was partially supported by an Adobe Research grant, along with NSF Grants 2046235 and 1763618. PD and GD were partially supported by NSF AF 1814041, NSF FRG 1760353, and DOE-SC0022085.

References

[AM07] Dimitris Achlioptas and Frank McSherry. Fast computation of low-rank matrix approximations. Journal of the ACM (JACM), 54(2):9–es, 2007.
[AN13] Alexandr Andoni and Huy L Nguyên. Eigenvalues of a matrix in the streaming model. In Proceedings of the \nth24 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2013.
[AW21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the \nth32 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021.
[BCJ20] Ainesh Bakshi, Nadiia Chepurko, and Rajesh Jayaram. Testing positive semi-definiteness via random submatrices. Proceedings of the \nth61 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2020.
[Ber27] Serge Bernstein. Sur l’extension du théorème limite du calcul des probabilités aux sommes de quantités dépendantes. Mathematische Annalen, 97(1):1–59, 1927.
[Bha13] Rajendra Bhatia. Matrix analysis. Springer Science & Business Media, 2013.
[BIMW21] Arturs Backurs, Piotr Indyk, Cameron Musco, and Tal Wagner. Faster kernel matrix algebra via density estimation. Proceedings of the \nth38 International Conference on Machine Learning (ICML), 2021.
[BKKS21] Vladimir Braverman, Robert Krauthgamer, Aditya R Krishnan, and Shay Sapir. Near-optimal entrywise sampling of numerically sparse matrices. In Proceedings of the \nth34 Annual Conference on Computational Learning Theory (COLT), 2021.
[BKM22] Vladimir Braverman, Aditya Krishnan, and Christopher Musco. Linear and sublinear time spectral density estimation. Proceedings of the \nth54 Annual ACM Symposium on Theory of Computing (STOC), 2022.
[BLWZ19] Maria-Florina Balcan, Yi Li, David P Woodruff, and Hongyang Zhang. Testing matrix rank, optimally. In Proceedings of the \nth30 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2019.
[BSS10] Itai Benjamini, Oded Schramm, and Asaf Shapira. Every minor-closed property of sparse graphs is testable. Advances in Mathematics, 223(6):2200–2218, 2010.
[CCH⁺20] Nadiia Chepurko, Kenneth L Clarkson, Lior Horesh, Honghao Lin, and David P Woodruff. Quantum-inspired algorithms from randomized numerical linear algebra. arXiv:2011.04125, 2020.
[CLM⁺15] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of the \nth6 Conference on Innovations in Theoretical Computer Science (ITCS), 2015.
[CNX21] Difeng Cai, James Nagy, and Yuanzhe Xi. Fast and stable deterministic approximation of general symmetric kernel matrices in high dimensions. arXiv:2102.05215, 2021.
[Com21] The Numpy Community. numpy.linalg.eigvals. https://numpy.org/doc/stable/reference/generated/numpy.linalg.eigvals.html, 2021.
[CSKSV18] David Cohen-Steiner, Weihao Kong, Christian Sohler, and Gregory Valiant. Approximating the spectrum of a graph. In Proceedings of the \nth24 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.
[DBB19] Kun Dong, Austin R Benson, and David Bindel. Network density of states. In Proceedings of the \nth25 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2019.
[DDHK07] James Demmel, Ioana Dumitriu, Olga Holtz, and Robert Kleinberg. Fast matrix multiplication is stable. Numerische Mathematik, 2007.
[DK01] Petros Drineas and Ravi Kannan. Fast monte-carlo algorithms for approximate matrix multiplication. In Proceedings of the \nth42 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2001.
[ER59] P. Erdös and A. Rényi. On random graphs I. Publicationes Mathematicae Debrecen, 1959.
[ER18] Talya Eden and Will Rosenbaum. On sampling edges almost uniformly. SIAM Symposium on Simplicty in Algorithms (SOSA), 2018.
[FKV04] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations. Journal of the ACM (JACM), 51(6):1025–1041, 2004.
[GE95] Ming Gu and Stanley C Eisenstat. A divide-and-conquer algorithm for the symmetric tridiagonal eigenproblem. SIAM Journal on Matrix Analysis and Applications, 1995.
[Ger31] Semyon Aranovich Gershgorin. Uber die abgrenzung der eigenwerte einer matrix. Izvestiya Rossiyskoy akademii nauk. Seriya matematicheskaya, (6):749–754, 1931.
[GKX19] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In Proceedings of the \nth36 International Conference on Machine Learning (ICML), 2019.
[GR97] Oded Goldreich and Dana Ron. Property testing in bounded degree graphs. In Proceedings of the \nth29 Annual ACM Symposium on Theory of Computing (STOC), 1997.
[GR08] Oded Goldreich and Dana Ron. Approximating average parameters of graphs. Random Structures & Algorithms, 32(4):473–493, 2008.
[GS91] Leslie Greengard and John Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 1991.
[GT11] Alex Gittens and Joel A Tropp. Tail bounds for all eigenvalues of a sum of random matrices. arXiv:1104.4513, 2011.
[HBT19] Jonas Helsen, Francesco Battistel, and Barbara M Terhal. Spectral quantum tomography. Quantum Information, 2019.
[HJ12] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, USA, 2nd edition, 2012.
[HP14] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. Advances in Neural Information Processing Systems 27 (NIPS), 2014.
[KS03] Robert Krauthgamer and Ori Sasson. Property testing of data dimensionality. In Proceedings of the \nth14 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2003.
[LK14] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, 2014.
[LKF07] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 2007.
[LNW14] Yi Li, Huy L Nguyê̋n, and David P Woodruff. On sketching matrix norms and the top singular vector. In Proceedings of the \nth25 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2014.
[LSY16] Lin Lin, Yousef Saad, and Chao Yang. Approximating spectral densities of large matrices. SIAM Review, 2016.
[LW16] Yi Li and David P Woodruff. On approximating functions of the singular values in a stream. In Proceedings of the \nth48 Annual ACM Symposium on Theory of Computing (STOC), 2016.
[LWW14] Yi Li, Zhengyu Wang, and David P Woodruff. Improved testing of low rank matrices. In Proceedings of the \nth20 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2014.
[LXES19] Ruipeng Li, Yuanzhe Xi, Lucas Erlandson, and Yousef Saad. The eigenvalues slicing library (EVSL): Algorithms, implementation, and software. SIAM Journal on Scientific Computing, 2019.
[ML12] Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems 25 (NIPS), 2012.
[MU17] Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
[NSW22] Deanna Needell, William Swartworth, and David P Woodruff. Testing positive semidefiniteness using linear measurements. In Proceedings of the \nth63 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2022.
[RV07] Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM (JACM), 2007.
[Saa11] Yousef Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011.
[SBL16] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv:1611.07476, 2016.
[SgS90] G. W. Stewart and Ji guang Sun. Matrix Perturbation Theory. Academic Press, 1990.
[SR94] RN Silver and H Röder. Densities of states of mega-dimensional Hamiltonian matrices. International Journal of Modern Physics C, 1994.
[Tan18] Ewin Tang. Quantum-inspired classical algorithms for principal component analysis and supervised clustering. arXiv:1811.00414, 2018.
[Tro08a] Joel A Tropp. Norms of random submatrices and sparse approximation. Comptes Rendus Mathematique, 2008.
[Tro08b] Joel A. Tropp. The random paving property for uniformly bounded matrices. Studia Mathematica, 185:67–82, 2008.
[Tro15] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv:1501.01571, 2015.
[Wan94] Lin-Wang Wang. Calculating the density of states and optical-absorption spectra of large quantum systems by the plane-wave moments method. Physical Review B, 1994.
[Wey12] Hermann Weyl. The asymptotic distribution law of the eigenvalues of linear partial differential equations (with an application to the theory of cavity radiation). Mathematical Annals, 1912.
[WWAF06] Alexander Weiße, Gerhard Wellein, Andreas Alvermann, and Holger Fehske. The kernel polynomial method. Reviews of Modern Physics, 2006.
[YGL⁺18] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based analysis of large batch training and robustness to adversaries. arXiv:1802.08241, 2018.

Appendix A Eigenvalue Approximation for PSD Matrices

Here we give a simple proof that shows if Algorithm 1 is used to approximate the eigenvalues of positive semidefinite (PSD) matrices (i.e., with all non-negative eigenvalues) using a $O(1/\epsilon^{2})\times O(1/\epsilon^{2})$ random submatrix, then the $\ell_{2}$ norm of the error of eigenvalue approximations is bounded by $\epsilon n$ . This much stronger result immediately implies that each eigenvalue of a PSD matrix can be approximated to $\pm\epsilon n$ additive error using just a $O(1/\epsilon^{2})\times O(1/\epsilon^{2})$ random submatrix. The proof follows from a bound in [Bha13] which bounds the $\ell_{2}$ norm of the difference vector of eigenvalues of a Hermitian matrix and any other random matrix by the Frobenius norm of the difference of the two matrices. This improves on the bound of Theorem 1 for general entrywise bounded matrices by a $1/\epsilon^{2}$ factor, and matches the $O(1/\epsilon^{4})$ lower bound for principal submatrix queries in [BCJ20]. Note that the hard instance used to prove the lower bound in [BCJ20] can in fact be negated to be PSD, thus showing that our upper bound here is tight.

We first state the result from [Bha13] which we will be using in our proof.

Fact 5 ( $\ell_{2}$ -norm bound on eigenvalues [Bha13]).

\displaystyle\left(\sum_{i=1}^{n}\left|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{B})\right|^{2}\right)^{1/2}\leq\sqrt{2}\|\mathbf{E}\|_{F}.

Our result is based on the following Lemma, we prove at the end of the section.

Lemma 9.

Consider a PSD matrix $\mathbf{A}=\mathbf{BB}^{T}$ with $\|\mathbf{A}\|_{\infty}\leq 1$ . Let $S$ be sampled as in Algorithm 1 for $s\geq\frac{1}{\epsilon^{2}\delta}$ . Let $\mathbf{\bar{S}}\in\mathbb{R}^{n\times|S|}$ be the scaled sampling matrix satisfying $\mathbf{\bar{S}}^{T}\mathbf{A}\mathbf{\bar{S}}=\frac{n}{s}\cdot\mathbf{A}_{S}$ . Then with probability at least $1-\delta$ ,

\displaystyle\|\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}-\mathbf{B}^{T}\mathbf{B}\|_{F}\leq\epsilon n.

From the above Lemma we have:

Corollary 2 (Spectral norm bound – PSD matrices).

Consider a PSD matrix $\mathbf{A}$ with $\|\mathbf{A}\|_{\infty}\leq 1$ . Let $S$ be a subset of indices formed by including each index in $[n]$ independently with probability $s/n$ as in Algorithm 1. Let $\mathbf{A}_{S}$ be the corresponding principal submatrix of $\mathbf{A}$ , with eigenvalues $\lambda_{1}(\mathbf{A}_{S})\geq\ldots\geq\lambda_{|S|}(\mathbf{A}_{S})$ .

For all $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{S})\geq 0$ , let $\tilde{\lambda}_{i}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ . For all other $i\in[n]$ , let $\tilde{\lambda}_{i}(\mathbf{A})=0$ . Then if $s\geq\frac{2}{\epsilon^{2}\delta}$ , with probability at least $1-\delta$ ,

\displaystyle\left(\sum_{i=1}^{n}\left|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A})\right|^{2}\right)^{1/2}

\displaystyle\leq\epsilon n,

which implies that for all $i\in[n]$ ,

\displaystyle\lambda_{i}(\mathbf{A})-\epsilon n\leq\tilde{\lambda}_{i}(\mathbf{A})\leq\lambda_{i}(\mathbf{A})+\epsilon n.

Proof.

Let $S$ be sampled as in Algorithm 1 and let $\mathbf{\bar{S}}\in\mathbb{R}^{n\times|S|}$ be the scaled sampling matrix satisfying $\mathbf{\bar{S}}^{T}\mathbf{A}\mathbf{\bar{S}}=\frac{n}{s}\cdot\mathbf{A}_{S}$ . Since $\mathbf{A}$ is PSD, we can write $\mathbf{A}=\mathbf{B}\mathbf{B}^{T}$ for some matrix $\mathbf{B}\in\mathbb{R}^{n\times\operatorname{rank}(\mathbf{A})}$ . From Lemma 9, for $s\geq\frac{1}{\epsilon^{2}\delta}$ , we have with probability at least $1-\delta$ :

\displaystyle\|\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}-\mathbf{B}^{T}\mathbf{B}\|_{F}\leq\epsilon n

Using Fact 5, we have,

\displaystyle\left(\sum_{i=1}^{\operatorname{rank}(\mathbf{A})}\left|\lambda_{i}(\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B})-\lambda_{i}(\mathbf{B}^{T}\mathbf{B})\right|^{2}\right)^{1/2}\leq\sqrt{2}\|\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}-\mathbf{B}^{T}\mathbf{B}\|_{F}\leq\sqrt{2}\epsilon n.

(35)

Also from Fact 1, we have $\lambda_{i}(\mathbf{B}^{T}\mathbf{B})=\lambda_{i}(\mathbf{B}\mathbf{B}^{T})=\lambda_{i}(\mathbf{A})$ for all $i\leq\operatorname{rank}(\mathbf{A})$ . Thus,

\displaystyle\left(\sum_{i=1}^{\operatorname{rank}(\mathbf{A})}\left|\lambda_{i}(\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B})-\lambda_{i}(\mathbf{A})\right|^{2}\right)^{1/2}\leq\sqrt{2}\epsilon n

Also by Fact 1, all non-zero eigenvalues of $\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}$ are equal to those of $\mathbf{\bar{S}}^{T}\mathbf{BB}^{T}\mathbf{\bar{S}}=\frac{n}{s}\cdot\mathbf{A}_{S}$ . All other eigenvalue estimates are set to $0$ . Further, for all $i>\operatorname{rank}(\mathbf{A})$ , $\lambda_{i}(\mathbf{A})=0$ . Thus,

\displaystyle\left(\sum_{i=1}^{n}\left|\tilde{\lambda}_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A})\right|^{2}\right)^{1/2}\leq\sqrt{2}\epsilon n.

Adjusting $\epsilon$ to $\epsilon/\sqrt{2}$ then gives us the bound. ∎

We now prove Lemma 9, using a standard approach for sampling based approximate matrix multiplication – see e.g. [DK01].

Proof of Lemma 9.

For $k=1,\ldots,n$ let $\mathbf{Y}_{k}=\frac{n}{s}-1$ with probability $\frac{s}{n}$ and $\mathbf{Y}_{k}=-1$ with probability $1-\frac{s}{n}$ . Thus $\mathbb{E}[\mathbf{Y}_{k}]=0$ and

\displaystyle\|\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}-\mathbf{B}^{T}\mathbf{B}\|_{F}^{2}=\sum_{i=1}^{n}\sum_{j=1}^{n}\left(\sum_{k=1}^{n}\mathbf{Y}_{k}\cdot\mathbf{B}_{ik}\mathbf{B}_{jk}\right)^{2}.

Fixing $i,j$ , each the $\mathbf{Y}_{k}\cdot\mathbf{B}_{ik}\mathbf{B}_{jk}$ are $0$ mean independent random variables. Thus we have:

	$\displaystyle\mathbb{E}\left[\\|\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}-\mathbf{B}^{T}\mathbf{B}\\|_{F}^{2}\right]$	$\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\mathbb{E}\left[\left(\sum_{k=1}^{n}\mathbf{Y}_{k}\cdot\mathbf{B}_{ik}\mathbf{B}_{jk}\right)^{2}\right]$
		$\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\operatorname{Var}\left[\sum_{k=1}^{n}\mathbf{Y}_{k}\cdot\mathbf{B}_{ik}\mathbf{B}_{jk}\right]$
		$\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\sum_{k=1}^{n}\operatorname{Var}\left[\mathbf{Y}_{k}\cdot\mathbf{B}_{ik}\mathbf{B}_{jk}\right]$
		$\displaystyle\leq\sum_{i=1}^{n}\sum_{j=1}^{n}\sum_{k=1}^{n}\frac{n}{s}\cdot\mathbf{B}_{ik}^{2}\mathbf{B}_{jk}^{2}.$

since $\operatorname{Var}[\mathbf{Y}_{k}]=\left(\frac{n}{s}-1\right)^{2}\cdot\frac{s}{n}+\left(1-\frac{s}{n}\right)=\frac{n}{s}-2+\frac{s}{n}+1-\frac{s}{n}=\frac{n}{s}-1$ . Rearranging the sums we have:

\displaystyle\mathbb{E}[\|\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}-\mathbf{B}^{T}\mathbf{B}\|_{F}^{2}]\leq\frac{n}{s}\sum_{k=1}^{n}\sum_{i=1}^{n}\mathbf{B}_{ik}^{2}\sum_{j=1}^{n}\mathbf{B}_{jk}^{2}.

Observe that $\sum_{j=1}^{n}\mathbf{B}_{jk}^{2}=\mathbf{A}_{kk}\leq\|\mathbf{A}\|_{\infty}\leq 1$ , thus overall we have:

\displaystyle\mathbb{E}[\|\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}-\mathbf{B}^{T}\mathbf{B}\|_{F}^{2}]\leq\frac{n^{2}}{s}\leq\epsilon^{2}\delta n^{2}.

So by Markov’s inequality, with probability $\geq 1-\delta$ , $\|\mathbf{B}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{B}-\mathbf{B}^{T}\mathbf{B}\|_{F}^{2}\leq\epsilon^{2}n^{2}$ . This completes the theorem after taking a square root. ∎

Remark: The proof of Lemma 9 can be easily modified to show that the $i$ ^th row of $\mathbf{A}$ can be sampled with probability proportional to $\frac{|\mathbf{A}_{ii}|}{\text{tr}(\mathbf{A})}$ to approximate the eigenvalues of any PSD $\mathbf{A}$ up to $\pm\epsilon\cdot\text{tr}(\mathbf{A})$ error ( $\text{tr}(\mathbf{A})$ is the trace of $\mathbf{A}$ ). When sampling with probability proportional to $\frac{|\mathbf{A}_{ii}|}{\text{tr}(\mathbf{A})}$ , we do not require a bounded entry assumption on $\mathbf{A}$ .

Appendix B Alternate Bound for Uniform Sampling

In this section we provide an alternate bound for approximating eigenvalues with uniform sampling. The sample complexity is worse by a factor of $1/\epsilon$ for this approach, but better by a factor $\log^{2}n$ as compared to Theorem 1. We start with an analog to Lemma 3, showing that the outlying eigenspace remains nearly orthogonal after sampling. In particular, we show concentration of the Hermitian matrix $\mathbf{V}_{o}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{V}_{o}$ about its expectation $\mathbf{V}_{o}^{T}\mathbf{V}_{o}=\mathbf{I}$ rather than the non-Hermitian $\boldsymbol{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\mathbf{\bar{S}}\mathbf{\bar{S}}^{T}\mathbf{V}_{o}\boldsymbol{\Lambda}_{o}^{1/2}$ as in Lemma 3. This allows us to use Weyl’s inequality in our final analysis, rather than the non-Hermitian eigenvalue perturbation bound of Fact 4, saving a $\log^{2}n$ factor in the sample complexity.

Lemma 10 (Near orthonormality – sampled outlying eigenvalues).

Let $S$ be sampled as in Algorithm 1 for $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{4}\delta}$ where $c$ is a sufficiently large constant. Let $\mathbf{\bar{S}}\in\mathbb{R}^{n\times|S|}$ be the scaled sampling matrix satisfying $\mathbf{\bar{S}}^{T}\mathbf{A}\mathbf{\bar{S}}=\frac{n}{s}\cdot\mathbf{A}_{S}$ . Then with probability at least $1-\delta$ , $\|\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}-\mathbf{I}\|_{2}\leq\epsilon.$

Proof.

The result is standard in randomized numerical linear algebra – see e.g., [CLM⁺15]. For completeness, we give a proof here. Define $\mathbf{E}=\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}-\mathbf{I}$ . For all $i\in[n]$ , let $\mathbf{V}_{o,i}$ be the $i^{th}$ row of $\mathbf{V}_{o}$ and define the matrix valued random variable

\displaystyle\mathbf{Y}_{i}=\begin{cases}\frac{n}{s}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T},&\text{with probability }s/n\\ 0&\text{otherwise.}\end{cases}

Then, similar to the proof of Lemma 3, define $\mathbf{Q}_{i}=\mathbf{Y}_{i}-\mathbb{E}\left[\mathbf{Y}_{i}\right]$ . Since $\mathbf{Q}_{1},\mathbf{Q}_{2},\ldots,\mathbf{Q}_{n}$ are independent random variables and $\sum_{i=1}^{n}\mathbf{Q}_{i}=\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}-\mathbf{I}=\mathbf{E}$ , we need to bound $\|\mathbf{Q}_{i}\|_{2}$ for all $i\in[n]$ and $\mathbf{Var}(\mathbf{E})\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\mathbb{E}(\mathbf{EE}^{T})=\mathbb{E}(\mathbf{E}^{T}\mathbf{E})=\sum_{i=1}^{n}\mathbb{E}[\mathbf{Q}_{i}^{2}]$ . Observe $\|\mathbf{Q}_{i}\|_{2}\leq\max\left(1,\frac{n}{s}-1\right)\|\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\|_{2}=\max\left(1,\frac{n}{s}-1\right)\|\mathbf{V}_{o,i}\|_{2}^{2}\leq\frac{1}{\epsilon^{2}\delta s}$ , by row norm bounds of Lemma 2. Again, using Lemma 2 we have

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}[\mathbf{Q}_{i}^{2}]$	$\displaystyle=\sum_{i=1}^{n}\frac{s}{n}\cdot\left(\frac{n}{s}-1\right)^{2}(\mathbf{V}_{o,i}\mathbf{V}^{T}_{o,i})^{2}+\left(1-\frac{s}{n}\right)(\mathbf{V}_{o,i}\mathbf{V}^{T}_{o,i})^{2}$
		$\displaystyle\preceq\sum_{i=1}^{n}\frac{n}{s}\\|\mathbf{V}_{o,i}\\|_{2}^{2}(\mathbf{V}_{o,i}\mathbf{V}^{T}_{o,i})$
		$\displaystyle\preceq\sum_{i=1}^{n}\frac{n}{s}\frac{1}{\epsilon^{2}\delta n}(\mathbf{V}_{o,i}\mathbf{V}^{T}_{o,i})$
		$\displaystyle\preceq\frac{1}{s\epsilon^{2}\delta}\cdot\mathbf{I}$

where $\mathbf{I}$ is the identity matrix of appropriate dimension. By setting $d=\frac{1}{\epsilon^{2}\delta}$ , we can finally bound the probability of the event $\|\mathbf{E}\|_{2}\geq\epsilon n$ using Theorem 6 (the matrix Bernstein inequality) with $\delta$ if $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{4}\delta}$ . Since these steps follow Lemma 3 nearly exactly, we omit them here. ∎

With Lemma 10 in place, we can now give our alternate sample complexity bound.

Theorem 8 (Sublinear Time Eigenvalue Approximation).

For all $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{S})\geq 0$ , let $\tilde{\lambda}_{i}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ . For all $i\in\{1,\ldots,|S|\}$ with $\lambda_{i}(\mathbf{A}_{S})<0$ , let $\tilde{\lambda}_{n-(|S|-i)}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ . For all other $i\in[n]$ , let $\tilde{\lambda}_{i}(\mathbf{A})=0$ . If $s\geq\frac{c\log n}{\epsilon^{4}\delta}$ , for a large enough constant $c$ , then with probability $\geq 1-\delta$ , for all $i\in[n]$ ,

\displaystyle\lambda_{i}(\mathbf{A})-\epsilon n\leq\tilde{\lambda}_{i}(\mathbf{A})\leq\lambda_{i}(\mathbf{A})+\epsilon n.

Proof.

Let $\mathbf{S}\in\mathbb{R}^{n\times|S|}$ be the binary sampling matrix with a single one in each column such that $\mathbf{S}^{T}\mathbf{A}\mathbf{S}=\mathbf{A}_{S}$ . Let $\bar{\mathbf{S}}=\sqrt{n/s}\cdot\mathbf{S}$ . Following Definition 1.1, we write $\mathbf{A}=\mathbf{A}_{o}+\mathbf{A}_{m}$ . By Fact 1 we have that the nonzero eigenvalues of $\frac{n}{s}\cdot\mathbf{A}_{o,S}=\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}$ are identical to those of $\mathbf{\Lambda}_{o}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}$ .

Note that $\mathbf{H}=\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}$ is positive semidefinite. Writing its eigendecomposition $\mathbf{H}=\mathbf{U}\mathbf{W}\mathbf{U}^{T}$ we can define the matrix squareroot $\mathbf{H}^{1/2}=\mathbf{U}\mathbf{W}^{1/2}\mathbf{U}^{T}$ with $\mathbf{H}^{1/2}\mathbf{H}^{1/2}=\mathbf{H}$ . By Lemma 10 applied with error $\epsilon/6$ , with probability at least $1-\delta$ , all eigenvalues of $\mathbf{H}$ lie in the range $[1-\epsilon/6,1+\epsilon/6]$ . In turn, all eigenvalues of $\mathbf{H}^{1/2}$ also lie in this range. Again using Fact 1, we have that the nonzero eigenvalues of $\mathbf{\Lambda}_{o}\mathbf{H}$ , and in turn those of $\frac{n}{s}\cdot\mathbf{A}_{o,S}$ , are identical to those of $\mathbf{H}^{1/2}\mathbf{\Lambda}_{o}\mathbf{H}^{1/2}$ .

Let $\mathbf{E}=\mathbf{H}^{1/2}-\mathbf{I}=\mathbf{U}\mathbf{W}^{1/2}\mathbf{U}^{T}-\mathbf{U}\mathbf{U}^{T}=\mathbf{U}(\mathbf{W}^{1/2}-\mathbf{I})\mathbf{U}^{T}$ . Since the diagonal entries of $\mathbf{W}^{1/2}$ lie in $[1-\epsilon/6,1+\epsilon/6]$ , those of $\mathbf{W}^{1/2}-\mathbf{I}$ lie in $[-\epsilon/6,\epsilon/6]$ . Thus, $\|\mathbf{E}\|_{2}\leq\epsilon/6$ . We can write

\displaystyle\lambda_{i}(\mathbf{H}^{1/2}\mathbf{\Lambda}_{o}\mathbf{H}^{1/2})=\lambda_{i}((\mathbf{I}+\mathbf{E})\mathbf{\Lambda}_{o}(\mathbf{I}+\mathbf{E}))=\lambda_{i}(\mathbf{\Lambda}_{o}+\mathbf{E}\mathbf{\Lambda}_{o}+\mathbf{\Lambda}_{o}\mathbf{E}+\mathbf{E}\mathbf{\Lambda}_{o}\mathbf{E}).

We can then bound

	$\displaystyle\\|\mathbf{E}\mathbf{\Lambda}_{o}+\mathbf{\Lambda}_{o}\mathbf{E}+\mathbf{E}\mathbf{\Lambda}_{o}\mathbf{E}\\|_{2}$	$\displaystyle\leq\\|\mathbf{E}\mathbf{\Lambda}_{o}\\|_{2}+\\|\mathbf{\Lambda}_{o}\mathbf{E}\\|_{2}+\\|\mathbf{E}\mathbf{\Lambda}_{o}\mathbf{E}\\|_{2}$
		$\displaystyle\leq\\|\mathbf{E}\\|_{2}\\|\mathbf{\Lambda}_{o}\\|_{2}+\\|\mathbf{\Lambda}_{o}\\|_{2}\\|\mathbf{E}\\|_{2}+\\|\mathbf{E}\\|_{2}\\|\mathbf{\Lambda}_{o}\\|_{2}\\|\mathbf{E}\\|_{2}$
		$\displaystyle\leq\epsilon n/6+n\epsilon/6+\epsilon^{2}n/36$
		$\displaystyle\leq\epsilon/2\cdot n.$

Applying Weyl’s eigenvalue perturbation theorem (Fact 3), we thus have for all $i$ ,

\displaystyle\left|\lambda_{i}(\mathbf{H}^{1/2}\mathbf{\Lambda}_{o}\mathbf{H}^{1/2})-\lambda_{i}(\mathbf{\Lambda}_{o})\right|

\displaystyle<\epsilon/2\cdot n.

(36)

Note that we have shown that the nonzero eigenvalues of $\frac{n}{s}\cdot\mathbf{A}_{o,S}$ are identical to those of $\mathbf{H}^{1/2}\mathbf{\Lambda}_{o}\mathbf{H}^{1/2}$ , which we have shown well approximate those of $\mathbf{\Lambda}_{o}$ and in turn $\mathbf{A}_{o}$ i.e., the non-zero eigenvalues of $\frac{n}{s}\cdot\mathbf{A}_{o,S}$ approximate all outlying eigenvalues of $\mathbf{A}$ . We can also bound the middle eigenvalues using Lemma 4 as in Theorem 1. Now the only thing left is to argue how these approximations ‘line up’ in the presence of zero eigenvalues in the spectrum of these matrices. This part of the proof again proceeds similarly to that of Theorem 1 in Section 3.2.

Analogous to Theorem 1, from Lemma 10 equation (36) holds with probability $1-\delta$ if $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{4}\delta}$ . We also require $s\geq\frac{c\log n}{\epsilon^{2}\delta}$ for $\|\mathbf{A}_{m,S}\|_{2}\leq\epsilon n$ to hold with probability $1-\delta$ by Lemma 4. Thus, for both conditions to hold simultaneously with probability $1-2\delta$ by a union bound, it suffices to set $s=\frac{c\log n}{\epsilon^{4}\delta}\geq\max\left(\frac{c\log(1/(\epsilon\delta))}{\epsilon^{4}\delta},\frac{c\log n}{\epsilon^{2}\delta}\right)$ , where we use that $\log(1/(\epsilon\delta))=O(\log n)$ , as otherwise our algorithm can take $\mathbf{A}_{S}$ to be all of $\mathbf{A}$ . Adjusting $\delta$ to $\delta/2$ completes the theorem. ∎

Appendix C Refined Bounds

In this section, we show how it is possible to get better query complexity or tighter approximation factors by modifying the proof of Theorem 1 and Lemmas 3 and 2 under some assumptions. We give an extension to Theorem 1 in Theorem 9 for the case when the eigenvalues of $\mathbf{A}_{o}$ lie in a bounded range – between $\epsilon^{a}\sqrt{\delta}n$ and $\epsilon^{b}n$ where $0\leq b\leq a\leq 1$ .

Theorem 9.

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric with $\|\mathbf{A}\|_{\infty}\leq 1$ and eigenvalues $\lambda_{1}(\mathbf{A})\geq\ldots\geq\lambda_{n}(\mathbf{A})$ . Let $\mathbf{A}_{o}$ be as in Definition 1.1 such that for all eigenvalues $\lambda_{i}(\mathbf{A}_{o})$ we have either $\epsilon^{a}\sqrt{\delta}n\leq\lvert\lambda_{i}(\mathbf{A}_{o})\rvert\leq\epsilon^{b}n$ or $\lambda_{i}(\mathbf{A}_{o})=0$ where $0\leq b\leq a\leq 1$ . Let $S\subseteq[n]$ be formed by including each index independently with probability $s/n$ as in Algorithm 1. Let $\mathbf{A}_{S}$ be the corresponding principal submatrix of $\mathbf{A}$ , with eigenvalues $\lambda_{1}(\mathbf{A}_{S})\geq\ldots\geq\lambda_{|S|}(\mathbf{A}_{S})$ .

For all $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{S})\geq 0$ , let $\tilde{\lambda}_{i}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ . For all $i\in[|S|]$ with $\lambda_{i}(\mathbf{A}_{S})<0$ , let $\tilde{\lambda}_{n-(|S|-i)}(\mathbf{A})=\frac{n}{s}\cdot\lambda_{i}(\mathbf{A}_{S})$ . For all other $i\in[n]$ , let $\tilde{\lambda}_{i}(\mathbf{A})=0$ . If $s\geq\frac{c\log(1/(\epsilon\delta))\log^{2+a-b}n}{\epsilon^{2+a-b}\delta}$ , for large enough $c$ , then with probability at least $1-\delta$ , for all $i\in[n]$ ,

\displaystyle\lambda_{i}(\mathbf{A})-\epsilon n\leq\tilde{\lambda}_{i}(\mathbf{A})\leq\lambda_{i}(\mathbf{A})+\epsilon n.

Proof.

The proof follows by modifying the proofs of Theorem 1, Lemmas 2 and 3 to account for the tighter intervals. First observe that since $\lvert\lambda_{i}(\mathbf{A}_{o})\rvert\geq\epsilon^{a}\sqrt{\delta}n$ for all $i$ , we can give a tighter row norm bound for $\mathbf{V}_{o}$ from the proof of Lemma 2. In particular, from equation (3) we get:

\displaystyle\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}\leq\frac{1}{\epsilon^{a}\sqrt{\delta}}\hskip 20.00003pt\text{ and }\hskip 20.00003pt\|\mathbf{V}_{o,i}\|_{2}^{2}

\displaystyle\leq\frac{n}{\epsilon^{2a}\delta n^{2}}=\frac{1}{\epsilon^{2a}\delta n}.

We can then bound the number of samples we need to take such that for $\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}$ (as defined in Theorem 8) we have $\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}-\mathbf{\Lambda}_{o}\|_{2}\leq\epsilon n$ with probability at least $1-\delta$ via a matrix Bernstein bound. By appropriately modifying the proof of Lemma 3 to incorporate the stronger row norm bound for $\mathbf{V}_{o}$ , we can show that sampling with probability $s/n$ for $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{2+a-b}\delta}$ for large enough $c$ suffices. Specifically, we get $L\leq\frac{n}{\epsilon^{a}\sqrt{\delta}s}$ , $v\leq\frac{n^{2}}{\epsilon^{a-b}\sqrt{\delta}s}$ and $d\leq\log(1/(\epsilon^{2}\delta))$ for the Bernstein bound in Lemma 3 which enables us to get the tighter bound. Thus, we have $\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}-\mathbf{\Lambda}_{o}\|_{2}\leq\epsilon n$ with probability $1-\delta$ for $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{2+a-b}\sqrt{\delta}}$ following Lemma 3. We also require $s\geq\frac{c\log n}{\epsilon^{2}\delta}$ for $\|\mathbf{A}_{m,S}\|_{2}\leq\epsilon n$ to hold with probability $1-\delta$ by Lemma 4. Then, following the proof of Theorem 1, by Fact 4, for all $i\in[n]$ , and some constant $C$ , we have:

\lvert\lambda_{i}(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2})-\lambda_{i}(\mathbf{\Lambda}_{o})\rvert\leq C\log n\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2}-\mathbf{\Lambda}_{o}\|_{2}.

As in the proof of Theorem 1, adjusting $\epsilon$ by a $\frac{1}{C\log n}$ factor, we get $\lvert\lambda_{i}(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o}^{T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}_{o}\mathbf{\Lambda}_{o}^{1/2})-\lambda_{i}(\mathbf{\Lambda}_{o})\rvert\leq\epsilon n$ with probability $1-\delta$ for $s\geq\frac{c\log(1/(\epsilon\delta))\log^{2+a-b}n}{\epsilon^{2+a-b}\sqrt{\delta}}$ . Then we follow the proof of Theorem 1 to align the eigenvalues completing the proof. ∎

Appendix D Spectral Norm Bounds for Non-Uniform Random Submatrices

See 5

Proof.

The proof follows from [Tro08b]. We begin by first defining the following term

\displaystyle E\coloneqq\mathbb{E}_{2}\|\mathbf{AS}\|_{2}.

Now we have

\displaystyle E^{2}=\mathbb{E}\|\mathbf{A}\mathbf{S}\|^{2}_{2}=\mathbb{E}\|\mathbf{A}\mathbf{S}\mathbf{S}\mathbf{A}^{*}\|_{2}=\mathbb{E}\left\|\sum_{j=1}^{n}\delta^{2}_{j}\mathbf{A}_{j}\mathbf{A}_{j}^{*}\right\|_{2},

where $\delta_{j}$ is the sequence of independent random variables such that $\delta_{j}=\frac{1}{\sqrt{p_{j}}}$ with probability $p_{j}$ and $0$ otherwise, and $\mathbf{A}_{j}$ is the $j$ ^th column of $\mathbf{A}$ . Then, $\mu_{j}=\mathbb{E}[(\delta_{j})^{2}]=1$ . Let $\{\delta_{j}^{\prime}\}$ be an independent copy of the sequence $\{\delta_{j}\}$ . Subtracting the mean and applying triangle inequality we have

\displaystyle E^{2}\leq\mathbb{E}\left\|\sum_{j=1}^{n}(\delta^{2}_{j}-\mathbb{E}[(\hat{\delta})^{2}])\mathbf{A}_{j}\mathbf{A}_{j}^{*}\right\|_{2}+\left\|\sum_{j=1}^{n}\mathbf{A}_{j}\mathbf{A}_{j}^{*}\right\|_{2}.

Using Jensen’s inequality we have

\displaystyle E^{2}

\displaystyle\leq\mathbb{E}\left\|\sum_{j=1}^{n}(\delta^{2}_{j}-(\delta_{j}^{\prime})^{2})\mathbf{A}_{j}\mathbf{A}_{j}^{*}\right\|_{2}+\left\|\mathbf{A}\mathbf{A}^{*}\right\|_{2}.

The random variables $(\delta^{2}_{j}-(\delta_{j}^{\prime})^{2})$ are symmetric and independent. Let $\{\epsilon_{j}\}$ be i.i.d Rademacher random variables for all $j\in[n]$ . Then applying the standard symmetrization argument followed by triangle inequality, we have:

\displaystyle E^{2}

\displaystyle\leq 2\mathbb{E}\left\|\sum_{j=1}^{n}\epsilon_{j}\delta^{2}_{j}\mathbf{A}_{j}\mathbf{A}_{j}^{*}\right\|_{2}+\left\|\mathbf{A}\mathbf{A}^{*}\right\|_{2}.

Let $\Omega=\{j:\delta_{j}=\frac{1}{\sqrt{p_{j}}}\}$ . Let $\mathbb{E}$ be the partial expectation with respect to $\{\epsilon_{j}\}$ , keeping the other random variables fixed. Then, we get:

\displaystyle E^{2}

\displaystyle\leq 2\mathbb{E}_{\Omega}\left[\mathbb{E}_{\epsilon}\left\|\sum_{\Omega}\epsilon_{j}\delta_{j}^{2}\mathbf{A}_{j}\mathbf{A}_{j}^{T}\right\|_{2}\right]+\|\mathbf{A}\|^{2}_{2}.

Using Rudelson’s Lemma 11 of [Tro08b] for any matrix $\mathbf{X}$ with columns $\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{n}$ and any $q=2\log n$ we have

\displaystyle\left(\mathbb{E}\left\|\sum_{j=1}^{n}\epsilon_{j}\mathbf{x}_{j}\mathbf{x}_{j}^{*}\right\|_{2}^{q}\right)^{1/q}

\displaystyle\leq 1.5\sqrt{q}\|\mathbf{X}\|_{1\to 2}\|\mathbf{X}\|_{2}.

Since $(.)^{1/q}$ is concave for $q\geq 1$ , using Jensen’s inequality we get:

\displaystyle\mathbb{E}\left\|\sum_{j=1}^{n}\epsilon_{j}\mathbf{x}_{j}\mathbf{x}_{j}^{*}\right\|_{2}

\displaystyle\leq 1.5\sqrt{q}\|\mathbf{X}\|_{1\to 2}\|\mathbf{X}\|_{2}

Applying the above result to the matrix $\mathbf{A}\mathbf{S}$ , we get:

\displaystyle E^{2}

\displaystyle\leq 3\sqrt{q}\left[\mathbb{E}(\|\mathbf{A}\mathbf{S}\|_{1\rightarrow 2}\|\mathbf{A}\mathbf{S}\|_{2})\right]+\|\mathbf{A}\|_{2}^{2}.

Applying Cauchy Schwartz we get:

\displaystyle E^{2}

\displaystyle\leq 3\sqrt{q}(\mathbb{E}\|\mathbf{A}\mathbf{S}\|^{2}_{1\rightarrow 2})^{1/2}(\mathbb{E}\|\mathbf{A}\mathbf{S}\|^{2}_{2})^{1/2}+\|\mathbf{A}\|_{2}^{2}.

The above equation is of the form $E^{2}\leq bE+c$ . Thus, the values of $E$ fro which the above equation is true is given by $E\leq\frac{b+\sqrt{b^{2}+4c}}{2}\leq b+\sqrt{c}$ . Thus, we get:

\displaystyle\mathbb{E}_{2}\|\mathbf{AS}\|_{2}\leq 3\sqrt{q}\mathbb{E}_{2}\|\mathbf{AS}\|_{1\rightarrow 2}+\|\mathbf{A}\|_{2}.

This gives us the final bound. ∎

Appendix E Improved Bounds via Row-Norm-Based Sampling

Building on the sparsity-based sampling results presented in Section 4, we now show how to obtain improved approximation error of $\pm\epsilon\|\mathbf{A}\|_{F}$ assuming we can sample the rows of $\mathbf{A}$ with probabilties proportional to their squared $\ell_{2}$ norms. The ability to sample by norms also allows us to remove the assumption that $\mathbf{A}$ has bounded entries – our results apply to any symmetric matrix.

For technical reasons, we mix row norm sampling with uniform sampling, forming a random principal submatrix by sampling each index $i\in[n]$ independently with probability $p_{i}=\min\left(1,\frac{s\|\mathbf{A}_{i}\|^{2}_{2}}{\|\mathbf{A}\|^{2}_{F}}+\frac{1}{n^{2}}\right)$ and rescaling each sampled row/column by $1/\sqrt{p_{i}}$ . As in the sparsity-based sampling setting, we must carefully zero out entries of the sampled submatrix to ensure concentration of the sampled eigenvalues. Pseudocode for the full algorithm is given in Algorithm 3.

E.1 Preliminary Lemmas

Our proof closely follows that of Theorem 2 in Section 4. We start by defining $\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ obtained by zeroing out entries of $\mathbf{A}$ as described in Algorithm 3. We have $\mathbf{A}^{\prime}_{ij}=0$ whenever 1) $i=j$ and $\|\mathbf{A}_{i}\|^{2}_{2}<\frac{\epsilon^{2}}{4}\|\mathbf{A}\|^{2}_{F}$ or 2) $i\neq j$ and $\|\mathbf{A}_{i}\|_{2}^{2}\cdot\|\mathbf{A}_{j}\|_{2}^{2}<\frac{\epsilon^{2}\|\mathbf{A}\|_{F}^{2}\cdot\lvert\mathbf{A}_{ij}\rvert^{2}}{c_{2}\log^{4}n}$ . Otherwise $\mathbf{A}^{\prime}_{ij}=\mathbf{A}_{ij}$ . Similar to the sparsity sampling case, we argue that the eigenvalues of $\mathbf{A}^{\prime}$ are close to $\mathbf{A}$ i.e., zeroing out entries of $\mathbf{A}$ according to the given condition doesn’t change it’s eigenvalues by too much (Lemma 11. Then, we again split $\mathbf{A}^{\prime}=\mathbf{A}^{\prime}_{o}+\mathbf{A}^{\prime}_{m}$ such that $\|\mathbf{A}^{\prime}_{m}\|_{2}\leq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ . We argue that after sampling, we have $\|\mathbf{A}^{\prime}_{m,S}\|_{2}\leq\epsilon\|\mathbf{A}\|_{F}$ and the eigenvalues of $\mathbf{A}^{\prime}_{o,S}$ approximate those of $\mathbf{A}^{\prime}_{o}$ up to $\pm\epsilon\|\mathbf{A}\|_{F}$ error.

Algorithm 3 Eigenvalue estimator using

\ell_{2}

norm-based sampling

1: Input: Symmetric

\mathbf{A}\in\mathbb{R}^{n\times n}

, Accuracy

\epsilon\in(0,1)

, failure prob.

\delta\in(0,1)

\|\mathbf{A}_{i}\|_{2}

for all

i\in[n]

2: Fix

s=\frac{c_{1}\log^{10}n}{\epsilon^{8}\delta^{4}}

where

c_{1}

is a sufficiently large constant.

3: Add each

i\in[n]

to sample set

S

independently, with probability

p_{i}=\min\left(1,\frac{s\|\mathbf{A}_{i}\|^{2}_{2}}{\|\mathbf{A}\|^{2}_{F}}+\frac{1}{n^{2}}\right)

. Let the principal submatrix of

\mathbf{A}

corresponding to

S

\mathbf{A}_{S}

4: Let

\mathbf{A}_{S}=\mathbf{D}\mathbf{A}_{S}\mathbf{D}

where

\mathbf{D}\in\mathbb{R}^{|S|\times|S|}

is diagonal with

\mathbf{D}_{i,i}=\frac{1}{\sqrt{p_{j}}}

if the

i^{th}

element of

S

j

5: Construct

\mathbf{A}^{\prime}_{S}\in\mathbb{R}^{|S|\times|S|}

from

\mathbf{A}_{S}

as follows:

\displaystyle\mathbf{[}\mathbf{A}^{\prime}_{S}]_{i,j}

\displaystyle=\begin{cases}0&\text{if $i=j$ and }\|\mathbf{A}_{i}\|_{2}^{2}<\frac{\epsilon^{2}}{4}\|\mathbf{A}\|_{F}^{2}\\ 0&\text{if $i\neq j$ and }\|\mathbf{A}_{i}\|_{2}^{2}\cdot\|\mathbf{A}_{j}\|_{2}^{2}<\frac{\epsilon^{2}\|\mathbf{A}\|_{F}^{2}\cdot\lvert\mathbf{A}_{ij}\rvert^{2}}{c_{2}\log^{4}n}\text{ for sufficient large constant $c_{2}$}\\ [\mathbf{A}_{S}]_{i,j}&\text{otherwise}.\end{cases}

6: Compute the eigenvalues of

\mathbf{A}^{\prime}_{S}

\lambda_{1}(\mathbf{A}^{\prime}_{S})\geq\ldots\geq\lambda_{|S|}(\mathbf{A}^{\prime}_{S})

7: For all

i\in[|S|]

with

\lambda_{i}(\mathbf{A}^{\prime}_{S})\geq 0

, let

\tilde{\lambda}_{i}(\mathbf{A})=\lambda_{i}(\mathbf{A}^{\prime}_{S})

. For all

i\in[|S|]

with

\lambda_{i}(\mathbf{A}^{\prime}_{S})<0

, let

\tilde{\lambda}_{n-(|S|-i)}(\mathbf{A})=\lambda_{i}(\mathbf{A}^{\prime}_{S})

. For all remaining

i\in[n]

, let

\tilde{\lambda}_{i}(\mathbf{A})=0

8: Return: Eigenvalue estimates

\tilde{\lambda}_{1}(\mathbf{A})\geq\ldots\geq\tilde{\lambda}_{n}(\mathbf{A})

Lemma 11.

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be symmetric. Let $\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ have $\mathbf{A}^{\prime}_{ij}=0$ if either 1) $i=j$ and $\|\mathbf{A}_{i}\|_{2}^{2}<\frac{\epsilon^{2}}{4}\|\mathbf{A}\|_{F}^{2}$ or 2) $i\neq j$ and $\|\mathbf{A}_{i}\|_{2}^{2}\cdot\|\mathbf{A}_{j}\|_{2}^{2}<\frac{\epsilon^{2}\|\mathbf{A}\|_{F}^{2}\cdot\lvert\mathbf{A}_{ij}\rvert^{2}}{c_{2}\log^{4}n}$ for a sufficiently large constant $c_{2}$ . Otherwise, $\mathbf{A}^{\prime}_{ij}=\mathbf{A}_{ij}$ . Then, for all $i\in[n]$ ,

|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{A}^{\prime})|\leq\epsilon\|\mathbf{A}\|_{F}.

Proof.

Consider the matrix $\mathbf{A}^{\prime\prime}$ , which is defined identically to $\mathbf{A}^{\prime}$ except we only set $\mathbf{A}^{\prime\prime}_{ij}=0$ if $i\neq j$ and $\|\mathbf{A}_{i}\|_{2}^{2}\cdot\|\mathbf{A}_{j}\|_{2}^{2}<\frac{\epsilon^{2}\|\mathbf{A}\|_{F}^{2}\lvert\mathbf{A}_{ij}\rvert^{2}}{c_{2}\log^{4}n}$ . I.e., we do not zero out any entries on the diagonal as in $\mathbf{A}^{\prime}$ . We will show that $\|\mathbf{A}-\mathbf{A}^{\prime\prime}\|_{2}\leq\frac{\epsilon}{2}\|\mathbf{A}\|_{F}$ . If $\mathbf{A}_{ii}$ is zeroed out in $\mathbf{A}^{\prime}$ this implies that $\|\mathbf{A}_{i}\|_{2}^{2}<\frac{\epsilon^{2}}{4}\|\mathbf{A}\|_{F}^{2}$ . Thus, $|\mathbf{A}_{ii}|\leq\|\mathbf{A}_{i}\|_{2}\leq\frac{\epsilon}{2}\|\mathbf{A}\|_{F}$ and so $\|\mathbf{A}^{\prime\prime}-\mathbf{A}^{\prime}\|_{2}\leq\frac{\epsilon}{2}\|\mathbf{A}\|_{F}$ . So, by triangle inequality, we will then have $\|\mathbf{A}-\mathbf{A}^{\prime}\|_{2}\leq\epsilon\cdot\|\mathbf{A}\|_{F}$ . The lemma then follows from Weyl’s inequality

To show that $\|\mathbf{A}-\mathbf{A}^{\prime\prime}\|_{2}\leq\frac{\epsilon}{2}\|\mathbf{A}\|_{F}$ , we use a variant of Girshgorin’s theorem, as in the proof of Lemma 5. First, we split the entries of $\mathbf{A}$ into level sets, according to their magnitudes. Let $\mathbf{A}=\sum_{k=0}^{\log\frac{n}{\epsilon}}\mathbf{A}_{k}$ where $(\mathbf{A}_{0})_{ij}=\mathbf{A}_{ij}$ if $\lvert\mathbf{A}_{ij}\rvert\in\left[0,\frac{\epsilon}{n}\|\mathbf{A}\|_{F}\right)$ and $(\mathbf{A}_{0})_{ij}=0$ otherwise. For $1\leq k\leq\log\frac{n}{\epsilon}$ , $(\mathbf{A}_{k})_{ij}=\mathbf{A}_{ij}$ if $\lvert\mathbf{A}_{ij}\rvert\in\left[\frac{\|\mathbf{A}\|_{F}}{2^{k}},\frac{\|\mathbf{A}\|_{F}}{2^{k-1}}\right)$ and $(\mathbf{A}_{k})_{ij}=0$ otherwise. We can also define $\mathbf{A}^{\prime\prime}=\sum_{k=0}^{\log\frac{n}{\epsilon}}\mathbf{A}^{\prime\prime}_{k}$ where each $\mathbf{A}^{\prime\prime}_{k}$ are defined similarly. By triangle inequality, $\|\mathbf{A}-\mathbf{A}^{\prime\prime}\|_{2}\leq\sum_{k=0}^{\log n/\epsilon}\|\mathbf{A}_{k}-\mathbf{A}_{k}^{\prime\prime}\|_{2}$ . First observe that $\|\mathbf{A}_{0}-\mathbf{A}_{0}^{\prime\prime}\|_{2}\leq\|\mathbf{A}_{0}-\mathbf{A}_{0}^{\prime\prime}\|_{F}\leq n\cdot\|\mathbf{A}_{0}\|_{\infty}\leq\epsilon\|\mathbf{A}\|_{F}$ . Further, we can assume without loss of generality that $\epsilon>1/n$ and so $\log(n/\epsilon)\leq 2\log n$ , as otherwise our algorithm can afford to read all of $\mathbf{A}$ . So, it suffices to show that for all $k\geq 1$ ,

\displaystyle\|\mathbf{A}_{k}-\mathbf{A}_{k}^{\prime\prime}\|_{2}\leq\frac{\epsilon}{\log n}\cdot\|\mathbf{A}\|_{F}.

(37)

This will give $\|\mathbf{A}-\mathbf{A}^{\prime\prime}\|_{2}\leq\epsilon\cdot\|\mathbf{A}\|_{F}+\sum_{k=1}^{\log n/\epsilon}\frac{\epsilon}{\log n}\cdot\|\mathbf{A}\|_{F}\leq 3\epsilon\cdot\|\mathbf{A}\|_{F}$ , which gives the lemma after adjusting $\epsilon$ by a constant factor.

We now prove (37) for each $k\geq 1$ . For $p\in\{0,1,\ldots\log(n^{2})\}$ , let $\mathcal{I}_{p}\subset[n]$ be the set of rows/columns in $\mathbf{A}_{k}$ with $\operatorname{nnz}((\mathbf{A}_{k})_{i})\in\left[\frac{\operatorname{nnz}(\mathbf{A}_{k})}{2^{p}},\frac{\operatorname{nnz}(\mathbf{A}_{k})}{2^{p-1}}\right)$ and let $\mathbf{A}_{kpq}=\mathbf{A}_{k}(\mathcal{I}_{p},\mathcal{I}_{q})$ be the submatrix of $\mathbf{A}_{k}$ formed with rows in $\mathcal{I}_{p}$ and columns in $\mathcal{I}_{q}$ . Define the submatrix $\mathbf{A}^{\prime\prime}_{kpq}$ of $\mathbf{A}^{\prime\prime}_{k}$ in the same way. Let $\mathbf{\widehat{A}}_{kpq}=\mathbf{A}_{kpq}-\mathbf{A}^{\prime\prime}_{kpq}$ and finally, let $\bar{\mathbf{A}}_{kpq}\in\mathbb{R}^{n\times n}$ be the symmetric error matrix such that $\bar{\mathbf{A}}_{kpq}(\mathcal{I}_{p},\mathcal{I}_{q})=\mathbf{\widehat{A}}_{kpq}$ and $\bar{\mathbf{A}}_{kpq}(\mathcal{I}_{q},\mathcal{I}_{p})=\mathbf{\widehat{A}}_{kpq}^{T}$ .

Note that all rows from which we zero out entries must have at least one non-zero entry $\operatorname{nnz}((\mathbf{A}_{k})_{i})\geq 1$ (otherwise all entries in that row/column are already zero), thus all such rows have $\operatorname{nnz}((\mathbf{A}_{k})_{i})\geq\frac{\operatorname{nnz}(\mathbf{A}_{k})}{n^{2}}$ and so are covered by the submatrices $\mathbf{A}_{kpq}$ . Thus, by triangle inequality, we can bound

\displaystyle\|\mathbf{A}_{k}-\mathbf{A}_{k}^{\prime\prime}\|_{2}\leq\sum_{p=0}^{\log(n^{2})}\sum_{q=0}^{\log(n^{2})}\|\mathbf{\bar{A}}_{kpq}\|_{2}.

(38)

To prove (37), we need to bound $\|\mathbf{A}_{kpq}-\mathbf{A}_{kpq}^{\prime\prime}\|_{2}$ for all $k\geq 1$ and $p,q$ . We use a case analysis.

Case 1: $\frac{4\operatorname{nnz}(\mathbf{A}_{k})^{2}\cdot c_{2}\log^{4}n}{\epsilon^{2}\cdot 2^{2k}}>2^{p+q}.$ In this case, first observe that since the nonzero entries of $\mathbf{A}_{k}$ lie in $\left[\frac{\|\mathbf{A}\|_{F}}{2^{k}},\frac{\|\mathbf{A}\|_{F}}{2^{k-1}}\right)$ , for any $i\in\mathcal{I}_{p}$ , $j\in\mathcal{I}_{j}$ ,

	$\displaystyle\\|\mathbf{A}_{i}\\|_{2}^{2}\cdot\\|\mathbf{A}_{j}\\|_{2}^{2}$	$\displaystyle\geq\\|(\mathbf{A}_{k})_{i}\\|_{2}^{2}\cdot\\|(\mathbf{A}_{k})_{j}\\|_{2}^{2}$
		$\displaystyle\geq\frac{\\|\mathbf{A}\\|_{F}^{4}}{2^{4k}}\cdot\operatorname{nnz}((\mathbf{A}_{k})_{i})\cdot\operatorname{nnz}((\mathbf{A}_{k})_{j})$
		$\displaystyle\geq\frac{\\|\mathbf{A}\\|_{F}^{4}}{2^{4k}\cdot 2^{p+q}}\cdot\operatorname{nnz}(\mathbf{A}_{k})^{2}.$

Thus, by the assumed bound on $2^{p+q}$ , we have for any $i,j$ where $(\mathbf{A}_{k})_{ij}$ is nonzero,

\displaystyle\|\mathbf{A}_{i}\|_{2}^{2}\cdot\|\mathbf{A}_{j}\|_{2}^{2}\geq\frac{\epsilon^{2}\|\mathbf{A}\|_{F}^{4}}{4\cdot 2^{2k}c_{2}\log^{4}n}\geq\frac{\epsilon^{2}\|\mathbf{A}\|_{F}^{2}\cdot|\mathbf{A}_{ij}|^{2}}{c_{2}\log^{4}n},

where the second inequality follows again from the fact that the nonzero entries of $\mathbf{A}_{k}$ lie in $\left[\frac{\|\mathbf{A}\|_{F}}{2^{k}},\frac{\|\mathbf{A}\|_{F}}{2^{k-1}}\right)$ . Thus, any $i,j$ with $(\mathbf{A}_{kpq})_{ij}$ nonzero is not zeroed out in line 5 of Algorithm 3. So $\mathbf{\bar{A}}_{kpq}=\mathbf{0}$ . Plugging into (38), we thus have:

\displaystyle\|\mathbf{A}_{k}-\mathbf{A}_{k}^{\prime\prime}\|_{2}\leq\sum_{p=0}^{\log(n^{2})}\sum_{q:2^{p+q}\geq\frac{16\operatorname{nnz}(\mathbf{A}_{k})^{2}\cdot c_{2}\log^{4}n}{\epsilon^{2}\cdot 2^{2k}}}\|\mathbf{\bar{A}}_{kpq}\|_{2}.

(39)

Case 2: $\frac{16\operatorname{nnz}(\mathbf{A}_{k})^{2}\cdot c_{2}\log^{4}n}{\epsilon^{2}\cdot 2^{2k}}\leq 2^{p+q}.$ In this case, observe that $(\mathbf{\widehat{A}}_{kpq}\mathbf{\widehat{A}}_{kpq}^{T})_{m}=(\mathbf{\widehat{A}}_{kpq})_{m}\mathbf{\widehat{A}}_{kpq}^{T}$ . We can see that $(\mathbf{\widehat{A}}_{kpq})_{m}$ has at most $\operatorname{nnz}((\mathbf{A}_{k})_{m})\leq\frac{\operatorname{nnz}(\mathbf{A}_{k})}{2^{p-1}}$ non-zero entries. Similarly, each row of $\mathbf{\widehat{A}}_{kpq}^{T}$ has at most $\frac{\operatorname{nnz}(\mathbf{A}_{k})}{2^{q-1}}$ non-zero elements. Thus, for all $m\in|\mathcal{I}_{p}|$ , using the fact that all non-zero entries of $\mathbf{A}_{kpq}$ are bounded by $\frac{\|\mathbf{A}\|_{F}}{2^{k-1}}$ , we have:

\displaystyle\|(\mathbf{\widehat{A}}_{kpq}\mathbf{\widehat{A}}^{T}_{kpq})_{m}\|_{1}\leq\frac{\operatorname{nnz}(\mathbf{A}_{k})^{2}}{2^{p+q-2}}\cdot\frac{\|\mathbf{A}\|_{F}^{2}}{2^{2k-2}}.

Applying Girshgorin’s circle theorem (Theorem 2) we thus have:

\displaystyle\|\mathbf{\widehat{A}}_{kpq}\|_{2}^{2}=\|\mathbf{\widehat{A}}_{kpq}\mathbf{\widehat{A}}_{kpq}^{T}\|_{2}\leq\frac{\operatorname{nnz}(\mathbf{A}_{k})^{2}}{2^{p+q-2}}\cdot\frac{\|\mathbf{A}\|_{F}^{2}}{2^{2k-2}}

and so

\displaystyle\|\bar{\mathbf{A}}_{kpq}\|_{2}\leq 2\|\mathbf{\widehat{A}}_{kpq}\|_{2}\leq\frac{8\cdot\|\mathbf{A}\|_{F}\cdot\operatorname{nnz}(\mathbf{A}_{k})}{2^{k}2^{\frac{(p+q)}{2}}}.

Plugging to (39), we thus have:

	$\displaystyle\\|\mathbf{A}_{k}-\mathbf{A}_{k}^{\prime\prime}\\|_{2}$	$\displaystyle\leq\sum_{p=0}^{\log(n^{2})}\sum_{q:2^{p+q}\geq\frac{16\operatorname{nnz}(\mathbf{A}_{k})^{2}\cdot c_{2}\log^{4}n}{\epsilon^{2}\cdot 2^{2k}}}\frac{8\cdot\\|\mathbf{A}\\|_{F}\cdot\operatorname{nnz}(\mathbf{A}_{k})}{2^{k}2^{\frac{(p+q)}{2}}}$
		$\displaystyle\leq\sum_{p=0}^{\log(n^{2})}\frac{2\epsilon\cdot\\|\mathbf{A}\\|_{F}}{\sqrt{c_{2}}\log^{2}n}\cdot\sum_{i=0}^{\infty}\frac{1}{\sqrt{2}}\leq\frac{8\epsilon\\|\mathbf{A}\\|_{F}}{\sqrt{c_{2}}}.$

Setting $c_{2}\geq 64$ , we thus have (37), and in turn the lemma. ∎

We next give a bound on the incoherence of the outlying eigenvectors of $\mathbf{A}^{\prime}$ . This bound is again similar to Lemmas 2 and 6.

Lemma 12 (Incoherence of outlying eigenvectors in terms of $\ell_{2}$ norms).

Let $\mathbf{A},\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ be as in Lemma 11. Let $\mathbf{A}^{\prime}_{o}=\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}\mathbf{V}_{o}^{{}^{\prime}T}$ where $\mathbf{\Lambda}^{\prime}_{o}$ is diagonal, with the eigenvalues of $\mathbf{A}^{\prime}$ with magnitude $\geq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ on its diagonal, and $\mathbf{V}^{\prime}_{o}$ has columns equal to the corresponding eigenvectors. Let $\mathbf{V}^{\prime}_{o,i}$ denote the $i$ ^th row of $\mathbf{V}^{\prime}_{o}$ . Then,

\displaystyle\|\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}^{\prime}_{o,i}\|_{2}^{2}\leq\frac{\|\mathbf{A}_{i}\|_{2}^{2}}{\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}}\hskip 10.00002ptand\hskip 10.00002pt\|\mathbf{V}^{\prime}_{o,i}\|^{2}_{2}\leq\frac{\|\mathbf{A}_{i}\|_{2}^{2}}{\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}}.

Proof.

The proof is again nearly identical to that of Lemma 2. Observe that $\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}=\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}$ . Letting $[\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}]_{i}$ denote the $i$ ^th row of the $\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}$ , we have

\|[\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}]_{i}\|_{2}^{2}=\|[\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}]_{i}\|_{2}^{2}=\sum_{j=1}^{r}\lambda_{j}^{2}\cdot\mathbf{V}_{o,i,j}^{{}^{\prime}2},

(40)

where $r=\operatorname{rank}(\mathbf{A}^{\prime}_{o})$ , $\mathbf{V}^{\prime}_{o,i,j}$ is the $(i,j)$ ^th element of $\mathbf{V}^{\prime}_{o}$ and $\lambda_{j}=\mathbf{\Lambda}^{\prime}_{o}(j,j)$ . Since $\mathbf{V}^{\prime}_{o}$ has orthonormal columns, we have $\|[\mathbf{A}^{\prime}\mathbf{V}^{\prime}_{o}]_{i}\|_{2}^{2}\leq\|\mathbf{A}^{\prime}_{i}\|_{2}^{2}\leq\|\mathbf{A}_{i}\|_{2}^{2}$ . Therefore, by (40),

\sum_{j=1}^{r}\lambda_{j}^{2}\cdot\mathbf{V}_{o,i,j}^{{}^{\prime}2}\leq\|\mathbf{A}_{i}\|_{2}^{2}.

(41)

Since by definition $\lvert\lambda_{j}\rvert\geq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ for all $j$ , we can conclude that $\|\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}^{\prime}_{o,i}\|_{2}^{2}=\sum_{j=1}^{r}\lambda_{j}\cdot\mathbf{V}_{o,i,j}^{{}^{\prime}2}\leq\frac{\|\mathbf{A}_{i}\|_{2}^{2}}{\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}}$ and $\|\mathbf{V}^{\prime}_{o,i}\|_{2}^{2}=\sum_{j=1}^{r}\mathbf{V}_{o,i,j}^{{}^{\prime}2}\leq\frac{\|\mathbf{A}_{i}\|_{2}^{2}}{\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}}$ , which completes the lemma. ∎

E.2 Outer and Middle Eigenvalue Bounds

Using Lemma 12, we next argue that the eigenvalues of $\mathbf{A}_{o,S}^{\prime}$ will approximate those of $\mathbf{A}^{\prime}$ , and in turn those of $\mathbf{A}$ . The proof is very similar to Lemmas 3 and 7.

Lemma 13 (Concentration of outlying eigenvalues with $\ell_{2}$ norm based sampling).

Let $\mathbf{A},\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ be as in algorithm 3. Let $\mathbf{A}^{\prime}=\mathbf{A}^{\prime}_{m}+\mathbf{A}^{\prime}_{o}$ , where $\mathbf{A}^{\prime}_{m}=\mathbf{V}^{\prime}_{m}\mathbf{\Lambda}^{\prime}_{m}\mathbf{\mathbf{V}^{\prime}}_{m}^{T}$ , and $\mathbf{A}^{\prime}_{o}=\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}\mathbf{\mathbf{V}^{\prime}}_{o}^{T}$ are projections onto the eigenspaces with magnitude $<\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ and $\geq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ respectively. For all $i\in[n]$ let $p_{i}=\min\left(1,\frac{s\|\mathbf{A}_{i}\|^{2}_{2}}{\|\mathbf{A}\|^{2}_{F}}+\frac{1}{n^{2}}\right)$ and let $\bar{\mathbf{S}}$ be a scaled diagonal sampling matrix such that the $\bar{\mathbf{S}}_{ii}=\frac{1}{\sqrt{p_{i}}}$ with probability $p_{i}$ and $\bar{\mathbf{S}}_{ii}=0$ otherwise. If $s\geq\frac{c\log(1/(\epsilon\delta))}{\epsilon^{3}\sqrt{\delta}}$ for a large enough constant $c$ , then with probability at least $1-\delta$ , $\|\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}\mathbf{V}_{o}^{{}^{\prime}T}\bar{\mathbf{S}}\bar{\mathbf{S}}^{T}\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}_{o}^{{}^{\prime}1/2}-\mathbf{\Lambda}^{\prime}_{o}\|_{2}\leq\epsilon\|\mathbf{A}\|_{F}$ .

Proof.

We define the random variables $\mathbf{Q}_{1},\cdots\mathbf{Q}_{n}$ and the set $P=\{i\in[n]:p_{i}<1\}$ exactly as in the proof of Lemma 7. Then, as explained in the proof of Lemma 7 it is sufficient to bound $\sum_{i\in P}\mathbb{E}[\mathbf{Q}_{i}^{2}]$ . From 17 we have $\sum_{i\in P}\mathbb{E}[\mathbf{Q}_{i}^{2}]\preceq\sum_{i\in P}\frac{1}{p_{i}}\cdot\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}\cdot(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2})$ . Also from Lemma 11, we have $\|\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\|_{2}^{2}\leq\frac{\|\mathbf{A}_{i}\|_{2}^{2}}{\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}}$ and for all $i\in P$ , $\frac{1}{p_{i}}\leq\frac{\|\mathbf{A}\|_{F}^{2}}{s\|\mathbf{A}_{i}\|_{2}^{2}}$ . We thus get,

	$\displaystyle\sum_{i\in P}\mathbb{E}[\mathbf{Q}_{i}^{2}]$	$\displaystyle\preceq\sum_{i\in P}\frac{1}{p_{i}}\cdot\frac{\\|\mathbf{A}_{i}\\|_{2}^{2}}{\epsilon\sqrt{\delta}\\|\mathbf{A}\\|_{F}}\cdot(\mathbf{\Lambda}_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2})$
		$\displaystyle\preceq\frac{\\|\mathbf{A}\\|_{F}}{s\epsilon\sqrt{\delta}}(\sum_{i\in P}\Lambda_{o}^{1/2}\mathbf{V}_{o,i}\mathbf{V}_{o,i}^{T}\mathbf{\Lambda}_{o}^{1/2})$
		$\displaystyle=\frac{\\|\mathbf{A}\\|_{F}}{s\epsilon\sqrt{\delta}}\mathbf{\Lambda}_{o}\preceq\frac{\\|\mathbf{A}\\|_{F}^{2}}{s\epsilon\sqrt{\delta}}\cdot\mathbf{I}.$

Since $\mathbf{Q}_{i}^{2}$ is PSD this establishes that $v\leq\|\textbf{Var(E)}\|_{2}\leq\frac{\|\mathbf{A}\|_{F}^{2}}{s\epsilon\sqrt{\delta}}$ . We can then apply the matrix Bernstein inequality exactly as in the proof of Lemma 3 to show that when $s\geq\frac{c}{\epsilon^{3}\sqrt{\delta}}$ for large enough $c$ , with probability at least $1-\delta$ , $\left\|\mathbf{E}\right\|_{2}\leq\epsilon\|\mathbf{A}\|_{F}$ . ∎

We now bound the middle eignevalues.

Lemma 14 (Concentration of middle eigenvalues with $\ell_{2}$ - norm based sampling).

Let $\mathbf{A},\mathbf{A}^{\prime}\in\mathbb{R}^{n\times n}$ be as in Lemma 12. Let $\mathbf{A}^{\prime}=\mathbf{A}^{\prime}_{m}+\mathbf{A}^{\prime}_{o}$ , where $\mathbf{A}^{\prime}_{m}=\mathbf{V}^{\prime}_{m}\mathbf{\Lambda}^{\prime}_{m}\mathbf{\mathbf{V}^{\prime}}_{m}^{T}$ , and $\mathbf{A}^{\prime}_{o}=\mathbf{V}^{\prime}_{o}\mathbf{\Lambda}^{\prime}_{o}\mathbf{\mathbf{V}^{\prime}}_{o}^{T}$ are projections onto the eigenspaces with magnitude $<\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ and $\geq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ respectively (analogous to Definition 1.1). As in Algorithm 2, for all $i\in[n]$ let $p_{i}=\min\left(1,\frac{s\|\mathbf{A}_{i}\|_{2}^{2}}{\|\mathbf{A}\|_{F}^{2}}+\frac{1}{n^{2}}\right)$ and let $\bar{\mathbf{S}}$ be a scaled diagonal sampling matrix such that the $\bar{\mathbf{S}}_{ii}=\frac{1}{\sqrt{p_{i}}}$ with probability $p_{i}$ and $\bar{\mathbf{S}}_{ii}=0$ otherwise. If $s\geq\frac{c\log^{10}n}{\epsilon^{8}\delta^{4}}$ for a large enough constant $c$ , then with probability at least $1-\delta$ ,

\|\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m}\bar{\mathbf{S}}\|_{2}\leq\epsilon\|\mathbf{A}\|_{F}.

Proof.

First observe that since $s\geq\frac{4}{\epsilon^{2}}$ (for large enough $c$ ), the results of Lemmas 11 and 12 still hold. The proof follows the same structure as the proof of bounding the middle eigenvalues for sparsity sampling in Lemma 8. From Lemma 12, we have $\|{\mathbf{V}^{\prime}}_{o,i}\|_{2}\leq\frac{\|\mathbf{A}_{i}\|_{2}}{\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}}$ . Also, following the proof of Lemma 12, we have $\|{\mathbf{\Lambda}^{\prime}}_{o}{\mathbf{V}^{\prime}}^{T}_{o,j}\|_{2}=\|[{\mathbf{A}^{\prime}}{\mathbf{V}^{\prime}}_{o}]_{j}\|_{2}\leq\|\mathbf{A}_{j}\|_{2}$ . Thus, for all $i,j\in[n]$ , using Cauchy Schwarz’s inequality, we have

\displaystyle|{\mathbf{A}^{\prime}}_{o,i,j}|=|{\mathbf{V}^{\prime}}_{o,i}{\mathbf{\Lambda}^{\prime}}_{o}{\mathbf{V}^{\prime}}_{o,j}^{T}|\leq\|{\mathbf{V}^{\prime}}_{o,i}\|_{2}\cdot\|{\mathbf{\Lambda}^{\prime}}_{o}{\mathbf{V}^{\prime}}_{o,j}^{T}\|_{2}\leq\frac{\|\mathbf{A}_{i}\|_{2}}{\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}}\cdot\|\mathbf{A}_{j}\|_{2}.

(42)

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}\leq 10\sqrt{\log n}\left(\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}+\mathbb{E}_{2}\|\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}\right)+2\|\mathbf{H}_{m}\|_{2}+\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}

(43)

We now proceed to bound each of the terms on the right hand side of (43). We start with $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}$ . First, observe that $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}\leq\max_{i}\frac{1}{p_{i}}\lvert(\mathbf{D}_{m})_{ii}\rvert$ . We consider two cases.

Case 1: $p_{i}<1$ . Then, as $p_{i}\geq\frac{s\|\mathbf{A}_{i}\|_{2}^{2}}{\|\mathbf{A}\|_{F}^{2}}$ we have $\|\mathbf{A}\|_{F}^{2}\leq\frac{1}{s}\|\mathbf{A}_{i}\|_{2}^{2}$ since $\frac{1}{s}<\frac{\epsilon^{2}}{4}$ . So we must have that have $\lvert(\mathbf{D}_{m})_{ii}\rvert=\lvert({\mathbf{A}^{\prime}}_{m})_{ii}\rvert=\lvert(\mathbf{A}^{\prime}_{o})_{ii}\rvert$ (since $\mathbf{A}^{\prime}_{ii}=0$ ). Then by (42), we have $\frac{1}{p_{i}}\lvert(\mathbf{D}_{m})_{ii}\rvert\leq\frac{\|\mathbf{A}\|_{F}}{s\epsilon\sqrt{\delta}}$ .

Case 2: $p_{i}=1$ . Then we have $\frac{1}{p_{i}}\lvert(\mathbf{D}_{m})_{ii}\rvert=\lvert(\mathbf{D}_{m})_{ii}\rvert\leq\max_{j}\lvert(\mathbf{D}_{m})_{jj}\rvert\leq\|\mathbf{A}^{\prime}_{m}\|_{2}\leq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ .
From the two cases above, for $s\geq\frac{1}{\epsilon^{2}\delta}$ , we have:

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{D}_{m}\bar{\mathbf{S}}\|_{2}\leq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}.

(44)

We can bound $\|\mathbf{H}_{m}\|_{2}$ similarly. Since $\mathbf{H}_{m}={\mathbf{A}^{\prime}}_{m}-\mathbf{D}_{m}$ and $\|{\mathbf{A}^{\prime}}_{m}\|_{2}\leq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}.$ ,

$\displaystyle\\|\mathbf{H}_{m}\\|_{2}$	$\displaystyle\leq\\|{\mathbf{A}^{\prime}}_{m}\\|_{2}+\\|\mathbf{D}_{m}\\|_{2}$
	$\displaystyle\leq\epsilon\sqrt{\delta}\\|\mathbf{A}\\|_{F}+\epsilon\sqrt{\delta}\\|\mathbf{A}\\|_{F}.$
	$\displaystyle=2\epsilon\sqrt{\delta}\\|\mathbf{A}\\|_{F}.$	(45)

where the second step follows from the fact that $\|\mathbf{D}_{m}\|_{2}\leq\max_{i}\lvert(\mathbf{D}_{m})_{ii}\rvert\leq\|\mathbf{A}^{\prime}_{m}\|_{2}$ .

Case 1: $p_{i}=1$ . Then $\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}\leq\|{\mathbf{A}^{\prime}}_{m}\|_{2}\leq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ .

Case 2: $p_{i}<1$ . Then $\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}\leq\|{\mathbf{A}^{\prime}}_{i}\|_{2}\leq\|\mathbf{A}\|_{F}$ . Thus, setting $s\geq\frac{1}{\epsilon^{2}\delta}$ we have:

	$\displaystyle\frac{\\|{\mathbf{A}^{\prime}}_{m,i}\\|_{2}}{\sqrt{p_{i}}}$	$\displaystyle\leq\frac{\\|\mathbf{A}\\|_{F}}{\sqrt{s}\\|\mathbf{A}_{i}\\|_{2}}\cdot\\|{\mathbf{A}^{\prime}}_{i}\\|_{2}$
		$\displaystyle\leq\frac{\\|\mathbf{A}\\|_{F}}{\sqrt{s}}\leq\epsilon\sqrt{\delta}\\|\mathbf{A}\\|_{F}.$

Thus, from the two cases above, for all $i\in[n]$ , adjusting $\epsilon$ by a $\frac{1}{\sqrt{\log n}}$ factor, we have for $s\geq\frac{\log n}{\epsilon^{2}\delta}$ :

\displaystyle\mathbb{E}_{2}\|\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}

\displaystyle\leq\frac{\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}}{\sqrt{\log n}}.

(46)

Overall, plugging (44), (45), and (46) back into (43), we have :

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}\leq 10\sqrt{\log n}\cdot\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}+15\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}.

(47)

Finally we bound $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}$ . As in the proof of Lemma 8, we have $\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}\leq\mathbb{E}_{2}\left(\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\right)$ and we will argue that $\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}$ is bounded by $\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ with probability $1-1/\mathop{\mathrm{missing}}{poly}(n)$ . Also as argued in the proof of Lemma 8, since $p_{i}\geq\frac{1}{n^{2}}$ , it suffices to bound $\frac{\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}$ for all $i\in[n]$ with high probability. Again, for a fixed $i$ and any $j\in[n]$ , define the random variables $z_{j}$ as:

\displaystyle z_{j}

\displaystyle=\begin{cases}\frac{1}{p_{j}}|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}&\text{with probability $p_{j}$}\\ 0&\text{otherwise}.\end{cases}

Then $\sum_{j=1}^{n}z_{j}=\|(\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m})_{:,i}\|_{2}^{2}$ and $\mathbb{E}[\sum_{j=1}^{n}z_{j}]=\|\mathbf{A}^{\prime}_{m,i}\|_{2}^{2}\leq\|\mathbf{A}^{\prime}_{i}\|_{2}^{2}\leq\|\mathbf{A}\|_{F}^{2}$ . We will again use Bernstein’s inequality to bound $\sum_{j=1}^{n}z_{j}=\|(\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m})_{:,i}\|_{2}^{2}$ by bounding bound $|z_{j}|$ for all $j\in[n]$ and $\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)$ . We consider the cases of $p_{i}<1$ and $p_{i}=1$ separately.

Case 1: $p_{i}<1$ . Then, we have $p_{i}\geq s\|\mathbf{A}_{i}\|_{2}^{2}/\|\mathbf{A}\|_{F}^{2}$ . If ${\mathbf{A}^{\prime}}_{i,j}\neq 0$ then

	$\displaystyle\|z_{j}\|$	$\displaystyle\leq\frac{1}{p_{j}}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}\leq\max\left(1,\frac{\\|\mathbf{A}\\|_{F}^{2}}{s\\|\mathbf{A}_{j}\\|_{2}^{2}}\right)\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\\|\mathbf{A}\\|_{F}^{2}}{s\\|\mathbf{A}_{j}\\|_{2}^{2}}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\|{\mathbf{A}^{\prime}}_{o,i,j}\|^{2}\right)$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\\|\mathbf{A}\\|_{F}^{2}}{s\\|\mathbf{A}_{j}\\|_{2}^{2}}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\frac{\\|\mathbf{A}_{i}\\|_{2}^{2}\\|\mathbf{A}_{j}\\|_{2}^{2}}{\epsilon^{2}\delta\\|\mathbf{A}\\|_{F}^{2}}\right)$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\\|\mathbf{A}\\|_{F}^{2}}{s\\|\mathbf{A}_{j}\\|_{2}^{2}}\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\frac{2\\|\mathbf{A}_{i}\\|_{2}^{2}}{\epsilon^{2}\delta s},$

where the fourth inequality uses (42). By the thresholding procedure which defines $\mathbf{A}^{\prime}$ , if $i\neq j$ and $\mathbf{A}^{\prime}_{ij}\neq 0$ ,

\displaystyle\|\mathbf{A}_{i}\|_{2}^{2}\cdot\|\mathbf{A}_{j}\|_{2}^{2}\geq\frac{\epsilon^{2}\|\mathbf{A}\|_{F}^{2}|\mathbf{A}^{\prime}_{ij}|^{2}}{c_{2}\log^{4}n}\Rightarrow\frac{\|\mathbf{A}_{j}\|_{2}^{2}}{|\mathbf{A}^{\prime}_{i,j}|^{2}}\geq\frac{\epsilon^{2}\|\mathbf{A}\|_{F}^{2}}{c_{2}\cdot\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{2}},

(48)

and thus for $p_{i}<1$ and ${\mathbf{A}^{\prime}}_{ij}\neq 0$ we have

\displaystyle|z_{j}|

\displaystyle\leq|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}+\frac{2c_{2}\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}}+\frac{2\|\mathbf{A}_{i}\|_{2}^{2}}{\epsilon^{2}\delta s}.

Also $\mathbf{A}^{\prime}_{ii}=0$ since we must have $\|\mathbf{A}_{i}\|_{2}^{2}<\frac{\epsilon^{2}}{4}\|\mathbf{A}\|_{F}^{2}$ as $p_{i}<1$ . If ${\mathbf{A}^{\prime}}_{i,j}=0$ or $i=j$ , then we simply have

\displaystyle|z_{j}|

\displaystyle\leq|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}+\frac{2\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}\delta}.

Overall for all $j\in[n]$ ,

\displaystyle|z_{j}|

\displaystyle\leq|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}+\frac{2\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}\delta}+\frac{2c_{2}\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}},

(49)

\displaystyle|z_{j}|

\displaystyle\leq\|\mathbf{A}_{i}\|_{2}^{2}+\frac{2\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}\delta}+\frac{2c_{2}\cdot\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}}.

(50)

For $s\geq c\left(\frac{\log^{4}n}{\epsilon^{2}}+\frac{1}{\epsilon^{2}\delta}\right)$ and large enough $c$ , we thus have $|z_{j}|\leq 2\|\mathbf{A}_{i}\|_{2}^{2}$ .

We next bound the variance by:

	$\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)$	$\displaystyle\leq\sum_{j=1}^{n}\mathbb{E}[z_{j}^{2}]\leq\sum_{j=1}^{n}p_{j}\frac{1}{p_{j}^{2}}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}$
		$\displaystyle=\sum_{j=1}^{n}\max\left(1,\frac{\\|\mathbf{A}\\|_{F}^{2}}{s\\|\mathbf{A}_{j}\\|_{2}^{2}}\right)\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}$
		$\displaystyle\leq\sum_{j=1}^{n}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}+\sum_{j=1}^{n}\frac{12\\|\mathbf{A}\\|_{F}^{2}}{s\\|\mathbf{A}_{j}\\|_{2}^{2}}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{4}+\|{\mathbf{A}^{\prime}}_{o,i,j}\|^{4}\right)$
		$\displaystyle\leq\\|{\mathbf{A}^{\prime}}_{m,i}\\|_{2}^{4}+\sum_{j=1}^{n}\frac{12\\|\mathbf{A}\\|_{F}^{2}}{s\\|\mathbf{A}_{j}\\|_{2}^{2}}\left(\|\mathbf{A}_{i,j}^{\prime}\|^{4}+\frac{\\|\mathbf{A}_{i}\\|_{2}^{4}\\|\mathbf{A}_{j}\\|_{2}^{4}}{\epsilon^{4}\delta^{2}\\|\mathbf{A}\\|_{F}^{4}}\right),$

where the last inequality uses (42). We thus get:

\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)

\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{4}+\sum_{j:{\mathbf{A}^{\prime}}_{i,j}\neq 0}\frac{12\|\mathbf{A}\|_{F}^{2}|\mathbf{A}^{\prime}_{ij}|^{4}}{s\|\mathbf{A}_{j}\|_{2}^{2}}+\sum_{j=1}^{n}\frac{12\|\mathbf{A}_{i}\|_{2}^{4}\|\mathbf{A}_{j}\|_{2}^{2}}{s\epsilon^{4}\delta^{2}\|\mathbf{A}\|_{F}^{2}}.

(51)

Now $\mathbf{A}_{ii}^{\prime}=0$ as $p_{i}<1$ (and thus, $\|\mathbf{A}\|_{i}^{2}<\frac{\epsilon^{2}}{4}\|\mathbf{A}\|_{F}^{2}$ ). Combining (48) with the second term to the right of (51) we have

\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)

\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{4}+\sum_{j:{\mathbf{A}^{\prime}}_{i,j}\neq 0}\frac{12c_{2}\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{2}\cdot|\mathbf{A}^{\prime}_{ij}|^{2}}{s\epsilon^{2}}+\sum_{j=1}^{n}\frac{12\|\mathbf{A}_{i}\|_{2}^{4}\|\mathbf{A}_{j}\|_{2}^{2}}{s\epsilon^{4}\delta^{2}\|\mathbf{A}\|_{F}^{2}},

and since $\sum_{j}|\mathbf{A}^{\prime}_{ij}|^{2}=\|\mathbf{A}_{i}\|_{2}^{2}$ , we have

\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)

\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{4}+\frac{12c_{2}\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{4}}{s\epsilon^{2}}+\sum_{j=1}^{n}\frac{12\|\mathbf{A}_{i}\|_{2}^{4}\|\mathbf{A}_{j}\|_{2}^{2}}{s\epsilon^{4}\delta^{2}\|\mathbf{A}\|_{F}^{2}}.

(52)

Finally since $\sum_{j=1}^{n}\|\mathbf{A}_{j}\|_{2}^{2}=\|\mathbf{A}\|_{F}^{2}$ and $\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{4}\leq\|\mathbf{A^{\prime}}_{i}\|_{2}^{4}\leq\|\mathbf{A}_{i}\|_{2}^{4}$ we have

\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)

\displaystyle\leq\|\mathbf{A}_{i}\|_{2}^{4}+\frac{12c_{2}\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{4}}{s\epsilon^{2}}+\frac{12\|\mathbf{A}_{i}\|_{2}^{4}}{s\epsilon^{4}\delta^{2}}.

(53)

For $s\geq c\left(\frac{\log^{4}n}{\epsilon^{2}}+\frac{1}{\epsilon^{4}\delta^{2}}\right)$ for large enough $c$ , we have $\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)\leq 2\|\mathbf{A}_{i}\|_{2}^{4}$ .

Therefore, using (50) and (53) with $s\geq c\left(\frac{\log^{4}n}{\epsilon^{2}}+\frac{1}{\epsilon^{4}\delta^{2}}\right)$ , we can apply Bernstein inequality (Theorem 7) (for some constant $c$ ) to get

	$\displaystyle\operatorname*{\mathbb{P}}\left(\\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\\|_{2}^{2}\geq\mathbb{E}\\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\\|_{2}^{2}+t\right)$	$\displaystyle\leq\operatorname*{\mathbb{P}}\left(\sum_{j=1}^{n}z_{j}\geq\\|\mathbf{A}_{i}\\|_{2}^{2}+t\right)$
		$\displaystyle\leq\exp\left(\frac{-t^{2}/2}{c\\|\mathbf{A}_{i}\\|_{2}^{4}+ct\\|\mathbf{A}_{i}\\|_{2}^{2}/3}\right).$

If we set $t=\log n\cdot\|\mathbf{A}_{i}\|_{2}^{2}$ , for some constant $c^{\prime}$ we have

\displaystyle\operatorname*{\mathbb{P}}\left(\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}\geq\mathbb{E}\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}+\log n\cdot\|\mathbf{A}_{i}\|_{2}^{2}\right)

\displaystyle\leq\exp\left(\frac{-(\log n)^{2}/2}{c+c(\log n)/3}\right)\leq\exp(-c^{\prime}\log n)\leq 1/n^{c^{\prime}}.

\displaystyle\frac{1}{p_{i}}\cdot\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}\leq\frac{\|\mathbf{A}\|_{F}^{2}}{s\|\mathbf{A}_{i}\|_{2}^{2}}\cdot c(\log n)\|\mathbf{A}_{i}\|_{2}^{2}\leq\frac{\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}}{\log n},

for $s\geq c\left(\frac{\log^{4}n}{\epsilon^{2}}+\frac{1}{\epsilon^{4}\delta^{2}}\right)$ for large enough $c$ . Observe that, as in Lemma 3 w.l.o.g. we have assumed $1-\frac{1}{n^{c^{\prime}}}\geq 1-\delta$ , since otherwise, our algorithm would read all $n^{2}$ entries of the matrix.

Case 2: $p_{i}=1$ . Then, we have $\|\mathbf{A}_{i}\|_{2}^{2}\geq\|\mathbf{A}\|_{F}^{2}/s$ . As in the $p_{i}<1$ case, when $\mathbf{A}_{ii}=0$ , (and this $\mathbf{A}^{\prime}_{ii}=\mathbf{A}_{ii}=0$ ) we have from (49):

\displaystyle|z_{j}|

\displaystyle\leq|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}+\frac{2\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}\delta}+\frac{2c_{2}\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}}.

Now, we observe that $|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\sum_{j=1}^{n}|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\|\mathbf{A}^{\prime}_{m,i}\|^{2}_{2}\leq\|\mathbf{A}^{\prime}_{m}\|^{2}_{2}\leq\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}$ , which gives us

\displaystyle|z_{j}|

\displaystyle\leq\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}+\frac{2\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}\delta}+\frac{2c_{2}\log^{4}n\cdot\|\mathbf{A}_{i}\|_{2}^{2}}{s\epsilon^{2}}.

(54)

Note that if $\mathbf{A}_{ii}\neq 0$ , the second term in (49) is bounded as $\frac{2\|\mathbf{A}\|_{F}^{2}}{s\|\mathbf{A}_{i}\|_{2}^{2}}\cdot|\mathbf{A}^{\prime}_{ii}|^{2}\leq\frac{2\|\mathbf{A}\|_{F}^{2}}{s}\leq 2\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}$ for $s\geq O(\frac{1}{\epsilon^{2}\delta})$ . Thus, for $s\geq c\left(\frac{\log^{4}n}{\epsilon^{4}\delta}+\frac{1}{\epsilon^{4}\delta^{2}}\right)$ for a large enough constant $c$ and adjusting for other constants we have $|z_{j}|\leq 2\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}$ . Also observe that the expectation of $\sum z_{j}$ can be bounded by:

\displaystyle\mathbb{E}\left(\sum_{j=1}^{n}z_{j}\right)=\mathbb{E}\|(\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m})_{:,i}\|_{2}^{2}=\|{\mathbf{A}^{\prime}}_{m,i}\|_{2}^{2}\leq\|{\mathbf{A}^{\prime}}_{m}\|_{2}^{2}\leq\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}.

Next, the variance of the sum of the random variables $\{z_{j}\}$ can again be bounded by following the analysis presented in and prior to (52) and (53) we have

	$\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)$	$\displaystyle\leq\\|{\mathbf{A}^{\prime}}_{m,i,j}\\|_{2}^{4}+\frac{12c_{2}\log^{2}n\cdot\\|\mathbf{A}_{i}\\|_{2}^{4}}{s\epsilon^{2}}+\frac{12\\|\mathbf{A}_{i}\\|_{2}^{4}}{s\epsilon^{4}\delta^{2}}$
		$\displaystyle\leq\epsilon^{4}\delta^{2}\\|\mathbf{A}\\|_{F}^{4}+\frac{12c_{2}\log^{2}n\cdot\\|\mathbf{A}_{i}\\|_{2}^{4}}{s\epsilon^{2}}+\frac{12\\|\mathbf{A}_{i}\\|_{2}^{4}}{s\epsilon^{4}\delta^{2}},$		(55)

where we again bound $\|{\mathbf{A}^{\prime}}_{m,i,j}\|_{2}^{4}$ using

|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\sum_{j=1}^{n}|{\mathbf{A}^{\prime}}_{m,i,j}|^{2}\leq\|\mathbf{A}^{\prime}_{m,i}\|^{2}_{2}\leq\|\mathbf{A}^{\prime}\|^{2}_{2}\leq\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}.

Then for $s\geq c(\frac{\log^{4}n}{\epsilon^{6}\delta^{2}}+\frac{1}{\epsilon^{8}\delta^{4}})$ , we have $\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)\leq 2\epsilon^{4}\delta^{2}\|\mathbf{A}\|_{F}^{4}$ for large enough constant $c$ .

Using (54) and (55) and noting that $\sum_{j=1}^{n}\mathbb{E}\left(z_{j}^{2}\right)\geq\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)-\mathbb{E}^{2}\left(\sum_{j=1}^{n}z_{j}\right)$ we can apply the Bernstein inequality (Theorem 7):

	$\displaystyle\operatorname*{\mathbb{P}}\left(\\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\\|_{2}^{2}\geq\mathbb{E}\\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\\|_{2}^{2}+t\right)$	$\displaystyle\leq\operatorname*{\mathbb{P}}\left(\sum_{j=1}^{n}z_{j}\geq\epsilon^{2}\delta\\|\mathbf{A}_{i}\\|_{2}^{2}+t\right)$
		$\displaystyle\leq\exp\left(\frac{-t^{2}/2}{c\epsilon^{4}\delta^{2}\\|\mathbf{A}\\|_{F}^{4}+c\epsilon^{2}\delta\\|\mathbf{A}\\|_{F}^{2}t/3}\right).$

If we set $t=(\log n)\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}$ , then for some constant $c^{\prime}$ we have

\displaystyle\operatorname*{\mathbb{P}}\left(\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}\geq\mathbb{E}\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}+t\right)

\displaystyle\leq\exp(-c^{\prime}\log n)\leq 1/n^{c^{\prime}}.

This, since $\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}\leq\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}$ , when $p_{i}=1$ , setting $s\geq c(\frac{\log^{4}n}{\epsilon^{6}\delta^{2}}+\frac{1}{\epsilon^{8}\delta^{4}})$ for large enough $c$ , we have with probability $\geq 1-1/n^{c^{\prime}}$ $\frac{1}{p_{i}}\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}=\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}\leq\|(\bar{\mathbf{S}}\mathbf{A}^{\prime}_{m})_{:,i}\|_{2}^{2}\leq(\log n)\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}}).$

We have proven that with probability $\geq 1-1/n^{c^{\prime}}$ , for both cases when $p_{i}<1$ and $p_{i}=1$ , $\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}^{2}}{p_{i}}\leq(\log n)\epsilon^{2}\delta\|\mathbf{A}\|_{F}^{2}$ . Taking a union bound over all $i\in[n]$ , with probability at least $1-1/n^{c^{\prime}-1}$ , $\max_{i}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\leq\sqrt{\log n}\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}$ for $s\geq c(\frac{\log^{4}n}{\epsilon^{6}\delta^{2}}+\frac{1}{\epsilon^{8}\delta^{4}})$ . Also, since $p_{i}\geq\frac{1}{n^{2}}$ for all $i\in[n]$ , $\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\leq\sqrt{\sum_{j=1}^{n}\frac{\mathbf{A}_{m,i,j}^{2}}{p_{i}\cdot p_{j}}}\leq\frac{n\cdot\|\mathbf{A}\|_{F}}{\sqrt{s}}$ . Thus, $\max_{i}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\leq n\|\mathbf{A}\|_{F}$ and we get,

\displaystyle\mathbb{E}_{2}\left(\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\right)\leq\sqrt{\log n}\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}+\frac{1}{n^{c^{\prime}-3}}\leq\sqrt{\log n}\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}.

after adjusting $\epsilon$ by at most some constants. Overall, we finally get

\mathbb{E}_{2}\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\|_{1\to 2}\leq\mathbb{E}_{2}\left(\max_{i:i\in[n]}\frac{\|(\bar{\mathbf{S}}\mathbf{H}_{m})_{:,i}\|_{2}}{\sqrt{p_{i}}}\right)\leq\epsilon\sqrt{\log n}\sqrt{\delta}\|\mathbf{A}\|_{F}.

Plugging this bound into (47), we have for $s\geq c(\frac{\log^{4}n}{\epsilon^{6}\delta^{2}}+\frac{1}{\epsilon^{8}\delta^{4}})$ ,

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}

\displaystyle\leq(\log n)\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}.

Finally after adjusting $\epsilon$ by a $\frac{1}{\log n}$ factor, we have for $s\geq c(\frac{\log^{10}n}{\epsilon^{6}\delta^{2}}+\frac{\log^{8}n}{\epsilon^{8}\delta^{4}})$ or $s\geq\frac{c\log^{10}n}{\epsilon^{8}\delta^{4}}$ ,

\displaystyle\mathbb{E}_{2}\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}

\displaystyle\leq\epsilon\sqrt{\delta}\|\mathbf{A}\|_{F}.

The final bound then follows via Markov’s inequality on $\|\bar{\mathbf{S}}{\mathbf{A}^{\prime}}_{m}\bar{\mathbf{S}}\|_{2}$ . ∎

E.3 Main Accuracy Bound

We are finally ready to state our main result for $\ell_{2}$ norm based sampling.

See 3

Proof.

The proof follows exactly the same structure as the proofs of Theorems 1 and 2 for uniform and sparsity based sampling respectively. We use the results of Lemmas 14 and 13 on the concentration of the middle and large eigenvalues for $\ell_{2}$ norm based sampling.

Analogous to Theorem 2, from Lemma 13 with error parameter $\frac{\epsilon}{\log n}$ the eigenvalues of $\mathbf{A}^{\prime}_{o,S}$ approximate those of $\mathbf{A}_{o}^{\prime}$ up to error $\epsilon\|\mathbf{A}\|_{F}$ with probability $1-\delta$ if $s\geq\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}\sqrt{\delta}}$ . We also require $s\geq\frac{c\log^{10}n}{\epsilon^{8}\delta^{4}}$ for $\|\mathbf{A}_{m,S}^{\prime}\|_{2}\leq\epsilon\|\mathbf{A}\|_{F}$ to hold with probability $1-\delta$ by Lemma 14. Thus, for both conditions to hold simultaneously with probability $1-2\delta$ by a union bound, if suffices to set $s=\max\left(\frac{c\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}\sqrt{\delta}},\frac{c\log^{10}n}{\epsilon^{8}\delta^{4}}\right)=\frac{c\log^{10}n}{\epsilon^{8}\delta^{4}}$ , where we use that $\log(1/(\epsilon\delta)\leq\log n$ , as otherwise our algorithm can take $\mathbf{A}_{S}$ to be the full matrix $\mathbf{A}$ . Adjusting $\delta$ to $\delta/2$ completes the theorem. ∎

Appendix F Eigenvalue Approximation via Entrywise Sampling

In this section we show that sampling $\tilde{O}(n/\epsilon^{2})$ entries from a bounded entry matrix preserves its eigenvalues up to error $\pm\epsilon n$ . We use this result to improve the sample complexity of Theorem 1 from $\tilde{O}\left(\frac{\log^{6}n}{\epsilon^{6}}\right)$ to $\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{5}}\right)$ by applying entrywise sampling to further sparsify the submatrix $\mathbf{A}_{S}$ that is sampled in Algorithm 1. Entrywise sampling results similar to what we show are well-known in the literature. See for example [AM07] and [BKKS21]. For completeness, we give a proof here using standard matrix concentration bounds.

Theorem 10 (Entrywise sampling – spectral norm bound).

Consider $\mathbf{A}\in\mathbb{R}^{n\times n}$ with $\|\mathbf{A}\|_{\infty}\leq 1$ . Let $\mathbf{C}\in\mathbb{R}^{n\times n}$ be constructed by setting $\mathbf{C}_{i,i}=\mathbf{A}_{i,i}$ for all $i\in[n]$ and

\displaystyle\mathbf{C}_{j,i}=\mathbf{C}_{i,j}

\displaystyle=\begin{cases}\frac{1}{p}\cdot\mathbf{A}_{i,j}&\text{with probability }$p$\\ 0&\text{otherwise}.\end{cases}

For any $\epsilon,\delta\in(0,1)$ , if $p\geq\frac{c\log(n/\delta)}{n\epsilon^{2}}$ for a large enough constant $c$ , then with probability at least $1-\delta$ , $\|\mathbf{A}-\mathbf{C}\|_{2}\leq\epsilon n$ .

Note that by Weyl’s inequality (Fact 3), Theorem 10 immediately implies that the eigenvalues of $\mathbf{C}$ approximate those of $\mathbf{A}$ up to $\pm\epsilon n$ error with good probability.

Proof.

For any $i<j$ , define the symmetric random matrix $\mathbf{E}^{(ij)}$ with

\displaystyle\mathbf{E}^{(ij)}_{i,j}=\mathbf{E}^{(ij)}_{j,i}

\displaystyle=\begin{cases}(\frac{1}{p}-1)\cdot\mathbf{A}_{i,j}&\text{with probability }p\\ -\mathbf{A}_{i,j}&\text{otherwise}.\end{cases}

. Observe that $\mathbf{C}-\mathbf{A}=\sum_{i,j\in[n],i<j}\mathbf{E}^{(ij)}$ . Further, each $\mathbf{E}^{(ij)}$ has just two non-zero values in different rows and columns. So

\displaystyle\|\mathbf{E}^{(ij)}\|_{2}=|\mathbf{C}_{i,j}-\mathbf{A}_{i,j}]|\leq\left(\frac{1}{p}-1\right)\cdot|\mathbf{A}_{i,j}|\leq\frac{1}{p},

where the last inequality uses that $\|\mathbf{A}\|_{\infty}\leq 1$ . Additionally, $\mathbf{E}^{(ij)}\mathbf{E}^{(ij)T}$ is diagonal with two diagonal entries at $(i,i)$ or $(j,j)$ equal to $(\mathbf{C}_{i,j}-\mathbf{A}_{i,j})^{2}$ . Thus, $\mathbf{V}=\sum_{i,j\in[n],i<j}\mathbb{E}[\mathbf{E}^{(ij)}\mathbf{E}^{(ij)T}]$ is also diagonal. We have

	$\displaystyle\mathbf{V}_{i,i}=\sum_{j\neq i}\mathbb{E}[(\mathbf{C}_{i,j}-\mathbf{A}_{i,j})^{2}]$	$\displaystyle=\sum_{j\neq i}\mathbf{A}_{i,j}^{2}\cdot\left(p\cdot\left(\frac{1}{p}-1\right)^{2}+(1-p)\cdot(-1)^{2}\right)$
		$\displaystyle=\sum_{j\neq i}\mathbf{A}_{i,j}^{2}\cdot\left(\frac{1}{p}-1\right)\leq\frac{n}{p},$

where in the final inequality we use that $\|\mathbf{A}\|_{\infty}\leq 1$ . Thus, since $\mathbf{V}$ is diagonal, $\|\mathbf{V}\|_{2}\leq\frac{n}{p}$ . Putting the above together using Theorem 6 we get,

\displaystyle\operatorname*{\mathbb{P}}\left(\|\mathbf{A}-\mathbf{C}\|_{2}\geq\epsilon n\right)=\operatorname*{\mathbb{P}}\left(\left\|\sum_{i,j\in[n],i<j}\mathbf{E}^{(ij)}\right\|_{2}\geq\epsilon n\right)

\displaystyle\leq 2n\cdot\exp\left(\frac{-\epsilon^{2}n^{2}/2}{\frac{n}{p}+\frac{\epsilon n}{3p}}\right).

Thus, for $p\geq\frac{c\log(n/\delta)}{n\epsilon^{2}}$ for large enough $c$ , with probability at least $1-\delta$ we have $\|\mathbf{A}-\mathbf{C}\|_{2}\leq\epsilon n$ . ∎

F.1 Improved Sample Complexity via Entrywise Sampling

We can combine Theorem 10 directly with Theorem 1 to give an improved sample complexity for eigenvalue estimation. we have: See 1

Proof.

Letting $s=\frac{c_{1}\log(1/(\epsilon\delta))\cdot\log^{3}n}{\epsilon^{3}\delta}$ for large enough constant $c_{1}$ , by Theorem 1, for a random principal submatrix $\mathbf{A}_{S}$ formed by sampling each index with probability $s/n$ , the eigenvalues of $\mathbf{A}_{S}$ , after scaling up by a factor of $n/s$ approximate those of $\mathbf{A}$ to error $\pm\epsilon n$ with probability at least $1-\delta$ . By Theorem 10, if we sample off-diagonal entries of $\mathbf{A}_{S}$ with probability $p\geq\frac{c_{2}\log(|S|/\delta)}{|S|\cdot\epsilon^{2}}$ to produce $\mathbf{C}$ , then we preserve its eigenvalues to error $\pm\epsilon|S|$ . Thus, after scaling by $\frac{n}{s}$ , the eigenvalues of $\mathbf{C}$ approximate those of $\mathbf{A}$ to error $\pm\left(\epsilon n+\frac{n}{s}\cdot\epsilon|S|\right)$ . Finally, observe that by a standard Chernoff bound, $|S|\leq 2s$ with probability at least $1-\delta$ . So adjusting $\epsilon$ by a constant, the scaled eigenvalues of $\mathbf{C}$ give $\pm\epsilon n$ approximations to $\mathbf{A}$ ’s eigenvalues. The expected number of entries read is $|S|+p\cdot|S|^{2}=\tilde{O}\left(\frac{s\cdot\log(1/\delta)}{\epsilon^{2}}\right)=\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{5}\delta}\right)$ . Additionally, by a standard Chernoff bound at most $\tilde{O}\left(\frac{\log^{3}n}{\epsilon^{5}\delta}\right)$ are read with probability at least $1-\delta$ . ∎

Appendix G Singular Value Approximation via Sampling

We now show how to estimate the singular values of a bounded-entry matrix via random subsampling. Unlike in eigenvalue estimation, instead of sampling a random principal submatrix, here we sample a random submatrix with independent rows and columns. This allows us to apply known interior eigenvalue matrix Chernoff bounds to bound the perturbation in the singular values [GT11, BCJ20]. We first state a simplified version of Theorem 4.1 from [GT11] (also stated as Theorem 4.6 in [BCJ20]), simplified using standard upper bounds on the Chernoff bounds in [MU17].

Theorem 11 (Interior Eigenvalue Matrix Chernoff bounds – Theorem 4.1 of [GT11]).

Let $\{\mathbf{X}_{j}\}$ be a finite sequence of independent, random, positive-semidefinite matrices with dimension $n$ , and assume that $\|\mathbf{X}_{j}\|_{2}\leq L$ for some value $L$ almost surely. Given an integer $k\leq n$ , define

\mu_{k}=\lambda_{k}\left(\sum_{j}\mathbb{E}[\mathbf{X}_{j}]\right).

Then we have the tail inequalities:

\displaystyle\begin{cases}\operatorname*{\mathbb{P}}\left(\lambda_{k}(\sum_{j}\mathbf{X}_{j})\geq(1+\Delta)\mu_{k}\right)\leq(n-k+1)\cdot e^{-\frac{\Delta\mu_{k}}{3L}},&\text{for }\Delta\geq 1\\ \operatorname*{\mathbb{P}}\left(\lambda_{k}(\sum_{j}\mathbf{X}_{j})\geq(1+\Delta)\mu_{k}\right)\leq(n-k+1)\cdot e^{-\frac{\Delta^{2}\mu_{k}}{3L}},&\text{for }\Delta\in[0,1)\\ \operatorname*{\mathbb{P}}\left(\lambda_{k}(\sum_{j}\mathbf{X}_{j})\leq(1-\Delta)\mu_{k}\right)\leq k\cdot e^{-\frac{\Delta^{2}\mu_{k}}{2L}},&\text{for }\Delta\in[0,1)\end{cases}

We are now ready to state and prove the main theorem.

Theorem 12.

Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ be a matrix with $\|\mathbf{A}\|_{\infty}\leq 1$ and singular values $\sigma_{1}(\mathbf{A})\geq\ldots\geq\sigma_{n}(\mathbf{A})$ . Let $\mathbf{\bar{S}}\in\mathbb{R}^{n\times n}$ be a scaled diagonal sampling matrix such that $\mathbf{\bar{S}}_{ii}=\sqrt{\frac{n}{s}}$ with probability $\frac{s}{n}$ and $\mathbf{\bar{S}}_{ii}=0$ otherwise. Let $\mathbf{\bar{T}}\in\mathbb{R}^{n\times n}$ be an independent and identically distributed random sampling matrix. Let $\mathbf{Z}=\mathbf{\bar{S}A\bar{T}}$ be the sampled submatrix from $\mathbf{A}$ with singular values $\sigma_{1}(\mathbf{Z})\geq\ldots\geq\sigma_{n}(\mathbf{Z})$ . Then, if $s\geq\frac{c\log(n/\delta)}{\epsilon^{2}}$ for some constant $c$ , with probability at least $1-\delta$ , for all $i\in[n]$ ,

\displaystyle\sigma_{i}(\mathbf{A})-\epsilon n\leq\sigma_{i}(\mathbf{Z})\leq\sigma_{i}(\mathbf{A})+\epsilon n.

Proof.

We first prove that singular values of $\bar{\mathbf{S}}\mathbf{A}$ are close to those of $\mathbf{A}$ . Let $\mathbf{X}_{i}\in\mathbb{R}^{n\times n}$ be matrix valued r.v.’s for $i\in[n]$ such that:

\displaystyle\mathbf{X}_{i}=\begin{cases}\frac{n}{s}\mathbf{A}_{i}\mathbf{A}_{i}^{T},&\text{with probability }s/n\\ 0&\text{otherwise}\end{cases}

where $\mathbf{A}_{i}$ is the $i$ ^th row of $\mathbf{A}$ written as a column vector. Then, $\sum_{i}\mathbf{X}_{i}=(\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A})$ and $\mathbb{E}[\sum_{i}\mathbf{X}_{i}]=\mathbf{A}^{T}\mathbf{A}$ . We have $\|\mathbf{X}_{i}\|_{2}\leq\max_{j}\frac{n}{s}\|\mathbf{A}_{j}\|^{2}_{2}\leq\frac{n^{2}}{s}$ and $\lambda_{k}(\mathbb{E}[\sum_{i}\mathbf{X}_{i}])=\lambda_{k}(\mathbf{A}^{T}\mathbf{A})=\sigma_{k}^{2}(\mathbf{A})$ for $k\in[n]$ .

Case 1: We will first prove that $\sigma_{k}(\mathbf{A})-\epsilon n\leq\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})$ for all $k\in[n]$ . Note that when $\sigma_{k}(\mathbf{A})\leq\epsilon n$ , $\sigma_{k}(\mathbf{A})-\epsilon n\leq\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})$ is trivially true. We now consider all $k\in[n]$ such that $\sigma_{k}(\mathbf{A})>\epsilon n$ . Setting $\mu_{k}=\lambda_{k}(\mathbf{A}^{T}\mathbf{A})$ , $L=\frac{n^{2}}{s}$ and $\Delta=\frac{\epsilon n}{\sigma_{k}(\mathbf{A})}$ (note that $\Delta<1$ ) in Theorem 11, we get:

\displaystyle\operatorname*{\mathbb{P}}\left(\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))\leq(1-\Delta)\lambda_{k}(\mathbf{A}^{T}\mathbf{A})\right)

\displaystyle\leq k\cdot e^{-c\frac{\Delta^{2}_{1}\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}{L}}\leq k\cdot e^{-c\frac{\epsilon^{2}n^{2}}{\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}\cdot\frac{\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}{(n^{2}/s)}}

where $c$ is constant. So, for $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ for any $k$ , we have $\lambda_{k}((\bar{\mathbf{S}}\mathbf{A})^{T}(\bar{\mathbf{S}}\mathbf{A}))=\sigma_{k}^{2}(\bar{\mathbf{S}}\mathbf{A})\geq(1-\Delta)\sigma_{k}^{2}(\mathbf{A})$ with probability at least $1-\frac{\delta}{n}$ . Taking a square root on both sides we get $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})\geq\sqrt{1-\Delta}\sigma_{k}(\mathbf{A})\geq(1-\Delta)\sigma_{k}(\mathbf{A})=\sigma_{k}(\mathbf{A})-\epsilon n$ . Taking a union bound over all $k$ with $\sigma_{k}(\mathbf{A})>\epsilon n$ , $\sigma_{k}(\mathbf{A})-\epsilon n\leq\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})$ holds for all such $k$ with probability at least $1-\delta$ .

Case 2: We now prove that $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})\leq\sigma_{k}(\mathbf{A})+\epsilon n$ for all $k\in[n]$ . We first consider the case when $\sigma_{k}(\mathbf{A})\leq\epsilon n$ . Setting $\mu_{k}=\lambda_{k}(\mathbf{A}^{T}\mathbf{A})$ , $L=\frac{n^{2}}{s}$ and $\Delta=\frac{\epsilon^{2}n^{2}}{\sigma^{2}_{k}(\mathbf{A})}$ (note that $\Delta\geq 1$ ) in Theorem 11, we get (for some constant $c$ ):

	$\displaystyle\operatorname*{\mathbb{P}}\left(\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))\geq(1+\Delta)\lambda_{k}(\mathbf{A}^{T}\mathbf{A})\right)$	$\displaystyle\leq n.e^{-\frac{c\Delta\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}{L}}$
		$\displaystyle\leq n\cdot e^{-\frac{c\epsilon^{2}n^{2}}{\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}\cdot\frac{\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}{(n^{2}/s)}}$

Thus, if $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ , we have $\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))\leq(1+\Delta)\lambda_{k}(\mathbf{A}^{T}\mathbf{A})\leq\lambda_{k}(\mathbf{A}^{T}\mathbf{A})+\epsilon^{2}n^{2}$ for all $k\in[n]$ such that $\sigma_{k}(\mathbf{A})\leq\epsilon n$ with probability at least $1-\delta$ via a union bound. Taking square root on both sides and using the facts that $\lambda_{k}(\mathbf{A}^{T}\mathbf{A})=\sigma_{k}^{2}(\mathbf{A})$ , $\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))=\sigma_{k}^{2}(\mathbf{\bar{S}}\mathbf{A})$ and $\sqrt{a+b}<\sqrt{a}+\sqrt{b}$ , we get $\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})\leq\sigma_{k}(\mathbf{A})+\epsilon n$ .

We now consider the case $\sigma_{k}(\mathbf{A})>\epsilon n$ . Setting $\mu_{k}=\lambda_{k}(\mathbf{A}^{T}\mathbf{A})$ , $L=\frac{n^{2}}{s}$ and $\Delta=\frac{\epsilon n}{\sigma_{k}(\mathbf{A})}$ (note that $\Delta<1$ ) in Theorem 11, we get (for some constant $c$ ):

	$\displaystyle\operatorname*{\mathbb{P}}\left(\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))\geq(1+\Delta)\lambda_{k}(\mathbf{A}^{T}\mathbf{A})\right)$	$\displaystyle\leq n.e^{-\frac{c\Delta^{2}\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}{L}}$
		$\displaystyle\leq n\cdot e^{-\frac{c\epsilon^{2}n^{2}}{\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}\cdot\frac{\lambda_{k}(\mathbf{A}^{T}\mathbf{A})}{(n^{2}/s)}}.$

Thus, if $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ , we have $\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))\leq(1+\Delta)\lambda_{k}(\mathbf{A}^{T}\mathbf{A})$ for all $k\in[n]$ such that $\sigma_{k}(\mathbf{A})>\epsilon n$ with probability at least $1-\delta$ via a union bound. Taking square root on both sides and using the fact that $\lambda_{k}(\mathbf{A}^{T}\mathbf{A})=\sigma_{k}^{2}(\mathbf{A})$ , $\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))=\sigma_{k}^{2}(\mathbf{\bar{S}}\mathbf{A})$ and $\sqrt{a}<a$ for any $a>1$ , we get $\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})\leq(1+\Delta)\sigma_{k}(\mathbf{A})\leq\sigma_{k}(\mathbf{A})+\epsilon n$ . Thus, via a union bound over all $k\in[n]$ , we have $\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})\leq\sigma_{k}(\mathbf{A})+\epsilon n$ with probability $1-2\delta$ .

Thus, via a union bound over the two cases above, for all $k\in[n]$ with probability at least $1-3\delta$ for $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ we have, for all $k\in[n]$ ,

\displaystyle|\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})-\sigma_{k}(\mathbf{A})|\leq\epsilon n.

(56)

Next we prove that the singular values of $\mathbf{\bar{S}}\mathbf{A}\bar{\mathbf{T}}$ are close to those of $\mathbf{\bar{S}}\mathbf{A}$ , using essentially the same approach as above. Let $\mathbf{Y}_{i}$ be a matrix values random variable for $i\in[n]$ such that:

\displaystyle\mathbf{Y}_{i}=\begin{cases}\frac{n}{s}(\mathbf{\bar{S}}\mathbf{A})_{i}(\mathbf{\bar{S}}\mathbf{A})^{T}_{i},&\text{with probability }s/n\\ 0&\text{otherwise}\end{cases}

where $(\mathbf{\bar{S}}\mathbf{A})_{i}$ is the $i$ ^th column of $\mathbf{\bar{S}}\mathbf{A}$ . Then, $\sum_{i}\mathbf{Y}_{i}=(\mathbf{\bar{S}}\mathbf{A}\bar{\mathbf{T}})^{T}(\mathbf{\bar{S}}\mathbf{A}\bar{\mathbf{T}})$ . Also, we have $\lambda_{k}(\mathbb{E}[\sum_{i}\mathbf{Y}_{i}])=\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))=\sigma_{k}^{2}(\mathbf{\bar{S}}\mathbf{A})$ . First, using a standard Chernoff bound, we can claim that $\bar{\mathbf{S}}$ will sample at most $2s$ rows from $\mathbf{A}$ with probability at least $1-\delta$ for any $s\geq O(\log(1/\delta))$ . Thus, we have $\|\mathbf{Y}_{i}\|_{2}=\frac{n}{s}\|\mathbf{\bar{S}}\mathbf{A}\|_{2}^{2}\leq\frac{n}{s}\cdot\frac{n}{s}\cdot 2s\leq\frac{2n^{2}}{s}$ with probability $1-\delta$ . Let this event be called $E_{2}$ . We now consider two cases conditioned on the event $E_{2}$ .

Case 1: We first prove that $\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})-\epsilon n\leq\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})$ for all $k\in[n]$ . Again note that when $\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})\leq\epsilon n$ this is trvially true. So we consider all $k\in[n]$ such that $\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})>\epsilon n$ . Setting $\mu_{k}=\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))$ , $L=\frac{2n^{2}}{s}$ (as we have conditioned on $E_{2}$ ) and $\Delta=\frac{\epsilon n}{\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})}$ (note that $\Delta<1$ ) in Theorem 11, we get:

\displaystyle\operatorname*{\mathbb{P}}\left(\lambda_{k}((\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})^{T}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}}))\leq(1-\Delta)\lambda_{k}(\mathbf{A}^{T}\mathbf{A})\right)

\displaystyle\leq k\cdot e^{-c\frac{\Delta^{2}_{1}\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))}{L}}\leq k\cdot e^{-c\frac{\epsilon^{2}n^{2}}{\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))}\cdot\frac{\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))}{(n^{2}/s)}}

where $c$ is some constant. So, for $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ for any $k$ , we have $\lambda_{k}((\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})^{T}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}}))=\sigma_{k}^{2}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})\geq(1-\Delta)\sigma_{k}^{2}(\bar{\mathbf{S}}\mathbf{A})$ with probability at least $1-\frac{\delta}{n}$ . Taking a square root on both sides we get $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})\geq\sqrt{1-\Delta}\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})\geq(1-\Delta)\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})=\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})-\epsilon n$ . Taking a union bound over all $k$ with $\sigma_{k}(\mathbf{A})>\epsilon n$ , $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})-\epsilon n\leq\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})$ holds for all such $k$ with probability at least $1-\delta$ .

Case 2: We now prove $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})\leq\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})+\epsilon n$ for all $k\in[n]$ . We again first consider the case $\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})\leq\epsilon n$ . Setting $\mu_{k}=\lambda_{k}(\mathbf{A}^{T}\mathbf{A})$ , $L=\frac{n^{2}}{s}$ and $\Delta=\frac{\epsilon^{2}n^{2}}{\sigma^{2}_{k}(\mathbf{\bar{S}}\mathbf{A})}$ (note that $\Delta\geq 1$ ) in Theorem 11:

	$\displaystyle\operatorname*{\mathbb{P}}\left(\lambda_{k}((\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})^{T}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}}))\geq(1+\Delta)\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))\right)$	$\displaystyle\leq n\cdot e^{-\frac{c\Delta\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))}{L}}$
		$\displaystyle\leq n\cdot e^{-\frac{c\epsilon^{2}n^{2}}{\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))}\cdot\frac{\lambda_{k}((\mathbf{\bar{S}}\mathbf{A})^{T}(\mathbf{\bar{S}}\mathbf{A}))}{(n^{2}/s)}}$

Then, similar to the case $\sigma_{k}(\mathbf{A})\leq\epsilon n$ in the previous case 2, taking square root of both sides and via a union bound, we get $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})\leq\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})+\epsilon n$ for all $k\in[n]$ such that $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})\leq\epsilon n$ with probability at least $1-\delta$ for $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ . The case $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})>\epsilon n$ will again be similar as $\sigma_{k}(\mathbf{A})>\epsilon n$ in the previous case 2. We set $\Delta=\frac{\epsilon n}{\sigma_{k}(\bar{\mathbf{S}}\mathbf{A})}$ and apply Theorem 11 and take the square root on both sides to get $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})\leq\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})+\epsilon n$ with probability $1-\delta$ for all $k\in[n]$ for $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ . Thus, with probability $1-2\delta$ , conditioned on the event $E_{2}$ , we have $\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})\leq\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})+\epsilon n$ for all $k\in[n]$ . Finally, via a union bound over the two cases above, and conditioned on $E_{2}$ , for all $k\in[n]$ with probability at least $1-2\delta$ for $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ we get

\displaystyle|\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})-\sigma_{k}(\mathbf{\bar{S}}\mathbf{A})|\leq\epsilon n.

(57)

Thus, taking a union bound over all the cases above (including $E_{2}$ ), from equation (56) and (57) and via a triangle inequality, we get: $|\sigma_{k}(\bar{\mathbf{S}}\mathbf{A}\bar{\mathbf{T}})-\sigma_{k}(\mathbf{A})|\leq 2\epsilon n$ with probability at least $1-c\delta$ (where $c$ is a small constant) for $s\geq O(\frac{\log(n/\delta)}{\epsilon^{2}})$ . Adjusting $\epsilon$ and $\delta$ by constant factors gives us the final bound. ∎

Remark on Rectangular Matrices: Though we have considered $\mathbf{A}$ to be a square matrix for simplicity, notice that Theorem 12 also holds for any arbitrary (non-square) matrix $\mathbf{A}\in\mathbb{R}^{n\times m}$ , with $n$ replaced by $\max(n,m)$ in the sample complexity bound.

Remark on Non-Uniform Sampling: As discussed in Section 1.3.1, simple non-uniform random submatrix sampling via row/column sparsities or norms does not suffice to estimate the singular values up to improved error bounds of $\epsilon\sqrt{\operatorname{nnz}(\mathbf{A})}$ or $\epsilon\|\mathbf{A}\|_{F}$ . A more complex strategy, such as the zeroing out used in Theorems 2 and 3 must be used. It is worth noting that following the same proof as Theorem 12, it is easy to show that if $\mathbf{\bar{S}}$ is sampled according to row norms or sparsities and appropriately weighted, then the singular values of $\mathbf{\bar{S}}\mathbf{A}$ do approximate those of $\mathbf{A}$ up to these improved error bounds. The proof breaks down when analyzing $\mathbf{\bar{S}}\mathbf{A}\mathbf{\bar{T}}$ . $\mathbf{\bar{T}}$ would have to be sampled according to the row norms/sparsities of $\mathbf{\bar{S}}\mathbf{A}$ , not $\mathbf{A}$ , for the proof to go through. However, in general, these sampling probabilities can differ significantly between $\mathbf{\bar{S}}\mathbf{A}$ and $\mathbf{A}$ .

	$\displaystyle\mathbb{E}_{2}\\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\\|_{2}$	$\displaystyle\leq 5\sqrt{\log n}\mathbb{E}_{2}\\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\\|_{1\to 2}+\mathbb{E}_{2}\\|\hat{\mathbf{S}}\mathbf{H}_{m}\\|_{2}$
		$\displaystyle\leq 5\sqrt{\log n}\mathbb{E}_{2}\\|\bar{\mathbf{S}}\mathbf{H}_{m}\hat{\mathbf{S}}\\|_{1\to 2}+5\sqrt{\log n}\mathbb{E}_{2}\\|\mathbf{H}_{m}\hat{\mathbf{S}}\\|_{1\to 2}+\\|\mathbf{H}_{m}\\|_{2}.$		(20)

	$\displaystyle\|z_{j}\|$	$\displaystyle\leq\frac{1}{p_{j}}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}\leq\max\left(1,\frac{\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\right)\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\|{\mathbf{A}^{\prime}}_{o,i,j}\|^{2}\right)$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\frac{\operatorname{nnz}({\mathbf{A}}_{i})\operatorname{nnz}({\mathbf{A}}_{j})}{\epsilon^{2}\delta\operatorname{nnz}({\mathbf{A}})}\right)$
		$\displaystyle\leq\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\|{\mathbf{A}^{\prime}}_{i,j}\|^{2}+\frac{2\operatorname{nnz}({\mathbf{A}}_{i})}{\epsilon^{2}\delta s},$

	$\displaystyle\mathbf{Var}\left(\sum_{j=1}^{n}z_{j}\right)$	$\displaystyle\leq\sum_{j=1}^{n}\mathbb{E}[z_{j}^{2}]\leq\sum_{j=1}^{n}p_{j}\frac{1}{p_{j}^{2}}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}$
		$\displaystyle=\sum_{j=1}^{n}\max\left(1,\frac{\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\right)\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}$
		$\displaystyle\leq\sum_{j=1}^{n}\|{\mathbf{A}^{\prime}}_{m,i,j}\|^{4}+\sum_{j=1}^{n}\frac{12\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\left(\|{\mathbf{A}^{\prime}}_{i,j}\|^{4}+\|{\mathbf{A}^{\prime}}_{o,i,j}\|^{4}\right)$
		$\displaystyle\leq\\|{\mathbf{A}^{\prime}}_{m,i}\\|_{2}^{4}+\sum_{j=1}^{n}\frac{12\operatorname{nnz}({\mathbf{A}})}{s\operatorname{nnz}({\mathbf{A}}_{j})}\left(\|\mathbf{A}_{i,j}^{\prime}\|^{4}+\frac{\operatorname{nnz}(\mathbf{A}_{i})^{2}\operatorname{nnz}(\mathbf{A}_{j})^{2}}{\epsilon^{4}\delta^{2}\operatorname{nnz}(\mathbf{A})^{2}}\right),$

	$\displaystyle\\|\mathbf{E}\mathbf{\Lambda}_{o}+\mathbf{\Lambda}_{o}\mathbf{E}+\mathbf{E}\mathbf{\Lambda}_{o}\mathbf{E}\\|_{2}$	$\displaystyle\leq\\|\mathbf{E}\mathbf{\Lambda}_{o}\\|_{2}+\\|\mathbf{\Lambda}_{o}\mathbf{E}\\|_{2}+\\|\mathbf{E}\mathbf{\Lambda}_{o}\mathbf{E}\\|_{2}$
		$\displaystyle\leq\\|\mathbf{E}\\|_{2}\\|\mathbf{\Lambda}_{o}\\|_{2}+\\|\mathbf{\Lambda}_{o}\\|_{2}\\|\mathbf{E}\\|_{2}+\\|\mathbf{E}\\|_{2}\\|\mathbf{\Lambda}_{o}\\|_{2}\\|\mathbf{E}\\|_{2}$
		$\displaystyle\leq\epsilon n/6+n\epsilon/6+\epsilon^{2}n/36$
		$\displaystyle\leq\epsilon/2\cdot n.$

	$\displaystyle\\|\mathbf{A}_{i}\\|_{2}^{2}\cdot\\|\mathbf{A}_{j}\\|_{2}^{2}$	$\displaystyle\geq\\|(\mathbf{A}_{k})_{i}\\|_{2}^{2}\cdot\\|(\mathbf{A}_{k})_{j}\\|_{2}^{2}$
		$\displaystyle\geq\frac{\\|\mathbf{A}\\|_{F}^{4}}{2^{4k}}\cdot\operatorname{nnz}((\mathbf{A}_{k})_{i})\cdot\operatorname{nnz}((\mathbf{A}_{k})_{j})$
		$\displaystyle\geq\frac{\\|\mathbf{A}\\|_{F}^{4}}{2^{4k}\cdot 2^{p+q}}\cdot\operatorname{nnz}(\mathbf{A}_{k})^{2}.$

Sublinear Time Eigenvalue Approximation via Random Sampling

Abstract

1 Introduction

1.1 Our Contributions

Theorem 1 (Sublinear Time Eigenvalue Approximation).

Corollary 1 (Improved Sample Complexity via Entrywise Sampling).

Theorem 2 (Sparse Matrix Eigenvalue Approximation).

Theorem 3 (Row Norm Based Matrix Eigenvalue Approximation).

1.2 Related Work

1.3 Technical Overview

Definition 1.1 (Eigenvalue Split).

1.3.1 Improved Bounds via Non-Uniform Sampling

1.4 Towards Optimal Query Complexity

2 Notation and Preliminaries

Fact 1 (Eigenvalue of Matrix Product).

Fact 2 (Girshgorin’s circle theorem [Ger31]).

Fact 3 (Weyl’s Inequality [Wey12]).

Fact 4 (Non-Hermitian perturbation bound [Bha13]).

Theorem 4 (Random principal submatrix spectral norm bound [RV07, Tro08a]).

Lemma 1 (Decoupling and recoupling).

Theorem 5 (Non-uniform column sampling – spectral norm bound).

Theorem 6 (Matrix Bernstein [Tro15]).

Theorem 7 (Bernstein inequality [Ber27]).

3 Sublinear Time Eigenvalue Estimation using Uniform Sampling

3.1 Outer and Middle Eigenvalue Bounds

Lemma 2 (Incoherence of outlying eigenvectors).

Proof.

Lemma 3 (Concentration of outlying eigenvalues).

Proof.

Lemma 4 (Spectral norm bound – sampled middle eigenvalues).

Proof.

3.2 Main Accuracy Bounds

Proof.

4 Improved Bounds via Sparsity-Based Sampling

4.1 Preliminary Lemmas

Lemma 5.

Proof.

Lemma 6 (Incoherence of outlying eigenvectors in terms of sparsity).

Proof.

4.2 Outer and Middle Eigenvalue Bounds

Lemma 7 (Concentration of outlying eigenvalues with sparsity-based sampling).

Proof.

Lemma 8 (Concentration of middle eigenvalues with sparsity-based sampling).

Proof.

4.3 Main Accuracy Bound

Proof.

5 Empirical Evaluation

5.1 Datasets

5.2 Implementation Details

5.3 Experimental Setup

5.4 Summary of Results

6 Conclusion

Acknowledgements

References

Appendix A Eigenvalue Approximation for PSD Matrices

Fact 5 (ℓ2\ell_{2}-norm bound on eigenvalues [Bha13]).

Lemma 9.

Corollary 2 (Spectral norm bound – PSD matrices).

Proof.

Proof of Lemma 9.

Appendix B Alternate Bound for Uniform Sampling

Lemma 10 (Near orthonormality – sampled outlying eigenvalues).

Proof.

Theorem 8 (Sublinear Time Eigenvalue Approximation).

Proof.

Appendix C Refined Bounds

Theorem 9.

Proof.

Appendix D Spectral Norm Bounds for Non-Uniform Random Submatrices

Proof.

Appendix E Improved Bounds via Row-Norm-Based Sampling

E.1 Preliminary Lemmas

Lemma 11.

Proof.

Lemma 12 (Incoherence of outlying eigenvectors in terms of ℓ2\ell_{2} norms).

Proof.

E.2 Outer and Middle Eigenvalue Bounds

Lemma 13 (Concentration of outlying eigenvalues with ℓ2\ell_{2} norm based sampling).

Proof.

Lemma 14 (Concentration of middle eigenvalues with ℓ2\ell_{2}- norm based sampling).

Fact 5 ( $\ell_{2}$ -norm bound on eigenvalues [Bha13]).

Lemma 12 (Incoherence of outlying eigenvectors in terms of $\ell_{2}$ norms).

Lemma 13 (Concentration of outlying eigenvalues with $\ell_{2}$ norm based sampling).

Lemma 14 (Concentration of middle eigenvalues with $\ell_{2}$ - norm based sampling).