This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Universal Matrix Sparsifiers and Fast Deterministic Algorithms for Linear Algebra

Rajarshi Bhattacharjee111University of Massachusetts Amherst, {rbhattacharj,cmusco,ray}@cs.umass.edu    Gregory Dexter222Purdue University, {gdexter}@purdue.edu    Cameron Musco11footnotemark: 1    Archan Ray11footnotemark: 1    Sushant Sachdeva 333University of Toronto, {sachdeva}@cs.toronto.edu    David P Woodruff444Carnegie Mellon University, {dwoodruf}@cs.cmu.edu

Let 𝐒n×n\mathbf{S}\in\mathbb{R}^{n\times n} be any matrix satisfying 𝟏𝐒2ϵn\|\mathbf{1}-\mathbf{S}\|_{2}\leq\epsilon n, where 𝟏\mathbf{1} is the all ones matrix and 2\|\cdot\|_{2} is the spectral norm. It is well-known that there exists 𝐒\mathbf{S} with just O(n/ϵ2)O(n/\epsilon^{2}) non-zero entries achieving this guarantee: we can let 𝐒\mathbf{S} be the scaled adjacency matrix of a Ramanujan expander graph. We show that, beyond giving a sparse approximation to the all ones matrix, 𝐒\mathbf{S} yields a universal sparsifier for any positive semidefinite (PSD) matrix. In particular, for any PSD 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} which is normalized so that its entries are bounded in magnitude by 11, we show that 𝐀𝐀𝐒2ϵn\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\|_{2}\leq\epsilon n, where \circ denotes the entrywise (Hadamard) product. Our techniques also yield universal sparsifiers for non-PSD matrices. In this case, we show that if 𝐒\mathbf{S} satisfies 𝟏𝐒2ϵ2nclog2(1/ϵ)\|\mathbf{1}-\mathbf{S}\|_{2}\leq\frac{\epsilon^{2}n}{c\log^{2}(1/\epsilon)} for some sufficiently large constant cc, then 𝐀𝐀𝐒2ϵmax(n,𝐀1)\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\|_{2}\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}), where 𝐀1\|\mathbf{A}\|_{1} is the nuclear norm. Again letting 𝐒\mathbf{S} be a scaled Ramanujan graph adjacency matrix, this yields a sparsifier with O~(n/ϵ4)\widetilde{O}(n/\epsilon^{4}) entries. We prove that the above universal sparsification bounds for both PSD and non-PSD matrices are tight up to logarithmic factors.

Since 𝐀𝐒\mathbf{A}\circ\mathbf{S} can be constructed deterministically without reading all of 𝐀\mathbf{A}, our result for PSD matrices derandomizes and improves upon established results for randomized matrix sparsification, which require sampling a random subset of O(nlognϵ2){O}(\frac{n\log n}{\epsilon^{2}}) entries and only give an approximation to any fixed 𝐀\mathbf{A} with high probability. We further show that any randomized algorithm must read at least Ω(n/ϵ2)\Omega(n/\epsilon^{2}) entries to spectrally approximate general 𝐀\mathbf{A} to error ϵn\epsilon n, thus proving that these existing randomized algorithms are optimal up to logarithmic factors. We leverage our deterministic sparsification results to give the first deterministic algorithms for several problems, including singular value and singular vector approximation and positive semidefiniteness testing, that run in faster than matrix multiplication time. This partially addresses a significant gap between randomized and deterministic algorithms for fast linear algebraic computation.

Finally, if 𝐀{1,0,1}n×n\mathbf{A}\in\{-1,0,1\}^{n\times n} is PSD, we show that a spectral approximation 𝐀~\mathbf{\widetilde{A}} with 𝐀𝐀~2ϵn\|\mathbf{A}-\mathbf{\widetilde{A}}\|_{2}\leq\epsilon n can be obtained by deterministically reading O~(n/ϵ)\widetilde{O}(n/\epsilon) entries of 𝐀\mathbf{A}. This improves the 1/ϵ1/\epsilon dependence on our result for general PSD matrices by a quadratic factor and is information-theoretically optimal up to a logarithmic factor.

1 Introduction

A common task in processing large matrices is element-wise sparsification. Given an n×nn\times n matrix 𝐀\mathbf{A}, the goal is to choose a small subset SS of coordinates in [n]×[n][n]\times[n], where [n]={1,2,,n}[n]=\{1,2,\ldots,n\}, such that 𝐀𝐀𝐒2\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\|_{2} is small, where \circ denotes the Hadamard (entrywise) product and 𝐒\mathbf{S} is a sampling matrix which equals n2|S|\frac{n^{2}}{|S|} on the entries in SS, but is 0 otherwise. As in previous work, we consider operator norm error, where for a matrix 𝐁\mathbf{B}, 𝐁2=defsup𝐱n𝐁𝐱2𝐱2\|\mathbf{B}\|_{2}\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\sup_{\mathbf{x}\in\mathbb{R}^{n}}\frac{\|\mathbf{B}\mathbf{x}\|_{2}}{\|\mathbf{x}\|_{2}}. Elementwise sparsification has been widely studied [AM07, DZ11, AKL13, BKKS21] and has been used as a primitive in several applications, including low-rank approximation [AM07, Kun15], approximate eigenvector computation [AHK06, AM07], semi-definite programming [AHK05, d’A11], and matrix completion [CR12a, CT10]. Without loss of generality, one can scale the entries of 𝐀\mathbf{A} so that the maximum entry is bounded by 11 in absolute value, and we refer to such matrices as having bounded entries. With this normalization, it will be convenient to consider the task of finding a small subset SS with corresponding sampling matrix 𝐒\mathbf{S} such that, for a given error parameter ϵ(0,1)\epsilon\in(0,1),

𝐀𝐀𝐒2ϵn.\displaystyle\left\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\right\|_{2}\leq\epsilon\cdot n. (1)

One can achieve the error guarantee in (1) for any bounded entry matrix 𝐀{\mathbf{A}} with high probability by uniformly sampling a set of O(nlognϵ2){O}\left(\frac{n\log n}{\epsilon^{2}}\right) entries of 𝐀{\mathbf{A}}. See, e.g., Theorem 1 of [DZ11].555To apply Thm. 1 of [DZ11] we first rescale 𝐀{\mathbf{A}} so that all entries are 1/n\leq 1/n in magnitude, and note that 𝐀F21\|{\mathbf{A}}\|_{F}^{2}\leq 1 in this case. Thus, they sample s=O(nlognϵ2)s=O\left(\frac{n\log n}{\epsilon^{2}}\right) entries. Rescaling their error guarantee analogously gives (1). However, there are no known lower bounds for this problem, even if we consider the harder task of universal sparsification, which requires finding a fixed subset SS such that (1) holds simultaneously for every bounded entry matrix 𝐀{\mathbf{A}}. The existence of such a fixed subset SS corresponds to the existence of a deterministic sublinear query algorithm that constructs a spectral approximation to any 𝐀\mathbf{A} by forming 𝐀𝐒\mathbf{A}\circ\mathbf{S} (which requires reading just |S||S| entries of 𝐀\mathbf{A}). As we will see, such algorithms have applications to fast deterministic algorithms for linear algebraic computation. We ask:

What is the size of the smallest set SS that achieves (1) simultaneously for every bounded entry matrix 𝐀{\mathbf{A}}?

Previously, no bound on the size of SS better than the trivial O(n2)O(n^{2}) was known.

1.1 Our Results

Our first result answers the above question for the class of symmetric positive semidefinite (PSD) matrices, i.e., those matrices 𝐀{\bf A} for which 𝐱T𝐀𝐱0\mathbf{x}^{T}{\mathbf{A}}\mathbf{x}\geq 0 for all vectors 𝐱n\mathbf{x}\in\mathbb{R}^{n}, or equivalently, whose eigenvalues are all non-negative. PSD matrices arise e.g., as covariance matrices, kernel similarity matrices, and graph Laplacians, and significant work has considered efficient algorithms for approximating their properties [WLZ16, CW17, XG10, Cha11, ACK+16, MMMW21, SKO21]. We summarize our results below and give a comparison to previous work in Table 1.

We show that there exists a set SS with |S|=O(n/ϵ2)|S|={O}(n/\epsilon^{2}) that achieves (1) simultaneously for all bounded entry PSD matrices. This improves the best known randomized bound of O(nlognϵ2){O}\left(\frac{n\log n}{\epsilon^{2}}\right) for algorithms which only succeed on a fixed matrix with high probability.

Theorem 1 (Universal Sparsifiers for PSD Matrices).

There exists a subset SS of s=O(n/ϵ2)s={O}(n/\epsilon^{2}) entries of [n]×[n][n]\times[n] such that, letting 𝐒n×n\mathbf{S}\in\mathbb{R}^{n\times n} have 𝐒ij=n2s\mathbf{S}_{ij}=\frac{n^{2}}{s} for (i,j)S(i,j)\in S and 𝐒ij=0\mathbf{S}_{ij}=0 otherwise, simultaneously for all PSD matrices 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} with bounded entries, 𝐀𝐀𝐒2ϵn.\left\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\right\|_{2}\leq\epsilon n.

Matrix
Type
Approx.
Type
Randomized
Error
Randomized
Sample Complexity
Deterministic
Error
Deterministic
Sample Complexity
𝐀1\|\mathbf{A}\|_{\infty}\leq 1
𝐀\mathbf{A} is PSD
sparse ϵn\epsilon n Θ(n/ϵ2)\Theta(n/\epsilon^{2}) (Thms 1, 3) ϵn\epsilon n
Θ(n/ϵ2)\Theta(n/\epsilon^{2}) (Thms 1, 3)
𝐀1\|\mathbf{A}\|_{\infty}\leq 1 sparse ϵn\epsilon n
O(nlognϵ2)O\left(\frac{n\log n}{\epsilon^{2}}\right) [AM07]
Ω(n/ϵ2)\Omega(n/\epsilon^{2}) (Thm 3)
ϵmax(n,𝐀1)\epsilon\max(n,\|\mathbf{A}\|_{1})
O(nlog4(1/ϵ)ϵ4)O\left(\frac{n\log^{4}(1/\epsilon)}{\epsilon^{4}}\right) (Thm 4)
Ω(n/ϵ4)\Omega(n/\epsilon^{4}) (Thm 6)
𝐀1\|\mathbf{A}\|_{\infty}\leq 1
𝐀\mathbf{A} is PSD
any ϵn\epsilon n
O(nlog(1/ϵ)ϵ)O\left(\frac{n\log(1/\epsilon)}{\epsilon}\right) [MM17]
ϵn\epsilon n
O(nlognϵ)O\left(\frac{n\log n}{\epsilon}\right) (Thm 9)
for 𝐀{1,0,1}n×n\mathbf{A}\in\{-1,0,1\}^{n\times n}
Ω(n/ϵ)\Omega(n/\epsilon) (Thm 10)
𝐀1\|\mathbf{A}\|_{\infty}\leq 1 any ϵn\epsilon n
Ω(n/ϵ2)\Omega(n/\epsilon^{2}) (Thm 11)
ϵmax(n,𝐀1)\epsilon\max(n,\|\mathbf{A}\|_{1}) Ω(n/ϵ2)\Omega(n/\epsilon^{2}) (Thm 7)
Table 1: Summary of results. Algorithms in the first two rows output a spectral approximation that is sparse, as in (1). Our deterministic algorithms in this case follow directly from our universal sparsification results. Algorithms in the last two rows can output an approximation of any form.

Theorem 1 can be viewed as a significant strengthening of a classic spectral graph expander guarantee [Mar73, Alo86]; indeed, letting 𝐀=𝟏{\mathbf{A}}=\mathbf{1} be the all ones matrix, we have that if 𝐀𝐒{\mathbf{A}}\circ\mathbf{S} satisfies (1), then it matches the near-optimal spectral expansion of Ramanujan graphs, up to a constant factor.666If we let d=s/n=O(1/ϵ2)d=s/n=O(1/\epsilon^{2}) be the average number of entries per row of 𝐒\mathbf{S}, and let 𝐆{0,1}n×n\mathbf{G}\in\{0,1\}^{n\times n} be the binary adjacency matrix of the graph with edges in SS (i.e., 𝐆=sn2𝐒\mathbf{G}=\frac{s}{n^{2}}\cdot\mathbf{S}), then by (1) applied with 𝐀=𝟏\mathbf{A}=\mathbf{1}, λ1(𝐆)=sn2λ1(𝐒)=sn2λ1(𝟏𝐒)=sn2(λ1(𝟏)±ϵn)=Θ(s/n)=Θ(d)\lambda_{1}(\mathbf{G})=\frac{s}{n^{2}}\lambda_{1}(\mathbf{S})=\frac{s}{n^{2}}\lambda_{1}(\mathbf{1}\circ\mathbf{S})=\frac{s}{n^{2}}(\lambda_{1}(\mathbf{1})\pm\epsilon n)=\Theta(s/n)=\Theta(d), while for i>1i>1, |λi(𝐆)|sn2(|λi(𝟏)|+ϵn)ϵd=Θ(d)|\lambda_{i}(\mathbf{G})|\leq\frac{s}{n^{2}}(|\lambda_{i}(\mathbf{1})|+\epsilon n)\leq\epsilon d=\Theta(\sqrt{d}). In fact, we prove Theorem 1 by proving a more general claim: that any matrix 𝐒\mathbf{S} that sparsifies the all ones matrix also sparsifies every bounded entry PSD matrix. In particular, Theorem 1 follows as a direct corollary of:

Theorem 2 (Spectral Expanders are Universal Sparsifiers for PSD Matrices).

Let 𝟏n×n\mathbf{1}\in\mathbb{R}^{n\times n} be the all ones matrix. Let 𝐒n×n\mathbf{S}\in\mathbb{R}^{n\times n} be any matrix such that 𝟏𝐒2ϵn\|\mathbf{1}-\mathbf{S}\|_{2}\leq\epsilon n. Then for any PSD matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} with bounded entries, 𝐀𝐀𝐒2ϵn\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\|_{2}\leq\epsilon n.

To prove Theorem 1 given Theorem 2, we let SS be the edge set of a Ramanujan spectral expander graph with degree d=O(1/ϵ2)d=O(1/\epsilon^{2}) and adjacency matrix 𝐆\mathbf{G}. We have s=nd=O(n/ϵ2)s=nd=O(n/\epsilon^{2}) and 𝐒=n2s𝐆=nd𝐆\mathbf{S}=\frac{n^{2}}{s}\mathbf{G}=\frac{n}{d}\mathbf{G}. Thus, the top eigenvector of 𝐒\mathbf{S} is the all ones vector with eigenvalue λ1(𝐒)=ndλ1(𝐆)=n\lambda_{1}(\mathbf{S})=\frac{n}{d}\lambda_{1}(\mathbf{G})=n. All other eigenvalues are bounded by |λi(𝐒)|=nd|λi(𝐆)|=O(ndd)=O(ϵn).|\lambda_{i}(\mathbf{S})|=\frac{n}{d}|\lambda_{i}(\mathbf{G})|=O(\frac{n}{d}\cdot\sqrt{d})=O(\epsilon n). Combined, after adjusting ϵ\epsilon by a constant factor, this shows that 𝟏𝐒2ϵn\|\mathbf{1}-\mathbf{S}\|_{2}\leq\epsilon n, as required.

Sparsity Lower Bound. We show that Theorem 1 is tight. Even if we only seek to approximate the all ones matrix (rather than all bounded PSD matrices), 𝐒\mathbf{S} requires Ω(n/ϵ2)\Omega(n/\epsilon^{2}) non-zero entries.

Theorem 3 (Sparsity Lower Bound – PSD Matrices).

Let 𝟏n×n{\mathbf{1}}\in\mathbb{R}^{n\times n} be the all-ones matrix. Then, for any ϵ(0,1/2)\epsilon\in(0,1/2) with ϵc/n\epsilon\geq c/\sqrt{n} for large enough constant cc, any 𝐒n×n\mathbf{S}\in\mathbb{R}^{n\times n} with 𝟏𝐒2ϵn\|\mathbf{1}-\mathbf{S}\|_{2}\leq\epsilon n must have Ω(n/ϵ2)\Omega(n/\epsilon^{2}) nonzero entries.

Theorem 3 resolves an open question of [BKKS21], which asked whether a spectral approximation must have Ω(ns(𝐀)sr(𝐀)ϵ2)\Omega\left(\frac{\operatorname{ns}(\mathbf{A})\operatorname{sr}(\mathbf{A})}{\epsilon^{2}}\right) non-zeros, where sr(𝐀)=𝐀F2𝐀22\operatorname{sr}(\mathbf{A})=\frac{\|\mathbf{A}\|_{F}^{2}}{\|\mathbf{A}\|_{2}^{2}} is the stable rank and ns(𝐀)=maxi𝐚i12𝐚i22\operatorname{ns}(\mathbf{A})=\max_{i}\frac{\|\mathbf{a}_{i}\|_{1}^{2}}{\|\mathbf{a}_{i}\|_{2}^{2}}, where 𝐚i\mathbf{a}_{i} is the ithi^{th} row of 𝐀\mathbf{A}, is the ‘numerical sparsity’. They give a lower bound when sr(𝐀)=Θ(n)\operatorname{sr}(\mathbf{A})=\Theta(n) but ask if this can be extended to sr(𝐀)=o(n)\operatorname{sr}(\mathbf{A})=o(n). For the all ones matrix, sr(𝐀)=1\operatorname{sr}(\mathbf{A})=1 and ns(𝐀)=n\operatorname{ns}(\mathbf{A})=n, and so Theorem 3 resolves this question. Further, by applying the theorem to a block diagonal matrix with rr disjoint k×kk\times k blocks of all ones, we resolve the question for integer sr(𝐀)\operatorname{sr}(\mathbf{A}) and ns(𝐀)\operatorname{ns}(\mathbf{A}).

Non-PSD Matrices. The techniques used to prove Theorem 1 also give nearly tight universal sparsification bounds for general bounded entry (not necessarily PSD) matrices. In this case, we show that a subset SS of O(nlog4(1/ϵ)ϵ4){O}\left(\frac{n\log^{4}(1/\epsilon)}{\epsilon^{4}}\right) entries suffices to achieve spectral norm error depending on the nuclear norm 𝐀1\|{\mathbf{A}}\|_{1}, which is the sum of singular values of 𝐀{\mathbf{A}}.

Theorem 4 (Universal Sparsifiers for Non-PSD Matrices).

There exists a subset SS of s=O(nlog4(1/ϵ)ϵ4)s={O}\left(\frac{n\log^{4}(1/\epsilon)}{\epsilon^{4}}\right) entries of [n]×[n][n]\times[n] such that, letting 𝐒n×n\mathbf{S}\in\mathbb{R}^{n\times n} have 𝐒ij=n2s\mathbf{S}_{ij}=\frac{n^{2}}{s} for (i,j)S(i,j)\in S and 𝐒ij=0\mathbf{S}_{ij}=0 otherwise, simultaneously for all symmetric matrices 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} with bounded entries,

𝐀𝐀𝐒2ϵmax(n,𝐀1).\displaystyle\left\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\right\|_{2}\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). (2)

Note that for a bounded entry PSD matrix 𝐀{\mathbf{A}}, we have 𝐀1=tr(𝐀)=O(n)\|{\mathbf{A}}\|_{1}=\operatorname{tr}({\mathbf{A}})=O(n) and so for PSD matrices, (1) and (2) are equivalent.

Remark 1.

Although Theorem 4 is stated for symmetric 𝐀{\mathbf{A}}, one can first symmetrize 𝐀{\mathbf{A}} by considering 𝐁=[𝟎,𝐀;𝐀T,𝟎]{\mathbf{B}}=[{\mathbf{0}},{\mathbf{A}};{\mathbf{A}}^{T},{\mathbf{0}}]. Then for any 𝐳=(𝐱,𝐲)2n{\mathbf{z}}=({\mathbf{x}},{\mathbf{y}})\in\mathbb{R}^{2n}, we have 𝐳T𝐁𝐳=2𝐱T𝐀𝐲{\mathbf{z}}^{T}{\mathbf{B}}{\mathbf{z}}=2{\mathbf{x}}^{T}{\mathbf{A}}{\mathbf{y}}, and 𝐁1=2𝐀1\|{\mathbf{B}}\|_{1}=2\|{\mathbf{A}}\|_{1}, so applying the above theorem to 𝐁{\mathbf{B}} gives us a spectral approximation to 𝐀{\mathbf{A}}.

As with Theorem 1, Theorem 4 follows from a general claim which shows that sparsifying the all ones matrix to small enough error suffices to sparsify any bounded entry symmetric matrix.

Theorem 5 (Spectral Expanders are Universal Sparsifiers for Non-PSD Matrices).

Let 𝟏n×n\mathbf{1}\in\mathbb{R}^{n\times n} be the all ones matrix. Let 𝐒n×n\mathbf{S}\in\mathbb{R}^{n\times n} be any matrix such that 𝟏𝐒2ϵ2nclog2(1/ϵ)\|\mathbf{1}-\mathbf{S}\|_{2}\leq\frac{\epsilon^{2}n}{c\log^{2}(1/\epsilon)} for some large enough constant cc. Then for any symmetric 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} with bounded entries, 𝐀𝐀𝐒2ϵmax(n,𝐀1).\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\|_{2}\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}).

Theorem 4 follows by applying Theorem 5 where 𝐒\mathbf{S} is taken to be the scaled adjacency matrix of a Ramanujan expander graph with degree d=O(log4(1/ϵ)/ϵ4)d=O(\log^{4}(1/\epsilon)/\epsilon^{4}).

Lower Bounds for Non-PSD Matrices. We can prove that Theorem 4 is tight up to logarithmic factors. Our lower bound holds even for the easier problem of top singular value (spectral norm) approximation and against a more general class of algorithms, which non-adaptively and deterministically query entries of the input matrix. The idea is simple: since the entries read by the deterministic algorithm are fixed, we can construct two very different input instances on which the algorithm behaves identically: one which is the all ones matrix, and the other which is one only on the entries read by the algorithm and zero everywhere else. We show that if the number of entries ss that are read is too small, the top singular value of the second instance is significantly smaller than the first (as compared to their Schatten-1 norms), which violates the desired error bound of ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). To obtain a tight bound, we apply the above construction on just a subset of the rows of the matrix on which the algorithm does not read too many entries. Here, we critically use non-adaptivity, as this subset of rows can be fixed, independent of the input instance.

Theorem 6 (Non-Adaptive Query Lower Bound for Deterministic Spectral Approximation of Non-PSD Matrices).

For any ϵ(1/n1/4,1/4)\epsilon\in(1/n^{1/4},1/4), any deterministic algorithm that queries entries of a bounded entry matrix 𝐀\mathbf{A} non-adaptively and outputs σ~1\widetilde{\sigma}_{1} satisfying |σ1(𝐀)σ~1|ϵmax(n,𝐀1)|\sigma_{1}(\mathbf{A})-\widetilde{\sigma}_{1}|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}) must read at least Ω(nϵ4)\Omega\left(\frac{n}{\epsilon^{4}}\right) entries. Here σ1(𝐀)=𝐀2\sigma_{1}(\mathbf{A})=\|\mathbf{A}\|_{2} is the largest singular value of 𝐀\mathbf{A}.

Observe that Theorem 4 implies the existence of a non-adaptive deterministic algorithm for the above problem with query complexity O(nlog4(1/ϵ)ϵ4)O\left(\frac{n\log^{4}(1/\epsilon)}{\epsilon^{4}}\right), since 𝐀𝐒\mathbf{A}\circ\mathbf{S} can be computed with non-adaptive determinstic queries and since 𝐀𝐀𝐒2ϵmax(n,𝐀1)\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\|_{2}\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}) implies via Weyl’s inequality that |σ1(𝐀)σ1(𝐀𝐒)|ϵmax(n,𝐀1)|\sigma_{1}(\mathbf{A})-\sigma_{1}(\mathbf{A}\circ\mathbf{S})|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Also recall that while Theorem 4 is stated for bounded entry symmetric matrices, it applies to general (possibly asymmetric) matrices by Remark 1.

Note that Theorem 6 establishes a separation between universal sparsifiers and randomized sparsification since randomly sampling O(nlognϵ2)O\left(\frac{n\log n}{\epsilon^{2}}\right) entries of any 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} achieves error ϵn\epsilon n by [DZ11]. In fact, for the problem of just approximating σ1(𝐀)\sigma_{1}(\mathbf{A}) to error ±ϵn\pm\epsilon n, randomized algorithms using just missingpoly(logn,1/ϵ)\mathop{\mathrm{missing}}{poly}(\log n,1/\epsilon) samples are known [BCJ20, BDD+23]. Theorem 6 shows that universal sparsifiers for general bounded entry matrices (and in fact all non-adaptive deterministic algorithms for spectral approximation) require a worse 1/ϵ41/\epsilon^{4} dependence and error bound of ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). We can extend Theorem 6 to apply to general deterministic algorithms that query 𝐀\mathbf{A} possibly adaptively, however the lower bound weakens to Ω(n/ϵ2)\Omega(n/\epsilon^{2}). Understanding if this gap between adaptive and non-adaptive query deterministic algorithms is real is an interesting open question.

Theorem 7 (Adaptive Query Lower Bound for Deterministic Spectral Approximation of Non-PSD Matrices).

For any ϵ(1/n,1/4)\epsilon\in(1/\sqrt{n},1/4), any deterministic algorithm that queries entries of a bounded entry matrix 𝐀\mathbf{A} (possibly adaptively) and outputs σ~1\widetilde{\sigma}_{1} satisfying |σ1(𝐀)σ~1|ϵmax(n,𝐀1)|\sigma_{1}(\mathbf{A})-\widetilde{\sigma}_{1}|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}) must read at least Ω(nϵ2)\Omega\left(\frac{n}{\epsilon^{2}}\right) entries. Here σ1(𝐀)=𝐀2\sigma_{1}(\mathbf{A})=\|\mathbf{A}\|_{2} is the largest singular value of 𝐀\mathbf{A}.

1.1.1 Applications to Fast Deterministic Algorithms for Linear Algebra

Given sampling matrix 𝐒\mathbf{S} such that 𝐀𝐀𝐒2ϵmax(n,𝐀1)\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\|_{2}\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}), one can use 𝐀𝐒\mathbf{A}\circ\mathbf{S} to approximate various linear algebraic properties of 𝐀\mathbf{A}. For example, by Weyl’s inequality [Wey12], the eigenvalues of 𝐀𝐒\mathbf{A}\circ\mathbf{S} approximate those of 𝐀\mathbf{A} up to additive error ±ϵmax(n,𝐀1)\pm\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Thus, our universal sparsification results (Theorems 1, 4) immediately give the first known deterministic algorithms for approximating the eigenspectrum of a bounded entry matrix up to small additive error with sublinear entrywise query complexity. Previously, only randomized sublinear query algorithms were known [WS00, MM17, MW17, TYUC17, BCJ20, BDD+23].

Further, our results yield the first deterministic algorithms for several problems that run in o(nω)o(n^{\omega}) time, where ω2.373\omega\approx 2.373 is the matrix multiplication exponent [AW21]. Consider e.g., approximating the top singular value σ1(𝐀)\sigma_{1}(\mathbf{A}) (the spectral norm) of 𝐀\mathbf{A}. A (1+ϵ)(1+\epsilon)-relative error approximation can be computed in O~(n2/ϵ)\widetilde{O}(n^{2}/\sqrt{\epsilon}) time with high probability using O(log(n/ϵ)/ϵ)O(\log(n/\epsilon)/\sqrt{\epsilon}) iterations of the Lanczos method with a random initialization [KW92, MM15]. This can be further accelerated e.g., via randomized entrywise sparsification [DZ11], allowing an additive ±ϵn\pm\epsilon n approximation to σ1(𝐀)\sigma_{1}(\mathbf{A}) to be computed in O~(n/missingpoly(ϵ))\widetilde{O}(n/\mathop{\mathrm{missing}}{poly}(\epsilon)) time. However, prior to our work, no fast deterministic algorithms were known, even with coarse approximation guarantees. The fastest approach was just to perform a full SVD of 𝐀\mathbf{A}, requiring Θ(nω)\Theta(n^{\omega}) time. In fact, this gap between randomized and deterministic methods exists for many linear algebraic problems, and resolving it is a central open question.

By combining our universal sparsification results with a derandomized power method that uses a full subspace of vectors for initalization, and iteratively approximates the singular values of 𝐀𝐒\mathbf{A}\circ\mathbf{S} via ‘deflation’ [Saa11], we give to the best of our knowledge, the first o(nω)o(n^{\omega}) time deterministic algorithm for approximating all singular values of a bounded entry matrix 𝐀\mathbf{A} to small additive error.

Theorem 8 (Deterministic Singular Value Approximation).

Let 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} be a bounded entry symmetric matrix with singular values σ1(𝐀)σ2(𝐀)σn(𝐀)\sigma_{1}(\mathbf{A})\geq\sigma_{2}(\mathbf{A})\geq\ldots\geq\sigma_{n}(\mathbf{A}). Then there exists a deterministic algorithm that, given ϵ(0,1)\epsilon\in(0,1), reads O~(nϵ4)\widetilde{O}\big{(}\frac{n}{\epsilon^{4}}\big{)} entries of 𝐀\mathbf{A}, runs in O~(n2ϵ8)\widetilde{O}\big{(}\frac{n^{2}}{\epsilon^{8}}\big{)} time, and returns singular value approximations σ~1(𝐀),,σ~n(𝐀)\widetilde{\sigma}_{1}(\mathbf{A}),\ldots,\widetilde{\sigma}_{n}(\mathbf{A}) satisfying for all ii,

|σi(𝐀)σ~i(𝐀)|ϵmax(n,𝐀1).|\sigma_{i}(\mathbf{A})-\widetilde{\sigma}_{i}(\mathbf{A})|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}).

Further, for all i1/ϵi\leq 1/\epsilon, the algorithm returns a unit vector 𝐳i\mathbf{z}_{i} such that |𝐀𝐳i2σi(𝐀)|ϵmax(n,𝐀1)|\|\mathbf{A}\mathbf{z}_{i}\|_{2}-\sigma_{i}(\mathbf{A})|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}) and the returned vectors are all mutually orthogonal.

The runtime of Theorem 8 is stated assuming we have already constructed a sampling matrix 𝐒\mathbf{S} that samples O~(nϵ4)\tilde{O}(\frac{n}{\epsilon^{4}}) entries of 𝐀\mathbf{A} and satisfies Theorem 4. It is well known that, if we let 𝐒\mathbf{S} be the adjacency matrix of a Ramanujan expander graph, it can indeed be constructed deterministically in O~(n/missingpoly(ϵ))\tilde{O}(n/\mathop{\mathrm{missing}}{poly}(\epsilon)) time [Alo21]. This is lower order as compared to the runtime of O~(n2ϵ8)\widetilde{O}\big{(}\frac{n^{2}}{\epsilon^{8}}\big{)} unless ϵ<1/nc\epsilon<1/n^{c} for some large enough constant cc. Further, for a fixed input size nn (and in fact, for a range of input sizes), 𝐒\mathbf{S} needs to be constructed only once – see Section 2 for further discussion.

Recall that if we further assume 𝐀\mathbf{A} to be PSD, then the error in Theorem 8 is bounded by ϵmax(n,𝐀1)ϵn\epsilon\cdot\max(n,\|\mathbf{A}\|_{1})\leq\epsilon n. The sample complexity and runtime also improve by missingpoly(1/ϵ)\mathop{\mathrm{missing}}{poly}(1/\epsilon) factors due to the tighter universal sparisfier bound for PSD matrices given in Theorem 1. Also observe that while Theorem 8 gives additive error approximations to all of 𝐀\mathbf{A}’s singular values, these approximations are only meaningful for singular values larger than ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}), of which there are at most 1/ϵ1/\epsilon. Similar additive error guarantees have been given using randomized algorithms [BDD+23, WS23]. Related bounds have also been studied in work on randomized methods for spectral density estimation [WWAF06, LSY16, BKM22].

We further leverage Theorem 8 to give the first o(nω)o(n^{\omega}) time deterministic algorithm for testing if a bounded entry matrix is either PSD or has at least one large negative eigenvalue ϵmax(n,𝐀1)\leq-\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Recent work has focused on optimal randomized methods for this problem [BCJ20, NSW22]. We also show that, under the assumption that σ1(𝐀)αmax(n,𝐀1)\sigma_{1}(\mathbf{A})\geq\alpha\cdot\max(n,\|\mathbf{A}\|_{1}) for some α(0,1)\alpha\in(0,1), one can deterministically compute σ~1(𝐀)\widetilde{\sigma}_{1}(\mathbf{A}) with |σ1(𝐀)σ~1(𝐀)|ϵmax(n,𝐀1)|\sigma_{1}(\mathbf{A})-\widetilde{\sigma}_{1}(\mathbf{A})|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}) in O~(n2log(1/ϵ)missingpoly(α))\widetilde{O}\left(\frac{n^{2}\log(1/\epsilon)}{\mathop{\mathrm{missing}}{poly}(\alpha)}\right) time. That is, one can compute a highly accurate approximation to the top singular value in roughly linear time in the input matrix size. Again, this is the first o(nω)o(n^{\omega}) time deterministic algorithm for this problem, and matches the runtime of the best known randomized methods for high accuracy top singular value computation, up to a missingpoly(log(n,1/α))\mathop{\mathrm{missing}}{poly}(\log(n,1/\alpha)) factor.

1.1.2 Beyond Sparse Approximations

It is natural ask if it is possible to achieve better than O(n/ϵ2)O(n/\epsilon^{2}) sample complexity for spectral approximation by using an algorithm that does not output a sparse approximation to 𝐀\mathbf{A}, but can output more general data structures, allowing it to avoid the sparsity lower bound of Theorem 3. Theorems 6 and 7 already rule this out for both non-adaptive and adaptive deterministic algorithms for non-PSD matrices. However, it is known that this is possible with randomized algorithms for PSD matrices. For example, following [AKM+17], one can apply Theorem 3 of [MM17] with error parameter λ=ϵn\lambda=\epsilon n. Observing that all ridge leverage score sampling probabilities are bounded by 1/λ1/\lambda (e.g., via Lemma 6 of the same paper and the bounded entry assumption), one can show that a Nyström approximation 𝐀~\mathbf{\widetilde{A}} based on O~(1/ϵ)\widetilde{O}(1/\epsilon) uniformly sampled columns satisfies 𝐀𝐀~2ϵn\|\mathbf{A}-\mathbf{\widetilde{A}}\|_{2}\leq\epsilon n with high probability. Further 𝐀~\mathbf{\widetilde{A}} can be constructed by reading just O~(1/ϵ)\widetilde{O}(1/\epsilon) columns and thus O~(n/ϵ)\widetilde{O}(n/\epsilon) entries of 𝐀\mathbf{A}, giving a linear rather than quadratic dependence on 1/ϵ1/\epsilon as compared to Theorem 1. Unfortunately, derandomizing such a column-sampling-based approach seems difficult – any deterministic algorithm must read entries in Ω(n)\Omega(n) columns of 𝐀\mathbf{A}, as otherwise it will fail when 𝐀\mathbf{A} is entirely supported on the unread columns.

Nevertheless, in the special case where 𝐀\mathbf{A} is PSD and has entries in {1,0,1}n×n\{-1,0,1\}^{n\times n}, we show that a spectral approximation can be obtained by deterministically reading just O~(n/ϵ)\widetilde{O}(n/\epsilon) entries of 𝐀\mathbf{A}.

Theorem 9 (Deterministic Spectral Approximation of Binary Magnitude PSD Matrices).

Let 𝐀{1,0,1}n×n\mathbf{A}\in\{-1,0,1\}^{n\times n} be PSD. Then for any ϵ(0,1)\epsilon\in(0,1), there exists a deterministic algorithm which reads O(nlognϵ)O\left(\frac{n\log n}{\epsilon}\right) entries of 𝐀\mathbf{A} and returns PSD 𝐀~{1,0,1}n×n\mathbf{\widetilde{A}}\in\{-1,0,1\}^{n\times n} such that 𝐀𝐀~2ϵn\|\mathbf{A}-\mathbf{\widetilde{A}}\|_{2}\leq\epsilon n.

Using Turán’s theorem, we show that Theorem 9 is information theoretically optimal up to a logn\log n factor, even for the potentially much easier problem of eigenvalue approximation:

Theorem 10 (Deterministic Spectral Approximation of Binary PSD Matrices – Lower Bound).

Let 𝐀{0,1}n×n\mathbf{A}\in\{0,1\}^{n\times n} be PSD. Then for any ϵ(0,1)\epsilon\in(0,1), any possibly adaptive deterministic algorithm which approximates all eigenvalues of 𝐀\mathbf{A} up to ϵn\epsilon n additive error must read Ω(nϵ)\Omega\left(\frac{n}{\epsilon}\right) entries of 𝐀\mathbf{A}.

An interesting open question is if O~(n/ϵ)\widetilde{O}(n/\epsilon) sample complexity can be achieved for deterministic spectral approximation of any bounded entry PSD matrix, with a matrix 𝐀~\mathbf{\widetilde{A}} of any form, matching what is known for randomized algorithms.

Finally, we show that the PSD assumption is critical to achieve o(n/ϵ2)o(n/\epsilon^{2}) query complexity, for both randomized and deterministic algorithms. Without this assumption, Ω(n/ϵ2)\Omega(n/\epsilon^{2}) queries are required, even to achieve (1) with constant probability for a single input 𝐀{\mathbf{A}}. For bounded entry matrices, the randomized element-wise sparsification algorithms of [AM07, DZ11] read just O~(n/ϵ2)\widetilde{O}(n/\epsilon^{2}) entries of 𝐀\mathbf{A}, and so our lower bounds are the first to show near-optimality of the number of entries read of these algorithms, which may be of independent interest.

Theorem 11 (Lower Bound for Randomized Spectral Approximation).

Any randomized algorithm that (possibly adaptively) reads entries of a binary matrix 𝐀{0,1}n×n\mathbf{A}\in\{0,1\}^{n\times n} to construct a data structure f:n×nf:\mathbb{R}^{n}\times\mathbb{R}^{n}\rightarrow\mathbb{R} that, for ϵ(0,1)\epsilon\in(0,1) satisfies with probability at least 99/10099/100,

|f(𝐱,𝐲)𝐱T𝐀𝐲|ϵn, for all unit 𝐱,𝐲n,\displaystyle|f(\mathbf{x},\mathbf{y})-\mathbf{x}^{T}\mathbf{A}\mathbf{y}|\leq\epsilon n,\text{ for all unit }\mathbf{x},\mathbf{y}\in\mathbb{R}^{n},

must read Ω(nϵ2)\Omega\left(\frac{n}{\epsilon^{2}}\right) entries of 𝐀\mathbf{A} in the worst case, provided ϵ=Ω(lognn)\epsilon=\Omega\left(\frac{\log n}{\sqrt{n}}\right).

1.1.3 Relation to Spectral Graph Sparsification

We remark that while our work focuses on general bounded entry symmetric (PSD) matrices, when 𝐀\mathbf{A} is a PSD graph Laplacian matrix, it is possible to construct a sparsifier 𝐀~\mathbf{\widetilde{A}} with just O~(n/ϵ2)\widetilde{O}(n/\epsilon^{2}) entries, that achieves a relative error spectral approximation guarantee that can be much stronger than the additive error guarantee of (1) [BSST13]. In particular, one can achieve (1ϵ)𝐱T𝐀𝐱𝐱T𝐀~𝐱(1+ϵ)𝐱T𝐀𝐱(1-\epsilon)\mathbf{x}^{T}\mathbf{A}\mathbf{x}\leq\mathbf{x}^{T}\mathbf{\widetilde{A}}\mathbf{x}\leq(1+\epsilon)\mathbf{x}^{T}\mathbf{A}\mathbf{x} for all 𝐱n\mathbf{x}\in\mathbb{R}^{n}. To achieve such a guarantee however, it is not hard to see that the set of entries sampled in 𝐀~\mathbf{\widetilde{A}} must depend on 𝐀\mathbf{A}, and cannot be universal. Fast randomized algorithms for constructing 𝐀~\mathbf{\widetilde{A}} have been studied extensively [ST04, SS08, BSST13]. Recent work has also made progress towards developing fast deterministic algorithms [CGL+20].

1.2 Road Map

We introduce notation in Section 2. In Section 3, we prove our main universal sparsification results for PSD and non-PSD matrices (Theorems 1 and 4). In Section 4 we prove nearly matching lower bounds (Theorems 3 and 6). We also prove Theorems 7 and 11 which give lower bounds for general deterministic (possibly adaptive) spectral approximation algorithms, and general randomized algorithms, respectively. In Section 5, we show how to give tighter deterministic spectral approximation results for PSD matrices with entries in {1,0,1}\{-1,0,1\}. Finally, in Section 6, we discuss applications of our results to fast deterministic algorithms for singular value and vector approximation.

2 Notation and Preliminaries

We start by defining notation used throughout. For any integer nn, let [n][n] denote the set {1,2,,n}\{1,2,\ldots,n\}.

Matrices and Vectors. Matrices are represented with bold uppercase literals, e.g., 𝐀\mathbf{A}. Vectors are represented with bold lowercase literals, e.g., 𝐱\mathbf{x}. 𝟏\mathbf{1} and 𝟎\mathbf{0} denote all ones (resp. all zeros) matrices or vectors. The identity matrix is denoted by 𝐈\mathbf{I}. The size of these matrices vary based on their applications. For a vector 𝐱\mathbf{x}, 𝐱(j)\mathbf{x}(j) denotes its jthj^{th} entry. For a matrix 𝐀\mathbf{A}, 𝐀ij\mathbf{A}_{ij} denotes the entry in the iith row and jjth column. For a vector 𝐱\mathbf{x} (or matrix 𝐀\mathbf{A}), 𝐱T\mathbf{x}^{T} (resp. 𝐀T)\mathbf{A}^{T}) denotes its transpose. For two matrices 𝐀,𝐁\mathbf{A},\mathbf{B} of the same size, 𝐀𝐁\mathbf{A}\circ\mathbf{B} denotes the entrywise (Hadamard) product.

Matrix Norms and Properties. For a vector 𝐱\mathbf{x}, 𝐱2\|\mathbf{x}\|_{2} denotes its Euclidean norm and 𝐱1=i=1n|𝐱(i)|\|\mathbf{x}\|_{1}=\sum_{i=1}^{n}|\mathbf{x}(i)| denotes its 1\ell_{1} norm. We denote the eigenvalues of a symmetric matrix 𝐀\mathbf{A} as λ1(𝐀)λ2(𝐀)λn(𝐀)\lambda_{1}(\mathbf{A})\geq\lambda_{2}(\mathbf{A})\geq\ldots\geq\lambda_{n}(\mathbf{A}) in decreasing order. A symmetric matrix is positive semidefinite (PSD) if λi0\lambda_{i}\geq 0 for all i[n]i\in[n]. The singular values of a matrix 𝐀\mathbf{A} are denoted as σ1(𝐀)σ2(𝐀)σn(𝐀)0\sigma_{1}(\mathbf{A})\geq\sigma_{2}(\mathbf{A})\geq\ldots\geq\sigma_{n}(\mathbf{A})\geq 0 in decreasing order. We let 𝐀2=maxx𝐀𝐱2𝐱2=σ1(𝐀)\|\mathbf{A}\|_{2}=\max_{x}\frac{\|\mathbf{A}\mathbf{x}\|_{2}}{\|\mathbf{x}\|_{2}}=\sigma_{1}(\mathbf{A}) denote the spectral norm, 𝐀\|\mathbf{A}\|_{\infty} denote the largest magnitude of an entry, 𝐀F=(i,j𝐀ij2)1/2\|\mathbf{A}\|_{F}=(\sum_{i,j}\mathbf{A}_{ij}^{2})^{1/2} denote the Frobenius norm, and 𝐀1=i=1nσi(𝐀)\|\mathbf{A}\|_{1}=\sum_{i=1}^{n}\sigma_{i}(\mathbf{A}) denote the Schatten-1 norm (also called the trace norm or nuclear norm).

Expander Graphs. Our universal sparsifier constructions are based on Ramanujan expander graphs [LPS88, Mar73, Alo86], defined below.

Definition 2.1 (Ramanujan Expander Graphs [LPS88]).

Let GG be a connected dd-regular, unweighted and undirected graph on nn vertices. Let 𝐆{0,1}n×n\mathbf{G}\in\{0,1\}^{n\times n} be the adjacency matrix corresponding to GG and λi\lambda_{i} be its ithi^{th} eigenvalue, such that λ1λ2λn\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{n}. Then GG is called a Ramanujan graph if:

|λi|\displaystyle|\lambda_{i}| 2d1,for alli>1.\displaystyle\leq 2\sqrt{d-1},~{}\text{for all}~{}i>1.

Equivalently, letting 𝟏\mathbf{1} be the n×nn\times n all ones matrix, 𝟏nd𝐆2nd2d1\|\mathbf{1}-\frac{n}{d}\mathbf{G}\|_{2}\leq\frac{n}{d}\cdot 2\sqrt{d-1}.

Efficient Construction of Ramanujan graphs. Significant work has studied efficient constructions for Ramanujan Graphs [Mar73, LPS88, Mor94, Coh16], and nearly linear time constructions, called strongly explicit constructions [Alo21], have been proposed. E.g., by Proposition 1.1 of [Alo21] we can reconstruct for any nn and dd, a graph on n(1+o(1))n(1+o(1)) vertices with second eigenvalue at most (2+o(1))d(2+o(1))\sqrt{d} in O~(nd+missingpoly(d))\tilde{O}(nd+\mathop{\mathrm{missing}}{poly}(d)) time. Additionally, in our applications, the expander just needs to be constructed once and can then be used for any input of size nn. In fact, a single expander can be used for a range of input sizes, by the argument below.

Though the size of the expander above is (1+o(1))n(1+o(1))n instead of exactly nn, and its expansion does not exactly hit the tight Ramanujan bound of 2d12\sqrt{d-1}, we can let our input matrix 𝐀\mathbf{A} be the top n×nn\times n principal submatrix of a slightly larger n(1+o(1))×n(1+o(1))n(1+o(1))\times n(1+o(1)) matrix 𝐀\mathbf{A}^{\prime} that is zero everywhere else. Then, the top n×nn\times n principal submatrix of a spectral approximation to 𝐀\mathbf{A}^{\prime} is a spectral approximation to 𝐀\mathbf{A}. Thus, obtaining a spectral approximation to 𝐀\mathbf{A}^{\prime} via a Ramanujan sparsifier on n(1+o(1))n(1+o(1)) vertices suffices. Similarly, in our applications, we can adjust d=1/missingpoly(ϵ)d=1/\mathop{\mathrm{missing}}{poly}(\epsilon) by at most a constant factor to account for the (2+o(1))d(2+o(1))\sqrt{d} bound on the second eigenvalue, rather than the tight Ramanujan bound of 2d12\sqrt{d-1}.

3 Universal Sparsifier Upper Bounds

We now prove our main results on universal sparsifiers for PSD matrices (Theorem 1) and general symmetric matrices (Theorem 4). Both theorems follow from general reductions which show that any sampling matrix 𝐒\mathbf{S} that sparsifies the all ones matrix to sufficient accuracy (i.e., is a sufficiently good spectral expander) yields a universal sparsifier. We prove the reduction for the PSD case in Section 3.1, and then extend it to the non-PSD case in Section 3.2.

3.1 Universal Sparsifiers for PSD Matrices

In the PSD case, we prove the following:

See 2 Theorem 1 follows directly from Theorem 2 by letting 𝐒\mathbf{S} be the scaled adjacency matrix of a Ramanujan expander graph with degree d=O(1/ϵ2)d=O(1/\epsilon^{2}) (Definition 2.1).

Proof of Theorem 2.

To prove the theorem it suffices to show that for any 𝐱n\mathbf{x}\in\mathbb{R}^{n} with 𝐱2=1\|\mathbf{x}\|_{2}=1, |𝐱T𝐀𝐱𝐱T(𝐀𝐒)𝐱|ϵn|\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}|\leq\epsilon n. Let 𝐯1,,𝐯n\mathbf{v}_{1},\ldots,\mathbf{v}_{n} and λ1,,λn\lambda_{1},\ldots,\lambda_{n} be the eigenvectors and eigenvalues of 𝐀\mathbf{A} so that 𝐀=i=1nλi𝐯i𝐯iT\mathbf{A}=\sum_{i=1}^{n}\lambda_{i}\mathbf{v}_{i}\mathbf{v}_{i}^{T}. Then we can expand out this error as:

|𝐱T𝐀𝐱𝐱T(𝐀𝐒)𝐱|\displaystyle|\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}| =|i=1nλi𝐱T𝐯i𝐯iT𝐱i=1nλi𝐱T(𝐯i𝐯iT𝐒)𝐱)|\displaystyle=\left|\sum_{i=1}^{n}\lambda_{i}\cdot\mathbf{x}^{T}\mathbf{v}_{i}\mathbf{v}_{i}^{T}\mathbf{x}-\sum_{i=1}^{n}\lambda_{i}\mathbf{x}^{T}(\mathbf{v}_{i}\mathbf{v}_{i}^{T}\circ\mathbf{S})\mathbf{x})\right|
=|i=1nλi𝐱T[𝐯i𝐯iT(𝟏𝐒)]𝐱|,\displaystyle=\left|\sum_{i=1}^{n}\lambda_{i}\mathbf{x}^{T}[\mathbf{v}_{i}\mathbf{v}_{i}^{T}\circ(\mathbf{1}-\mathbf{S})]\mathbf{x}\right|,

where we use that for any matrix 𝐌\mathbf{M}, 𝐌𝟏=𝐌\mathbf{M}\circ\mathbf{1}=\mathbf{M}. Now observe that if we let 𝐃in×n\mathbf{D}_{i}\in\mathbb{R}^{n\times n} be a diagonal matrix with the entries of 𝐯i\mathbf{v}_{i} on its diagonal, then we can write 𝐯i𝐯iT(𝟏𝐒)=𝐃i(𝟏𝐒)𝐃i\mathbf{v}_{i}\mathbf{v}_{i}^{T}\circ(\mathbf{1}-\mathbf{S})=\mathbf{D}_{i}(\mathbf{1}-\mathbf{S})\mathbf{D}_{i}. Plugging this back in we have:

|𝐱T𝐀𝐱𝐱T(𝐀𝐒)𝐱|\displaystyle|\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}| =|i=1nλi𝐱T𝐃i(𝟏𝐒)𝐃i𝐱|\displaystyle=\left|\sum_{i=1}^{n}\lambda_{i}\mathbf{x}^{T}\mathbf{D}_{i}(\mathbf{1}-\mathbf{S})\mathbf{D}_{i}\mathbf{x}\right|
i=1nλi𝟏𝐒2𝐱T𝐃i2𝐱\displaystyle\leq\sum_{i=1}^{n}\lambda_{i}\|\mathbf{1}-\mathbf{S}\|_{2}\cdot\mathbf{x}^{T}\mathbf{D}_{i}^{2}\mathbf{x}
ϵni=1nλi𝐱T𝐃i2𝐱.\displaystyle\leq\epsilon n\cdot\sum_{i=1}^{n}\lambda_{i}\mathbf{x}^{T}\mathbf{D}_{i}^{2}\mathbf{x}. (3)

In the second line we use that 𝐀\mathbf{A} is PSD and thus λi\lambda_{i} is non-negative for all ii. We also use that |𝐱T𝐃i(𝟏𝐒)𝐃i𝐱|𝟏𝐒2𝐃i𝐱22=𝟏𝐒2𝐱T𝐃i2𝐱|\mathbf{x}^{T}\mathbf{D}_{i}(\mathbf{1}-\mathbf{S})\mathbf{D}_{i}\mathbf{x}|\leq\|\mathbf{1}-\mathbf{S}\|_{2}\cdot\|\mathbf{D}_{i}\mathbf{x}\|_{2}^{2}=\|\mathbf{1}-\mathbf{S}\|_{2}\cdot\mathbf{x}^{T}\mathbf{D}_{i}^{2}\mathbf{x}. In the third line we bound 𝟏𝐒2ϵn\|\mathbf{1}-\mathbf{S}\|_{2}\leq\epsilon n using the assumption of the theorem statement.

Finally, writing 𝐱T𝐃i2𝐱=j=1n𝐱(j)2𝐯i(j)2\mathbf{x}^{T}\mathbf{D}_{i}^{2}\mathbf{x}=\sum_{j=1}^{n}\mathbf{x}(j)^{2}\mathbf{v}_{i}(j)^{2} and plugging back into (3), we have:

|𝐱T𝐀𝐱𝐱T(𝐀𝐒)𝐱|\displaystyle|\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}| ϵni=1nj=1nλi𝐱(j)2𝐯i(j)2\displaystyle\leq\epsilon n\cdot\sum_{i=1}^{n}\sum_{j=1}^{n}\lambda_{i}\mathbf{x}(j)^{2}\mathbf{v}_{i}(j)^{2}
=ϵnj=1n𝐱(j)2i=1nλi𝐯i(j)2\displaystyle=\epsilon n\cdot\sum_{j=1}^{n}\mathbf{x}(j)^{2}\sum_{i=1}^{n}\lambda_{i}\mathbf{v}_{i}(j)^{2}
=ϵnj=1n𝐱(j)2𝐀jj\displaystyle=\epsilon n\cdot\sum_{j=1}^{n}\mathbf{x}(j)^{2}\mathbf{A}_{jj}
ϵn,\displaystyle\leq\epsilon n,

where in the last step we use that 𝐀jj1\mathbf{A}_{jj}\leq 1 by our bounded entry assumption, and that 𝐱\mathbf{x} is a unit vector. This completes the theorem. ∎

Remark 2.

Note that we can state a potentially stronger bound for Theorem 2 by observing that in the second to last step, |𝐱T𝐀𝐱𝐱T(𝐀𝐒)𝐱|ϵnj=1n𝐱(j)2𝐀jj=ϵn𝐱T𝐃𝐱|\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}|\leq\epsilon n\cdot\sum_{j=1}^{n}\mathbf{x}(j)^{2}\mathbf{A}_{jj}=\epsilon n\cdot\mathbf{x}^{T}\mathbf{D}\mathbf{x}, where 𝐃\mathbf{D} is a diagonal matrix containing the diagonal elements of 𝐀\mathbf{A}. That is, letting 𝐌𝐍\mathbf{M}\preceq\mathbf{N} denote that 𝐍𝐌\mathbf{N}-\mathbf{M} is PSD, we have the following spectral approximation bound: ϵn𝐃𝐀(𝐀𝐒)ϵn𝐃.-\epsilon n\cdot\mathbf{D}\preceq\mathbf{A}-(\mathbf{A}\circ\mathbf{S})\preceq\epsilon n\cdot\mathbf{D}.

3.2 Universal Sparsifiers for Non-PSD Matrices

We next extend the above approach to give a similar reduction for universal sparsification of general symmetric matrices. See 5 Observe that as compared to the PSD case (Theorem 2), here we require that 𝐒\mathbf{S} gives a stronger approximation to the all ones matrix. Theorem 4 follows directly by letting 𝐒\mathbf{S} be the scaled adjacency matrix of a Ramanujan expander graph with degree d=O(log4(1/ϵ)ϵ4)d=O\left(\frac{\log^{4}(1/\epsilon)}{\epsilon^{4}}\right) (Definition 2.1).

Proof of Theorem 5.

To prove the theorem, it suffices to show that for any 𝐱n\mathbf{x}\in\mathbb{R}^{n} with 𝐱2=1\|\mathbf{x}\|_{2}=1, |𝐱T𝐀𝐱𝐱T(𝐀𝐒)𝐱|ϵmax(n,𝐀1)|\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). We will split the entries of 𝐱\mathbf{x} into level sets according to their magnitude. In particular, let ϵ¯=ϵlog2(1/ϵ){\bar{\epsilon}}=\frac{\epsilon}{\log_{2}(1/\epsilon)} and let =log2(1/ϵ¯)\ell=\log_{2}(1/{\bar{\epsilon}}). Then we can write 𝐱=i=0+1𝐱i\mathbf{x}=\sum_{i=0}^{\ell+1}\mathbf{x}_{i}, such that for any t[n]t\in[n]:

𝐱i(t)={𝐱(t)if i{1,2,,} and |𝐱(t)|(2i1n,2in]𝐱(t)if i=0 and |𝐱(t)|[0,1n]𝐱(t)if i=+1 and |𝐱(t)|(1ϵ¯n,1]0otherwise.\displaystyle\mathbf{x}_{i}(t)=\begin{cases}\mathbf{x}(t)&\text{if }i\in\{1,2,\ldots,\ell\}\text{ and }|\mathbf{x}(t)|\in\left(\frac{2^{i-1}}{\sqrt{n}},\frac{2^{i}}{\sqrt{n}}\right]\\ \mathbf{x}(t)&\text{if }i=0\text{ and }|\mathbf{x}(t)|\in\left[0,\frac{1}{\sqrt{n}}\right]\\ \mathbf{x}(t)&\text{if }i=\ell+1\text{ and }|\mathbf{x}(t)|\in\left(\frac{1}{{\bar{\epsilon}}\sqrt{n}},1\right]\\ 0&\text{otherwise}.\end{cases} (4)

Via triangle inequality, we have:

|𝐱T𝐀𝐱𝐱T(𝐀𝐒)𝐱|\displaystyle\left|\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}\right| i=0+1|𝐱iT𝐀𝐱𝐱iT(𝐀𝐒)𝐱|.\displaystyle\leq\sum_{i=0}^{\ell+1}\left|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}\right|. (5)

We will bound each term in (5) by O(ϵ¯max(n,𝐀1))=O(ϵmax(n,𝐀1)log(1/ϵ))=O(ϵmax(n,𝐀1)+2)O({\bar{\epsilon}}\cdot\max(n,\|\mathbf{A}\|_{1}))=O\left(\frac{\epsilon\cdot\max(n,\|\mathbf{A}\|_{1})}{\log(1/\epsilon)}\right)=O\left(\frac{\epsilon\cdot\max(n,\|\mathbf{A}\|_{1})}{\ell+2}\right). Summing over all +2\ell+2 terms we achieve a final bound of O(ϵmax(n,𝐀1))O(\epsilon\cdot\max(n,\|\mathbf{A}\|_{1})). The theorem then follows by adjusting ϵ\epsilon by a constant factor.

Fixing i{0,,}i\in\{0,\ldots,\ell\}, let 𝐱L(t)=𝐱(t)\mathbf{x}_{L}(t)=\mathbf{x}(t) if |𝐱(t)|12iϵ¯n|\mathbf{x}(t)|\leq\frac{1}{2^{i}\cdot{\bar{\epsilon}}\sqrt{n}} and 𝐱L(t)=0\mathbf{x}_{L}(t)=0 otherwise. Let 𝐱H(t)=𝐱(t)\mathbf{x}_{H}(t)=\mathbf{x}(t) if |𝐱(t)|>12iϵ¯n|\mathbf{x}(t)|>\frac{1}{2^{i}\cdot{\bar{\epsilon}}\sqrt{n}} and 𝐱H(t)=0\mathbf{x}_{H}(t)=0 otherwise. In the edge case, for i=+1i=\ell+1, let 𝐱H=𝐱\mathbf{x}_{H}=\mathbf{x} and 𝐱L=𝟎\mathbf{x}_{L}=\mathbf{0}. Writing 𝐱=𝐱H+𝐱L\mathbf{x}=\mathbf{x}_{H}+\mathbf{x}_{L} and applying triangle inequality, we can bound:

|𝐱iT𝐀𝐱𝐱iT(𝐀𝐒)𝐱||𝐱iT𝐀𝐱L𝐱iT(𝐀𝐒)𝐱L|+|𝐱iT𝐀𝐱H𝐱iT(𝐀𝐒)𝐱H|.\displaystyle\left|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}\right|\leq\left|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{L}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{L}\right|+\left|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H}\right|. (6)

In the following, we separately bound the two terms in (6). Roughly, since 𝐱L\mathbf{x}_{L} has relatively small entries and thus is relatively well spread, we will be able to show, using a similar approach to Theorem 2, that the first term is small since sparsification with 𝐒\mathbf{S} approximately preserves 𝐱iT𝐀𝐱L\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{L}. On the otherhand, since 𝐱H\mathbf{x}_{H} has relatively large entries and thus is relatively sparse, we can show that the second term is small since 𝐱iT𝐀𝐱H\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H} is small (and 𝐱iT(𝐀𝐒)𝐱H\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H} cannot be much larger).

Term 1: Well-Spread Vectors. We start by bounding |𝐱iT𝐀𝐱L𝐱iT(𝐀𝐒)𝐱L|\left|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{L}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{L}\right|. Let 𝐯1,,𝐯n\mathbf{v}_{1},\ldots,\mathbf{v}_{n} and λ1,,λn\lambda_{1},\ldots,\lambda_{n} be the eigenvectors and eigenvalues of 𝐀\mathbf{A} so that 𝐀=k=1nλk𝐯k𝐯kT\mathbf{A}=\sum_{k=1}^{n}\lambda_{k}\mathbf{v}_{k}\mathbf{v}_{k}^{T}. Then we write:

|𝐱iT𝐀𝐱L𝐱iT(𝐀𝐒)𝐱L|\displaystyle|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{L}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{L}| =|k=1nλk𝐱iT𝐯k𝐯kT𝐱Lk=1nλk𝐱iT(𝐯k𝐯kT𝐒)𝐱L)|\displaystyle=\left|\sum_{k=1}^{n}\lambda_{k}\cdot\mathbf{x}_{i}^{T}\mathbf{v}_{k}\mathbf{v}_{k}^{T}\mathbf{x}_{L}-\sum_{k=1}^{n}\lambda_{k}\mathbf{x}_{i}^{T}(\mathbf{v}_{k}\mathbf{v}_{k}^{T}\circ\mathbf{S})\mathbf{x}_{L})\right|
k=1n|λk||𝐱iT[𝐯k𝐯kT(𝟏𝐒)]𝐱L|.\displaystyle\leq\sum_{k=1}^{n}|\lambda_{k}|\cdot\left|\mathbf{x}_{i}^{T}[\mathbf{v}_{k}\mathbf{v}_{k}^{T}\circ(\mathbf{1}-\mathbf{S})]\mathbf{x}_{L}\right|.

As in the proof of Theorem 2, if we let 𝐃kn×n\mathbf{D}_{k}\in\mathbb{R}^{n\times n} be a diagonal matrix with the entries of 𝐯k\mathbf{v}_{k} on its diagonal, then we have 𝐯k𝐯kT(𝟏𝐒)=𝐃k(𝟏𝐒)𝐃k\mathbf{v}_{k}\mathbf{v}_{k}^{T}\circ(\mathbf{1}-\mathbf{S})=\mathbf{D}_{k}(\mathbf{1}-\mathbf{S})\mathbf{D}_{k}. Further, 𝐱iT𝐃k(𝟏𝐒)𝐃k𝐱L=𝐯kT𝐃i(𝟏𝐒)𝐃L𝐯k\mathbf{x}_{i}^{T}\mathbf{D}_{k}(\mathbf{1}-\mathbf{S})\mathbf{D}_{k}\mathbf{x}_{L}=\mathbf{v}_{k}^{T}\mathbf{D}_{i}(\mathbf{1}-\mathbf{S})\mathbf{D}_{L}\mathbf{v}_{k}, where 𝐃i,𝐃L\mathbf{D}_{i},\mathbf{D}_{L} are diagonal with the entries of 𝐱i\mathbf{x}_{i} and 𝐱L\mathbf{x}_{L} on their diagonals. Plugging this back in we have:

|𝐱iT𝐀𝐱L𝐱iT(𝐀𝐒)𝐱L|\displaystyle|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{L}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{L}| k=1n|λk||𝐯kT𝐃i(𝟏𝐒)𝐃L𝐯kT|\displaystyle\leq\sum_{k=1}^{n}|\lambda_{k}|\cdot\left|\mathbf{v}_{k}^{T}\mathbf{D}_{i}(\mathbf{1}-\mathbf{S})\mathbf{D}_{L}\mathbf{v}_{k}^{T}\right|
𝟏𝐒2k=1n|λk|𝐯kT2𝐃i2𝐃L2𝐯k2\displaystyle\leq\|\mathbf{1}-\mathbf{S}\|_{2}\cdot\sum_{k=1}^{n}|\lambda_{k}|\cdot\|\mathbf{v}_{k}^{T}\|_{2}\cdot\|\mathbf{D}_{i}\|_{2}\cdot\|\mathbf{D}_{L}\|_{2}\cdot\|\mathbf{v}_{k}\|_{2}
𝟏𝐒2k=1n|λk|𝐃i2𝐃L2.\displaystyle\leq\|\mathbf{1}-\mathbf{S}\|_{2}\cdot\sum_{k=1}^{n}|\lambda_{k}|\cdot\|\mathbf{D}_{i}\|_{2}\cdot\|\mathbf{D}_{L}\|_{2}. (7)

Finally, observe that by definition, for any i{0,,}i\in\{0,\ldots,\ell\}, 𝐃i2=maxt[n]|𝐱i(t)|2in\|\mathbf{D}_{i}\|_{2}=\max_{t\in[n]}|\mathbf{x}_{i}(t)|\leq\frac{2^{i}}{\sqrt{n}} and 𝐃L2=maxt[n]|𝐱L(t)|12iϵ¯n\|\mathbf{D}_{L}\|_{2}=\max_{t\in[n]}|\mathbf{x}_{L}(t)|\leq\frac{1}{2^{i}{\bar{\epsilon}}\sqrt{n}}. Thus, 𝐃i2𝐃L21ϵ¯n\|\mathbf{D}_{i}\|_{2}\cdot\|\mathbf{D}_{L}\|_{2}\leq\frac{1}{{\bar{\epsilon}}n}. For i=+1i=\ell+1, 𝐱L=𝟎\mathbf{x}_{L}=\mathbf{0} by definition and so, the bound 𝐃i2𝐃L21ϵ¯n\|\mathbf{D}_{i}\|_{2}\cdot\|\mathbf{D}_{L}\|_{2}\leq\frac{1}{{\bar{\epsilon}}n} holds vacuously.

Also, by the assumption of the theorem, 𝟏𝐒2ϵ2nclog2(1/ϵ)ϵ¯2n\|\mathbf{1}-\mathbf{S}\|_{2}\leq\frac{\epsilon^{2}n}{c\log^{2}(1/\epsilon)}\leq{\bar{\epsilon}}^{2}n. Plugging back into (7),

|𝐱iT𝐀𝐱L𝐱iT(𝐀𝐒)𝐱L|ϵ¯2n1ϵ¯n𝐀1ϵ¯𝐀1,\displaystyle|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{L}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{L}|\leq{\bar{\epsilon}}^{2}n\cdot\frac{1}{{\bar{\epsilon}}n}\cdot\|\mathbf{A}\|_{1}\leq{\bar{\epsilon}}\cdot\|\mathbf{A}\|_{1}, (8)

as required.

Term 2: Sparse Vectors. We next bound the second term of (6): |𝐱iT𝐀𝐱H𝐱iT(𝐀𝐒)𝐱H|\left|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H}\right|. We write 𝐱i=𝐱i,P+𝐱i,N\mathbf{x}_{i}=\mathbf{x}_{i,P}+\mathbf{x}_{i,N}, where 𝐱i,P\mathbf{x}_{i,P} and 𝐱i,N\mathbf{x}_{i,N} contain its positive and non-positive entries respectively. Similarly, write 𝐱H=𝐱H,P+𝐱H,N\mathbf{x}_{H}=\mathbf{x}_{H,P}+\mathbf{x}_{H,N}. We can then bound via triangle inequality:

|𝐱iT𝐀𝐱H𝐱iT(𝐀𝐒)𝐱H|\displaystyle|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H}| |𝐱iT𝐀𝐱H|+|𝐱iT(𝐀𝐒)𝐱H|\displaystyle\leq|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}|+|\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H}|
|𝐱iT𝐀𝐱H|+|𝐱i,PT(𝐀𝐒)𝐱H,P|+|𝐱i,PT(𝐀𝐒)𝐱H,N|\displaystyle\leq|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}|+|\mathbf{x}_{i,P}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H,P}|+|\mathbf{x}_{i,P}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H,N}|
+|𝐱i,NT(𝐀𝐒)𝐱H,P|+|𝐱i,NT(𝐀𝐒)𝐱H,N|.\displaystyle\hskip 20.00003pt+|\mathbf{x}_{i,N}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H,P}|+|\mathbf{x}_{i,N}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H,N}|. (9)

We will bound each term in (9) by O(ϵ¯n)O({\bar{\epsilon}}n), giving that overall |𝐱iT𝐀𝐱H𝐱iT(𝐀𝐒)𝐱H|=O(ϵ¯n)|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H}|=O({\bar{\epsilon}}n). We first observe that

|𝐱iT𝐀𝐱H|𝐱i1𝐱H1𝐀.|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}|\leq\|\mathbf{x}_{i}\|_{1}\cdot\|\mathbf{x}_{H}\|_{1}\cdot\|\mathbf{A}\|_{\infty}.

By assumption 𝐀1\|\mathbf{A}\|_{\infty}\leq 1. Further, for i{1,,+1}i\in\{1,\ldots,\ell+1\}, since |𝐱i(t)|2i1n|\mathbf{x}_{i}(t)|\geq\frac{2^{i-1}}{\sqrt{n}} for all tt and since 𝐱i21\|\mathbf{x}_{i}\|_{2}\leq 1, 𝐱i\mathbf{x}_{i} has at most n22i2\frac{n}{2^{2i-2}} non-zero entries. Thus, 𝐱i1n2i1\|\mathbf{x}_{i}\|_{1}\leq\frac{\sqrt{n}}{2^{i-1}}. For i=0i=0, this bound holds trivially since 𝐱i1n𝐱i2nn2i1\|\mathbf{x}_{i}\|_{1}\leq\sqrt{n}\cdot\|\mathbf{x}_{i}\|_{2}\leq\sqrt{n}\leq\frac{\sqrt{n}}{2^{i-1}}.

Similarly, for i{0,}i\in\{0,\ldots\ell\}, since 𝐱H21\|\mathbf{x}_{H}\|_{2}\leq 1 and since |𝐱H(t)|12iϵ¯n|\mathbf{x}_{H}(t)|\geq\frac{1}{2^{i}{\bar{\epsilon}}\sqrt{n}} by definition, 𝐱H\mathbf{x}_{H} has at most 22iϵ¯2n2^{2i}\cdot{\bar{\epsilon}}^{2}n non-zero entries. So 𝐱H12iϵ¯n\|\mathbf{x}_{H}\|_{1}\leq 2^{i}{\bar{\epsilon}}\cdot\sqrt{n}. For i=+1=log2(1/ϵ¯)+1i=\ell+1=\log_{2}(1/{\bar{\epsilon}})+1 this bound holds trivially since 𝐱H1n𝐱H22iϵ¯n\|\mathbf{x}_{H}\|_{1}\leq\sqrt{n}\cdot\|\mathbf{x}_{H}\|_{2}\leq 2^{i}\bar{\epsilon}\cdot\sqrt{n}. Putting these all together, we have

|𝐱iT𝐀𝐱H|𝐱i1𝐱H1𝐀n2i12iϵ¯n=2ϵ¯n.\displaystyle|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}|\leq\|\mathbf{x}_{i}\|_{1}\cdot\|\mathbf{x}_{H}\|_{1}\cdot\|\mathbf{A}\|_{\infty}\leq\frac{\sqrt{n}}{2^{i-1}}\cdot 2^{i}{\bar{\epsilon}}\cdot\sqrt{n}=2{\bar{\epsilon}}n. (10)

We next bound |𝐱i,PT(𝐀𝐒)𝐱H,P||\mathbf{x}_{i,P}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H,P}|. Since 𝐱i,P\mathbf{x}_{i,P} and 𝐱H,P\mathbf{x}_{H,P} are both all positive vectors, and since by assumption 𝐀1\|\mathbf{A}\|_{\infty}\leq 1,

|𝐱i,PT(𝐀𝐒)𝐱H,P|\displaystyle|\mathbf{x}_{i,P}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H,P}| |𝐱i,PT𝐒𝐱H,P|\displaystyle\leq|\mathbf{x}_{i,P}^{T}\mathbf{S}\mathbf{x}_{H,P}|
|𝐱i,PT𝟏𝐱H,P|+|𝐱i,PT(𝟏𝐒)𝐱H,P|\displaystyle\leq|\mathbf{x}_{i,P}^{T}\mathbf{1}\mathbf{x}_{H,P}|+|\mathbf{x}_{i,P}^{T}(\mathbf{1}-\mathbf{S})\mathbf{x}_{H,P}|
𝐱i,P1𝐱H,P1𝟏+𝐱i,P2𝐱H,P2𝟏𝐒2.\displaystyle\leq\|\mathbf{x}_{i,P}\|_{1}\cdot\|\mathbf{x}_{H,P}\|_{1}\cdot\|\mathbf{1}\|_{\infty}+\|\mathbf{x}_{i,P}\|_{2}\cdot\|\mathbf{x}_{H,P}\|_{2}\cdot\|\mathbf{1}-\mathbf{S}\|_{2}. (11)

Following the same argument used to show (10), 𝐱i,P1𝐱H,P1𝟏2ϵ¯n\|\mathbf{x}_{i,P}\|_{1}\cdot\|\mathbf{x}_{H,P}\|_{1}\cdot\|\mathbf{1}\|_{\infty}\leq 2{\bar{\epsilon}}n. Further, since by the assumption of the theorem, 𝟏𝐒2ϵ2nclog2(1/ϵ)ϵ¯2nϵ¯n\|\mathbf{1}-\mathbf{S}\|_{2}\leq\frac{\epsilon^{2}n}{c\log^{2}(1/\epsilon)}\leq{\bar{\epsilon}}^{2}n\leq{\bar{\epsilon}}n, and since 𝐱i,P\mathbf{x}_{i,P} and 𝐱H,P\mathbf{x}_{H,P} both are at most unit norm, 𝐱i,P2𝐱H,P2𝟏𝐒2ϵ¯n\|\mathbf{x}_{i,P}\|_{2}\cdot\|\mathbf{x}_{H,P}\|_{2}\cdot\|\mathbf{1}-\mathbf{S}\|_{2}\leq{\bar{\epsilon}}n. Thus, plugging back into (11), we have |𝐱i,PT(𝐀𝐒)𝐱H,P|3ϵ¯n|\mathbf{x}_{i,P}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H,P}|\leq 3{\bar{\epsilon}}n. Identical bounds will hold for the remaining three terms of (9), since for each, the two vectors in the quadratic form have entries that either always match or never match on sign. Plugging these bounds and (10) back into (9), we obtain

|𝐱iT𝐀𝐱H𝐱iT(𝐀𝐒)𝐱H|2ϵ¯n+43ϵ¯n14ϵ¯n,\displaystyle|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}_{H}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}_{H}|\leq 2{\bar{\epsilon}}n+4\cdot 3{\bar{\epsilon}}n\leq 14{\bar{\epsilon}}n, (12)

which completes the required bound in the sparse case.

Concluding the Proof. We finally plug our bounds for the sparse case (12) and the well-spread case (8) into (6) to obtain:

|𝐱iT𝐀𝐱𝐱iT(𝐀𝐒)𝐱|14ϵ¯n+ϵ¯𝐀115ϵ¯max(n,𝐀1).\displaystyle\left|\mathbf{x}_{i}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}_{i}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}\right|\leq 14{\bar{\epsilon}}n+{\bar{\epsilon}}\|\mathbf{A}\|_{1}\leq 15{\bar{\epsilon}}\cdot\max(n,\|\mathbf{A}\|_{1}).

Plugging this bound into (5) gives that |𝐱T𝐀𝐱𝐱T(𝐀𝐒)𝐱|=O(ϵmax(n,𝐀1))|\mathbf{x}^{T}\mathbf{A}\mathbf{x}-\mathbf{x}^{T}(\mathbf{A}\circ\mathbf{S})\mathbf{x}|=O(\epsilon\cdot\max(n,\|\mathbf{A}\|_{1})), which completes the theorem after adjusting ϵ\epsilon by a constant factor. ∎

4 Spectral Approximation Lower Bounds

We now show that our universal sparsifier upper bounds for both PSD matrices (Theorem 1) and non-PSD matrices (Theorem 4) are nearly tight. We also give Ω(n/ϵ2)\Omega(n/\epsilon^{2}) query lower bounds against general deterministic (possibly adaptive) spectral approximation algorithms (Theorem 7) and general randomized spectral approximation algorithms (Theorem 11).

4.1 Sparsity Lower Bound for PSD Matrices

We first prove that every matrix which is an ϵn\epsilon n spectral approximation to the all-ones matrix must have Ω(nϵ2)\Omega(\frac{n}{\epsilon^{2}}) non-zero entries. This shows that Theorem 1 is optimal up to constant factors, even for algorithms that sparsify just a single bounded entry PSD matrix. The idea of the lower bound is simple: if a matrix 𝐒\mathbf{S} spectrally approximates the all-ones matrix, its entries must sum to Ω(n2)\Omega(n^{2}). Thus, if 𝐒\mathbf{S} has just ss non-zero entries, it must have Frobenius norm at least Ω(n2/s)\Omega(n^{2}/\sqrt{s}). Unless s=Ω(n/ϵ2)s=\Omega(n/\epsilon^{2}), this Frobenius norm is too large for 𝐒\mathbf{S} to be a ϵn\epsilon n-spectral approximation of the all ones matrix (which has 𝟏F=n\|\mathbf{1}\|_{F}=n.)

See 3

Proof.

If we let 𝐱n\mathbf{x}\in\mathbb{R}^{n} be the all ones vector, then since 𝐱22=n\|\mathbf{x}\|_{2}^{2}=n, 𝟏𝐒2ϵn\|\mathbf{1}-\mathbf{S}\|_{2}\leq\epsilon n implies that |𝐱T𝐒𝐱𝐱T𝟏𝐱|=|𝐱T𝐒𝐱n2|ϵn2|\mathbf{x}^{T}\mathbf{S}\mathbf{x}-\mathbf{x}^{T}\mathbf{1}\mathbf{x}|=|\mathbf{x}^{T}\mathbf{S}\mathbf{x}-n^{2}|\leq\epsilon n^{2}. This in turn implies:

i,j[n]|𝐒ij|i,j[n]𝐒ijn2ϵn2n22,\displaystyle\sum_{i,j\in[n]}|\mathbf{S}_{ij}|\geq\sum_{i,j\in[n]}\mathbf{S}_{ij}\geq n^{2}-\epsilon n^{2}\geq\frac{n^{2}}{2},

where the last inequality follows from the assumption that ϵ12\epsilon\leq\frac{1}{2}. If 𝐒\mathbf{S} has ss non-zero entries, since i,j[n]|𝐒ij|\sum_{i,j\in[n]}|\mathbf{S}_{ij}| is the 1\ell_{1}-norm of the entries, and 𝐒F\|\mathbf{S}\|_{F} is the 2\ell_{2}-norm of the entries, we conclude by 12\ell_{1}-\ell_{2} norm equivalence that 𝐒F1si,j[n]|𝐒ij|n22s\|\mathbf{S}\|_{F}\geq\frac{1}{\sqrt{s}}\sum_{i,j\in[n]}|\mathbf{S}_{ij}|\geq\frac{n^{2}}{2\sqrt{s}}. Therefore, by the triangle inequality,

𝟏𝐒F𝐒F𝟏F=n22snn24s,\displaystyle\|\mathbf{1}-\mathbf{S}\|_{F}\geq\|\mathbf{S}\|_{F}-\|\mathbf{1}\|_{F}=\frac{n^{2}}{2\sqrt{s}}-n\geq\frac{n^{2}}{4\sqrt{s}},

as long as sn216s\leq\frac{n^{2}}{16} so that n22s2n\frac{n^{2}}{2\sqrt{s}}\geq 2n. Now, using that 𝐀F2=i=1nσi2(𝐀)n𝐀22\|\mathbf{A}\|_{F}^{2}=\sum_{i=1}^{n}\sigma_{i}^{2}(\mathbf{A})\leq n\cdot\|\mathbf{A}\|_{2}^{2}, we have:

𝟏𝐒Fn24s𝟏𝐒F2n416s𝟏𝐒22n316s𝟏𝐒2n3/24s.\displaystyle\|\mathbf{1}-\mathbf{S}\|_{F}\geq\frac{n^{2}}{4\sqrt{s}}\implies\|\mathbf{1}-\mathbf{S}\|_{F}^{2}\geq\frac{n^{4}}{16s}\implies\|\mathbf{1}-\mathbf{S}\|_{2}^{2}\geq\frac{n^{3}}{16s}\implies\|\mathbf{1}-\mathbf{S}\|_{2}\geq\frac{n^{3/2}}{4\sqrt{s}}.

By our assumption that 𝟏𝐒2ϵn\|\mathbf{1}-\mathbf{S}\|_{2}\leq\epsilon n, this means that s=Ω(nϵ2)s=\Omega\left(\frac{n}{\epsilon^{2}}\right), concluding the theorem. ∎

4.2 Lower Bounds for Deterministic Approximation of Non-PSD Matrices

We next show that our universal sparsification bound for non-PSD matrices (Theorem 4) is tight up to a log4(1/ϵ)\log^{4}(1/\epsilon) factor. Our lower bound holds even for the easier problem of top singular value (spectral norm) approximation and against a more general class of algorithms, which non-adaptively and deterministically query entries of the input matrix. We show how to extend the lower bound to possibly adaptive deterministic algorithms in Theorem 7, but with a 1/ϵ21/\epsilon^{2} factor loss.

See 6

Proof.

Assume that we have a deterministic algorithm 𝒜\mathcal{A} that non-adaptively reads ss entries of any bounded symmetric matrix 𝐀\mathbf{A} and outputs σ~1\widetilde{\sigma}_{1} with |σ1(𝐀)σ~1|ϵmax(n,𝐀1)|\sigma_{1}(\mathbf{A})-\widetilde{\sigma}_{1}|\leq\epsilon\max(n,\|\mathbf{A}\|_{1}). Assume for the sake of contradiction that scnϵ4s\leq\frac{cn}{\epsilon^{4}} for some sufficiently small constant cc.

Let T[n]T\subset[n] be a set of n3/2/s1/2n^{3/2}/s^{1/2} rows on which 𝒜\mathcal{A} reads at most ns\sqrt{ns} entries. Such a subset must exist since the average number of entries read in any set of n3/2/s1/2n^{3/2}/s^{1/2} rows is n3/2s1/2nsn2=ns\frac{n^{3/2}}{s^{1/2}}\cdot n\cdot\frac{s}{n^{2}}=\sqrt{ns}.

Let 𝟏T\mathbf{1}_{T} be the matrix which is 11 on all the rows in TT and zero everywhere else. Let 𝐒T\mathbf{S}_{T} be the matrix which matches 𝟏T\mathbf{1}_{T} on all entries, except is 0 on any entry in the rows of TT that is not read by the algorithm 𝒜\mathcal{A}. Observe that 𝒜\mathcal{A} reads the same entries (all ones) and thus outputs the same approximation σ~1\widetilde{\sigma}_{1} for 𝟏T\mathbf{1}_{T} and 𝐒T\mathbf{S}_{T}. We now bound the allowed error of this approximation. 𝟏T\mathbf{1}_{T} is rank-1 and so:

𝟏T1=σ1(𝟏T)=𝟏TF=n5/4s1/4.\|\mathbf{1}_{T}\|_{1}=\sigma_{1}(\mathbf{1}_{T})=\|\mathbf{1}_{T}\|_{F}=\frac{n^{5/4}}{s^{1/4}}.

We have σ1(𝐒T)𝐒TFn1/4s1/4\sigma_{1}(\mathbf{S}_{T})\leq\|\mathbf{S}_{T}\|_{F}\leq n^{1/4}s^{1/4} since 𝐒T\mathbf{S}_{T} has just ns\sqrt{ns} entries set to 11 – the entries that 𝒜\mathcal{A} reads in the rows of TT. Using that 𝐒T\mathbf{S}_{T} is supported only on the n3/2/s1/2n^{3/2}/s^{1/2} rows of TT, 𝐒T\mathbf{S}_{T} has rank at most n3/2/s1/2n^{3/2}/s^{1/2}. Thus, we can bound:

𝐒T1n3/4s1/4𝐒TFn.\|\mathbf{S}_{T}\|_{1}\leq\frac{n^{3/4}}{s^{1/4}}\cdot\|\mathbf{S}_{T}\|_{F}\leq n.

Now, since |σ1(𝟏T)σ1(𝐒T||n5/4s1/4n1/4s1/4||\sigma_{1}(\mathbf{1}_{T})-\sigma_{1}(\mathbf{S}_{T}|\geq\left|\frac{n^{5/4}}{s^{1/4}}-n^{1/4}s^{1/4}\right|, our algorithm incurs error on one of the two input instances at least

|n5/4s1/4n1/4s1/4|2n5/44s1/4,\frac{\left|\frac{n^{5/4}}{s^{1/4}}-n^{1/4}s^{1/4}\right|}{2}\geq\frac{n^{5/4}}{4s^{1/4}},

where the inequality follows since, by assumption, scnϵ4s\leq\frac{cn}{\epsilon^{4}} for some small constant cc and ϵ1n1/4\epsilon\geq\frac{1}{n^{1/4}}, and thus n1/4s1/4n5/42s1/4n^{1/4}s^{1/4}\leq\frac{n^{5/4}}{2s^{1/4}}.

The above is a contradiction when ϵ<1/4\epsilon<1/4 since the above error is at least 1/4𝟏T11/4\cdot\|\mathbf{1}_{T}\|_{1} and further, for scnϵ4s\leq\frac{cn}{\epsilon^{4}}, the error it at least ϵn4c1/4>ϵnϵ𝐒T1\frac{\epsilon n}{4c^{1/4}}>\epsilon n\geq\epsilon\cdot\|\mathbf{S}_{T}\|_{1} if we set cc small enough. Thus, on at least one of the two input instances the error exceeds ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}), yielding a contradiction. ∎

We can prove a variant on Theorem 6 when the algorithm is allowed to make adaptive queries. Here, our lower bound reduces to Ω(n/ϵ2)\Omega(n/\epsilon^{2}), as we are not able to restrict our hard case to a small set of rows of the input matrix. Closing the gap here – either by giving a stronger lower bound in the adaptive case or giving an adaptive query deterministic algorithm that achieves O~(n/ϵ2)\widetilde{O}(n/\epsilon^{2}) query complexity is an interesting open question.

See 7

Proof.

Assume that we have a deterministic algorithm 𝒜\mathcal{A} that reads at most ss entries of any bounded entry matrix 𝐀\mathbf{A} and outputs σ~1\widetilde{\sigma}_{1} with |σ1(𝐀)σ~1|ϵmax(n,𝐀1)|\sigma_{1}(\mathbf{A})-\widetilde{\sigma}_{1}|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Assume for the sake of contradiction that scnϵ2s\leq\frac{cn}{\epsilon^{2}} for some sufficiently small constant cc.

Let SS be the set of entries that 𝒜\mathcal{A} reads when given the all ones matrix 𝟏\mathbf{1} as input. Let 𝐒\mathbf{S} be the matrix which is one on every entry in SS and zero elsewhere. Observe that 𝒜\mathcal{A} reads the same entries and thus outputs the same approximation σ~1\widetilde{\sigma}_{1} for 𝟏\mathbf{1} and 𝐒\mathbf{S}. We now bound the allowed error of this approximation. 𝟏\mathbf{1} is rank-1 and has 𝟏1=σ1(𝟏)=n\|\mathbf{1}\|_{1}=\sigma_{1}(\mathbf{1})=n. We have σ1(𝐒)𝐒F=s\sigma_{1}(\mathbf{S})\leq\|\mathbf{S}\|_{F}=\sqrt{s} and can bound 𝐒1n𝐒Fsnc1/2nϵ,\|\mathbf{S}\|_{1}\leq\sqrt{n}\cdot\|\mathbf{S}\|_{F}\leq\sqrt{sn}\leq\frac{c^{1/2}n}{\epsilon}, where the last bound follows from our assumption that scnϵ2s\leq\frac{cn}{\epsilon^{2}}. Now, since |σ1(𝟏)σ1(𝐒||ns||\sigma_{1}(\mathbf{1})-\sigma_{1}(\mathbf{S}|\geq\left|n-\sqrt{s}\right|, our algorithm incurs an error on one of the two input instances at least

|ns|2n4,\frac{\left|n-\sqrt{s}\right|}{2}\geq\frac{n}{4},

where the inequality holds since, by assumption, scnϵ2s\leq\frac{cn}{\epsilon^{2}} for some small constant cc and ϵ1n1/2\epsilon\geq\frac{1}{n^{1/2}}, and thus sn2\sqrt{s}\leq\frac{n}{2}.

The above is a contradiction when ϵ<1/4\epsilon<1/4 since the error mentioned above is at least 1/2𝟏1=1/2max(n,𝟏1)1/2\cdot\|\mathbf{1}\|_{1}=1/2\cdot\max(n,\|\mathbf{1}\|_{1}). Furthermore, since we have bounded 𝐒1c1/2nϵ\|\mathbf{S}\|_{1}\leq\frac{c^{1/2}n}{\epsilon}, the error is greater than ϵmax(n,𝐒1)\epsilon\cdot\max(n,\|\mathbf{S}\|_{1}) when c<1c<1. Thus, on at least one of the two input instances the error exceeds ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}), yielding a contradiction. ∎

4.3 Lower Bound for Randomized Approximation of Non-PSD Matrices

Finally, we prove Theorem 11, which gives an Ω(n/ϵ2)\Omega(n/\epsilon^{2}) query lower bound for spectral approximation of bounded entry matrices that holds even for randomized and adaptive algorithms that approximate 𝐀\mathbf{A} in an arbitrary manner (not necessarily via sparsification). In particular, we show the lower bound against algorithms that produce any data structure, f(,)f(\cdot,\cdot), that satisfies |f(𝐱,𝐲)𝐱T𝐀𝐲|ϵn|f(\mathbf{x},\mathbf{y})-\mathbf{x}^{T}\mathbf{A}\mathbf{y}|\leq\epsilon n for all unit norm 𝐱,𝐲n\mathbf{x,y}\in\mathbb{R}^{n}

To prove this lower bound, we let 𝐀\mathbf{A} be a random matrix where each row has binary entries that are either i.i.d. fair coin flips, or else are coin flips with bias +ϵ+\epsilon. Each row is unbiased with probability 1/21/2 and biased with probability 1/21/2. We show that an ϵn\epsilon n-spectral approximation to 𝐀\mathbf{A} suffices to identify for at least a 9/109/10 fraction of the rows, whether or not they are biased – roughly since the approximation must preserve the fact that 𝐱T𝐀𝐱\mathbf{x}^{T}\mathbf{A}\mathbf{x} is large when 𝐱\mathbf{x} is supported just on this biased set. Consider a communication problem in the blackboard model, in which each of n2n^{2} players can access just a single entry of 𝐀\mathbf{A}. It is known that if the nn players corresponding to a single row of the matrix want to identify with good probability whether or not it is biased, they must communicate at least Ω(1/ϵ2)\Omega(1/\epsilon^{2}) bits [SWXZ22]. Further, via a direct sum type argument, we can show that for the n2n^{2} players to identify the bias of a 9/109/10 fraction of the rows with good probability, Ω(n/ϵ2)\Omega(n/\epsilon^{2}) bits must be communicated. I.e., at least Ω(n/ϵ2)\Omega(n/\epsilon^{2}) of the players must read their input bits, yielding our query complexity lower bound.

Note that there may be other proof approaches here, based on a direct sum of nn 22-player Gap-Hamming instances [BJKS04, CR12b, BGPW13], but our argument is simple and already gives optimal bounds.

Section Roadmap. In Section 4.3.1, we formally define the problem of identifying the bias of a large fraction of 𝐀\mathbf{A}’s rows as the (ϵ,n)(\epsilon,n)-distributed detection problem (Definition 4.2). We prove a Ω(n/ϵ2)\Omega(n/\epsilon^{2}) query lower bound for this problem in Lemma 3.. Then, in Section 4.3.2, we prove Theorem 11 by showing a reduction from the distributed detection problem to spectral approximation.

4.3.1 Distributed Detection Lower Bound

Here, we define additional concepts and notation specific to this section. For random variables XX, YY and ZZ, let H(X)H(X) denote the entropy, H(X|Y)H(X|Y) denote the conditional entropy, I(X;Y)I(X;Y) denote mutual information, and I(X;Y|Z)I(X;Y|Z) denote conditional mutual information [Cov99]. We also use some ideas from communication complexity. Namely, we work in the blackboard model, where TT parties communicate by posting messages to a public blackboard with access to public randomness with unlimited rounds of adaptivity. Let Π{0,1}\Pi\in\{0,1\}^{*} denote the transcript of a protocol posted to the blackboard, and let |Π||\Pi| denote the length of a transcript. For a fixed protocol with fixed probability of success, we may assume, without loss of generality, that Π\Pi has a fixed pre-determined length (see Section 2.2 of [Rou16]).

Our query complexity lower bound for spectral approximation will follow from a reduction from the following testing problem.

Definition 4.1.

(ϵ\epsilon-Distributed detection problem [SWXZ22]). For fixed distributions μ0=Bernoulli(1/2)\mu_{0}=\operatorname{Bernoulli}(1/2) and μ1=Bernoulli(1/2+ϵ)\mu_{1}=\operatorname{Bernoulli}(1/2+\epsilon), with ϵ[0,12]\epsilon\in[0,\frac{1}{2}], let 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n} be a random vector such that 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n} are sampled i.i.d. from μV\mu_{V}, for V{0,1}V\in\{0,1\}. The distributed detection problem is the task of determining whether V=0V=0 or V=1V=1, given the values of 𝐱\mathbf{x}.

This decision problem can be naturally interpreted as a communication problem in the blackboard model, where nn players each have a single private bit corresponding to whether a unique entry of 𝐱\mathbf{x} is zero or one. Prior work takes this view to lower bound the mutual information between the transcript of a protocol which correctly solves the ϵ\epsilon-Distributed detection problem with constant advantage and the sampled bits in 𝐱\mathbf{x} in the case V=0V=0.

Theorem 12.

(Theorem 6 in [SWXZ22]). Let Π\Pi be the transcript of a protocol that solves ϵ\epsilon-distributed detection problem with probability 1p1-p for any fixed choice of p[0,0.5)p\in[0,0.5). Then

I(X;Π|V=0)=Ω(ϵ2).\displaystyle I(X;\Pi|V=0)=\Omega(\epsilon^{-2}).

Next, we define the (ϵ,n)(\epsilon,n)-Distributed detection problem, which combines nn length-nn ϵ\epsilon-Distributed detection problems into a single joint detection problem.

Definition 4.2.

((ϵ,n)(\epsilon,n)-Distributed detection problem) For 𝐯{0,1}n\mathbf{v}\in\{0,1\}^{n} distributed uniformly on the Hamming cube, generate the matrix 𝐀{0,1}n×n\mathbf{A}\in\{0,1\}^{n\times n} such that if 𝐯i=0\mathbf{v}_{i}=0, then all entries in the ii-th row of 𝐀\mathbf{A} are i.i.d. samples from Bernoulli(1/2)\operatorname{Bernoulli}(1/2). Otherwise, let all entries in the ii-th row be sampled from Bernoulli(1/2+ϵ)\operatorname{Bernoulli}(1/2+\epsilon). The (ϵ,n)(\epsilon,n)-Distributed detection problem is the task of recovering a vector 𝐯^{0,1}n\hat{\mathbf{v}}\in\{0,1\}^{n} such that 𝐯^𝐯1n20\|\hat{\mathbf{v}}-\mathbf{v}\|_{1}\leq\frac{n}{20}.

Again, this vector recovery problem has a natural interpretation as a communication problem, where n2n^{2} players each hold a single bit of information that corresponds to whether a unique entry of 𝐀\mathbf{A} is zero or one. Throughout this section, we will view Definition 4.2 as a decision problem or communication problem as needed. Let Π\Pi be the transcript of a protocol that solves the (ϵ,n)(\epsilon,n)-Distributed detection problem in the communication model introduced at the beginning of this section. Lower bounding the mutual information between the transcript, Π\Pi, and the private information held by the players (the entries of 𝐀\mathbf{A}) lower bounds the length of the transcript via the following argument.

Lemma 1.

If Π\Pi is the transcript of a protocol that solves the (ϵ,n)(\epsilon,n)-Distributed detection problem and I(𝐀;Π)bI(\mathbf{A};\Pi)\geq b, then |Π|blog2|\Pi|\geq\frac{b}{\log 2}.

Proof.

For random variables X,YX,Y, I(X;Y)min{H(X),H(Y)}I(X;Y)\leq\min\{H(X),H(Y)\} [Cov99]. If a random variable XX has finite support, then H(X)log|supp(X)|H(X)\leq\log|\operatorname{supp}(X)|. Recall that |Π||\Pi| is the number of bits in the transcript and hence log2|Π|=log2|Π|H(Π)I(𝐀;Π)b\log 2^{|\Pi|}=\log 2\cdot|\Pi|\geq H(\Pi)\geq I(\mathbf{A};\Pi)\geq b. Hence, we conclude the statement. ∎

Next, we lower bound the number of entries in the matrix 𝐀\mathbf{A} that must be observed to solve a variant of the (ϵ,n)(\epsilon,n)-Distributed detection problem, where we aim to correctly decide each index of 𝐯\mathbf{v} with constant success probability, rather than constructing 𝐯^\hat{\mathbf{v}} such that 𝐯^𝐯1n20\|\hat{\mathbf{v}}-\mathbf{v}\|_{1}\leq\frac{n}{20}. There are three sources of randomness in this problem: 1) the initial randomness in sampling 𝐯\mathbf{v}, 2) the random variable 𝐀\mathbf{A} which depends on 𝐯\mathbf{v}, and 3) the random transcript Π\Pi which depends on 𝐀\mathbf{A}. When necessary to avoid confusion, we will be explicit regarding which of these variables are considered fixed and which are considered as random in probabilistic statements.

Lemma 2.

Let 𝐀\mathbf{A} and 𝐯\mathbf{v} be distributed as in the (ϵ,n)(\epsilon,n)-Distributed detection problem. Any protocol with transcript Π\Pi that constructs a vector 𝐯^\hat{\mathbf{v}} such that (𝐯i=𝐯^i)910\operatorname*{\mathbb{P}}(\mathbf{v}_{i}=\hat{\mathbf{v}}_{i})\geq\frac{9}{10}, for all i[n]i\in[n], (where the randomness is with respect to 𝐀\mathbf{A}, 𝐯\mathbf{v}, and Π\Pi), must observe Ω(nϵ2)\Omega(\frac{n}{\epsilon^{2}}) entries of 𝐀\mathbf{A} in the worst case.

Proof.

Consider 𝐀\mathbf{A} as n2n^{2} players each holding a single bit of information corresponding to whether a unique entry of 𝐀\mathbf{A} is zero or one. Here, let Π{0,1}\Pi\in\{0,1\}^{*} be the transcript of a protocol that satisfies (𝐯i=𝐯^i)910\operatorname*{\mathbb{P}}(\mathbf{v}_{i}=\hat{\mathbf{v}}_{i})\geq\frac{9}{10} for every i[n]i\in[n]. Note that any algorithm which solves this problem by querying kk entries of 𝐀\mathbf{A} implies a protocol for the communication problem which communicates kk bits, since the algorithm could be simulated by (possibly adaptively) posting each queried bit to the blackboard.

First, we lower bound the un-conditional mutual information between the private information in the ii-th row of 𝐀\mathbf{A} and the transcript Π\Pi. Note that 𝐯i\mathbf{v}_{i} is one or zero with equal probability, therefore,

I(𝐀i;Π|𝐯i)=12I(𝐀i;Π|𝐯i=0)+12I(𝐀i;Π|𝐯i=1)12I(𝐀i;Π|𝐯i=0).\displaystyle I(\mathbf{A}_{i};\Pi|\mathbf{v}_{i})=\frac{1}{2}I(\mathbf{A}_{i};\Pi|\mathbf{v}_{i}=0)+\frac{1}{2}I(\mathbf{A}_{i};\Pi|\mathbf{v}_{i}=1)\geq\frac{1}{2}I(\mathbf{A}_{i};\Pi|\mathbf{v}_{i}=0).

Next, we decompose the mutual information by its definition and the chain rule, then, we use the fact that 0H(𝐯i)10\leq H(\mathbf{v}_{i})\leq 1 and that conditioning can only decrease the entropy of a random variable.

I(𝐀i;Π)=I(𝐀i;Π|𝐯i)+H(𝐯i|𝐀i,Π)H(𝐯i|𝐀i)H(𝐯i|Π)+H(𝐯i)\displaystyle I(\mathbf{A}_{i};\Pi)=I(\mathbf{A}_{i};\Pi|\mathbf{v}_{i})+H(\mathbf{v}_{i}|\mathbf{A}_{i},\Pi)-H(\mathbf{v}_{i}|\mathbf{A}_{i})-H(\mathbf{v}_{i}|\Pi)+H(\mathbf{v}_{i})
I(𝐀i;Π)I(𝐀i;Π|𝐯i)2.\displaystyle\Rightarrow I(\mathbf{A}_{i};\Pi)\geq I(\mathbf{A}_{i};\Pi|\mathbf{v}_{i})-2.

Observe that determining 𝐯^i\hat{\mathbf{v}}_{i} with the guarantee (𝐯i=𝐯^i)910\operatorname*{\mathbb{P}}(\mathbf{v}_{i}=\hat{\mathbf{v}}_{i})\geq\frac{9}{10} solves the ϵ\epsilon-Distributed detection problem with probability at least 910\frac{9}{10}. Therefore, by Theorem 12 I(𝐀i;Π|𝐯i=0)=Ω(ϵ2)I(\mathbf{A}_{i};\Pi|\mathbf{v}_{i}=0)=\Omega(\epsilon^{-2}), and so we can use the previous two equations to lower bound for the conditional mutual information:

I(𝐀i;Π)12I(𝐀i;Π|𝐯i=0)2=Ω(1ϵ2).\displaystyle I(\mathbf{A}_{i};\Pi)\geq\frac{1}{2}I(\mathbf{A}_{i};\Pi|\mathbf{v}_{i}=0)-2=\Omega(\frac{1}{\epsilon^{2}}). (13)

Next, we lower bound the total mutual information between 𝐀\mathbf{A} and Π\Pi. Mirroring the argument of Lemma 1 in [SWXZ22], we use that, for independent random variables, entropy is additive and conditional entropy is subadditive to lower bound the mutual information between 𝐀\mathbf{A} and Π\Pi:

I({𝐀i}i[n];Π)\displaystyle I(\{\mathbf{A}_{i}\}_{i\in[n]};\Pi) =H({𝐀i}i[n])H({𝐀i}i[n]|Π)\displaystyle=H(\{\mathbf{A}_{i}\}_{i\in[n]})-H(\{\mathbf{A}_{i}\}_{i\in[n]}|\Pi)
i=1nH(𝐀i)H(𝐀i,Π)\displaystyle\geq\sum_{i=1}^{n}H(\mathbf{A}_{i})-H(\mathbf{A}_{i},\Pi)
=i=1nI(𝐀i;Π)=Ω(nϵ2),\displaystyle=\sum_{i=1}^{n}I(\mathbf{A}_{i};\Pi)=\Omega\left(\frac{n}{\epsilon^{2}}\right),

where the last step follows from (13). By Lemma 1, |Π|=Ω(I({𝐀i}i[n];Π))=Ω(nϵ2)|\Pi|=\Omega(I(\{\mathbf{A}_{i}\}_{i\in[n]};\Pi))=\Omega(\frac{n}{\epsilon^{2}}). Therefore, every algorithm which samples entries of 𝐀\mathbf{A} to construct a vector 𝐯^\hat{\mathbf{v}} satisfying (𝐯i=𝐯^i)910\operatorname*{\mathbb{P}}(\mathbf{v}_{i}=\hat{\mathbf{v}}_{i})\geq\frac{9}{10} must observe at least Ω(nϵ2)\Omega(\frac{n}{\epsilon^{2}}) entries of 𝐀\mathbf{A}. ∎

We now have the necessary results to prove a lower bound on the number of entries of 𝐀\mathbf{A} that must be observed to solve the (ϵ,n)(\epsilon,n)-Distributed detection problem, which we do by reducing to the problem in Lemma 2.

Lemma 3.

Any adaptive randomized algorithm which solves the (ϵ,n)(\epsilon,n)-Distributed detection problem with probability at least 1920\frac{19}{20} (with respect to the randomness in 𝐯\mathbf{v}, 𝐀\mathbf{A}, and Π\Pi) must observe Ω(nϵ2)\Omega(\frac{n}{\epsilon^{2}}) entries of 𝐀\mathbf{A}.

Proof.

Any algorithm that can produce a vector 𝐯^\hat{\mathbf{v}} such that 𝐯^𝐯1n20\|\hat{\mathbf{v}}-\mathbf{v}\|_{1}\leq\frac{n}{20} with probability at least 1920\frac{19}{20} could be used to guarantee that (𝐯^i=𝐯i)910\operatorname*{\mathbb{P}}(\hat{\mathbf{v}}_{i}=\mathbf{v}_{i})\geq\frac{9}{10} by the following argument.

For any input matrix 𝐀\mathbf{A}, create the matrix 𝐀¯\bar{\mathbf{A}} such that 𝐀¯i=𝐀σ(i)\bar{\mathbf{A}}_{i}=\mathbf{A}_{\sigma(i)}, where σ\sigma is a permutation sampled uniformly from the symmetric group of size nn. Let 𝐯σ\mathbf{v}_{\sigma} be the vector which satisfies [𝐯σ]i=𝐯σ(i)[\mathbf{v}_{\sigma}]_{i}=\mathbf{v}_{\sigma(i)}. Run the considered protocol on 𝐀¯\bar{\mathbf{A}} to recover 𝐮\mathbf{u} such that 𝐮𝐯σ1n20\|\mathbf{u}-\mathbf{v}_{\sigma}\|_{1}\leq\frac{n}{20} with probability 1920\frac{19}{20}. Finally, let 𝐯^i=𝐮σ1(i)\hat{\mathbf{v}}_{i}=\mathbf{u}_{\sigma^{-1}(i)} for all i[n]i\in[n].

Then, (𝐯^i=𝐯i)=(𝐮𝐯σ1n20)(𝐮σ1(i)𝐯σ1(i)𝐮𝐯σ1n20)=19201920910\operatorname*{\mathbb{P}}(\hat{\mathbf{v}}_{i}=\mathbf{v}_{i})=\operatorname*{\mathbb{P}}(\|\mathbf{u}-\mathbf{v}_{\sigma}\|_{1}\leq\frac{n}{20})\cdot\operatorname*{\mathbb{P}}(\mathbf{u}_{\sigma^{-1}(i)}-\mathbf{v}_{\sigma^{-1}(i)}|\|\mathbf{u}-\mathbf{v}_{\sigma}\|_{1}\leq\frac{n}{20})=\frac{19}{20}\cdot\frac{19}{20}\geq\frac{9}{10}. Since σ(i)\sigma(i) is uniformly distributed over [n][n], (𝐮σ1(i)𝐯σ1(i)𝐮𝐯σ1n20)\operatorname*{\mathbb{P}}(\mathbf{u}_{\sigma^{-1}(i)}-\mathbf{v}_{\sigma^{-1}(i)}|\|\mathbf{u}-\mathbf{v}_{\sigma}\|_{1}\leq\frac{n}{20}) is the number of correctly decided entries divided by nn.

By Lemma 2, we conclude that solving the (ϵ,n)(\epsilon,n)-Distributed detection with probability at least 1920\frac{19}{20} requires observing Ω(nϵ2)\Omega(\frac{n}{\epsilon^{2}}) entries of 𝐀\mathbf{A} in the worst case. ∎

4.3.2 Spectral Approximation Query Lower Bound

Next, we show that the (ϵ,n)(\epsilon,n)-Distributed detection problem can be solved using a spectral approximation of 𝐀\mathbf{A}, thereby implying a query complexity lower bound for spectral approximation. See 11

Proof.

We reduce solving the (ϵ,n)(\epsilon,n)-Distributed detection problem to constructing a spectral approximation satisfying the guarantee in the above theorem statement. Throughout this proof, let ϵd\epsilon_{d} be the parameter associated with the (ϵ,n)(\epsilon,n)-Distributed detection problem and ϵs\epsilon_{s} be the accuracy of the spectral approximation.

Let 𝐀\mathbf{A} be the matrix associated with the (ϵd,n)(\epsilon_{d},n)-Distributed detection problem. Suppose that by reading rr entries of 𝐀\mathbf{A}, we can create a data structure f(,)f(\cdot,\cdot) satisfying:

|f(𝐱,𝐲)𝐱T𝐀𝐲|ϵsn𝐱2𝐲2. for all input 𝐱,𝐲n.\displaystyle|f(\mathbf{x},\mathbf{y})-\mathbf{x}^{T}\mathbf{A}\mathbf{y}|\leq\epsilon_{s}n\|\mathbf{x}\|_{2}\|\mathbf{y}\|_{2}.\text{ for all input }\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}.

We show that for ϵs\epsilon_{s} sufficiently small respect to ϵd\epsilon_{d}, such a spectral approximation is sufficient to solve the (ϵd,n)(\epsilon_{d},n)-Distributed detection problem with no further queries to 𝐀\mathbf{A}.

First, let BiB_{i} denote the number of ones in the ii-th row of 𝐀\mathbf{A}. Observe that BiBinomial(n,1/2)B_{i}\sim\operatorname{Binomial}(n,1/2) if 𝐯i=0\mathbf{v}_{i}=0, and, BiBinomial(n,1/2+ϵd)B_{i}\sim\operatorname{Binomial}(n,1/2+\epsilon_{d}) if 𝐯i=1\mathbf{v}_{i}=1. By Hoeffding’s inequality [Ver18],

P(|BiE[Bi]|>4nlogn)2exp(4nlogn2n)=2n2.\displaystyle P\left(|B_{i}-E[B_{i}]|>\sqrt{4n\log n}\right)\leq 2\exp\left(\frac{-4n\log n}{2n}\right)=\frac{2}{n^{2}}.

Therefore, by the union bound over all i[n]i\in[n],

P(maxiS|BiE[Bi]|>4nlogn)2n.\displaystyle P\left(\max_{i\in S}|B_{i}-E[B_{i}]|>\sqrt{4n\log n}\right)\leq\frac{2}{n}.

Let S1,S2[n]S_{1},S_{2}\subset[n], such that |S1|=|S2|=n2|S_{1}|=|S_{2}|=\frac{n}{2}. Then with probability at least 12n1-\frac{2}{n},

|iS1BiiS2Bi|\displaystyle\left|\sum_{i\in S_{1}}B_{i}-\sum_{i\in S_{2}}B_{i}\right| |E[iS1BiiS2Bi]|+2n24nlogn\displaystyle\leq\left|E\left[\sum_{i\in S_{1}}B_{i}-\sum_{i\in S_{2}}B_{i}\right]\right|+\frac{2n}{2}\cdot\sqrt{4n\log n}
=|E[iS1BiiS2Bi]|+O(n3/2logn).\displaystyle=\left|E\left[\sum_{i\in S_{1}}B_{i}-\sum_{i\in S_{2}}B_{i}\right]\right|+O(n^{3/2}\log n). (14)

For large enough nn, we have with probability at least 99100\frac{99}{100} that there are at least n4\frac{n}{4} biased rows, since the number of biased rows is distributed as Binomial(n,1/2)\operatorname{Binomial}(n,1/2). Define the set SS^{*} such that |S|=n2|S^{*}|=\frac{n}{2} and SS^{*} contains a maximal number of biased rows. Define the set S¯\bar{S} as,

S¯=argmax|S|=n2f(𝟏,𝟏S).\displaystyle\bar{S}=\operatorname*{arg\,max}_{|S|=\frac{n}{2}}f(\mathbf{1},\mathbf{1}_{S}).

We will show that S¯\bar{S} can contain at most kn80+o(n)k\leq\frac{n}{80}+o(n) fewer biased rows than SS^{*}. By guarantee of the spectral approximation and (14), (note 𝟏ST𝐀𝟏S=i,jS𝐀ij),\mathbf{1}_{S}^{T}\mathbf{A}\mathbf{1}_{S}=\sum_{i,j\in S}\mathbf{A}_{ij}),

f(𝟏,𝟏S)f(𝟏,𝟏S¯)\displaystyle f(\mathbf{1},\mathbf{1}_{S^{*}})-f(\mathbf{1},\mathbf{1}_{\bar{S}}) 𝟏T𝐀𝟏S𝟏T𝐀𝟏S¯2ϵsn𝟏2𝟏S2\displaystyle\geq\mathbf{1}^{T}\mathbf{A}\mathbf{1}_{S^{*}}-\mathbf{1}^{T}\mathbf{A}\mathbf{1}_{\bar{S}}-2\epsilon_{s}n\|\mathbf{1}\|_{2}\|\mathbf{1}_{S^{*}}\|_{2}
E[𝟏T𝐀𝟏S𝟏T𝐀𝟏S¯]+O(n3/2logn)2ϵsn2.\displaystyle\geq E[\mathbf{1}^{T}\mathbf{A}\mathbf{1}_{S^{*}}-\mathbf{1}^{T}\mathbf{A}\mathbf{1}_{\bar{S}}]+O(n^{3/2}\log n)-2\epsilon_{s}n^{2}.

If S¯\bar{S} has kk fewer biased rows than SS^{*}, then, E[𝟏T𝐀𝟏S𝟏T𝐀𝟏S¯]=ϵdnkE[\mathbf{1}^{T}\mathbf{A}\mathbf{1}_{S^{*}}-\mathbf{1}^{T}\mathbf{A}\mathbf{1}_{\bar{S}}]=\epsilon_{d}nk. By the optimality of S¯\bar{S}, f(𝟏,𝟏S)f(𝟏,𝟏S¯)0f(\mathbf{1},\mathbf{1}_{S^{*}})-f(\mathbf{1},\mathbf{1}_{\bar{S}})\leq 0. Therefore, if ϵs=ϵd160\epsilon_{s}=\frac{\epsilon_{d}}{160}, then,

0f(𝟏,𝟏S)f(𝟏,𝟏S¯)ϵdnk2ϵsn2+O(n3/2logn)=ϵdnkϵdn280+O(n3/2logn).\displaystyle 0\geq f(\mathbf{1},\mathbf{1}_{S^{*}})-f(\mathbf{1},\mathbf{1}_{\bar{S}})\geq\epsilon_{d}nk-2\epsilon_{s}n^{2}+O(n^{3/2}\log n)=\epsilon_{d}nk-\frac{\epsilon_{d}n^{2}}{80}+O(n^{3/2}\log n).

Solving for kk implies, kn80+O(n1/2lognϵd)k\leq\frac{n}{80}+O(\frac{n^{1/2}\log n}{\epsilon_{d}}). By assumption of the theorem statement, 1ϵd=o(nlogn)\frac{1}{\epsilon_{d}}=o(\frac{\sqrt{n}}{\log n}), therefore, kn80+o(n)k\leq\frac{n}{80}+o(n).

Let bb be the number of biased rows in 𝐀\mathbf{A}. First, we consider the case where b=n2b=\frac{n}{2} exactly. In this case, SS^{*} contains n2\frac{n}{2} biased rows, and hence, S¯\bar{S} contains at least n2n80o(n)n2n40\frac{n}{2}-\frac{n}{80}-o(n)\geq\frac{n}{2}-\frac{n}{40} biased rows for large enough nn. Therefore, deciding all rows in S¯\bar{S} (i.e., 𝐯^i=1\hat{\mathbf{v}}_{i}=1 for all iS¯)i\in\bar{S}) are biased will correctly decide (n2n40)/n21920(\frac{n}{2}-\frac{n}{40})/\frac{n}{2}\geq\frac{19}{20} of the biased rows. By looking at the complement of SS^{*} and S¯\bar{S}, we conclude that 1920\frac{19}{20} of the unbiased rows are decided correctly as well. Hence, we can construct 𝐯^\hat{\mathbf{v}} such that 𝐯𝐯^1n20\|\mathbf{v}-\hat{\mathbf{v}}\|_{1}\leq\frac{n}{20} by the assignment 𝐯^i=1\hat{\mathbf{v}}_{i}=1 for iS¯i\in\bar{S} and 𝐯^i=0\hat{\mathbf{v}}_{i}=0 otherwise.

The number of biased rows is distributed as a binomial random variable, i.e., bBinomial(n,1/2)b\sim\operatorname{Binomial}(n,1/2). Therefore, by Hoeffding’s inequaliy, (|bn2|>10n)<0.01\operatorname*{\mathbb{P}}(|b-\frac{n}{2}|>10\sqrt{n})<0.01. For large enough nn, 10n0.01n10\sqrt{n}\leq 0.01\cdot n. Therefore, with probability at least 99100\frac{99}{100}, we can reduce to the case b=n2b=\frac{n}{2} by assuming that at most an additional 0.01n0.01\cdot n rows are misclassified as biased or unbiased. Hence, with probability at least 99100\frac{99}{100}, a spectral approximation of 𝐀\mathbf{A} with ϵs=Θ(ϵd)\epsilon_{s}=\Theta(\epsilon_{d}) accuracy is sufficient to recover at least 910\frac{9}{10} of the biased rows with probability at least 99100\frac{99}{100}.

Correctly classifying 910\frac{9}{10} of the rows as biased or unbiased requires observing Ω(nϵd2)\Omega(\frac{n}{\epsilon_{d}^{2}}) entries of 𝐀\mathbf{A} by Lemma 3. Since Ω(nϵd2)=Ω(nϵs2)\Omega(\frac{n}{\epsilon_{d}^{2}})=\Omega(\frac{n}{\epsilon_{s}^{2}}) entries of 𝐀\mathbf{A} must be observed to construct the spectral approximation used to solve the (ϵ,n)(\epsilon,n)-Distributed detection problem, we conclude the lower bound of the theorem statement. ∎

Note that the assumption 1ϵ=o(nlogn)\frac{1}{\epsilon}=o\left(\frac{\sqrt{n}}{\log n}\right) in the previous theorem is mild, since if this assumption does not hold, then we must read nearly all entries of the matrix anyways. We also show that our construction provides a lower bound even when the input is restricted to be symmetric.

Corollary 1.

The lower bound in Theorem 11 applies when the input is restricted to symmetric binary matrices.

Proof.

While construction used in Theorem 11 is not symmetric, we can modify it to give the same lower bound for symmetric input. Let 𝐀\mathbf{A} be defined as in the proof of Theorem 11, and let 𝐀¯\bar{\mathbf{A}} be the Hermitian dilation of 𝐀\mathbf{A}, i.e.,

𝐀¯=[𝟎𝐀𝐀T𝟎].\displaystyle\bar{\mathbf{A}}=\begin{bmatrix}\mathbf{0}&\mathbf{A}\\ \mathbf{A}^{T}&\mathbf{0}\end{bmatrix}.

Any query 𝐱T𝐀𝐲\mathbf{x}^{T}\mathbf{Ay} can be simulated by the query 𝐱¯T𝐀¯𝐲¯\bar{\mathbf{x}}^{T}\bar{\mathbf{A}}\bar{\mathbf{y}}, where 𝐱¯=[𝐱,𝟎]T\bar{\mathbf{x}}=[\mathbf{x},\mathbf{0}]^{T} and 𝐲¯=[𝟎,𝐲]T\bar{\mathbf{y}}=[\mathbf{0},\mathbf{y}]^{T}. The size of 𝐀\mathbf{A} is nn and the size of 𝐀¯\bar{\mathbf{A}} is 2n2n, hence, if f¯\bar{f} is a spectral approximation of 𝐀¯\bar{\mathbf{A}} such that,

|f¯(𝐱¯,𝐲¯)𝐱T¯𝐀¯𝐲¯|(ϵs2)(2n)𝐱¯2𝐲¯2, for all 𝐱¯,𝐲¯n,\displaystyle|\bar{f}(\bar{\mathbf{x}},\bar{\mathbf{y}})-\bar{\mathbf{x}^{T}}\bar{\mathbf{A}}\bar{\mathbf{y}}|\leq\left(\frac{\epsilon_{s}}{2}\right)(2n)\|\bar{\mathbf{x}}\|_{2}\|\bar{\mathbf{y}}\|_{2},\text{ for all }\bar{\mathbf{x}},\bar{\mathbf{y}}\in\mathbb{R}^{n},

then we can simulate f(,)f(\cdot,\cdot) such that,

|f(𝐱,𝐲)𝐱T𝐀𝐲|ϵsn𝐱2𝐲2, for all 𝐱,𝐲n.\displaystyle|f(\mathbf{x},\mathbf{y})-\mathbf{x}^{T}\mathbf{A}\mathbf{y}|\leq\epsilon_{s}n\|\mathbf{x}\|_{2}\|\mathbf{y}\|_{2},\text{ for all }\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}.

Therefore, the proof of Theorem 11 applies to the Hermitian dilation 𝐀¯\bar{\mathbf{A}} after adjusting for a constant factor in the accuracy parameter ϵs\epsilon_{s}. ∎

5 Improved Bounds for Binary Magnitude PSD Matrices

We call a PSD matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} a binary magnitude PSD matrix if i,j[n],|𝐀ij|{0,1}\forall i,j\in[n],\left|\mathbf{A}_{ij}\right|\in\{0,1\}, i.e., if 𝐀{1,0,1}n×n\mathbf{A}\in\{-1,0,1\}^{n\times n}. In this section, we give deterministic spectral approximation algorithms for such matrices with improved ϵ\epsilon dependencies. Specifically, we show that there exists a deterministic ϵn\epsilon n error spectral approximation algorithm reading just O(nlognϵ)O\left(\frac{n\log n}{\epsilon}\right) entries of the input matrix. This improves on our bound for general bounded entry PSD matrices (Theorem 1) by a 1/ϵ1/\epsilon factor, while losing a logn\log n factor. Further, we show that it is tight up to a logarithmic factor, via a simple application of Turán’s theorem [Tur41].

The key idea of our improved algorithm is that for any PSD 𝐀\mathbf{A}, we can write 𝐀=𝐁T𝐁\mathbf{A}=\mathbf{B}^{T}\mathbf{B}, so that 𝐀ij=𝐛i,𝐛j\mathbf{A}_{ij}=\langle\mathbf{b}_{i},\mathbf{b}_{j}\rangle, where 𝐛i,𝐛j\mathbf{b}_{i},\mathbf{b}_{j} are the ithi^{th} and jthj^{th} columns of 𝐁\mathbf{B} respectfully. Since 𝐀\mathbf{A} has bounded entries, for any ii, 𝐀ii=𝐛i221\mathbf{A}_{ii}=\|\mathbf{b}_{i}\|_{2}^{2}\leq 1. Thus, we can apply transitivity arguments: if |𝐀ij|=|𝐛i,𝐛j||\mathbf{A}_{ij}|=|\langle\mathbf{b}_{i},\mathbf{b}_{j}\rangle| is large (close to 11) and |𝐀jk|=|𝐛j,𝐛i||\mathbf{A}_{jk}|=|\langle\mathbf{b}_{j},\mathbf{b}_{i}\rangle| is also large, then |𝐀ik|=|𝐛i,𝐛k||\mathbf{A}_{ik}|=|\langle\mathbf{b}_{i},\mathbf{b}_{k}\rangle| must be relatively large as well. Thus, we can hope to infer the contribution of an entry to 𝐱T𝐀𝐱\mathbf{x}^{T}\mathbf{A}\mathbf{x} without necessarily reading that entry, allowing for reduced sample complexity.

The logic is particularly clean when 𝐀\mathbf{A} is binary. In this case, we can restrict to looking at the principal submatrix corresponding to i[n]i\in[n] for which 𝐀ii=1\mathbf{A}_{ii}=1. The remainder of the matrix is all zeros. For such ii, we have 𝐛i2=1\|\mathbf{b}_{i}\|_{2}=1, and thus if 𝐀ij=1\mathbf{A}_{ij}=1 and 𝐀jk=1\mathbf{A}_{jk}=1, we can conclude that 𝐀ik=1\mathbf{A}_{ik}=1. This implies that, up to a permutation of the row/column indices, 𝐀\mathbf{A} is a block diagonal matrix, with rank(𝐀)\operatorname{rank}(\mathbf{A}) blocks of all ones, corresponding to groups of indices SkS_{k} with 𝐛i=𝐛j\mathbf{b}_{i}=\mathbf{b}_{j} for all i,jSki,j\in S_{k}. The eigenvalues of this matrix are exactly the sizes of these blocks, and to recover a spectral approximation, it suffices to recover the set SkS_{k} corresponding to any block of size ϵn\geq\epsilon n.

To do so, we employ a weak notion of expanders [Pip87, WZ93], that requires that any two subsets of vertices both of size ϵn\geq\epsilon n, are connected by at least one edge. Importantly, such expanders can be constructed with O~(n/ϵ)\widetilde{O}(n/\epsilon) edges – beating Ramanujan graphs, which would require Ω(n/ϵ2)\Omega(n/\epsilon^{2}) edges, but satisfy much stronger notions of edge discrepancy. Further, if we consider the graph whose edges are in the intersection of those in the expander graph and of the non-zero entries in 𝐀\mathbf{A}, we can show that any block of ones in 𝐀\mathbf{A} of size Ω(ϵn)\Omega(\epsilon n) will correspond to a large Ω(ϵn)\Omega(\epsilon n) sized connected component in this graph. We can form this graph by querying 𝐀\mathbf{A} at O~(n/ϵ)\widetilde{O}(n/\epsilon) positions (the edges in the expander) and then computing these large connected components to recover a spectral approximation. Each sparse connected component becomes a dense block of all ones in our approximation, avoiding the sparsity lower bound of Theorem 3. Noting that we can extend this logic to matrices with entries in {1,0,1}\{-1,0,1\}, yields the main result of this section, Theorem 9.

It is not hard to see that O~(n/ϵ)\widetilde{O}(n/\epsilon) is optimal, even for binary PSD matrices (Theorem 10). By Turán’s theorem, any sample set of size o(n/ϵ)o(n/\epsilon) does not read any entries in some principal submatrix of size ϵn×ϵn\epsilon n\times\epsilon n. By letting 𝐀\mathbf{A} be the identity plus a block of all ones on this submatrix, we force our algorithm to incur ϵn\epsilon n approximation error, since it cannot find this large block of ones.

Section Roadmap. We formally prove the block diagonal structure of binary magnitude PSD matrices in Section 5.1. We leverage this structure to give our improved spectral approximation algorithm for these matrices in Section 5.2. Finally, in Section 5.3 we prove a nearly matching lower bound via Turán’s theorem.

5.1 Structure of Binary Magnitude PSD Matrices

As discussed, our improved algorithm hinges on the fact that, up to a permutation of the rows and columns, binary magnitude PSD matrices are block diagonal. Further, each block is itself a rank-11 matrix, consisting of two on-diagonal blocks of 11’s and two off-diagonal blocks of 1-1’s. Formally,

Lemma 4 (Binary Magnitude PSD Matrices are Block Matrices).

Let 𝐀{1,0,1}n×n\mathbf{A}\in\{-1,0,1\}^{n\times n} be PSD and let λi\lambda_{i} for i[n]i\in[n] be the eigenvalues of 𝐀\mathbf{A}, with λ1λ2λp>0\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{p}>0 and λp+1==λn=0\lambda_{p+1}=\ldots=\lambda_{n}=0. Then, there exists a permutation matrix 𝐏\mathbf{P} such that

𝐏𝐀𝐏T=[𝐀(1)𝟎𝟎𝟎𝟎𝐀(2)𝟎𝟎𝟎𝟎𝟎𝐀(p)𝟎𝟎𝟎𝟎𝟎].\displaystyle\mathbf{P}\mathbf{A}\mathbf{P}^{T}=\begin{bmatrix}\mathbf{A}^{(1)}&\mathbf{0}&\ldots&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{A}^{(2)}&\ldots&\mathbf{0}&\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots&\mathbf{0}\\ \mathbf{0}&\mathbf{0}&\ldots&\mathbf{A}^{(p)}&\mathbf{0}\\ \mathbf{0}&\mathbf{0}&\ldots&\mathbf{0}&\mathbf{0}\end{bmatrix}.

Further, Each 𝐀(i){1,1}λi×λi\mathbf{A}^{(i)}\in\{-1,1\}^{\lambda_{i}\times\lambda_{i}} is a λi×λi\lambda_{i}\times\lambda_{i} rank-1 PSD matrix. Letting SiS_{i} denote the set of indices corresponding to the principal submatrix 𝐀(i)\mathbf{A}^{(i)}, which we call its support set, SiS_{i} can be partitioned in to two subsets Si1S_{i1} and Si2S_{i2} such that, if we let 𝐯i(j)=1\mathbf{v}_{i}(j)=1 for jSi1j\in S_{i1} and 𝐯i(j)=1\mathbf{v}_{i}(j)=-1 for jSi2j\in S_{i2}, then 𝐯1,,𝐯p\mathbf{v}_{1},\ldots,\mathbf{v}_{p} are orthogonal and 𝐀=i=1p𝐯i𝐯iT\mathbf{A}=\sum_{i=1}^{p}\mathbf{v}_{i}\mathbf{v}_{i}^{T}.

Proof.

Since 𝐀\mathbf{A} is PSD, there exists 𝐁n×n\mathbf{B}\in\mathbb{R}^{n\times n} with rows 𝐁1,,𝐁nn\mathbf{B}_{1},\ldots,\mathbf{B}_{n}\in\mathbb{R}^{n} such that 𝐀=𝐁𝐁T\mathbf{A}=\mathbf{B}\mathbf{B}^{T}. Note that either 𝐀ii=0\mathbf{A}_{ii}=0 or 𝐀ii=1\mathbf{A}_{ii}=1 for all ii since 𝐀ii=𝐁i220\mathbf{A}_{ii}=\|\mathbf{B}_{i}\|_{2}^{2}\geq 0. Also for any i,j[n]i,j\in[n], if 𝐀ij=1\mathbf{A}_{ij}=1 or 𝐀ij=1\mathbf{A}_{ij}=-1, then 𝐀ii=𝐀jj=1\mathbf{A}_{ii}=\mathbf{A}_{jj}=1. Otherwise, if 𝐀ii\mathbf{A}_{ii} or 𝐀jj\mathbf{A}_{jj} were 0, we would have 𝐁i2=0\|\mathbf{B}_{i}\|_{2}=0 or 𝐁j2=0\|\mathbf{B}_{j}\|_{2}=0 and thus 𝐀ij=𝐁i𝐁jT=0\mathbf{A}_{ij}=\mathbf{B}_{i}\mathbf{B}_{j}^{T}=0.

Thus, if 𝐀ij=1\mathbf{A}_{ij}=1, we have 𝐀ii=𝐁i2=1\mathbf{A}_{ii}=\|\mathbf{B}_{i}\|_{2}=1, 𝐀jj=𝐁j2=1\mathbf{A}_{jj}=\|\mathbf{B}_{j}\|_{2}=1, and 𝐁iT𝐁j=1\mathbf{B}_{i}^{T}\mathbf{B}_{j}=1. I.e., we must have 𝐁i=𝐁j\mathbf{B}_{i}=\mathbf{B}_{j}. Similarly, if 𝐀ij=1\mathbf{A}_{ij}=-1, we have 𝐁i2=𝐁j2=1\|\mathbf{B}_{i}\|_{2}=\|\mathbf{B}_{j}\|_{2}=1 and 𝐁iT𝐁j=1\mathbf{B}_{i}^{T}\mathbf{B}_{j}=-1 and so 𝐁i=𝐁j\mathbf{B}_{i}=-\mathbf{B}_{j}. Finally, if 𝐀ij=0\mathbf{A}_{ij}=0, then 𝐁i𝐁j\mathbf{B}_{i}\perp\mathbf{B}_{j}. This implies that the indices [n][n] can be grouped into disjoint subsets S1,S2,,SpS_{1},S_{2},\ldots,S_{p} (Si[n]S_{i}\subset[n]) such that for any SiS_{i} and any j,kSij,k\in S_{i}, either 𝐁i=𝐁j\mathbf{B}_{i}=\mathbf{B}_{j} or 𝐁i=𝐁j\mathbf{B}_{i}=-\mathbf{B}_{j}. Also for any kSik\in S_{i} and Sj\ell\in S_{j} with iji\neq j, we must have 𝐁k𝐁\mathbf{B}_{k}\perp\mathbf{B}_{\ell}. Label these subsets such that |S1||S2||Sp||S_{1}|\geq|S_{2}|\geq\ldots\geq|S_{p}|. Now observe that any subset SiS_{i} can be further divided into two disjoint subsets Si1S_{i1} and Si2S_{i2} such that if j,kSi1j,k\in S_{i1} or j,kSi2j,k\in S_{i2}, 𝐁j=𝐁k\mathbf{B}_{j}=\mathbf{B}_{k} but if jSi1j\in S_{i1} and kSi2k\in S_{i2} (or vice-versa), 𝐁j=𝐁k\mathbf{B}_{j}=-\mathbf{B}_{k}. Thus, the principal submatrix corresponding to SiS_{i}, which we label 𝐀(i)\mathbf{A}^{(i)} is a rank-1 block matrix with two blocks of all ones on the principal submatrices corresponding to Si1S_{i1} and Si2S_{i2} and 1-1’s on all remaining entries.

For any SiS_{i}, let 𝐯i{0,1}n\mathbf{v}_{i}\in\{0,1\}^{n} be a vector such that 𝐯i(j)=1\mathbf{v}_{i}(j)=1 if jSi1j\in S_{i1}, 𝐯i(j)=1\mathbf{v}_{i}(j)=-1 if jSi2j\in S_{i2} and 𝐯i(j)=0\mathbf{v}_{i}(j)=0 otherwise. Then, observe that 𝐀𝐯i=|Si|𝐯i\mathbf{A}\mathbf{v}_{i}=\lvert S_{i}\rvert\mathbf{v}_{i}. Thus, each |Si|\lvert S_{i}\rvert is an eigenvalue of 𝐀\mathbf{A} and each 𝐯i\mathbf{v}_{i} is an eigenvector. Since S1,,SpS_{1},\ldots,S_{p} has disjoint supports, 𝐯1,,𝐯p\mathbf{v}_{1},\ldots,\mathbf{v}_{p} are orthogonal. Further, tr(𝐀)=i=1nλi=i=1p|Si|\operatorname{tr}(\mathbf{A})=\sum_{i=1}^{n}\lambda_{i}=\sum_{i=1}^{p}|S_{i}| and thus, all other eigenvalues of 𝐀\mathbf{A} are 0. Thus, we can write 𝐀=i=1p𝐯i𝐯iT\mathbf{A}=\sum_{i=1}^{p}\mathbf{v}_{i}\mathbf{v}_{i}^{T}. This concludes the lemma. ∎

5.2 Improved Spectral Approximation of Binary Magnitude PSD matrices

Given Lemma 4, the key idea to efficiently recovering a spectral approximation to a binary magnitude PSD matrix 𝐀\mathbf{A} is that we do not need to read very many entries in each block 𝐀(i)\mathbf{A}^{(i)} to identify the block. We just need enough to identify a large fraction of the indices in the support set SiS_{i}.

To ensure that our sampling is able to do so, we will sample according to the edges of a certain type of weak expander graph, which ensures that any two vertex sets of large enough size are connected. I.e., that we read at least one entry in any large enough off-diagonal block of our matrix.

Definition 5.1 (ϵn\epsilon n-expander graphs [WZ93, Pip87]).

For any ϵ\epsilon\in\mathbb{R}, an ϵn\epsilon n-expander graph is any undirected graph on nn vertices such that any two disjoint subsets of vertices containing at least ϵn\epsilon n vertices each are joined by an edge.

Moreover, it is not hard to show that ϵn\epsilon n-expanders with just O(nlogn/ϵ)O(n\log n/\epsilon) edges exist.

Fact 1.

[WZ93] A random d=O(lognϵ)d=O\left(\frac{\log n}{\epsilon}\right) regular graph is an ϵn\epsilon n-expander with high probability.

We note that while a spectral expander graph, such as a Ramanujan graph, as used to achieve Theorem 1, will also be an ϵn\epsilon n-expander by the expander mixing lemma [CG02], such graphs require O(n/ϵ2)O(n/\epsilon^{2}) edges, as opposed to the O~(n/ϵ)\widetilde{O}(n/\epsilon) given by Fact 1. Further, while Fact 1 is not constructive, in [WZ93] an explicit polynomial time construction of ϵn\epsilon n-expander graphs is shown, with just an no(1)n^{o(1)} loss. This result can be plugged directly into our algorithms to give a polynomial time constructible deterministic spectral approximation algorithm with O(n1+o(1)/ϵ)O(n^{1+o(1)}/\epsilon) sample complexity.

Fact 2 (Constructive ϵn\epsilon n-expanders [WZ93]).

There is a polynomial time algorithm that, given an integer nn and ϵ(0,1)\epsilon\in(0,1), constructs an ϵn\epsilon n-expanding graph on nn vertices with maximum degree no(1)ϵ\frac{n^{o(1)}}{\epsilon}.

We now prove our main technical result, which shows that if we read the entries of a binary magnitude PSD matrix 𝐀\mathbf{A} according to an ϵn\epsilon n-expander graph, then we can approximately recover all blocks of this matrix by considering the connected components of the expander graph restricted to edges where 𝐀\mathbf{A} is nonzero.

Theorem 13 (Approximation via Sampled Connected Components).

Let 𝐀{1,0,1}n×n\mathbf{A}\in\{-1,0,1\}^{n\times n} be a PSD matrix with eigenvalues λ1λ2λp>0\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{p}>0 and λp+1==λn=0\lambda_{p+1}=\ldots=\lambda_{n}=0. Let GG be an ϵ/6n\epsilon/6\cdot n-expanding graph on nn vertices with adjacency matrix 𝐁G\mathbf{B}_{G}. Let 𝐀G\mathbf{A}_{G} be the binary adjacency matrix with (𝐀G)ij=1(\mathbf{A}_{G})_{ij}=1 whenever (𝐁G)ij=1(\mathbf{B}_{G})_{ij}=1 and |𝐀ij|=1|\mathbf{A}_{ij}|=1. Let G¯\bar{G} be the graph whose adjacency matrix is 𝐀G\mathbf{A}_{G}. Let S1,,SpS_{1},\ldots,S_{p} denote the support sets of 𝐀\mathbf{A}’s blocks (as defined in Lemma 4) with |Si|=λi|S_{i}|=\lambda_{i}, and let G¯Si\bar{G}_{S_{i}} be the induced subgraph of G¯\bar{G} restricted to SiS_{i}. Finally, let Ci[n]C_{i}\subset[n] be the largest connected component of G¯Si\bar{G}_{S_{i}}. We have:

λiϵ/2n<|Ci|λi.\displaystyle\lambda_{i}-\epsilon/2\cdot n<|C_{i}|\leq\lambda_{i}.
Proof.

First observe that since |Si|=λi|S_{i}|=\lambda_{i}, we trivially have that |Ci|λi|C_{i}|\leq\lambda_{i}. Thus, it remains to show that |Ci|>λiϵ/2n|C_{i}|>\lambda_{i}-\epsilon/2\cdot n. This holds vacuously if λiϵ/2n\lambda_{i}\leq\epsilon/2\cdot n. Thus, we focus our attention to λiϵ/2n\lambda_{i}\geq\epsilon/2\cdot n. For contradiction assume |Ci|λiϵ/2n|C_{i}|\leq\lambda_{i}-\epsilon/2\cdot n. Let M1,,MrSiM_{1},\ldots,M_{r}\subseteq S_{i} denote the connected components of G¯Si\bar{G}_{S_{i}} such that |M1||Mr|\left|M_{1}\right|\geq\ldots\geq\left|M_{r}\right|. So |Ci|=M1|C_{i}|=M_{1}. Observe that since all entries in 𝐀\mathbf{A} in the principal submatrix indexed by SiS_{i} (i.e., in 𝐀(i)\mathbf{A}^{(i)} from Lemma 4) have magnitude 11, G¯Si=GSi\bar{G}_{S_{i}}=G_{S_{i}}. Consider two cases.

Case 1: |Ci|ϵ/6n|C_{i}|\geq\epsilon/6\cdot n. From our assumption: λi|Ci|ϵ/2n\lambda_{i}-|C_{i}|\geq\epsilon/2\cdot n which implies that j=2r|Mj|ϵ/2n\sum_{j=2}^{r}|M_{j}|\geq\epsilon/2\cdot n. Let C=j=2rMjC^{\prime}=\bigcup_{j=2}^{r}M_{j} so |C|ϵ/2n|C^{\prime}|\geq\epsilon/2\cdot n. Since M1,,MrM_{1},\ldots,M_{r} are disconnected, there are no edges from CC^{\prime} to CiC_{i} in G¯\bar{G}, and thus no edges in GG, since G¯Si=GSi\bar{G}_{S_{i}}=G_{S_{i}}. However, this contradicts the fact that GG is an ϵ/6n\epsilon/6\cdot n expander and thus must have at least one edge between any two set of vertices of size at least ϵ/6n\epsilon/6\cdot n (Definition 1). Thus we have a contradiction and our assumption λi|Ci|ϵn\lambda_{i}-|C_{i}|\geq\epsilon n is incorrect. So we must have |Ci|>λiϵn|C_{i}|>\lambda_{i}-\epsilon n as needed.

Case 2: |Ci|<ϵ/6n|C_{i}|<\epsilon/6\cdot n. Since |Ci|<ϵ/6n|C_{i}|<\epsilon/6\cdot n, we have maxi[r]|Mj|<ϵ/6n\max_{i\in[r]}|M_{j}|<\epsilon/6\cdot n. Consider partitioning these connected components into two sets, T1T_{1} and T2T_{2}. Let V1,V2V_{1},V_{2} be the vertex sets corresponding to these two partitions, i.e., V1=MiT1MiV_{1}=\bigcup_{M_{i}\in T_{1}}M_{i} and V2=MiT2MiV_{2}=\bigcup_{M_{i}\in T_{2}}M_{i}. Pick T1T_{1} and T2T_{2} to minimize ||V1||V2||||V_{1}|-|V_{2}||. Since maxi[r]|Mj|<ϵ/6n\max_{i\in[r]}|M_{j}|<\epsilon/6\cdot n, for this optimal T1,T2T_{1},T_{2}, we must have ||V1||V2||<ϵ/6n||V_{1}|-|V_{2}||<\epsilon/6\cdot n.

Since V1V_{1} and V2V_{2} are disconnected in G¯Si\bar{G}_{S_{i}}, they must be disconnected in GSiG_{S_{i}}. Thus, we must have either |V1|<ϵ/6n|V_{1}|<\epsilon/6\cdot n or |V2|<ϵ/6n|V_{2}|<\epsilon/6\cdot n since any two subsets of vertices with size ϵ/6n\geq\epsilon/6\cdot n must have an edge between them in GG due to it being an ϵ/6n\epsilon/6\cdot n-expander. Assume w.l.o.g. that |V2|<ϵ/6n|V_{2}|<\epsilon/6\cdot n. Then since |V1|+|V2|=|Si|=λiϵ/2n|V_{1}|+|V_{2}|=|S_{i}|=\lambda_{i}\geq\epsilon/2\cdot n, we must have |V1|>ϵ/3n|V_{1}|>\epsilon/3\cdot n. However, this means that ||V1||V2||>ϵ/6n||V_{1}|-|V_{2}||>\epsilon/6\cdot n, which contradicts the fact that we picked T1,T2T_{1},T_{2} such that ||V1||V2||<ϵ/6n||V_{1}|-|V_{2}||<\epsilon/6\cdot n.

We now present our deterministic algorithm for spectral approximation of binary magnitude PSD matrices, based on Theorem 13. The key idea is to compute all connected components of G¯\bar{G} – the graph whose edges lie in the intersection of the edges of GG and the non-zero entries of 𝐀\mathbf{A}. Any large enough block in 𝐀\mathbf{A} (see Lemma 4) will correspond to a large connected component in this graph and can thus be identified.

Algorithm 1 Deterministic spectral approximator using expanders
1:PSD matrix 𝐀{1,0,1}n×n\mathbf{A}\in\{-1,0,1\}^{n\times n} and ϵ(0,1)\epsilon\in(0,1).
2:Construct an ϵ/6n\epsilon/6\cdot n-expanding graph GG on nn nodes with adjacency matrix 𝐁G\mathbf{B}_{G}.
3:Let 𝐀G\mathbf{A}_{G} be an n×nn\times n matrix such that (𝐀G)ij=1(\mathbf{A}_{G})_{ij}=1 if (𝐁G)ij=1(\mathbf{B}_{G})_{ij}=1 and |𝐀ij|=1|\mathbf{A}_{ij}|=1. Let G¯\bar{G} be the graph corresponding to 𝐀G\mathbf{A}_{G}.
4:Compute all connected components of G¯\bar{G}, C1,C2,,Cr[n]C_{1},C_{2},\ldots,C_{r}\subseteq[n].
5:Initialize 𝐀~=𝟎n×n\widetilde{\mathbf{A}}=\mathbf{0}^{n\times n}.
6:for i=1,2,,ri=1,2,\ldots,r do
7:     if |Ci|>ϵn2|C_{i}|>\frac{\epsilon n}{2} then
8:         Pick any jCij\in C_{i} and let 𝐀~=𝐀~+𝐀j𝐀jT\widetilde{\mathbf{A}}=\widetilde{\mathbf{A}}+\mathbf{A}_{j}\mathbf{A}_{j}^{T}, where 𝐀j\mathbf{A}_{j} is the jthj^{th} column of 𝐀\mathbf{A}.
9:     end if
10:end for
11:return 𝐀~\widetilde{\mathbf{A}}.

Query complexity: Observe that Algorithm 1 requires querying 𝐀\mathbf{A} at the locations where 𝐁G\mathbf{B}_{G} is non-zero (to perform step 2). By Fact 1, there are ϵ/6n\epsilon/6\cdot n-expander graphs with just O(nlognϵ)O(\frac{n\log n}{\epsilon}) edges, so this requires O(nlognϵ)O(\frac{n\log n}{\epsilon}) queries. Further, in a graph with nn nodes,there can be at most 2/ϵ2/\epsilon connected components of size ϵn2\geq\frac{\epsilon n}{2}. Thus, line (7) requires total query complexity 2nϵ\leq\frac{2n}{\epsilon} to read one column of 𝐀\mathbf{A} corresponding to each large component identified.

See 9

Proof.

We will show that Algorithm 1 satisfies the requirements of the theorem. We have already argued above that its query complexity is O(nlognϵ)O(\frac{n\log n}{\epsilon}).

By Lemma 4, we know that, up to a permutation, 𝐀\mathbf{A} is a block matrix with rank-11 blocks 𝐀(1),𝐀(2),,𝐀(k)\mathbf{A}^{(1)},\mathbf{A}^{(2)},\ldots,\mathbf{A}^{(k)} with support sets S1,,SkS_{1},\ldots,S_{k}. Since G¯\bar{G} has edges only where 𝐀\mathbf{A} is non-zero, each connected component CiC_{i} in G¯\bar{G} is entirely contained within one of these support sets. Further, if |Si|=λiϵn|S_{i}|=\lambda_{i}\geq\epsilon n, then by Theorem 13, there is exactly one connected component, call it DiD_{i} (the largest connected component of G¯Si\bar{G}_{S_{i}}) with vertices in SiS_{i} and |Di|ϵ/2|D_{i}|\geq\epsilon/2. If |Si|=λi<ϵn|S_{i}|=\lambda_{i}<\epsilon n, then there it at most one such connected component by Theorem 13 (there may be none).

So, Step 7 is triggered for exactly one connected component DiSiD_{i}\subseteq S_{i} for each SiS_{i} with |Si|=λiϵn|S_{i}|=\lambda_{i}\geq\epsilon n and at most one connected component for all other SiS_{i}. If Line 7 is triggered, then we add 𝐀j𝐀jT\mathbf{A}_{j}\mathbf{A}_{j}^{T} to 𝐀~\mathbf{\widetilde{A}}. Observe that 𝐀j\mathbf{A}_{j} is a column for jSij\in S_{i}, and thus by Lemma 4, 𝐀j\mathbf{A}_{j} is supported only on 𝐒i\mathbf{S}_{i}. We have that either 𝐀j(t)=1\mathbf{A}_{j}(t)=1 on Si1S_{i1} and 𝐀j(t)=1\mathbf{A}_{j}(t)=-1 on Si2S_{i2}, or 𝐀j(t)=1\mathbf{A}_{j}(t)=-1 on Si1S_{i1} and 𝐀j(t)=1\mathbf{A}_{j}(t)=1. That is, 𝐀j\mathbf{A}_{j} is equal to either 𝐯i\mathbf{v}_{i} or 𝐯i-\mathbf{v}_{i} as defined in Lemma 4.

So overall, letting Z{S1,,Sk}Z\subseteq\{S_{1},\ldots,S_{k}\} be the set of blocks for which one connected component is recovered in Line 7, 𝐀~=iZ𝐯i𝐯iT\mathbf{\widetilde{A}}=\sum_{i\in Z}\mathbf{v}_{i}\mathbf{v}_{i}^{T}. From Lemma 4, 𝐀=i=1k𝐯i𝐯iT\mathbf{A}=\sum_{i=1}^{k}\mathbf{v}_{i}\mathbf{v}_{i}^{T} and so 𝐀𝐀~=iZ𝐯i𝐯iT\mathbf{A}-\mathbf{\widetilde{A}}=\sum_{i\notin Z}\mathbf{v}_{i}\mathbf{v}_{i}^{T}. Since the 𝐯i\mathbf{v}_{i} are orthogonal and since for iZi\notin Z we must have λi=|Si|=𝐯i𝐯iT2ϵn\lambda_{i}=|S_{i}|=\|\mathbf{v}_{i}\mathbf{v}_{i}^{T}\|_{2}\leq\epsilon n, this gives that 𝐀𝐀~2ϵn\|\mathbf{A}-\mathbf{\widetilde{A}}\|_{2}\leq\epsilon n, completing the theorem.

5.3 Lower Bounds for Binary PSD Matrix Approximation

We can prove that our O(nlognϵ)O\left(\frac{n\log n}{\epsilon}\right) query complexity for binary magnitude matrices is optimal up to a logn\log n factor via Turán’s Theorem, stated below. Our lower bound holds for the easier problem of eigenvalue approximation for PSD 𝐀{0,1}n×n\mathbf{A}\in\{0,1\}^{n\times n}.

Fact 3 (Turan’s theorem [Tur41, Aig95]).

Let GG be a graph on nn vertices that does not include a (k+1)(k+1)-clique as a subgraph. Let t(n,k)t(n,k) be the number of edges in GG. Then

t(n,k)(k1)n22k.\displaystyle t(n,k)\leq\frac{(k-1)n^{2}}{2k}.

Using Turan’s theorem we can prove: See 10

Proof.

Let 𝒜\mathcal{A} be a deterministic algorithm that approximates the eigenvalues of a binary PSD input matrix 𝐀\mathbf{A} to ϵn\epsilon n additive error. Let 𝐀0=𝐈n\mathbf{A}_{0}=\mathbf{I}_{n}, i.e., the identity matrix. Let S[n]×[n]S\subset[n]\times[n] be the first nϵ\frac{n}{\epsilon} off-diagonal entries read by 𝒜\mathcal{A} on input 𝐀\mathbf{A}. Since the choices of the algorithm are deterministic, these same entries will be the first nϵ\frac{n}{\epsilon} entries queried by 𝒜\mathcal{A} from any input matrix which has ones on the diagonal and zeroes in all entries of SS, even if the algorithm is adaptive.

Without loss of generality, we can assume that the entries in SS are above the diagonal, since reading any entry below the diagonal of an input matrix would be redundant due to symmetry. Consider a graph G with vertex set V=[n]V=[n] and undirected edge set E=ScE=S^{c}, that is, the edge set contains all undirected edges (i,j)(i,j) such that (i,j),(j,i)S(i,j),(j,i)\not\in S for i,j[n]i,j\in[n]. Hence, for all ϵ(0,1)\epsilon\in(0,1),

|E|=(n2)nϵ=n22(n2+nϵ)>n222nϵ=(ϵn41)n22ϵn4,\displaystyle|E|={n\choose 2}-\frac{n}{\epsilon}=\frac{n^{2}}{2}-\left(\frac{n}{2}+\frac{n}{\epsilon}\right)>\frac{n^{2}}{2}-\frac{2n}{\epsilon}=\frac{(\frac{\epsilon n}{4}-1)n^{2}}{2\cdot\frac{\epsilon n}{4}},

where the first inequality is true since n2<nϵ\frac{n}{2}<\frac{n}{\epsilon}. Therefore, if we set k=ϵn4k=\frac{\epsilon n}{4}, we conclude by the contrapositive of Fact 3 that there exists a clique of size ϵn4\frac{\epsilon n}{4} in GG. Let TET\subset E denote the set of edges which forms this clique in GG.

Construct 𝐀1\mathbf{A}_{1} such that [𝐀1]ij=[𝐀0]ij[\mathbf{A}_{1}]_{ij}=[\mathbf{A}_{0}]_{ij} for all i,ji,j such that (i,j)T(i,j)\not\in T and [𝐀1]ij=[𝐀1]ji=1[\mathbf{A}_{1}]_{ij}=[\mathbf{A}_{1}]_{ji}=1 for all i,ji,j such that (i,j)T(i,j)\in T. Then, there is a principal submatrix containing all ones in 𝐀1\mathbf{A}_{1} where the off-diagonal entries correspond to edges in TT along with the diagonal entries which are all ones by construction. Hence, 𝐀1\mathbf{A}_{1} contains a non-zero principal submatrix of size ϵn4\frac{\epsilon n}{4}, and so, it has an eigenvalue that is at least ϵn4\frac{\epsilon n}{4}. Both 𝐀0\mathbf{A}_{0} and 𝐀1\mathbf{A}_{1} have identical values on the diagonal and on the first nϵ\frac{n}{\epsilon} off-diagonal entries read by 𝒜\mathcal{A} (i.e., the entries in SS). Therefore, the nϵ\frac{n}{\epsilon} entries of 𝐀0\mathbf{A}_{0} and 𝐀1\mathbf{A}_{1} queried by 𝒜\mathcal{A} are identical, and hence 𝒜\mathcal{A} cannot distinguish these two matrices, despite the fact their maximum eigenvalue differs by ϵn4\frac{\epsilon n}{4}.

After adjusting for constant factors, we conclude that any deterministic (possibly adaptive) algorithm that estimates the eigenvalues of a {0,1}\{0,1\}-matrix to ϵn\epsilon n additive error must observe Ω(nϵ)\Omega(\frac{n}{\epsilon}) entries of the input matrix in the worst case.

6 Applications to Fast Deterministic Algorithms for Linear Algebra

We finally show how to leverage our universal sparsifiers to design the first o(nω)o(n^{\omega}) time deterministic algorithms for several problems related to singular value and vector approximation. As discussed in Section 1.1.1, throughout this section, runtimes are stated assuming that we have already constructed a deterministic sampling matrix 𝐒\mathbf{S} satisfying the universal sparsification guarantees of Theorem 1 or Theorem 4. When 𝐒\mathbf{S} is the adjacency matrix of a Ramanujan graph, this can be done in O~(n/missingpoly(ϵ))\tilde{O}(n/\mathop{\mathrm{missing}}{poly}(\epsilon)) time – see Section 1.1.1 and Section 2 for further discussion.

In Section 6.1, we consider approximating the singular values of 𝐀\mathbf{A} to additive error ±ϵmax(n,𝐀1)\pm\epsilon\max(n,\|\mathbf{A}\|_{1}). Observe that if we deterministically compute 𝐀~=𝐀𝐒\mathbf{\widetilde{A}}=\mathbf{A}\circ\mathbf{S} with 𝐀𝐀~2ϵmax(n,𝐀1)\|\mathbf{A}-\mathbf{\widetilde{A}}\|_{2}\leq\epsilon\max(n,\|\mathbf{A}\|_{1}) via Theorem 4, by Weyl’s inequality, all singular values of 𝐀~\mathbf{\widetilde{A}} approximate those of 𝐀\mathbf{A} up to additive error ±ϵmax(n,𝐀1)\pm\epsilon\max(n,\|\mathbf{A}\|_{1}). Thus we can focus on approximating 𝐀~\mathbf{\widetilde{A}}’s singular values.

To do so efficiently, we will use the fact that 𝐀~\mathbf{\widetilde{A}} is sparse and so admits very fast O~(nϵ4)\widetilde{O}\left(\frac{n}{\epsilon^{4}}\right) time matrix-vector products. Focusing on the top singular value, it is well known that the power method, initialized with a vector that has non-negligible (i.e., Ω(1/missingpoly(n))\Omega(1/\mathop{\mathrm{missing}}{poly}(n))) inner product with the top singular vector 𝐯1\mathbf{v}_{1} of 𝐀~\mathbf{\widetilde{A}}, converges to a (1±ϵ)(1\pm\epsilon) relative error approximation in O(logn/ϵ)O(\log n/\epsilon) iterations. I.e., the method outputs 𝐯~1\mathbf{\widetilde{v}}_{1} with (1ϵ)σ1(𝐀~)𝐀~𝐯~12σ1(𝐀~)(1-\epsilon)\sigma_{1}(\mathbf{\widetilde{A}})\leq\|\mathbf{\widetilde{A}}\mathbf{\widetilde{v}}_{1}\|_{2}\leq\sigma_{1}(\mathbf{\widetilde{A}}). Each iteration requires a single matrix-vector product with 𝐀~\mathbf{\widetilde{A}}, yielding total runtime O~(nϵ5)\widetilde{O}\left(\frac{n}{\epsilon^{5}}\right). Since 𝐯1\mathbf{v}_{1} is unknown, one typically initializes the power method with a random vector, which has non-negligible inner product with 𝐯1\mathbf{v}_{1} with high probability. To derandomize this approach, we simply try nn orthogonal starting vectors, e.g., the standard basis vectors, of which at least one must have large inner product with 𝐯1\mathbf{v}_{1}. This simple brute force approach yields total runtime O~(n2ϵ5)\widetilde{O}\left(\frac{n^{2}}{\epsilon^{5}}\right), which, for large enough ϵ\epsilon, is o(nω)o(n^{\omega}).

To approximate more than just the top singular value, one would typically use a block power or Krylov method. However, derandomization is difficult here – unlike in the single vector case, it is not clear how to pick a small set of deterministic starting blocks for which at least one gives a good approximation. Thus, we instead use deflation [Saa11, AZL16]. We first use our deterministic power method to compute 𝐯~1\mathbf{\widetilde{v}}_{1} with 𝐀~𝐯~12σ1(𝐀~)\|\mathbf{\widetilde{A}}\mathbf{\widetilde{v}}_{1}\|_{2}\approx\sigma_{1}(\mathbf{\widetilde{A}}). We then run the method again on the deflated matrix 𝐀~(𝐈𝐯~1𝐯~1T)\mathbf{\widetilde{A}}(\mathbf{I}-\mathbf{\widetilde{v}}_{1}\mathbf{\widetilde{v}}_{1}^{T}), which we can still multiply by in O(nϵ4)O\left(\frac{n}{\epsilon^{4}}\right) time. This second run outputs 𝐯~2\mathbf{\widetilde{v}}_{2} which is orthogonal to 𝐯~1\mathbf{\widetilde{v}}_{1} and which has 𝐀~𝐯~22σ1(𝐀~(𝐈𝐯~1𝐯~1T))\|\mathbf{\widetilde{A}}\mathbf{\widetilde{v}}_{2}\|_{2}\approx\sigma_{1}(\mathbf{\widetilde{A}}(\mathbf{I}-\mathbf{\widetilde{v}}_{1}\mathbf{\widetilde{v}}_{1}^{T})). Since 𝐯~1\mathbf{\widetilde{v}}_{1} gives a good approximation to σ1(𝐀~)\sigma_{1}(\mathbf{\widetilde{A}}), we then argue that σ1(𝐀~(𝐈𝐯~1𝐯~1T))σ2(𝐀~)\sigma_{1}(\mathbf{\widetilde{A}}(\mathbf{I}-\mathbf{\widetilde{v}}_{1}\mathbf{\widetilde{v}}_{1}^{T}))\approx\sigma_{2}(\mathbf{\widetilde{A}}) and so 𝐀~𝐯~22σ2(𝐀~)\|\mathbf{\widetilde{A}}\mathbf{\widetilde{v}}_{2}\|_{2}\approx\sigma_{2}(\mathbf{\widetilde{A}}). We repeat this process to approximate all large singular values of 𝐀~\mathbf{\widetilde{A}}. The key challenge is bounding the accumulation of error that occurs at each step.

In Section 6.2 we apply the above derandomized power method approach to the problem of testing whether 𝐀\mathbf{A} is PSD or has at least on large negative eigenvalue <ϵmax(n,𝐀1)<-\epsilon\max(n,\|\mathbf{A}\|_{1}). We reduce this problem to approximating the top singular value of a shifted version of 𝐀\mathbf{A}.

Finally, in Section 6.3, we consider approximating σ1(𝐀)\sigma_{1}(\mathbf{A}) to high accuracy – e.g., to relative error (1±ϵ)(1\pm\epsilon) with ϵ=1/missingpoly(n)\epsilon=1/\mathop{\mathrm{missing}}{poly}(n), under the assumption that σ1(𝐀)αmax(n,𝐀1)\mathbf{\sigma}_{1}(\mathbf{A})\geq\alpha\max(n,\|\mathbf{A}\|_{1}) for some α\alpha. Here, one cannot simply approximate σ1(𝐀~)\sigma_{1}(\mathbf{\widetilde{A}}) in place of σ1(𝐀)\sigma_{1}(\mathbf{A}), since the error due to sparsification is too large. Instead, we use our deflation approach to compute a ‘coarse’ approximate set of top singular vectors for 𝐀~\mathbf{\widetilde{A}}. We then use these singular vectors to initialize a block Krylov method run on 𝐀\mathbf{A} itself, which we argue converges in O(log(n/ϵ)missingpoly(α))O\left(\frac{\log(n/\epsilon)}{\mathop{\mathrm{missing}}{poly}(\alpha)}\right) iterations, giving total run time O~(n2log(1/ϵ)missingpoly(α))\widetilde{O}\left(\frac{n^{2}\log(1/\epsilon)}{\mathop{\mathrm{missing}}{poly}(\alpha)}\right). The key idea in showing convergence is to argue that since σ1(𝐀)αmax(n,𝐀1)\mathbf{\sigma}_{1}(\mathbf{A})\geq\alpha\max(n,\|\mathbf{A}\|_{1}) by assumption, we have σ2/α(𝐀)α/2𝐀1σ1(𝐀)/2\mathbf{\sigma}_{2/\alpha}(\mathbf{A})\leq\alpha/2\|\mathbf{A}\|_{1}\leq\sigma_{1}(\mathbf{A})/2. So, there must be at least some singular value σp(𝐀)\sigma_{p}(\mathbf{A}) with p2/αp\leq 2/\alpha that is larger than σp+1(𝐀)\sigma_{p+1}(\mathbf{A}) by α2/4max(n,𝐀1)\alpha^{2}/4\cdot\max(n,\|\mathbf{A}\|_{1}). The presence of this spectral gap allows us to (1) argue that the first pp approximate singular vectors output by our deflation approach applied to 𝐀~\mathbf{\widetilde{A}} have non-neglible inner product with the true top kk singular vectors of 𝐀\mathbf{A} and (2) that the block Krylov method initialized with these vectors converges rapidly to a high accuracy approximation to σ1(𝐀)\sigma_{1}(\mathbf{A}), via known gap-dependent convergence bounds [MM15].

6.1 Deterministic Singular Value Approximation

We start by giving a deterministic algorithm to approximate all singular values of a symmetric bounded entry matrix 𝐀\mathbf{A} up to error ±ϵmax(n,𝐀1)\pm\epsilon\max(n,\|\mathbf{A}\|_{1}) in O~(n2/missingpoly(ϵ))\widetilde{O}(n^{2}/\mathop{\mathrm{missing}}{poly}(\epsilon)) time. Our main result appears as Theorem 8.

As discussed, we first deterministically compute a sparsified matrix 𝐀~\widetilde{\mathbf{A}} from 𝐀\mathbf{A} with 𝐀𝐀~2ϵmax(n,𝐀1)\|\mathbf{A}-\widetilde{\mathbf{A}}\|_{2}\leq\epsilon\max(n,\|\mathbf{A}\|_{1}). We then use a simple brute-force derandomization of the classical power method to approximate to the top singular value of 𝐀~\widetilde{\mathbf{A}} (Lemma 6). To approximate the rest of 𝐀~\mathbf{\widetilde{A}}’s singular values, we repeatedly deflate the matrix and compute the top singular value of the deflated matrix (Lemma 7). By Weyl’s inequality, the singular values of 𝐀~\widetilde{\mathbf{A}} approximate those of 𝐀\mathbf{A} up to ±ϵmax(n,𝐀1)\pm\epsilon\max(n,\|\mathbf{A}\|_{1}) error. Thus, we can argue that our approximations will approximate to the singular values of 𝐀\mathbf{A} itself, yielding Theorem 8. We first present the derandomized power method that will be applied to 𝐀~\mathbf{\widetilde{A}} in Algorithm 2.

Algorithm 2 Deterministic Power Method for Largest Singular Value Estimation
1:𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, error parameter ϵ(0,1)\epsilon\in(0,1)
2:Initialize σ~=0\widetilde{\sigma}=0, t=clog(n/ϵ)ϵt=\frac{c\log(n/\epsilon)}{\epsilon} where cc is a sufficiently large constant.
3:for k=1nk=1\to n do
4:     𝐲k=𝐞k\mathbf{y}_{k}=\mathbf{e}_{k} where 𝐞k\mathbf{e}_{k} is the kth standard basis vector.
5:     for j=1tj=1\to t do
6:         𝐲k(𝐀𝐀T)𝐲k\mathbf{y}_{k}\leftarrow(\mathbf{A}\mathbf{A}^{T})\mathbf{y}_{k}.
7:         𝐲k𝐲k𝐲k2\mathbf{y}_{k}\leftarrow\frac{\mathbf{y}_{k}}{\|\mathbf{y}_{k}\|_{2}}.
8:     end for
9:     if 𝐀T𝐲k2σ~\|\mathbf{A}^{T}\mathbf{y}_{k}\|_{2}\geq\widetilde{\sigma} then
10:         𝐳𝐲k\mathbf{z}\leftarrow\mathbf{y}_{k}.
11:         σ~𝐀T𝐳2\widetilde{\sigma}\leftarrow\|\mathbf{A}^{T}\mathbf{z}\|_{2}.
12:     end if
13:end for
14:return  𝐳\mathbf{z}

Let 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} be a matrix with its SVD given by 𝐀=𝐔𝚺𝐕T\mathbf{A}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{T} where 𝐔,𝐕n×n\mathbf{U},\mathbf{V}\in\mathbb{R}^{n\times n} are matrices with orthonormal columns containing the left and right singular vectors respectively and 𝚺\mathbf{\Sigma} is a diagonal matrix containing the singular values σ1(𝐀)σ2(𝐀)σn(𝐀)0\sigma_{1}(\mathbf{A})\geq\sigma_{2}(\mathbf{A})\geq\ldots\geq\sigma_{n}(\mathbf{A})\geq 0. Observe that any vector 𝐲n\mathbf{y}\in\mathbb{R}^{n} can be written as 𝐲=i=1nci𝐮i\mathbf{y}=\sum_{i=1}^{n}c_{i}\mathbf{u}_{i} where cic_{i} are scalar coefficients and 𝐮i\mathbf{u}_{i} are the columns of 𝐔\mathbf{U}. Then, we have the following well-known result which shows that when the starting unit vector 𝐲\mathbf{y} has a large enough inner product with 𝐮1\mathbf{u}_{1} i.e., |c1|1n|c_{1}|\geq\frac{1}{\sqrt{n}}, power iterations on 𝐀𝐀T\mathbf{A}\mathbf{A}^{T} converge to a 1ϵ1-\epsilon approximation to σ1(𝐀)\sigma_{1}(\mathbf{A}) within O~(1/ϵ)\widetilde{O}(1/\epsilon) iterations. We will leverage this lemma to prove the correctness and runtime of our deterministic power method, Algorithm 2.

Lemma 5 (Power iterations – gap independent bound).

Let 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} be a matrix with largest singular value σ1(𝐀)\sigma_{1}(\mathbf{A}) and left singular vectors 𝐮1,,𝐮n\mathbf{u}_{1},\ldots,\mathbf{u}_{n}. Let 𝐲=i=1nci𝐮i\mathbf{y}=\sum_{i=1}^{n}c_{i}\mathbf{u}_{i} be a unit vector where |c1|1n|c_{1}|\geq\frac{1}{\sqrt{n}}. Then, for t=O(log(n/ϵ)ϵ)t=O\big{(}\frac{\log(n/\epsilon)}{\epsilon}\big{)}, if we set 𝐳=(𝐀𝐀T)t𝐲(𝐀𝐀T)t𝐲2\mathbf{z}=\frac{(\mathbf{A}\mathbf{A}^{T})^{t}\mathbf{y}}{\|(\mathbf{A}\mathbf{A}^{T})^{t}\mathbf{y}\|_{2}},

𝐀T𝐳2(1ϵ)σ1(𝐀).\displaystyle\|\mathbf{A}^{T}\mathbf{z}\|_{2}\geq(1-\epsilon)\sigma_{1}(\mathbf{A}).
Proof.

We prove Lemma 5 for completeness. Let 𝐫=(𝐀𝐀T)t𝐲\mathbf{r}=(\mathbf{A}\mathbf{A}^{T})^{t}\mathbf{y}. Then, 𝐳=𝐫𝐫2\mathbf{z}=\frac{\mathbf{r}}{\|\mathbf{r}\|_{2}}. First observe that 𝐀𝐀T=i=1nσi2(𝐀)𝐮i𝐮iT\mathbf{A}\mathbf{A}^{T}=\sum_{i=1}^{n}\sigma^{2}_{i}(\mathbf{A})\mathbf{u}_{i}\mathbf{u}_{i}^{T}. Thus, 𝐫=i=1nci(σi(𝐀))2t𝐮i\mathbf{r}=\sum_{i=1}^{n}c_{i}(\sigma_{i}(\mathbf{A}))^{2t}\mathbf{u}_{i}. Let k[n]k\in[n] be the smallest kk such that σk2(𝐀)(1ϵ)σ12(𝐀)\sigma^{2}_{k}(\mathbf{A})\leq(1-\epsilon)\sigma^{2}_{1}(\mathbf{A}). Observe that if no such kk exists, then for any unit vector 𝐳\mathbf{z}, 𝐀T𝐳2σn(𝐀)>(1ϵ)σ1(𝐀)\|\mathbf{A}^{T}\mathbf{z}\|_{2}\geq\sigma_{n}(\mathbf{A})>(1-\epsilon)\sigma_{1}(\mathbf{A}), and thus the lemma holds trivially.

Let 𝐫=𝐫1+𝐫2\mathbf{r}=\mathbf{r}_{1}+\mathbf{r}_{2} where 𝐫1=i=1k1ciσi2t(𝐀)𝐮i\mathbf{r}_{1}=\sum_{i=1}^{k-1}c_{i}\sigma^{2t}_{i}(\mathbf{A})\mathbf{u}_{i} and 𝐫2=i=knciσi2t(𝐀)𝐮i\mathbf{r}_{2}=\sum_{i=k}^{n}c_{i}\sigma^{2t}_{i}(\mathbf{A})\mathbf{u}_{i}. Then:

𝐫222𝐫122=i=knci2σi4t(𝐀)i=1k1ci2σi4t(𝐀)i=knci2σi4t(𝐀)c12σ14t(𝐀)i=kn(cic1)2(σi(𝐀)σ1(𝐀))4t.\displaystyle\frac{\|\mathbf{r}_{2}\|_{2}^{2}}{\|\mathbf{r}_{1}\|_{2}^{2}}=\frac{\sum_{i=k}^{n}c^{2}_{i}\sigma^{4t}_{i}(\mathbf{A})}{\sum_{i=1}^{k-1}c^{2}_{i}\sigma^{4t}_{i}(\mathbf{A})}\leq\frac{\sum_{i=k}^{n}c^{2}_{i}\sigma^{4t}_{i}(\mathbf{A})}{c^{2}_{1}\sigma^{4t}_{1}(\mathbf{A})}\leq\sum_{i=k}^{n}\bigg{(}\frac{c_{i}}{c_{1}}\bigg{)}^{2}\bigg{(}\frac{\sigma_{i}(\mathbf{A})}{\sigma_{1}(\mathbf{A})}\bigg{)}^{4t}.

Since ci1c_{i}\leq 1 for all i[n]i\in[n], and since by assumption |c1|1n|c_{1}|\geq\frac{1}{\sqrt{n}}, |cic1|n\left|\frac{c_{i}}{c_{1}}\right|\leq\sqrt{n} for all i[n]i\in[n]. Also, for any iki\geq k, (σi(𝐀)σ1(𝐀))2(1ϵ)\big{(}\frac{\sigma_{i}(\mathbf{A})}{\sigma_{1}(\mathbf{A})}\big{)}^{2}\leq(1-\epsilon) and (1ϵ)2tδ(1-\epsilon)^{2t}\leq\delta for tO(log(1/δ)ϵ)t\geq O\big{(}\frac{\log(1/\delta)}{\epsilon}\big{)}. Thus, we have 𝐫222𝐫122n2δ\frac{\|\mathbf{r}_{2}\|_{2}^{2}}{\|\mathbf{r}_{1}\|_{2}^{2}}\leq n^{2}\delta. Setting δ=ϵ2n2\delta=\frac{\epsilon^{2}}{n^{2}}, we get 𝐫222𝐫122ϵ2\frac{\|\mathbf{r}_{2}\|_{2}^{2}}{\|\mathbf{r}_{1}\|_{2}^{2}}\leq\epsilon^{2}.

Next, observe that 𝐫22=𝐫122+𝐫222\|\mathbf{r}\|^{2}_{2}=\|\mathbf{r}_{1}\|^{2}_{2}+\|\mathbf{r}_{2}\|^{2}_{2} which implies that 𝐫22(1+ϵ2)𝐫122\|\mathbf{r}\|_{2}^{2}\leq(1+\epsilon^{2})\|\mathbf{r}_{1}\|_{2}^{2} or 𝐫122(1ϵ)𝐫22\|\mathbf{r}_{1}\|_{2}^{2}\geq(1-\epsilon)\|\mathbf{r}\|_{2}^{2}. Thus, using the fact that 𝐫1T𝐀𝐀T𝐫2=0\mathbf{r}_{1}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{r}_{2}=0 and 𝐫2T𝐀𝐀T𝐫20\mathbf{r}_{2}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{r}_{2}\geq 0, and the fact that σi2(𝐀)(1ϵ)σ12(𝐀)\sigma_{i}^{2}(\mathbf{A})\geq(1-\epsilon)\sigma_{1}^{2}(\mathbf{A}) for i<ki<k, we get:

𝐫T𝐀𝐀T𝐫=𝐫1T𝐀𝐀T𝐫1+𝐫2T𝐀𝐀T𝐫2\displaystyle\mathbf{r}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{r}=\mathbf{r}_{1}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{r}_{1}+\mathbf{r}_{2}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{r}_{2} 𝐫1T𝐀𝐀T𝐫1\displaystyle\geq\mathbf{r}_{1}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{r}_{1}
σk12(𝐀)𝐫122\displaystyle\geq\sigma_{k-1}^{2}(\mathbf{A})\cdot\|\mathbf{r}_{1}\|_{2}^{2}
(1ϵ)2σ12(𝐀)𝐫22.\displaystyle\geq(1-\epsilon)^{2}\cdot\sigma_{1}^{2}(\mathbf{A})\cdot\|\mathbf{r}\|_{2}^{2}.

Finally dividing both sides of the above equation by 𝐫22\|\mathbf{r}\|^{2}_{2}, we get 𝐳T𝐀𝐀T𝐳(1ϵ)2σ12(𝐀)\mathbf{z}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{z}\geq(1-\epsilon)^{2}\sigma^{2}_{1}(\mathbf{A}). Taking square root on both sides gives us, 𝐀T𝐳2(1ϵ)σ1(𝐀)\|\mathbf{A}^{T}\mathbf{z}\|_{2}\geq(1-\epsilon)\sigma_{1}(\mathbf{A}). ∎

Next we prove the correctness of Algorithm 2 using Lemma 5. The idea is to run power iterations using all nn basis vectors as starting vectors. At least one of these vectors will have high inner product with 𝐮1\mathbf{u}_{1}, and so so this starting vector will give us a good approximation to σ1(𝐀)\sigma_{1}(\mathbf{A}) by Lemma 5. We also prove that the squared singular values of the deflated matrix 𝐀𝐳𝐳T𝐀\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A} are close to the squared singular values of the original matrix up to an additive error of ϵσ12(𝐀)\epsilon\cdot\sigma^{2}_{1}(\mathbf{A}). This will be used to as part of the correctness proof for our main algorithm for singular value approximation.

Lemma 6.

Let 𝐳\mathbf{z} be the output of Algorithm 2 for some matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} with singular values σ1(𝐀)σ2(𝐀)σn(𝐀)\sigma_{1}(\mathbf{A})\geq\sigma_{2}(\mathbf{A})\geq\ldots\geq\sigma_{n}(\mathbf{A}) and error parameter ϵ(0,1)\epsilon\in(0,1) as input. Then:

(1ϵ)σ1(𝐀)𝐀T𝐳2σ1(𝐀).(1-\epsilon)\sigma_{1}(\mathbf{A})\leq\|\mathbf{A}^{T}\mathbf{z}\|_{2}\leq\sigma_{1}(\mathbf{A}).

Further, for any i[n]i\in[n]:

σi2(𝐀𝐳𝐳T𝐀)σi+12(𝐀)+ϵσ12(𝐀).\sigma_{i}^{2}(\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A})\leq\sigma_{i+1}^{2}(\mathbf{A})+\epsilon\sigma^{2}_{1}(\mathbf{A}).
Proof.

Let 𝐮1\mathbf{u}_{1} be the left singular vector corresponding to the largest singular value of 𝐀\mathbf{A}. Observe that since 𝐮12=1\|\mathbf{u}_{1}\|_{2}=1, there exists at least one standard basis vector 𝐞k\mathbf{e}_{k} such that |𝐞kT𝐮1|1n|\mathbf{e}_{k}^{T}\cdot\mathbf{u}_{1}|\geq\frac{1}{\sqrt{n}}. After tt iterations of the inner loop of Algorithm 2, we have 𝐲k=(𝐀𝐀T)t𝐞k(𝐀𝐀T)t𝐞k2\mathbf{y}_{k}=\frac{(\mathbf{A}\mathbf{A}^{T})^{t}\mathbf{e}_{k}}{\|(\mathbf{A}\mathbf{A}^{T})^{t}\mathbf{e}_{k}\|_{2}}. Using Lemma 5, we thus have 𝐀T𝐲k22(1ϵ)σ1(𝐀)\|\mathbf{A}^{T}\mathbf{y}_{k}\|_{2}^{2}\geq(1-\epsilon)\sigma_{1}(\mathbf{A}). Also since 𝐲k\mathbf{y}_{k} is a unit vector, we trivially have 𝐀T𝐲k2σ1(𝐀)\|\mathbf{A}^{T}\mathbf{y}_{k}\|_{2}\leq\sigma_{1}(\mathbf{A}) for all kk. Since for 𝐳\mathbf{z} output by Algorithm 2, 𝐀T𝐳2=maxk𝐀T𝐲k2\|\mathbf{A}^{T}\mathbf{z}\|_{2}=\max_{k}\|\mathbf{A}^{T}\mathbf{y}_{k}\|_{2}, we thus have:

(1ϵ)σ1(𝐀)𝐀T𝐳2σ1(𝐀),(1-\epsilon)\sigma_{1}(\mathbf{A})\leq\|\mathbf{A}^{T}\mathbf{z}\|_{2}\leq\sigma_{1}(\mathbf{A}), (15)

giving our first error bound.

We now prove the second bound on σi2(𝐀𝐳𝐳T𝐀)\sigma_{i}^{2}(\mathbf{A}-\mathbf{zz}^{T}\mathbf{A}). Using the Pythagorean theorem, 𝐀𝐳𝐳T𝐀F2=𝐀F2𝐳𝐳T𝐀F2\|\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A}\|_{F}^{2}=\|\mathbf{A}\|_{F}^{2}-\|\mathbf{z}\mathbf{z}^{T}\mathbf{A}\|_{F}^{2}. We have 𝐳𝐳T𝐀F2=tr(𝐳𝐳T𝐀𝐀T𝐳𝐳T)=𝐳T𝐀𝐀T𝐳=𝐀T𝐳22\|\mathbf{z}\mathbf{z}^{T}\mathbf{A}\|_{F}^{2}=\operatorname{tr}(\mathbf{zz}^{T}\mathbf{AA}^{T}\mathbf{zz}^{T})=\mathbf{z}^{T}\mathbf{AA}^{T}\mathbf{z}=\|\mathbf{A}^{T}\mathbf{z}\|_{2}^{2}. We thus have 𝐀𝐳𝐳T𝐀F2=𝐀F2𝐀T𝐳22\|\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A}\|_{F}^{2}=\|\mathbf{A}\|_{F}^{2}-\|\mathbf{A}^{T}\mathbf{z}\|_{2}^{2}. Combining this with (15):

𝐀𝐳𝐳T𝐀F2\displaystyle\|\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A}\|_{F}^{2} 𝐀F2(1ϵ)2σ12(𝐀)\displaystyle\leq\|\mathbf{A}\|_{F}^{2}-(1-\epsilon)^{2}\sigma_{1}^{2}(\mathbf{A})
𝐀F2σ12(𝐀)+2ϵσ12(𝐀)\displaystyle\leq\|\mathbf{A}\|_{F}^{2}-\sigma_{1}^{2}(\mathbf{A})+2\epsilon\sigma_{1}^{2}(\mathbf{A})
=𝐀𝐀1F2+2ϵσ12(𝐀),\displaystyle=\|\mathbf{A}-\mathbf{A}_{1}\|_{F}^{2}+2\epsilon\sigma_{1}^{2}(\mathbf{A}), (16)

where 𝐀1=𝐮1𝐮1T𝐀\mathbf{A}_{1}=\mathbf{u}_{1}\mathbf{u}_{1}^{T}\mathbf{A} is the best rank-11 approximation to 𝐀\mathbf{A} in the Frobenius norm, with 𝐀𝐀1F2=i=2nσi2(𝐀)\|\mathbf{A}-\mathbf{A}_{1}\|_{F}^{2}=\sum_{i=2}^{n}\sigma_{i}^{2}(\mathbf{A}). So, we have:

j=1nσj2(𝐀𝐳𝐳T𝐀)=𝐀𝐳𝐳T𝐀F2\displaystyle\sum_{j=1}^{n}\sigma_{j}^{2}(\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A})=\|\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A}\|_{F}^{2} 𝐀𝐀1F2+2ϵσ12(𝐀)=j=2nσj2(𝐀)+2ϵσ12(𝐀).\displaystyle\leq\|\mathbf{A}-\mathbf{A}_{1}\|_{F}^{2}+2\epsilon\sigma^{2}_{1}(\mathbf{A})=\sum_{j=2}^{n}\sigma_{j}^{2}(\mathbf{A})+2\epsilon\sigma^{2}_{1}(\mathbf{A}).

Simply subtracting σn2(𝐀𝐳𝐳T𝐀)\sigma_{n}^{2}(\mathbf{A}-\mathbf{zz}^{T}\mathbf{A}) from the lefthand side and pulling σi2(𝐀𝐳𝐳T)\sigma_{i}^{2}(\mathbf{A}-\mathbf{zz}^{T}) out of the summation for some i[n]i\in[n] gives that

σi2(𝐀𝐳𝐳T𝐀)+ji,j[n1]σj2(𝐀𝐳𝐳T𝐀)\displaystyle\sigma_{i}^{2}(\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A})+\sum_{j\neq i,j\in[n-1]}\sigma_{j}^{2}(\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A}) j=2nσj2(𝐀)+2ϵσ12(𝐀).\displaystyle\leq\sum_{j=2}^{n}\sigma_{j}^{2}(\mathbf{A})+2\epsilon\sigma^{2}_{1}(\mathbf{A}). (17)

Using the eigenvalue min-max theorem (Fact 4), we have for all j[n1]j\in[n-1], σj(𝐀𝐳𝐳T𝐀)σj+1(𝐀)\sigma_{j}(\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A})\geq\sigma_{j+1}(\mathbf{A}). Using this fact in (17),

σi2(𝐀𝐳𝐳T𝐀)+ji,j[n1]σj+12(𝐀)\displaystyle\sigma_{i}^{2}(\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A})+\sum_{j\neq i,j\in[n-1]}\sigma_{j+1}^{2}(\mathbf{A}) j=2nσj2(𝐀)+2ϵσ12(𝐀),\displaystyle\leq\sum_{j=2}^{n}\sigma_{j}^{2}(\mathbf{A})+2\epsilon\sigma^{2}_{1}(\mathbf{A}),

which implies that σi2(𝐀𝐳𝐳T𝐀)σi+12(𝐀)+2ϵσ12(𝐀)\sigma_{i}^{2}(\mathbf{A}-\mathbf{z}\mathbf{z}^{T}\mathbf{A})\leq\sigma_{i+1}^{2}(\mathbf{A})+2\epsilon\sigma^{2}_{1}(\mathbf{A}). This completes the proof after adjusting ϵ\epsilon by a factor of 22. ∎

We now state as Algorithm 3 our main deterministic algorithm for estimating kk singular values of a matrix by leveraging Algorithm 2 as a subroutine. We will then show that by applying Algorithm 3 to a sparse spectral approximation 𝐀~\mathbf{\widetilde{A}} of 𝐀\mathbf{A}, we can estimate the singular values of 𝐀\mathbf{A} up to error ±ϵmax(n,𝐀1)\pm\epsilon\max(n,\|\mathbf{A}\|_{1}) in O~(n2/missingpoly(ϵ))\widetilde{O}(n^{2}/\mathop{\mathrm{missing}}{poly}(\epsilon)) time.

Algorithm 3 Deterministic Singular Value Estimation via Deflation
1:𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, error parameter ϵ(0,1)\epsilon\in(0,1), number of singular values to be estimated knk\leq n
2:𝐀(1)=𝐀\mathbf{A}^{(1)}=\mathbf{A}.
3:for i=1ki=1\to k do
4:     𝐳i\mathbf{z}_{i}\leftarrow output of Algorithm 2 with input 𝐀(i)\mathbf{A}^{(i)}, error parameter ϵ\epsilon.
5:     𝐀(i+1)𝐀(i)𝐳i𝐳iT𝐀\mathbf{A}^{(i+1)}\leftarrow\mathbf{A}^{(i)}-\mathbf{z}_{i}\mathbf{z}_{i}^{T}\mathbf{A}.
6:end for
7:return 𝐳1,𝐳2,,𝐳k\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{k}
Lemma 7.

Let {𝐳1,𝐳2,𝐳k}\{\mathbf{z}_{1},\mathbf{z}_{2},\ldots\mathbf{z}_{k}\} be the output of Algorithm 3 for some matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} with singular values σ1(𝐀)σ2(𝐀)σn(𝐀)\sigma_{1}(\mathbf{A})\geq\sigma_{2}(\mathbf{A})\geq\ldots\geq\sigma_{n}(\mathbf{A}) and error parameter ϵ(0,1)\epsilon\in(0,1) as input. Then, {𝐳1,𝐳2,𝐳k}\{\mathbf{z}_{1},\mathbf{z}_{2},\ldots\mathbf{z}_{k}\} are orthogonal unit vectors and for any i[k]i\in[k]:

(1ϵ)σi(𝐀)𝐀T𝐳i2σi(𝐀)+σ1(𝐀)iϵ(1-\epsilon)\sigma_{i}(\mathbf{A})\leq\|\mathbf{A}^{T}\mathbf{z}_{i}\|_{2}\leq\sigma_{i}(\mathbf{A})+\sigma_{1}(\mathbf{A})\sqrt{i\cdot\epsilon}
Proof.

Let σ~i=(𝐀(i))T𝐳i2\widetilde{\sigma}_{i}=\|(\mathbf{A}^{(i)})^{T}\mathbf{z}_{i}\|_{2} for all ii. Since each 𝐳i\mathbf{z}_{i} is the output of Algorithm 2 with 𝐀(i)\mathbf{A}^{(i)} as the input, using the bound of Lemma 6, for any i[k]i\in[k] we get that

(1ϵ)σ1(𝐀(i))σ~iσ1(𝐀(i)).(1-\epsilon)\sigma_{1}(\mathbf{A}^{(i)})\leq\widetilde{\sigma}_{i}\leq\sigma_{1}(\mathbf{A}^{(i)}). (18)

Note that 𝐀(i)=𝐀(i1)𝐳i1𝐳i1T𝐀=𝐀j=1i1𝐳j𝐳jT𝐀=𝐀𝐙i1𝐙i1T𝐀\mathbf{A}^{(i)}=\mathbf{A}^{(i-1)}-\mathbf{z}_{i-1}\mathbf{z}_{i-1}^{T}\mathbf{A}=\mathbf{A}-\sum_{j=1}^{i-1}\mathbf{z}_{j}\mathbf{z}_{j}^{T}\mathbf{A}=\mathbf{A}-\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T}\mathbf{A} where 𝐙i1\mathbf{Z}_{i-1} is a matrix with i1i-1 columns where the jjth column is equal to 𝐳j\mathbf{z}_{j}. We first prove that 𝐙i1\mathbf{Z}_{i-1} is a matrix with orthonormal columns. First note that each 𝐳i\mathbf{z}_{i} is a unit vector since Algorithm 2 always outputs a unit vector. We can prove that all 𝐳i\mathbf{z}_{i} are orthogonal to each other by induction. Suppose that all 𝐳k\mathbf{z}_{k} with k[j]k\in[j] are orthogonal to each other for some j<i1j<i-1. Now, 𝐀(j+1)=𝐀𝐙j𝐙jT𝐀=(𝐈𝐙j𝐙jT)𝐀\mathbf{A}^{(j+1)}=\mathbf{A}-\mathbf{Z}_{j}\mathbf{Z}_{j}^{T}\mathbf{A}=(\mathbf{I}-\mathbf{Z}_{j}\mathbf{Z}_{j}^{T})\mathbf{A} where 𝐈\mathbf{I} is the identity matrix. Observe that 𝐳j+1\mathbf{z}_{j+1} lies in the column span of 𝐀(j+1)\mathbf{A}^{(j+1)}. Thus, to prove that 𝐳j+1\mathbf{z}_{j+1} is orthogonal to all 𝐳k\mathbf{z}_{k} with k[j]k\in[j], it is enough to show that each such 𝐳k\mathbf{z}_{k} is orthogonal to the column space of 𝐀(j+1)\mathbf{A}^{(j+1)}. This follows since 𝐳k=𝐙j𝐙jT𝐳k\mathbf{z}_{k}=\mathbf{Z}_{j}\mathbf{Z}_{j}^{T}\mathbf{z}_{k} since 𝐙j𝐙jT\mathbf{Z}_{j}\mathbf{Z}_{j}^{T} is a projection onto the span of 𝐳1,,𝐳j\mathbf{z}_{1},\ldots,\mathbf{z}_{j}. Thus, (𝐀(j+1))T𝐳k=𝐀T(𝐈𝐙j𝐙jT)𝐳k=𝐀T(𝐳k𝐳k)=0(\mathbf{A}^{(j+1)})^{T}\mathbf{z}_{k}=\mathbf{A}^{T}(\mathbf{I}-\mathbf{Z}_{j}\mathbf{Z}_{j}^{T})\mathbf{z}_{k}=\mathbf{A}^{T}(\mathbf{z}_{k}-\mathbf{z}_{k})=0. Thus, 𝐳j+1\mathbf{z}_{j+1} is orthogonal to all 𝐳k\mathbf{z}_{k} with k[j]k\in[j]. Hence, 𝐙i1\mathbf{Z}_{i-1} is a matrix with orthonormal columns. This implies that 𝐙i1𝐙i1T\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T} is a rank (i1)(i-1) projection matrix.

By the above orthogonality claim, 𝐙i1T𝐳i=𝟎\mathbf{Z}_{i-1}^{T}\mathbf{z}_{i}=\mathbf{0} and we have σ~i=(𝐀(i))T𝐳i2=(𝐀T𝐀T𝐙i1𝐙i1T)𝐳i2=𝐀T𝐳i2\widetilde{\sigma}_{i}=\|(\mathbf{A}^{(i)})^{T}\mathbf{z}_{i}\|_{2}=\|(\mathbf{A}^{T}-\mathbf{A}^{T}\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T})\mathbf{z}_{i}\|_{2}=\|\mathbf{A}^{T}\mathbf{z}_{i}\|_{2}. So, to prove the lemma, it is enough to bound σ~i\widetilde{\sigma}_{i}. We first prove the lower bound on σ~i\widetilde{\sigma}_{i} using the min-max principle of eigenvalues, which we state below.

Fact 4 (Eigenvalue Minimax Principle – see e.g., [Bha13]).

Let 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} be a symmetric matrix with eigenvalues λ1(𝐀)λ2(𝐀)λn(𝐀)\lambda_{1}(\mathbf{A})\geq\lambda_{2}(\mathbf{A})\geq\ldots\geq\lambda_{n}(\mathbf{A}), then for all i[n]i\in[n],

λi(𝐀)\displaystyle\lambda_{i}(\mathbf{A}) =maxS:dim(S)=imin𝐱S,𝐱2=1𝐱T𝐀𝐱=minS:dim(S)=ni+1max𝐱S,𝐱2=1𝐱T𝐀𝐱.\displaystyle=\max_{S\colon dim(S)=i}\min_{\mathbf{x}\in S,\|\mathbf{x}\|_{2}=1}\mathbf{x}^{T}\mathbf{A}\mathbf{x}=\min_{S\colon dim(S)=n-i+1}\max_{\mathbf{x}\in S,\|\mathbf{x}\|_{2}=1}\mathbf{x}^{T}\mathbf{A}\mathbf{x}.

By Fact 4, σ1(𝐀(i))2=σ1(𝐀𝐙i1𝐙i1T𝐀)2=λ1((𝐈𝐙i1𝐙i1T)𝐀𝐀T(𝐈𝐙i1𝐙i1T))λi(𝐀𝐀T)=σi(𝐀)2\sigma_{1}(\mathbf{A}^{(i)})^{2}=\sigma_{1}(\mathbf{A}-\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T}\mathbf{A})^{2}=\lambda_{1}((\mathbf{I}-\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T})\mathbf{A}\mathbf{A}^{T}(\mathbf{I}-\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T}))\geq\lambda_{i}(\mathbf{AA}^{T})=\sigma_{i}(\mathbf{A})^{2} since 𝐙i1𝐙i1T\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T} is a rank i1i-1 projection matrix. Thus, from (18), we get σ~i(1ϵ)σ1(𝐀(i))(1ϵ)σi(𝐀)\widetilde{\sigma}_{i}\geq(1-\epsilon)\sigma_{1}(\mathbf{A}^{(i)})\geq(1-\epsilon)\sigma_{i}(\mathbf{A}).

We now prove the upper bound on σ~i\widetilde{\sigma}_{i}. For any i[k]i\in[k] and any j[n]j\in[n] from Lemma 6 we have:

σj2(𝐀(i)𝐳i𝐳iT𝐀(i))σj+12(𝐀(i))+ϵσ12(𝐀(i)).\sigma_{j}^{2}(\mathbf{A}^{(i)}-\mathbf{z}_{i}\mathbf{z}_{i}^{T}\mathbf{A}^{(i)})\leq\sigma_{j+1}^{2}(\mathbf{A}^{(i)})+\epsilon\sigma^{2}_{1}(\mathbf{A}^{(i)}).

Now using the fact that 𝐀(i)𝐳i𝐳iT𝐀(i)=𝐀𝐙i1𝐙i1T𝐀𝐳i𝐳iT𝐀=𝐀𝐙i𝐙iT𝐀\mathbf{A}^{(i)}-\mathbf{z}_{i}\mathbf{z}_{i}^{T}\mathbf{A}^{(i)}=\mathbf{A}-\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T}\mathbf{A}-\mathbf{z}_{i}\mathbf{z}_{i}^{T}\mathbf{A}=\mathbf{A}-\mathbf{Z}_{i}\mathbf{Z}_{i}^{T}\mathbf{A} and σ1(𝐀(i))σ1(𝐀)\sigma_{1}(\mathbf{A}^{(i)})\leq\sigma_{1}(\mathbf{A}), we get:

σj2(𝐀𝐙i𝐙iT𝐀)σj+12(𝐀𝐙i1𝐙i1T𝐀)+ϵσ12(𝐀).\sigma_{j}^{2}(\mathbf{A}-\mathbf{Z}_{i}\mathbf{Z}_{i}^{T}\mathbf{A})\leq\sigma_{j+1}^{2}(\mathbf{A}-\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T}\mathbf{A})+\epsilon\sigma^{2}_{1}(\mathbf{A}). (19)

From (18) we have σ~iσ1(𝐀𝐙i1𝐙i1T𝐀)\widetilde{\sigma}_{i}\leq\sigma_{1}(\mathbf{A}-\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T}\mathbf{A}). Squaring both sides and applying the bound from (19) i1i-1 times repeatedly on the RHS, we get

σ~i2\displaystyle\widetilde{\sigma}^{2}_{i} σ12(𝐀𝐙i1𝐙i1T𝐀)σ22(𝐀𝐙i2𝐙i2T𝐀)+ϵσ12(𝐀)\displaystyle\leq\sigma^{2}_{1}(\mathbf{A}-\mathbf{Z}_{i-1}\mathbf{Z}_{i-1}^{T}\mathbf{A})\leq\sigma^{2}_{2}(\mathbf{A}-\mathbf{Z}_{i-2}\mathbf{Z}_{i-2}^{T}\mathbf{A})+\epsilon\sigma^{2}_{1}(\mathbf{A})
σ32(𝐀𝐙i3𝐙i3T𝐀)+2ϵσ12(𝐀)σi2(𝐀)+iϵσ12(𝐀).\displaystyle\leq\sigma^{2}_{3}(\mathbf{A}-\mathbf{Z}_{i-3}\mathbf{Z}_{i-3}^{T}\mathbf{A})+2\epsilon\sigma^{2}_{1}(\mathbf{A})\leq\ldots\leq\sigma^{2}_{i}(\mathbf{A})+i\cdot\epsilon\cdot\sigma^{2}_{1}(\mathbf{A}).

Finally, taking the square root on both sides and using that for any non-negative a,ba,b a+ba+b\sqrt{a}+\sqrt{b}\geq\sqrt{a+b} gives us the upper bound. ∎

Lemma 7 in place, we are now ready to state and prove our main result on deterministic singular value estimation for any bounded entry matrix. See 8

Proof.

Note that 𝐀\mathbf{A} can have at most 1ϵ\frac{1}{\epsilon} singular values greater than ϵ𝐀1\epsilon\|\mathbf{A}\|_{1}. Thus, it is enough to approximate only the top 1ϵ\frac{1}{\epsilon} singular values of 𝐀\mathbf{A} up to error ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). The rest of the singular values can be approximated by 0 to still have the required approximation error of ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Let 𝐀~=𝐀𝐒\widetilde{\mathbf{A}}=\mathbf{A}\circ\mathbf{S} be the sparsification of 𝐀\mathbf{A} as described in Theorem 4, which satisfies 𝐀𝐀~2ϵmax(n,𝐀1).\|\mathbf{A}-\widetilde{\mathbf{A}}\|_{2}\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Using Weyl’s inequality we have for all i[n]i\in[n]:

|σi(𝐀)σi(𝐀~)|ϵmax(n,𝐀1).|\sigma_{i}(\mathbf{A})-\sigma_{i}(\widetilde{\mathbf{A}})|\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}).

Thus it is enough to approximate the top 1ϵ\frac{1}{\epsilon} singular values of 𝐀~\widetilde{\mathbf{A}} up to ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}) error and then use triangle inequality to get the final approximation to the top 1ϵ\frac{1}{\epsilon} singular values of 𝐀\mathbf{A}. To do so, we run Algorithm 3 with 𝐀~\widetilde{\mathbf{A}} as the input matrix, ϵ3\epsilon^{3} as the error parameter, and k=1/ϵk=\lceil 1/\epsilon\rceil as the number of singular values to be estimated. Using Lemma 7 we get orthonormal vectors 𝐳i\mathbf{z}_{i} for i1ϵi\in\lceil\frac{1}{\epsilon}\rceil such that:

(1ϵ3)σi(𝐀~)𝐀~𝐳i2σi(𝐀~)+σ1(𝐀~)iϵ3.(1-\epsilon^{3})\sigma_{i}(\widetilde{\mathbf{A}})\leq\|\widetilde{\mathbf{A}}\mathbf{z}_{i}\|_{2}\leq\sigma_{i}(\widetilde{\mathbf{A}})+\sigma_{1}(\widetilde{\mathbf{A}})\sqrt{i\cdot\epsilon^{3}}. (20)

Let σ~i(𝐀)=𝐀~𝐳i2\widetilde{\sigma}_{i}(\mathbf{A})=\|\widetilde{\mathbf{A}}\mathbf{z}_{i}\|_{2} for i1ϵi\in\lceil\frac{1}{\epsilon}\rceil. Now since σ1(𝐀)n\sigma_{1}(\mathbf{A})\leq n, using Weyl’s inequality, we have σ1(𝐀~)σ1(𝐀)+𝐀𝐀~2σ1(𝐀)+ϵmax(n,𝐀1)2max(n,𝐀1)\sigma_{1}(\mathbf{\widetilde{A}})\leq\sigma_{1}(\mathbf{A})+\|\mathbf{A}-\mathbf{\widetilde{A}}\|_{2}\leq\sigma_{1}(\mathbf{A})+\epsilon\max(n,\|\mathbf{A}\|_{1})\leq 2\max(n,\|\mathbf{A}\|_{1}). Thus, for any i[1/ϵ]i\in[\lceil 1/\epsilon\rceil], σ1(𝐀~)iϵ32ϵmax(n,𝐀1)\sigma_{1}(\widetilde{\mathbf{A}})\sqrt{i\cdot\epsilon^{3}}\leq 2\epsilon\max(n,\|\mathbf{A}\|_{1}). So, from Equation (20), we get:

(1ϵ)σi(𝐀~)σ~i(𝐀)σi(𝐀~)+2ϵmax(n,𝐀1).(1-\epsilon)\sigma_{i}(\widetilde{\mathbf{A}})\leq\widetilde{\sigma}_{i}(\mathbf{A})\leq\sigma_{i}(\widetilde{\mathbf{A}})+2\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}).

This gives us |σ~i(𝐀)σi(𝐀~)|2ϵmax(n,𝐀1)|\widetilde{\sigma}_{i}(\mathbf{A})-\sigma_{i}(\widetilde{\mathbf{A}})|\leq 2\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Finally, we get:

|σ~i(𝐀)σi(𝐀)|\displaystyle|\widetilde{\sigma}_{i}(\mathbf{A})-\sigma_{i}(\mathbf{A})| |σ~i(𝐀)σi(𝐀~)|+|σi(𝐀~)σi(𝐀)|3ϵmax(n,𝐀1),\displaystyle\leq|\widetilde{\sigma}_{i}(\mathbf{A})-\sigma_{i}(\widetilde{\mathbf{A}})|+|\sigma_{i}(\widetilde{\mathbf{A}})-\sigma_{i}(\mathbf{A})|\leq 3\epsilon\max(n,\|\mathbf{A}\|_{1}),

where the second inequality follows from the bound on |σ~i(𝐀)σi(𝐀~)||\widetilde{\sigma}_{i}(\mathbf{A})-\sigma_{i}(\widetilde{\mathbf{A}})| and Weyl’s inequality which gives us |σi(𝐀~)σi(𝐀)|ϵmax(n,𝐀1)|\sigma_{i}(\widetilde{\mathbf{A}})-\sigma_{i}(\mathbf{A})|\leq\epsilon\max(n,\|\mathbf{A}\|_{1}). This gives us the required additive approximation error after adjusting ϵ\epsilon by constants. Next, observe that for i1ϵi\in\lceil\frac{1}{\epsilon}\rceil, using triangle inequality:

|𝐀𝐳i2σi(𝐀)|\displaystyle|\|\mathbf{A}\mathbf{z}_{i}\|_{2}-\sigma_{i}(\mathbf{A})| |𝐀𝐳i2𝐀~𝐳i2|+|𝐀~𝐳i2σi(𝐀)|4ϵmax(n,𝐀1),\displaystyle\leq|\|\mathbf{A}\mathbf{z}_{i}\|_{2}-\|\widetilde{\mathbf{A}}\mathbf{z}_{i}\|_{2}|+|\|\widetilde{\mathbf{A}}\mathbf{z}_{i}\|_{2}-\sigma_{i}(\mathbf{A})|\leq 4\epsilon\max(n,\|\mathbf{A}\|_{1}),

where in the second step, the first term is bounded as |𝐀𝐳i2𝐀~𝐳i2|(𝐀𝐀~)𝐳i2𝐀𝐀~2ϵmax(n,𝐀1)|\|\mathbf{A}\mathbf{z}_{i}\|_{2}-\|\widetilde{\mathbf{A}}\mathbf{z}_{i}\|_{2}|\leq\|(\mathbf{A}-\widetilde{\mathbf{A}})\mathbf{z}_{i}\|_{2}\leq\|\mathbf{A}-\widetilde{\mathbf{A}}\|_{2}\leq\epsilon\max(n,\|\mathbf{A}\|_{1}). This gives us the second bound (after adjusting ϵ\epsilon by constants).

Sample Complexity and Runtime Analysis. By Theorem 4, the number of queries to 𝐀\mathbf{A}’s entries need to construct 𝐀~=𝐀𝐒\widetilde{\mathbf{A}}=\mathbf{A}\circ\mathbf{S} is O~(nϵ4)\widetilde{O}\big{(}\frac{n}{\epsilon^{4}}\big{)}. The loop estimating the singular values in Algorithm 3 runs 1ϵ\frac{1}{\epsilon} times and Algorithm 2 is called in Line 3 inside the loop each time with error parameter ϵ3\epsilon^{3}. At the ith iteration of the loop in Algorithm 3, in Line 4 we have 𝐀~(i+1)=𝐀~(i)𝐳iT𝐳iT𝐀~=𝐀~𝐙iT𝐙iT𝐀~\widetilde{\mathbf{A}}^{(i+1)}=\mathbf{\widetilde{A}}^{(i)}-\mathbf{z}_{i}^{T}\mathbf{z}_{i}^{T}\mathbf{\widetilde{A}}=\mathbf{\widetilde{A}}-\mathbf{Z}_{i}^{T}\mathbf{Z}_{i}^{T}\mathbf{\widetilde{A}}. Note that since the matrix 𝐀~(i+1)\widetilde{\mathbf{A}}^{(i+1)} could be dense we don’t explicitly compute 𝐀~(i+1)\widetilde{\mathbf{A}}^{(i+1)} to do power iterations in Algorithm 2 with this matrix. Instead, in each step of power iteration in Line 5 of Algorithm 2, we can first calculate 𝐀~𝐲j\mathbf{\widetilde{A}}\mathbf{y}_{j} in time O~(nϵ4)\widetilde{O}\big{(}\frac{n}{\epsilon^{4}}\big{)} and then multiply this vector first with 𝐙iT\mathbf{Z}_{i}^{T} and then with 𝐙i\mathbf{Z}_{i} in O(nϵ)O\big{(}\frac{n}{\epsilon}\big{)} time to get 𝐙i𝐙iT𝐀~\mathbf{Z}_{i}\mathbf{Z}_{i}^{T}\mathbf{\widetilde{A}}. Thus, each power iteration (Lines 5 and 6 of Algorithm 2) takes O~(nϵ4)\widetilde{O}\big{(}\frac{n}{\epsilon^{4}}\big{)} time. Since we call Algorithm 2 with error parameter ϵ3\epsilon^{3}, the number of power iterations with each starting vector is O(log(n/ϵ)ϵ3)O\big{(}\frac{\log(n/\epsilon)}{\epsilon^{3}}\big{)}. There are also nn different starting vectors. Thus, each call to Algorithm 2 in Line 3 of Algorithm 3 takes O~(n2ϵ7)\widetilde{O}\big{(}\frac{n^{2}}{\epsilon^{7}}\big{)} time. Thus, the total running time of the algorithm is O~(n2ϵ8)\widetilde{O}\big{(}\frac{n^{2}}{\epsilon^{8}}\big{)} (as the loop in Algorithm 3 runs 1ϵ\frac{1}{\epsilon} times). ∎

6.2 Deterministic PSD Testing

We next leverage our deterministic power method approach (Algorithm 2) to give an o(nω)o(n^{\omega}) time deterministic algorithm for testing if a bounded entry matrix is either PSD or has at least one large negative eigenvalue ϵmax(n,𝐀1)\leq-\epsilon\max(n,\|\mathbf{A}\|_{1}). An optimal randomized algorithm for this problem with detection threshold ϵn-\epsilon n was presented in [BCJ20]. The idea of our approach is to approximate the maximum singular value of 𝐈𝐀~𝐀~2\mathbf{I}-\frac{\widetilde{\mathbf{A}}}{\|\widetilde{\mathbf{A}}\|_{2}} which will be at smaller than 1 if 𝐀\mathbf{A} is PSD and greater than 1 if 𝐀\mathbf{A} is ϵn\epsilon n far from being PSD.

Theorem 14 (Deterministic PSD testing with \ell_{\infty}-gap).

Given ϵ(0,1)\epsilon\in(0,1), let 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} be a symmetric bounded entry matrix with eigenvalues λ1(𝐀)λ2(𝐀)λn(𝐀)\lambda_{1}(\mathbf{A})\geq\lambda_{2}(\mathbf{A})\geq\ldots\geq\lambda_{n}(\mathbf{A}) such that either 𝐀\mathbf{A} is PSD i.e. λn(𝐀)0\lambda_{n}(\mathbf{A})\geq 0 or 𝐀\mathbf{A} is at least ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1})-far from being PSD, i.e., λn(𝐀)ϵmax(n,𝐀1)\lambda_{n}(\mathbf{A})\leq-\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). There exists a deterministic algorithm that distinguishes between these two cases by reading O~(nϵ4)\widetilde{O}\left(\frac{n}{\epsilon^{4}}\right) entries of 𝐀\mathbf{A} and which runs in time O~(n2ϵ5)\widetilde{O}\left(\frac{n^{2}}{\epsilon^{5}}\right).

Proof.

Let 𝐀~=𝐀𝐒\widetilde{\mathbf{A}}=\mathbf{A}\circ\mathbf{S} be a sparsification of 𝐀\mathbf{A} as described in Theorem 4 with approximation parameter ϵ10\frac{\epsilon}{10} that satisfies 𝐀𝐀~2ϵ10max(n,𝐀1)\|\mathbf{A}-\widetilde{\mathbf{A}}\|_{2}\leq\frac{\epsilon}{10}\max(n,\|\mathbf{A}\|_{1}). Using Weyl’s inequality we have for all i[n]i\in[n], |λi(𝐀)λi(𝐀~)|ϵ10max(n,𝐀1)|\lambda_{i}(\mathbf{A})-\lambda_{i}(\widetilde{\mathbf{A}})|\leq\frac{\epsilon}{10}\max(n,\|\mathbf{A}\|_{1}).

Let σ~1(𝐀~)\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}}) be the estimate of 𝐀~2=σ1(𝐀~)\|\widetilde{\mathbf{A}}\|_{2}=\sigma_{1}(\widetilde{\mathbf{A}}) output by Algorithm 2 with input 𝐀~\widetilde{\mathbf{A}} and error parameter ϵ100\frac{\epsilon}{100}. Then by Lemma 6, we have:

(1ϵ100)σ1(𝐀~)σ~1(𝐀~)σ1(𝐀~).\left(1-\frac{\epsilon}{100}\right)\sigma_{1}(\widetilde{\mathbf{A}})\leq\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})\leq\sigma_{1}(\widetilde{\mathbf{A}}).

Let 𝐁=𝐈1σ~1(𝐀~)𝐀~\mathbf{B}=\mathbf{I}-\frac{1}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\widetilde{\mathbf{A}}. Let σ~1(𝐁)\widetilde{\sigma}_{1}(\mathbf{B}) be the estimate of σ1(𝐁)\sigma_{1}(\mathbf{B}) output by Algorithm 2 with input 𝐁\mathbf{B} and error parameter ϵ1000\frac{\epsilon}{1000}. Then by Lemma 6, we again have:

(1ϵ1000)σ1(𝐁)σ~1(𝐁)σ1(𝐁).\left(1-\frac{\epsilon}{1000}\right)\sigma_{1}(\mathbf{B})\leq\widetilde{\sigma}_{1}(\mathbf{B})\leq\sigma_{1}(\mathbf{B}).

We now distinguish between the two cases as follows:

Case 1: When 𝐀\mathbf{A} is PSD, we have λn(𝐀~)λn(𝐀)ϵn10ϵn10\lambda_{n}(\widetilde{\mathbf{A}})\geq\lambda_{n}(\mathbf{A})-\frac{\epsilon n}{10}\geq-\frac{\epsilon n}{10}. Now, σ~1(𝐁)σ1(𝐁)=max(|λ1(𝐁)|,|λn(𝐁)|)=max(|1λn(𝐀~)σ~1(𝐀~)|,|1λ1(𝐀~)σ~1(𝐀~)|)\widetilde{\sigma}_{1}(\mathbf{B})\leq\sigma_{1}(\mathbf{B})=\max(|\lambda_{1}(\mathbf{B})|,|\lambda_{n}(\mathbf{B})|)=\max(\big{|}1-\frac{\lambda_{n}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\big{|},\big{|}1-\frac{\lambda_{1}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\big{|}).

First assume that |1λn(𝐀~)σ~1(𝐀~)||1λ1(𝐀~)σ~1(𝐀~)|\big{|}1-\frac{\lambda_{n}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\big{|}\geq\big{|}1-\frac{\lambda_{1}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\big{|}. Now, 1λn(𝐀~)σ~1(𝐀~)1+ϵn10σ~1(𝐀~)1-\frac{\lambda_{n}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\leq 1+\frac{\epsilon n}{10\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}. Thus, σ~1(𝐁)1+ϵn10σ~1(𝐀~)\widetilde{\sigma}_{1}(\mathbf{B})\leq 1+\frac{\epsilon n}{10\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}. Now, assume |1λn(𝐀~)σ~1(𝐀~)|<|1λ1(𝐀~)σ~1(𝐀~)|\big{|}1-\frac{\lambda_{n}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\big{|}<\big{|}1-\frac{\lambda_{1}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\big{|}. Then, we must have λ1(𝐀~)0\lambda_{1}(\widetilde{\mathbf{A}})\geq 0 as otherwise, we will have 1λn(𝐀~)σ~1(𝐀~)1λ1(𝐀~)σ~1(𝐀~)>01-\frac{\lambda_{n}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\geq 1-\frac{\lambda_{1}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}>0. Thus, we get that σ~1(𝐁)|1λ1(𝐀~)σ~1(𝐀~)|=|1σ1(𝐀~)σ~1(𝐀~)|=|111ϵ100|=ϵ/1001ϵ/100<ϵ50<1+ϵn10σ~1(𝐀~)\widetilde{\sigma}_{1}(\mathbf{B})\leq\big{|}1-\frac{\lambda_{1}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\big{|}=\big{|}1-\frac{\sigma_{1}(\widetilde{\mathbf{A}})}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\big{|}=\big{|}1-\frac{1}{1-\frac{\epsilon}{100}}\big{|}=\frac{\epsilon/100}{1-\epsilon/100}<\frac{\epsilon}{50}<1+\frac{\epsilon n}{10\widetilde{\sigma}_{1}(\mathbf{\widetilde{A}})}.

Case 2: When 𝐀\mathbf{A} is ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}) far from being PSD, we have λn(𝐀~)λn(𝐀)+ϵmax(n,𝐀1)10910ϵmax(n,𝐀1)\lambda_{n}(\widetilde{\mathbf{A}})\leq\lambda_{n}(\mathbf{A})+\frac{\epsilon\cdot\max(n,\|\mathbf{A}\|_{1})}{10}\leq-\frac{9}{10}\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Observe that σ1(𝐁)11σ~1(𝐀~)λn(𝐀~)1+910σ~1(𝐀~)ϵmax(n,𝐀1)\sigma_{1}(\mathbf{B})\geq 1-\frac{1}{\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\lambda_{n}(\widetilde{\mathbf{A}})\geq 1+\frac{9}{10\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). Then, we have σ~1(𝐁)(1ϵ1000)σ1(𝐁)(1ϵ1000)(1+910σ~1(𝐀~)ϵmax(n,𝐀1))>1+210σ~1(𝐀~)ϵmax(n,𝐀1))>1+2ϵn10σ~1(𝐀~)\widetilde{\sigma}_{1}(\mathbf{B})\geq\left(1-\frac{\epsilon}{1000}\right)\sigma_{1}(\mathbf{B})\geq\left(1-\frac{\epsilon}{1000}\right)\left(1+\frac{9}{10\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\epsilon\cdot\max(n,\|\mathbf{A}\|_{1})\right)>1+\frac{2}{10\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}))>1+\frac{2\epsilon n}{10\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})} .

Thus, in Case 1, we have σ~1(𝐁)1+ϵn10σ~1(𝐀~)\widetilde{\sigma}_{1}(\mathbf{B})\leq 1+\frac{\epsilon n}{10\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})} while for Case 2, we have σ~1(𝐁)>1+2ϵn10σ~1(𝐀~)\widetilde{\sigma}_{1}(\mathbf{B})>1+\frac{2\epsilon n}{10\widetilde{\sigma}_{1}(\widetilde{\mathbf{A}})}. So we can distinguish between the above two cases.

Sample Complexity and Runtime Analysis. The total runtime is the time taken to estimate σ1(𝐀~)\sigma_{1}(\widetilde{\mathbf{A}}) and σ1(𝐁)\sigma_{1}(\mathbf{B}) using Algorithm 2 with error parameter Θ(ϵ)\Theta(\epsilon). This is O~(n2ϵ5)\widetilde{O}\left(\frac{n^{2}}{\epsilon^{5}}\right) since we run O(log(n/ϵ)ϵ)O\left(\frac{\log(n/\epsilon)}{\epsilon}\right) iterations of the power method with nn different starting vectors, where each iteration takes time O~(nϵ4)\widetilde{O}\left(\frac{n}{\epsilon^{4}}\right) since 𝐀~\mathbf{\tilde{A}} and 𝐁\mathbf{B} both have O~(n/ϵ4)\tilde{O}(n/\epsilon^{4}) non-zero entries. The sample complexity required for form 𝐀~\mathbf{\tilde{A}} and 𝐁\mathbf{B} is O~(n/ϵ4)\tilde{O}(n/\epsilon^{4}).

6.3 High Accuracy Spectral Norm Approximation

Finally, we show that, under the assumption that σ1(𝐀)αmax(n,𝐀1)\sigma_{1}(\mathbf{A})\geq\alpha\max(n,\|\mathbf{A}\|_{1}) for some α(0,1)\alpha\in(0,1), we can deterministically compute σ~1(𝐀)\widetilde{\sigma}_{1}(\mathbf{A}) such that σ~1(𝐀)(1ϵ)σ1(𝐀)\widetilde{\sigma}_{1}(\mathbf{A})\geq(1-\epsilon)\sigma_{1}(\mathbf{A}) in O~(n2log(1/ϵ)missingpoly(α))\widetilde{O}\big{(}\frac{n^{2}\log(1/\epsilon)}{\mathop{\mathrm{missing}}{poly}(\alpha)}\big{)} time. This allows us to set ϵ\epsilon to 1nc\frac{1}{n^{c}} to get a high accuracy approximation to σ1(𝐀)\sigma_{1}(\mathbf{A}) in roughly linear time in the input matrix size. This is the first o(nω)o(n^{\omega}) time deterministic algorithm for this problem, and to the best of our knowledge matches the best known randomized methods for high accuracy top singular value computation, up to a missingpolylog(n,1/α)\mathop{\mathrm{missing}}{poly}\log(n,1/\alpha) factor. Our approach is described in Algorithm 4 below.

Algorithm 4 High Accuracy Spectral Norm Approximator
1:Symmetric 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, parameter α\alpha, and error parameter ϵ(0,1)\epsilon\in(0,1)
2:Let 𝐀~=𝐀𝐒\widetilde{\mathbf{A}}=\mathbf{A}\circ\mathbf{S} where 𝐒\mathbf{S} is as specified in Theorem 4 with error parameter ϵ=cα4\epsilon=c\alpha^{4} for a sufficiently small constant cc.
3:𝐙\mathbf{Z}\leftarrow output of Algorithm 3 with inputs 𝐀~\widetilde{\mathbf{A}}, error parameter ϵ~=cα4\widetilde{\epsilon}=c\alpha^{4}, and k=2/αk=2/\alpha.
4:𝐙~\widetilde{\mathbf{Z}}\leftarrow Output of Block Krylov method (Algorithm 2 of [MM15]) with input matrix 𝐀\mathbf{A}, starting block 𝐙\mathbf{Z}, and number of iterations q=clog(n/ϵ)α3/2q=\frac{c^{\prime}\log(n/\epsilon)}{\alpha^{3/2}} for some enough large constant cc^{\prime}.
5:return σ~1(𝐀)=𝐳~1T𝐀𝐀T𝐳~1\widetilde{\sigma}_{1}(\mathbf{A})=\sqrt{\widetilde{\mathbf{z}}_{1}^{T}\mathbf{A}\mathbf{A}^{T}\widetilde{\mathbf{z}}_{1}}, where 𝐳~1\widetilde{\mathbf{z}}_{1} is the first column of 𝐙~\widetilde{\mathbf{Z}}.

Algorithm 4 first computes (Step 2) a coarse approximate subspace of top singular vectors for 𝐀\mathbf{A} using our deflation based Algorithm 3 applied to the sparsified matrix 𝐀~\mathbf{\widetilde{A}}. This subspace is then used in Step 3 to initialize a Block Krylov method applied to 𝐀\mathbf{A} to compute a much more accurate approximation to σ1(𝐀)\sigma_{1}(\mathbf{A}). Our proof will leverage the analysis of Block Krylov methods given in [MM15]. Specifically, we will apply Theorem 13 of [MM15] which bounds the number of iterations required to find a (1±ϵ)(1\pm\epsilon) accurate approximation to the top kk singular values in terms of the singular value gap σk(𝐀)σp+1(𝐀)\frac{\sigma_{k}(\mathbf{A})}{\sigma_{p+1}(\mathbf{A})} for some block size pkp\geq k.

The proof of Theorem 13 in [MM15] uses a degree qq amplifying polynomial f(𝐀)f(\mathbf{A}) (for large enough qq) in the Krylov subspace to significantly separate the top kk singular values from the rest. It is shown that if the starting block 𝐙\mathbf{Z} is such that f(𝐀)f(\mathbf{A}) projected onto the column span of f(𝐀)𝐙f(\mathbf{A})\mathbf{Z} is a good rank pp approximation to f(𝐀)f(\mathbf{A}), then the approximate singular values and vectors output by the Block Krylov method will be accurate. To prove this condition for our starting block, we will apply the following well-known result, which bounds the error of any low rank approximation to a matrix in terms of the error of the best possible low rank approximation.

Lemma 8 (Lemma 48 of [Woo14], proved in [BDMI14]).

Let 𝐀=𝐀𝐒𝐒T+𝐄\mathbf{A}=\mathbf{A}\mathbf{S}\mathbf{S}^{T}+\mathbf{E} be a low-rank matrix factorization of 𝐀\mathbf{A}, with 𝐒n×k\mathbf{S}\in\mathbb{R}^{n\times k}, and 𝐒T𝐒=𝐈k\mathbf{S}^{T}\mathbf{S}=\mathbf{I}_{k}. Let 𝐙n×c\mathbf{Z}\in\mathbb{R}^{n\times c} with ckc\geq k be any matrix such that rank(𝐒T𝐙)=rank(𝐒)=k\operatorname{rank}{(\mathbf{S}^{T}\mathbf{Z})}=\operatorname{rank}{(\mathbf{S})}=k. Let 𝐐\mathbf{Q} be an orthonormal basis for the column span of 𝐀𝐙\mathbf{AZ}. Then,

𝐀𝐐𝐐T𝐀F2𝐄F2+𝐄𝐙(𝐒T𝐙)+F2.\|\mathbf{A}-\mathbf{Q}\mathbf{Q}^{T}\mathbf{A}\|_{F}^{2}\leq\|\mathbf{E}\|_{F}^{2}+\|\mathbf{E}\mathbf{Z}(\mathbf{S}^{T}\mathbf{Z})^{+}\|_{F}^{2}.

The above result lets us bound the difference between the Frobenius norm error of projecting f(𝐀)f(\mathbf{A}) onto the column span of f(𝐀)𝐙f(\mathbf{A})\mathbf{Z}, and the error of the best rank pp approximation of f(𝐀)f(\mathbf{A}). The proof of Theorem 13 of [MM15] uses Lemma 4 of [MM15], (which in turn is proven using Lemma 8 above) to show that f(𝐀)𝐐𝐐Tf(𝐀)F2ncf(𝐀)(f(𝐀))pF2\|f(\mathbf{A})-\mathbf{Q}\mathbf{Q}^{T}f(\mathbf{A})\|^{2}_{F}\leq n^{c}\cdot\|f(\mathbf{A})-(f(\mathbf{A}))_{p}\|^{2}_{F} where 𝐐\mathbf{Q} is an orthonormal basis for 𝐀𝐙\mathbf{AZ}, where 𝐙\mathbf{Z} is a random starting block, cc is some constant, and (f(𝐀))p(f(\mathbf{A}))_{p} is the best rank pp approximation to f(𝐀)f(\mathbf{A}). In order to apply this result in our setting, we need to show that the same bound applies for our starting block 𝐙\mathbf{Z}, which is computed deterministically by Algorithm 3.

In particular, by Lemma 8, it suffices to bound (𝐕pT𝐙)+2\|(\mathbf{V}_{p}^{T}\mathbf{Z})^{+}\|_{2}, where 𝐕p\mathbf{V}_{p} contains the top pp singular vectors of 𝐀\mathbf{A} (which are identical to those of f(𝐀)f(\mathbf{A})). That is, we need to ensure that the columns of 𝐙\mathbf{Z} and the top pp singular vectors of 𝐀\mathbf{A} have large inner product with each other. We next prove that this is the case when there is a large gap between the top pp singular values and the rest. We will later show that under the assumption that σ1(𝐀)αmax(n,𝐀1)\sigma_{1}(\mathbf{A})\geq\alpha\cdot\max(n,\|\mathbf{A}\|_{1}), there is always some pp that satisfies this gap assumption.

Lemma 9 (Starting Block Condition).

Let 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} be a bounded entry symmetric matrix with singular values σ1(𝐀)σ2(𝐀)σn(𝐀)\sigma_{1}(\mathbf{A})\geq\sigma_{2}(\mathbf{A})\geq\ldots\geq\sigma_{n}(\mathbf{A}). Let 𝐀~=𝐀𝐒\widetilde{\mathbf{A}}=\mathbf{A}\circ\mathbf{S} where the sampling matrix 𝐒\mathbf{S} is as specified in Theorem 4 with error parameter ϵ=cα4\epsilon=c\alpha^{4} for a sufficiently small constant cc and some α(0,1)\alpha\in(0,1). For some k[n]k\in[n], let 𝐙n×k\mathbf{Z}\in\mathbb{R}^{n\times k} be the output of Algorithm 3 with input 𝐀~\widetilde{\mathbf{A}} and parameters k=2αk=\frac{2}{\alpha} and ϵ=cα4\epsilon=c\alpha^{4}. For some p2αp\leq\frac{2}{\alpha}, assume that for some constant c1c_{1},

σp2(𝐀)σp+12(𝐀)c1α3(max(n,𝐀1))2.\sigma_{p}^{2}(\mathbf{A})-\sigma_{p+1}^{2}(\mathbf{A})\geq{c}_{1}\alpha^{3}\cdot(\max(n,\|\mathbf{A}\|_{1}))^{2}.

Also let 𝐕pn×p\mathbf{V}_{p}\in\mathbb{R}^{n\times p} be the matrix with columns equal to the pp singular vectors of 𝐀\mathbf{A} corresponding to its largest pp singular values. Let 𝐙p\mathbf{Z}_{p} contain the first pp columns of 𝐙\mathbf{Z}. Then for some constant c2c_{2},

(𝐕pT𝐙p)+2c2n.\|(\mathbf{V}_{p}^{T}\mathbf{Z}_{p})^{+}\|_{2}\leq c_{2}n.
Proof.

Before proving the main result, we will bound the error between σj2(𝐀)\sigma^{2}_{j}(\mathbf{A}) and 𝐀𝐳j22\|\mathbf{A}\mathbf{z}_{j}\|^{2}_{2} where 𝐳j\mathbf{z}_{j} is a column of 𝐙\mathbf{Z} for any j2αj\in\lceil\frac{2}{\alpha}\rceil. From Lemma 7, we have for any j2αj\in\lceil\frac{2}{\alpha}\rceil:

𝐀~𝐳j2(1cα4)σj(𝐀~),\|\widetilde{\mathbf{A}}\mathbf{z}_{j}\|_{2}\geq(1-c\alpha^{4})\sigma_{j}(\widetilde{\mathbf{A}}),

since we run Algorithm 3 with error parameter ϵ=cα4\epsilon=c\alpha^{4}. Also since 𝐀~𝐀2cα4max(n,𝐀1)\|\widetilde{\mathbf{A}}-\mathbf{A}\|_{2}\leq c\alpha^{4}\max(n,\|\mathbf{A}\|_{1}), using Weyl’s inequality and the fact that σj(𝐀)max(n,𝐀1)\sigma_{j}(\mathbf{A})\leq\max(n,\|\mathbf{A}\|_{1}) we have (for some constant bb)

σj(𝐀)cα4max(n,𝐀1)σj(𝐀~)σj(𝐀)+cα4max(n,𝐀1)bmax(n,𝐀1).\sigma_{j}(\mathbf{A})-c\alpha^{4}\max(n,\|\mathbf{A}\|_{1})\leq\sigma_{j}(\widetilde{\mathbf{A}})\leq\sigma_{j}(\mathbf{A})+c\alpha^{4}\max(n,\|\mathbf{A}\|_{1})\leq b\max(n,\|\mathbf{A}\|_{1}).

Now, using triangle inequality,

𝐀𝐳j2𝐀~𝐳j2(𝐀~𝐀)𝐳j2(1cα4)σj(𝐀~)cα4max(n,𝐀1),\|\mathbf{A}\mathbf{z}_{j}\|_{2}\geq\|\widetilde{\mathbf{A}}\mathbf{z}_{j}\|_{2}-\|(\widetilde{\mathbf{A}}-\mathbf{A})\mathbf{z}_{j}\|_{2}\geq(1-c\alpha^{4})\sigma_{j}(\widetilde{\mathbf{A}})-c\alpha^{4}\max(n,\|\mathbf{A}\|_{1}),

where the second inequality uses that (𝐀~𝐀)𝐳j2𝐀~𝐀2𝐳j2cα2max(n,𝐀1)\|(\widetilde{\mathbf{A}}-\mathbf{A})\mathbf{z}_{j}\|_{2}\leq\|\widetilde{\mathbf{A}}-\mathbf{A}\|_{2}\|\mathbf{z}_{j}\|_{2}\leq c\alpha^{2}\max(n,\|\mathbf{A}\|_{1}). Thus, using the bounds on σj(𝐀~)\sigma_{j}(\widetilde{\mathbf{A}}), we get:

𝐀𝐳j2\displaystyle\|\mathbf{A}\mathbf{z}_{j}\|_{2} σj(𝐀)cα4max(n,𝐀1)cα4σj(𝐀~)cα4max(n,𝐀1)\displaystyle\geq\sigma_{j}(\mathbf{A})-c\alpha^{4}\max(n,\|\mathbf{A}\|_{1})-c\alpha^{4}\sigma_{j}(\widetilde{\mathbf{A}})-c\alpha^{4}\max(n,\|\mathbf{A}\|_{1})
σj(𝐀)c′′α4max(n,𝐀1),\displaystyle\geq\sigma_{j}(\mathbf{A})-c^{\prime\prime}\alpha^{4}\max(n,\|\mathbf{A}\|_{1}),

for some constant c′′c^{\prime\prime}. Finally, squaring both sides, we get for any j2αj\in\lceil\frac{2}{\alpha}\rceil:

𝐀𝐳j22\displaystyle\|\mathbf{A}\mathbf{z}_{j}\|_{2}^{2} σj2(𝐀)2c′′α4σj(𝐀)max(n,𝐀1)+(c′′)2α8(max(n,𝐀1))2\displaystyle\geq\sigma^{2}_{j}(\mathbf{A})-2c^{\prime\prime}\alpha^{4}\sigma_{j}(\mathbf{A})\max(n,\|\mathbf{A}\|_{1})+(c^{\prime\prime})^{2}\alpha^{8}(\max(n,\|\mathbf{A}\|_{1}))^{2}
σj2(𝐀)2c′′α4(max(n,𝐀1))2.\displaystyle\geq\sigma^{2}_{j}(\mathbf{A})-2c^{\prime\prime}\alpha^{4}(\max(n,\|\mathbf{A}\|_{1}))^{2}. (21)

We are now ready to prove the main result of the lemma. Note that showing (𝐕pT𝐙p)+2c2n\|(\mathbf{V}_{p}^{T}\mathbf{Z}_{p})^{+}\|_{2}\leq c_{2}n is equivalent to showing that σmin(𝐕pT𝐙p)1c2n\sigma_{\min}(\mathbf{V}_{p}^{T}\mathbf{Z}_{p})\geq\frac{1}{c_{2}n}. Assume for contradiction that σmin(𝐕pT𝐙p)1c2n\sigma_{\min}(\mathbf{V}_{p}^{T}\mathbf{Z}_{p})\leq\frac{1}{c_{2}n}. Then there is some vector 𝐲\mathbf{y} with 𝐲2=1\|\mathbf{y}\|_{2}=1 such that 𝐙pT𝐕p𝐲21c2n\|\mathbf{Z}_{p}^{T}\mathbf{V}_{p}\mathbf{y}\|_{2}\leq\frac{1}{c_{2}n}. Let 𝐰=𝐕p𝐲\mathbf{w}=\mathbf{V}_{p}\mathbf{y}. Observe that 𝐰2=1\|\mathbf{w}\|_{2}=1 since 𝐕p\mathbf{V}_{p} has orthonormal columns. Then, for any column 𝐳i\mathbf{z}_{i} of 𝐙p\mathbf{Z}_{p},

𝐰𝐰T𝐳i2=|𝐰T𝐳i|𝐰T𝐙p21c2n,\displaystyle\|\mathbf{w}\mathbf{w}^{T}\mathbf{z}_{i}\|_{2}=|\mathbf{w}^{T}\mathbf{z}_{i}|\leq\|\mathbf{w}^{T}\mathbf{Z}_{p}\|_{2}\leq\frac{1}{c_{2}n}, (22)

where the last step follows by our assumption. Then, using triangle inequality and the above bound, we have 𝐀(𝐈𝐰𝐰T)𝐳i2𝐀𝐳i2𝐀𝐰𝐰T𝐳i2𝐀𝐳i2𝐀2c2n𝐀𝐳i21c2\|\mathbf{A}(\mathbf{I}-\mathbf{ww}^{T})\mathbf{z}_{i}\|_{2}\geq\|\mathbf{A}\mathbf{z}_{i}\|_{2}-\|\mathbf{A}\mathbf{ww}^{T}\mathbf{z}_{i}\|_{2}\geq\|\mathbf{A}\mathbf{z}_{i}\|_{2}-\frac{\|\mathbf{A}\|_{2}}{c_{2}n}\geq\|\mathbf{A}\mathbf{z}_{i}\|_{2}-\frac{1}{c_{2}}, where the last equality uses the fact that 𝐀2n\|\mathbf{A}\|_{2}\leq n since it has bounded entries. Thus, for any i[p]i\in[p], squaring both sides of this bound, we get:

𝐀(𝐈𝐰𝐰T)𝐳i22\displaystyle\|\mathbf{A}(\mathbf{I}-\mathbf{ww}^{T})\mathbf{z}_{i}\|_{2}^{2} 𝐀𝐳i222𝐀𝐳i2c2\displaystyle\geq\|\mathbf{A}\mathbf{z}_{i}\|_{2}^{2}-\frac{2\|\mathbf{A}\mathbf{z}_{i}\|_{2}}{c_{2}}
𝐀𝐳i222c2max(n,𝐀1)\displaystyle\geq\|\mathbf{A}\mathbf{z}_{i}\|_{2}^{2}-\frac{2}{c_{2}}\max(n,\|\mathbf{A}\|_{1})
σi2(𝐀)c3α4max(n,𝐀1)2,\displaystyle\geq\sigma^{2}_{i}(\mathbf{A})-c_{3}\alpha^{4}\max(n,\|\mathbf{A}\|_{1})^{2}, (23)

for some constant c3c_{3}, where the second inequality uses 𝐀𝐳i2n\|\mathbf{A}\mathbf{z}_{i}\|_{2}\leq n, and the last inequality uses (21). Moreover since 𝐰\mathbf{w} is spanned by 𝐕p\mathbf{V}_{p}, we have:

𝐀𝐰22min𝐱:𝐱2=1𝐀𝐕p𝐱2σp2(𝐀)>σp+12(𝐀)+c1α3(max(n,𝐀1))2,\displaystyle\|\mathbf{A}\mathbf{w}\|_{2}^{2}\geq\min_{\mathbf{x}:\|\mathbf{x}\|_{2}=1}\|\mathbf{A}\mathbf{V}_{p}\mathbf{x}\|_{2}\geq\sigma^{2}_{p}(\mathbf{A})>\sigma_{p+1}^{2}(\mathbf{A})+c_{1}\alpha^{3}\cdot(\max(n,\|\mathbf{A}\|_{1}))^{2}, (24)

where the final inequality is by the gap assumption in the lemma statement.

Consider the matrix 𝐖=[𝐰,(𝐈𝐰𝐰T)𝐙p]\mathbf{W}=[\mathbf{w},(\mathbf{I}-\mathbf{ww}^{T})\mathbf{Z}_{p}]. Next, we show that under our assumption that σmin(𝐕pT𝐙p)1c2n\sigma_{\min}(\mathbf{V}_{p}^{T}\mathbf{Z}_{p})\leq\frac{1}{c_{2}n}, 𝐖\mathbf{W} will be very close to orthonormal, and further that by (23) and (24), 𝐀𝐖F2>i=1p+1σi2(𝐀)\|\mathbf{AW}\|_{F}^{2}>\sum_{i=1}^{p+1}\sigma_{i}^{2}(\mathbf{A}). Since 𝐖\mathbf{W} has p+1p+1 columns, this will lead to a contradiction since we must have 𝐀𝐖F2𝐀𝐕p+1F2=i=1p+1σi2(𝐀)\|\mathbf{AW}\|_{F}^{2}\leq\|\mathbf{AV}_{p+1}\|_{F}^{2}=\sum_{i=1}^{p+1}\sigma_{i}^{2}(\mathbf{A}). In particular, using (23) and (24),

𝐀𝐖F2\displaystyle\|\mathbf{A}\mathbf{W}\|_{F}^{2} =𝐀𝐰22+i=1p𝐀(𝐈𝐰𝐰T)𝐳i22\displaystyle=\|\mathbf{A}\mathbf{w}\|_{2}^{2}+\sum_{i=1}^{p}\|\mathbf{A}(\mathbf{I}-\mathbf{ww}^{T})\mathbf{z}_{i}\|_{2}^{2}
σp+12(𝐀)+c1α3(max(n,𝐀1))2+i=1pσi(𝐀)2c3pα4max(n,A1)2\displaystyle\geq\sigma_{p+1}^{2}(\mathbf{A})+c_{1}\alpha^{3}\cdot(\max(n,\|\mathbf{A}\|_{1}))^{2}+\sum_{i=1}^{p}\sigma_{i}(\mathbf{A})^{2}-c_{3}p\alpha^{4}\max(n,\|A\|_{1})^{2}
i=1p+1σi2(𝐀)+c4α3max(n,𝐀1)2,\displaystyle\geq\sum_{i=1}^{p+1}\sigma_{i}^{2}(\mathbf{A})+c_{4}\alpha^{3}\max(n,\|\mathbf{A}\|_{1})^{2}, (25)

for some constant c4c_{4} which will be positive as long as c1c_{1} is large enough compared to c3c_{3} (which can be made arbitrary small by our setting of c2,cc_{2},c in the lemma statement).

We will now show that all eigenvalues of 𝐖T𝐖\mathbf{W}^{T}\mathbf{W} are close to 11, which implies that 𝐖\mathbf{W} is approximately orthogonal. This allows us to translate the bound on 𝐀𝐖F2\|\mathbf{AW}\|_{F}^{2} given in (25) to an orthonormal basis 𝐐\mathbf{Q} for the column span of 𝐖\mathbf{W}.

Observe that (𝐖T𝐖)i,i=𝐰22=1(\mathbf{W}^{T}\mathbf{W})_{i,i}=\|\mathbf{w}\|_{2}^{2}=1 for i=1i=1 and using (22), |(𝐖T𝐖)i,i𝐳i1T𝐳i1|=|𝐳i1𝐰𝐰T𝐳i1|1c22n2|(\mathbf{W}^{T}\mathbf{W})_{i,i}-\mathbf{z}_{i-1}^{T}\mathbf{z}_{i-1}|=|\mathbf{z}_{i-1}\mathbf{ww}^{T}\mathbf{z}_{i-1}|\leq\frac{1}{c_{2}^{2}n^{2}} otherwise. Thus, |(𝐖T𝐖)i,i1|1c22n2|(\mathbf{W}^{T}\mathbf{W})_{i,i}-1|\leq\frac{1}{c_{2}^{2}n^{2}} for all ii. Further, (𝐖T𝐖)i,j=𝐰T(𝐈𝐰𝐰T)𝐳j1=0(\mathbf{W}^{T}\mathbf{W})_{i,j}=\mathbf{w}^{T}(\mathbf{I}-\mathbf{w}\mathbf{w}^{T})\mathbf{z}_{j-1}=0 when iji\neq j and i=1i=1 as 𝐰2=1\|\mathbf{w}\|_{2}=1 (and similarly when iji\neq j and j=1j=1). Also, when iji\neq j and i,j>1i,j>1, |(𝐖T𝐖)i,j|=|𝐳iT𝐳j𝐳iT𝐰𝐰T𝐳j|=|𝐳iT𝐰𝐰T𝐳j|1c22n2|(\mathbf{W}^{T}\mathbf{W})_{i,j}|=|\mathbf{z}_{i}^{T}\mathbf{z}_{j}-\mathbf{z}_{i}^{T}\mathbf{ww}^{T}\mathbf{z}_{j}|=|\mathbf{z}_{i}^{T}\mathbf{ww}^{T}\mathbf{z}_{j}|\leq\frac{1}{c_{2}^{2}n^{2}} otherwise. We can then apply Gershgorin’s circle theorem to bound the eigenvalues of 𝐖T𝐖\mathbf{W}^{T}\mathbf{W}.

Fact 5 (Gershgorin’s circle theorem [Ger31]).

Let 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} with entries 𝐀ij\mathbf{A}_{ij}. For i[n]i\in[n], let 𝐑i\mathbf{R}_{i} be the sum of absolute values of non-diagonal entries in the iith row. Let D(𝐀ii,𝐑i)D(\mathbf{A}_{ii},\mathbf{R}_{i}) be the closed disc centered at 𝐀ii\mathbf{A}_{ii} with radius 𝐑i\mathbf{R}_{i}. Then every eigenvalue of 𝐀\mathbf{A} lies within one of the discs D(𝐀ii,𝐑i)D(\mathbf{A}_{ii},\mathbf{R}_{i}).

From Fact 5, all eigenvalues of 𝐖T𝐖\mathbf{W}^{T}\mathbf{W} lie in the range [11c22n,1+1c22n]\left[1-\frac{1}{c_{2}^{2}n},1+\frac{1}{c_{2}^{2}n}\right]. Thus, 𝐖T𝐖\mathbf{W}^{T}\mathbf{W} is a full rank matrix and so, 𝐖\mathbf{W} is a rank p+1p+1 matrix. Let 𝐐𝐒𝐑T\mathbf{Q}\mathbf{S}\mathbf{R}^{T} be the SVD of 𝐖\mathbf{W}, where 𝐐n×p+1\mathbf{Q}\in\mathbb{R}^{n\times p+1} and 𝐑p+1×p+1\mathbf{R}\in\mathbb{R}^{p+1\times p+1} are orthonormal matrices and 𝐒p+1×p1\mathbf{S}\in\mathbb{R}^{p+1\times p1} is a diagonal matrix of singular values of 𝐖\mathbf{W}. Then, all eigenvalues of (𝐖T𝐖)1/2(\mathbf{W}^{T}\mathbf{W})^{-1/2} or all diagonal entries of 𝐒1\mathbf{S}^{-1} lie in [1c5n,1+c5n]\left[1-\frac{c_{5}}{n},1+\frac{c_{5}}{n}\right] for some constant c5c_{5}.

Now 𝐀𝐐F2=𝐀𝐖𝐑𝐒1F2(1c6n)𝐀𝐖𝐑F2(1c6n)𝐀𝐖F2\|\mathbf{A}\mathbf{Q}\|_{F}^{2}=\|\mathbf{A}\mathbf{W}\mathbf{R}\mathbf{S}^{-1}\|_{F}^{2}\geq\left(1-\frac{c_{6}}{n}\right)\|\mathbf{A}\mathbf{W}\mathbf{R}\|_{F}^{2}\geq\left(1-\frac{c_{6}}{n}\right)\|\mathbf{A}\mathbf{W}\|_{F}^{2} (for some constant c6c_{6}) where the last step follows from the fact that 𝐀𝐖𝐑F2=𝐀𝐖F2\|\mathbf{A}\mathbf{W}\mathbf{R}\|_{F}^{2}=\|\mathbf{A}\mathbf{W}\|_{F}^{2} since 𝐑\mathbf{R} is an orthonormal square matrix. Then we have

𝐀𝐐F2(1c6n)𝐀𝐖F2i=1p+1σi2(𝐀)+c7α3max(n,𝐀1)2,\displaystyle\|\mathbf{A}\mathbf{Q}\|_{F}^{2}\geq(1-\frac{c_{6}}{n})\|\mathbf{A}\mathbf{W}\|_{F}^{2}\geq\sum_{i=1}^{p+1}\sigma_{i}^{2}(\mathbf{A})+c_{7}\alpha^{3}\max(n,\|\mathbf{A}\|_{1})^{2}, (26)

for some constant c7c_{7}, where the last inequality uses (25) and the fact that (1c6n)c4α3max(n,𝐀1)2c5i=1p+1σi2𝐀n=Ω(α3max(n,𝐀1)2)(1-\frac{c_{6}}{n})c_{4}\alpha^{3}\max(n,\|\mathbf{A}\|_{1})^{2}-\frac{c_{5}\sum_{i=1}^{p+1}\sigma^{2}_{i}{\mathbf{A}}}{n}=\Omega(\alpha^{3}\max(n,\|\mathbf{A}\|_{1})^{2}). However, by the optimality of the SVD for Frobenius norm low-rank approximation (based on the eigenvvalue min-max theorem of Fact 4), for any matrix 𝐗n×(p+1)\mathbf{X}\in\mathbb{R}^{n\times(p+1)} with orthonormal columns, 𝐀𝐗F2𝐀𝐕p+1F2=i=1p+1σi2(𝐀)\|\mathbf{A}\mathbf{X}\|_{F}^{2}\leq\|\mathbf{A}\mathbf{V}_{p+1}\|_{F}^{2}=\sum_{i=1}^{p+1}\sigma_{i}^{2}(\mathbf{A}). This contradicts (26) which was derived using the assumption σmin(𝐕pT𝐙p)1c2n\sigma_{\min}(\mathbf{V}_{p}^{T}\mathbf{Z}_{p})\leq\frac{1}{c_{2}n}. ∎

We now prove the main result of this section. We will first show that there will be a large (in terms of O((αmax(n,𝐀1))c)O((\alpha\cdot\max(n,\|\mathbf{A}\|_{1}))^{c})) singular value gap for some pO(1α)p\leq O(\frac{1}{\alpha}) and then use Lemmas 8 and 9 to prove a gap dependent bound similar to Theorem 13 of [MM15].

Theorem 15 (High accuracy spectral norm approximation).

Let 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} be a bounded entry symmetric matrix with largest singular value σ1(𝐀)\sigma_{1}(\mathbf{A}), such that σ1(𝐀)αmax(n,𝐀1)\sigma_{1}(\mathbf{A})\geq\alpha\cdot\max(n,\|\mathbf{A}\|_{1}) for some α(0,1)\alpha\in(0,1). Then there exists an O~(n2log(n/ϵ)α29)\widetilde{O}\left(\frac{n^{2}\log(n/\epsilon)}{\alpha^{29}}\right) time deterministic algorithm that computes 𝐳~n\widetilde{\mathbf{z}}\in\mathbb{R}^{n}, such that 𝐀𝐳~2(1ϵ)σ1(𝐀)\|\mathbf{A}\widetilde{\mathbf{z}}\|_{2}\geq(1-\epsilon)\sigma_{1}(\mathbf{A}).

Proof.

We will first show that there exists some p2αp\in\frac{2}{\alpha} such that there is a large enough gap between the squared singular values σp2(𝐀)\sigma^{2}_{p}(\mathbf{A}) and σp+12(𝐀)\sigma^{2}_{p+1}(\mathbf{A}). For k=2αk=\frac{2}{\alpha} we have σk(𝐀)α2(max(n,𝐀1))\sigma_{k}(\mathbf{A})\leq\frac{\alpha}{2}\cdot(\max(n,\|\mathbf{A}\|_{1})) and squaring both sides we have σk2(𝐀)α24(max(n,𝐀1))2\sigma_{k}^{2}(\mathbf{A})\leq\frac{\alpha^{2}}{4}\cdot(\max(n,\|\mathbf{A}\|_{1}))^{2}. Also since σ1(𝐀)α2max(n,𝐀1)2\sigma_{1}(\mathbf{A})\geq\alpha^{2}\cdot\max(n,\|\mathbf{A}\|_{1})^{2}, we have σ12(𝐀)σk2(𝐀)34α2(max(n,𝐀1))2\sigma_{1}^{2}(\mathbf{A})-\sigma_{k}^{2}(\mathbf{A})\geq\frac{3}{4}\alpha^{2}\cdot(\max(n,\|\mathbf{A}\|_{1}))^{2}. Thus there exists p[2α]p\in\left[\frac{2}{\alpha}\right] such that

σp2(𝐀)σp+12(𝐀)34α2(max(n,𝐀1))22α=38α3(max(n,𝐀1))2.\displaystyle\sigma_{p}^{2}(\mathbf{A})-\sigma_{p+1}^{2}(\mathbf{A})\geq\frac{\frac{3}{4}\alpha^{2}\cdot(\max(n,\|\mathbf{A}\|_{1}))^{2}}{\frac{2}{\alpha}}=\frac{3}{8}\alpha^{3}\cdot(\max(n,\|\mathbf{A}\|_{1}))^{2}. (27)

Let 𝐙\mathbf{Z} be the output of step 2 of Algorithm 4. Let 𝐙pn×p\mathbf{Z}_{p}\in\mathbb{R}^{n\times p} be the matrix with the first pp columns of 𝐙\mathbf{Z}. Then performing Block Krylov iterations (i.e. step 3 of Algorithm 4) with 𝐙\mathbf{Z} as a starting block can only yield a larger approximation for σ1(𝐀)\sigma_{1}(\mathbf{A}) as compared to doing Block Krylov iterations with 𝐙p\mathbf{Z}_{p} as a starting block. Thus it suffices to show that Block Krylov iterations (step 3 of Algorithm 4) with starting block 𝐙p\mathbf{Z}_{p} produces some matrix 𝐙~pn×p\widetilde{\mathbf{Z}}_{p}\in\mathbb{R}^{n\times p} with orthonormal columns such that 𝐳~1T𝐀𝐀T𝐳~1(1ϵ)σ1(𝐀)\sqrt{\widetilde{\mathbf{z}}_{1}^{T}\mathbf{A}\mathbf{A}^{T}\widetilde{\mathbf{z}}_{1}}\geq(1-\epsilon)\sigma_{1}(\mathbf{A}) where 𝐳~1\widetilde{\mathbf{z}}_{1} is the first columns of 𝐙~p\widetilde{\mathbf{Z}}_{p}.

To show this, we will be using Theorem 13 of [MM15] which bounds the number of iterations of randomized Block Krylov iterations in terms of the singular value gap. To apply this in our setting, we need to show that 𝐙p\mathbf{Z}_{p} is a good enough starting block. Specifically, let f(𝐀)f(\mathbf{A}) be the amplifying polynomial used in the proof of Theorem 13 of [MM15]. Let 𝐘\mathbf{Y} be an orthonormal basis for the span of f(𝐀)𝐙pf(\mathbf{A})\mathbf{Z}_{p}. Then, following the proof in [MM15], to prove the gap-dependent convergence bound, we just need to show that

f(𝐀)𝐘𝐘Tf(𝐀)F2cn2f(𝐀)(f(𝐀))pF2\|f(\mathbf{A})-\mathbf{Y}\mathbf{Y}^{T}f(\mathbf{A})\|_{F}^{2}\leq cn^{2}\|f(\mathbf{A})-(f(\mathbf{A}))_{p}\|_{F}^{2}

where (f(𝐀))p(f(\mathbf{A}))_{p} is the best rank pp approximation to f(𝐀)f(\mathbf{A}) and cc is a constant. To prove this, we will be proceeding in a similar manner as to the proof of Lemma 4 of [MM15]. Using Lemma 8, we have:

f(𝐀)𝐘𝐘Tf(𝐀)F2f(𝐀)(f(𝐀))pF2+(f(𝐀)(f(𝐀))p)𝐙p(𝐔pT𝐙p)+F2\|f(\mathbf{A})-\mathbf{Y}\mathbf{Y}^{T}f(\mathbf{A})\|_{F}^{2}\leq\|f(\mathbf{A})-(f(\mathbf{A}))_{p}\|_{F}^{2}+\|(f(\mathbf{A})-(f(\mathbf{A}))_{p})\mathbf{Z}_{p}(\mathbf{U}_{p}^{T}\mathbf{Z}_{p})^{+}\|_{F}^{2}

where 𝐔pn×p\mathbf{U}_{p}\in\mathbb{R}^{n\times p} contains the top pp singular vectors of 𝐀\mathbf{A} as its columns (note that 𝐀\mathbf{A} is symmetric so it has same left and right singular vectors). Thus, it suffices to show that

(f(𝐀)(f(𝐀))p)𝐙p(𝐔pT𝐙p)+F2cn2f(𝐀)(f(𝐀))pF2\|(f(\mathbf{A})-(f(\mathbf{A}))_{p})\mathbf{Z}_{p}(\mathbf{U}_{p}^{T}\mathbf{Z}_{p})^{+}\|_{F}^{2}\leq cn^{2}\|f(\mathbf{A})-(f(\mathbf{A}))_{p}\|_{F}^{2}

for some constant cc. Using the spectral submultiplicativity property, we have

(f(𝐀)(f(𝐀))p)𝐙p(𝐔pT𝐙p)+F2\displaystyle\|(f(\mathbf{A})-(f(\mathbf{A}))_{p})\mathbf{Z}_{p}(\mathbf{U}_{p}^{T}\mathbf{Z}_{p})^{+}\|_{F}^{2} (f(𝐀)(f(𝐀))p)F2𝐙p(𝐔pT𝐙p)+22\displaystyle\leq\|(f(\mathbf{A})-(f(\mathbf{A}))_{p})\|_{F}^{2}\|\mathbf{Z}_{p}(\mathbf{U}_{p}^{T}\mathbf{Z}_{p})^{+}\|_{2}^{2}
(f(𝐀)(f(𝐀))p)F2(𝐔pT𝐙p)+22,\displaystyle\leq(f(\mathbf{A})-(f(\mathbf{A}))_{p})\|_{F}^{2}\|(\mathbf{U}_{p}^{T}\mathbf{Z}_{p})^{+}\|_{2}^{2},

where the second step uses the fact that 𝐙p22=1\|\mathbf{Z}_{p}\|_{2}^{2}=1. From Lemma 9, we have (𝐔pT𝐙p)+2c1n\|(\mathbf{U}_{p}^{T}\mathbf{Z}_{p})^{+}\|_{2}\leq c_{1}n for some constant c1c_{1}. Thus, we have shown that doing Block Krylov iterations with starting block 𝐙p\mathbf{Z}_{p} gives us the same guarantees as Theorem 13 of [MM15] for gap-dependent convergence of randomized Block Krylov.

Specifically, we can apply the per-vector guarantee (equation 3 of [MM15]) i.e. if 𝐳~1\widetilde{\mathbf{z}}_{1} is the first column of the matrix 𝐙~p\widetilde{\mathbf{Z}}_{p} after doing Block Krylov iterations, and 𝐯1\mathbf{v}_{1} is the singular vector of 𝐀\mathbf{A} corresponding to its largest singular value, we have |𝐯1T𝐀𝐀T𝐯1𝐳~1𝐀𝐀T𝐳~1|ϵσp+12(𝐀)ϵσ12(𝐀)|\mathbf{v}_{1}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{v}_{1}-\widetilde{\mathbf{z}}_{1}\mathbf{A}\mathbf{A}^{T}\widetilde{\mathbf{z}}_{1}|\leq\epsilon\sigma_{p+1}^{2}(\mathbf{A})\leq\epsilon\sigma_{1}^{2}(\mathbf{A}). This implies that 𝐳~1𝐀𝐀T𝐳~1(1ϵ)σ1(𝐀)\sqrt{\widetilde{\mathbf{z}}_{1}\mathbf{A}\mathbf{A}^{T}\widetilde{\mathbf{z}}_{1}}\geq(1-\epsilon)\sigma_{1}(\mathbf{A}) (as 𝐯1T𝐀𝐀T𝐯1=σ12(𝐀)\mathbf{v}_{1}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{v}_{1}=\sigma^{2}_{1}(\mathbf{A})).

Runtime Analysis. The time to compute 𝐙\mathbf{Z} from 𝐀~\widetilde{\mathbf{A}} in step 2 of Algorithm 4 with k=2αk=\frac{2}{\alpha} and error parameter ϵ~=cα4\tilde{\epsilon}=c\alpha^{4} (for some small c<1c<1) is O~(n2α29)\widetilde{O}\left(\frac{n^{2}}{\alpha^{29}}\right), since we only need to estimate the top 2α\frac{2}{\alpha} singular values, and estimating each singular value using power iterations in Algorithm 3 takes time O~(kn2log(n/ϵ~)ϵ~7)\widetilde{O}\left(\frac{kn^{2}\log(n/\tilde{\epsilon})}{\tilde{\epsilon}^{7}}\right) as described in the analysis of Theorem 8.

The number of iterations of the Block Krylov with starting block 𝐙p\mathbf{Z}_{p} is given by O(log(n/ϵ)/min(1,σ1(𝐀)σp+1(𝐀)1))O\left(\log(n/\epsilon)/\sqrt{\min(1,\frac{\sigma_{1}(\mathbf{A})}{\sigma_{p+1}(\mathbf{A})}-1)}\right) as specified in Theorem 13 of [MM15]. Thus, we first need to bound σ1(𝐀)σp+1(𝐀)\frac{\sigma_{1}(\mathbf{A})}{\sigma_{p+1}(\mathbf{A})}. We have:

σ12(𝐀)σp+12(𝐀)σp2(𝐀)σp+12(𝐀)σp2(𝐀)σp2(𝐀)38α3(max(n,𝐀1))21138α31+38α3,\displaystyle\frac{\sigma_{1}^{2}(\mathbf{A})}{\sigma^{2}_{p+1}(\mathbf{A})}\geq\frac{\sigma^{2}_{p}(\mathbf{A})}{\sigma^{2}_{p+1}(\mathbf{A})}\geq\frac{\sigma^{2}_{p}(\mathbf{A})}{\sigma^{2}_{p}(\mathbf{A})-\frac{3}{8}\alpha^{3}\cdot(\max(n,\|\mathbf{A}\|_{1}))^{2}}\geq\frac{1}{1-\frac{3}{8}\alpha^{3}}\geq 1+\frac{3}{8}\alpha^{3},

where the second inequality uses (27) and for the third inequality we use the fact that σp(𝐀)max(n,𝐀1)\sigma_{p}(\mathbf{A})\leq\max(n,\|\mathbf{A}\|_{1}). Taking square root we have σ1(𝐀)σp+1(𝐀)1+cα3\frac{\sigma_{1}(\mathbf{A})}{\sigma_{p+1}(\mathbf{A})}\geq 1+c\alpha^{3}, where cc is a large enough constant. Thus, the number of iterations of Block Krylov in step 3 of Algorithm 4 is O(log(n/ϵ)α3/2)O\left(\frac{\log(n/\epsilon)}{\alpha^{3/2}}\right). From Theorem 7 of [MM15], for qq iterations, Block Krylov takes time O(n2kq+nk2q2+k3q3)O(n^{2}kq+nk^{2}q^{2}+k^{3}q^{3}) where k=2αk=\frac{2}{\alpha}. Thus, the total time taken by step 3 of Algorithm 4 is O~(n2log(n/ϵ)α5/2+nlog2(n/ϵ)α5+log3(n/ϵ)α15/2)\widetilde{O}\left(\frac{n^{2}\log(n/\epsilon)}{\alpha^{5/2}}+\frac{n\log^{2}(n/\epsilon)}{\alpha^{5}}+\frac{\log^{3}(n/\epsilon)}{\alpha^{15/2}}\right). Since O~(n2log(n/ϵ)α29)\widetilde{O}\left(\frac{n^{2}\log(n/\epsilon)}{\alpha^{29}}\right) asymptotically dominates each term of O~(n2log(n/ϵ)α5/2+nlog2(n/ϵ)α5+log3(n/ϵ)α15/2)\widetilde{O}\left(\frac{n^{2}\log(n/\epsilon)}{\alpha^{5/2}}+\frac{n\log^{2}(n/\epsilon)}{\alpha^{5}}+\frac{\log^{3}(n/\epsilon)}{\alpha^{15/2}}\right), the total time taken by the Algorithm 4 is O~(n2log(n/ϵ)α29)\widetilde{O}\left(\frac{n^{2}\log(n/\epsilon)}{\alpha^{29}}\right). ∎

Remark. Since we can apply the per-vector guarantee (Equation 3 of [MM15]) for 𝐙~p\widetilde{\mathbf{Z}}_{p}, i.e., the output of Block Krylov in Algorithm 4, then for all p2αp\in\lceil\frac{2}{\alpha}\rceil such that the condition in (27) holds and any ipi\leq p, |𝐯iT𝐀𝐀T𝐯i𝐳~iT𝐀𝐀T𝐳~i|ϵσp+12(𝐀)ϵσi2(𝐀)\lvert\mathbf{v}_{i}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{v}_{i}-\widetilde{\mathbf{z}}_{i}^{T}\mathbf{A}\mathbf{A}^{T}\widetilde{\mathbf{z}}_{i}\rvert\leq\epsilon\sigma_{p+1}^{2}(\mathbf{A})\leq\epsilon\sigma_{i}^{2}(\mathbf{A}), where 𝐳~i\widetilde{\mathbf{z}}_{i} is the iith column of 𝐙~p\widetilde{\mathbf{Z}}_{p}. This implies for all ipi\leq p we have 𝐳~i𝐀𝐀T𝐳~i(1ϵ)σi(𝐀)\sqrt{\widetilde{\mathbf{z}}_{i}\mathbf{A}\mathbf{A}^{T}\widetilde{\mathbf{z}}_{i}}\geq(1-\epsilon)\sigma_{i}(\mathbf{A}) (as 𝐯iT𝐀𝐀T𝐯i=σi2(𝐀)\mathbf{v}_{i}^{T}\mathbf{A}\mathbf{A}^{T}\mathbf{v}_{i}=\sigma^{2}_{i}({\mathbf{A}})). Thus we are able to approximate the top pp singular values and corresponding singular vectors of 𝐀\mathbf{A} with high accuracy using Algorithm 4, where the ppth singular value of 𝐀\mathbf{A} satisfies the condition of (27).

7 Conclusion

Our work has shown that it is possible to deterministically construct an entrywise sampling matrix 𝐒\mathbf{S} (a universal sparsifier) with just O~(n/ϵc)\widetilde{O}(n/\epsilon^{c}) non-zero entries such that for any bounded entry 𝐀\mathbf{A} with 𝐀1\|\mathbf{A}\|_{\infty}\leq 1, 𝐀𝐀𝐒2ϵmax(n,𝐀1)\|\mathbf{A}-\mathbf{A}\circ\mathbf{S}\|_{2}\leq\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}). We show how to achieve sparsity O(n/ϵ2)O(n/\epsilon^{2}) when 𝐀\mathbf{A} is restricted to be PSD (Theorem 1) and O~(n/ϵ4)\widetilde{O}(n/\epsilon^{4}) when 𝐀\mathbf{A} is general (Theorem 4), and prove that both these bounds are tight up to logarithmic factors (Theorems 3 and 6). Further, our proofs are based on simple reductions, which show that any 𝐒\mathbf{S} that spectrally approximates the all ones matrix to sufficient accuracy (i.e., is a sufficiently good spectral expander) yields a universal sparsifier.

We apply our universal sparsification bounds to give the first o(nω)o(n^{\omega}) time deterministic algorithms for several core linear algebraic problems, including singular value/vector approximation and positive semidefiniteness testing. When 𝐀\mathbf{A} is restricted to be PSD and have entries in {1,0,1}\{-1,0,1\}, we show how to give achieve improved deterministic query complexity of O~(n/ϵ)\widetilde{O}(n/\epsilon) to construct a general spectral approximation 𝐀~\mathbf{\widetilde{A}}, which may not be sparse (Theorem 9). We again show that this bound is tight up a to a logarithmic factor (Theorem 10)

Our work leaves several open questions:

  1. 1.

    An interesting question is if O~(n/ϵ)\widetilde{O}(n/\epsilon) sample complexity can be achieved for deterministic spectral approximation of any bounded entry PSD matrix if the sparsity of the output is not restricted, thereby generalizing the upper bound for PSD {1,0,1}\{-1,0,1\}-matrices proven in Theorem 9. This query complexity is known for randomized algorithms based on column sampling [MM17], however it is not currently known how to derandomize such results.

  2. 2.

    It would also be interesting to close the O~(1/ϵ2)\widetilde{O}(1/\epsilon^{2}) factor gap between our universal sparsification upper bound of O~(n/ϵ4)\widetilde{O}(n/\epsilon^{4}) queries for achieving ϵmax(n,𝐀1)\epsilon\cdot\max(n,\|\mathbf{A}\|_{1}) spectral approximation error for non-PSD matrices (Theorem 4) and our Ω(n/ϵ2)\Omega(n/\epsilon^{2}) query lower bound for general deterministic algorithms that make possibly adaptive queries to 𝐀\mathbf{A} (Theorem 7). By Theorem 6, our universal sparsification bound is tight up to log factors for algorithms that make non-adaptive deterministic queries to 𝐀\mathbf{A}. It is unknown if adaptive queries can be used to give improved algorithms.

  3. 3.

    Finally, it would be interesting to understand if our deterministic algorithms for spectrum approximation can be improved. For example, can one compute an ϵn\epsilon n additive error or a (1+ϵ)(1+\epsilon) relative error approximation to the top singular value 𝐀2\|\mathbf{A}\|_{2} for bounded entry 𝐀\mathbf{A} and constant ϵ\epsilon in o(nω)o(n^{\omega}) time deterministically? Are there fundamental barriers that make doing so difficult?

Acknowledgements

We thank Christopher Musco for helpful conversations about this work. RB, CM and AR was partially supported by an Adobe Research grant, along with NSF Grants 2046235, 1763618, and 1934846. AR was also partially supported by the Manning College of Information and Computer Sciences Dissertation Writing Fellowship. GD was partially supported by NSF AF 1814041, NSF FRG 1760353, and DOE-SC0022085. SS was supported by an NSERC Discovery Grant RGPIN-2018-06398 and a Sloan Research Fellowship. DW was supported by a Simons Investigator Award.

References

  • [ACK+16] Alexandr Andoni, Jiecao Chen, Robert Krauthgamer, Bo Qin, David P Woodruff, and Qin Zhang. On sketching quadratic forms. In Proceedings of the \nth7 Conference on Innovations in Theoretical Computer Science (ITCS), 2016.
  • [AHK05] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In Proceedings of the \nth46 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2005.
  • [AHK06] Sanjeev Arora, Elad Hazan, and Satyen Kale. A fast random sampling algorithm for sparsifying matrices. In Proceedings of the \nth10 International Workshop on Randomization and Computation (RANDOM), 2006.
  • [Aig95] Martin Aigner. Turán’s graph theorem. The American Mathematical Monthly, 1995.
  • [AKL13] Dimitris Achlioptas, Zohar Shay Karnin, and Edo Liberty. Near-optimal entrywise sampling for data matrices. In Advances in Neural Information Processing Systems 26 (NeurIPS), 2013.
  • [AKM+17] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the \nth34 International Conference on Machine Learning (ICML), 2017.
  • [Alo86] Noga Alon. Eigenvalues and expanders. Combinatorica, 1986.
  • [Alo21] Noga Alon. Explicit expanders of every degree and size. Combinatorica, 2021.
  • [AM07] Dimitris Achlioptas and Frank McSherry. Fast computation of low rank matrix approximations. In Proceedings of the \nth39 Annual ACM Symposium on Theory of Computing (STOC), 2007.
  • [AW21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the \nth32 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021.
  • [AZL16] Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even faster SVD decomposition yet without agonizing pain. Advances in Neural Information Processing Systems 29 (NeurIPS), 2016.
  • [BCJ20] Ainesh Bakshi, Nadiia Chepurko, and Rajesh Jayaram. Testing positive semi-definiteness via random submatrices. In Proceedings of the \nth61 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2020.
  • [BDD+23] Rajarshi Bhattacharjee, Gregory Dexter, Petros Drineas, Cameron Musco, and Archan Ray. Sublinear time eigenvalue approximation via random sampling. Proceedings of the \nth50 International Colloquium on Automata, Languages and Programming (ICALP), 2023.
  • [BDMI14] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 2014.
  • [BGPW13] Mark Braverman, Ankit Garg, Denis Pankratov, and Omri Weinstein. Information lower bounds via self-reducibility. In Proceedings of the \nth8 International Computer Science Symposium in Russia (CSR), 2013.
  • [Bha13] Rajendra Bhatia. Matrix analysis. Springer Science & Business Media, 2013.
  • [BJKS04] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 2004.
  • [BKKS21] Vladimir Braverman, Robert Krauthgamer, Aditya R Krishnan, and Shay Sapir. Near-optimal entrywise sampling of numerically sparse matrices. In Proceedings of the \nth34 Annual Conference on Computational Learning Theory (COLT), 2021.
  • [BKM22] Vladimir Braverman, Aditya Krishnan, and Christopher Musco. Sublinear time spectral density estimation. In Proceedings of the \nth54 Annual ACM Symposium on Theory of Computing (STOC), 2022.
  • [BSST13] Joshua Batson, Daniel A Spielman, Nikhil Srivastava, and Shang-Hua Teng. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 2013.
  • [CG02] Fan Chung and Ronald Graham. Sparse quasi-random graphs. Combinatorica, 2002.
  • [CGL+20] Julia Chuzhoy, Yu Gao, Jason Li, Danupon Nanongkai, Richard Peng, and Thatchaphol Saranurak. A deterministic algorithm for balanced cut with applications to dynamic connectivity, flows, and beyond. In Proceedings of the \nth61 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2020.
  • [Cha11] Françoise Chatelin. Spectral approximation of linear operators. SIAM, 2011.
  • [Coh16] Michael B Cohen. Ramanujan graphs in polynomial time. In Proceedings of the \nth57 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2016.
  • [Cov99] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  • [CR12a] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 2012.
  • [CR12b] Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-Hamming-distance. SIAM Journal on Computing, 2012.
  • [CT10] Emmanuel J. Candès and Terence Tao. The power of convex relaxation: near-optimal matrix completion. IEEE Transations on Information Theory, 2010.
  • [CW17] Kenneth L Clarkson and David P Woodruff. Low-rank PSD approximation in input-sparsity time. In Proceedings of the \nth28 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2017.
  • [d’A11] Alexandre d’Aspremont. Subsampling algorithms for semidefinite programming. Stochastic Systems, 2011.
  • [DZ11] Petros Drineas and Anastasios Zouzias. A note on element-wise matrix sparsification via a matrix-valued Bernstein inequality. Information Processing Letters, 2011.
  • [Ger31] Semyon Aranovich Gershgorin. Uber die abgrenzung der eigenwerte einer matrix. Izvestiya Rossiyskoy akademii nauk. Seriya matematicheskaya, 1931.
  • [Kun15] Abhisek Kundu. Element-wise matrix sparsification and reconstruction. PhD thesis, Rensselaer Polytechnic Institute, USA, 2015.
  • [KW92] Jacek Kuczyński and Henryk Woźniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 1992.
  • [LPS88] Alexander Lubotzky, Ralph Phillips, and Peter Sarnak. Ramanujan graphs. Combinatorica, 1988.
  • [LSY16] Lin Lin, Yousef Saad, and Chao Yang. Approximating spectral densities of large matrices. SIAM Review, 2016.
  • [Mar73] Grigorii Aleksandrovich Margulis. Explicit constructions of concentrators. Problemy Peredachi Informatsii, 1973.
  • [MM15] Cameron Musco and Christopher Musco. Randomized block Krylov methods for stronger and faster approximate singular value decomposition. Advances in Neural Information Processing Systems 28 (NeurIPS), 2015.
  • [MM17] Cameron Musco and Christopher Musco. Recursive sampling for the Nyström method. Advances in Neural Information Processing Systems 30 (NeurIPS), 2017.
  • [MMMW21] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA). SIAM, 2021.
  • [Mor94] Moshe Morgenstern. Existence and explicit constructions of q+1q+1 regular Ramanujan graphs for every prime power qq. Journal of Combinatorial Theory, Series B, 1994.
  • [MW17] Cameron Musco and David P. Woodruff. Sublinear time low-rank approximation of positive semidefinite matrices. In Proceedings of the \nth58 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2017.
  • [NSW22] Deanna Needell, William Swartworth, and David P. Woodruff. Testing positive semidefiniteness using linear measurements. Proceedings of the \nth63 Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2022.
  • [Pip87] Nicholas Pippenger. Sorting and selecting in rounds. SIAM Journal on Computing, 1987.
  • [Rou16] Tim Roughgarden. Communication complexity (for algorithm designers). Foundations and Trends in Theoretical Computer Science, 2016.
  • [Saa11] Yousef Saad. Numerical methods for large eigenvalue problems: Revised edition. SIAM, 2011.
  • [SKO21] Florian Schäfer, Matthias Katzfuss, and Houman Owhadi. Sparse Cholesky factorization by Kullback–Leibler minimization. SIAM Journal on Scientific Computing, 2021.
  • [SS08] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. In Proceedings of the \nth40 Annual ACM Symposium on Theory of Computing (STOC), 2008.
  • [ST04] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the \nth36 Annual ACM Symposium on Theory of Computing (STOC), 2004.
  • [SWXZ22] Vaidehi Srinivas, David P Woodruff, Ziyu Xu, and Samson Zhou. Memory bounds for the experts problem. Proceedings of the \nth54 Annual ACM Symposium on Theory of Computing (STOC), 2022.
  • [Tur41] Pál Turán. On an extremal problem in graph theory. Mat. Fiz. Lapok, 1941.
  • [TYUC17] JA Tropp, A Yurtsever, M Udell, and V Cevher. Randomized single-view algorithms for low-rank matrix approximation, 2017.
  • [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018.
  • [Wey12] Hermann Weyl. The asymptotic distribution law of the eigenvalues of linear partial differential equations (with an application to the theory of cavity radiation). Mathematical Annals, 1912.
  • [WLZ16] Shusen Wang, Luo Luo, and Zhihua Zhang. SPSD matrix approximation vis column selection: Theories, algorithms, and extensions. The Journal of Machine Learning Research, 2016.
  • [Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 2014.
  • [WS00] Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. Advances in Neural Information Processing Systems 13 (NeurIPS), 2000.
  • [WS23] David Woodruff and William Swartworth. Optimal eigenvalue approximation via sketching. In Proceedings of the \nth55 Annual ACM Symposium on Theory of Computing (STOC), 2023.
  • [WWAF06] Alexander Weisse, Gerhard Wellein, Andreas Alvermann, and Holger Fehske. The kernel polynomial method. Reviews of modern physics, 2006.
  • [WZ93] Avi Wigderson and David Zuckerman. Expanders that beat the eigenvalue bound: Explicit construction and applications. In Proceedings of the \nth25 Annual ACM Symposium on Theory of Computing (STOC), 1993.
  • [XG10] Jianlin Xia and Ming Gu. Robust approximate Cholesky factorization of rank-structured symmetric positive definite matrices. SIAM Journal on Matrix Analysis and Applications, 2010.