This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Dequantization of the
Quantum Singular Value Transformation and
Quantum Machine Learning Algorithms

François Le Gall
Nagoya University
[email protected]
Abstract

Several quantum algorithms for linear algebra problems, and in particular quantum machine learning problems, have been “dequantized” in the past few years. These dequantization results typically hold when classical algorithms can access the data via length-squared sampling. This assumption, which is standard in the field of randomized linear algebra, means that for a unit-norm vector unu\in\mathbb{C}^{n}, we can sample from the distribution pu:{1,,n}[0,1]p_{u}\colon\{1,\ldots,n\}\to[0,1] defined as pu(i)=|u(i)|2p_{u}(i)=|u(i)|^{2} for each i{1,,n}i\in\{1,\ldots,n\}. Since this distribution corresponds to the distribution obtained by measuring the quantum state |u|u\rangle in the computational basis, length-squared sampling access gives a reasonable classical analogue to the kind of quantum access considered in many quantum algorithms for linear algebra problems.

In this work we investigate how robust these dequantization results are. We introduce the notion of approximate length-squared sampling, where classical algorithms are only able to sample from a distribution close to the ideal distribution in total variation distance. While quantum algorithms are natively robust against small perturbations, current techniques in dequantization are not. Our main technical contribution is showing how many techniques from randomized linear algebra can be adapted to work under this weaker assumption as well. We then use these techniques to show that the recent low-rank dequantization framework by Chia, Gilyén, Li, Lin, Tang and Wang (JACM 2022) and the dequantization framework for sparse matrices by Gharibian and Le Gall (STOC 2022), which are both based on the Quantum Singular Value Transformation, can be generalized to the case of approximate length-squared sampling access to the input. We also apply these results to obtain a robust dequantization of many quantum machine learning algorithms, including quantum algorithms for recommendation systems, supervised clustering and low-rank matrix inversion.

1 Introduction

1.1 Background

The quantum algorithm for matrix inversion by Harrow, Hassidim and Lloyd [23], often called simply the HHL algorithm, is a milestone in quantum algorithms for linear algebra. Given a well-conditioned invertible matrix An×nA\in\mathbb{C}^{n\times n} and a vector unu\in\mathbb{C}^{n}, the algorithm prepares in time poly(logn)\mathrm{poly}(\log n) a quantum state proportional to a vector x^n\hat{x}\in\mathbb{C}^{n} close to the solution of the system of equations Ax=uAx=u. This quantum state can then be used to compute some partial information about x^\hat{x}. The HHL algorithm has been very influential for the development of quantum algorithms solving problems related to linear algebra, and has in particular lead to several quantum machine learning algorithms ([7, 29, 30, 32, 35, 36, 37, 41] for instance).

An important assumption in the HHL algorithm, and in most of these extensions as well, is that the input vector should be accessible as a quantum state: the quantum computer has access to a few copies of the quantum state |v=i=1nv(i)|i|v\rangle=\sum_{i=1}^{n}v(i)|i\rangle on O(logn)O(\log n) qubits, for v=u/uv=u/\left\|u\right\|. In order for the HHL algorithm to be useful, it is thus crucial to be able to prepare this quantum state efficiently. One concrete proposal is having the vector uu stored in Quantum Random Access Memory (QRAM), in which case (when using the definition of QRAM in [29]) one copy of |v|v\rangle can be created in time poly(logn)\mathrm{poly}(\log n). Similar assumptions are needed on how the quantum algorithm can access the matrix AA. As discussed in [1], due to all these assumptions it is difficult to directly compare the performance of these quantum algorithms to the performance of classical methods.

A series of works initiated by Tang [39] has been investigating the importance of such assumptions for quantum machine learning. These results have shown that for most quantum machine learning algorithms in the QRAM model, if classical algorithms are allowed to access the input via length-squared111Length-squared sampling is also called 2\ell_{2}-norm importance sampling in the literature. sampling-and-query access (which we abbreviate as sampling-and-query access in this paper) then the quantum advantage vanishes: classical algorithms can achieve performance similar (i.e., polynomially related) to the performance of quantum algorithms. The concept of (length-squared) sampling-and-query access is a central concept in the literature on linear algebra algorithms based on random sampling [11, 12, 13, 14, 15, 17, 27, 28]. For a nonzero vector unu\in\mathbb{C}^{n}, sampling-and-query access to uu means that the following two operations can be implemented in poly(logn)\mathrm{poly}(\log n) time: on input i{1,,n}i\in\{1,\ldots,n\} we can query the entry u(i)u(i); we can get one sample from the distribution pu:{1,,n}[0,1]p_{u}\colon\{1,\ldots,n\}\to[0,1] defined as pu(i)=|u(i)|2/u2p_{u}(i)=|u(i)|^{2}/\left\|u\right\|^{2} for each i{1,,n}i\in\{1,\ldots,n\}. The first operation is standard and similar to the assumption (made in most classical algorithms) that the input is stored in Random Access Memory (RAM). It is the second operation that makes linear algebra algorithms based on random sampling extremely powerful. In the perspective of dequantization, this second operation can be seen as a natural classical analogue of the assumption that the quantum state |v|v\rangle (with v=u/uv=u/\left\|u\right\|) can be created at cost poly(logn)\mathrm{poly}(\log n), since measuring |v|v\rangle in the computational basis gives precisely a sample from the distribution pup_{u}.

The main thesis of the line of research on dequantization initiated in [39] is that the performance of quantum machine learning algorithms in the QRAM model should be compared to the performance of classical algorithms with sampling-and-query access to the input. More precisely, the main objective of this approach is to investigate if some quantum advantage persists even when compared to such classical algorithms. The answers obtained so far are unfortunately negative. After Tang’s celebrated dequantization [39] of the quantum algorithm for recommendation systems by Kerenidis and Prakash [29], which was one of the main candidates for quantum advantage in machine learning, several works have successively dequantized most known quantum machine learning algorithms in the QRAM model [3, 5, 4, 6, 10, 16, 18, 22, 26, 39]. The coup de grâce was recently given by Chia, Gilyén, Li, Lin, Tang and Wang [5], who developed a general framework for dequantization applying to almost all known quantum machine learning algorithms, and gave strong evidence for the lack of exponential speedup for quantum machine learning algorithms in the QRAM model.

1.2 Motivation: Approximate sampling-and-query access

In this work, we explore the possibility of dequantizing quantum algorithms in the QRAM model by classical algorithms using a weaker version of sampling-and-query access, which we call ε\varepsilon-approximate (length-squared) sampling-and-query access, where ε[0,1]\varepsilon\in[0,1] is a parameter. In ε\varepsilon-approximate sampling-and-query access we weaken the definition as follows: While the query operation is unchanged (on input i{1,,n}i\in\{1,\ldots,n\} we can query the entry u(i)u(i)), for the sampling operation we can only get one sample from a distribution p~u:{1,,n}[0,1]\tilde{p}_{u}\colon\{1,\ldots,n\}\to[0,1] that is close to pup_{u}. More precisely, the only guarantee is that the total variation distance between p~u\tilde{p}_{u} and pup_{u} is at most ε\varepsilon. (Standard sampling-and-query access corresponds to the case ε=0\varepsilon=0.) This variant corresponds to the setting where query access to uu can be easily implemented (e.g., when the input is stored in RAM) but perfect sampling access is not available or simply too costly to be implemented. We stress that in this setting, we do not know the value p~u(i)\tilde{p}_{u}(i) (and cannot compute it efficiently from the samples and queries).222If we had access to p~u(i)\tilde{p}_{u}(i), we could simply define a vector u~\tilde{u} such that p~u(i)=|u~(i)|2/u~2\tilde{p}_{u}(i)=|\tilde{u}(i)|^{2}/\left\|\tilde{u}\right\|^{2} and apply the framework from [5] on u~\tilde{u} instead of uu (since we could then easily obtain perfect sampling-and-query access to u~\tilde{u}).

It happens that this seemingly minor change is problematic for known dequantization techniques. Consider for instance the inner product estimation from [39], a technique also used in several other “quantum-inspired” algorithms, which dequantizes the SWAP test (a basic quantum algorithm that can be used to estimate the inner product between two pure states). Given two unit-norm vectors u,vnu,v\in\mathbb{R}^{n}, where uu is given via sampling-and-query access, the approach typically works as follows: sample an index i{1,,n}i\in\{1,\ldots,n\} according to the distribution pup_{u} and output the value v(i)u(i)\frac{v(i)}{u(i)}. Note that the expectation of this value is

i=1npu(i)v(i)u(i)=i=1n|u(i)|2v(i)u(i)=i=1nu(i)v(i)=(u,v).\sum_{i=1}^{n}p_{u}(i)\frac{v(i)}{u(i)}=\sum_{i=1}^{n}\left|u(i)\right|^{2}\frac{v(i)}{u(i)}=\sum_{i=1}^{n}u(i)v(i)=(u,v). (1)

It is easy to show that the variance of this estimator is small, and thus taking the mean for a few samples gives a good approximation of the inner product (u,v)(u,v). Note that this test, however, is not “robust” due to the division by u(i)u(i). In particular, when only able to sample from p~u\tilde{p}_{u} instead of pup_{u}, the same strategy does not work since the ratio p~u(i)1u(i)\tilde{p}_{u}(i)\frac{1}{u(i)} can become arbitrarily large, which significantly compromises the quality of the approximation.

Quantum algorithms in the QRAM model, on the other hand, are natively robust: since quantum evolutions are trace-preserving, for any quantum algorithm 𝒬\mathcal{Q} and any two quantum states |ϕ|\phi\rangle and |ψ|\psi\rangle such that the trace distance333We consider the trace distance since it directly gives an upper bound on the total variation distance of the corresponding probability distributions. Using another closeness measure between quantum states (e.g., the fidelity) would lead to another (possibly weaker) upper bound on the total variation distance. between |ϕ|\phi\rangle and |ψ|\psi\rangle is at most ε\varepsilon (which implies that the total variation distance between the distributions p|ϕp_{|\phi\rangle} and p|ψp_{|\psi\rangle} is at most ε\varepsilon), the distance between the output of 𝒬\mathcal{Q} on input |ϕ|\phi\rangle and the output of 𝒬\mathcal{Q} on input |ψ|\psi\rangle is at most ε\varepsilon. The central motivation of the present work is to understand whether this property of quantum algorithms can be exploited to show a quantum advantage in machine learning.

To our knowledge, this notion of approximate sampling-and-query access has never been considered in the literature on linear algebra via random sampling.444The slightly relation notion of oversampling access has been considered in [13], and in [5] in the context of dequantization. This notion is conceptually different from ours (our notion only guarantees that the distribution is close to the ideal distribution in total variation distance). The main reason is that in those works, sampling access to the vectors and matrices is usually not given as input. Instead, the algorithm makes one pass over the whole data as a preliminary step in order to collect enough information to be able to implement efficient sampling access in the second phase of the algorithm. The time required by the preliminary step is typically linear in the input size (i.e., the number of nonzero entries of the vectors or matrices), which is enough to get an implementation of perfect sampling-and-query access to the input. In dequantization algorithms, on the other hand, the main goal is typically to construct algorithms with running time exponential faster than the input size, and thus making a pass on the whole data is not possible. In this setting, the implementation of sampling access becomes extremely challenging, and it makes sense to consider the scenario where only approximate sampling access can be implemented.

1.3 Our results

In this work we show that for essentially all known quantum algorithms that have been dequantized in the past few years, robust dequantization is also possible: there exist classical algorithms matching the performance of these quantum algorithms (up to polynomial factors) even when having only approximate sampling-and-query access to the input. In particular, this gives evidence for the lack of exponential speedup for quantum machine learning algorithms in the QRAM model even when classical algorithms have only approximate sampling-and-query access to the input.

Approximate oversampling and closeness properties.

Our first contribution is to develop a framework for working with vectors and matrices given by approximate sampling-and-query access. Following the approach developed in [5] (which used the concept of oversampling-and-query access), we develop our framework by introducing the concept of approximate over-sampling-and-query access. In this setting, instead of being able to generate samples from a distribution p~u\tilde{p}_{u} close to pup_{u}, we can only generate samples from a distribution q~u:{1,,n}[0,1]\tilde{q}_{u}\colon\{1,\ldots,n\}\to[0,1] such that q~u(i)1ϕp~u(i)\tilde{q}_{u}(i)\geq\frac{1}{\phi}\tilde{p}_{u}(i) holds for all i{1,,n}i\in\{1,\ldots,n\}, for some ϕ1\phi\geq 1 called the oversampling parameter. (Note that taking ϕ=1\phi=1 gives the usual approximate sampling-and-query access.) We prove the following two properties for approximate oversampling-and-query access.

  • Closeness property: given approximate oversampling-and-query access to a small number of vectors, we can implement efficiently approximate oversampling-and-query access to a linear combination of these vectors (with a slightly worse oversampling parameter).

  • Conversion property: given approximate oversampling-and-query access to a vector, we can implement efficiently approximate sampling-and-query access with an overhead in the complexity corresponding to the oversampling parameter.

These two properties generalize the two properties shown (for the setting of perfect sampling-and-query access) in [5]. The closeness property, which usually does not hold for approximate sampling-and-query access, is the reason why the notion of approximate oversampling-and-query access is very convenient to deal with. In addition, the conversion property shows that working with oversampling-and-query access is meaningful, since at the end of the computation it is possible to recover sampling-and-query access from the output (with an overhead in the complexity corresponding to the oversampling parameter).

Matrix multiplication.

Our second contribution is to develop a robust version of one of the central techniques in the area of randomized numerical linear algebra: matrix multiplication based on importance matrix sketches. Given two matrices Xm×nX\in\mathbb{C}^{m\times n} and Ym×nY\in\mathbb{C}^{m\times n^{\prime}}, the approach by Drineas, Kannan and Mahoney [13], on which the dequantization framework by Chia, Gilyén, Li, Lin, Tang and Wang [5] is also based, creates a matrix Σr×m\Sigma\in\mathbb{R}^{r\times m} with only one nonzero entry per row, and estimates the matrix product XYX^{\dagger}Y by the matrix product XΣΣYX^{\dagger}\Sigma^{\dagger}\Sigma Y. Note that XΣX^{\dagger}\Sigma^{\dagger} is an n×rn\times r matrix obtained by selecting rr columns of XX^{\dagger} (i.e., rr rows of XX) and renormalizing them, while ΣY\Sigma Y is an r×nr\times n^{\prime} matrix obtained by selecting rr rows of YY and renormalizing them. The matrix product XΣΣYX^{\dagger}\Sigma^{\dagger}\Sigma Y can thus be considered as a “sketch” of the matrix product XYX^{\dagger}Y. The quality of the approximation, i.e., the Frobenius norm

XΣΣYXYF\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-X^{\dagger}Y\right\|_{\mathrm{F}}

depends naturally on the choice of Σ\Sigma. Drineas, Kannan and Mahoney [13] analyzed the optimal choice for Σ\Sigma, and showed that this Σ\Sigma can be constructed using (length-squared) sampling-and-query access to the matrices XX and YY.555Sampling-and-query access to a matrix is defined using the concept of sampling-and-query access to a vector: we say that we have sampling-and-query access to a matrix Am×nA\in\mathbb{C}^{m\times n} if we have sampling-and-query access to each row of AA and we have sampling-and-query access to the vector of row norms of the matrices AA, i.e., the vector (A(1,),,A(m,))(\left\|A(1,\cdot)\right\|,\ldots,\left\|A(m,\cdot)\right\|). This notion has been extended to oversampling-and-query access to a matrix in the literature [5, 13]. In this work we further extend it to ε\varepsilon-approximate (over)sampling-and-query access to a matrix (see Definitions 7 and 8 in Section 2.3). In the dequantization setting, this result has been used in particular to obtain a sketch of the singular value transformation of a matrix [5, Theorem 5.1].

In our setting, where we only have approximate (over)sampling-and-query access to the matrices XX and YY, several challenges appear when trying to adapt this technique. First, constructing efficiently unbiased estimators for matrix multiplication seems hopeless since the probability distribution from which we can sample is biased. Second, and more importantly, it is even unclear whether the approach from [5, 13] can be adapted, for the same reasons as mentioned in Section 1.2 for the inner product (observe that inner product is a special case of matrix product).

Our main technical contribution is to show that a robust version of matrix multiplication is possible, with a small bias depending on the parameter ε\varepsilon. Here are informal versions of our two results related to matrix multiplication (we refer to Sections 4.1 and 4.2 for the formal statements):

Theorem 1 (Informal version).

Given two matrices Xm×nX\in\mathbb{C}^{m\times n} and Ym×nY\in\mathbb{C}^{m\times n^{\prime}}, assume that we have ε\varepsilon-approximate sampling-and-query access to XX, for any ε[0,1]\varepsilon\in[0,1]. For any ξ>0\xi>0 and any δ(0,1]\delta\in(0,1], we can construct efficiently a matrix sketch Σr×m\Sigma\in\mathbb{R}^{r\times m} with r=Θ(1δξ2)r=\Theta\left(\frac{1}{\delta\xi^{2}}\right) such that

Pr[XΣΣYXYF(22ε+ξ)XFYF]1δ.\mathrm{Pr}\left[\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-X^{\dagger}Y\right\|_{\mathrm{F}}\leq\left(2\sqrt{2\varepsilon}+\xi\right)\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}\right]\geq 1-\delta.
Theorem 2 (Informal version).

Given two matrices Xm×nX\in\mathbb{C}^{m\times n} and Ym×nY\in\mathbb{C}^{m\times n^{\prime}}, assume that we have ε\varepsilon-approximate sampling-and-query access to both XX and YY, for any ε[0,1]\varepsilon\in[0,1]. For any ξ>0\xi>0 and any δ(0,1]\delta\in(0,1], we can construct efficiently a matrix sketch Σr×m\Sigma\in\mathbb{R}^{r\times m} with r=Θ(log(1/δ)ξ2)r=\Theta\left(\frac{\log(1/\delta)}{\xi^{2}}\right) such that

Pr[XΣΣYXYF(2ε+ξ)XFYF]1δ.\mathrm{Pr}\left[\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-X^{\dagger}Y\right\|_{\mathrm{F}}\leq\left(2\varepsilon+\xi\right)\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}\right]\geq 1-\delta.

Theorem 2 gives a smaller bias and an exponentially better dependence on δ\delta than Theorem 1. On the other hand, it requires ε\varepsilon-approximate sampling-and-query access to both XX and YY. We note that the dependence in δ\delta in Theorem 1 is likely to be inevitable since even with perfect sampling-and-quantum access, it is not known how to achieve a better dependence on δ\delta when given access to only one matrix [13].

We then use these techniques to show how to compute a sketch of the singular value transformation of a matrix, similarly to what [5] has done in the perfect sampling-and-query access setting. Here is the informal version of our result (see Section 4.3 for the formal version):

Theorem 3 (Informal version).

Let Xm×nX\in\mathbb{C}^{m\times n} be a matrix and f:f\colon\mathbb{R}\to\mathbb{C} be a smooth function. Assume that we have ε\varepsilon-approximate sampling-and-query access to XX. For any parameters δ(0,1]\delta\in(0,1] and γ>0\gamma>0, we can efficiently construct two matrices Σr×m\Sigma\in\mathbb{C}^{r\times m} and C=r×cC=\mathbb{C}^{r\times c} with r,c=poly(log(1/δ),1/γ,XF)r,c=\mathrm{poly}(\log(1/\delta),1/\gamma,\left\|X\right\|_{\mathrm{F}}) such that if ε\varepsilon is small enough then the inequality

XΣf¯(CC)ΣXf(XX)Fγ\left\|X^{\dagger}\Sigma^{\dagger}\bar{f}(CC^{\dagger})\Sigma X-f(X^{\dagger}X)\right\|_{\mathrm{F}}\leq\gamma

holds with probability at least 1δ1-\delta, where f¯(x):=f(x)/x\bar{f}(x):=f(x)/x.

Theorem 3 shows that in order to get a good “sketch” of a singular value transformation f(XX)f(X^{\dagger}X) of the matrix XXX^{\dagger}X it is enough to compute f¯(CC)\bar{f}(CC^{\dagger}) for the much smaller matrix CCr×rCC^{\dagger}\in\mathbb{C}^{r\times r}. This technique is the basis for most of our advanced applications.

Applications.

By applying all the above techniques, we are able to robustly dequantize (i.e., dequantize even when the input is given via ε\varepsilon-approximate sampling-and-query access for ε>0\varepsilon>0) most666Due to space constraints, in this paper we only present our robust dequantization results for a few important quantum algorithms. Our framework does apply to more quantum algorithms. Actually, we have not been able to find an example of quantum algorithm that can be dequantized but not robustly dequantized. of the quantum algorithms that have been dequantized in the literature. Below we briefly discuss all our applications and refer to Section 6 for details. Simplified statements of the complexities of our results and their comparison with prior works can be found in Table 1.777In this paper, the notation O~()\tilde{O}(\cdot) removes factors polylogarithmic in the dimension of the matrices and vectors.

Quantum Classical (perfect sampling) Classical (ε\varepsilon-approxi- mate sampling)
Inner product O~(1/ξ2)\tilde{O}(1/\xi^{2}) O~(1/ξ2)\tilde{O}(1/\xi^{2}) [39] O~(1/ξ2)\tilde{O}(1/\xi^{2})      Props. 1, 6
Low-rank QSVT O~(d𝔭(AA)b)\tilde{O}\left(\frac{d}{\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\|}\right) [21] O~(d16𝔭(AA)b6)\tilde{O}\left(\frac{d^{16}}{\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\|^{6}}\right) [5] O~(d16𝔭(AA)b6)\tilde{O}\left(\frac{d^{16}}{\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\|^{6}}\right)  Th. 7
Sparse QSVT poly(s,d)\mathrm{poly}(s,d)   [21] O~(sd)\tilde{O}(s^{d})        [18] O~(sd)\tilde{O}(s^{d})                Th. 8
Sparse matrix inversion poly(s,κ,1/η)\mathrm{poly}(s,\kappa,1/\eta)[21, 23] O~(sκlog(κ/η))\tilde{O}(s^{\kappa\log(\kappa/\eta)})     [18] O~(sκlog(κ/η))\tilde{O}(s^{\kappa\log(\kappa/\eta)})   Th. 8
Supervised clustering O~(1/η)\tilde{O}(1/\eta)       [31] O~(1/η2)\tilde{O}(1/\eta^{2})   [5, 40] O~(1/η2)\tilde{O}(1/\eta^{2})           Th. 9
Recommendation systems O~(K)\tilde{O}(\sqrt{K})       [29] \bigstrutO~(K12)\bigstrut\tilde{O}(K^{12})      [39] O~(K8)\tilde{O}(K^{8})                Th. 10
O~(K8)\tilde{O}(K^{8})       [5]
Low-rank matrix inversion O~(K)\tilde{O}(\sqrt{K})       [21, 22] O~(K14)\tilde{O}(K^{14})     [5] O~(K14)\tilde{O}(K^{14})              Th. 11
Table 1: Overview of our applications and comparison with prior works (the recent works [2, 20, 38] are omitted from this table — we briefly discuss them at the very end of Section 1.3). The complexities reported in the last column of this table assume that the parameter ε\varepsilon is small enough (the precise requirements on ε\varepsilon are given in the formal statements of the results).
  • Inner product (Section 3 and Section 6.1). In the quantum setting, the SWAP test can be used to estimate the absolute value of the inner product between two pure states with additive error ξ\xi using O(1/ξ2)O(1/\xi^{2}) copies of the states.888If quantum circuits constructing the two quantum states are available, then the complexity can be further reduced to O(1/ξ)O(1/\xi) using amplitude estimation. The dequantization technique from [5, 39] estimates the inner product of two vectors uu and vv with error ξuv\xi\left\|u\right\|\left\|v\right\| in time O(1/ξ2)O(1/\xi^{2}) given sampling-and-query access to one of the vectors and query access to the other, which matches the complexity of the SWAP test for unit-norm vectors. By applying Theorem 1, we immediately obtain a classical algorithm that estimates the inner product with error (22ε+ξ)uv(2\sqrt{2\varepsilon}+\xi)\left\|u\right\|\left\|v\right\| in time O(1/ξ2)O(1/\xi^{2}) when given ε\varepsilon-approximate sampling-and-query access to one of the vectors and query access to the other (Proposition 1 in Section 3 and Proposition 6 in Section 6.1). The term 22ε2\sqrt{2\varepsilon} here is a bias (coming from the bias in Theorem 1) due to the fact that we are sampling from a biased distribution.999Such a bias is inevitable when estimating the inner product using samples from a biased distribution. Note that a similar bias would appear for quantum algorithms based on the SWAP test in the analogue setting where the input states are at (trace) distance ε\varepsilon from the ideal states. This result shows that when ε\varepsilon is small enough, i.e., when we are sampling from a distribution close enough to the ideal distribution in the total variation distance, we can obtain a performance close to the performance of algorithms having access to the ideal distribution.

  • “Low-rank” Quantum Singular Value transformation (Section 6.2). Chia, Gilyén, Li, Lin, Tang and Wang [5] have shown how to dequantize in the low-rank regime the Quantum Singular Value Transformation (QSVT), which is a recent powerful paradigm developed by Gilyén, Su, Low and Wiebe [21] that can be used to recover most known quantum algorithms and construct new ones (we refer to [33] for a good survey). Given an even polynomial 𝔭\mathfrak{p} of degree dd, a matrix Am×nA\in\mathbb{C}^{m\times n} and a vector bb with b=1\left\|b\right\|=1, the QSVT is a technique to construct a good approximation of a quantum state proportional to the vector 𝔭(AA)b\mathfrak{p}(\sqrt{A^{\dagger}A})b. In the low-rank regime, which corresponds to the assumption AF=1\left\|A\right\|_{\mathrm{F}}=1 (or more generally AF1\left\|A\right\|_{\mathrm{F}}\leq 1), the complexity of the approach is O~(d/𝔭(AA)b)\tilde{O}(d/\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\|). Chia, Gilyén, Li, Lin, Tang and Wang [5] showed how to dequantize this technique by proving that given sampling-and-query access to AA and bb, it is possible to implement with high probability sampling-and-query access to a vector close to 𝔭(AA)b\mathfrak{p}(\sqrt{A^{\dagger}A})b in time O~(d16/𝔭(AA)b6)\tilde{O}(d^{16}/\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\|^{6}). By adapting the methodology from [5], we show how to achieve the same running time when given only ε\varepsilon-approximate sampling-and-query access to AA and bb for ε>0\varepsilon>0, when ε\varepsilon is small enough.

  • Sparse Quantum Singular Value Transformation (Section 6.3). Gharibian and Le Gall [18] have considered the problem of estimating the value v𝔭(AA)uv^{\dagger}\mathfrak{p}(\sqrt{A^{\dagger}A})u for a matrix Am×nA\in\mathbb{C}^{m\times n} such that A=1\left\|A\right\|=1 (or more generally A1\left\|A\right\|\leq 1) and two vectors u,vnu,v\in\mathbb{C}^{n}. Note that the condition A1\left\|A\right\|\leq 1 is significantly weaker than the condition AF1\left\|A\right\|_{\mathrm{F}}\leq 1 since the inequality AAF\left\|A\right\|\leq\left\|A\right\|_{\mathrm{F}} always holds but AF\left\|A\right\|_{\mathrm{F}} can be as large as nA\sqrt{n}\left\|A\right\|. On the other hand, with this weaker assumption the problem can only be solved efficiently in the classical setting for special types of matrices, such as ss-sparse matrices (i.e., matrices containing at most ss nonzero entries per row and column). The main result from [18] indeed shows that when given sampling-and-query access to an ss-sparse matrix AA, a good estimate of the value v𝔭(AA)uv^{\dagger}\mathfrak{p}(\sqrt{A^{\dagger}A})u can be computed in O~(sd)\tilde{O}(s^{d}) time, which matches (up to polynomial factors) the complexity of quantum algorithms based on the QSVT for constant ss and constant dd (Ref. [18] also shows that the problem becomes 𝖡𝖰𝖯\mathsf{BQP}-hard for d=poly(n)d=\mathrm{poly}(n) and ss constant). By combining our techniques for robust estimation of the inner product with the main technical result from [18], we show that this dequantization result even holds when considering 𝖲𝖰ε(v)\mathsf{SQ}_{\varepsilon}(v) for ε>0\varepsilon>0, when ε\varepsilon is small enough.

    Using this result, we can generalize the other dequantization results in [18, Section 4], and in particular obtain an efficient classical algorithm computing a constant-precision estimation of the ground energy of a local Hamiltonians when given ε\varepsilon-approximate sampling-and-query access to a guiding vector uu that has constant overlap with the ground state (Theorem 1 in [18] showed this result only for the case ε=0\varepsilon=0). Similarly, the implications to the Quantum PCP conjecture and the No Low-Energy Samplable States (NLSS) conjecture discussed in [18, Section 5] also generalize to the case of approximate sampling-and-query access to the witness states.

    While not considered in [18], the same approach can be used for sparse matrix inversion as well. As discussed in [21] (see also [33]), the QSVT can be used for matrix inversion by choosing a polynomial 𝔭\mathfrak{p} that approximates the function f(x)=1/xf(x)=1/x on the interval [1,1](1/κ,1/κ)[-1,1]\setminus(-1/\kappa,1/\kappa), where κ\kappa is the condition number. The analysis from [21] shows that a good enough approximation can be obtained by a polynomial of degree d=O(κlog(κ/η))d=O(\kappa\log(\kappa/\eta)), where η\eta is the needed precision. The dequantization technique from [18] then leads to a classical algorithm with complexity O~(sκlog(κ/η))\tilde{O}(s^{\kappa\log(\kappa/\eta)}) given exact sampling-and-query access to the input. Our result shows that this complexity holds even when considering approximate sampling-and-query access.

  • Supervised clustering (Section 6.4). Lloyd, Mohseni, and Rebentrost [31] developed a quantum algorithm for the supervised clustering problem, an important problem in machine learning. From a computational perspective, this quantum algorithm simply computes the inner product between two vectors representing appropriately the data, which can be done in O~(1/η)\tilde{O}\left(1/\eta\right) time using the SWAP test, where η\eta represents the precision parameter. Prior dequantization works [5, 40] showed how to construct classical algorithms solving this problem in time O(1/η2)O\left(1/\eta^{2}\right) given sampling-and-query access to the input. By applying our algorithm for robust estimation of the inner product, we immediately obtain the same complexity when given only ε\varepsilon-approximate sampling-and-query access for ε>0\varepsilon>0.

  • Recommendation systems (Section 6.5). In recommendation systems, the main computational task reduces to sampling from a row of a low-rank matrix close to the input matrix AA (which corresponds to the users’ preference matrix). A key parameter for expressing the complexity of quantum algorithms for recommendation systems [29] and their dequantization [5, 39] is the ratio K=AF2/σ2K=\left\|A\right\|_{\mathrm{F}}^{2}/\sigma^{2}, where σ\sigma is a threshold value used for the specification of the problem. Kerenidis and Prakash [29] have shown how to solve the problem in time O~(K)\tilde{O}(\sqrt{K}) on a quantum computer when AA is given in QRAM. Ref. [5] gave a classical algorithm solving the problem in time O~(K16)\tilde{O}(K^{16}) when AA is given via sampling-and-query access, which matches the complexity of the quantum algorithm up to a (large) polynomial factor and improved the complexity of Tang’s original algorithm [39]. By applying our framework, we show how to obtain the same complexity when given only ε\varepsilon-approximate sampling-and-query access to the input for ε>0\varepsilon>0, when ε\varepsilon is small enough.

  • “Low-rank” matrix inversion (Section 6.6). While the matrix inversion problem solved by the HHL quantum algorithm is 𝖡𝖰𝖯\mathsf{BQP}-hard [23] and thus unlikely to be dequantized, several low-rank versions of matrix inversion (or, more precisely, solving a system of linear equations) have been studied in recent works on dequantization [4, 5, 22]. Ref. [5], in particular, showed that classical algorithms can solve linear equations in time O~(K14)\tilde{O}(K^{14}) given sampling-and-query access to the matrix and the vector specifying the system of linear equations. Since quantum algorithms solve this problem in time O~(K)\tilde{O}(\sqrt{K}) when AA is given in QRAM [21, 22], the classical algorithms match this complexity up to a (large) polynomial factor. By applying our framework, we show how to obtain the same complexity when given only ε\varepsilon-approximate sampling-and-query access for ε>0\varepsilon>0, when ε\varepsilon is small enough.

All these results give strong evidence for the lack of exponential speedup for quantum machine learning algorithms in the QRAM model (and additionally for the framework from [18]) even when the classical algorithms only have approximate sampling-and-query access to the input. The main conceptual message of this paper is thus that no quantum advantage seems to arise in machine learning (at least for the problems we consider) from the ability of quantum algorithms to natively robustly manipulate classical information encoded as quantum states.

More recent quantum-inspired algorithms.

Very recently, Bakshi and Tang [2] have constructed quantum-inspired classical algorithms for the Quantum Singular Value transformation and its applications with improved bounds compared with [5]. For low-rank matrix inversion, recent works by Gilyén, Song and Tang [20] and Montanaro and Shao [38] have given quantum-inspired classical algorithms with improved bounds over [5] as well. These improved algorithms, which use iterative methods and new classical techniques, assume perfect sampling-and-query access to the data, as in [5]. It would be interesting to obtain similar improvements in the case of approximate sampling-and-query access.

1.4 Overview of our techniques

Our work adapts the frameworks developed by Chia, Gilyén, Li, Lin, Tang and Wang [5] and Gharibian and Le Gall [18] to the case of approximate sampling-and-query access to the input. In particular, we need to adapt essentially all the definitions and technical lemmas from [5], as well as their proofs. From a technical perspective, the three main contributions are the three techniques described below.

Robust estimation of the inner product.

We describe the main idea behind our robust estimation of the inner product. Consider two unit-norm vectors u,vnu,v\in\mathbb{R}^{n}, where uu is given via ε\varepsilon-approximate sampling-and-query access and vv is given by query access (i.e., on an input i{1,,n}i\in\{1,\ldots,n\} we can query v(i)v(i)). A straightforward adaptation of the method described in Section 1.2 would be: sample an index i{1,,n}i\in\{1,\ldots,n\} according to the distribution p~u\tilde{p}_{u} and output the value v(i)/u(i)v(i)/u(i). Comparing with Equation (1), we see that indices ii such that u(i)u(i) is very small but p~u(i)\tilde{p}_{u}(i) is large lead to large bias. This suggests the followings strategy: use instead an estimator of the type

{0 if iΓ,v(i)u(i) if iΓ,\begin{cases}0&\textrm{ if }i\in\Gamma,\\ \frac{v(i)}{u(i)}&\textrm{ if }i\notin\Gamma,\end{cases}

for a set Γ={i{1,,n}||u(i)|θ(i)}\Gamma=\{i\in\{1,\ldots,n\}\>|\>\left|u(i)\right|\leq\theta(i)\}, where θ(i)\theta(i) is a well chosen threshold function. Ideally, we would like to choose θ(i)\theta(i) based on the value of p~u(i)\tilde{p}_{u}(i). This is unfortunately not possible since we do not have access to the value p~u(i)\tilde{p}_{u}(i). Our key discovery is that choosing θ(i)\theta(i) based on the value of v(i)v(i) does work. Taking such as estimate leads to a biased estimator where the bias comes from two parts: the first part coming from the fact that we are sampling from p~u\tilde{p}_{u} instead of pup_{u} and the second part coming from the fact that we disregarding the contribution of the indices in the set Γ\Gamma. When setting θ(i)=γ|v(i)|\theta(i)=\gamma\left|v(i)\right| for some constant γ\gamma, we discover that it is possible to derive “complementary” upper bounds on these two biases: the upper bound on the first bias is proportional to γ\gamma while the upper bound on the second bias is inverse-proportional to γ\gamma. We then take the optimal value for γ\gamma and set θ(i)\theta(i) accordingly. Concretely, we take θ(i)=2ε|v(i)|\theta(i)=\sqrt{2\varepsilon}\left|v(i)\right| (with additional normalization factors when uu and vv are not unit-norm), which gives a bias of 2ε+2ε=22ε\sqrt{2\varepsilon}+\sqrt{2\varepsilon}=2\sqrt{2\varepsilon}. We refer to Section 3 for a fairly self-contained presentation of the details of our technique for robust estimation of the inner product.101010The inner product estimation algorithm of Section 3 is actually a special case of Theorem 1 in Section 4.1. We present it in Section 3 as a “warm-up” since for this special case the presentation is significantly lighter (in particular, we do not need the notion of matrix sketch introduced in in Section 4.1).

Extension to robust estimation of matrix multiplication.

Our second main technical contribution is extending this strategy to compute a general matrix product XYXY. We adapt the notion of matrix sketch, which is one of the central concepts in linear algebra algorithms based on random sampling [11, 12, 13, 14, 15, 17, 27, 28], to the setting of approximate sampling (see Definition 9 in Section 4.1). This is the key step that enables us to adapt the matrix multiplication sketching technique by Drineas, Kannan and Mahoney [13] to the setting where only ε\varepsilon-approximate sampling-and-query access to one matrix is given, for ε>0\varepsilon>0. When given approximate sampling access only to one of the matrices (XX or YY), we apply a technique similar to our robust evaluation of the inner product (indeed, we obtain the 22ε2\sqrt{2\varepsilon} term in the bias of the sketch in Theorem 1). When given approximate sampling access to both matrices XX and YY, we adapt the technique based on joint-matrix sampling from [13], which was also used in [5]. The first step is again to adapt the definition to the setting of approximate sampling (Definition 12 in Section 4.2). Using the inequality of arithmetic and geometric means and a finer analysis of the success probability based on MacDiarmid’s bounded difference inequality, this technique leads to an algorithm (Theorem 2) with exponentially better dependence in the error parameter δ\delta.

Robust rejection sampling.

Our third main technical contribution, crucial to dequantize quantum algorithms for the low-rank quantum singular value transformation and recommendation systems, is the conversion technique mentioned in Section 1.3, which converts approximate oversampling-and-query access into approximate sampling-and-query. A similar conversion has been used in the perfect setting in [5, 39], based on the standard technique in probabilistic algorithms called rejection sampling (see, e.g., [9] for a detailed description of this technique). Our main technical contribution is to develop a robust version of rejection sampling.

Standard rejection sampling is a method to generate samples from a (hard) distribution p1:{1,,n}[0,1]p_{1}\colon\{1,\ldots,n\}\to[0,1] given the ability to generate samples from another (easier) distribution p2:{1,,n}[0,1]p_{2}\colon\{1,\ldots,n\}\to[0,1]. The method works as follows: sample j{1,,n}j\in\{1,\ldots,n\} from p2p_{2}; output it with probability p1(j)mp2(j)\frac{p_{1}(j)}{mp_{2}(j)} for a large value mm (large enough such that p1(j)mp2(j)p_{1}(j)\leq mp_{2}(j) holds for all j{1,,m}j\in\{1,\ldots,m\}) and otherwise report “failure”. It is easy to see that the probability of outputting jj is p1(j)/mp_{1}(j)/m and the probability that the process does not fail is 1/m1/m. In consequence, conditioned on the event that the process does not fail, the probability that sample jj is output is precisely p1(j)p_{1}(j). Repeating the process Θ(m)\Theta(m) times will then output a sample drawn from the distribution p1p_{1} with high probability. In Section 5.1, we show how to generalize this technique to the case where we can only generate samples from a probability distribution p~2\tilde{p}_{2} that is close to p2p_{2} in total variation distance. At first glance, it seems that the same difficulty as when trying to extend Equation (1) to the approximate sampling setting arises. We show that the condition p1(j)mp2(j)p_{1}(j)\leq mp_{2}(j) is actually strong enough to control the bias, which enables us to sample with high probability from a probability distribution close to p1p_{1} in total variation distance (see Proposition 3 in Section 5.1).

2 Preliminaries

In this section we present the definitions needed for this work. In Section 2.1 we give general definitions and notations. In Section 2.2 we define several notions of access to a vector and in particular introduce the concept of approximate sampling-and-query access, which is one of the main conceptual contributions of this work. In Section 2.3, we extend these definitions to matrices.

2.1 Notations and general definitions

General notations.

In this paper we assume that arithmetic operations in \mathbb{C} (i.e., addition, subtraction, multiplication and division of two scalars) can be done at unit cost.111111All our upper bounds can easily be converted into upper bounds for the bit-complexity of the problem as well if the notions of query access and sampling access to the input are redefined appropriately to take in consideration the bit-complexity of the input vectors and matrices. For any complex number xx, we write xx^{\ast} its complex conjugate. We denote by 0\mathbb{R}_{\geq 0} the set of non-negative real numbers. We say that a polynomial 𝔭[x]\mathfrak{p}\in\mathbb{C}[x] is even if 𝔭(x)=𝔭(x)\mathfrak{p}(x)=\mathfrak{p}(-x) for all xx\in\mathbb{C}. Given a function f:f\colon\mathbb{R}\to\mathbb{C}, some set Λ\Lambda\subseteq\mathbb{R} and a positive real number LL, we say that ff is LL-Lipschitz on Λ\Lambda if the inequality |f(x)f(y)|L|xy|\left|f(x)-f(y)\right|\leq L\left|x-y\right| holds for any x,yΛx,y\in\Lambda.

Matrices and vectors.

For a matrix Am×nA\in\mathbb{C}^{m\times n}, we write its entries as A(i,j)A(i,j) for (i,j){1,,m}×{1,,n}(i,j)\in\{1,\ldots,m\}\times\{1,\ldots,n\}. For each i{1,,m}i\in\{1,\ldots,m\}, we denote the ii-th row of AA as A(i,)A(i,\cdot) . For any j{1,,n}j\in\{1,\ldots,n\}, we denote the jj-th column of AA as A(,j)A(\cdot,j). We write AA^{\ast} for its conjugate and AA^{\dagger} for its transpose-conjugate, i.e., the matrices such that A(i,j)=A(i,j)A^{\ast}(i,j)=A(i,j)^{*} and A(i,j)=A(j,i)A^{\dagger}(i,j)=A(j,i)^{*} for all (i,j){1,,m}×{1,,n}(i,j)\in\{1,\ldots,m\}\times\{1,\ldots,n\}. We use similar notations for vectors.

For a vector unu\in\mathbb{C}^{n}, we denote its 2\ell_{2} norm as u\left\|u\right\|. For a matrix Am×nA\in\mathbb{C}^{m\times n}, we denote its spectral norm as A\left\|A\right\|, and denote its Frobenius norm as AF\left\|A\right\|_{\mathrm{F}}, i.e.,

AF=i=1mj=1n|A(i,j)|2.\left\|A\right\|_{\mathrm{F}}=\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}\left|A(i,j)\right|^{2}}.

We have AAFmin(m,n)A\left\|A\right\|\leq\left\|A\right\|_{\mathrm{F}}\leq\sqrt{\min(m,n)}\left\|A\right\|.121212In this paper we systematically use the upper bound AAF\left\|A\right\|\leq\left\|A\right\|_{\mathrm{F}} to state the bounds on the complexity of our algorithms only in term of AF\left\|A\right\|_{\mathrm{F}}. It is possible (and done in prior works [5]) to give a finer analysis of the statements of upper bounds using both A\left\|A\right\| and AF\left\|A\right\|_{\mathrm{F}} but for clarity we abstain from doing this since this gives more complicated bounds.

Given a Hermitian matrix Hn×nH\in\mathbb{C}^{n\times n}, we order its eigenvalues in non-increasing order and denote them λ1(H)λ2(H)λn(H)\lambda_{1}(H)\geq\lambda_{2}(H)\geq\cdots\geq\lambda_{n}(H). We write

Spec(H)={λ1(H),λ2(H),,λn(H)}\mathrm{Spec}(H)=\{\lambda_{1}(H),\lambda_{2}(H),\ldots,\lambda_{n}(H)\}

for the spectrum of HH.

In Section 4.3 we will need the following two lemmas.

Lemma 1 (Hoffman-Wielandt theorem [24]).

For Hermitian matrices X,Ym×mX,Y\in\mathbb{C}^{m\times m}, we have

i=1m|λi(X)λi(Y)|2XYF2.\sum_{i=1}^{m}\left|\lambda_{i}(X)-\lambda_{i}(Y)\right|^{2}\leq\left\|X-Y\right\|_{\mathrm{F}}^{2}.
Lemma 2 (Corollary 2.3 in [19]).

For two Hermitian matrices X,Ym×mX,Y\in\mathbb{C}^{m\times m} and a function f:f\colon\mathbb{R}\to\mathbb{C} that is LL-Lipschitz on Spec(X)Spec(Y)\mathrm{Spec}(X)\cup\mathrm{Spec}(Y), we have

f(X)f(Y)FLXYF.\left\|f(X)-f(Y)\right\|_{\mathrm{F}}\leq L\left\|X-Y\right\|_{\mathrm{F}}.

For a positive semidefinite Hermitian matrix Hn×nH\in\mathbb{C}^{n\times n}, consider its spectral decomposition

H=i=1nλi(H)vivi.H=\sum_{i=1}^{n}\lambda_{i}(H)v_{i}v_{i}^{\dagger}.

where the {vi}i\{v_{i}\}_{i} are orthonormal vectors in n\mathbb{C}^{n}. For any function f:0f\colon\mathbb{R}_{\geq 0}\to\mathbb{C}, the matrix

f(H)=i=1nf(λi(H))vivif(H)=\sum_{i=1}^{n}f(\lambda_{i}(H))v_{i}v_{i}^{\dagger}

is called the singular value transformation of HH associated to ff (which in this case coincides with the eigenvalue transformation of HH associated to ff). We use this definition to define the concept, which we will consider in Sections 6.2 and 6.3, of (even) polynomial transformation of an arbitrary (i.e., non-Hermitian) matrix. For an even polynomial 𝔭[x]\mathfrak{p}\in\mathbb{C}[x] and a matrix Am×nA\in\mathbb{C}^{m\times n}, the (even) polynomial transformation of AA associated to 𝔭\mathfrak{p} is defined as 𝔭(AA)\mathfrak{p}(\sqrt{A^{\dagger}A}). Let us write

𝔭(x)=a0+a2x2+a4x4++adxd,\mathfrak{p}(x)=a_{0}+a_{2}x^{2}+a_{4}x^{4}+\cdots+a_{d}x^{d},

where dd is the (even) degree of 𝔭\mathfrak{p} and a0,,ada_{0},\ldots,a_{d} are coefficients. It is easy to check that the following relation holds:

𝔭(AA)=a0In+a2AA+a4(AA)2++ad(AA)d,\mathfrak{p}(\sqrt{A^{\dagger}A})=a_{0}I_{n}+a_{2}A^{\dagger}A+a_{4}(A^{\dagger}A)^{2}+\cdots+a_{d}(A^{\dagger}A)^{d},

where InI_{n} denotes the identity matrix of dimension nn.

We now define the singular value transformation of an arbitrary matrix. This concept will be used in Sections 6.5 and 6.6. For any matrix Am×nA\in\mathbb{C}^{m\times n}, consider its spectral singular value transformation

A=i=1min(m,n)σiuivi,A=\sum_{i=1}^{\min(m,n)}\sigma_{i}u_{i}v_{i}^{\dagger},

where the σi\sigma_{i}’s are nonnegative real numbers, {ui}i\{u_{i}\}_{i} are orthonormal vectors in m\mathbb{C}^{m} and {vi}i\{v_{i}\}_{i} are orthonormal vectors in n\mathbb{C}^{n}. For any function f:0f\colon\mathbb{R}_{\geq 0}\to\mathbb{C} such that f(0)=0f(0)=0, the matrix

f(A)=i=1min(m,n)f(σi)uivif(A)=\sum_{i=1}^{\min(m,n)}f(\sigma_{i})u_{i}v_{i}^{\dagger}

is called the singular value transformation of AA associated to ff. Note that when AA is a positive semidefinite Hermitian matrix we recover the above definition.131313For arbitrary matrices, the condition f(0)=0f(0)=0 guarantees that the singular value transformation is well defined.

Probability distributions and matrix-valued random variables.

Given a probability distribution p:{1,,n}[0,1]p\colon\{1,\ldots,n\}\to[0,1], we denote its support as supp(p)\mathrm{supp}(p), i.e.,

supp(p)={i{1,,n}|p(i)>0}.\mathrm{supp}(p)=\{i\in\{1,\ldots,n\}\>|\>p(i)>0\}.

Given two probability distributions p,q:{1,,n}[0,1]p,q\colon\{1,\ldots,n\}\to[0,1] we write their total variation distance as

|pq|tv=12i=1n|p(i)q(i)|.|p-q|_{\mathrm{tv}}=\frac{1}{2}\sum_{i=1}^{n}\left|p(i)-q(i)\right|.

As already mentioned in the introduction, for any nonzero vector unu\in\mathbb{C}^{n} we define the probability distribution pu:{1,,n}[0,1]p_{u}\colon\{1,\ldots,n\}\to[0,1] as

pu(i)=|u(i)|2u2p_{u}(i)=\frac{|u(i)|^{2}}{\left\|u\right\|^{2}}

for each i{1,,n}i\in\{1,\ldots,n\}.

We will often work with matrix-valued random variables. We recall the definition of the expectation and variance of such variables. A matrix-valued random variable Zm×nZ\in\mathbb{C}^{m\times n} is a matrix where each entry Z(i,j)Z(i,j) is a complex-valued random variable, i.e., Z(i,j)=X(i,j)+iY(i,j)Z(i,j)=X(i,j)+i\>Y(i,j) for two real-valued random variables X(i,j)X(i,j) and Y(i,j)Y(i,j). The expectation of ZZ is the matrix Cm×nC\in\mathbb{C}^{m\times n} such that the C(i,j)=𝔼[X(i,j)]+i𝔼[Y(i,j)]C(i,j)=\mathbb{E}\!\left[X(i,j)\right]+i\>\mathbb{E}\!\left[Y(i,j)\right]. The variance of ZZ is

Var[Z]=𝔼[Z𝔼[Z]F2]=𝔼[ZF2]𝔼[Z]F2.\mathrm{Var}\left[Z\right]=\mathbb{E}\!\left[\left\|Z-\mathbb{E}\!\left[Z\right]\right\|_{\mathrm{F}}^{2}\right]=\mathbb{E}\!\left[\left\|Z\right\|_{\mathrm{F}}^{2}\right]-\left\|\mathbb{E}\!\left[Z\right]\right\|_{\mathrm{F}}^{2}.

With these definitions, Chebyshev’s inequality (which is just Markov’s inequality applied to the real-valued random variable Z𝔼[Z]F2\left\|Z-\mathbb{E}\!\left[Z\right]\right\|_{\mathrm{F}}^{2}) holds for matrix-valued random variables.

We will also need MacDiarmid’s bounded difference inequality [34].

Lemma 3 (MacDiarmid’s bounded difference inequality [34]).

Let X1,,XrX_{1},\ldots,X_{r} be independent random variables, where XiX_{i} takes values in a set AiA_{i} for each i{1,,r}i\in\{1,\ldots,r\}. Let f:A1××Arf\colon A_{1}\times\cdots\times A_{r}\to\mathbb{R} be a function satisfying the following property: for any i{1,,r}i\in\{1,\ldots,r\},

|f(x1,,xr)f(x1,,xr)|di\left|f(x_{1},\ldots,x_{r})-f(x^{\prime}_{1},\ldots,x^{\prime}_{r})\right|\leq d_{i}

whenever the vectors (x1,,xr)(x_{1},\ldots,x_{r}) and (x1,,xr)(x^{\prime}_{1},\ldots,x^{\prime}_{r}) differ in just the ii-th coordinate. Then for any t>0t>0,

Pr[|f(X1,,Xr)𝔼[f(X1,,Xr)]|t]2exp(2t2i=1rdi2).\mathrm{Pr}\left[\left|f(X_{1},\ldots,X_{r})-\mathbb{E}\!\left[f(X_{1},\ldots,X_{r})\right]\right|\geq t\right]\leq 2\exp\left(-\frac{2t^{2}}{\sum_{i=1}^{r}d_{i}^{2}}\right).
Powering lemma.

In order to amplify the success probability of probabilistic estimators, we will often use the following version of the “powering lemma” from [25]. Since our version is slightly more general than the original version (our version applies to complex-valued random variables as well), we include a proof for completeness.

Lemma 4.

Consider a randomized algorithm that produces an estimate μ~\tilde{\mu} of a complex-valued quantity μ\mu such that |μ~μ|ε\left|\tilde{\mu}-\mu\right|\leq\varepsilon holds with probability at least 3/43/4. Then, for any δ>0\delta>0, it suffices to repeat O(log(1/δ))O(\log(1/\delta)) times the algorithm to obtain an estimate μ^\hat{\mu} such that |μ^μ|2ε\left|\hat{\mu}-\mu\right|\leq\sqrt{2}\varepsilon holds with probability at least 1δ1-\delta.

Proof.

Let us write μ=μ1+iμ2\mu=\mu_{1}+i\mu_{2} with μ1,μ2\mu_{1},\mu_{2}\in\mathbb{R}. Each application of the algorithm returns a complex number μ~=μ~1+iμ~2\tilde{\mu}=\tilde{\mu}_{1}+i\tilde{\mu}_{2} with μ~1,μ~2\tilde{\mu}_{1},\tilde{\mu}_{2}\in\mathbb{R} such that both |μ~1μ1|ε\left|\tilde{\mu}_{1}-\mu_{1}\right|\leq\varepsilon and |μ~2μ2|ε\left|\tilde{\mu}_{2}-\mu_{2}\right|\leq\varepsilon hold with probability at least 3/43/4. We repeat the algorithm mm times. We compute the median of the real parts of the mm estimates and write it μ^1\hat{\mu}_{1}. Similarly, we compute the median of the imaginary parts of the mm estimates and write it μ^2\hat{\mu}_{2}. Finally, we output μ^=μ^1+iμ^2\hat{\mu}=\hat{\mu}_{1}+i\hat{\mu}_{2}.

By taking m=O(log(1/δ))m=O(\log(1/\delta)), Chernoff’s bound guarantees that the inequality |μ^1μ1|ε\left|\hat{\mu}_{1}-\mu_{1}\right|\leq\varepsilon holds with probability at least 1δ/21-\delta/2, and the inequality |μ^2μ2|ε\left|\hat{\mu}_{2}-\mu_{2}\right|\leq\varepsilon holds with probability at least 1δ/21-\delta/2. The inequality |μ^μ|2ε\left|\hat{\mu}-\mu\right|\leq\sqrt{2}\varepsilon then holds with probability at least 1δ1-\delta. ∎

2.2 Sampling-and-query access to vectors

In this subsection we define several versions of sampling access to vectors. The relations between these notions is summarized in Figure 1.

𝖰(u)\mathsf{Q}(u)(query access)𝖲ε(u)\mathsf{S}_{\varepsilon}(u)(sampling access)𝖲𝖰ε(u)\mathsf{SQ}_{\varepsilon}(u)𝖱𝖲𝖰ε,δ,η(u){\mathsf{RSQ}_{\varepsilon,\delta,\eta}}(u)              (randomized sampling-and-query access)𝖮𝖲𝖰ε,ϕ(u)\mathsf{OSQ}_{\varepsilon,\phi}(u)(oversampling-and-query access)Prop. 3
Figure 1: Relation between the notions of sampling access to a vector defined in Section 2.2. A plain arrow shows a trivial generalization. The dotted arrow refers to the implementation proved in Proposition 3 (which modifies the parameter ε\varepsilon).
Basic definitions.

We first define query access and (approximate) sampling access to a vector.

Definition 1.

For a nonzero vector unu\in\mathbb{C}^{n}, we say that we have query access to uu, which we write 𝖰(u)\mathsf{Q}(u), if on input i{1,,n}i\in\{1,\ldots,n\} we can query the entry u(i)u(i). We denote q(u)\textbf{q}(u) the cost of implementing one such query.

Definition 2.

For a nonzero vector unu\in\mathbb{C}^{n}, we say that we have ε\varepsilon-approximate sampling access to uu, which we write 𝖲ε(u)\mathsf{S}_{\varepsilon}(u), if we can sample from a distribution p~:{1,,n}[0,1]\tilde{p}\colon\{1,\ldots,n\}\to[0,1] such that |pup~|tvε|p_{u}-\tilde{p}|_{\mathrm{tv}}\leq\varepsilon. We denote s(u)\textbf{s}(u) the cost of generating one sample.

We now define the concept of (approximate) sampling-and-query access, which is the main definition of this paper.

Definition 3.

For a nonzero vector unu\in\mathbb{C}^{n} and a parameter ε0\varepsilon\geq 0, we say that we have ε\varepsilon-approximate sampling-and-query access to uu, which we write 𝖲𝖰ε(u)\mathsf{SQ}_{\varepsilon}(u), if we have 𝖰(u)\mathsf{Q}(u) and 𝖲ε(u)\mathsf{S}_{\varepsilon}(u) and additionally can get the value u\left\|u\right\|. We write sq(u)\textbf{sq}(u) the maximum among q(u)\textbf{q}(u), s(u)\textbf{s}(u) and the cost of obtaining u\left\|u\right\|. When ε=0\varepsilon=0, we simply write 𝖲𝖰(u)\mathsf{SQ}(u).

Knowledge of u\left\|u\right\| is required since this norm cannot be easily computed from 𝖰(u)\mathsf{Q}(u) and 𝖲ε(u)\mathsf{S}_{\varepsilon}(u). The literature on sampling methods for linear algebra, as well as recent works on dequantization [5, 39], have used 𝖲𝖰(u)\mathsf{SQ}(u) (i.e., Definition 3 with ε=0\varepsilon=0).141414We also mention the work by Cotler, Huang and McClean [8], which compares 𝖲𝖰(u)\mathsf{SQ}(u) with quantum access to a quantum state proportional to uu and shows that 𝖲𝖰(u)\mathsf{SQ}(u) can in some cases be more powerful due to the query access allowed in the classical setting (this advantage naturally disappears if we allow quantum algorithms to use 𝖰(u)\mathsf{Q}(u)). We are not aware of any prior work that introduces a similar definition with ε>0\varepsilon>0.

More technical definitions.

We define two generalizations of our central definition (Definition 3) that will be useful to state our results.

The first one is a slight generalization of Definition 3, which only requires to get an approximation of u\left\|u\right\| and only requires randomized procedures for sampling and norm estimation. This generalization is needed since in some applications we will not be able to implement sampling with probability one or compute exactly the norm of the vector.

Definition 4.

For a nonzero vector unu\in\mathbb{C}^{n} and parameters ϵ,δ,η0\epsilon,\delta,\eta\geq 0, we say that we have (ε,δ,η)(\varepsilon,\delta,\eta)-approximate randomized sampling-and-query access to uu, which we write 𝖱𝖲𝖰ε,δ,η(u){\mathsf{RSQ}_{\varepsilon,\delta,\eta}}(u), if the following three conditions are satisfied:

  • (i)

    we have 𝖰(u)\mathsf{Q}(u);

  • (ii)

    we can get an estimate u~\widetilde{\left\|u\right\|} such that |u~u|ηu\left|\widetilde{\left\|u\right\|}-\left\|u\right\|\right|\leq\eta\left\|u\right\| holds with probability at least 1δ1-\delta;

  • (iii)

    we can generate with probability at least 1δ1-\delta a sample from a distribution p~:{1,,n}[0,1]\tilde{p}\colon\{1,\ldots,n\}\to[0,1] such that |pup~|tvε|p_{u}-\tilde{p}|_{\mathrm{tv}}\leq\varepsilon.151515This means that we have a Las-Vegas algorithm that samples from p~\tilde{p} with probability at least 1δ1-\delta: the algorithm outputs an error message with probability at most δ\delta, but when it does not output an error message then it returns a sample from the distribution p~\tilde{p}.

We write rsq(u)\textbf{rsq}(u) the maximum among q(u)\textbf{q}(u), the cost of getting u~\widetilde{\left\|u\right\|} and the cost of sampling from p~\tilde{p}.

The second one is a more significant generalization based on the concept of oversampling from [13]. This is a generalization of the definition from [5], which corresponds to the case ε=0\varepsilon=0. This generalization is convenient since (as we will show later) it has several useful closure properties.

Definition 5.

For a nonzero vector unu\in\mathbb{C}^{n} and parameters ε0\varepsilon\geq 0 and ϕ1\phi\geq 1, we say that we have (ε,ϕ)(\varepsilon,\phi)-approximate oversampling-and-query access to uu, which we write 𝖮𝖲𝖰ε,ϕ(u)\mathsf{OSQ}_{\varepsilon,\phi}(u), if the following two conditions are satisfied:

  • (i)

    we have 𝖰(u)\mathsf{Q}(u);

  • (ii)

    we have 𝖲𝖰ε(u¯)\mathsf{SQ}_{\varepsilon}(\bar{u}) for a vector u¯n\bar{u}\in\mathbb{C}^{n} such that u¯2=ϕu2\left\|\bar{u}\right\|^{2}=\phi\left\|u\right\|^{2} and |u¯(i)||u(i)|\left|\bar{u}(i)\right|\geq\left|u(i)\right| for all i{1,,n}i\in\{1,\ldots,n\}.

We denote osq(u)=max{q(u),sq(u¯)}\textbf{osq}(u)=\max\{\textbf{q}(u),\textbf{sq}(\bar{u})\}.

2.3 Sampling-and-query access to matrices

We now define sampling access to matrices. We first state the definition of query access to a matrix, which is a direct generalization of the concept of query access to a vector (Definition 1).

Definition 6.

For a nonzero matrix Am×nA\in\mathbb{C}^{m\times n}, we say that we have query access to AA, which we write 𝖰(A)\mathsf{Q}(A), if on input (i,j){1,,m}×{1,,n}(i,j)\in\{1,\ldots,m\}\times\{1,\ldots,n\} we can query the entry A(i,j)A(i,j). We denote q(A)\textbf{q}(A) the cost of implementing such a query.

For a matrix Am×nA\in\mathbb{C}^{m\times n}, we denote rA\mathrm{r_{A}} the vector in m\mathbb{C}^{m} such that the ii-th coordinate is A(i,)\left\|A(i,\cdot)\right\|, for each i{1,,m}i\in\{1,\ldots,m\}. We now introduce our central definition of approximate sampling-and-query access to a matrix, which corresponds to approximate sampling-and-query access to all rows of AA and also to rA\mathrm{r_{A}}. This is a generalization of the standard definition from the literature, which corresponds to the case ε=0\varepsilon=0.

Definition 7.

For a nonzero matrix Am×nA\in\mathbb{C}^{m\times n} and a parameter ε0\varepsilon\geq 0, we say that we have ε\varepsilon-approximate sampling-and-query access to AA, which we write 𝖲𝖰ε(A)\mathsf{SQ}_{\varepsilon}(A), if the following two conditions are satisfied:

  • (i)

    for each i{1,,m}i\in\{1,\ldots,m\} such that A(i,)0\left\|A(i,\cdot)\right\|\neq 0, we have 𝖲𝖰ε(A(i,))\mathsf{SQ}_{\varepsilon}(A(i,\cdot));

  • (ii)

    we have 𝖲𝖰ε(rA)\mathsf{SQ}_{\varepsilon}(\mathrm{r_{A}}).

We denote sq(A)=max{sq(rA),sq(A(1,)),,sq(A(m,))}\textbf{sq}(A)=\max\{\textbf{sq}(\mathrm{r_{A}}),\textbf{sq}(A(1,\cdot)),\dots,\textbf{sq}(A(m,\cdot))\}. When ε=0\varepsilon=0, we simply write 𝖲𝖰(A)\mathsf{SQ}(A).

Note that from Condition (i) in Definition 7 we automatically get 𝖰(A)\mathsf{Q}(A), and from Condition (ii) we automatically get the norm AF\left\|A\right\|_{\mathrm{F}}.

We finally define oversampling access to a matrix, in a way similar to how oversampling access to a vector is defined. The case ε=0\varepsilon=0 corresponds to the model considered in [5].

Definition 8.

For a nonzero matrix Am×nA\in\mathbb{C}^{m\times n} and parameters ε0\varepsilon\geq 0 and ϕ1\phi\geq 1, we say that we have (ε,ϕ)(\varepsilon,\phi)-approximate oversampling-and-query access to AA, which we write 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A), if the following two conditions are satisfied:

  • (i)

    we have 𝖰(A)\mathsf{Q}(A);

  • (ii)

    we have 𝖲𝖰ε(A¯)\mathsf{SQ}_{\varepsilon}(\bar{A}) for a nonzero matrix A¯m×n\bar{A}\in\mathbb{C}^{m\times n} such that |A¯(i,j)||A(i,j)|\left|\bar{A}(i,j)\right|\geq\left|A(i,j)\right| for all (i,j){1,,m}×{1,,n}(i,j)\in\{1,\ldots,m\}\times\{1,\ldots,n\} and A¯F=ϕAF\left\|\bar{A}\right\|_{\mathrm{F}}=\phi\left\|A\right\|_{\mathrm{F}}.

We denote osq(A)=max{q(A),sq(A¯)}\textbf{osq}(A)=\max\{\textbf{q}(A),\textbf{sq}(\bar{A})\}.

3 Warming-Up: Robust Estimation of the Inner Product

As a warming-up, in this section we describe a special case illustrating one of the main ideas of this paper: how to estimate the inner product when given approximate sampling-and-query access to one vector and query access to the other. Here is the statement of the result:

Proposition 1.

For any ε[0,1]\varepsilon\in[0,1], assume that

  • we have 𝖲𝖰ε(u)\mathsf{SQ}_{\varepsilon}(u) for a nonzero vector unu\in\mathbb{C}^{n};

  • we have 𝖰(v)\mathsf{Q}(v) for a nonzero vector vnv\in\mathbb{C}^{n} and know a value v~\widetilde{\left\|v\right\|} such that v~v\widetilde{\left\|v\right\|}\geq\left\|v\right\|.

For any δ(0,1)\delta\in(0,1) and any ξ>0\xi>0, we can output an estimator α\alpha such that

|α(u,v)|(22ε+ξ)uv~\left|\alpha-(u,v)\right|\leq(2\sqrt{2\varepsilon}+\xi)\left\|u\right\|\widetilde{\left\|v\right\|}

holds with probability at least 1δ1-\delta at cost O(log(1/δ)ξ2(sq(u)+q(v)))O\left(\frac{\log(1/\delta)}{\xi^{2}}(\textbf{sq}(u)+\textbf{q}(v))\right).

Proposition 1 already generalizes the inner product estimation algorithm from [39, Proposition 4.2], which worked only for the case of perfect sampling-and-query access (i.e., ε=0\varepsilon=0). The key idea to prove Proposition 1 is to partition the coordinates into two carefully-chosen sets (the sets Γ\Gamma and {1,,n}Γ\{1,\ldots,n\}\setminus\Gamma in our proof) and define the estimator differently on these two sets. This idea will also be used to show the matrix multiplication algorithm of Theorem 1. Note that by applying Theorem 1, we will give in Section 6.1 another inner product estimation algorithm (see Proposition 6), which is more general than Proposition 1 since it applies even to the case of approximate oversampling-and-query access.

Proof of Proposition 1.

Let us define the set

Γ={i{1,,n}||v(i)|v~2εu|u(i)|}.\Gamma=\left\{i\in\{1,\ldots,n\}\>|\>\left|v(i)\right|\geq\frac{\widetilde{\left\|v\right\|}}{\sqrt{2\varepsilon}\left\|u\right\|}\left|u(i)\right|\right\}.

Note that u(i)0u(i)\neq 0 and

|v(i)||u(i)|v~2εu\frac{\left|v(i)\right|}{\left|u(i)\right|}\leq\frac{\widetilde{\left\|v\right\|}}{\sqrt{2\varepsilon}\left\|u\right\|}

hold for all iΓi\notin\Gamma. Let XX be the random variable representing the output of the following process: sample an index i{1,,n}i\in\{1,\ldots,n\} from the probability distribution p~u\tilde{p}_{u} and output

{0 if iΓ,v(i)u2u(i) if iΓ.\begin{cases}0&\textrm{ if }i\in\Gamma,\\ \frac{v(i)^{\ast}\left\|u\right\|^{2}}{u(i)^{\ast}}&\textrm{ if }i\notin\Gamma.\end{cases}

We have

|𝔼[X](u,v)|\displaystyle\left|\mathbb{E}\!\left[X\right]-(u,v)\right| =|iΓv(i)u2u(i)p~u(i)i=1nu(i)v(i)|\displaystyle=\left|\sum_{i\notin\Gamma}\frac{v(i)^{\ast}\left\|u\right\|^{2}}{u(i)^{\ast}}\tilde{p}_{u}(i)-\sum_{i=1}^{n}u(i)v(i)^{\ast}\right|
|iΓv(i)u2u(i)p~u(i)iΓv(i)u2u(i)pu(i)|+|iΓv(i)u2u(i)pu(i)i=1nu(i)v(i)|\displaystyle\leq\left|\sum_{i\notin\Gamma}\frac{v(i)^{\ast}\left\|u\right\|^{2}}{u(i)^{\ast}}\tilde{p}_{u}(i)-\sum_{i\notin\Gamma}\frac{v(i)^{\ast}\left\|u\right\|^{2}}{u(i)^{\ast}}p_{u}(i)\right|+\left|\sum_{i\notin\Gamma}\frac{v(i)^{\ast}\left\|u\right\|^{2}}{u(i)^{\ast}}p_{u}(i)-\sum_{i=1}^{n}u(i)v(i)^{\ast}\right|
u2v~2εuiΓ|p~u(i)pu(i)|+|iΓu(i)v(i)|\displaystyle\leq\frac{\left\|u\right\|^{2}\widetilde{\left\|v\right\|}}{\sqrt{2\varepsilon}\left\|u\right\|}\sum_{i\notin\Gamma}\left|\tilde{p}_{u}(i)-p_{u}(i)\right|+\left|\sum_{i\in\Gamma}u(i)v(i)^{\ast}\right|
uv~2ε2ε+(iΓ|u(i)|2)(iΓ|v(i)|2)\displaystyle\leq\frac{\left\|u\right\|\widetilde{\left\|v\right\|}}{\sqrt{2\varepsilon}}2\varepsilon+\sqrt{\left(\sum_{i\in\Gamma}\left|u(i)\right|^{2}\right)\left(\sum_{i\in\Gamma}\left|v(i)\right|^{2}\right)}
uv~2ε2ε+(iΓ2εu2v~2|v(i)|2)(iΓ|v(i)|2)\displaystyle\leq\frac{\left\|u\right\|\widetilde{\left\|v\right\|}}{\sqrt{2\varepsilon}}2\varepsilon+\sqrt{\left(\sum_{i\in\Gamma}\frac{2\varepsilon\left\|u\right\|^{2}}{\widetilde{\left\|v\right\|}^{2}}\left|v(i)\right|^{2}\right)\left(\sum_{i\in\Gamma}\left|v(i)\right|^{2}\right)}
2εuv~+2εuv\displaystyle\leq\sqrt{2\varepsilon}\left\|u\right\|\widetilde{\left\|v\right\|}+\sqrt{2\varepsilon}\left\|u\right\|\left\|v\right\|
=22εuv~.\displaystyle=2\sqrt{2\varepsilon}\left\|u\right\|\widetilde{\left\|v\right\|}.

We now compute the variance:

Var[X]\displaystyle\mathrm{Var}\left[X\right] =𝔼[|X|2]|𝔼[X]|2\displaystyle=\mathbb{E}\!\left[\left|X\right|^{2}\right]-\left|\mathbb{E}\!\left[X\right]\right|^{2}
𝔼[|X|2]\displaystyle\leq\mathbb{E}\!\left[\left|X\right|^{2}\right]
=iΓ|v(i)|2u4|u(i)|2p~u(i)\displaystyle=\sum_{i\notin\Gamma}\frac{\left|v(i)\right|^{2}\left\|u\right\|^{4}}{\left|u(i)\right|^{2}}\tilde{p}_{u}(i)
=iΓ|v(i)|2u4|u(i)|2(p~u(i)pu(i))+iΓ|v(i)|2u4|u(i)|2pu(i)\displaystyle=\sum_{i\notin\Gamma}\frac{\left|v(i)\right|^{2}\left\|u\right\|^{4}}{\left|u(i)\right|^{2}}(\tilde{p}_{u}(i)-p_{u}(i))+\sum_{i\notin\Gamma}\frac{\left|v(i)\right|^{2}\left\|u\right\|^{4}}{\left|u(i)\right|^{2}}p_{u}(i)
=iΓ|v(i)|2u4|u(i)|2(p~u(i)pu(i))+iΓ|v(i)|2u4u2\displaystyle=\sum_{i\notin\Gamma}\frac{\left|v(i)\right|^{2}\left\|u\right\|^{4}}{\left|u(i)\right|^{2}}(\tilde{p}_{u}(i)-p_{u}(i))+\sum_{i\notin\Gamma}\left|v(i)\right|^{2}\frac{\left\|u\right\|^{4}}{\left\|u\right\|^{2}}
u2v~22ε2|p~upu|tv+u2v2\displaystyle\leq\frac{\left\|u\right\|^{2}\widetilde{\left\|v\right\|}^{2}}{2\varepsilon}2|\tilde{p}_{u}-p_{u}|_{\mathrm{tv}}+\left\|u\right\|^{2}\left\|v\right\|^{2}
2u2v~2.\displaystyle\leq 2\left\|u\right\|^{2}\widetilde{\left\|v\right\|}^{2}.

Finally, we explain how to compute our estimate of the inner product (u,v)(u,v). Take r=16/ξ2r=\left\lceil 16/\xi^{2}\right\rceil independent samples from XX and denote the corresponding random variables as X1,,XrX_{1},\ldots,X_{r}. Consider the random variable Z=X1+XrrZ=\frac{X_{1}+\cdots X_{r}}{r}. By Chebyshev’s inequality we have

Pr[|ZE[Z]|>ξ2uv~]2Var[Z]ξ2u2v~2=2Var[X]rξ2u2v~214.\mathrm{Pr}\left[\left|Z-E[Z]\right|>\frac{\xi}{\sqrt{2}}\left\|u\right\|\widetilde{\left\|v\right\|}\right]\leq\frac{2\mathrm{Var}\left[Z\right]}{\xi^{2}\left\|u\right\|^{2}\widetilde{\left\|v\right\|}^{2}}=\frac{2\mathrm{Var}\left[X\right]}{r\xi^{2}\left\|u\right\|^{2}\widetilde{\left\|v\right\|}^{2}}\leq\frac{1}{4}.

We then use Lemma 4 to compute an estimate α\alpha such that

Pr[|αE[Z]|ξuv~]1δ.\mathrm{Pr}\left[\left|\alpha-E[Z]\right|\leq\xi\left\|u\right\|\widetilde{\left\|v\right\|}\right]\geq 1-\delta.

By the triangle inequality we conclude that

Pr[|α(u,v)|(22ε+ξ)uv~]1δ.\mathrm{Pr}\left[\left|\alpha-(u,v)\right|\leq\left(2\sqrt{2\varepsilon}+\xi\right)\left\|u\right\|\widetilde{\left\|v\right\|}\right]\leq 1-\delta.

The overall complexity is

O(rlog(1/δ)(sq(u)+q(v)))=O(log(1/δ)ξ2(sq(u)+q(v))),O\left(r\log(1/\delta)(\textbf{sq}(u)+\textbf{q}(v))\right)=O\left(\frac{\log(1/\delta)}{\xi^{2}}(\textbf{sq}(u)+\textbf{q}(v))\right),

as claimed. ∎

4 Matrix Multiplication using Importance Matrix Sketches

In this section we introduce several notions of approximate matrix sketching and show how to use them to approximate matrix products and singular value transformations.

4.1 Approximate importance matrix sketches and matrix multiplication

In this subsection we define the basic notion of approximate matrix sketch needed for this work and show how to use it to approximate the matrix product.

Here is the first key definition.

Definition 9.

Let r,mr,m be two positive integers such that rmr\leq m, and p,p~p,\tilde{p} be two probability distributions over {1,,m}\{1,\ldots,m\}. The r×mr\times m matrix sampled according to (p,p~)(p,\tilde{p}) is the matrix Sr×mS\in\mathbb{R}^{r\times m} obtained by the following process: for each i{1,,r}i\in\{1,\ldots,r\}, choose an index si{1,,m}s_{i}\in\{1,\ldots,m\} by sampling from the distribution p~\tilde{p}, and set the ii-th row of SS as

S(i,)={esirp(si)if p(si)>0,𝟎if p(si)=0,S(i,\cdot)=\begin{cases}\frac{e_{s_{i}}}{\sqrt{rp(s_{i})}}&\textrm{if }p(s_{i})>0,\\ {\mathbf{0}}&\textrm{if }p(s_{i})=0,\end{cases}

where esi1×ne_{s_{i}}\in\mathbb{C}^{1\times n} is the row-vector that has coordinate 1 in the sis_{i}-th position and 0 elsewhere, and 𝟎{\mathbf{0}} denotes the all-zero row-vector.161616For our purpose, the definition of S(i,)S(i,\cdot) for the case p(si)=0p(s_{i})=0 is actually arbitrary. Indeed, this case will never occur in our calculations since we will work on the support of the distribution pp. We call the list ((s1,α1),,(sr,αr))((s_{1},\alpha_{1}),\ldots,(s_{r},\alpha_{r})), where αi=1/rp(si)\alpha_{i}=1/\sqrt{rp(s_{i})} if p(si)>0p(s_{i})>0 and αi=0\alpha_{i}=0 otherwise, the standard description of SS.

We now define approximate importance matrix sketches.

Definition 10.

Given a nonzero matrix Am×nA\in\mathbb{C}^{m\times n}, a positive integer rmr\leq m and two parameters ε[0,1]\varepsilon\in[0,1] and ϕ1\phi\geq 1, an (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA is a matrix Sr×mS\in\mathbb{R}^{r\times m} that is sampled according to (p,p~)(p,\tilde{p}) for some probability distributions p,p~p,\tilde{p} satisfying the following two conditions:

  • p(i)1ϕprA(i)p(i)\geq\frac{1}{\phi}p_{\mathrm{r_{A}}}(i) for all i{1,,m}i\in\{1,\ldots,m\};

  • |p~p|tvε|\tilde{p}-p|_{\mathrm{tv}}\leq\varepsilon.

This notion of approximate importance matrix sketch is motivated by the following definition.

Definition 11.

Consider a nonzero matrix Am×nA\in\mathbb{C}^{m\times n} with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A), and an integer rmr\leq m. Let A¯\bar{A} denote the matrix from Definition 8 and p~rA¯\tilde{p}_{\mathrm{{r}_{\bar{A}}}} denote the distribution such that |p~rA¯prA¯|tvε|\tilde{p}_{\mathrm{{r}_{\bar{A}}}}-p_{\mathrm{{r}_{\bar{A}}}}|_{\mathrm{tv}}\leq\varepsilon corresponding to 𝖲𝖰ε,ϕ(A¯)\mathsf{SQ}_{\varepsilon,\phi}(\bar{A}). The r×mr\times m matrix sampled according to (prA¯,p~rA¯)(p_{\mathrm{{r}_{\bar{A}}}},\tilde{p}_{\mathrm{{r}_{\bar{A}}}}) is called the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A).

It is straightforward to check that the matrix sketch of Definition 11 is an (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA according to Definition 10.171717We have prA¯(i)=A¯(i,)2A¯F2A(i,)2ϕAF2=1ϕprA(i)p_{\mathrm{{r}_{\bar{A}}}}(i)=\frac{\left\|\bar{A}(i,\cdot)\right\|^{2}}{\left\|\bar{A}\right\|_{\mathrm{F}}^{2}}\geq\frac{\left\|A(i,\cdot)\right\|^{2}}{\phi\left\|A\right\|_{\mathrm{F}}^{2}}=\frac{1}{\phi}p_{\mathrm{r_{A}}}(i) for all i{1,,m}i\in\{1,\ldots,m\}. Definition 11 is the special case of Definition 10 that is relevant for algorithmic applications when we can access a matrix via approximate oversampling-and-query access. Since all the results of this subsection work for Definition 10, we nevertheless present our results using this more general (and mathematically more elegant) definition.

The following lemma, which extends Lemma 4.2 in [5] to the setting of approximate importance matrix sketches, shows that if SS is an (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA with rr large enough and ϕ\phi small enough, then SAF\left\|SA\right\|_{\mathrm{F}} is a good approximation of AF\left\|A\right\|_{\mathrm{F}}.

Lemma 5.

Given a matrix Am×nA\in\mathbb{C}^{m\times n}, a positive integer rmr\leq m and two parameters ε[0,1]\varepsilon\in[0,1] and ϕ1\phi\geq 1, consider an (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch SS of AA. We have

[SA](i,)2ϕAF2r\left\|[SA](i,\cdot)\right\|^{2}\leq\frac{\phi\left\|A\right\|_{\mathrm{F}}^{2}}{r} (2)

for all i{1,,r}i\in\{1,\ldots,r\}. Moreover, for any δ(0,1]\delta\in(0,1] we have

Pr[|SAF2AF2|(2ε+ln(2/δ)2r)ϕAF2]1δ.\mathrm{Pr}\left[\left|\left\|SA\right\|_{\mathrm{F}}^{2}-\left\|A\right\|_{\mathrm{F}}^{2}\right|\leq\left(2\varepsilon+\sqrt{\frac{\ln{(2/\delta)}}{2r}}\right)\phi\left\|A\right\|_{\mathrm{F}}^{2}\right]\geq 1-\delta.
Proof.

For any i{1,,r}i\in\{1,\ldots,r\}, we have

[SA](i,)2=A(si,)2rp(si)ϕA(si,)2rprA(si)=ϕAF2r\left\|[SA](i,\cdot)\right\|^{2}=\frac{\left\|A(s_{i},\cdot)\right\|^{2}}{rp(s_{i})}\leq\frac{\phi\left\|A(s_{i},\cdot)\right\|^{2}}{rp_{\mathrm{r_{A}}}(s_{i})}=\frac{\phi\left\|A\right\|_{\mathrm{F}}^{2}}{r}

if p(si)>0p(s_{i})>0. If p(si)=0p(s_{i})=0 we have [SA](i,)=0\left\|[SA](i,\cdot)\right\|=0, which trivially satisfies the inequality.

Let us now prove the second part. We have

|𝔼[[SA](i,)2]AF2r|\displaystyle\left|\mathbb{E}\!\left[\left\|[SA](i,\cdot)\right\|^{2}\right]-\frac{\left\|A\right\|_{\mathrm{F}}^{2}}{r}\right| =|ssupp(p)p~(s)A(s,)2rp(s)ssupp(p)A(s,)2r|\displaystyle=\left|\sum_{s\in\mathrm{supp}(p)}\tilde{p}(s)\frac{\left\|A(s,\cdot)\right\|^{2}}{rp(s)}-\frac{\sum_{s\in\mathrm{supp}(p)}\left\|A(s,\cdot)\right\|^{2}}{r}\right|
=|ssupp(p)(p~(s)p(s))A(s,)2rp(s)|\displaystyle=\left|\sum_{s\in\mathrm{supp}(p)}(\tilde{p}(s)-p(s))\frac{\left\|A(s,\cdot)\right\|^{2}}{rp(s)}\right|
ssupp(p)|p~(s)p(s)|A(s,)2rp(s)\displaystyle\leq\sum_{s\in\mathrm{supp}(p)}\left|\tilde{p}(s)-p(s)\right|\frac{\left\|A(s,\cdot)\right\|^{2}}{rp(s)}
ssupp(p)|p~(s)p(s)|ϕAF2r\displaystyle\leq\sum_{s\in\mathrm{supp}(p)}\left|\tilde{p}(s)-p(s)\right|\frac{\phi\left\|A\right\|_{\mathrm{F}}^{2}}{r}
2εϕAF2r,\displaystyle\leq\frac{2\varepsilon\phi\left\|A\right\|_{\mathrm{F}}^{2}}{r},

where we used again p(s)1ϕprA(s)p(s)\geq\frac{1}{\phi}p_{\mathrm{r_{A}}}(s) to derive the second inequality. We thus have

|𝔼[SA2]AF2|2εϕAF2.\left|\mathbb{E}\!\left[\left\|SA\right\|^{2}\right]-\left\|A\right\|_{\mathrm{F}}^{2}\right|\leq 2\varepsilon\phi\left\|A\right\|_{\mathrm{F}}^{2}.

Observe that the quantity SA2\left\|SA\right\|^{2} is a sum of rr independent random variables, each in the interval [0,ϕAF2/r][0,\phi\left\|A\right\|_{\mathrm{F}}^{2}/r] (the upper bound follows from Equation (2)). From Hoeffding’s inequality, we conclude that

Pr[|SAF2𝔼[SAF2]|ln(2/δ)2rϕAF2]δ.\mathrm{Pr}\left[\left|\left\|SA\right\|_{\mathrm{F}}^{2}-\mathbb{E}\!\left[\left\|SA\right\|_{\mathrm{F}}^{2}\right]\right|\geq\sqrt{\frac{\ln{(2/\delta)}}{2r}}\phi\left\|A\right\|_{\mathrm{F}}^{2}\right]\leq\delta.

From the triangle inequality we get

Pr[|SAF2AF2|(2ε+ln(2/δ)2r)ϕAF2]1δ,\mathrm{Pr}\left[\left|\left\|SA\right\|_{\mathrm{F}}^{2}-\left\|A\right\|_{\mathrm{F}}^{2}\right|\leq\left(2\varepsilon+\sqrt{\frac{\ln{(2/\delta)}}{2r}}\right)\phi\left\|A\right\|_{\mathrm{F}}^{2}\right]\geq 1-\delta,

as claimed. ∎

Here is the main result of this subsection, which shows how to approximate a matrix product XYX^{\dagger}Y by using an approximate matrix sketch of XX.

Theorem 1 (formal statement).

Given two nonzero matrices Xm×nX\in\mathbb{C}^{m\times n} and Ym×nY\in\mathbb{C}^{m\times n^{\prime}}, a positive integer rmr\leq m and two parameters ε[0,1]\varepsilon\in[0,1] and ϕ1\phi\geq 1, define the set

Γ={j{1,,m}|Y(j,)YF2ϕεXFX(j,)}.\Gamma=\left\{j\in\{1,\ldots,m\}\>|\>\left\|Y(j,\cdot)\right\|\geq\frac{\left\|Y\right\|_{\mathrm{F}}}{\sqrt{2\phi\varepsilon}\left\|X\right\|_{\mathrm{F}}}\left\|X(j,\cdot)\right\|\right\}.

Consider an (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch Σ\Sigma of XX. Let Σ¯r×m\bar{\Sigma}\in\mathbb{R}^{r\times m} be the matrix obtained from Σ\Sigma by the following process: for each i{1,,r}i\in\{1,\ldots,r\}, if iΓi\in\Gamma then replace the ii-th row of Σ\Sigma by an all-zero row. Then for any δ>0\delta>0,

Pr[XΣ¯Σ¯YXYF(22ε+2rδ)ϕXFYF]1δ.\mathrm{Pr}\left[\left\|X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y-X^{\dagger}Y\right\|_{\mathrm{F}}\leq\left(2\sqrt{2\varepsilon}+\sqrt{\frac{2}{r\delta}}\right)\sqrt{\phi}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}\right]\geq 1-\delta.

The informal version of Theorem 1 given in the introduction corresponds to the case ϕ=1\phi=1 (the claim that Σ¯\bar{\Sigma} can be efficiently constructed follows from Lemma 6 in Section 5.3).

Proof of Theorem 1.

At a high level, our proof applies a strategy similar to the strategy used in Lemma 4 in [13] (and also used in [5]). In particular, we compute the expectation of the sketch, its variance and conclude using Chebyshev’s inequality. Significant additional care is needed in the analysis since our we are not working with Σ\Sigma but with Σ¯\bar{\Sigma}, and those sketches are not obtained by sampling from the perfect distribution but only from a distribution close to it, which leads to a biased estimator instead of an unbiased estimator in those prior works.

Let pp and p~\tilde{p} denote the probability distributions from Definition 10. Observe that for any jΓj\notin\Gamma we necessarily have X(j,)>0\left\|X(j,\cdot)\right\|>0, and thus p(j)>0p(j)>0 since p(j)1ϕprX(j)p(j)\geq\frac{1}{\phi}p_{\mathrm{r_{X}}}(j).

We first show that the expectation of XΣ¯Σ¯YX^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y is close to XYX^{\dagger}Y:

𝔼[XΣ¯Σ¯Y]XYF\displaystyle\left\|\mathbb{E}\!\left[X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y\right]-X^{\dagger}Y\right\|_{\mathrm{F}} =𝔼[i=1r[(Σ¯X)](,i)[Σ¯Y](i,)]j=1mX(,j)Y(j,)F\displaystyle=\left\|\mathbb{E}\!\left[\sum_{i=1}^{r}[(\bar{\Sigma}X)^{\dagger}](\cdot,i)[\bar{\Sigma}Y](i,\cdot)\right]-\sum_{j=1}^{m}X^{\dagger}(\cdot,j)Y(j,\cdot)\right\|_{\mathrm{F}}
=𝔼[i=1r([Σ¯X](i,))[Σ¯Y](i,)]j=1mX(j,)Y(j,)F\displaystyle=\left\|\mathbb{E}\!\left[\sum_{i=1}^{r}([\bar{\Sigma}X](i,\cdot))^{\dagger}[\bar{\Sigma}Y](i,\cdot)\right]-\sum_{j=1}^{m}X(j,\cdot)^{\dagger}Y(j,\cdot)\right\|_{\mathrm{F}}
=r𝔼[([Σ¯X](1,))[Σ¯Y](1,)]j=1mX(j,)Y(j,)F\displaystyle=\left\|r\mathbb{E}\!\left[([\bar{\Sigma}X](1,\cdot))^{\dagger}[\bar{\Sigma}Y](1,\cdot)\right]-\sum_{j=1}^{m}X(j,\cdot)^{\dagger}Y(j,\cdot)\right\|_{\mathrm{F}}
=rjΓp~(j)X(j,)Y(j,)rp(j)j=1mX(j,)Y(j,)F\displaystyle=\left\|r\sum_{j\notin\Gamma}\tilde{p}(j)\frac{X(j,\cdot)^{\dagger}Y(j,\cdot)}{rp(j)}-\sum_{j=1}^{m}X(j,\cdot)^{\dagger}Y(j,\cdot)\right\|_{\mathrm{F}}
=jΓ(p~(j)p(j))X(j,)Y(j,)p(j)jΓX(j,)Y(j,)F\displaystyle=\left\|\sum_{j\notin\Gamma}\left(\tilde{p}(j)-p(j)\right)\frac{X(j,\cdot)^{\dagger}Y(j,\cdot)}{p(j)}-\sum_{j\in\Gamma}X(j,\cdot)^{\dagger}Y(j,\cdot)\right\|_{\mathrm{F}}
jΓ|p~(j)p(j)|ϕXF2Y(j,)X(j,)+jΓX(j,)Y(j,)\displaystyle\leq\sum_{j\notin\Gamma}\left|\tilde{p}(j)-p(j)\right|\frac{\phi\left\|X\right\|_{\mathrm{F}}^{2}\left\|Y(j,\cdot)\right\|}{\left\|X(j,\cdot)\right\|}+\sum_{j\in\Gamma}\left\|X(j,\cdot)\right\|\left\|Y(j,\cdot)\right\|
ϕXFYF2ϕε2ε+jΓ2ϕεXFYFY(j,)2\displaystyle\leq\frac{\phi\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}}{\sqrt{2\phi\varepsilon}}2\varepsilon+\sum_{j\in\Gamma}\frac{\sqrt{2\phi\varepsilon}\left\|X\right\|_{\mathrm{F}}}{\left\|Y\right\|_{\mathrm{F}}}\left\|Y(j,\cdot)\right\|^{2}
22ϕεXFYF.\displaystyle\leq 2\sqrt{2\phi\varepsilon}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}.

Let us now compute the variance:

Var[XΣ¯Σ¯Y]\displaystyle\mathrm{Var}\left[X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y\right] 𝔼[XΣ¯Σ¯YF2]\displaystyle\leq\mathbb{E}\!\left[\left\|X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y\right\|_{\mathrm{F}}^{2}\right]
=𝔼[i=1nj=1n|[XΣ¯Σ¯Y](i,j)|2]\displaystyle=\mathbb{E}\!\left[\sum_{i=1}^{n}\sum_{j=1}^{n^{\prime}}\left|[X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y](i,j)\right|^{2}\right]
=𝔼[i=1nj=1nk=1r|[XΣ¯](i,k)[Σ¯Y](k,j)|2]\displaystyle=\mathbb{E}\!\left[\sum_{i=1}^{n}\sum_{j=1}^{n^{\prime}}\sum_{k=1}^{r}\left|[X^{\dagger}\bar{\Sigma}^{\dagger}](i,k)[\bar{\Sigma}Y](k,j)\right|^{2}\right]
=r𝔼[i=1nj=1n|[Σ¯X](1,i)[Σ¯Y](1,j)|2]\displaystyle=r\mathbb{E}\!\left[\sum_{i=1}^{n}\sum_{j=1}^{n^{\prime}}\left|[\bar{\Sigma}X](1,i)[\bar{\Sigma}Y](1,j)\right|^{2}\right]
=r𝔼[[Σ¯X](1,)2[Σ¯Y](1,)2]\displaystyle=r\mathbb{E}\!\left[\left\|[\bar{\Sigma}X](1,\cdot)\right\|^{2}\left\|[\bar{\Sigma}Y](1,\cdot)\right\|^{2}\right]
=rjΓp~(j)X(j,)2Y(j,)2(rp(j))2\displaystyle=r\sum_{j\notin\Gamma}\tilde{p}(j)\frac{\left\|X(j,\cdot)\right\|^{2}\left\|Y(j,\cdot)\right\|^{2}}{(rp(j))^{2}}
=rjΓ(p~(j)p(j))X(j,)2Y(j,)2(rp(j))2+rjΓp(j)X(j,)2Y(j,)2(rp(j))2\displaystyle=r\sum_{j\notin\Gamma}(\tilde{p}(j)-p(j))\frac{\left\|X(j,\cdot)\right\|^{2}\left\|Y(j,\cdot)\right\|^{2}}{(rp(j))^{2}}+r\sum_{j\notin\Gamma}p(j)\frac{\left\|X(j,\cdot)\right\|^{2}\left\|Y(j,\cdot)\right\|^{2}}{(rp(j))^{2}}
rjΓ(p~(j)p(j))X(j,)4YF22ϕε(rp(j))2XF2+ϕXF2YF2r.\displaystyle\leq r\sum_{j\notin\Gamma}(\tilde{p}(j)-p(j))\frac{\left\|X(j,\cdot)\right\|^{4}\left\|Y\right\|_{\mathrm{F}}^{2}}{2\phi\varepsilon(rp(j))^{2}\left\|X\right\|_{\mathrm{F}}^{2}}+\frac{\phi\left\|X\right\|_{\mathrm{F}}^{2}\left\|Y\right\|_{\mathrm{F}}^{2}}{r}.

Using the inequality p(j)1ϕprX(j)p(j)\geq\frac{1}{\phi}p_{\mathrm{r_{X}}}(j) and prX(j)=X(j,)2XF2p_{\mathrm{r_{X}}}(j)=\frac{\left\|X(j,\cdot)\right\|^{2}}{\left\|X\right\|_{\mathrm{F}}^{2}}, we get

Var[XΣ¯Σ¯Y]jΓ(p~(j)p(j))ϕXF2YF22εr+ϕXF2YF2r2ϕXF2YF2r.\mathrm{Var}\left[X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y\right]\leq\sum_{j\notin\Gamma}(\tilde{p}(j)-p(j))\frac{\phi\left\|X\right\|_{\mathrm{F}}^{2}\left\|Y\right\|_{\mathrm{F}}^{2}}{2\varepsilon r}+\frac{\phi\left\|X\right\|_{\mathrm{F}}^{2}\left\|Y\right\|_{\mathrm{F}}^{2}}{r}\leq\frac{2\phi\left\|X\right\|_{\mathrm{F}}^{2}\left\|Y\right\|_{\mathrm{F}}^{2}}{r}.

Using Chebyshev’s inequality we obtain

Pr[XΣ¯Σ¯Y𝔼[XΣ¯Σ¯Y]F2ϕrδXFYF]Var[XΣ¯Σ¯Y]2ϕrδXF2YF2=δ.\mathrm{Pr}\left[\left\|X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y-\mathbb{E}\!\left[X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y\right]\right\|_{\mathrm{F}}\geq\sqrt{\frac{2\phi}{r\delta}}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}\right]\leq\frac{\mathrm{Var}\left[X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y\right]}{\frac{2\phi}{r\delta}\left\|X\right\|_{\mathrm{F}}^{2}\left\|Y\right\|_{\mathrm{F}}^{2}}=\delta.

We conclude the proof by using the triangle inequality:

Pr[XΣ¯Σ¯YXYF(22ε+2rδ)ϕXFYF]1δ,\mathrm{Pr}\left[\left\|X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y-X^{\dagger}Y\right\|_{\mathrm{F}}\leq\left(2\sqrt{2\varepsilon}+\sqrt{\frac{2}{r\delta}}\right)\sqrt{\phi}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}\right]\geq 1-\delta,

as claimed. ∎

4.2 Approximate joint importance matrix sketches and matrix multiplication

In order to compute an approximation of the product of two matrices AA and BB with dependence on the error parameter δ\delta better than in Theorem 1, we will need a slightly different notion of sketch that we call approximate joint importance matrix sketch and define as follows.

Definition 12.

Given two nonzero matrices Am×nA\in\mathbb{C}^{m\times n} and Bm×nB\in\mathbb{C}^{m\times n^{\prime}}, a positive integer rmr\leq m and three parameters ε(0,1)\varepsilon\in(0,1) and ϕ,ϕ1\phi,\phi^{\prime}\geq 1, an (r,ε,ϕ,ϕ)(r,\varepsilon,\phi,\phi^{\prime})-approximate joint importance matrix sketch of AA and BB is a matrix Sr×mS\in\mathbb{R}^{r\times m} that is sampled according to (p,p~)(p,\tilde{p}) for some probability distributions p,p~p,\tilde{p} satisfying the following conditions:

  • (i)

    p(i)12(1ϕprA(i)+1ϕprB(i))p(i)\geq\frac{1}{2}(\frac{1}{\phi}p_{\mathrm{r_{A}}}(i)+\frac{1}{\phi^{\prime}}p_{\mathrm{r_{B}}}(i)) for all i{1,,m}i\in\{1,\ldots,m\};

  • (ii)

    |p~p|tvε|\tilde{p}-p|_{\mathrm{tv}}\leq\varepsilon.

This notion of approximate joint importance matrix sketch is motivated by the following definition.

Definition 13.

Consider two nonzero matrices Am×nA\in\mathbb{C}^{m\times n}, Bm×nB\in\mathbb{C}^{m\times n^{\prime}} with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A), 𝖮𝖲𝖰ε,ϕ(B)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(B) and an integer rmr\leq m. Let A¯,B¯\bar{A},\bar{B} denote the matrices from Definition 8 and p~rA¯,p~rB¯\tilde{p}_{\mathrm{{r}_{\bar{A}}}},\tilde{p}_{\mathrm{{r}_{\bar{B}}}} denote the distributions such that |p~rA¯prA¯|tvε|\tilde{p}_{\mathrm{{r}_{\bar{A}}}}-p_{\mathrm{{r}_{\bar{A}}}}|_{\mathrm{tv}}\leq\varepsilon and |p~rB¯prB¯|tvε|\tilde{p}_{\mathrm{{r}_{\bar{B}}}}-p_{\mathrm{{r}_{\bar{B}}}}|_{\mathrm{tv}}\leq\varepsilon corresponding to 𝖲𝖰ε,ϕ(A¯)\mathsf{SQ}_{\varepsilon,\phi}(\bar{A}) and 𝖲𝖰ε,ϕ(B¯)\mathsf{SQ}_{\varepsilon,\phi^{\prime}}(\bar{B}), respectively. The r×mr\times m matrix sampled according to (12(prA¯+prB¯),12(p~rA¯+p~rB¯))(\frac{1}{2}(p_{\mathrm{{r}_{\bar{A}}}}+p_{\mathrm{{r}_{\bar{B}}}}),\frac{1}{2}(\tilde{p}_{\mathrm{{r}_{\bar{A}}}}+\tilde{p}_{\mathrm{{r}_{\bar{B}}}})) is called the (r,ε,ϕ,ϕ)(r,\varepsilon,\phi,\phi^{\prime})-approximate joint importance matrix sketch of AA and BB associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) and 𝖮𝖲𝖰ε,ϕ(B)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(B).

It is again straightforward to check that the matrix sketch of Definition 13 is an (r,ε,ϕ,ϕ)(r,\varepsilon,\phi,\phi^{\prime})-approximate joint importance matrix sketch of AA and BB according to Definition 12.181818We have 12(prA¯(i)+prB¯(i))=12(A¯(i,)2A¯F2+B¯(i,)2B¯F2)12(A(i,)2ϕAF2+B(i,)2ϕBF2)=12(1ϕprA(i)+1ϕprB(i))\frac{1}{2}\left(p_{\mathrm{{r}_{\bar{A}}}}(i)+p_{\mathrm{{r}_{\bar{B}}}}(i)\right)=\frac{1}{2}\left(\frac{\left\|\bar{A}(i,\cdot)\right\|^{2}}{\left\|\bar{A}\right\|_{\mathrm{F}}^{2}}+\frac{\left\|\bar{B}(i,\cdot)\right\|^{2}}{\left\|\bar{B}\right\|_{\mathrm{F}}^{2}}\right)\geq\frac{1}{2}\left(\frac{\left\|A(i,\cdot)\right\|^{2}}{\phi\left\|A\right\|_{\mathrm{F}}^{2}}+\frac{\left\|B(i,\cdot)\right\|^{2}}{\phi^{\prime}\left\|B\right\|_{\mathrm{F}}^{2}}\right)=\frac{1}{2}\left(\frac{1}{\phi}p_{\mathrm{r_{A}}}(i)+\frac{1}{\phi^{\prime}}p_{\mathrm{r_{B}}}(i)\right) for all i{1,,m}i\in\{1,\ldots,m\}. Definition 13 is the special case of Definition 12 that is relevant for algorithmic applications when we can access the two matrices via approximate oversampling-and-query access. Since all the results of this subsection work for Definition 12, we nevertheless again present our results using this more general (and mathematically more elegant) definition.

The following easy proposition, similar to the statement used in the proof of Lemma 4.6 in [5] for perfect oversampling-and-query access, shows that a joint importance matrix sketch of AA and BB is an importance matrix sketch of both AA and BB (with a slightly worse oversampling parameter), and shows Inequality (3), which will be crucial for the matrix multiplication algorithm of Theorem 2.

Proposition 2.

An (r,ε,ϕ,ϕ)(r,\varepsilon,\phi,\phi^{\prime})-approximate joint importance matrix sketch of Am×nA\in\mathbb{C}^{m\times n} and Bm×nB\in\mathbb{C}^{m\times n^{\prime}} is an (r,ε,2ϕ)(r,\varepsilon,2\phi)-approximate importance matrix sketch of AA and an (r,ε,2ϕ)(r,\varepsilon,2\phi^{\prime})-approximate importance matrix sketch of BB. Moreover, for any probability distribution pp satisfying Condition (i) of Definition 12 we have

p(i)A(i,)B(i,)ϕϕAFBFp(i)\geq\frac{\left\|A(i,\cdot)\right\|\left\|B(i,\cdot)\right\|}{\sqrt{\phi\phi^{\prime}}\left\|A\right\|_{\mathrm{F}}\left\|B\right\|_{\mathrm{F}}} (3)

for all i{1,,m}i\in\{1,\ldots,m\}.

Proof.

The first part is trivial since we have p(i)12ϕprA(i)p(i)\geq\frac{1}{2\phi}p_{\mathrm{r_{A}}}(i) and p(i)12ϕprB(i)p(i)\geq\frac{1}{2\phi^{\prime}}p_{\mathrm{r_{B}}}(i) for all i{1,,m}i\in\{1,\ldots,m\}. The second part follows from the inequality of arithmetic and geometric means: for each i{1,,m}i\in\{1,\ldots,m\} we get

p(i)\displaystyle p(i) 12(1ϕprA(i)+1ϕprB(i))prA(i)prB(i)ϕϕ=A(i,)B(i,)ϕϕAFBF,\displaystyle\geq\frac{1}{2}\left(\frac{1}{\phi}p_{\mathrm{r_{A}}}(i)+\frac{1}{\phi^{\prime}}p_{\mathrm{r_{B}}}(i)\right)\geq\ \sqrt{\frac{p_{\mathrm{r_{A}}}(i)p_{\mathrm{r_{B}}}(i)}{\phi\phi^{\prime}}}=\frac{\left\|A(i,\cdot)\right\|\left\|B(i,\cdot)\right\|}{\sqrt{\phi\phi^{\prime}}\left\|A\right\|_{\mathrm{F}}\left\|B\right\|_{\mathrm{F}}},

as claimed. ∎

We can now state the main theorem of this subsection: the product of two matrices can be approximated well using a joint importance matrix sketch.

Theorem 2 (formal statement).

Given two matrices Xm×nX\in\mathbb{C}^{m\times n} and Ym×nY\in\mathbb{C}^{m\times n^{\prime}}, a positive integer rmr\leq m and three parameters ε(0,1)\varepsilon\in(0,1) and ϕ,ϕ1\phi,\phi^{\prime}\geq 1, consider an (r,ε,ϕ,ϕ)(r,\varepsilon,\phi,\phi^{\prime})-approximate joint importance matrix sketch Σ\Sigma of XX and YY. Then for any δ(0,1]\delta\in(0,1],

Pr[XΣΣYXYFϕϕ(2ε+7ln(2/δ)r)XFYF]1δ.\mathrm{Pr}\left[\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-X^{\dagger}Y\right\|_{\mathrm{F}}\leq\sqrt{\phi\phi^{\prime}}\left(2\varepsilon+\sqrt{\frac{7\ln(2/\delta)}{r}}\right)\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}\right]\geq 1-\delta.

Note that the dependence on δ\delta in Theorem 2 is exponentially better than in Theorem 1. The informal version of Theorem 2 given in the introduction corresponds to the case ϕ=ϕ=1\phi=\phi^{\prime}=1 (the claim that Σ\Sigma can be efficiently constructed follows from Lemma 6 in Section 5.3).

Proof of Theorem 2.

At a high level, our proof follows the strategy from [13] (and also used in [5]) for the case of perfect importance matrix sketches. In particular, we use MacDiarmid’s bounded difference inequality. Significant additional care is needed in the analysis to control the impact of not being able to sample from the perfect distribution.

Let pp and p~\tilde{p} denote the probability distributions from Definition 12. Observe that p(j)=0p(j)=0 implies that prX(j)=prY(j)=0p_{\mathrm{r_{X}}}(j)=p_{\mathrm{r_{Y}}}(j)=0, which implies X(j,)=Y(j,)=0\left\|X(j,\cdot)\right\|=\left\|Y(j,\cdot)\right\|=0.

We first show that the expectation of XΣΣYX^{\dagger}\Sigma^{\dagger}\Sigma Y is close to XYX^{\dagger}Y:

𝔼[XΣΣY]XYF\displaystyle\left\|\mathbb{E}\!\left[X^{\dagger}\Sigma^{\dagger}\Sigma Y\right]-X^{\dagger}Y\right\|_{\mathrm{F}} =𝔼[i=1r[(ΣX)](,i)[ΣY](i,)]j=1mX(,j)Y(j,)F\displaystyle=\left\|\mathbb{E}\!\left[\sum_{i=1}^{r}[(\Sigma X)^{\dagger}](\cdot,i)[\Sigma Y](i,\cdot)\right]-\sum_{j=1}^{m}X^{\dagger}(\cdot,j)Y(j,\cdot)\right\|_{\mathrm{F}}
=𝔼[i=1r([ΣX](i,))[ΣY](i,)]jsupp(p)X(j,)Y(j,)F\displaystyle=\left\|\mathbb{E}\!\left[\sum_{i=1}^{r}([\Sigma X](i,\cdot))^{\dagger}[\Sigma Y](i,\cdot)\right]-\sum_{j\in\mathrm{supp}(p)}X(j,\cdot)^{\dagger}Y(j,\cdot)\right\|_{\mathrm{F}}
=r𝔼[([ΣX](1,))[ΣY](1,)]jsupp(p)X(j,)Y(j,)F\displaystyle=\left\|r\mathbb{E}\!\left[([\Sigma X](1,\cdot))^{\dagger}[\Sigma Y](1,\cdot)\right]-\sum_{j\in\mathrm{supp}(p)}X(j,\cdot)^{\dagger}Y(j,\cdot)\right\|_{\mathrm{F}}
=rjsupp(p)p~(j)X(j,)Y(j,)rp(j)jsupp(p)X(j,)Y(j,)F\displaystyle=\left\|r\sum_{j\in\mathrm{supp}(p)}\tilde{p}(j)\frac{X(j,\cdot)^{\dagger}Y(j,\cdot)}{rp(j)}-\sum_{j\in\mathrm{supp}(p)}X(j,\cdot)^{\dagger}Y(j,\cdot)\right\|_{\mathrm{F}}
=jsupp(p)(p~(j)p(j))X(j,)Y(j,)p(j)F\displaystyle=\left\|\sum_{j\in\mathrm{supp}(p)}\left(\tilde{p}(j)-p(j)\right)\frac{X(j,\cdot)^{\dagger}Y(j,\cdot)}{p(j)}\right\|_{\mathrm{F}}
jsupp(p)|p~(j)p(j)|ϕϕXFYF\displaystyle\leq\sum_{j\in\mathrm{supp}(p)}\left|\tilde{p}(j)-p(j)\right|\sqrt{\phi\phi^{\prime}}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}
2εϕϕXFYF,\displaystyle\leq 2\varepsilon\sqrt{\phi\phi^{\prime}}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}},

where we used Inequality (3) to derive the first inequality.

Consider the indices s1,,sr{1,,m}s_{1},\ldots,s_{r}\in\{1,\ldots,m\} chosen when constructing the matrix sketch Σ\Sigma (see Definition 9). Define the function

f(s1,,sr)=XΣΣY𝔼[XΣΣY]F.f(s_{1},\ldots,s_{r})=\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-\mathbb{E}\!\left[X^{\dagger}\Sigma^{\dagger}\Sigma Y\right]\right\|_{\mathrm{F}}.

We have

𝔼[f]\displaystyle\mathbb{E}\!\left[f\right] =𝔼[XΣΣY𝔼[XΣΣY]F]\displaystyle=\mathbb{E}\!\left[\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-\mathbb{E}\!\left[X^{\dagger}\Sigma^{\dagger}\Sigma Y\right]\right\|_{\mathrm{F}}\right]
𝔼[XΣΣY𝔼[XΣΣY]F2]\displaystyle\leq\sqrt{\mathbb{E}\!\left[\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-\mathbb{E}\!\left[X^{\dagger}\Sigma^{\dagger}\Sigma Y\right]\right\|_{\mathrm{F}}^{2}\right]}
i=1nj=1n𝔼[|[XΣΣY](i,j)|2],\displaystyle\leq\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n^{\prime}}\mathbb{E}\!\left[\left|[X^{\dagger}\Sigma^{\dagger}\Sigma Y](i,j)\right|^{2}\right]},

where we used Jensen’s inequality, and then properties of the variance for the second equality. We thus have:

𝔼[f]\displaystyle\mathbb{E}\!\left[f\right] 𝔼[i=1nj=1nk=1r|[XΣ](i,k)[ΣY](k,j)|2]\displaystyle\leq\sqrt{\mathbb{E}\!\left[\sum_{i=1}^{n}\sum_{j=1}^{n^{\prime}}\sum_{k=1}^{r}\left|[X^{\dagger}\Sigma^{\dagger}](i,k)[\Sigma Y](k,j)\right|^{2}\right]}
=r𝔼[i=1nj=1n|[ΣX](1,i)[ΣY](1,j)|2]\displaystyle=\sqrt{r\mathbb{E}\!\left[\sum_{i=1}^{n}\sum_{j=1}^{n^{\prime}}\left|[\Sigma X](1,i)[\Sigma Y](1,j)\right|^{2}\right]}
=r𝔼[ΣX(1,)2ΣY(1,)2]\displaystyle=\sqrt{r\mathbb{E}\!\left[\left\|\Sigma X(1,\cdot)\right\|^{2}\left\|\Sigma Y(1,\cdot)\right\|^{2}\right]}
=rjsupp(p)p~(j)X(j,)2rp(j)Y(j,)2rp(j)\displaystyle=\sqrt{r\sum_{j\in\mathrm{supp}(p)}\tilde{p}(j)\frac{\left\|X(j,\cdot)\right\|^{2}}{rp(j)}\frac{\left\|Y(j,\cdot)\right\|^{2}}{rp(j)}}
1rjsupp(p)p~(j)ϕϕXF2YF2\displaystyle\leq\sqrt{\frac{1}{r}\sum_{j\in\mathrm{supp}(p)}\tilde{p}(j)\phi\phi^{\prime}\left\|X\right\|_{\mathrm{F}}^{2}\left\|Y\right\|_{\mathrm{F}}^{2}}
ϕϕXFYFr.\displaystyle\leq\frac{\sqrt{\phi\phi^{\prime}}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}}{\sqrt{r}}.

Let i=(i1,,ir)\vec{i}=(i_{1},\ldots,i_{r}) and i=(i1,,ir)\vec{i^{\prime}}=(i^{\prime}_{1},\ldots,i^{\prime}_{r}) be two vectors of variables that differ in only one coordinate. Using the reverse triangle inequality and Inequality (3), we obtain

|f(i)f(i)|2maxjsupp(p)X(j,)Y(j,)rp(j)F2ϕϕXFYFr.\left|f(\vec{i})-f(\vec{i^{\prime}})\right|\leq 2\max_{j\in\mathrm{supp}(p)}\left\|\frac{X(j,\cdot)^{\dagger}Y(j,\cdot)^{\dagger}}{rp(j)}\right\|_{\mathrm{F}}\leq\frac{2\sqrt{\phi\phi^{\prime}}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}}{r}.

By using MacDiarmid’s bounded difference inequality (Lemma 3), we conclude that

Pr[|f𝔼[f]|2ln(2/δ)ϕϕrXFYF]δ.\mathrm{Pr}\left[\left|f-\mathbb{E}\!\left[f\right]\right|\geq\sqrt{\frac{2\ln(2/\delta)\phi\phi^{\prime}}{r}}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}\right]\leq\delta.

In consequence the following inequalities hold with probability at least 1δ1-\delta:

XΣΣY𝔼[XΣΣY]F\displaystyle\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-\mathbb{E}\!\left[X^{\dagger}\Sigma^{\dagger}\Sigma Y\right]\right\|_{\mathrm{F}} 𝔼[f]+2ln(2/δ)ϕϕrXFYF\displaystyle\leq\mathbb{E}\!\left[f\right]+\sqrt{\frac{2\ln(2/\delta)\phi\phi^{\prime}}{r}}\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}
ϕϕ(1r+2ln(2/δ)r)XFYF,\displaystyle\leq\sqrt{\phi\phi^{\prime}}\left(\frac{1}{\sqrt{r}}+\sqrt{\frac{2\ln(2/\delta)}{r}}\right)\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}},

and thus

XΣΣYXYF\displaystyle\left\|X^{\dagger}\Sigma^{\dagger}\Sigma Y-X^{\dagger}Y\right\|_{\mathrm{F}} ϕϕ(2ε+1r+2ln(2/δ)r)XFYF\displaystyle\leq\sqrt{\phi\phi^{\prime}}\left(2\varepsilon+\frac{1}{\sqrt{r}}+\sqrt{\frac{2\ln(2/\delta)}{r}}\right)\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}}
ϕϕ(2ε+7ln(2/δ)r)XFYF,\displaystyle\leq\sqrt{\phi\phi^{\prime}}\left(2\varepsilon+\sqrt{\frac{7\ln(2/\delta)}{r}}\right)\left\|X\right\|_{\mathrm{F}}\left\|Y\right\|_{\mathrm{F}},

as claimed, where we used the inequality 1+2log(2/δ)7log(2/δ)1+\sqrt{2\log(2/\delta)}\leq\sqrt{7\log(2/\delta)}. ∎

4.3 Singular value transformation

In this subsection we show how matrix multiplication via importance matrix sketches can be used to approximate the singular value transformation of a positive semidefinite Hermitian matrix of the form AAA^{\dagger}A.

We start with two definitions.

Definition 14.

Given a positive semidefinite Hermitian matrix Hn×nH\in\mathbb{C}^{n\times n}, for any χ0\chi\geq 0 we write

Specχ(H)={x0||xz|χ for some zSpec(H)}.\mathrm{Spec}_{\chi}(H)=\left\{x\in\mathbb{R}_{\geq 0}\>|\>\left|x-z\right|\leq\chi\textrm{ for some }z\in\mathrm{Spec}(H)\right\}.
Definition 15.

Given two parameters L,L¯0L,\bar{L}\geq 0 and a set Λ0\Lambda\subseteq\mathbb{R}_{\geq 0}, a function f:0f\colon\mathbb{R}_{\geq 0}\to\mathbb{C} is (L,L¯)(L,\bar{L})-smooth on Λ\Lambda if it satisfies the following conditions:

  • f(0)=0f(0)=0 and the limit of the function f(x)x\frac{f(x)}{x} at x=0x=0 exists (we write this limit c);

  • ff is LL-Lipschitz over Λ\Lambda ;

  • the function f¯:0\bar{f}\colon\mathbb{R}_{\geq 0}\to\mathbb{C} defined as f¯(x)=f(x)x\bar{f}(x)=\frac{f(x)}{x} for x>0x>0 and f¯(0)=c\bar{f}(0)=c is L¯\bar{L}-Lipschitz over Λ\Lambda.

We can now state the main result of this subsection.

Theorem 3 (formal statement).

Let Am×nA\in\mathbb{C}^{m\times n} be a matrix and f:0f\colon\mathbb{R}_{\geq 0}\to\mathbb{C} be an (L,L¯)(L,\bar{L})-smooth function on Specχ(AA)\mathrm{Spec}_{\chi}(A^{\dagger}A), for some L,L¯,χ>0L,\bar{L},\chi>0. For any parameters δ(0,1]\delta\in(0,1] and γ>0\gamma>0 and any ϕ,ϕ1\phi,\phi^{\prime}\geq 1, consider any ε0\varepsilon\geq 0 and any integers r{1,,m}r\in\{1,\ldots,m\} and c{1,,n}c\in\{1,\ldots,n\} satisfying the following conditions:

ε\displaystyle\varepsilon min(18ϕ,χ24(ϕ+ϕ)AF2,γ24LϕAF2,γ48L¯ϕAF4),\displaystyle\leq\min\left(\frac{1}{8\phi},\frac{\chi}{24(\phi+\phi^{\prime})\left\|A\right\|_{\mathrm{F}}^{2}},\frac{\gamma}{24L\phi\left\|A\right\|_{\mathrm{F}}^{2}},\frac{\gamma}{48\bar{L}\phi^{\prime}\left\|A\right\|_{\mathrm{F}}^{4}}\right), (4)
r\displaystyle r max(2ϕ2ln(6/δ),112ϕ2AF4(ln(6/δ)χ2+ln(6/δ)L2γ2)),\displaystyle\geq\max\left(2\phi^{2}\ln(6/\delta),112\phi^{2}\left\|A\right\|_{\mathrm{F}}^{4}\left(\frac{\ln(6/\delta)}{\chi^{2}}+\frac{\ln(6/\delta)L^{2}}{\gamma^{2}}\right)\right), (5)
c\displaystyle c 112ϕ2AF4(ln(6/δ)χ2+4ln(6/δ)L¯2AF4γ2).\displaystyle\geq 112\phi^{\prime 2}\left\|A\right\|_{\mathrm{F}}^{4}\left(\frac{\ln(6/\delta)}{\chi^{2}}+\frac{4\ln(6/\delta)\bar{L}^{2}\left\|A\right\|_{\mathrm{F}}^{4}}{\gamma^{2}}\right). (6)

Let Sr×mS\in\mathbb{R}^{r\times m} be an (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA, and Tc×nT\in\mathbb{R}^{c\times n} be a (c,ε,ϕ)(c,\varepsilon,\phi^{\prime})-approximate importance matrix sketch of (SA)(SA)^{\dagger}. Then for R=SAR=SA and C=SATC=SAT^{\dagger}, the inequality

Rf¯(CC)Rf(AA)Fγ\left\|R^{\dagger}\bar{f}(CC^{\dagger})R-f(A^{\dagger}A)\right\|_{\mathrm{F}}\leq\gamma

holds with probability at least 1δ1-\delta.

The informal version of Theorem 3 given in the introduction corresponds to the case ϕ,L,L¯=O(1)\phi,L,\bar{L}=O(1) and χ=\chi=\infty. The claim that RR and CC can be efficiently (implicitly) constructed follows from Lemmas 6 and 8 in Section 5.3.

Theorem 3 is a generalization of Theorem 5.1 in [5], which focused on perfect sampling-and-query (i.e., ε=0\varepsilon=0). As the proof in [5] combined several techniques for perfect importance matrix sketches, our proof combines essentially all the tools we developed so far for approximate importance matrix sketches.

Proof of Theorem 3.

First observe that from Lemma 5, with probability at least 1δ/31-\delta/3 we have

RF22AF2.\left\|R\right\|_{\mathrm{F}}^{2}\leq 2\left\|A\right\|_{\mathrm{F}}^{2}.

We assume below that this inequality holds.

From Theorem 2 with X=Y=AX=Y=A and Σ=S\Sigma=S (note that SS is obviously also an (r,ε,ϕ,ϕ)(r,\varepsilon,\phi,\phi)-approximate joint importance sketch of AA and AA), we know that the inequality

RRAAFmin((χ4+χ12)ϕAF2ϕAF2,(γ4+γ12)ϕAF2ϕLAF2)min(χ3,γ3L)\left\|R^{\dagger}R-A^{\dagger}A\right\|_{\mathrm{F}}\leq\min\left(\left(\frac{\chi}{4}+\frac{\chi}{12}\right)\frac{\phi\left\|A\right\|_{\mathrm{F}}^{2}}{\phi\left\|A\right\|_{\mathrm{F}}^{2}},\left(\frac{\gamma}{4}+\frac{\gamma}{12}\right)\frac{\phi\left\|A\right\|_{\mathrm{F}}^{2}}{\phi L\left\|A\right\|_{\mathrm{F}}^{2}}\right)\leq\min\left(\frac{\chi}{3},\frac{\gamma}{3L}\right)

holds with probability at least 1δ/31-\delta/3. By applying again Theorem 2 with X=Y=RX=Y=R^{\dagger} and Σ=T\Sigma=T, we know that the inequality

CCRRFmin((χ4+χ12)ϕRF2ϕAF2,(γ8+γ24)ϕRF2ϕL¯AF4)min(2χ3,γ3L¯AF2)\left\|CC^{\dagger}-RR^{\dagger}\right\|_{\mathrm{F}}\leq\min\left(\left(\frac{\chi}{4}+\frac{\chi}{12}\right)\frac{\phi^{\prime}\left\|R^{\dagger}\right\|_{\mathrm{F}}^{2}}{\phi^{\prime}\left\|A\right\|_{\mathrm{F}}^{2}},\left(\frac{\gamma}{8}+\frac{\gamma}{24}\right)\frac{\phi^{\prime}\left\|R\right\|_{\mathrm{F}}^{2}}{\phi^{\prime}\bar{L}\left\|A\right\|_{\mathrm{F}}^{4}}\right)\leq\min\left(\frac{2\chi}{3},\frac{\gamma}{3\bar{L}\left\|A\right\|_{\mathrm{F}}^{2}}\right)

holds with probability at least 1δ/31-\delta/3 as well. We assume below that these inequalities hold.

Applying Lemma 1 twice we thus obtain

i=1n|λi(RR)λi(AA)|2\displaystyle\sqrt{\sum_{i=1}^{n}\left|\lambda_{i}(R^{\dagger}R)-\lambda_{i}(A^{\dagger}A)\right|^{2}} RRAAFχ3,\displaystyle\leq\left\|R^{\dagger}R-A^{\dagger}A\right\|_{\mathrm{F}}\leq\frac{\chi}{3}, (7)
i=1r|λi(CC)λi(RR)|2\displaystyle\sqrt{\sum_{i=1}^{r}\left|\lambda_{i}(CC^{\dagger})-\lambda_{i}(RR^{\dagger})\right|^{2}} CCRRF2χ3.\displaystyle\leq\left\|CC^{\dagger}-RR^{\dagger}\right\|_{\mathrm{F}}\leq\frac{2\chi}{3}. (8)

These inequalities imply the inclusions191919Observe that Inequalities (7) and (8) are actually significantly stronger bounds than these inclusions since they bound the sum of the difference of eigenvalues. These inclusions will nevertheless be enough for our purpose.

Spec(RR)\displaystyle\mathrm{Spec}(R^{\dagger}R) Specχ/3(AA),\displaystyle\subseteq\mathrm{Spec}_{\chi/3}(A^{\dagger}A),
Spec(CC)\displaystyle\mathrm{Spec}(CC^{\dagger}) Spec2χ/3(RR)Spec2χ/3(RR)Specχ(AA).\displaystyle\subseteq\mathrm{Spec}_{2\chi/3}(RR^{\dagger})\subseteq\mathrm{Spec}_{2\chi/3}(R^{\dagger}R)\subseteq\mathrm{Spec}_{\chi}(A^{\dagger}A).

From Lemma 2, we thus obtain

f(RR)f(AA)F\displaystyle\left\|f(R^{\dagger}R)-f(A^{\dagger}A)\right\|_{\mathrm{F}} LRRAAFγ3,\displaystyle\leq L\left\|R^{\dagger}R-A^{\dagger}A\right\|_{\mathrm{F}}\leq\frac{\gamma}{3},
f¯(CC)f¯(RR)F\displaystyle\left\|\bar{f}(CC^{\dagger})-\bar{f}(RR^{\dagger})\right\|_{\mathrm{F}} L¯CCRRFγ3AF2.\displaystyle\leq\bar{L}\left\|CC^{\dagger}-RR^{\dagger}\right\|_{\mathrm{F}}\leq\frac{\gamma}{3\left\|A\right\|_{\mathrm{F}}^{2}}.

Putting all these inequalities together, we conclude that with probability at least 1δ1-\delta:

Rf¯(CC)Rf(AA)F\displaystyle\left\|R^{\dagger}\bar{f}(CC^{\dagger})R-f(A^{\dagger}A)\right\|_{\mathrm{F}} Rf¯(RR)Rf(AA)F+R(f¯(CC)f¯(RR))RF\displaystyle\leq\left\|R^{\dagger}\bar{f}(RR^{\dagger})R-f(A^{\dagger}A)\right\|_{\mathrm{F}}+\left\|R^{\dagger}(\bar{f}(CC^{\dagger})-\bar{f}(RR^{\dagger}))R\right\|_{\mathrm{F}}
=f(RR)f(AA)F+R(f¯(CC)f¯(RR))RF\displaystyle=\left\|f(R^{\dagger}R)-f(A^{\dagger}A)\right\|_{\mathrm{F}}+\left\|R^{\dagger}(\bar{f}(CC^{\dagger})-\bar{f}(RR^{\dagger}))R\right\|_{\mathrm{F}}
f(RR)f(AA)F+R2f¯(CC)f¯(RR)F\displaystyle\leq\left\|f(R^{\dagger}R)-f(A^{\dagger}A)\right\|_{\mathrm{F}}+\left\|R\right\|^{2}\left\|\bar{f}(CC^{\dagger})-\bar{f}(RR^{\dagger})\right\|_{\mathrm{F}}
γ3+γR23AF2\displaystyle\leq\frac{\gamma}{3}+\frac{\gamma\left\|R\right\|^{2}}{3\left\|A\right\|_{\mathrm{F}}^{2}}
γ,\displaystyle\leq\gamma,

where the identity Rf¯(RR)R=f(RR)R^{\dagger}\bar{f}(RR^{\dagger})R=f(R^{\dagger}R) can be easily shown by considering the singular value decomposition of RR. ∎

5 Technical Tools

In this section we present three important technical techniques that will be used in the applications we discuss in Section 6. In Section 5.1 we show how to implement randomized sampling from oversampling. In Section 5.2 we show how the concept of oversampling-and-query access is closed under taking linear combination of rows. Finally, in Section 5.3 we show how to concretely construct importance matrix sketches, and prove several technical lemmas that will be used in Section 6.

5.1 From oversampling to sampling

The goal of this subsection is to prove the following proposition that shows how to implement approximate (randomized) sampling-and-query access to a vector given approximate oversampling-and-query access to the same vector.

Proposition 3.

For a vector unu\in\mathbb{C}^{n}, assume that we have 𝖮𝖲𝖰ε,ϕ(u)\mathsf{OSQ}_{\varepsilon,\phi}(u) with parameters ϕ1\phi\geq 1 and ε[0,12ϕ)\varepsilon\in[0,\frac{1}{2\phi}) and assume that we know an upper bound ϕmaxϕ\phi_{\textup{max}}\geq\phi. Then for any parameters δ,η(0,1]\delta,\eta\in(0,1], we can implement 𝖱𝖲𝖰ε,δ,η(u){\mathsf{RSQ}_{\varepsilon^{\prime},\delta,\eta}}(u) with ε3εϕ\varepsilon^{\prime}\leq 3\varepsilon\phi at cost

O(ϕmaxlog(1/δ)η2osq(u)).O\left(\frac{\phi_{\textup{max}}\log(1/\delta)}{\eta^{2}}\textbf{osq}(u)\right).

This proposition is a generalization of Lemma 3.5 in [5] and Proposition 4.3 in [39], which focused on perfect sampling (i.e., ε=0\varepsilon=0) and used rejection sampling. To prove Proposition 3, we first develop a theory of robust rejection sampling in Section 5.1.1. We then prove Proposition 3 in Section 5.1.2.

5.1.1 Robust rejection sampling

Standard rejection sampling.

Rejection sampling is a method to generate samples from a (hard) distribution p1:{1,,n}[0,1]p_{1}\colon\{1,\ldots,n\}\to[0,1] given the ability to generate samples from an (easier) distribution p2:{1,,n}[0,1]p_{2}\colon\{1,\ldots,n\}\to[0,1]. In its standard statement, the following three conditions should be satisfied:

  • (i)

    we can generate samples from p2p_{2};

  • (ii)

    we know a value mm such that p1(j)mp2(j)1\frac{p_{1}(j)}{mp_{2}(j)}\leq 1 holds for all jj;

  • (iii)

    given j{1,,n}j\in\{1,\ldots,n\}, we can compute p1(j)p_{1}(j) and p2(j)p_{2}(j).

The method works as follows: sample j{1,,n}j\in\{1,\ldots,n\} from p2p_{2}; output it with probability p1(j)mp2(j)\frac{p_{1}(j)}{mp_{2}(j)} and otherwise report “failure”. It is not difficult to show that conditioned on the event that the process does not fail, which happens with probability 1/m1/m, the probability that sample jj is output is precisely p1(j)p_{1}(j). Repeating the process Θ(m)\Theta(m) times will then output a sample drawn from the distribution p1p_{1} with high probability.

Robust rejection sampling.

We consider the setting where we have two probability distributions p1,p2:{1,,n}[0,1]p_{1},p_{2}\colon\{1,\ldots,n\}\to[0,1] satisfying the following conditions:

  • (i’)

    we can generate samples from a probability distribution p~2:{1,,n}[0,1]\tilde{p}_{2}\colon\{1,\ldots,n\}\to[0,1] such that |p~2p2|tvε|\tilde{p}_{2}-p_{2}|_{\mathrm{tv}}\leq\varepsilon;

  • (ii’)

    there exists a value mm such that p1(j)mp2(j)1\frac{p_{1}(j)}{mp_{2}(j)}\leq 1 holds for all jj;

  • (iii’)

    for each j{1,,n}j\in\{1,\ldots,n\} we can compute the value p1(j)mp2(j)\frac{p_{1}(j)}{mp_{2}(j)}.

In particular, we do not assume that we can compute p~2(j)\tilde{p}_{2}(j) efficiently. Condition (i’) is a clear relaxation of Condition (i). Conditions (ii’) and (iii’) are slight relaxations of Conditions (ii) and (iii) that will be crucial to derive our results (since in our setting we are not able to easily compute p1(j)p_{1}(j) and p2(j)p_{2}(j)).

The goal is to output a sample drawn from a distribution close to p1p_{1}. The basic sampling procedure we use is described in Figure 2. Note that Step 1 can be implemented due to Condition (i’), while Step 3 can be implemented due to Condition (iii’).

Here is our main result.

Procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample} 1. Take a sample j{1,,n}j\in\{1,\ldots,n\} from the probability distribution p~2\tilde{p}_{2}. 2. Take tt uniformly at random in the interval [0,1][0,1]. 3. If tp1(j)mp2(j)t\leq\frac{p_{1}(j)}{mp_{2}(j)} then output jj, else output “failure”.


Figure 2: Procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample}.

The following proposition analyzes the behavior of the procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample}.

Proposition 4.

For any ε[0,12m]\varepsilon\in[0,\frac{1}{2m}], the success probability of the procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample} is Θ(1/m)\Theta(1/m) and conditioned on success, the total variation distance between the distribution of its output jj and p1p_{1} is at most 3εm3\varepsilon m.

Proof.

Let paccp_{\mathrm{acc}} denote the probability that Procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample} does not fail. We have

pacc=j=1np~2(j)p1(j)mp2(j)\displaystyle p_{\mathrm{acc}}=\sum_{j=1}^{n}\tilde{p}_{2}(j)\frac{p_{1}(j)}{mp_{2}(j)} =j=1np2(j)p1(j)mp2(j)+j=1n(p2(j)p~2(j))p1(j)mp2(j)\displaystyle=\sum_{j=1}^{n}p_{2}(j)\frac{p_{1}(j)}{mp_{2}(j)}+\sum_{j=1}^{n}(p_{2}(j)-\tilde{p}_{2}(j))\frac{p_{1}(j)}{mp_{2}(j)}
=1m+j=1n(p2(j)p~2(j))p1(j)mp2(j)\displaystyle=\frac{1}{m}+\sum_{j=1}^{n}(p_{2}(j)-\tilde{p}_{2}(j))\frac{p_{1}(j)}{mp_{2}(j)}

and then

|pacc1m|\displaystyle\left|p_{\mathrm{acc}}-\frac{1}{m}\right| j=1n|p2(j)p~2(j)|p1(j)mp2(j)|p2(j)p~2(j)|tvε,\displaystyle\leq\sum_{j=1}^{n}\left|p_{2}(j)-\tilde{p}_{2}(j)\right|\frac{p_{1}(j)}{mp_{2}(j)}\leq|p_{2}(j)-\tilde{p}_{2}(j)|_{\mathrm{tv}}\leq\varepsilon,

where we used the condition p1(j)mp2(j)1\frac{p_{1}(j)}{mp_{2}(j)}\leq 1 for the second inequality. Note that we thus have

1pacc\displaystyle\frac{1}{p_{\mathrm{acc}}} [m(1+εm),m(1εm)][(1εm)m,(1+2εm)m],\displaystyle\in\left[\frac{m}{(1+\varepsilon m)},\frac{m}{(1-\varepsilon m)}\right]\subseteq\left[(1-\varepsilon m)m,(1+2\varepsilon m)m\right],

since we are assuming ε[0,12m]\varepsilon\in[0,\frac{1}{2m}]. In particular we have pacc=Θ(1/m)p_{\mathrm{acc}}=\Theta(1/m).

Let p^:{1,,n}[0,1]\hat{p}\colon\{1,\ldots,n\}\to[0,1] be the probability distribution corresponding to the output of Procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample} conditioned on the event that Procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample} stops. By definition we have

p^(j)=p~2(j)p1(j)mpaccp2(j)\hat{p}(j)=\frac{\tilde{p}_{2}(j)p_{1}(j)}{mp_{\textrm{acc}}p_{2}(j)}

for all j{1,,n}j\in\{1,\ldots,n\}.

We now show that the distribution p^\hat{p} is close to the distribution p1p_{1}. We have

|p^p1|tv\displaystyle|\hat{p}-p_{1}|_{\mathrm{tv}} =12j=1n|p~2(j)p1(j)mpaccp2(j)p1(j)|\displaystyle=\frac{1}{2}\sum_{j=1}^{n}\left|\frac{\tilde{p}_{2}(j)p_{1}(j)}{mp_{\textrm{acc}}p_{2}(j)}-p_{1}(j)\right|
12j=1n|p2(j)p1(j)mpaccp2(j)p1(j)|+12j=1n|(p2(j)p~2(j))p1(j)mpaccp2(j)|\displaystyle\leq\frac{1}{2}\sum_{j=1}^{n}\left|\frac{p_{2}(j)p_{1}(j)}{mp_{\textrm{acc}}p_{2}(j)}-p_{1}(j)\right|+\frac{1}{2}\sum_{j=1}^{n}\left|\frac{(p_{2}(j)-\tilde{p}_{2}(j))p_{1}(j)}{mp_{\textrm{acc}}p_{2}(j)}\right|
=12j=1n|(1mpacc1)p1(j)|+12j=1n|(p2(j)p~2(j))p1(j)mpaccp2(j)|\displaystyle=\frac{1}{2}\sum_{j=1}^{n}\left|\left(\frac{1}{mp_{\textrm{acc}}}-1\right)p_{1}(j)\right|+\frac{1}{2}\sum_{j=1}^{n}\left|\frac{(p_{2}(j)-\tilde{p}_{2}(j))p_{1}(j)}{mp_{\textrm{acc}}p_{2}(j)}\right|
12|1mpacc1|+1pacc|p2(j)p~2(j)|tv\displaystyle\leq\frac{1}{2}\left|\frac{1}{mp_{\textrm{acc}}}-1\right|+\frac{1}{p_{\textrm{acc}}}|p_{2}(j)-\tilde{p}_{2}(j)|_{\mathrm{tv}}
εm+(1+2εm)mε\displaystyle\leq\varepsilon m+(1+2\varepsilon m)m\cdot\varepsilon
3εm,\displaystyle\leq 3\varepsilon m,

where we used again the condition p1(j)mp2(j)1\frac{p_{1}(j)}{mp_{2}(j)}\leq 1 for the second inequality and εm1/2\varepsilon m\leq 1/2 for the last inequality. ∎

5.1.2 Proof of Proposition 3

Proof of Proposition 3.

We take p1=pup_{1}=p_{u}, p2=pu¯p_{2}=p_{\bar{u}}, p~2=p~u¯\tilde{p}_{2}=\tilde{p}_{\bar{u}} and m=ϕm=\phi. Note that

p1(j)mp2(j)=|u(j)|2u¯2ϕ|u¯(j)|2u2=|u(j)|2|u¯(j)|21\frac{p_{1}(j)}{mp_{2}(j)}=\frac{\left|u(j)\right|^{2}\left\|\bar{u}\right\|^{2}}{\phi\left|\bar{u}(j)\right|^{2}\left\|u\right\|^{2}}=\frac{\left|u(j)\right|^{2}}{\left|\bar{u}(j)\right|^{2}}\leq 1 (9)

holds for all j{1,,n}j\in\{1,\ldots,n\}, as required.

Consider repeating the procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample} until it outputs a sample. Proposition 4 guarantees that the expected number of repetitions is O(ϕ)O(\phi) and conditioned on success, the total variation distance between the output distribution and pup_{u} is at most 3εϕ3\varepsilon\phi. Repeating the procedure O(ϕmaxlog(1/δ))O(\phi_{\textrm{max}}\log(1/\delta)) times guarantees that with probability at least 1δ1-\delta the procedure successfully outputs a sample. For each execution of the procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample}, Step 1 only requires taking one sample from p~u¯\tilde{p}_{\bar{u}}, which has cost sq(u¯)\textbf{sq}(\bar{u}), and Step 3 can be implemented via Equation (9) by querying |u(j)|\left|u(j)\right| and |u¯(j)|\left|\bar{u}(j)\right|, which has cost q(u)+q(u¯)\textbf{q}(u)+\textbf{q}(\bar{u}). To get a sample, the overall complexity is thus O(ϕmaxlog(1/γ)osq(u))O(\phi_{\textrm{max}}\log(1/\gamma)\cdot\textbf{osq}(u)).

To compute an approximation of u\left\|u\right\|, observe that u=u¯/ϕ\left\|u\right\|=\left\|\bar{u}\right\|/\phi. Since we can get u¯\left\|\bar{u}\right\|, we only need to approximate ϕ1\phi^{-1}. We run r=3ϕmaxlog(2/δ)/η2r=\left\lceil 3\phi_{\textrm{max}}\log(2/\delta)/\eta^{2}\right\rceil times the procedure 𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{Sample}. Let XX be the fraction of these rr executions that are successful and observe that 𝔼[X]=ϕ1\mathbb{E}\!\left[X\right]=\phi^{-1}. From Chernoff’s bound we have

Pr[|Xϕ1|ηϕ1]2exp(η2r3ϕ)δ.\mathrm{Pr}\left[\left|X-\phi^{-1}\right|\geq\eta\phi^{-1}\right]\leq 2\exp\left(-\frac{\eta^{2}r}{3\phi}\right)\leq\delta.

We take XX as our estimator of ϕ1\phi^{-1} and output Xu¯X\left\|\bar{u}\right\|. Then with probability at least 1δ1-\delta we have

|uXu¯|\displaystyle\left|\left\|u\right\|-X\left\|\bar{u}\right\|\right| ηu,\displaystyle\leq\eta\left\|u\right\|,

as claimed. The overall cost of computing an approximation of u\left\|u\right\| is thus

O(rosq(u))=O(ϕmaxlog(1/δ)η2osq(u)),O(r\cdot\textbf{osq}(u))=O\left(\frac{\phi_{\textrm{max}}\log(1/\delta)}{\eta^{2}}\textbf{osq}(u)\right),

which gives the claimed cost for 𝖱𝖲𝖰ε,δ,η(u){\mathsf{RSQ}_{\varepsilon^{\prime},\delta,\eta}}(u). ∎

5.2 Sampling from linear combinations of rows

In this subsection we show that our notion of approximate oversampling-and-query access is robust under taking linear combinations of a small number of vectors. Here is our main result.

Proposition 5.

For a matrix Rr×nR\in\mathbb{C}^{r\times n}, assume that we have 𝖮𝖲𝖰ε,ϕi(R(i,))\mathsf{OSQ}_{\varepsilon,\phi_{i}}(R(i,\cdot)) for each i{1,,r}i\in\{1,\ldots,r\} with parameters ε0\varepsilon\geq 0 and ϕ1,,ϕr1\phi_{1},\ldots,\phi_{r}\geq 1. Let v=(λ1,,λr)v^{\dagger}=(\lambda_{1},\ldots,\lambda_{r}) be a known row-vector, and assume that the row-vector vRv^{\dagger}R is nonzero. We can then implement 𝖮𝖲𝖰ε,ϕ(vR)\mathsf{OSQ}_{\varepsilon,\phi}(v^{\dagger}R) with

ϕ=ri=1rϕiλiR(i,)2vR2\phi=r\frac{\sum_{i=1}^{r}\phi_{i}\left\|\lambda_{i}R(i,\cdot)\right\|^{2}}{\left\|v^{\dagger}R\right\|^{2}} (10)

at cost O(r+osq(R(1,))++osq(R(r,)))O(r+\textbf{osq}(R(1,\cdot))+\cdots+\textbf{osq}(R(r,\cdot))).

This proposition is a generalization of Lemma 3.6 in [5] and Proposition 4.3 in [39]. The proof strategy is similar, but additional care is needed in the analysis to ensure that the total variation distance between the distribution we construct and the ideal distribution is preserved.

Proof of Proposition 5..

Note that vR=i=1rλiR(i,).v^{\dagger}R=\sum_{i=1}^{r}\lambda_{i}R(i,\cdot). For any j{1,,n}j\in\{1,\ldots,n\} we can compute [vR](j)[v^{\dagger}R](j) by querying R(1,j),,R(r,j)R(1,j),\ldots,R(r,j) and then performing arithmetic operations, which means that we can implement one query access to the vector vRv^{\dagger}R at cost O(r+q(R(1,))++q(R(r,)))O(r+\textbf{q}(R(1,\cdot))+\cdots+\textbf{q}(R(r,\cdot))).

The remaining of the proof explains how to implement approximate sampling-and-query access to a vector ww such that w=ϕvR\left\|w\right\|=\phi\left\|v^{\dagger}R\right\| for ϕ\phi satisfying Equation (10), and |w(j)||[vR](j)|\left|w(j)\right|\geq\left|[v^{\dagger}R](j)\right| for all j{1,,n}j\in\{1,\ldots,n\}.

Notations.

We first introduce notations related to the vectors we can query and sample. From the assumptions, for each i{1,,r}i\in\{1,\ldots,r\} we have query access to R(i,)R(i,\cdot) and approximate sampling-and-query access to a vector, which we denote uinu_{i}\in\mathbb{C}^{n}, such that ui=ϕiR(i,)\left\|u_{i}\right\|=\phi_{i}\left\|R(i,\cdot)\right\| and |ui(j)||R(i,j)|\left|u_{i}(j)\right|\geq\left|R(i,j)\right| for all j{1,,n}j\in\{1,\ldots,n\}. This means that for each i{1,,r}i\in\{1,\ldots,r\}, each of the following operations can be implemented at cost osq(R(i,))\textbf{osq}(R(i,\cdot)):

  • query access to R(i,)R(i,\cdot) and uiu_{i};

  • getting ui\left\|u_{i}\right\|;

  • getting one sample from a distribution p~ui:{1,,n}[0,1]\tilde{p}_{u_{i}}\colon\{1,\ldots,n\}\to[0,1] such that |p~uipui|tvε|\tilde{p}_{u_{i}}-{p}_{u_{i}}|_{\mathrm{tv}}\leq\varepsilon.

Definition of the vector 𝒘\boldsymbol{w}.

Define the row-vector w1×nw\in\mathbb{R}^{1\times n} as follows:

w(j)=ri=1r|λiui(j)|2w(j)=\sqrt{r\sum_{i=1}^{r}\left|\lambda_{i}u_{i}(j)\right|^{2}}

for all j{1,,n}j\in\{1,\ldots,n\}. For all j{1,,n}j\in\{1,\ldots,n\}, we have

w(j)ri=1r|λiR(i,j)|2|i=1rλiR(i,j)|=|[vR](j)|,w(j)\geq\sqrt{r\sum_{i=1}^{r}\left|\lambda_{i}R(i,j)\right|^{2}}\geq\left|\sum_{i=1}^{r}\lambda_{i}R(i,j)\right|=\left|[v^{\dagger}R](j)\right|,

as claimed, by Cauchy-Schwarz inequality. We also have w2=ϕvR2\left\|w\right\|^{2}=\phi\left\|v^{\dagger}R\right\|^{2} for

ϕ=ri=1r|λi|2ui2i=1rλiR(i,)2=ri=1rϕi|λi|2R(i,)2i=1rλiR(i,)2,\displaystyle\phi=\frac{r\sum_{i=1}^{r}\left|\lambda_{i}\right|^{2}\left\|u_{i}\right\|^{2}}{\left\|\sum_{i=1}^{r}\lambda_{i}R(i,\cdot)\right\|^{2}}=\frac{r\sum_{i=1}^{r}\phi_{i}\left|\lambda_{i}\right|^{2}\left\|R(i,\cdot)\right\|^{2}}{\left\|\sum_{i=1}^{r}\lambda_{i}R(i,\cdot)\right\|^{2}},

as claimed.

Query access to 𝒘\boldsymbol{w}.

For any j{1,,n}j\in\{1,\ldots,n\} we can compute w(j)w(j) by querying u1(j),,ur(j)u_{1}(j),\ldots,u_{r}(j) and then performing arithmetic operations, which means that we can implement one query access to the vector ww with cost O(r+osq(R(1,))++osq(R(r,)))O(r+\textbf{osq}(R(1,\cdot))+\cdots+\textbf{osq}(R(r,\cdot))).

Estimating the norm of 𝒘\boldsymbol{w}.

Observe that

w=ri=1r|λi|2ui2.\left\|w\right\|=\sqrt{r\sum_{i=1}^{r}\left|\lambda_{i}\right|^{2}\left\|u_{i}\right\|^{2}}.

This quantity can be computed using O(r)O(r) arithmetic operations once we know u1,,ur\left\|u_{1}\right\|,\ldots,\left\|u_{r}\right\|. The cost is O(r+osq(R(1,))++osq(R(r,)))O(r+\textbf{osq}(R(1,\cdot))+\cdots+\textbf{osq}(R(r,\cdot))).

Sampling from 𝒘\boldsymbol{w}.

Let q:{1,,r}[0,1]q\colon\{1,\ldots,r\}\to[0,1] be the probability distribution such that

q(i)=|λi|2ui2=1r|λ|2u2.q(i)=\frac{\left|\lambda_{i}\right|^{2}\left\|u_{i}\right\|^{2}}{\sum_{\ell=1}^{r}\left|\lambda_{\ell}\right|^{2}\left\|u_{\ell}\right\|^{2}}.

Observe that the distribution qq can be explicitly constructed by getting the norms u1,,ur\left\|u_{1}\right\|,\ldots,\left\|u_{r}\right\|. In order to sample from ww, we execute the following process.

1. Take a sample i{1,,r}i\in\{1,\ldots,r\} from the probability distribution qq. 2. Take a sample j{1,,n}j\in\{1,\ldots,n\} from the probability distribution p~ui\tilde{p}_{u_{i}}.

The cost of the whole process (including the construction of qq) is

O(r+osq(R(1,))++osq(R(r,))).O(r+\textbf{osq}(R(1,\cdot))+\cdots+\textbf{osq}(R(r,\cdot))).

Let P:{1,,n}[0,1]P\colon\{1,\ldots,n\}\to[0,1] be the probability distribution of the sample obtained at Step 2, i.e.,

P(j)=i=1rq(i)p~ui(j)P(j)=\sum_{i=1}^{r}q(i)\tilde{p}_{u_{i}}(j)

for all j{1,,n}j\in\{1,\ldots,n\}. We have

|Ppw|tv\displaystyle|P-p_{w}|_{\mathrm{tv}} =12j=1n|i=1rq(i)p~ui(j)|w(j)|2w2|\displaystyle=\frac{1}{2}\sum_{j=1}^{n}\left|\sum_{i=1}^{r}q(i)\tilde{p}_{u_{i}}(j)-\frac{\left|w(j)\right|^{2}}{\left\|w\right\|^{2}}\right|
=12j=1n|i=1rq(i)p~ui(j)ri=1r|λiui(j)|2ri=r|λ|2u2|\displaystyle=\frac{1}{2}\sum_{j=1}^{n}\left|\sum_{i=1}^{r}q(i)\tilde{p}_{u_{i}}(j)-\frac{r\sum_{i=1}^{r}\left|\lambda_{i}u_{i}(j)\right|^{2}}{r\sum_{i=\ell}^{r}\left|\lambda_{\ell}\right|^{2}\left\|u_{\ell}\right\|^{2}}\right|
=12j=1n|i=1rq(i)p~ui(j)i=1rq(i)|ui(j)|2ui2|\displaystyle=\frac{1}{2}\sum_{j=1}^{n}\left|\sum_{i=1}^{r}q(i)\tilde{p}_{u_{i}}(j)-\sum_{i=1}^{r}q(i)\frac{\left|u_{i}(j)\right|^{2}}{\left\|u_{i}\right\|^{2}}\right|
i=1rq(i)|p~uipui|tv\displaystyle\leq\sum_{i=1}^{r}q(i)|\tilde{p}_{u_{i}}-p_{u_{i}}|_{\mathrm{tv}}
ε,\displaystyle\leq\varepsilon,

which concludes the proof. ∎

5.3 Importance matrix sketches associated with sampling access

We now discuss how to construct and use importance matrix sketches when given oversampling-and-query access to matrices.

Remember how we defined a special case of approximate importance matrix sketches in Definition 11 and a special case of approximate joint importance matrix sketches in Definition 13. We first observe that the standard representation (as defined in Definition 9) of such matrix sketches can be computed efficiently:

Lemma 6.

The standard description of the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) can be computed at cost O(rosq(A))O(r\cdot\textbf{osq}(A)). The standard description of the (r,ε,ϕ,ϕ)(r,\varepsilon,\phi,\phi^{\prime})-approximate joint importance matrix sketch of AA and BB associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) and 𝖮𝖲𝖰ε,ϕ(B)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(B) can be computed at cost O(r(osq(A)+osq(B)))O(r\cdot(\textbf{osq}(A)+\textbf{osq}(B))).

Proof.

The first part follows from the properties of the distributions p~rA¯\tilde{p}_{\mathrm{{r}_{\bar{A}}}} and prA¯p_{\mathrm{{r}_{\bar{A}}}}. Observe in particular that since we have 𝖲𝖰ε(rA¯)\mathsf{SQ}_{\varepsilon}(\mathrm{{r}_{\bar{A}}}), we can sample from prA¯p_{\mathrm{{r}_{\bar{A}}}} at cost osq(A)\textbf{osq}(A) and compute prA¯(i)=rA¯(i)2/rA¯2p_{\mathrm{{r}_{\bar{A}}}}(i)=\left\|\mathrm{{r}_{\bar{A}}}(i)\right\|^{2}/\left\|\mathrm{{r}_{\bar{A}}}\right\|^{2} at cost O(osq(A))O(\textbf{osq}(A)) for any i{1,,n}i\in\{1,\ldots,n\}. We can thus compute the standard description of the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) at cost O(rosq(A))O(r\cdot\textbf{osq}(A)).

The proof of the second part is similar. For sampling from the distribution 12(prA¯+prB¯)\frac{1}{2}(p_{\mathrm{{r}_{\bar{A}}}}+p_{\mathrm{{r}_{\bar{B}}}}), we simply take a random bit bb, and sample from prA¯p_{\mathrm{{r}_{\bar{A}}}} if b=0b=0 and from prB¯p_{\mathrm{{r}_{\bar{B}}}} if b=1b=1. ∎

We now show how to get oversampling-and-query access to the matrices SASA and (SA)(SA)^{\dagger} when given an approximate importance matrix sketches of AA associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A). The two following lemmas generalize similar results for the case of perfect oversampling-and-query access proved in [5, 17].

Lemma 7.

For a nonzero matrix Am×nA\in\mathbb{C}^{m\times n}, any parameters ε[0,1]\varepsilon\in[0,1], ϕ1\phi\geq 1 and any integer r1r\geq 1, assume that we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) and know the standard representation of the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A). Assume that the inequality SAF212AF2\left\|SA\right\|_{\mathrm{F}}^{2}\geq\frac{1}{2}\left\|A\right\|_{\mathrm{F}}^{2} holds. We can then implement 𝖮𝖲𝖰ε,ϕ(SA)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(SA) with ϕ2ϕ\phi^{\prime}\leq 2\phi at cost O(osq(A))O(\textbf{osq}(A)).

Proof.

Write ((s1,α1),,(sr,αr))((s_{1},\alpha_{1}),\ldots,(s_{r},\alpha_{r})) the standard representation of SS, where

αi=1rprA¯(si)=A¯FrA¯(si,)ϕAFrA(si,)\alpha_{i}=\frac{1}{\sqrt{rp_{\mathrm{{r}_{\bar{A}}}}(s_{i})}}=\frac{\left\|\bar{A}\right\|_{\mathrm{F}}}{\sqrt{r}\left\|\bar{A}(s_{i},\cdot)\right\|}\leq\frac{\sqrt{\phi}\left\|A\right\|_{\mathrm{F}}}{\sqrt{r}\left\|A(s_{i},\cdot)\right\|}

if A¯(si,)0\left\|\bar{A}(s_{i},\cdot)\right\|\neq 0, and αi=0\alpha_{i}=0 if A¯(si,)=0\left\|\bar{A}(s_{i},\cdot)\right\|=0. Define

Δ={i{1,,r}|A¯(si,)0}.\Delta=\left\{i\in\{1,\ldots,r\}\>|\>\left\|\bar{A}(s_{i},\cdot)\right\|\neq 0\right\}.

For any (i,j){1,,r}×{1,,n}(i,j)\in\{1,\ldots,r\}\times\{1,\ldots,n\}, we can compute [SA](i,j)[SA](i,j) using the identity [SA](i,j)=αiA(si,j)[SA](i,j)=\alpha_{i}A(s_{i},j), which requires only one query access to AA and an arithmetic operation.

We define the matrix SA¯=SA¯\,\overline{\!{SA}\!}\,=S\bar{A}. We have

SA¯F2=i=1r|αi|2A¯(si,)2=|Δ|rA¯F2ϕAF22ϕSAF2\left\|\,\overline{\!{SA}\!}\,\right\|_{\mathrm{F}}^{2}=\sum_{i=1}^{r}\left|\alpha_{i}\right|^{2}\left\|\bar{A}(s_{i},\cdot)\right\|^{2}=\frac{\left|\Delta\right|}{r}\left\|\bar{A}\right\|_{\mathrm{F}}^{2}\leq\phi\left\|A\right\|_{\mathrm{F}}^{2}\leq 2\phi\left\|SA\right\|_{\mathrm{F}}^{2}

and |[SA¯](i,j)|=|αi||A¯(si,j)||αi||A(si,j)|=|[SA](i,j)|\left|[\,\overline{\!{SA}\!}\,](i,j)\right|=\left|\alpha_{i}\right|\left|\bar{A}(s_{i},j)\right|\geq\left|\alpha_{i}\right|\left|A(s_{i},j)\right|=\left|[SA](i,j)\right| for all (i,j){1,,r}×{1,,n}(i,j)\in\{1,\ldots,r\}\times\{1,\ldots,n\}. We now explain how to implement 𝖲𝖰ε(SA¯)\mathsf{SQ}_{\varepsilon}(\,\overline{\!{SA}\!}\,).

Approximate sampling-and-query access to the rows of SA¯\boldsymbol{\,\overline{\!{SA}\!}\,}. For each i{1,,r}i\in\{1,\ldots,r\}, we explain how to implement 𝖲𝖰ε([SA¯](i,))\mathsf{SQ}_{\varepsilon}([\,\overline{\!{SA}\!}\,](i,\cdot)). We can implement query access from the identity [SA¯](i,j)=αiA¯(si,j)[\overline{SA}](i,j)=\alpha_{i}\bar{A}(s_{i},j) using only one query access to A¯\bar{A} and one arithmetic operation. Similarly, we can compute [SA¯](i,)=|αi|A(si,)\left\|[\,\overline{\!{SA}\!}\,](i,\cdot)\right\|=\left|\alpha_{i}\right|\left\|A(s_{i},\cdot)\right\| by getting the norm A¯(si,)\left\|\bar{A}(s_{i},\cdot)\right\|. To sample from a nonzero vector SA¯(i,)\,\overline{\!{SA}\!}\,(i,\cdot), we simply sample from the distribution p~A¯(si,)\tilde{p}_{\bar{A}(s_{i},\cdot)}. Since pSA¯(i,)=pA¯(si,)p_{\,\overline{\!{SA}\!}\,(i,\cdot)}=p_{\bar{A}(s_{i},\cdot)}, we have

|p~A¯(si,)pSA¯(i,)|tv=|p~A¯(si,)pA¯(si,)|tvε.|\tilde{p}_{\bar{A}(s_{i},\cdot)}-p_{\,\overline{\!{SA}\!}\,(i,\cdot)}|_{\mathrm{tv}}=|\tilde{p}_{\bar{A}(s_{i},\cdot)}-p_{\bar{A}(s_{i},\cdot)}|_{\mathrm{tv}}\leq\varepsilon.

The cost of each of these operations is O(osq(A))O(\textbf{osq}(A)).

Approximate sampling-and-query access to the row norm vector of SA¯\boldsymbol{\,\overline{\!{SA}\!}\,}. We now explain how to implement 𝖲𝖰ε(rSA¯)\mathsf{SQ}_{\varepsilon}(\mathrm{r_{\,\overline{\!{SA}\!}\,}}) for the vector rSA¯=([SA¯](1,),,[SA¯](r,))\mathrm{r_{\,\overline{\!{SA}\!}\,}}=(\left\|[\,\overline{\!{SA}\!}\,](1,\cdot)\right\|,\ldots,\left\|[\,\overline{\!{SA}\!}\,](r,\cdot)\right\|). We have rSA¯(i)=|αi|A¯(si,)=1/rA¯F\mathrm{r_{\,\overline{\!{SA}\!}\,}}(i)=\left|\alpha_{i}\right|\left\|\bar{A}(s_{i},\cdot)\right\|=\sqrt{1/r}\left\|\bar{A}\right\|_{\mathrm{F}} for all iΔi\in\Delta, and rSA¯(i)=0\mathrm{r_{\,\overline{\!{SA}\!}\,}}(i)=0 for all iΔi\notin\Delta. We can implement query access to rSA¯\mathrm{r_{\,\overline{\!{SA}\!}\,}} using one query to the norm A¯(si,)\left\|\bar{A}(s_{i},\cdot)\right\| (or to the norm rA¯=A¯F\left\|\mathrm{{r}_{\bar{A}}}\right\|=\left\|\bar{A}\right\|_{\mathrm{F}}) and one arithmetic operation. Since rSA¯=|Δ|/rA¯F=|Δ|/rrA¯\left\|\mathrm{r_{\,\overline{\!{SA}\!}\,}}\right\|=\sqrt{\left|\Delta\right|/r}\left\|\bar{A}\right\|_{\mathrm{F}}=\sqrt{\left|\Delta\right|/r}\left\|\mathrm{{r}_{\bar{A}}}\right\|, we can compute rSA¯\left\|\mathrm{r_{\,\overline{\!{SA}\!}\,}}\right\| by getting rA¯\left\|\mathrm{{r}_{\bar{A}}}\right\|, which can be done at cost osq(A)\textbf{osq}(A). Implementing (perfect) sampling access to rSA¯\mathrm{r_{\,\overline{\!{SA}\!}\,}} is trivial since prSA¯p_{\mathrm{r_{\,\overline{\!{SA}\!}\,}}} is the uniform distribution over Δ\Delta. ∎

Lemma 8.

For a nonzero matrix Am×nA\in\mathbb{C}^{m\times n}, any parameters ε[0,1]\varepsilon\in[0,1], ϕ1\phi\geq 1 and any integer r1r\geq 1, assume that we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) and know the standard representation of the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch of AA associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A). Assume that the inequality SAF212AF2\left\|SA\right\|_{\mathrm{F}}^{2}\geq\frac{1}{2}\left\|A\right\|_{\mathrm{F}}^{2} holds. We can then implement 𝖮𝖲𝖰ε,ϕ((SA))\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}((SA)^{\dagger}) with ϕ2ϕ\phi^{\prime}\leq 2\phi at cost O(rosq(A))O(r\cdot\textbf{osq}(A)).

Proof.

For conciseness, let us write M=(SA)M=(SA)^{\dagger}. Write ((s1,α1),,(sr,αr))((s_{1},\alpha_{1}),\ldots,(s_{r},\alpha_{r})) the standard representation of SS and define the set Δ\Delta as in the proof of Lemma 7.

For any (i,j){1,,r}×{1,,n}(i,j)\in\{1,\ldots,r\}\times\{1,\ldots,n\}, we can compute M(j,i)M(j,i) using the identity M(j,i)=[SA](i,j)=αiA(si,j)M(j,i)=[SA](i,j)=\alpha_{i}A(s_{i},j), which requires only one query access to AA and an arithmetic operation.

We define the matrix M¯=(SA¯)\,\overline{\!{M}\!}\,=(S\bar{A})^{\dagger}. We have

M¯F2=i=1r|αi|2A¯(si,)2=|Δ|rA¯F2ϕAF22ϕSAF2=2ϕMF2\left\|\,\overline{\!{M}\!}\,\right\|_{\mathrm{F}}^{2}=\sum_{i=1}^{r}\left|\alpha_{i}\right|^{2}\left\|\bar{A}(s_{i},\cdot)\right\|^{2}=\frac{\left|\Delta\right|}{r}\left\|\bar{A}\right\|_{\mathrm{F}}^{2}\leq\phi\left\|A\right\|_{\mathrm{F}}^{2}\leq 2\phi\left\|SA\right\|_{\mathrm{F}}^{2}=2\phi\left\|M\right\|_{\mathrm{F}}^{2}

and |M¯(j,i)|=|αi||A¯(si,j)||αi||A(si,j)|=|M(j,i)|\left|\,\overline{\!{M}\!}\,(j,i)\right|=\left|\alpha_{i}\right|\left|\bar{A}(s_{i},j)\right|\geq\left|\alpha_{i}\right|\left|A(s_{i},j)\right|=\left|M(j,i)\right| for any (i,j){1,,r}×{1,,n}(i,j)\in\{1,\ldots,r\}\times\{1,\ldots,n\}. We now explain how to implement 𝖲𝖰ε(M¯)\mathsf{SQ}_{\varepsilon}(\,\overline{\!{M}\!}\,).

Approximate sampling-and-query access to the rows of M¯\boldsymbol{\,\overline{\!{M}\!}\,}. For each j{1,,n}j\in\{1,\ldots,n\}, we explain how to implement 𝖲𝖰ε(M¯(j,))\mathsf{SQ}_{\varepsilon}(\,\overline{\!{M}\!}\,(j,\cdot)). For any i{1,,r}i\in\{1,\ldots,r\}, we can implement query access to this vector from the identity M¯(j,i)=αiA¯(si,j)\,\overline{\!{M}\!}\,(j,i)=\alpha_{i}\bar{A}(s_{i},j) using only one query access to A¯\bar{A} and one arithmetic operation. We can compute its norm M¯(j,)\left\|\,\overline{\!{M}\!}\,(j,\cdot)\right\| by querying A¯(s1,j),,A¯(sr,j)\bar{A}(s_{1},j),\ldots,\bar{A}(s_{r},j) and then performing O(r)O(r) arithmetic operations, which has cost O(rosq(A))O(r\cdot\textbf{osq}(A)). Similarly, we can (exactly) sample from the distribution pM¯(j,)p_{\,\overline{\!{M}\!}\,(j,\cdot)} by first querying A¯(s1,j),,A¯(sr,j)\bar{A}(s_{1},j),\ldots,\bar{A}(s_{r},j) and then explicitely computing this distribution. The cost is again O(rosq(A))O(r\cdot\textbf{osq}(A)).

Approximate sampling-and-query access to the row norm vector of M¯\boldsymbol{\,\overline{\!{M}\!}\,}. We now explain how to implement 𝖲𝖰ε(rM¯)\mathsf{SQ}_{\varepsilon}(\mathrm{{r}_{\,\overline{\!{M}\!}\,}}) for the vector rM¯=(M¯(1,),,M¯(n,))\mathrm{{r}_{\,\overline{\!{M}\!}\,}}=(\left\|\,\overline{\!{M}\!}\,(1,\cdot)\right\|,\ldots,\left\|\,\overline{\!{M}\!}\,(n,\cdot)\right\|). Query access can be done at cost O(rosq(A))O(r\cdot\textbf{osq}(A)) by computing M¯(j,)\left\|\,\overline{\!{M}\!}\,(j,\cdot)\right\| as described above. Since rM¯=M¯F=|Δ|/rA¯F\left\|\mathrm{r_{\,\overline{\!{M}\!}\,}}\right\|=\left\|\,\overline{\!{M}\!}\,\right\|_{\mathrm{F}}=\sqrt{\left|\Delta\right|/r}\left\|\bar{A}\right\|_{\mathrm{F}}, we can compute rM¯\left\|\mathrm{r_{\,\overline{\!{M}\!}\,}}\right\| at cost osq(A)\textbf{osq}(A) by getting rA¯=A¯F\left\|\mathrm{{r}_{\bar{A}}}\right\|=\left\|\bar{A}\right\|_{\mathrm{F}}. Sampling from rM¯\mathrm{{r}_{\,\overline{\!{M}\!}\,}} is more delicate. Here is our strategy: we take an index iΔi\in\Delta uniformly at random, and then return the index j{1,,n}j\in\{1,\ldots,n\} sampled from the distribution p~A¯(si,)\tilde{p}_{\bar{A}(s_{i},\cdot)}. Using the fact that A¯(si,)=0\left\|\bar{A}(s_{i},\cdot)\right\|=0 for iΔi\notin\Delta, the total variation distance between the resulting distribution and the distribution prM¯p_{\mathrm{{r}_{\,\overline{\!{M}\!}\,}}} can be evaluated as follows:

12j=1n|1|Δ|iΔp~A¯(si,)(j)prM¯(j)|=\displaystyle\frac{1}{2}\sum_{j=1}^{n}\left|\frac{1}{\left|\Delta\right|}\sum_{i\in\Delta}\tilde{p}_{\bar{A}(s_{i},\cdot)}(j)-p_{\mathrm{r_{\,\overline{\!{M}\!}\,}}}(j)\right|= 12j=1n|1|Δ|iΔp~A¯(si,)(j)M¯(j,)2rM¯2|\displaystyle\frac{1}{2}\sum_{j=1}^{n}\left|\frac{1}{\left|\Delta\right|}\sum_{i\in\Delta}\tilde{p}_{\bar{A}(s_{i},\cdot)}(j)-\frac{\left\|\,\overline{\!{M}\!}\,(j,\cdot)\right\|^{2}}{\left\|\mathrm{{r}_{\,\overline{\!{M}\!}\,}}\right\|^{2}}\right|
=\displaystyle= 12j=1n|1|Δ|iΔp~A¯(si,)(j)r|Δ|iΔ[SA¯](i,j)2A¯F2|\displaystyle\frac{1}{2}\sum_{j=1}^{n}\left|\frac{1}{\left|\Delta\right|}\sum_{i\in\Delta}\tilde{p}_{\bar{A}(s_{i},\cdot)}(j)-\frac{r}{\left|\Delta\right|}\sum_{i\in\Delta}\frac{\left\|[S\bar{A}](i,j)\right\|^{2}}{\left\|\bar{A}\right\|_{\mathrm{F}}^{2}}\right|
\displaystyle\leq 12j=1niΔ|p~A¯(si,)(j)|Δ|r[SA¯](i,j)2|Δ|A¯F2|\displaystyle\frac{1}{2}\sum_{j=1}^{n}\sum_{i\in\Delta}\left|\frac{\tilde{p}_{\bar{A}(s_{i},\cdot)}(j)}{\left|\Delta\right|}-\frac{r\left\|[S\bar{A}](i,j)\right\|^{2}}{\left|\Delta\right|\left\|\bar{A}\right\|_{\mathrm{F}}^{2}}\right|
=\displaystyle= 12iΔj=1n|p~A¯(si,)(j)|Δ|rαi2A¯(si,j)2|Δ|A¯F2|\displaystyle\frac{1}{2}\sum_{i\in\Delta}\sum_{j=1}^{n}\left|\frac{\tilde{p}_{\bar{A}(s_{i},\cdot)}(j)}{\left|\Delta\right|}-\frac{r\alpha_{i}^{2}\left\|\bar{A}(s_{i},j)\right\|^{2}}{\left|\Delta\right|\left\|\bar{A}\right\|_{\mathrm{F}}^{2}}\right|
=\displaystyle= 12iΔj=1n|p~A¯(si,)(j)|Δ|A¯F2|Δ|A¯(si,)2A¯(si,j)2A¯F2|\displaystyle\frac{1}{2}\sum_{i\in\Delta}\sum_{j=1}^{n}\left|\frac{\tilde{p}_{\bar{A}(s_{i},\cdot)}(j)}{\left|\Delta\right|}-\frac{\left\|\bar{A}\right\|_{\mathrm{F}}^{2}}{\left|\Delta\right|\left\|\bar{A}(s_{i},\cdot)\right\|^{2}}\frac{\left\|\bar{A}(s_{i},j)\right\|^{2}}{\left\|\bar{A}\right\|_{\mathrm{F}}^{2}}\right|
=\displaystyle= 1|Δ|iΔ|p~A¯(si,)pA¯(si,)|tv\displaystyle\frac{1}{\left|\Delta\right|}\sum_{i\in\Delta}|\tilde{p}_{\bar{A}(s_{i},\cdot)}-p_{\bar{A}(s_{i},\cdot)}|_{\mathrm{tv}}
\displaystyle\leq ε.\displaystyle\varepsilon.

The cost to output one sample is O(osq(A))O(\textbf{osq}(A)). ∎

6 Applications

In this section we discuss applications of the framework and techniques developed in the previous sections. In Section 6.1 we first show how to apply Theorem 1 to estimate the inner product of two vectors. In Sections 6.2 and 6.3, we apply our machinery to obtain a robust dequantization of the Quantum Singular Value Transformation in the low-rank setting (Section 6.2) and the sparse setting (Section 6.3). In Sections 6.4, 6.5 and 6.6 we then show how to obtain a robust dequantization of known quantum algorithms for supervised clustering, recommendation systems and low-rank matrix inversion.

When we consider in this section oversampling-and-query access to the input, for simplicity we always assume that we know the oversampling parameter ϕ\phi (if we know instead only an upper bound ϕmaxϕ\phi_{\textrm{max}}\geq\phi, all the statements still hold by replacing ϕ\phi by ϕmax\phi_{\textrm{max}}).

6.1 Estimation of the inner product with oversampling access

In this subsection we describe how to estimate the inner product with oversampling access. Here is our main result, which essentially generalizes Proposition 1 to approximate oversampling-and-query access.

Proposition 6.

For any ε[0,1]\varepsilon\in[0,1], any δ(0,1]\delta\in(0,1], any ξ>0\xi>0 and any ϕ1\phi\geq 1, assume that

  • we have 𝖮𝖲𝖰ε,ϕ(u)\mathsf{OSQ}_{\varepsilon,\phi}(u) for a nonzero vector umu\in\mathbb{C}^{m},

  • we have 𝖰(v)\mathsf{Q}(v) for a nonzero vector vmv\in\mathbb{C}^{m} and know v\left\|v\right\|.

We can then output an estimator α\alpha such that |α(u,v)|(22ϕε+ξ)uv\left|\alpha-(u,v)\right|\leq\left(2\sqrt{2\phi\varepsilon}+\xi\right)\left\|u\right\|\left\|v\right\| holds with probability at least 1δ1-\delta at cost O(ϕlog(1/δ)ξ2(osq(u)+q(v)))O\left(\frac{\phi\log(1/\delta)}{\xi^{2}}(\textbf{osq}(u)+\textbf{q}(v))\right).

Proof.

Take

r=16ϕξ2.r=\left\lceil\frac{16\phi}{\xi^{2}}\right\rceil.

and consider the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance matrix sketch Σ\Sigma of uu associated with 𝖮𝖲𝖰ε,ϕ(u)\mathsf{OSQ}_{\varepsilon,\phi}(u). The standard description of this sketch can be computed at cost O(rosq(u))O(r\cdot\textbf{osq}(u)) via Lemma 6. We apply Theorem 1 with n=n=1n=n^{\prime}=1 and X=uX=u^{\ast} and Y=vY=v^{\ast} (remember that uu^{\ast} and vv^{\ast} denote the complex conjugate of uu and vv). Consider the complex number β=XΣ¯Σ¯Y\beta=X^{\dagger}\bar{\Sigma}^{\dagger}\bar{\Sigma}Y, which can be computed at cost O(r(osq(u)+q(v)))O(r\cdot(\textbf{osq}(u)+\textbf{q}(v))). The first part of the proof of Theorem 1 shows that the inequality

|𝔼[β](u,v)|22ϕεuv\left|\mathbb{E}\!\left[\beta\right]-(u,v)\right|\leq 2\sqrt{2\phi\varepsilon}\left\|u\right\|\left\|v\right\|

holds. The second part of the proof of Theorem 1 shows that the inequality

|β𝔼[β]|ξ2uv.\left|\beta-\mathbb{E}\!\left[\beta\right]\right|\leq\frac{\xi}{\sqrt{2}}\left\|u\right\|\left\|v\right\|.

holds with probability at least 3/43/4. By using Lemma 4, we can get an estimate α\alpha such that |α𝔼[β]|ξuv\left|\alpha-\mathbb{E}\!\left[\beta\right]\right|\leq\xi\left\|u\right\|\left\|v\right\| holds with probability at least 1δ1-\delta. From the triangle inequality we thus get

Pr[|α(u,v)|(22ϕε+ξ)uv]1δ,\mathrm{Pr}\left[\left|\alpha-(u,v)\right|\leq\left(2\sqrt{2\phi\varepsilon}+\xi\right)\left\|u\right\|\left\|v\right\|\right]\geq 1-\delta,

as claimed. ∎

As a corollary, we can estimate values of the form uAvu^{\dagger}Av as well:

Corollary 1.

For any ε[0,1]\varepsilon\in[0,1], any δ(0,1]\delta\in(0,1], any ξ>0\xi>0 and any ϕ1\phi\geq 1, assume that

  • we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) for a nonzero matrix Am×mA\in\mathbb{C}^{m\times m},

  • we have 𝖰(u)\mathsf{Q}(u) and 𝖰(v)\mathsf{Q}(v) for two nonzero vectors u,vmu,v\in\mathbb{C}^{m} and know u\left\|u\right\| and v\left\|v\right\|.

We can then output an estimator α\alpha such that |αuAv|(4ε+ξ)uAFv\left|\alpha-u^{\dagger}Av\right|\leq\left(4\sqrt{\varepsilon}+\xi\right)\left\|u\right\|\left\|A\right\|_{\mathrm{F}}\left\|v\right\| holds with probability at least 1δ1-\delta at cost O(ϕlog(1/δ)ξ2(osq(A)+q(u)+q(v)))O\left(\frac{\phi\log(1/\delta)}{\xi^{2}}(\textbf{osq}(A)+\textbf{q}(u)+\textbf{q}(v))\right).

Proof.

Observe that uAv=Tr(uAv)=Tr(Avu)=(x,y)u^{\dagger}Av=\mathrm{Tr}(u^{\dagger}Av)=\mathrm{Tr}(Avu^{\dagger})=(x,y), where xx is the vector of length m2m^{2} obtained by concatenating the entries of matrix AA row by row, and yy is the vector of length m2m^{2} obtained by concatenating the entries of the matrix vuvu^{\dagger} column by column.

We show how to implement 𝖮𝖲𝖰2ε,ϕ(x)\mathsf{OSQ}_{2\varepsilon,\phi}(x) at cost O(osq(A))O(\textbf{osq}(A)). We define x¯\bar{x} as the vector of length m2m^{2} obtained by concatenating the entries of matrix A¯\bar{A} row by row, where A¯\bar{A} represents the matrix in Definition 8. We index the entries of xx and x¯\bar{x} by (i,j){1,,m}×{1,,m}(i,j)\in\{1,\ldots,m\}\times\{1,\ldots,m\}, so that x(i,j)=A(i,j)x(i,j)=A(i,j) and x¯(i,j)=A¯(i,j)\bar{x}(i,j)=\bar{A}(i,j). Query access to xx and x¯\bar{x}, as well as the norms x\left\|x\right\| and x¯\left\|\bar{x}\right\|, can trivially be obtained from 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A). Let us write Δ={i{1,,m}|A¯(i,)0}\Delta=\{i\in\{1,\ldots,m\}\>|\>\left\|\bar{A}(i,\cdot)\right\|\neq 0\} To sample from x¯\bar{x}, we sample an index i{1,,m}i\in\{1,\ldots,m\} from p~rA¯\tilde{p}_{\mathrm{{r}_{\bar{A}}}}. If iΔi\in\Delta, then we sample an index j{1,,m}j\in\{1,\ldots,m\} from p~A¯(i,)\tilde{p}_{\bar{A}(i,\cdot)} and output the index (i,j)(i,j). If iΔi\notin\Delta, then we output the index (i,1)(i,1). Let P:{1,,m}×{1,,m}[0,1]P\colon\{1,\ldots,m\}\times\{1,\ldots,m\}\to[0,1] denote the resulting probability distribution. For conciseness, let us write p=prA¯p=p_{\mathrm{r_{\bar{A}}}}, p~=p~rA¯\tilde{p}=\tilde{p}_{\mathrm{r_{\bar{A}}}} and pi=pA¯(i,)p_{i}=p_{\bar{A}(i,\cdot)}, p~i=p~A¯(i,)\tilde{p}_{i}=\tilde{p}_{\bar{A}(i,\cdot)} for all i{1,,m}i\in\{1,\ldots,m\}. We have

|Ppx¯|tv\displaystyle|P-p_{\bar{x}}|_{\mathrm{tv}} =12iΔj=1m|p~(i)p~i(j)|A¯(i,j)|2A¯F2|+12iΔ|p~(i)0|\displaystyle=\frac{1}{2}\sum_{i\in\Delta}\sum_{j=1}^{m}\left|\tilde{p}(i)\tilde{p}_{i}(j)-\frac{\left|\bar{A}(i,j)\right|^{2}}{\left\|\bar{A}\right\|_{\mathrm{F}}^{2}}\right|+\frac{1}{2}\sum_{i\notin\Delta}\left|\tilde{p}(i)-0\right|
=12iΔj=1m|p~(i)p~i(j)p(i)pi(j)|+12iΔ|p~(i)p(i)|\displaystyle=\frac{1}{2}\sum_{i\in\Delta}\sum_{j=1}^{m}\left|\tilde{p}(i)\tilde{p}_{i}(j)-p(i)p_{i}(j)\right|+\frac{1}{2}\sum_{i\notin\Delta}\left|\tilde{p}(i)-p(i)\right|
=12iΔj=1m|p~(i)p~i(j)p~(i)pi(j)|+12iΔj=1m|p~(i)pi(j)p(i)pi(j)|+12iΔ|p~(i)p(i)|\displaystyle=\frac{1}{2}\sum_{i\in\Delta}\sum_{j=1}^{m}\left|\tilde{p}(i)\tilde{p}_{i}(j)-\tilde{p}(i)p_{i}(j)\right|+\frac{1}{2}\sum_{i\in\Delta}\sum_{j=1}^{m}\left|\tilde{p}(i)p_{i}(j)-p(i)p_{i}(j)\right|+\frac{1}{2}\sum_{i\notin\Delta}\left|\tilde{p}(i)-p(i)\right|
=iΔp~(i)|p~ipi|tv+12iΔ|p~(i)p(i)|+12iΔ|p~(i)p(i)|\displaystyle=\sum_{i\in\Delta}\tilde{p}(i)|\tilde{p}_{i}-p_{i}|_{\mathrm{tv}}+\frac{1}{2}\sum_{i\in\Delta}\left|\tilde{p}(i)-p(i)\right|+\frac{1}{2}\sum_{i\notin\Delta}\left|\tilde{p}(i)-p(i)\right|
2ε.\displaystyle\leq 2\varepsilon.

We can implement 𝖰(y)\mathsf{Q}(y) at cost q(u)+q(v)\textbf{q}(u)+\textbf{q}(v). We have x=AF\left\|x\right\|=\left\|A\right\|_{\mathrm{F}} and y=uv\left\|y\right\|=\left\|u\right\|\left\|v\right\|. We then use Proposition 6 to estimate the inner product, which gives the claimed approximation factor and complexity. ∎

6.2 Dequantizing the Quantum Singular Value Transformation

In this subsection we consider the following problem.

Quantum Singular Value Transformation Input: a matrix Am×nA\in\mathbb{C}^{m\times n} with AF=1\left\|A\right\|_{\mathrm{F}}=1, a vector bnb\in\mathbb{C}^{n} with b=1\left\|b\right\|=1, an even polynomial 𝔭\mathfrak{p} of degree d2d\geq 2 such that |𝔭(x)|1\left|\mathfrak{p}(x)\right|\leq 1 for all x[1,1]x\in[-1,1], and two parameters δ(0,1]\delta\in(0,1] and η(0,1/2]\eta\in(0,1/2]. Goal: find with probability at least 1δ1-\delta a vector vnv\in\mathbb{C}^{n} such that v𝔭(AA)bη𝔭(AA)b.\left\|v-\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|\leq\eta\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|. (11)

Quantum algorithms for the Quantum Singular Value Transformation [21] output with probability at least 1δ1-\delta a quantum state proportional to the vector 𝔭(AA)b\mathfrak{p}(\sqrt{A^{\dagger}A})b in time

O~(dlog(1/δ)p(AA)b)\tilde{O}\left(\frac{d\log(1/\delta)}{\left\|p(\sqrt{A^{\dagger}A})b\right\|}\right)

when AA and bb are given in QRAM. The dequantization framework from [5] assumes that we have perfect sampling-and-query access to AA and bb (i.e., 𝖲𝖰(A)\mathsf{SQ}(A) and 𝖲𝖰(b)\mathsf{SQ}(b)), and gives a classical algorithm that after a preprocessing stage running in time

O~(d16log3(1/δ)η6𝔭(AA)b6),\tilde{O}\left(\frac{d^{16}\log^{3}(1/\delta)}{\eta^{6}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{6}}\right),

implements 𝖱𝖲𝖰0,δ,η(v){\mathsf{RSQ}_{0,\delta^{\prime},\eta^{\prime}}}(v) for a vector vv satisfying Condition (11) with probability at least 1δ1-\delta at cost

O~(d12log2(1/δ)log(1/δ)η4η2𝔭(AA)b6),\tilde{O}\left(\frac{d^{12}\log^{2}(1/\delta)\log(1/\delta^{\prime})}{\eta^{4}\eta^{\prime 2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{6}}\right),

for any δ,η(0,1]\delta^{\prime},\eta^{\prime}\in(0,1]. We show how to achieve a similar running time when given only 𝖲𝖰ε(A)\mathsf{SQ}_{\varepsilon}(A) and 𝖲𝖰ε(b)\mathsf{SQ}_{\varepsilon}(b) for ε>0\varepsilon>0, when ε\varepsilon is small enough. More generally, we also consider the case where we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) and 𝖮𝖲𝖰ε,ϕ(b)\mathsf{OSQ}_{\varepsilon,\phi}(b) with ϕ>1\phi>1.

Theorem 7.

There exists an absolute constant τ(0,1)\tau\in(0,1) such that the following holds. Assume that we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) and 𝖮𝖲𝖰ε,ϕ(b)\mathsf{OSQ}_{\varepsilon,\phi}(b) at cost osq(A)=osq(b)=O~(1)\textbf{osq}(A)=\textbf{osq}(b)=\tilde{O}(1) for some ϕ1\phi\geq 1 and ε0\varepsilon\geq 0. For any ε,δ,η(0,1]\varepsilon^{\prime},\delta^{\prime},\eta^{\prime}\in(0,1], if

εετη2𝔭(AA)b4ϕ3d8log(1/δ)\varepsilon\leq\varepsilon^{\prime}\cdot\frac{\tau\eta^{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{4}}{\phi^{3}d^{8}\log(1/\delta)} (12)

then there is a classical algorithm that after a preprocessing stage running in time

O~(ϕ6d16log3(1/δ)η6𝔭(AA)b6)\tilde{O}\left(\frac{\phi^{6}d^{16}\log^{3}(1/\delta)}{\eta^{6}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{6}}\right)

implements 𝖱𝖲𝖰ε,δ,η(v){\mathsf{RSQ}_{\varepsilon^{\prime},\delta^{\prime},\eta^{\prime}}}(v) for a vector vv satisfying Condition (11) with probability at least 1δ1-\delta at cost

O~(ϕ5d12log2(1/δ)log(1/δ)η4η2𝔭(AA)b6).\tilde{O}\left(\frac{\phi^{5}d^{12}\log^{2}(1/\delta)\log(1/\delta^{\prime})}{\eta^{4}\eta^{\prime 2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{6}}\right).
Proof.

Our proof follows the same strategy as in [5]. Let us consider the polynomial 𝔮\mathfrak{q} such that 𝔭(x)𝔭(0)=𝔮(x2)\mathfrak{p}(x)-\mathfrak{p}(0)=\mathfrak{q}(x^{2}) and define the function f:f\colon\mathbb{R}\to\mathbb{C} as follows:

f(x)={𝔮(1)if x1,𝔮(x)if x[1,1],𝔮(1)if x1.f(x)=\begin{cases}\mathfrak{q}(-1)&\textrm{if }x\leq-1,\\ \mathfrak{q}(x)&\textrm{if }x\in[-1,1],\\ \mathfrak{q}(1)&\textrm{if }x\geq 1.\end{cases}

As shown in Lemma 6.3 of [5], the function ff is LL-Lipschitz on \mathbb{R} with L=O(d2)L=O(d^{2}), the function f¯\bar{f} is L¯\bar{L}-Lipschitz on \mathbb{R} with L¯=O(d4)\bar{L}=O(d^{4}), and we have

Lmax:=maxx|f¯(x)|=O(d2).L_{\textrm{max}}:=\max_{x\in\mathbb{R}}\left|\bar{f}(x)\right|=O(d^{2}).

Let us set

r\displaystyle r =max(92ϕ2ln(6/δ),112ϕ2ln(6/δ)L2(η2𝔭(AA)b)2)=O(ϕ2d4log(1/δ)η2𝔭(AA)b2),\displaystyle=\left\lceil\max\left(\frac{9}{2}\phi^{2}\ln\left(6/\delta\right),\frac{112\phi^{2}\ln(6/\delta)L^{2}}{\left(\frac{\eta}{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|\right)^{2}}\right)\right\rceil=O\left(\frac{\phi^{2}d^{4}\log(1/\delta)}{\eta^{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{2}}\right),
c\displaystyle c =448ϕ2ln(6/δ)L¯2(η2𝔭(AA)b)2=O(ϕ2d8log(1/δ)η2𝔭(AA)b2).\displaystyle=\left\lceil\frac{448\phi^{2}\ln(6/\delta)\bar{L}^{2}}{\left(\frac{\eta}{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|\right)^{2}}\right\rceil=O\left(\frac{\phi^{2}d^{8}\log(1/\delta)}{\eta^{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{2}}\right).

We first consider the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance sketch Sr×mS\in\mathbb{R}^{r\times m} of AA associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A), and construct its standard representation via Lemma 6 at cost O~(r)\tilde{O}(r). Let us write R=SAR=SA. Observe that by taking the constant τ\tau small enough, any ε\varepsilon satisfying Condition (12) also satisfies the condition εϕ/6\varepsilon\leq\phi/6. From Lemma 5, with probability at least 1δ/31-\delta/3 we thus have

RF2[12AF2,32AF2].\left\|R\right\|_{\mathrm{F}}^{2}\in\left[\frac{1}{2}\left\|A\right\|_{\mathrm{F}}^{2},\frac{3}{2}\left\|A\right\|_{\mathrm{F}}^{2}\right]. (13)

We assume below that this inequality holds. From Lemma 8, we can then implement 𝖮𝖲𝖰ε,ϕ(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(R^{\dagger}) with ϕ2ϕ\phi^{\prime}\leq 2\phi at cost O~(r)\tilde{O}(r).

We now consider the (c,ε,ϕ)(c,\varepsilon,\phi^{\prime})-approximate importance sketch Tc×nT\in\mathbb{R}^{c\times n} of RR^{\dagger} associated with 𝖮𝖲𝖰ε,ϕ(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(R^{\dagger}), and construct its standard representation via Lemma 6 at cost O(cosq(R))=O~(cr)O(c\cdot\textbf{osq}(R^{\dagger}))=\tilde{O}(cr). Let us write C=SATC=SAT^{\dagger}. We now apply Theorem 3 (with χ=\chi=\infty since our bounds on the Lipschitz constants LL and L¯\bar{L} hold over \mathbb{R}). Observe that by taking the constant τ\tau small enough, any ε\varepsilon satisfying Condition (12) also satisfies Condition (4) with γ=η2𝔭(AA)b\gamma=\frac{\eta}{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|. Theorem 3 thus shows that

Rf¯(CC)Rf(AA)Fη2𝔭(AA)b\left\|R^{\dagger}\bar{f}(CC^{\dagger})R-f(A^{\dagger}A)\right\|_{\mathrm{F}}\leq\frac{\eta}{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|

holds with probability at least 1δ/31-\delta/3. Note that we then have

Rf¯(CC)Rb+𝔭(0)b𝔭(AA)bRf¯(CC)Rf(AA)Fbη2𝔭(AA)b.\left\|R^{\dagger}\bar{f}(CC^{\dagger})Rb+\mathfrak{p}(0)b-\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|\leq\left\|R^{\dagger}\bar{f}(CC^{\dagger})R-f(A^{\dagger}A)\right\|_{\mathrm{F}}\left\|b\right\|\leq\frac{\eta}{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|.

We now explain how to implement access to a good approximation vv of the vector Rf¯(CC)Rb+𝔭(0)bR^{\dagger}\bar{f}(CC^{\dagger})Rb+\mathfrak{p}(0)b.

We first explain how to explicitly compute a vector uru\in\mathbb{C}^{r} close to RbRb. First, we take

r=7ln(6/δ)ϕ2(η𝔭(AA)b/(16Lmax))2=O(ϕ2d4log(1/δ)η2𝔭(AA)b2)r^{\prime}=\left\lceil\frac{7\ln(6/\delta)\phi^{2}}{\left(\eta\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|/(16L_{\textrm{max}})\right)^{2}}\right\rceil=O\left(\frac{\phi^{2}d^{4}\log(1/\delta)}{\eta^{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{2}}\right)

and consider the (r,ε,ϕ,ϕ)(r^{\prime},\varepsilon,\phi^{\prime},\phi)-approximate joint importance sketch Σr×n\Sigma\in\mathbb{R}^{r^{\prime}\times n} of RR^{\dagger} and bb associated with 𝖮𝖲𝖰ε,ϕ(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(R^{\dagger}) and 𝖮𝖲𝖰ε,ϕ(b)\mathsf{OSQ}_{\varepsilon,\phi}(b), and construct its standard representation, which can be done at cost O(r(osq(R)+osq(b)))=O~(rr)O(r^{\prime}\cdot(\textbf{osq}(R^{\dagger})+\textbf{osq}(b)))=\tilde{O}(rr^{\prime}) via Lemma 6. We set u=RΣΣbu=R\Sigma^{\dagger}\Sigma b. Note that uu can be computed explicitly at cost O(rr)O(rr^{\prime}) by computing explicitly RΣr×rR\Sigma^{\dagger}\in\mathbb{C}^{r\times r^{\prime}} and Σbr\Sigma b\in\mathbb{C}^{r^{\prime}}, and then performing explicitly the matrix-vector product. By taking the constant τ\tau small enough, we can ensure that any ε\varepsilon satisfying Condition (12) also satisfies the inequality εηp(AA)b32Lmax2ϕ\varepsilon\leq\frac{\eta\left\|p(\sqrt{A^{\dagger}A})b\right\|}{32L_{\textrm{max}}\sqrt{2}\phi}. Theorem 2 thus guarantees that

uRb(116Lmax+116Lmax)η𝔭(AA)bRFbη4Lmax𝔭(AA)b\displaystyle\left\|u-Rb\right\|\leq\left(\frac{1}{16L_{\textrm{max}}}+\frac{1}{16L_{\textrm{max}}}\right)\eta\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|\left\|R\right\|_{\mathrm{F}}\left\|b\right\|\leq\frac{\eta}{4L_{\textrm{max}}}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|

with probability at least 1δ/31-\delta/3. We assume below that this is the case. Note that

uRb+η4Lmax𝔭(AA)b2+η4Lmax=O(1).\left\|u\right\|\leq\left\|Rb\right\|+\frac{\eta}{4L_{\textrm{max}}}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|\leq 2+\frac{\eta}{4L_{\textrm{max}}}=O(1). (14)

Since uru\in\mathbb{C}^{r} and Cr×cC\in\mathbb{C}^{r\times c}, we can compute explicitly the vector w=f¯(CC)uw=\bar{f}(CC^{\dagger})u using O(r2c+dr+r2)O(r^{2}c+dr+r^{2}) arithmetic operations by first computing explicitly f¯(CC)\bar{f}(CC^{\dagger}) and then performing the matrix-vector product.

All the steps we have described so far corresponds to the preprocessing stage. The overall complexity of this stage is

O~(rr+r2c+dr)=O~(r2c)=O~(ϕ6d16log3(1/δ)η6𝔭(AA)b6).\tilde{O}\left(rr^{\prime}+r^{2}c+dr\right)=\tilde{O}\left(r^{2}c\right)=\tilde{O}\left(\frac{\phi^{6}d^{16}\log^{3}(1/\delta)}{\eta^{6}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{6}}\right).

Let us write v=Rf¯(CC)u+𝔭(0)bv=R^{\dagger}\bar{f}(CC^{\dagger})u+\mathfrak{p}(0)b. Putting everything together, the following upper bound holds with probability at least 1δ1-\delta:

v𝔭(AA)b\displaystyle\left\|v-\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\| =Rf¯(CC)uf(AA)b\displaystyle=\left\|R^{\dagger}\bar{f}(CC^{\dagger})u-f(A^{\dagger}A)b\right\|
Rf¯(CC)(uRb)+Rf¯(CC)Rbf(AA)b\displaystyle\leq\left\|R^{\dagger}\bar{f}(CC^{\dagger})(u-Rb)\right\|+\left\|R^{\dagger}\bar{f}(CC^{\dagger})Rb-f(A^{\dagger}A)b\right\|
Rf¯(CC)uRb+Rf¯(CC)Rbf(AA)b\displaystyle\leq\left\|R^{\dagger}\right\|\left\|\bar{f}(CC^{\dagger})\right\|\left\|u-Rb\right\|+\left\|R^{\dagger}\bar{f}(CC^{\dagger})Rb-f(A^{\dagger}A)b\right\|
2LmaxuRb+Rf¯(CC)Rf(AA)F\displaystyle\leq 2L_{\textrm{max}}\left\|u-Rb\right\|+\left\|R^{\dagger}\bar{f}(CC^{\dagger})R-f(A^{\dagger}A)\right\|_{\mathrm{F}}
η𝔭(AA)b.\displaystyle\leq\eta\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|.

Finally, we explain how to implement approximate randomized sampling-and-query access from vv. Observe that vv is a linear combination of r+1r+1 vectors: a linear combination of the rr columns of RR^{\dagger} (i.e., rows of RR) and bb, where the coefficients of the linear combination are given by ww and 𝔭(0)\mathfrak{p}(0). Via Lemma 7, we can implement 𝖮𝖲𝖰ε,ϕ′′(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime\prime}}(R) with ϕ′′2ϕ\phi^{\prime\prime}\leq 2\phi at cost O~(1)\tilde{O}(1). Proposition 5 guarantees that we can implement 𝖮𝖲𝖰ε,φ(v)\mathsf{OSQ}_{\varepsilon,\varphi}(v) at cost O~(r)\tilde{O}(r), with

φ=O((r+1)ϕ′′i=1r|w(i)|2R(i,)2+ϕ|𝔭(0)|2b2v2).\varphi=O\left((r+1)\frac{\phi^{\prime\prime}\sum_{i=1}^{r}\left|w(i)\right|^{2}\left\|R(i,\cdot)\right\|^{2}+\phi\left|\mathfrak{p}(0)\right|^{2}\left\|b\right\|^{2}}{\left\|v\right\|^{2}}\right).

Using wLmaxu\left\|w\right\|\leq L_{\textrm{max}}\left\|u\right\|, Inequalities (13) and (14), the assumption b=1\left\|b\right\|=1 and the fact v=Θ(𝔭(AA)b)\left\|v\right\|=\Theta(\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\|), which holds since we are assuming η1/2\eta\leq 1/2, we get

φ=O(rϕLmax2RF2+ϕv2)=O(ϕ3d8log(1/δ)η2𝔭(AA)b4).\varphi=O\left(r\frac{\phi L^{2}_{\textrm{max}}\left\|R\right\|_{\mathrm{F}}^{2}+\phi}{\left\|v\right\|^{2}}\right)=O\left(\frac{\phi^{3}d^{8}\log(1/\delta)}{\eta^{2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{4}}\right).

By taking the constant τ\tau small enough, any ε\varepsilon satisfying Condition (12) also satisfies the inequality 3εφε3\varepsilon\varphi\leq\varepsilon^{\prime}. We then use Proposition 3 to get 𝖱𝖲𝖰ε,δ,η(v){\mathsf{RSQ}_{\varepsilon^{\prime},\delta^{\prime},\eta^{\prime}}}(v) at cost

O(φlog(1/δ)η2osq(v))=O~(φlog(1/δ)η2r)=O~(d12ϕ5log2(1/δ)log(1/δ)η4η2𝔭(AA)b6),O\left(\frac{\varphi\log(1/\delta^{\prime})}{\eta^{\prime 2}}\textbf{osq}(v)\right)=\tilde{O}\left(\frac{\varphi\log(1/\delta^{\prime})}{\eta^{\prime 2}}r\right)=\tilde{O}\left(\frac{d^{12}\phi^{5}\log^{2}(1/\delta)\log(1/\delta^{\prime})}{\eta^{4}\eta^{\prime 2}\left\|\mathfrak{p}(\sqrt{A^{\dagger}A})b\right\|^{6}}\right),

as claimed. ∎

6.3 Dequantizing the sparse Quantum Singular Value Transformation

In this subsection, we discuss the main problem considered by Gharibian and Le Gall [18].

For any integer s1s\geq 1, we say that a matrix is ss-sparse if it contains at most ss nonzero entries per row and column. The problem considered in [18, Section 4] is as follows.

Sparse Quantum Singular Value Transformation (SparseQSVT) Input: an ss-sparse matrix Am×nA\in\mathbb{C}^{m\times n} for s2s\geq 2 such that A1\left\|A\right\|\leq 1, two vectors u,vnu,v\in\mathbb{C}^{n} such that u1\left\|u\right\|\leq 1 and v1\left\|v\right\|\leq 1, an even polynomial 𝔭[x]\mathfrak{p}\in\mathbb{R}[x] of degree dd with |𝔭(x)|1|\mathfrak{p}(x)|\leq 1 for all x[1,1]x\in[-1,1], a parameter η(0,1]\eta\in(0,1]. Goal: compute an estimate z^\hat{z}\in\mathbb{C} such that |z^v𝔭(AA)u|η|\hat{z}-v^{\dagger}\mathfrak{p}(\sqrt{A^{\dagger}A})u|\leq\eta.

The main difference with the problem considered in Section 6.2 is that the normalization on AA is now A1\left\|A\right\|\leq 1 instead of AF=1\left\|A\right\|_{\mathrm{F}}=1. Note that this is a significantly weaker assumption since the inequality AAF\left\|A\right\|\leq\left\|A\right\|_{\mathrm{F}} always holds and AF\left\|A\right\|_{\mathrm{F}} can be as large as nA\sqrt{n}\left\|A\right\|. On the other hand, with this weaker assumption the problem can only be solved efficiently in the classical setting for special types of matrices, such as ss-sparse matrices. The main dequantization result from [18] indeed shows that when given 𝖰(A)\mathsf{Q}(A), 𝖰(u)\mathsf{Q}(u) and 𝖲𝖰(v)\mathsf{SQ}(v) with costs q(A)=q(u)=sq(v)=O~(1)\textbf{q}(A)=\textbf{q}(u)=\textbf{sq}(v)=\tilde{O}(1), the problem SparseQSVT can be solved classically with probability at least 11/poly(n)1-1/\mathrm{poly}(n) in O~(sd/η2)\tilde{O}(s^{d}/\eta^{2}) time.202020The result from [18] is actually slightly more general: it also works when vv can be accessed by a slight generalization of sampling-and-query access (called “ζ\zeta-sampling-access” in [18]), where the sampling is done with small multiplicative error (see Definition 2.4 in [18]). This result is obtained by showing how to implement efficiently query access to the vector 𝔭(AA)u\mathfrak{p}(\sqrt{A^{\dagger}A})u when AA is ss-sparse, and then using the inner product estimation technique from [39]. By using instead our technique for robust estimation of the inner product, we show that this dequantization result even holds when considering 𝖲𝖰ε(v)\mathsf{SQ}_{\varepsilon}(v) for ε>0\varepsilon>0, when ε\varepsilon is small enough.

Theorem 8.

Given 𝖰(A)\mathsf{Q}(A), 𝖰(u)\mathsf{Q}(u) and 𝖲𝖰ε(v)\mathsf{SQ}_{\varepsilon}(v) with εη/9\varepsilon\leq\eta/9 and costs q(A)=q(u)=sq(v)=O~(1)\textbf{q}(A)=\textbf{q}(u)=\textbf{sq}(v)=\tilde{O}(1), the problem SparseQSVT can be solved classically with probability at least 11/poly(n)1-1/\mathrm{poly}(n) in O~(sd/η2)\tilde{O}(s^{d}/\eta^{2}) time.

Proof.

Lemma 8 in [18] shows that given 𝖰(A)\mathsf{Q}(A) and 𝖰(u)\mathsf{Q}(u) at cost q(A)=q(u)=O~(1)\textbf{q}(A)=\textbf{q}(u)=\tilde{O}(1), we can implement query access to 𝔭(AA)u\mathfrak{p}(\sqrt{A^{\dagger}A})u at cost O~(sd)\tilde{O}(s^{d}). Let wnw\in\mathbb{C}^{n} denote the complex conjugate of the vector 𝔭(AA)u\mathfrak{p}(\sqrt{A^{\dagger}A})u. We immediately get 𝖰(w)\mathsf{Q}(w) from 𝖰(𝔭(AA)u)\mathsf{Q}(\mathfrak{p}(\sqrt{A^{\dagger}A})u).

Observe that w=𝔭(AA)uu1\left\|w\right\|=\|\mathfrak{p}(\sqrt{A^{\dagger}A})u\|\leq\left\|u\right\|\leq 1 from our assumptions on 𝔭\mathfrak{p} and u\left\|u\right\|. We have

(v,w)=v𝔭(AA)u.(v,w)=v^{\dagger}\mathfrak{p}(\sqrt{A^{\dagger}A})u.

We can then use Proposition 1 and output at cost

O(log(n)η2(sq(v)+q(w)))=O(sdlog(n)η2),O\left(\frac{\log(n)}{\eta^{2}}(\textbf{sq}(v)+\textbf{q}(w))\right)=O\left(\frac{s^{d}\log(n)}{\eta^{2}}\right),

an estimator z^\hat{z} such that

|z^v𝔭(AA)u|(22ε+η100)vη\left|\hat{z}-v^{\dagger}\mathfrak{p}(\sqrt{A^{\dagger}A})u\right|\leq(2\sqrt{2\varepsilon}+\frac{\eta}{100})\left\|v\right\|\leq\eta

holds with probability at least 11/poly(n)1-1/\mathrm{poly}(n). ∎

As mentioned in the introduction, using Theorem 8 we can generalize the other dequantization results in [18, Section 4], and in particular obtain an efficient classical algorithm computing a constant-precision estimation of the ground state of a local Hamiltonians when given 𝖲𝖰ε(u)\mathsf{SQ}_{\varepsilon}(u) for a guiding vector uu that has constant overlap with the ground state (Theorem 1 in [18] showed this result only for the case ε=0\varepsilon=0). Similarly, the implications to the Quantum PCP conjecture and the No Low-Energy Samplable States (NLSS) conjecture discussed in [18, Section 5] also generalize to the case of approximate sampling-and-query access to the witness states (i.e., 𝖲𝖰ε(|φ)\mathsf{SQ}_{\varepsilon}(|\varphi\rangle) for a witness |φ|\varphi\rangle instead of 𝖲𝖰(|φ)\mathsf{SQ}(|\varphi\rangle)).

6.4 Dequantizing quantum algorithms for supervised clustering

In the supervised clustering problem considered by Lloyd, Mohseni, and Rebentrost [31], the main computational task reduces to the following problem.

Supervised Clustering Input: a matrix Mn×dM\in\mathbb{R}^{n\times d} with no all-zero row, a row-vector w1×nw\in\mathbb{R}^{1\times n} and parameters η,δ(0,1)\eta,\delta\in(0,1). Goal: output an estimation α\alpha of wM2\left\|wM\right\|^{2} such that |αwM2|ηMF2w2\left|\alpha-\left\|wM\right\|^{2}\right|\leq\eta\left\|M\right\|_{\mathrm{F}}^{2}\left\|w\right\|^{2} holds with probability at least 1δ1-\delta.

Ref. [31] proposed a quantum algorithm for supervised clustering by solving the above problem using the SWAP test in time O~(log(1/δ)η)\tilde{O}\left(\frac{\log(1/\delta)}{\eta}\right), assuming that the data is stored in QRAM. Prior dequantization works [5, 40] showed how to construct classical algorithms solving this problem in time O(log(1/δ)η2(sq(M)+q(w)))O\left(\frac{\log(1/\delta)}{\eta^{2}}(\textbf{sq}(M)+\textbf{q}(w))\right) when given 𝖲𝖰(M)\mathsf{SQ}(M) and 𝖰(w)\mathsf{Q}(w). Here we show how that dequantization is possible even when given only 𝖲𝖰ε(M)\mathsf{SQ}_{\varepsilon}(M) for ε\varepsilon small enough.

Theorem 9.

Given 𝖲𝖰ε(M)\mathsf{SQ}_{\varepsilon}(M) and 𝖰(w)\mathsf{Q}(w) for εη225\varepsilon\leq\frac{\eta^{2}}{25}, there is a classical algorithm that solves the problem Supervised Clustering at cost O(log(1/δ)η2(sq(M)+q(w)))O\left(\frac{\log(1/\delta)}{\eta^{2}}(\textbf{sq}(M)+\textbf{q}(w))\right).

Proof.

We first observe that wM2=(u,v)\left\|wM\right\|^{2}=(u,v) for the vectors u,vdn2u,v\in\mathbb{R}^{dn^{2}} defined as follows. The coordinates of uu and vv are indexed by triples (i,j,k){1,,n}×{1,,d}×{1,,n}(i,j,k)\in\{1,\ldots,n\}\times\{1,\ldots,d\}\times\{1,\ldots,n\}. We set

ui,j,k\displaystyle u_{i,j,k} =M(i,j)M(k,),\displaystyle=M(i,j)\left\|M(k,\cdot)\right\|,
vi,j,k\displaystyle v_{i,j,k} =wiwkM(k,j)M(k,)\displaystyle=\frac{w_{i}w_{k}M(k,j)}{\left\|M(k,\cdot)\right\|}

for each (i,j,k){1,,n}×{1,,d}×{1,,n}(i,j,k)\in\{1,\ldots,n\}\times\{1,\ldots,d\}\times\{1,\ldots,n\}. We have

(u,v)\displaystyle(u,v) =j=1d(i=1nwiM(i,j))(k=1nwkM(k,j))=wM2=(u,v),\displaystyle=\sum_{j=1}^{d}\left(\sum_{i=1}^{n}w_{i}M(i,j)\right)\left(\sum_{k=1}^{n}w_{k}M(k,j)\right)=\left\|wM\right\|^{2}=(u,v),
u\displaystyle\left\|u\right\| =i=1nj=1dk=1n|M(i,j)|2M(k,)2=MF2\displaystyle=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{d}\sum_{k=1}^{n}\left|M(i,j)\right|^{2}\left\|M(k,\cdot)\right\|^{2}}=\left\|M\right\|_{\mathrm{F}}^{2}
v\displaystyle\left\|v\right\| =i=1nj=1dk=1n|wi|2|wk|2|M(k,j)|2M(k,)2=w2.\displaystyle=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{d}\sum_{k=1}^{n}\frac{\left|w_{i}\right|^{2}\left|w_{k}\right|^{2}\left|M(k,j)\right|^{2}}{\left\|M(k,\cdot)\right\|^{2}}}=\left\|w\right\|^{2}.

Note that from 𝖲𝖰ε(M)\mathsf{SQ}_{\varepsilon}(M) and 𝖰(w)\mathsf{Q}(w) we can implement 𝖰(u)\mathsf{Q}(u) and 𝖰(v)\mathsf{Q}(v) at cost O(sq(M)+q(w))O(\textbf{sq}(M)+\textbf{q}(w)). In order to implement sample access to uu, we sample ii and kk from p~rM\tilde{p}_{\mathrm{r_{M}}}, then sample jj from p~M(i,)\tilde{p}_{M(i,\cdot)}. Let P(i,j,k)P(i,j,k) denote the resulting probability distribution. For conciseness, let us write p=prMp=p_{\mathrm{r_{M}}}, p~=p~rM\tilde{p}=\tilde{p}_{\mathrm{r_{M}}} and pi=pM(i,)p_{i}=p_{M(i,\cdot)}, p~i=p~M(i,)\tilde{p}_{i}=\tilde{p}_{M(i,\cdot)} for each i{1,,n}i\in\{1,\ldots,n\}. We have

|Ppu|tv\displaystyle|P-p_{u}|_{\mathrm{tv}} =12i=1nj=1dk=1n|P(i,j,k)pu(i,j,k)|,\displaystyle=\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{d}\sum_{k=1}^{n}\left|P(i,j,k)-p_{u}(i,j,k)\right|,
=12i=1nj=1dk=1n|p~(i)p~i(j)p~(k)M(i,)2MF2|M(i,j)|2M(i,)2M(k,)2MF2|\displaystyle=\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{d}\sum_{k=1}^{n}\left|\tilde{p}(i)\tilde{p}_{i}(j)\tilde{p}(k)-\frac{\left\|M(i,\cdot)\right\|^{2}}{\left\|M\right\|_{\mathrm{F}}^{2}}\frac{\left|M(i,j)\right|^{2}}{\left\|M(i,\cdot)\right\|^{2}}\frac{\left\|M(k,\cdot)\right\|^{2}}{\left\|M\right\|_{\mathrm{F}}^{2}}\right|
=12i=1nj=1dk=1n|p~(i)p~i(j)p~(k)p(i)pi(j)p(k)|\displaystyle=\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{d}\sum_{k=1}^{n}\left|\tilde{p}(i)\tilde{p}_{i}(j)\tilde{p}(k)-p(i)p_{i}(j)p(k)\right|
12i=1nj=1dk=1n|p~(i)p~i(j)p~(k)p(i)p~i(j)p~(k)|+|p(i)p~i(j)p~(k)p(i)pi(j)p~(k)|\displaystyle\leq\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{d}\sum_{k=1}^{n}\left|\tilde{p}(i)\tilde{p}_{i}(j)\tilde{p}(k)-p(i)\tilde{p}_{i}(j)\tilde{p}(k)\right|+\left|p(i)\tilde{p}_{i}(j)\tilde{p}(k)-p(i)p_{i}(j)\tilde{p}(k)\right|
+|p(i)pi(j)p~(k)p(i)pi(j)p(k)|\displaystyle\hskip 199.16928pt+\left|p(i)p_{i}(j)\tilde{p}(k)-p(i)p_{i}(j)p(k)\right|
=12i=1n|p~(i)p(i)|j=1dp~i(j)k=1np~(k)+12i=1np(i)j=1d|p~i(j)pi(j)|k=1np~(k)\displaystyle=\frac{1}{2}\sum_{i=1}^{n}\left|\tilde{p}(i)-p(i)\right|\sum_{j=1}^{d}\tilde{p}_{i}(j)\sum_{k=1}^{n}\tilde{p}(k)+\frac{1}{2}\sum_{i=1}^{n}p(i)\sum_{j=1}^{d}\left|\tilde{p}_{i}(j)-p_{i}(j)\right|\sum_{k=1}^{n}\tilde{p}(k)
+12i=1np(i)j=1dpi(j)k=1n|p~(k)p(k)|\displaystyle\hskip 199.16928pt+\frac{1}{2}\sum_{i=1}^{n}p(i)\sum_{j=1}^{d}p_{i}(j)\sum_{k=1}^{n}\left|\tilde{p}(k)-p(k)\right|
=|p~p|tv+i=1np(i)|p~ipi|tv+|p~p|tv\displaystyle=|\tilde{p}-p|_{\mathrm{tv}}+\sum_{i=1}^{n}p(i)|\tilde{p}_{i}-p_{i}|_{\mathrm{tv}}+|\tilde{p}-p|_{\mathrm{tv}}
3ε.\displaystyle\leq 3\varepsilon.

This means that we can implement 𝖲𝖰3ε(u)\mathsf{SQ}_{3\varepsilon}(u) at cost O~(sq(M))\tilde{O}(\textbf{sq}(M)).

We can now use Proposition 6: we can output an estimator α\alpha such that

|α(u,v)|(26ε+η100)uv=(26ε+η100)MF2w2ηMF2w2\left|\alpha-(u,v)\right|\leq\left(2\sqrt{6\varepsilon}+\frac{\eta}{100}\right)\left\|u\right\|\left\|v\right\|=\left(2\sqrt{6\varepsilon}+\frac{\eta}{100}\right)\left\|M\right\|_{\mathrm{F}}^{2}\left\|w\right\|^{2}\leq\eta\left\|M\right\|_{\mathrm{F}}^{2}\left\|w\right\|^{2}

holds with probability at least 1δ1-\delta at cost O~(log(1/δ)η2(sq(M)+q(w)))\tilde{O}\left(\frac{\log(1/\delta)}{\eta^{2}}(\textbf{sq}(M)+\textbf{q}(w))\right). ∎

6.5 Dequantizing quantum algorithms for recommendation systems

In recommendation systems, the main computational task reduces to sampling from a row of a low-rank matrix close to the input matrix (which corresponds to the users’ preference matrix). In particular, quantum algorithms [29] and their dequantization [39, 21] consider and solve the following version.

Recommendation Systems Input: a matrix Am×nA\in\mathbb{R}^{m\times n}, an index i{1,,m}i\in\{1,\ldots,m\}, and four parameters σ,ν(0,1]\sigma,\nu\in(0,1] and ξ,τ(0,1/2]\xi,\tau\in(0,1/2]. Goal: sample from the probability distribution pA^(i,)p_{\hat{A}(i,\cdot)} up to δ\delta error in total variation distance, for a matrix A^m×n\hat{A}\in\mathbb{R}^{m\times n} such that A^Aσ,ξFνAF\left\|\hat{A}-A_{\sigma,\xi}\right\|_{\mathrm{F}}\leq\nu\left\|A\right\|_{\mathrm{F}}, where Aσ,ξ=f(A)A_{\sigma,\xi}=f(A) for some arbitrary function ff such that {f(x)=x if xσ(1+ξ)f(x)=0 if x<σ(1ξ)f(x)[0,x] if x[σ(1ξ),σ(1+ξ))\begin{cases}f(x)=x&\textrm{ if }x\geq\sigma(1+\xi)\\ f(x)=0&\textrm{ if }x<\sigma(1-\xi)\\ f(x)\in[0,x]&\textrm{ if }x\in[\sigma(1-\xi),\sigma(1+\xi))\end{cases} (15) holds.

In applications to recommendation systems, the parameter ξ\xi is taken as a constant (e.g., ξ=1/6\xi=1/6). The matrix Aσ,ξA_{\sigma,\xi} represents some kind of low-rank approximation of AA. The intuition is that if the singular values of the matrix AA “concentrate” on a few high values, then by choosing σ\sigma large enough the matrix Aσ,ξA_{\sigma,\xi} will have low rank (since Aσ,ξA_{\sigma,\xi} has only a few nonzero singular values) and will also be a good approximation of AA (since the spectrum concentrates on these high singular values). This typically corresponds to the case where the ratio AF/σ\left\|A\right\|_{\mathrm{F}}/\sigma is small.

This parameter K=AF2σ2K=\frac{\left\|A\right\|_{\mathrm{F}}^{2}}{\sigma^{2}} is indeed crucial for analyzing the performance of quantum algorithms for recommendation systems and their dequantization (these algorithms are efficient only when KK is small). We assume that K1K\geq 1, since this is the interesting regime. Kerenidis and Prakash [29] have shown how to solve Recommendation Systems in time O~(K)\tilde{O}(\sqrt{K}) when AA is given in QRAM. Chia, Gilyén, Li, Lin, Tang and Wang [5] gave a classical algorithm solving the problem in time O~(K8log3(1/δ)ξ6ν6+K3log2(1/δ)ξ2ν2A(i,)2A^(i,)2)\tilde{O}\left(\frac{K^{8}\log^{3}(1/\delta)}{\xi^{6}\nu^{6}}+\frac{K^{3}\log^{2}(1/\delta)}{\xi^{2}\nu^{2}}\frac{\left\|A(i,\cdot)\right\|^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}\right) given 𝖲𝖰(A)\mathsf{SQ}(A) with sq(A)=O~(1)\textbf{sq}(A)=\tilde{O}(1), which improved the complexity of the first dequantization algorithm by Tang [39]. We show how to obtain the same complexity when given only 𝖲𝖰ε(A)\mathsf{SQ}_{\varepsilon}(A) for ε\varepsilon small enough, or more generally even 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) with ϕ>1\phi>1.

Theorem 10.

Assume that we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) with sq(A)=O~(1)\textbf{sq}(A)=\tilde{O}(1), for any ϕ1\phi\geq 1 and any

εmin(νξ48K2ϕ,δA^(i,)245Kϕ2A(i,)2).\varepsilon\leq\min\left(\frac{\nu\xi}{48K^{2}\phi},\frac{\delta\left\|\hat{A}(i,\cdot)\right\|^{2}}{45K\phi^{2}\left\|A(i,\cdot)\right\|^{2}}\right).

There is a classical algorithm that solves the problem Recommendation Systems in time

O~(K8ϕ6log3(1/δ)ξ6ν6+K3ϕ6log2(1/δ)ξ2ν2A(i,)2A^(i,)2).\tilde{O}\left(\frac{K^{8}\phi^{6}\log^{3}(1/\delta)}{\xi^{6}\nu^{6}}+\frac{K^{3}\phi^{6}\log^{2}(1/\delta)}{\xi^{2}\nu^{2}}\frac{\left\|A(i,\cdot)\right\|^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}\right).
Proof.

Our proof follows the same approach as in [5]. Let us define the function

{t(x)=1 if x(1+ξ)2σ2,t(x)=0 if x<(1ξ)2σ2,t(x)=x(1ξ)2σ24ξσ2 if x[(1ξ)2σ2,(1+ξ)2σ2),\begin{cases}t(x)=1&\textrm{ if }x\geq(1+\xi)^{2}\sigma^{2},\\ t(x)=0&\textrm{ if }x<(1-\xi)^{2}\sigma^{2},\\ t(x)=\frac{x-(1-\xi)^{2}\sigma^{2}}{4\xi\sigma^{2}}&\textrm{ if }x\in[(1-\xi)^{2}\sigma^{2},(1+\xi)^{2}\sigma^{2}),\end{cases}

and set Aσ,ξ=At(AA)A_{\sigma,\xi}=A\>t(A^{\dagger}A). Note that this matrix satisfies the conditions of Equations (15). We have

{t¯(x)=1/x if x(1+ξ)2σ2,t¯(x)=0 if x<(1ξ)2σ2,t¯(x)=14ξσ2(1ξ)2σ24ξσ2x if x[(1ξ)2σ2,(1+ξ)2σ2).\begin{cases}\bar{t}(x)=1/x&\textrm{ if }x\geq(1+\xi)^{2}\sigma^{2},\\ \bar{t}(x)=0&\textrm{ if }x<(1-\xi)^{2}\sigma^{2},\\ \bar{t}(x)=\frac{1}{4\xi\sigma^{2}}-\frac{(1-\xi)^{2}\sigma^{2}}{4\xi\sigma^{2}x}&\textrm{ if }x\in[(1-\xi)^{2}\sigma^{2},(1+\xi)^{2}\sigma^{2}).\\ \end{cases}

Observe that t(x)t(x) is 14ξσ2\frac{1}{4\xi\sigma^{2}}-Lipschitz over \mathbb{R} and t¯(x)\bar{t}(x) is 14ξ(1ξ)2σ4\frac{1}{4\xi(1-\xi)^{2}\sigma^{4}}-Lipschitz over \mathbb{R}. Also observe that

maxx{t¯(x)}=1(1+ξ)2σ21σ2.\max_{x\in\mathbb{R}}\{\bar{t}(x)\}=\frac{1}{(1+\xi)^{2}\sigma^{2}}\leq\frac{1}{\sigma^{2}}. (16)

We choose appropriate integers r{1,,m}r\in\{1,\ldots,m\} and c{1,,n}c\in\{1,\ldots,n\} such that

r\displaystyle r =Θ(ϕ2log(1/δ)(1+AF4ξ2σ4ν2))=Θ(K2ϕ2log(1/δ)ξ2ν2),\displaystyle=\Theta\left(\phi^{2}\log(1/\delta)\left(1+\frac{\left\|A\right\|_{\mathrm{F}}^{4}}{\xi^{2}\sigma^{4}\nu^{2}}\right)\right)=\Theta\left(\frac{K^{2}\phi^{2}\log(1/\delta)}{\xi^{2}\nu^{2}}\right),
c\displaystyle c =Θ(ϕ2log(1/δ)AF8ξ2(1ξ2)2σ8ν2)=Θ(K4ϕ2log(1/δ)ξ2ν2).\displaystyle=\Theta\left(\frac{\phi^{2}\log(1/\delta)\left\|A\right\|_{\mathrm{F}}^{8}}{\xi^{2}(1-\xi^{2})^{2}\sigma^{8}\nu^{2}}\right)=\Theta\left(\frac{K^{4}\phi^{2}\log(1/\delta)}{\xi^{2}\nu^{2}}\right).

Consider the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance sketch Sr×mS\in\mathbb{R}^{r\times m} of AA associated with 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A). From Lemma 5, with probability at least 1δ/51-\delta/5 we have212121Note that the inequality ενξ48ϕK2\varepsilon\leq\frac{\nu\xi}{48\phi K^{2}} implies that 2εϕ1/32\varepsilon\phi\leq 1/3, which is enough to get the bounds on RF2\left\|R\right\|_{\mathrm{F}}^{2}.

RF2[12AF2,32AF2].\left\|R\right\|_{\mathrm{F}}^{2}\in\left[\frac{1}{2}\left\|A\right\|_{\mathrm{F}}^{2},\frac{3}{2}\left\|A\right\|_{\mathrm{F}}^{2}\right].

We assume below that this inequality holds. Let us write R=SAR=SA. From Lemma 8, we can implement 𝖮𝖲𝖰ε,ϕ(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(R^{\dagger}) with ϕ=2ϕ\phi^{\prime}=2\phi at cost O~(r)\tilde{O}(r). We now consider the (c,ε,ϕ)(c,\varepsilon,\phi^{\prime})-approximate importance sketch Tc×nT\in\mathbb{R}^{c\times n} of RR^{\dagger} associated with 𝖮𝖲𝖰ε,ϕ(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(R^{\dagger}). Let us write C=SATC=SAT^{\dagger}. Theorem 3 (with χ=\chi=\infty since our Lipschitz constants for t(x)t(x) and t¯(x)\bar{t}(x) hold over \mathbb{R}) guarantees that the inequality

Rt¯(CC)Rt(AtA)Fν2\left\|R^{\dagger}\bar{t}(CC^{\dagger})R-t(A^{t}A)\right\|_{\mathrm{F}}\leq\frac{\nu}{2} (17)

holds with probability at least 1δ/51-\delta/5.222222Note that the inequality ενξ48ϕK2\varepsilon\leq\frac{\nu\xi}{48\phi K^{2}} implies that ε\varepsilon satisfies Condition (4) of the statement of Theorem 3 with γ=ν/2\gamma=\nu/2. Assume below that this is the case. Additionally, we have

t¯(CC)R=Rt¯(CC)Rt(AA)+ν/21+ν/2.\left\|\sqrt{\bar{t}(CC^{\dagger})}R\right\|=\sqrt{\left\|R^{\dagger}\bar{t}(CC^{\dagger})R\right\|}\leq\sqrt{\left\|t(A^{\dagger}A)\right\|+\nu/2}\leq\sqrt{1+\nu/2}. (18)

We set A^=ARt¯(CC)R\hat{A}=A^{\prime}R^{\dagger}\bar{t}(CC^{\dagger})R, where Am×nA^{\prime}\in\mathbb{R}^{m\times n} is defined below. First, observe that since we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A), for each j{1,,m}j\in\{1,\ldots,m\} we have 𝖮𝖲𝖰ε,ϕ(A(j,))\mathsf{OSQ}_{\varepsilon,\phi}(A(j,\cdot)) and thus 𝖮𝖲𝖰ε,ϕ(A(j,))\mathsf{OSQ}_{\varepsilon,\phi}(A(j,\cdot)^{\dagger}) as well. For some integer r{1,,m}r^{\prime}\in\{1,\ldots,m\}, we consider the (r,ε,ϕ,ϕ)(r^{\prime},\varepsilon,\phi,\phi^{\prime})-approximate joint matrix sketch Σjr×n\Sigma_{j}\in\mathbb{R}^{r^{\prime}\times n} of A(j,)A(j,\cdot)^{\dagger} and RR^{\dagger} associated with 𝖮𝖲𝖰ε,ϕ(A(j,))\mathsf{OSQ}_{\varepsilon,\phi}(A(j,\cdot)^{\dagger}) and 𝖮𝖲𝖰ε,ϕ(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(R^{\dagger}). We then set the jj-th row of the matrix AA^{\prime} as A(j,)=A(j,)ΣjΣjA^{\prime}(j,\cdot)=A(j,\cdot)\Sigma_{j}^{\dagger}\Sigma_{j}. By taking

r\displaystyle r^{\prime} =Θ(Kϕ2log(m/δ)ν2),\displaystyle=\Theta\left(\frac{K\phi^{2}\log(m/\delta)}{\nu^{2}}\right),

Theorem 2 guarantees that for each j{1,,m}j\in\{1,\ldots,m\}, the inequality

A(j,)RA(j,)R(2εϕϕ+σν16AF)A(j,)RFσνA(j,)4\left\|A^{\prime}(j,\cdot)R^{\dagger}-A(j,\cdot)R^{\dagger}\right\|\leq\left(2\varepsilon\sqrt{\phi\phi^{\prime}}+\frac{\sigma\nu}{16\left\|A\right\|_{\mathrm{F}}}\right)\left\|A(j,\cdot)^{\dagger}\right\|\left\|R^{\dagger}\right\|_{\mathrm{F}}\leq\frac{\sigma\nu\left\|A(j,\cdot)\right\|}{4} (19)

holds with probability at least 1δ/(5m)1-\delta/(5m), since 2εϕϕ2ϕνξ24ϕK2σν16AF2\varepsilon\sqrt{\phi\phi^{\prime}}\leq\frac{\sqrt{2}\phi\nu\xi}{24\phi K^{2}}\leq\frac{\sigma\nu}{16\left\|A\right\|_{\mathrm{F}}}. The inequality

ARARF=j=1mA(j,)RA(j,)R2σνAF4.\left\|A^{\prime}R^{\dagger}-AR^{\dagger}\right\|_{\mathrm{F}}=\sqrt{\sum_{j=1}^{m}\left\|A^{\prime}(j,\cdot)R^{\dagger}-A(j,\cdot)R^{\dagger}\right\|^{2}}\leq\frac{\sigma\nu\left\|A\right\|_{\mathrm{F}}}{4}. (20)

thus holds with probability at least 1δ/51-\delta/5.

Putting everything together, the following holds with probability at least 13δ/51-3\delta/5:

A^Aσ,ηF\displaystyle\left\|\hat{A}-A_{\sigma,\eta}\right\|_{\mathrm{F}} =(ARAR)t¯(CC)R+ARt¯(CC)RAt(AA)F\displaystyle=\left\|(A^{\prime}R^{\dagger}-AR^{\dagger})\bar{t}(CC^{\dagger})R+AR^{\dagger}\bar{t}(CC^{\dagger})R-A\>t(A^{\dagger}A)\right\|_{\mathrm{F}}
(ARAR)t¯(CC)RF+ARt¯(CC)RAt(AA)F\displaystyle\leq\left\|(A^{\prime}R^{\dagger}-AR^{\dagger})\bar{t}(CC^{\dagger})R\right\|_{\mathrm{F}}+\left\|AR^{\dagger}\bar{t}(CC^{\dagger})R-A\>t(A^{\dagger}A)\right\|_{\mathrm{F}}
ARARFt¯(CC)t¯(CC)R+ARt¯(CC)Rt(AA)F\displaystyle\leq\left\|A^{\prime}R^{\dagger}-AR\right\|_{\mathrm{F}}\left\|\sqrt{\bar{t}(CC^{\dagger})}\right\|\left\|\sqrt{\bar{t}(CC^{\dagger})}R\right\|+\left\|A\right\|\left\|R^{\dagger}\bar{t}(CC^{\dagger})R-t(A^{\dagger}A)\right\|_{\mathrm{F}}
σνAF41σ1+ν/2+Aν2\displaystyle\leq\frac{\sigma\nu\left\|A\right\|_{\mathrm{F}}}{4}\frac{1}{\sigma}\sqrt{1+\nu/2}+\left\|A\right\|\frac{\nu}{2}
νAF,\displaystyle\leq\nu\left\|A\right\|_{\mathrm{F}},

where we used Inequalities (16), (17), (18) and (20) to derive the third inequality. We assume below that this upper bound holds.

Finally, we explain how to sample from the row-vector A^(i,)\hat{A}(i,\cdot). We first compute explicitly the row-vector x=A(i,)Rt¯(CC)=A(i,)ΣiΣiRt¯(CC)1×rx=A^{\prime}(i,\cdot)R^{\dagger}\bar{t}(CC^{\dagger})=A(i,\cdot)\Sigma_{i}^{\dagger}\Sigma_{i}R^{\dagger}\bar{t}(CC^{\dagger})\in\mathbb{R}^{1\times r} as follows. We compute the matrix CC explicitly, then compute the singular value decomposition of CCCC^{\dagger}, and then the matrix t¯(CC)\bar{t}(CC^{\dagger}). This can be done in O(r2c)O(r^{2}c) time. Then, we compute the vector A(i,)ΣiΣiRA(i,\cdot)\Sigma_{i}^{\dagger}\Sigma_{i}R^{\dagger} explicitely in O(r+rr)O(r^{\prime}+rr^{\prime}) time by computing the row-vector A(i,)Σi1×rA(i,\cdot)\Sigma_{i}^{\dagger}\in\mathbb{R}^{1\times r^{\prime}} and the matrix ΣiRr×r\Sigma_{i}R^{\dagger}\in\mathbb{R}^{r^{\prime}\times r}, and then computing their product. We compute xx by computing the product of this vector with the matrix t¯(CC)\bar{t}(CC^{\dagger}), which can be done in O(r2)O(r^{2}) time. Finally, in order to sample from A^(i,)=xR\hat{A}(i,\cdot)=xR, we use Proposition 5. We can get 𝖮𝖲𝖰ε,φ(xR)\mathsf{OSQ}_{\varepsilon,\varphi}(xR) with

φ\displaystyle\varphi =ri=1rϕ|xi|2R(i,)2A^(i,)2i=1rϕ2|xi|2AF2A^(i,)2=ϕ2x2AF2A^(i,)2\displaystyle=r\frac{\sum_{i=1}^{r}\phi\left|x_{i}\right|^{2}\left\|R(i,\cdot)\right\|^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}\leq\frac{\sum_{i=1}^{r}\phi^{2}\left|x_{i}\right|^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}=\frac{\phi^{2}\left\|x\right\|^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}} (21)

at cost O(r)O(r), where we used Inequality (2). We can upper bound φ\varphi as follows:

φ\displaystyle\varphi (A(i,)Rt¯(CC)+(A(i,)A(i,))Rt¯(CC))2ϕ2AF2A^(i,)2\displaystyle\leq\left(\left\|A(i,\cdot)R^{\dagger}\bar{t}(CC^{\dagger})\right\|+\left\|(A^{\prime}(i,\cdot)-A(i,\cdot))R^{\dagger}\bar{t}(CC^{\dagger})\right\|\right)^{2}\frac{\phi^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}
(A(i,)Rt¯(CC)t¯(CC)+(A(i,)A(i,))RFt¯(CC))2ϕ2AF2A^(i,)2.\displaystyle\leq\left(\left\|A(i,\cdot)\right\|\left\|R^{\dagger}\sqrt{\bar{t}(CC^{\dagger})}\right\|\left\|\sqrt{\bar{t}(CC^{\dagger})}\right\|+\left\|(A^{\prime}(i,\cdot)-A(i,\cdot))R^{\dagger}\right\|_{\mathrm{F}}\left\|\bar{t}(CC^{\dagger})\right\|\right)^{2}\frac{\phi^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}.

Using (16), (18) and (19), we obtain

φ\displaystyle\varphi (A(i,)1+ν/2σ+σνA(i,)41σ2)2ϕ2AF2A^(i,)2\displaystyle\leq\left(\left\|A(i,\cdot)\right\|\frac{\sqrt{1+\nu/2}}{\sigma}+\frac{\sigma\nu\left\|A(i,\cdot)\right\|}{4}\frac{1}{\sigma^{2}}\right)^{2}\frac{\phi^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}
(1+ν/2+ν4)2ϕ2A(i,)2AF2σ2A^(i,)2\displaystyle\leq\left(\sqrt{1+\nu/2}+\frac{\nu}{4}\right)^{2}\frac{\phi^{2}\left\|A(i,\cdot)\right\|^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\sigma^{2}\left\|\hat{A}(i,\cdot)\right\|^{2}}
3ϕ2A(i,)2AF2σ2A^(i,)2\displaystyle\leq 3\phi^{2}\frac{\left\|A(i,\cdot)\right\|^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\sigma^{2}\left\|\hat{A}(i,\cdot)\right\|^{2}}
=3Kϕ2A(i,)2A^(i,)2.\displaystyle=3K\phi^{2}\frac{\left\|A(i,\cdot)\right\|^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}.

Finally, we use Proposition 3 to implement 𝖱𝖲𝖰ε,δ5,1(xR){\mathsf{RSQ}_{\varepsilon^{\prime},\frac{\delta}{5},1}}(xR) with ε3εφ\varepsilon^{\prime}\leq 3\varepsilon\varphi at cost O(ϕrlog(1/δ))O(\phi r\log(1/\delta)). In particular, we can sample with probability at least 1δ/51-\delta/5 (which gives overall success probability at least 14δ/51-4\delta/5) from a distribution pp such that

|ppxR|tv3εφδ/5.|p-p_{xR}|_{\mathrm{tv}}\leq 3\varepsilon\varphi\leq\delta/5.

Putting the failure probability into the total variation distance, we get overall total variation distance at most δ\delta, as desired (remember that xR=A^(i,)xR=\hat{A}(i,\cdot)). The overall complexity is

O(r2c+rr+r2+φrlog(1/δ))=O(K8ϕ6log3(1/δ)ξ6ν6+K3ϕ6log2(1/δ)ξ2ν2A(i,)2A^(i,)2),O\left(r^{2}c+rr^{\prime}+r^{2}+\varphi r\log(1/\delta)\right)=O\left(\frac{K^{8}\phi^{6}\log^{3}(1/\delta)}{\xi^{6}\nu^{6}}+\frac{K^{3}\phi^{6}\log^{2}(1/\delta)}{\xi^{2}\nu^{2}}\frac{\left\|A(i,\cdot)\right\|^{2}}{\left\|\hat{A}(i,\cdot)\right\|^{2}}\right),

as claimed. ∎

6.6 Dequantizing quantum algorithms for low-rank matrix inversion

In this subsection we consider the following problem. We again write K=AF2/σ2K=\left\|A\right\|_{\mathrm{F}}^{2}/\sigma^{2} and assume that K1K\geq 1.

Matrix Inversion Input: a nonzero matrix Am×nA\in\mathbb{R}^{m\times n}, a vector bmb\in\mathbb{C}^{m}, and four parameters σ,δ(0,1]\sigma,\delta\in(0,1] and η,ξ(0,1/2]\eta,\xi\in(0,1/2]. Assumption: bWσ(A)b\in W_{\sigma}(A), where Wσ(A)W_{\sigma}(A) denotes the subspace of m\mathbb{C}^{m} corresponding to the right singular vectors of AA associated with singular values at least σ\sigma. Goal: find a vector x^\hat{x} such that x^xηx\left\|\hat{x}-x^{\ast}\right\|\leq\eta\left\|x^{\ast}\right\| (22) holds for x=Aσ,ξ+bx^{\ast}=A^{+}_{\sigma,\xi}b with probability at least 1δ1-\delta, where Aσ,ξ+=f(A)A^{+}_{\sigma,\xi}=f(A) for some arbitrary function ff such that {f(x)=1/x if xσf(x)=0 if x<σ(1ξ)f(x)[0,1/σ] if x[σ(1ξ),σ)\begin{cases}f(x)=1/x&\textrm{ if }x\geq\sigma\\ f(x)=0&\textrm{ if }x<\sigma(1-\xi)\\ f(x)\in[0,1/\sigma]&\textrm{ if }x\in[\sigma(1-\xi),\sigma)\end{cases} (23) holds.

Note that the assumption bWσ(A)b\in W_{\sigma}(A) is only needed to ensure that the inequality

bAx\frac{\left\|b\right\|}{\left\|A\right\|}\leq\left\|x^{\ast}\right\| (24)

holds. This inequality makes easier to state the complexity of our algorithm and has been used (implicitly) in [5] as well.

The problem Matrix Inversion can be considered as a low-rank version of the matrix inversion problem solved by the HHL quantum algorithm [23], which has been studied in several dequantization works [4, 5, 22]. Ref. [5], in particular, showed that given perfect oversampling-and-query access to AA (i.e., 𝖮𝖲𝖰0,ϕ(A)\mathsf{OSQ}_{0,\phi}(A)) and 𝖰(b)\mathsf{Q}(b) at cost osq(A)=q(b)=O~(1)\textbf{osq}(A)=\textbf{q}(b)=\tilde{O}(1), after a preprocessing step running in O~(K14ϕ6log3(1/δ)ξ6η6)\tilde{O}\left(\frac{K^{14}\phi^{6}\log^{3}(1/\delta)}{\xi^{6}\eta^{6}}\right) time it is possible to implement 𝖱𝖲𝖰0,δ,η(x^){\mathsf{RSQ}_{0,\delta^{\prime},\eta^{\prime}}}(\hat{x}) for a vector x^\hat{x} satisfying Condition (22) with probability at least 1δ1-\delta at cost

O~(K7ϕ4log(1/δ)log(1/δ)ξ2η2η2x2x^2),\tilde{O}\left(\frac{K^{7}\phi^{4}\log(1/\delta)\log(1/\delta^{\prime})}{\xi^{2}\eta^{2}\eta^{\prime 2}}\frac{\left\|x^{\ast}\right\|^{2}}{\left\|\hat{x}\right\|^{2}}\right),

for any δ,η(0,1)\delta^{\prime},\eta^{\prime}\in(0,1). We show how to obtain the same complexity when given only 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) for ε>0\varepsilon>0, when ε\varepsilon is small enough.

Theorem 11.

There exists an absolute constant τ(0,1]\tau\in(0,1] such that the following holds. Assume that we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) and 𝖰(b)\mathsf{Q}(b) at cost osq(A)=q(b)=O~(1)\textbf{osq}(A)=\textbf{q}(b)=\tilde{O}(1), for some ϕ1\phi\geq 1 and ε0\varepsilon\geq 0. For any ε,δ,η(0,1)\varepsilon^{\prime},\delta^{\prime},\eta^{\prime}\in(0,1), if

ετη2K3 and εετϕ2K3\varepsilon\leq\frac{\tau\eta^{2}}{K^{3}}\hskip 11.38109pt\textrm{ and }\hskip 11.38109pt\varepsilon\leq\varepsilon^{\prime}\cdot\frac{\tau}{\phi^{2}K^{3}} (25)

then there is a classical algorithm that after a preprocessing stage running in time

O(K14ϕ6log3(1/δ)ξ6η6)O\left(\frac{K^{14}\phi^{6}\log^{3}(1/\delta)}{\xi^{6}\eta^{6}}\right)

implements 𝖱𝖲𝖰ε,δ,η(x^){\mathsf{RSQ}_{\varepsilon^{\prime},\delta^{\prime},\eta^{\prime}}}(\hat{x}) for a vector x^\hat{x} satisfying Condition (22) with probability at least 1δ1-\delta at cost

O(K7ϕ4log(1/δ)log(1/δ)ξ2η2η2x2x^2).O\left(\frac{K^{7}\phi^{4}\log(1/\delta)\log(1/\delta^{\prime})}{\xi^{2}\eta^{2}\eta^{\prime 2}}\frac{\left\|x^{\ast}\right\|^{2}}{\left\|\hat{x}\right\|^{2}}\right).
Proof.

Our proof follows the same approach as in [5]. We define the function

{t(x)=1/x if xσ2,t(x)=x(1ξ)2σ2ξ(2ξ)σ4 if x[(1ξ)2σ2,σ2),t(x)=0 if x<(1ξ)2σ2,\begin{cases}t(x)=1/x&\textrm{ if }x\geq\sigma^{2},\\ t(x)=\frac{x-(1-\xi)^{2}\sigma^{2}}{\xi(2-\xi)\sigma^{4}}&\textrm{ if }x\in[(1-\xi)^{2}\sigma^{2},\sigma^{2}),\\ t(x)=0&\textrm{ if }x<(1-\xi)^{2}\sigma^{2},\end{cases}

and set Aσ,ξ+=t(AA)AbA^{+}_{\sigma,\xi}=t(A^{\dagger}A)\>A^{\dagger}b. Note that this matrix satisfies the conditions of Equation (23). Observe that t(x)t(x) is LL-Lipschitz over \mathbb{R} and t¯(x)\bar{t}(x) is L¯\bar{L}-Lipschitz over \mathbb{R} with

L=1ξ(2ξ)σ4 and L¯=1ξ(2ξ)(1ξ)2σ6.L=\frac{1}{\xi(2-\xi)\sigma^{4}}\hskip 14.22636pt\textrm{ and }\hskip 14.22636pt\bar{L}=\frac{1}{\xi(2-\xi)(1-\xi)^{2}\sigma^{6}}.

Also observe that

maxx{t(x)}=1σ2 and maxx{t¯(x)}=1σ4\max_{x\in\mathbb{R}}\{t(x)\}=\frac{1}{\sigma^{2}}\hskip 14.22636pt\textrm{ and }\hskip 14.22636pt\max_{x\in\mathbb{R}}\{\bar{t}(x)\}=\frac{1}{\sigma^{4}} (26)

We will set

r\displaystyle r =Θ(ϕ2AF8L2log(1/δ)η2)=Θ(ϕ2K4log(1/δ)ξ2η2),\displaystyle=\Theta\left(\frac{\phi^{2}\left\|A\right\|_{\mathrm{F}}^{8}L^{2}\log(1/\delta)}{\eta^{2}}\right)=\Theta\left(\frac{\phi^{2}K^{4}\log(1/\delta)}{\xi^{2}\eta^{2}}\right),
c\displaystyle c =Θ(ϕ2AF12L¯2log(1/δ)η2)=Θ(ϕ2K6log(1/δ)ξ2η2).\displaystyle=\Theta\left(\frac{\phi^{2}\left\|A\right\|_{\mathrm{F}}^{12}\bar{L}^{2}\log(1/\delta)}{\eta^{2}}\right)=\Theta\left(\frac{\phi^{2}K^{6}\log(1/\delta)}{\xi^{2}\eta^{2}}\right).

Consider the (r,ε,ϕ)(r,\varepsilon,\phi)-approximate importance sketch Sr×mS\in\mathbb{R}^{r\times m} of AA associated with 𝖲𝖰ε(A)\mathsf{SQ}_{\varepsilon}(A) and write R=SAR=SA. Observe that by taking the constant τ\tau small enough, any ε\varepsilon satisfying Condition (25) also satisfies the condition εϕ/3\varepsilon\leq\phi/3. From Lemma 5, we can then guarantee that with probability at least 1δ/31-\delta/3

RF2[12AF2,32AF2]\left\|R\right\|_{\mathrm{F}}^{2}\in\left[\frac{1}{2}\left\|A\right\|_{\mathrm{F}}^{2},\frac{3}{2}\left\|A\right\|_{\mathrm{F}}^{2}\right] (27)

holds. We assume below that this is the case. From Lemma 8, we can implement 𝖮𝖲𝖰ε,ϕ(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(R^{\dagger}) with ϕ2ϕ\phi^{\prime}\leq 2\phi at cost O~(r)\tilde{O}(r). We now consider the (c,ε,ϕ)(c,\varepsilon,\phi^{\prime})-approximate importance sketch Tc×nT\in\mathbb{R}^{c\times n} of RR^{\dagger} associated with 𝖮𝖲𝖰ε,ϕ(R)\mathsf{OSQ}_{\varepsilon,\phi^{\prime}}(R^{\dagger}) and write C=SATC=SAT^{\dagger}. Observe that by taking the constant τ\tau small enough, any ε\varepsilon satisfying Condition (25) also satisfies Condition (4) of Theorem 3 with γ=η2AF2\gamma=\frac{\eta}{2\left\|A\right\|_{\mathrm{F}}^{2}} and χ=\chi=\infty. Theorem 3 thus guarantees that the inequality

Rt¯(CC)Rt(AtA)Fη2AF2\left\|R^{\dagger}\bar{t}(CC^{\dagger})R-t(A^{t}A)\right\|_{\mathrm{F}}\leq\frac{\eta}{2\left\|A\right\|_{\mathrm{F}}^{2}} (28)

holds with probability at least 1δ/31-\delta/3. Assume below that this is the case.

Observe that

Rt¯(CC)Rt¯(CC)t¯(CC)t(AA)+η2AF21σ22σ3,\left\|R^{\dagger}\bar{t}(CC^{\dagger})\right\|\leq\left\|R^{\dagger}\sqrt{\bar{t}(CC^{\dagger})}\right\|\left\|\sqrt{\bar{t}(CC^{\dagger})}\right\|\leq\sqrt{\left\|t(AA^{\dagger})\right\|+\frac{\eta}{2\left\|A\right\|_{\mathrm{F}}^{2}}}\cdot\frac{1}{\sigma^{2}}\leq\frac{2}{\sigma^{3}}, (29)

where we used the assumption η1/2\eta\leq 1/2 (which implies ηK/2\eta\leq K/2 since K1K\geq 1).

We now explain how to get an approximation uu of the vector RAbrRA^{\dagger}b\in\mathbb{C}^{r}. For each i{1,,r}i\in\{1,\ldots,r\}, observe that bAR(i,)=R(i,)Abb^{\dagger}AR(i,\cdot)^{\dagger}=R(i,\cdot)A^{\dagger}b. Observe that by taking the constant τ\tau small enough, we can guarantee that any ε\varepsilon satisfying Condition (25) also satisfies the inequality εη64K3/2\sqrt{\varepsilon}\leq\frac{\eta}{64K^{3/2}}. Since we have 𝖮𝖲𝖰ε,ϕ(A)\mathsf{OSQ}_{\varepsilon,\phi}(A) and both 𝖰(R(i,))\mathsf{Q}(R(i,\cdot)^{\dagger}) and Q(b)Q{}(b), we use Corollary 1 to compute at cost O(ϕK3log(r/δ)η2)O\left(\frac{\phi K^{3}\log(r/\delta)}{\eta^{2}}\right) an estimate αi\alpha_{i} such that

|αiR(i,)Ab|(4ε+η16K3/2)R(i,)AFbη8K3/2R(i,)AFb\left|\alpha_{i}-R(i,\cdot)A^{\dagger}b\right|\leq\left(4\sqrt{\varepsilon}+\frac{\eta}{16K^{3/2}}\right)\left\|R(i,\cdot)\right\|\left\|A\right\|_{\mathrm{F}}\left\|b\right\|\leq\frac{\eta}{8K^{3/2}}\left\|R(i,\cdot)\right\|\left\|A\right\|_{\mathrm{F}}\left\|b\right\|

holds with probability at least 1δ/(3r)1-\delta/(3r). We set u(i)=αiu(i)=\alpha_{i}. The total time complexity of computing the whole vector uu is

O(rϕK3log(r/δ)η2)=O~(ϕ3K7log2(1/δ)ξ2η4).O\left(r\frac{\phi K^{3}\log(r/\delta)}{\eta^{2}}\right)=\tilde{O}\left(\frac{\phi^{3}K^{7}\log^{2}(1/\delta)}{\xi^{2}\eta^{4}}\right). (30)

Then with probability at least 1δ/31-\delta/3 we have

RAbu=i=1r|αiR(i,)Ab|2ησ38AF3RFAFbησ34x,\left\|RA^{\dagger}b-u\right\|=\sqrt{\sum_{i=1}^{r}\left|\alpha_{i}-R(i,\cdot)A^{\dagger}b\right|^{2}}\leq\frac{\eta\sigma^{3}}{8\left\|A\right\|_{\mathrm{F}}^{3}}\left\|R\right\|_{\mathrm{F}}\left\|A\right\|_{\mathrm{F}}\left\|b\right\|\leq\frac{\eta\sigma^{3}}{4}\left\|x^{\ast}\right\|, (31)

where the last inequality uses (24) and (27).

We set x^=Rt¯(CC)u\hat{x}=R^{\dagger}\bar{t}(CC^{\dagger})u. Putting everything together, the following inequalities hold with probability at least 1δ1-\delta:

x^x\displaystyle\left\|\hat{x}-x^{\ast}\right\| Rt¯(CC)(RAbu)+Rt¯(CC)RAbx\displaystyle\leq\left\|R^{\dagger}\bar{t}(CC^{\dagger})(RA^{\dagger}b-u)\right\|+\left\|R^{\dagger}\bar{t}(CC^{\dagger})RA^{\dagger}b-x^{\ast}\right\|
Rt¯(CC)RAbu+Rt¯(CC)Rt(AtA)Ab\displaystyle\leq\left\|R^{\dagger}\bar{t}(CC^{\dagger})\right\|\left\|RA^{\dagger}b-u\right\|+\left\|R^{\dagger}\bar{t}(CC^{\dagger})R-t(A^{t}A)\right\|\left\|A^{\dagger}b\right\|
η2x+η2AF2Ab\displaystyle\leq\frac{\eta}{2}\left\|x^{\ast}\right\|+\frac{\eta}{2\left\|A\right\|_{\mathrm{F}}^{2}}\left\|A\right\|\left\|b\right\|
ηx,\displaystyle\leq\eta\left\|x^{\ast}\right\|,

where we used (28), (29) and (31) for the third inequality, and (24) for the last inequality.

Finally, we explain how to implement sampling-and-query access to x^\hat{x}. We can compute explicitly the matrix t¯(CC)\bar{t}(CC^{\dagger}) in O(r2c)O(r^{2}c) time, and then compute the vector t¯(CC)ur\bar{t}(CC^{\dagger})u\in\mathbb{C}^{r} in O(r2)O(r^{2}) time. The overall complexity of this step is

O(ϕ6K14log3(1/δ)ξ6η6),O\left(\frac{\phi^{6}K^{14}\log^{3}(1/\delta)}{\xi^{6}\eta^{6}}\right),

which dominates the complexity of the preprocessing stage (compare with Equation (30)).

Since x^\hat{x} is a linear combination of the columns of RR^{\dagger} (i.e., the rows of RR), we can then use Proposition 5 to implement 𝖮𝖲𝖰ε,φ(x^)\mathsf{OSQ}_{\varepsilon,\varphi}(\hat{x}) at cost O(r)O(r) with

φ=ri=1r2ϕ|[t¯(CC)u](i)|2R(i,)2x^24ϕ2t¯(CC)u2AF2x^2,\displaystyle\varphi=r\frac{\sum_{i=1}^{r}2\phi|[\bar{t}(CC^{\dagger})u](i)|^{2}\left\|R(i,\cdot)^{\dagger}\right\|^{2}}{\left\|\hat{x}\right\|^{2}}\leq\frac{4\phi^{2}\left\|\bar{t}(CC^{\dagger})u\right\|^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{x}\right\|^{2}},

where we used Equation (2). Using (26), (29) and (31) we further have:

φ\displaystyle\varphi =O(ϕ2(t¯(CC)RAb+t¯(CC)(RAbu))2AF2x^2)\displaystyle=O\left(\frac{\phi^{2}(\left\|\bar{t}(CC^{\dagger})RA^{\dagger}b\right\|+\left\|\bar{t}(CC^{\dagger})(RA^{\dagger}b-u)\right\|)^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{x}\right\|^{2}}\right)
=O(ϕ2(t¯(CC)RAb+t¯(CC)RAbu)2AF2x^2)\displaystyle=O\left(\frac{\phi^{2}(\left\|\bar{t}(CC^{\dagger})R\right\|\left\|A^{\dagger}\right\|\left\|b\right\|+\left\|\bar{t}(CC^{\dagger})\right\|\left\|RA^{\dagger}b-u\right\|)^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{x}\right\|^{2}}\right)
=O(ϕ2(2σ3Ab+1σ4ησ34x)2AF2x^2)\displaystyle=O\left(\frac{\phi^{2}\left(\frac{2}{\sigma^{3}}\left\|A\right\|\left\|b\right\|+\frac{1}{\sigma^{4}}\frac{\eta\sigma^{3}}{4}\left\|x^{\ast}\right\|\right)^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{x}\right\|^{2}}\right)
=O(ϕ2(2σ3AF2x+1σ4ησ34x)2AF2x^2)\displaystyle=O\left(\frac{\phi^{2}\left(\frac{2}{\sigma^{3}}\left\|A\right\|_{\mathrm{F}}^{2}\left\|x^{\ast}\right\|+\frac{1}{\sigma^{4}}\frac{\eta\sigma^{3}}{4}\left\|x^{\ast}\right\|\right)^{2}\left\|A\right\|_{\mathrm{F}}^{2}}{\left\|\hat{x}\right\|^{2}}\right)

Using the assumption η1/2\eta\leq 1/2 (which again implies ηK/2\eta\leq K/2 since K1K\geq 1), we obtain

φ=O(ϕ2K3x2x^2).\varphi=O\left(\frac{\phi^{2}K^{3}\left\|x^{\ast}\right\|^{2}}{\left\|\hat{x}\right\|^{2}}\right).

When the inequality x^xηx\left\|\hat{x}-x^{\ast}\right\|\leq\eta\left\|x^{\ast}\right\| holds (which happens with probability at least 1δ1-\delta), we get the upper bound φ=O(ϕ2K3)\varphi=O(\phi^{2}K^{3}) since by assumption we have η1/2\eta\leq 1/2. By taking the constant τ\tau small enough, any ε\varepsilon satisfying Condition (25) also satisfies the inequality 3εφε3\varepsilon\varphi\leq\varepsilon^{\prime}. We then use Proposition 3 to implement 𝖱𝖲𝖰ε,δ,η(x^){\mathsf{RSQ}_{\varepsilon^{\prime},\delta^{\prime},\eta^{\prime}}}(\hat{x}) at cost

O(φlog(1/δ)η2r)=O(ϕ4K7log(1/δ)log(1/δ)ξ2η2η2x2x^2),O\left(\frac{\varphi\log(1/\delta^{\prime})}{\eta^{\prime 2}}r\right)=O\left(\frac{\phi^{4}K^{7}\log(1/\delta)\log(1/\delta^{\prime})}{\xi^{2}\eta^{2}\eta^{\prime 2}}\frac{\left\|x^{\ast}\right\|^{2}}{\left\|\hat{x}\right\|^{2}}\right),

as claimed. ∎

Acknowledgments

The author is grateful to András Gilyén, Will Rosenbaum, Ewin Tang and Santosh Vempala for helpful correspondence. He also thanks Ansis Rosmanis for valuable discussions. The author was supported by JSPS KAKENHI grants Nos. JP19H04066, JP20H05966, JP20H00579, JP20H04139, JP21H04879, JP24H00071 and MEXT Quantum Leap Flagship Program (MEXT Q-LEAP) grant No. JPMXS0120319794.

References

  • [1] Scott Aaronson. Read the fine print. Nature Physics, 11:291–293, 2015.
  • [2] Ainesh Bakshi and Ewin Tang. An improved classical singular value transformation for quantum machine learning. In Proceedings of the 2024 ACM-SIAM Symposium on Discrete Algorithms (SODA 2024), pages 2398–2453, 2024.
  • [3] Zhihuai Chen, Yinan Li, Xiaoming Sun, Pei Yuan, and Jialin Zhang. A quantum-inspired classical algorithm for separable non-negative matrix factorization. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), pages 4511–4517, 2019.
  • [4] Nai-Hui Chia, András Gilyén, Han-Hsuan Lin, Seth Lloyd, Ewin Tang, and Chunhao Wang. Quantum-inspired algorithms for solving low-rank linear equation systems with logarithmic dependence on the dimension. In Proceedings of the 31st International Symposium on Algorithms and Computation (ISAAC 2020), volume 181 of LIPIcs, pages 47:1–47:17, 2020.
  • [5] Nai-Hui Chia, András Pal Gilyén, Tongyang Li, Han-Hsuan Lin, Ewin Tang, and Chunhao Wang. Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning. Journal of the ACM, 69(5):33:1–33:72, 2022.
  • [6] Nai-Hui Chia, Tongyang Li, Han-Hsuan Lin, and Chunhao Wang. Quantum-inspired sublinear algorithm for solving low-rank semidefinite programming. In Proceedings of the 45th International Symposium on Mathematical Foundations of Computer Science (MFCS 2020), volume 170 of LIPIcs, pages 23:1–23:15, 2020.
  • [7] Iris Cong and Luming Duan. Quantum discriminant analysis for dimensionality reduction and classification. New Journal of Physics, 18(7):073011, 2016.
  • [8] Jordan Cotler, Hsin-Yuan Huang, and Jarrod R. McClean. Revisiting dequantization and quantum advantage in learning tasks. ArXiv:2112.00811, 2021.
  • [9] Luc Devroye. Non-Uniform Random Variate Generation. Springer, 1986.
  • [10] Chen Ding, Tian-Yi Bao, and He-Liang Huang. Quantum-inspired support vector machine. IEEE Transactions on Neural Networks and Learning Systems, 33(12):7210–7222, 2022.
  • [11] Petros Drineas, Alan M. Frieze, Ravi Kannan, Santosh S. Vempala, and V. Vinay. Clustering large graphs via the singular value decomposition. Machine Learning, 56(1-3):9–33, 2004.
  • [12] Petros Drineas and Ravi Kannan. Fast Monte-Carlo algorithms for approximate matrix multiplication. In Proceedings of the 42nd Annual Symposium on Foundations of Computer Science (FOCS 2001), pages 452–459, 2001.
  • [13] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157, 2006.
  • [14] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM Journal on Computing, 36(1):158–183, 2006.
  • [15] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices III: computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184–206, 2006.
  • [16] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, and Dacheng Tao. Quantum-inspired algorithm for general minimum conical hull problems. Physical Review Research, 2:033199, 2020.
  • [17] Alan M. Frieze, Ravi Kannan, and Santosh S. Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations. Journal of the ACM, 51(6):1025–1041, 2004.
  • [18] Sevag Gharibian and François Le Gall. Dequantizing the quantum singular value transformation: hardness and applications to quantum chemistry and the quantum PCP conjecture. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2022), pages 19–32, 2022.
  • [19] Michael I. Gil. Perturbations of functions of diagonalizable matrices. Electronic Journal of Linear Algebra, 20:303–313, 2010.
  • [20] András Gilyén, Zhao Song, and Ewin Tang. An improved quantum-inspired algorithm for linear regression. Quantum, 6:754, 2022.
  • [21] András Gilyén, Yuan Su, Guang Hao Low, and Nathan Wiebe. Quantum singular value transformation and beyond: exponential improvements for quantum matrix arithmetics. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC 2019), pages 193–204, 2019.
  • [22] András Gilyén, Seth Lloyd, and Ewin Tang. Quantum-inspired low-rank stochastic regression with logarithmic dependence on the dimension. ArXiv: 1811.04909, 2018.
  • [23] Aram W. Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Physical Review Letters, 103:150502, 2009. Full version available as ArXiv:0811.3171.
  • [24] Alan J. Hoffman and Helmut W. Wielandt. The variation of the spectrum of a normal matrix. Duke Mathematical Journal, 20(1):37–39, 1953.
  • [25] Mark Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169–188, 1986.
  • [26] Dhawal Jethwani, François Le Gall, and Sanjay K. Singh. Quantum-inspired classical algorithms for singular value transformation. In Proceedings of the 45th International Symposium on Mathematical Foundations of Computer Science (MFCS 2020), volume 170 of LIPIcs, pages 53:1–53:14, 2020.
  • [27] Ravi Kannan and Santosh S. Vempala. Spectral algorithms. Foundations and Trends(r) in Theoretical Computer Science, 4(3-4):157–288, 2009.
  • [28] Ravindran Kannan and Santosh S. Vempala. Randomized algorithms in numerical linear algebra. Acta Numerica, 26:95–135, 2017.
  • [29] Iordanis Kerenidis and Anupam Prakash. Quantum recommendation systems. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), volume 67 of LIPIcs, pages 49:1–49:21, 2017.
  • [30] Yang Liu and Shengyu Zhang. Fast quantum algorithms for least squares regression and statistic leverage scores. Theoretical Computer Science, 657(A):38–47, 2017.
  • [31] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and unsupervised machine learning. ArXiv:1307.0411, 2013.
  • [32] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nature Physics, 10:631–633, 2014.
  • [33] John M. Martyn, Zane M. Rossi, Andrew K. Tan, and Isaac L. Chuang. Grand unification of quantum algorithms. PRX Quantum, 2:040203, Dec 2021.
  • [34] Colin McDiarmid. On the method of bounded differences. In Surveys in combinatorics, 1989 (Norwich, 1989), volume 141 of London Mathematical Society Lecture Note series, pages 148–188. Cambridge University Press, 1989.
  • [35] Patrick Rebentrost, Thomas R. Bromley, Christian Weedbrook, and Seth Lloyd. Quantum Hopfield neural network. Physical Review A, 98(4), 2018.
  • [36] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big data classification. Physical Review Letters, 113:130503, 2014.
  • [37] Patrick Rebentrost, Adrian Steffens, Iman Marvian, and Seth Lloyd. Quantum singular-value decomposition of nonsparse low-rank matrices. Physical Review A, 97:012327, 2018.
  • [38] Changpeng Shao and Ashley Montanaro. Faster quantum-inspired algorithms for solving linear systems. ACM Transactions on Quantum Computing, 3(4), 2022.
  • [39] Ewin Tang. A quantum-inspired classical algorithm for recommendation systems. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC 2019), pages 217–228, 2019.
  • [40] Ewin Tang. Quantum principal component analysis only achieves an exponential speedup because of its state preparation assumptions. Physical Review Letters, 127:060503, 2021.
  • [41] Nathan Wiebe, Daniel Braun, and Seth Lloyd. Quantum algorithm for data fitting. Physical Review Letters, 109:050505, 2012.