This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Faster Linear Algebra for Distance Matrices

Piotr Indyk
MIT
[email protected]
   Sandeep Silwal
MIT
[email protected]
Abstract

The distance matrix of a dataset XX of nn points with respect to a distance function ff represents all pairwise distances between points in XX induced by ff. Due to their wide applicability, distance matrices and related families of matrices have been the focus of many recent algorithmic works. We continue this line of research and take a broad view of algorithm design for distance matrices with the goal of designing fast algorithms, which are specifically tailored for distance matrices, for fundamental linear algebraic primitives. Our results include efficient algorithms for computing matrix-vector products for a wide class of distance matrices, such as the 1\ell_{1} metric for which we get a linear runtime, as well as an Ω(n2)\Omega(n^{2}) lower bound for any algorithm which computes a matrix-vector product for the \ell_{\infty} case, showing a separation between the 1\ell_{1} and the \ell_{\infty} metrics. Our upper bound results, in conjunction with recent works on the matrix-vector query model, have many further downstream applications, including the fastest algorithm for computing a relative error low-rank approximation for the distance matrix induced by 1\ell_{1} and 22\ell_{2}^{2} functions and the fastest algorithm for computing an additive error low-rank approximation for the 2\ell_{2} metric, in addition to applications for fast matrix multiplication among others. We also give algorithms for constructing distance matrices and show that one can construct an approximate 2\ell_{2} distance matrix in time faster than the bound implied by the Johnson-Lindenstrauss lemma.

1 Introduction

Given a set of nn points X={x1,,xn}X=\{x_{1},\ldots,x_{n}\}, the distance matrix of XX with respect to a distance function ff is defined as the n×nn\times n matrix AA satisfying Ai,j=f(xi,xj)A_{i,j}=f(x_{i},x_{j}). Distances matrices are ubiquitous objects arising in various applications ranging from learning image manifolds [TSL00, WS06], signal processing [SY07], biological analysis [HS93], and non-linear dimensionality reduction [Kru64, Kru78, TSL00, CC08], to name a few111We refer the reader to the survey [DPRV15] for a more thorough discussion of applications of distance matrices.. Unfortunately, explicitly computing and storing AA requires at least Ω(n2)\Omega(n^{2}) time and space. Such complexities are prohibitive for scaling to large datasets.

A silver lining is that in many settings, the matrix AA is not explicitly required. Indeed in many applications, it suffices to compute some underlying function or property of AA, such as the eigenvalues and eigenvectors of AA or a low-rank approximation of AA. Thus an algorithm designer can hope to use the special geometric structure encoded by AA to design faster algorithms tailored for such tasks.

Therefore, it is not surprising that many recent works explicitly take advantage of the underlying geometric structure of distance matrices, and other related families of matrices, to design fast algorithms (see Section 1.2 for a thorough discussion of prior works). In this work, we continue this line of research and take a broad view of algorithm design for distance matrices. Our main motivating question is the following:

Can we design algorithms for fundamental linear algebraic primitives which are specifically tailored for distance matrices and related families of matrices?

We make progress towards the motivating question by studying three of the most fundamental primitives in algorithmic linear algebra. Specifically:

  1. 1.

    We study upper and lower bounds for computing matrix-vector products for a wide array of distance matrices,

  2. 2.

    We give algorithms for multiplying distance matrices faster than general matrices, and,

  3. 3.

    We give fast algorithms for constructing distance matrices.

1.1 Our Results

We now describe our contributions in more detail.

1. We study upper and lower bounds for constructing matrix-vector queries for a wide array of distance matrices.

A matrix-vector query algorithm accepts a vector zz as input and outputs the vector AzAz. There is substantial motivation for studying such queries. Indeed, there is now a rich literature for fundamental linear algebra algorithms which are in the “matrix free” or “implicit” model. These algorithms only assume access to the underlying matrix via matrix-vector queries. Some well known algorithms in this model include the power method for computing eigenvalues and the conjugate gradient descent method for solving a system of linear equations. For many fundamental functions of AA, nearly optimal bounds in terms of the number of queries have been achieved [MM15, BHSW20, BCW22]. Furthermore, having access to matrix-vector queries also allows the simulation of any randomized sketching algorithm, a well studied algorithmic paradigm in its own right [Woo14]. This is because randomized sketching algorithms operate on the matrix ΠA\Pi A or AΠA\Pi where Π\Pi is a suitably chosen random matrix, such as a Gaussian matrix. Typically, Π\Pi is chosen so that the sketches ΠA\Pi A or AΠA\Pi have significantly smaller row or column dimension compared to AA. If AA is symmetric, we can easily acquire both types of matrix sketches via a small number of matrix-vector queries.

Therefore, creating efficient versions of matrix-vector queries for distance matrices automatically lends itself to many further downstream applications. We remark that our algorithms can access to the set of input points but do not explicitly create the distance matrix. A canonical example of our upper bound results is the construction of matrix-vector queries for the function f(x,y)=xyppf(x,y)=\|x-y\|_{p}^{p}.

Theorem 1.1.

Let p1p\geq 1 be an integer. Suppose we are given a dataset of nn points X={x1,,xn}dX=\{x_{1},\ldots,x_{n}\}\subset\mathbb{R}^{d}. XX implicitly defines the matrix Ai,j=xixjppA_{i,j}=\|x_{i}-x_{j}\|_{p}^{p}. Given a query znz\in\mathbb{R}^{n}, we can compute AzAz exactly in time O(ndp)O(ndp). If pp is odd, we also require O(ndlogn)O(nd\log n) preprocessing time.

We give similar guarantees for a wide array of functions ff and we refer the reader to Table 1 which summarizes our matrix-vector query upper bound results. Note that some of the functions ff we study in Table 1 do not necessarily induce a metric in the strict mathematical sense (for example the function f(x,y)=xy22f(x,y)=\|x-y\|_{2}^{2} does not satisfy the triangle inequality). Nevertheless, we still refer to such functions under the broad umbrella term of “distance functions” for ease of notation. We always explicitly state the function ff we are referring to.

Crucially, most of our bounds have a linear dependency on nn which allows for scalable computation as the size of the dataset XX grows. Our upper bounds are optimal in many cases, see Theorem A.13.

Function f(x,y)f(x,y) Preprocessing Query Time Reference
pp\ell_{p}^{p} for pp even xypp\|x-y\|_{p}^{p} - O(ndp)O(ndp) Thms. A.1 / A.3
pp\ell_{p}^{p} for pp odd xypp\|x-y\|_{p}^{p} O(ndlogn)O(nd\log n) O(ndp)O(ndp) Thms. 2.2 / A.4
Mixed \ell_{\infty} maxi,j|xiyj|\max_{i,j}|x_{i}-y_{j}| O(ndlogn)O(nd\log n) O(n2)O(n^{2}) Thm. A.5
Mahalanobis Distance2 xTMyx^{T}My O(nd2)O(nd^{2}) O(nd)O(nd) Thm. A.6
Polynomial Kernel x,yp\langle x,y\rangle^{p} - O(ndp)O(nd^{p}) Thm. A.7
Total Variation Distance TV(x,y)\text{TV}(x,y) O(ndlogn)O(nd\log n) O(nd)O(nd) Thm. A.8
KL Divergence DKL(xy)\text{D}_{\text{KL}}(x\,\|\,y) - O(nd)O(nd) Thm. A.2
Symmetric Divergence DKL(xy)+DKL(yx)\text{D}_{\text{KL}}(x\,\|\,y)+\text{D}_{\text{KL}}(y\,\|\,x) - O(nd)O(nd) Thm. A.9
Cross Entropy H(x,y)H(x,y) - O(nd)O(nd) Thm. A.9
Hellinger Distance2 i=1dx(i)y(i)\sum_{i=1}^{d}\sqrt{x(i)y(i)} - O(nd)O(nd) Thm. A.10
Table 1: A summary of our results for exact matrix-vector queries.

Combining our upper bound results with optimized matrix-free methods, immediate corollaries of our results include faster algorithms for eigenvalue and singular value computations and low-rank approximations. Low-rank approximation is of special interest as it has been widely studied for distance matrices; for low-rank approximation, our bounds outperform prior results for specific distance functions. For example, for the 1\ell_{1} and 22\ell_{2}^{2} case (and in general PSD matrices), [BCW20] showed that a rank-kk approximation can be found in time O(ndk/ε+nkw1/εw1)O(ndk/\varepsilon+nk^{w-1}/\varepsilon^{w-1}). This bound has extra poly(1/ε)\text{poly}(1/\varepsilon) overhead compared to our bound stated in Table 2. The work of [IVWW19] has a worse poly(k,1/ε)\text{poly}(k,1/\varepsilon) overhead for an additive error approximation for the 2\ell_{2} case. See Section 1.2 for further discussion of prior works. The downstream applications of matrix-vector queries are summarized in Table 2.

We also study fundamental limits for any upper bound algorithms. In particular, we show that no algorithm can compute a matrix-vector query for general inputs for the \ell_{\infty} metric in subquadratic time, assuming a standard complexity-theory assumption called the Strong Exponential Time Hypothesis (SETH) [IP01, IPZ01].

Theorem 1.2.

For any α>0\alpha>0 and d=ω(logn)d=\omega(\log n), any algorithm for exactly computing AzAz for any input zz, where AA is the \ell_{\infty} distance matrix, requires Ω(n2α)\Omega(n^{2-\alpha}) time (assuming SETH).

This shows a separation between the functions listed in Table 1 and the \ell_{\infty} metric. Surprisingly, we can create queries for the approximate matrix-vector query in substantially faster time.

Theorem 1.3.

Suppose X{0,1,,O(1)}dX\subseteq\{0,1,\ldots,O(1)\}^{d}. We can compute ByBy in time O(ndO(dlog(d/ε)))O(n\cdot d^{O(\sqrt{d}\log(d/\varepsilon))}) where ABε\|A-B\|_{\infty}\leq\varepsilon.

To put the above result into context, the lower bound of Theorem 1.2 holds for points sets in {0,1,2}d\{0,1,2\}^{d} in dlognd\approx\log n dimensions. In contrast, if we relax to an approximation guarantee, we can obtain a subquadratic-time algorithm for dd up to Θ(log2(n)/loglog(n))\Theta(\log^{2}(n)/\log\log(n)).

Finally, we provide a general understanding of the limits of our upper bound techniques. In Theorem B.1, we show that essentially the only ff for which our upper bound techniques apply have a “linear structure” after a suitable transformation. We refer to Appendix Section B for details.

Problem f(x,y)f(x,y) Runtime Prior Work
(1+ε)(1+\varepsilon) Relative error rank kk
low-rank approximation
1,22\ell_{1},\ell_{2}^{2}
O~(ndkε1/3+nkw1ε(w1)/3)\tilde{O}\left(\frac{ndk}{\varepsilon^{1/3}}+\frac{nk^{w-1}}{\varepsilon^{(w-1)/3}}\right)
Theorem C.4
O(ndkε+nkw1εw1)O\left(\frac{ndk}{\varepsilon}+\frac{nk^{w-1}}{\varepsilon^{w-1}}\right)
[BCW20]
Additive error εAF\varepsilon\|A\|_{F} rank kk
low-rank approximation
2\ell_{2}
O~(ndkε1/3+nkw1ε(w1)/3)\tilde{O}\left(\frac{ndk}{\varepsilon^{1/3}}+\frac{nk^{w-1}}{\varepsilon^{(w-1)/3}}\right)
Theorem C.6
O~(ndpoly(k,1/ε))\tilde{O}(nd\cdot\text{poly}(k,1/\varepsilon))
[IVWW19]
(1+ε)(1+\varepsilon) Relative error rank kk
low-rank approximation
Any in Table 1
O~(Tkε1/3+nkw1ε(w1)/3)\tilde{O}\left(\frac{Tk}{\varepsilon^{1/3}}+\frac{nk^{w-1}}{\varepsilon^{(w-1)/3}}\right)
Theorem C.7
O~(n2dkε1/3+nkw1ε(w1)/3)\tilde{O}\left(\frac{n^{2}dk}{\varepsilon^{1/3}}+\frac{nk^{w-1}}{\varepsilon^{(w-1)/3}}\right)
[BCW22]
(1±ε)(1\pm\varepsilon) Approximation to
top kk singular values
Any in Table 1
O~(Tkε1/2+nk2ε+k3ε3/2)\tilde{O}\left(\frac{Tk}{\varepsilon^{1/2}}+\frac{nk^{2}}{\varepsilon}+\frac{k^{3}}{\varepsilon^{3/2}}\right)
Theorem C.8
O~(n2dkε1/2+nk2ε+k3ε3/2)\tilde{O}\left(\frac{n^{2}dk}{\varepsilon^{1/2}}+\frac{nk^{2}}{\varepsilon}+\frac{k^{3}}{\varepsilon}^{3/2}\right)
[MM15]
Multiply distance matrix AA
with any Bn×nB\in\mathbb{R}^{n\times n}
Any in Table 1
O(Tn)O(Tn)
Lemma C.9
O(nw)O(n^{w})
Multiply two distance
matrices AA and BB
22\ell_{2}^{2}
O(n2dw2)O(n^{2}d^{w-2})
Lemma C.11
O(nw)O(n^{w})
Table 2: Applications of our matrix-vector query results. TT denotes the matrix-vector query time, given in Table 1. w2.37w\approx 2.37 is the matrix multiplication constant [AW21].

2. We give algorithms for multiplying distance matrices faster than general matrices.

Fast matrix-vector queries also automatically imply fast matrix multiplication, which can be reduced to a series of matrix-vector queries. For concreteness, if ff is the pp\ell_{p}^{p} function which induces AA, and BB is any n×nn\times n matrix, we can compute ABAB in time O(n2dp)O(n^{2}dp). This is substantially faster than the general matrix multiplication bound of nwn2.37n^{w}\approx n^{2.37}. We also give an improvement of this result for the case where we are multiplying two distance matrices arising from 22\ell_{2}^{2}. See Table 2 for summary.

3. We give fast algorithms for constructing distance matrices.

Finally, we give fast algorithms for constructing approximate distance matrices. To establish some context, recall the classical Johnson-Lindenstrauss (JL) lemma which (roughly) states that a random projection of a dataset XdX\subset\mathbb{R}^{d} of size nn onto a dimension of size O(logn)O(\log n) approximately preserves all pairwise distances [JL84]. A common applications of this lemma is to instantiate the 2\ell_{2} distance matrix. A naive algorithm which computes the distance matrix after performing the JL projection requires approximately O(n2logn)O(n^{2}\log n) time. Surprisingly, we show that the JL lemma is not tight with respect to creating an approximate 2\ell_{2} distance matrix; we show that one can initialize the 2\ell_{2} distance in an asymptotically better runtime.

Theorem 1.4 (Informal; See Theorem D.5 ).

We can calculate a n×nn\times n matrix BB such that each (i,j)(i,j) entry BijB_{ij} of BB satisfies (1ε)xixj2Bij(1+ε)xixj2(1-\varepsilon)\|x_{i}-x_{j}\|_{2}\leq B_{ij}\leq(1+\varepsilon)\|x_{i}-x_{j}\|_{2} in time O(ε2n2log2(ε1logn))O(\varepsilon^{-2}n^{2}\,\log^{2}(\varepsilon^{-1}\log n)).

Our result can be viewed as the natural runtime bound which would follow if the JL lemma implied an embedding dimension bound of O(poly(loglogn))O(\text{poly}(\log\log n)). While this is impossible, as it would imply an exponential improvement over the JL bound which is tight [LN17], we achieve our speedup by carefully reusing distance calculations via tools from metric compression [IRW17]. Our results also extend to the 1\ell_{1} distance matrix; see Theorem D.5 for details.

Notation.

Our dataset will be the nn points X={x1,,xn}dX=\{x_{1},\ldots,x_{n}\}\subset\mathbb{R}^{d}. For points in XX, we denote xi(j)x_{i}(j) to be the jjth coordinate of point xix_{i} for clarity. For all other vectors vv, viv_{i} denotes the iith coordinate. We are interested in matrices of the form Ai,j=f(xi,xj)A_{i,j}=f(x_{i},x_{j}) for f:d×df:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R} which measures the similarity between any pair of points. ff might not necessarily be a distance function but we use the terminology “distance function” for ease of notation. We will always explicitly state the function ff as needed. w2.37w\approx 2.37 denotes the matrix multiplication constant, i.e., the exponent of nn in the time required to compute the product of two n×nn\times n matrix [AW21].

1.2 Related Works

Matrix-Vector Products Queries.

Our work can be understood as being part of a long line of classical works on the matrix free or implicit model as well as the active recent line of works on the matrix-vector query model. Many widely used linear algebraic algorithms such as the power method, the Lanczos algorithm [Lan50], conjugate gradient descent [S+94], and Wiedemann’s coordinate recurrence algorithm [Wie86], to name a few, all fall into this paradigm. Recent works such as [MM15, BHSW20, BCW22] have succeeded in precisely nailing down the query complexity of these classical algorithms in addition to various other algorithmic tasks such as low-rank approximation [BCW22], trace estimation [MMMW21], and other linear-algebraic functions [SWYZ21b, RWZ20]. There is also a rich literature on query based algorithms in other contexts with the goal of minimizing the number of queries used. Examples include graph queries [Gol17], distribution queries [Can20], and constraint based queries [ES20] in property testing, inner product queries in compressed sensing [EK12], and quantum queries [LSZ21, CHL21].

Most prior works on query based models assume black-box access to matrix-vector queries. While this is a natural model which allows for the design non-trivial algorithms and lower bounds, it is not always clear how such queries can be initialized. In contrast, the focus of our work is not on obtaining query complexity bounds, but rather complementing prior works by creating an efficient matrix-vector query for a natural class of matrices.

Subquadratic Algorithms for Distance Matrices.

Most work on subquadratic algorithms for distance matrices have focused on the problem of computing a low-rank approximation. [BW18, IVWW19] both obtain an additive error low-rank approximation applicable for all distance matrices. These works only assume access to the entries of the distance matrix whereas we assume we also have access to the underlying dataset. [BCW20] study the problem of computing the low-rank approximation of PSD matrices with also sample access to the entries of the matrix. Their results extend to low-rank approximation for the 1\ell_{1} and 22\ell_{2}^{2} distance matrices in addition to other more specialized metrics such as spherical metrics. Table 2 lists the runtime comparisons between their results and ours.

Practically, the algorithm of [IVWW19] is the easiest to implement and has outstanding empirical performance. We note that we can easily simulate their algorithm with no overall asymptotic runtime overhead using O(logn)O(\log n) vector queries. Indeed, their algorithm proceeds by sampling rows of the matrix according to their 22\ell_{2}^{2} value and then post-processing these rows. The sampling probabilities only need to be accurate up to a factor of two. We can acquire these sampling probabilities by performing O(logn)O(\log n) matrix-vector queries which sketches the rows onto dimension O(logn)O(\log n) and preserves all row-norms up to a factor of two with high probability due to the Johnson-Lindenstrauss lemma [JL84]. This procedure only incurs an additional runtime of O(Tlogn)O(T\log n) where TT is the time required to perform a matrix-vector query.

The paper [ILLP04] shows that the exact L1L_{1} distance matrix can be created in time O(n(w+3)/2)n2.69O(n^{(w+3)/2})\approx n^{2.69} in the case of d=nd=n, which is asymptotically faster than the naive bound of O(n2d)=O(n3)O(n^{2}d)=O(n^{3}). In contrast, we focus on creating an (entry-wise) approximate distance matrices for all values of dd.

We also compare to the paper of [ACSS20]. In summary, their main upper bounds are approximation algorithms while we mainly focus on exact algorithms. Concretely, they study matrix vector products for matrices of the form Ai,j=f(xixj22)A_{i,j}=f(\|x_{i}-x_{j}\|_{2}^{2}) for some function f:f:\mathbb{R}\rightarrow\mathbb{R}. They present results on approximating the matrix vector product of AA where the approximation error is additive. They also consider a wide range of ff, including polynomials and other kernels, but the input to is always the 2\ell_{2} distance squared. In contrast, we also present exact algorithms, i.e., with no approximation errors. For example one of our main upper bounds is an exact algorithm when Ai,j=xixj1A_{i,j}=\|x_{i}-x_{j}\|_{1} (see Table 1 for the full list). Since it is possible to approximately embed the 1\ell_{1} distance into 22\ell_{2}^{2}, their methods could be used to derive approximate algorithms for 1\ell_{1}, but not the exact ones. Furthermore, we also study a wide variety of other distance functions such as \ell_{\infty} and pp\ell_{p}^{p} (and others listed in Table 1) which are not studied in Alman et al. In terms of technique, the main upper bound technique of Alman et al. is to expand f(xixj22)f(\|x_{i}-x_{j}\|_{2}^{2}) and approximate the resulting quantity via a polynomial. This is related to our upper bound results for pp\ell_{p}^{p} for even pp where we also use polynomials. However, our results are exact, while theirs are approximate. Our 1\ell_{1} upper bound technique is orthogonal to the polynomial approximation techniques used in Alman et al. We also employ polynomial techniques to give upper bounds for the approximate \ell_{\infty} distance function which is not studied in Alman et al. Lastly, Alman et al. also focus on the Laplacian matrix of the weighted graph represented by the distance matrix, such as spectral sparsification and Laplacian system solving. In contrast, we study different problems including low-rank approximations, eigenvalue estimation, and the task of initializing an approximate distance matrix. We do not consider the distance matrix as a graph or consider the associated Laplacian matrix.

It is also easy to verify the “folklore” fact that for a gram matrix AATAA^{T}, we can compute AATvAA^{T}v in time O(nd)O(nd) if An×dA\in\mathbb{R}^{n\times d} by computing ATvA^{T}v first and then A(ATv)A(A^{T}v). Our upper bound for the 22\ell_{2}^{2} function can be reduced to this folklore fact by noting that xy22=x22+y222x,y\|x-y\|_{2}^{2}=\|x\|_{2}^{2}+\|y\|_{2}^{2}-2\langle x,y\rangle. Thus the 22\ell_{2}^{2} matrix can be decomposed into two rank one components due to the terms x22\|x\|_{2}^{2} and y22\|y\|_{2}^{2}, and a gram matrix due to the term x,y\langle x,y\rangle. This decomposition of the 22\ell_{2}^{2} matrix is well-known (see Section 22 in [DPRV15]). Hence, a matrix-vector query for the 22\ell_{2}^{2} matrix easily reduces to the gram matrix case. Nevertheless, we explicitly state the 22\ell_{2}^{2} upper bound for completeness since we also consider all pp\ell_{p}^{p} functions for any integer p1p\geq 1.

Polynomial Kernels.

There have also been works on faster algorithms for approximating a kernel matrix KK defined as the n×nn\times n matrix with entries Ki,j=k(xi,xj)K_{i,j}=k(x_{i},x_{j}) for a kernel function kk. Specifically for the polynomial kernel k(xi,xj)=xi,xjpk(x_{i},x_{j})=\langle x_{i},x_{j}\rangle^{p}, recent works such as [ANW14, AKK+20, WZ20, SWYZ21a] have shown how to find a sketch KK^{\prime} of KK which approximately satisfies Kz2Kz2\|K^{\prime}z\|_{2}\approx\|Kz\|_{2} for all zz. In contrast, we can exactly simulate the matrix-vector product KzKz. Our runtime is O(ndp)O(nd^{p}) which has a linear dependence on nn but an exponential dependence on pp while the aforementioned works have at least a quadratic dependence on nn but a polynomial dependence on pp. Thus our results are mostly applicable to the setting where our dataset is large, i.e. ndn\gg d and pp is a small constant. For example, p=2p=2 is a common choice in practice [CHC+10]. Algorithms with polynomial dependence in dd and pp but quadratic dependence in nn are suited for smaller datasets which have very large dd and large pp. Note that a large pp might arise if approximates a non-polynomial kernel using a polynomial kernel via a taylor expansion. We refer to the references within [ANW14, AKK+20, WZ20, SWYZ21a] for additional related work. There is also work on kernel density estimation (KDE) data structures which upon query yy, allow for estimation of the sum xXk(x,y)\sum_{x\in X}k(x,y) in time sublinear in |X||X| after some preprocessing on the dataset XX. For widely used kernels such as the Gaussian and Laplacian kernels, KDE data structures were used in [BIMW21] to create a matrix-vector query algorithm for kernel matrices in time subquadratic in |X||X| for input vectors which are entry wise non-negative. We refer the reader to [CS17, BCIS18, SRB+19, BIW19, CKNS20] for prior works on KDE data structures.

2 Faster Matrix-Vector Product Queries for \texorpdfstring1\ell_{1}L1

We derive faster matrix-vector queries for distance matrices for a wide array of distance metrics. First we consider the case of the 1\ell_{1} metric such that Ai,j=f(xi,xj)A_{i,j}=f(x_{i},x_{j}) where f(x,y)=xy1=i=1d|xiyi|f(x,y)=\|x-y\|_{1}=\sum_{i=1}^{d}|x_{i}-y_{i}|.

Algorithm 1 Preprocessing
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:procedure Preprocessing
3:     for i[d]i\in[d] do
4:         TiT_{i}\leftarrow sorted array of the iith coordinates of all xXx\in X.
5:     end for
6:end procedure

We first analyze the correctness of Algorithm 2.

Theorem 2.1.

Let Ai,j=xixj1A_{i,j}=\|x_{i}-x_{j}\|_{1}. Algorithm 2 computes AyAy exactly.

Proof.

Consider any coordinate k[n]k\in[n]. We show that (Ay)k(Ay)_{k} is computed exactly. We have

(Ay)(k)=j=1nyjxkxj1=j=1ni=1dyj|xk(i)xj(i)|=i=1dj=1nyj|xk(i)xj(i)|.(Ay)(k)=\sum_{j=1}^{n}y_{j}\|x_{k}-x_{j}\|_{1}=\sum_{j=1}^{n}\sum_{i=1}^{d}y_{j}|x_{k}(i)-x_{j}(i)|=\sum_{i=1}^{d}\sum_{j=1}^{n}y_{j}|x_{k}(i)-x_{j}(i)|.

Let πi\pi^{i} denote the order of [n][n] induced by TiT_{i}. We have

i=1dj=1nyj|xk(i)xj(i)|=i=1d(j:πi(k)πi(j)yj(xj(i)xk(i))+j:πi(k)>πi(j)yj(xk(i)xj(i))).\sum_{i=1}^{d}\sum_{j=1}^{n}y_{j}|x_{k}(i)-x_{j}(i)|=\sum_{i=1}^{d}\left(\sum_{j:\pi^{i}(k)\leq\pi^{i}(j)}y_{j}(x_{j}(i)-x_{k}(i))+\sum_{j:\pi^{i}(k)>\pi^{i}(j)}y_{j}(x_{k}(i)-x_{j}(i))\right).

We now consider the inner sum. It rearranges to the following:

xk(i)(j:πi(k)>πi(j)yjj:πi(k)πi(j)yj)+j:πi(k)πi(j)yjxj(i)j:πi(k)>πi(j)yjxj(i)\displaystyle x_{k}(i)\left(\sum_{j:\pi^{i}(k)>\pi^{i}(j)}y_{j}-\sum_{j:\pi^{i}(k)\leq\pi^{i}(j)}y_{j}\right)+\sum_{j:\pi^{i}(k)\leq\pi^{i}(j)}y_{j}x_{j}(i)-\sum_{j:\pi^{i}(k)>\pi^{i}(j)}y_{j}x_{j}(i)
=xk(i)(S3S4)+S2S1,\displaystyle=x_{k}(i)\cdot(S_{3}-S_{4})+S_{2}-S_{1},

where S1,S2,S3,S_{1},S_{2},S_{3}, and S4S_{4} are defined in lines 151815-18 of Algorithm 2 and the last equality follows from the definition of the arrays BiB_{i} and CiC_{i}. Summing over all i[d]i\in[d] gives us the desired result. ∎

The following theorem readily follows.

Theorem 2.2.

Suppose we are given a dataset {x1,,xn}\{x_{1},\ldots,x_{n}\} which implicitly defines the distance matrix Ai,j=xixj1A_{i,j}=\|x_{i}-x_{j}\|_{1}. Given a query ydy\in\mathbb{R}^{d}, we can compute AyAy exactly in O(nd)O(nd) query time. We also require a one time O(ndlogn)O(nd\log n) preprocessing time which can be reused for all queries.

Algorithm 2 matrix-vector Query for p=1p=1
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:Output: z=Ayz=Ay
3:procedure Query({Ti}i[d]\{T_{i}\}_{i\in[d]}, yy)
4:     y1,,yny_{1},\cdots,y_{n}\leftarrow coordinates of yy.
5:     Associate every xiXx_{i}\in X with the scalar yiy_{i}
6:     for i[d]i\in[d] do
7:         Compute two arrays Bi,CiB_{i},C_{i} as follows.
8:         BiB_{i} contains the partial sums of yjxj(i)y_{j}x_{j}(i) computed in the order induced by TiT_{i}
9:         CiC_{i} contains the partial sums of yjy_{j} computed in the order induced by TiT_{i}
10:     end for
11:     z0nz\leftarrow 0^{n}
12:     for k[n]k\in[n] do
13:         for i[d]i\in[d] do
14:              qq\leftarrow position of xk(i)x_{k}(i) in the order of TiT_{i}
15:              S1Bi[q]S_{1}\leftarrow B_{i}[q]
16:              S2Bi[n]Bi[q]S_{2}\leftarrow B_{i}[n]-B_{i}[q]
17:              S3Ci[q]S_{3}\leftarrow C_{i}[q]
18:              S4Ci[n]Ci[q]S_{4}\leftarrow C_{i}[n]-C_{i}[q]
19:              z(k)+=xk(i)(S3S4)+S2S1z(k)+=x_{k}(i)\cdot(S_{3}-S_{4})+S_{2}-S_{1}
20:         end for
21:     end for
22:end procedure

3 Lower and Upper bounds for \texorpdfstring\ell_{\infty}L-Infinity

In this section we give a proof of Theorem 1.2. Specifically, we give a reduction from a so-called Orthogonal Vector Problem (OVP) [Wil05] to the problem of computing matrix-vector product AzAz, where Ai,j=xixjA_{i,j}=\|x_{i}-x_{j}\|_{\infty}, for a given set of points X={x1,,xn}X=\{x_{1},\ldots,x_{n}\}. The orthogonal vector problem is defined as follows: given two sets of vectors A={a1,,an}A=\{a^{1},\ldots,a^{n}\} and B={b1,,bn}B=\{b^{1},\ldots,b^{n}\}, A,B{0,1}dA,B\subset\{0,1\}^{d}, |A|=|B|=n|A|=|B|=n, determine whether there exist xAx\in A and yBy\in B such that the dot product xy=j=1dxjyjx\cdot y=\sum_{j=1}^{d}x_{j}y_{j} (taken over reals) is equal to 0. It is known that if OVP can be solved in strongly subquadratic time O(n2α)O(n^{2-\alpha}) for any constant α>0\alpha>0 and d=ω(logn)d=\omega(\log n), then SETH is false. Thus, an efficient reduction from OVP to the matrix-vector product problem yields Theorem 1.2.

Lemma 3.1.

If the matrix-vector product problem for \ell_{\infty} distance matrices induced by nn vectors of dimension dd can be solved in time T(n,d)T(n,d), then OVP (with the same parameters) can be solved in time O(T(n,d))O(T(n,d)).

Proof.

Define two functions, f,g:{0,1}d[0,1]f,g:\{0,1\}^{d}\to[0,1], such that f(0)=g(0)=1/2f(0)=g(0)=1/2, f(1)=0f(1)=0, g(1)=1g(1)=1. We extend both functions to vectors by applying ff and gg coordinate wise and to sets by letting f({a1,,an})={f(a1),,f(an)})f(\{a^{1},\ldots,a^{n}\})=\{f(a^{1}),\ldots,f(a^{n})\}); the function gg is extended in the same way for BB. Observe that, for any pair of non-zero vectors a,b{0,1}da,b\in\{0,1\}^{d}, we have f(a)g(b)=1\|f(a)-g(b)\|_{\infty}=1 if and only if ab>0a\cdot b>0, and f(a)g(b)=1/2\|f(a)-g(b)\|_{\infty}=1/2 otherwise.

Consider two sets of binary vectors AA and BB. Without loss of generality we can assume that the vectors are non-zero, since otherwise the problem is trivial. Define three distance matrices: matrix MAM_{A} defined by the set f(A)f(A), matrix MBM_{B} defined by the set g(B)g(B) and MABM_{AB} defined by the set f(A)f(B)f(A)\cup f(B). Furthermore, let MM be the “cross-distance” matrix, such that such that Mi,j=f(ai)g(bj)M_{i,j}=\|f(a^{i})-g(b^{j})\|_{\infty}. Observe that the matrix MABM_{AB} contains blocks MAM_{A} and MBM_{B} on its diagonal, and blocks MM and MTM^{T} off-diagonal. Thus, MAB1=MA1+MB1+2M1M_{AB}\cdot 1=M_{A}\cdot 1+M_{B}\cdot 1+2M\cdot 1, where 11 denotes an all-ones vector of the appropriate dimension. Since M1=(MAB1MA1MB1)/2M\cdot 1=(M_{AB}\cdot 1-M_{A}\cdot 1-M_{B}\cdot 1)/2, we can calculate M1M\cdot 1 in time O(T(n,d))O(T(n,d)). Since all entries of MM are either 11 or 1/21/2, we have that M1<n2M\cdot 1<n^{2} if and only if there is an entry Mi,j=1/2M_{i,j}=1/2. However, this only occurs if aibj=0a^{i}\cdot b^{j}=0. ∎

3.1 Approximate \texorpdfstring\ell_{\infty}L-Infinity Matrix-Vector Queries

In light of the lower bounds given above, we consider initializing approximate matrix-vector queries for the \ell_{\infty} function. Note that the lower bound holds for points in {0,1,2}d\{0,1,2\}^{d} and thus it is natural to consider approximate upper bounds for the case of limited alphabet.

Binary Case.

We first consider the case that all points xXx\in X are from {0,1}d\{0,1\}^{d}. We first claim the existence of a polynomial TT with the following properties. Indeed, the standard Chebyshev polynomials satisfy the following lemma, see e.g., see Chapter 2 in [SV+14].

Lemma 3.2.

There exists a polynomial T:T:\mathbb{R}\rightarrow\mathbb{R} of degree O(dlog(1/ε))O(\sqrt{d}\log(1/\varepsilon)) such that T(0)=0T(0)=0 and |T(x)1|ε|T(x)-1|\leq\varepsilon for all x[1/d,1]x\in[1/d,1].

Now note that xy\|x-y\|_{\infty} can only take on two values, 0 or 11. Furthermore, xy=0\|x-y\|_{\infty}=0 if and only if xy22=0\|x-y\|_{2}^{2}=0 and xy=1\|x-y\|_{\infty}=1 if and only if xy221\|x-y\|_{2}^{2}\geq 1. Therefore, xy=0\|x-y\|_{\infty}=0 if and only if T(xy22/d)=0T(\|x-y\|_{2}^{2}/d)=0 and xy=1\|x-y\|_{\infty}=1 if and only if |T(xy22/d)1|ε|T(\|x-y\|_{2}^{2}/d)-1|\leq\varepsilon. Thus, we have that

|Ai,jT(xixj22/d)|=|xixjT(xixj22/d)|ε|A_{i,j}-T(\|x_{i}-x_{j}\|_{2}^{2}/d)|=|\|x_{i}-x_{j}\|_{\infty}-T(\|x_{i}-x_{j}\|_{2}^{2}/d)|\leq\varepsilon

for all entries Ai,jA_{i,j} of AA. Note that T(xy22/d)T(\|x-y\|_{2}^{2}/d) is a polynomial with O((2d)t)O((2d)^{t}) monomials in the variables x(1),,x(d)x(1),\ldots,x(d). Consider the matrix BB satisfying Bi,j=T(xixj22/d)B_{i,j}=T(\|x_{i}-x_{j}\|_{2}^{2}/d). Using the same ideas as our upper bound results for f(x,y)=x,ypf(x,y)=\langle x,y\rangle^{p}, it is straightforward to calculate the matrix vector product ByBy (see Section A.2). To summarize, for each k[n]k\in[n], we write the kkth coordinate of ByBy as a polynomial in the dd coordinates of xkx_{k}. This polynomial has O((2d)t)O((2d)^{t}) monomials and can be constructed in O(n(2d)t)O(n(2d)^{t}) time. Once constructed, we can evaluate the polynomial at x1,,xnx_{1},\ldots,x_{n} to obtain all the nn coordinates of ByBy. Each evaluation requires O((2d)t)O((2d)^{t}) resulting in an overall time bound of O(n(2d)t)O(n(2d)^{t}).

Theorem 3.3.

Let Ai,j=xixjA_{i,j}=\|x_{i}-x_{j}\|_{\infty}. We can compute ByBy in time O(n(2d)dlog(1/ε))O(n(2d)^{\sqrt{d}\log(1/\varepsilon)}) where ABε\|A-B\|_{\infty}\leq\varepsilon.

Entries in {0,,M}\{0,\ldots,M\}.

We now consider the case that all points xXx\in X are from {0,,M}d\{0,\ldots,M\}^{d}. Our argument will be a generalization of the previous section. At a high level, our goal is to detect which of the M+1M+1 possible values in {0,,M}\{0,\ldots,M\} is equal to the \ell_{\infty} norm. To do so, we appeal to the prior section and design estimators which approximate the indicator function ``xyi"``\|x-y\|_{\infty}\geq i". By summing up these indicators, we can approximate xy\|x-y\|_{\infty}.

Our estimators will again be designed using the Chebyshev polynomials. To motivate them, suppose that we want to detect if xyi\|x-y\|_{\infty}\geq i or if xy<i\|x-y\|_{\infty}<i. In the first case, some entry in xyx-y will have absolute value value at least ii where as in the other case, all entries of xyx-y will be bounded by i1i-1 in absolute value. Thus if we can boost this ‘signal’, we can apply a polynomial which performs thresholding to distinguish the two cases. This motivates considering the functions of xykk\|x-y\|_{k}^{k} for a larger power kk. In particular, in the case that xyi\|x-y\|_{\infty}\geq i, we have xykkik\|x-y\|_{k}^{k}\geq i^{k} and otherwise, xykkdik1\|x-y\|_{k}^{k}\leq di^{k-1}. By setting klog(d)k\approx\log(d), the first value is much larger than the latter, which we can detect using the ‘threshold’ polynomials of the previous section.

We now formalize our intuition. It is known that appropriately scaled Chebyshev polynomials satisfy the following guarantees, see e.g., see Chapter 2 in [SV+14].

Lemma 3.4.

There exists a polynomial T:T:\mathbb{R}\rightarrow\mathbb{R} of degree O(tlog(t/ε))O(\sqrt{t}\log(t/\varepsilon)) such that |T(x)|ε/t|T(x)|\leq\varepsilon/t for all x[0,1/(10t)]x\in[0,1/(10t)] and |T(x)1|ε/t2|T(x)-1|\leq\varepsilon/t^{2} for all x[1/t,1]x\in[1/t,1].

Given x,ydx,y\in\mathbb{R}^{d}, our estimator will first try to detect if xyi\|x-y\|_{\infty}\geq i. Let T1T_{1} be a polynomial from Lemma 3.4 with t=O(Mk)t=O(M^{k}) for k=O(Mlog(Md))k=O(M\log(Md)) and assuming kk is even. Let T2T_{2} be a polynomial from Lemma 3.4 with t=O(dlog(M/ε))t=O(\sqrt{d}\log(M/\varepsilon)). Our estimator will be

T2(1dj=1dT1((x(j)y(j))kikMk)).T_{2}\left(\frac{1}{d}\sum_{j=1}^{d}T_{1}\left(\frac{(x(j)-y(j))^{k}}{i^{k}\cdot M^{k}}\right)\right).

If coordinate jj is such that |x(j)y(j)|i|x(j)-y(j)|\geq i, then

(x(j)y(j))kikMk1Mk\frac{(x(j)-y(j))^{k}}{i^{k}\cdot M^{k}}\geq\frac{1}{M^{k}}

and so T1T_{1} will evaluate to a value very close to 11. Otherwise, we know that

(x(j)y(j))kikMk(i1)kikMk=1Mk(11/i)k1Mk1poly(M,d)\frac{(x(j)-y(j))^{k}}{i^{k}\cdot M^{k}}\leq\frac{(i-1)^{k}}{i^{k}M^{k}}=\frac{1}{M^{k}}\left(1-1/i\right)^{k}\ll\frac{1}{M^{k}}\cdot\frac{1}{\text{poly}(M,d)}

by our choice of kk, which means that T1T_{1} will evaluate to a value close to 0. Formally,

1dj=1dT1((x(j)y(j))kikMk)\frac{1}{d}\sum_{j=1}^{d}T_{1}\left(\frac{(x(j)-y(j))^{k}}{i^{k}\cdot M^{k}}\right)

will be at least 1/d1/d if there is a j[d]j\in[d] with |x(j)y(j)|i|x(j)-y(j)|\geq i and otherwise, will be at most 1/(10d)1/(10d). By our choice of T2T_{2}, the overall estimate output at least 1ε1-\varepsilon in the first case and a value at most ε\varepsilon in the second case.

The polynomial which is the concatenation of T2T_{2} and T1T_{1} has O((dkdeg(T1))deg(T2))=(dM)O(Mdlog(Md))O\left(\left(dk\cdot\text{deg}(T_{1})\right)^{\text{deg}(T_{2})}\right)=(dM)^{O(M\sqrt{d}\log(Md))} monomials, if we consider the expression as a polynomial in the variables x(1),,x(d)x(1),\ldots,x(d). Our final estimator will be the sum across all i1i\geq 1. Following our upper bound techniques for matrix-vector products for polynomial, e.g. in Section A.2, and as outlined in the prior section, we get the following overall query time:

Theorem 3.5.

Suppose we are given X={x1,,xn}{0,,M}dX=\{x_{1},\ldots,x_{n}\}\subseteq\{0,\ldots,M\}^{d} which implicitly defines the matrix Ai,j=xixjA_{i,j}=\|x_{i}-x_{j}\|_{\infty}. For any query yy, we can compute ByBy in time n(dM)O(Mdlog(Md/ε))n\cdot(dM)^{O(M\sqrt{d}\log(Md/\varepsilon))} where ABε\|A-B\|_{\infty}\leq\varepsilon.

4 Empirical Evaluation

We perform an empirical evaluation of our matrix-vector query for the 1\ell_{1} distance function. We chose to implement our 1\ell_{1} upper bound since it’s a clean algorithm which possesses many of the same underlying algorithmic ideas as some of our other upper bound results. We envision that similar empirical results hold for most of our upper bounds in Table 1. Furthermore, matrix-vector queries are the dominating subroutine in many key practical linear algebra algorithms such as the power method for eigenvalue estimation or iterative methods for linear regression: a fast matrix-vector query runtime automatically translates to faster algorithms for downstream applications.

Dataset (n,d)(n,d) Algo. Preprocessing Time Avg. Query Time
Gaussian Mixture (5104,50(5\cdot 10^{4},50) Naive 453.7 s 43.3 s
Ours 0.55 s 0.09 s
MNIST (5104,784)(5\cdot 10^{4},784) Naive 2672.5 s 38.6 s
Ours 5.5 s 1.9 s
Glove (1.2106,50)(1.2\cdot 10^{6},50) Naive - \approx 2.6 days (estimated)
Ours 16.8 s 3.4 s
Table 3: Dataset description and empirical results. (n,d)(n,d) denotes the number of points and dimension of the dataset, respectively. Query times are averaged over 1010 trials with Gaussian vectors as queries.

Experimental Design.

We chose two real and one synthetic dataset for our experiments. We have two “small” datasets and one “large” dataset. The two small datasets have 51045\cdot 10^{4} points whereas the large dataset has approximately 10610^{6} points. The first dataset is points drawn from a mixture of three spherical Gaussians in 50\mathbb{R}^{50}. The second dataset is the standard MNIST dataset [LeC98] and finally, our large dataset is Glove word embeddings222Can be accessed here: \urlhttp://github.com/erikbern/ann-benchmarks/ in 50\mathbb{R}^{50} [PSM14].

The two small datasets are small enough that one can feasibly initialize the full n×nn\times n distance matrix in memory in reasonable time. A 5104×51045\cdot 10^{4}\times 5\cdot 10^{4} matrix with each entry stored using 3232 bits requires 1010 gigabytes of space. This is simply impossible for the Glove dataset as approximately 5.85.8 terabytes of space is required to initialize the distance matrix (in contrast, the dataset itself only requires <0.3<0.3 gigabytes to store).

The naive algorithm for the small datasets is the following: we initialize the full distance matrix (which will count towards preprocessing), and then we use the full distance matrix to perform a matrix-vector query. Note that having the full matrix to perform a matrix-vector product only helps the naive algorithm since it can now take advantage of optimized linear algebra subroutines for matrix multiplication and does not need to explicitly calculate the matrix entries. Since we cannot initialize the full distance matrix for the large dataset, the naive algorithm in this case will compute the matrix-vector product in a standalone fashion by generating the entries of the distance matrix on the fly. We compare the naive algorithm to our Algorithms 1 and 2.

Our experiments are done in a 2021 M1 Macbook Pro with 32 gigabytes of RAM. We implement all algorithms in Python 3.9 using Numpy with Numba acceleration to speed up all algorithms whenever possible.

Results.

Results are shown in Table 3. We show preprocessing and query time for both the naive and our algorithm in seconds. The query time is averaged over 1010 trials using Gaussian vectors as queries. For the Glove dataset, it was infeasible to calculate even a single matrix-vector product, even using fast Numba accelerated code. We thus estimated the full query time by calculating the time on a subset of 51045\cdot 10^{4} points of the Glove dataset and extrapolating to the full dataset by multiplying the query time by (n/(5104))2(n/(5\cdot 10^{4}))^{2} where nn is the total number of points. We see that in all cases, our algorithm outperforms the naive algorithm in both preprocessing time and query time and the gains become increasingly substantial as the dataset size increases, as predicted by our theoretical results.

Acknowledgements.

This research was supported by the NSF TRIPODS program (award DMS-2022448), Simons Investigator Award, MIT-IBM Watson collaboration, GIST- MIT Research Collaboration grant, and NSF Graduate Research Fellowship under Grant No. 1745302.

References

  • [ACSS20] Josh Alman, Timothy Chu, Aaron Schild, and Zhao Song. Algorithms and hardness for linear algebra on geometric graphs. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 541–552. IEEE, 2020.
  • [AG11] Uri M Ascher and Chen Greif. A first course on numerical methods. SIAM, 2011.
  • [AKK+20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141–160. SIAM, 2020.
  • [ANW14] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polynomial kernel. Advances in neural information processing systems, 27, 2014.
  • [AW21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522–539. SIAM, 2021.
  • [BCIS18] Arturs Backurs, Moses Charikar, Piotr Indyk, and Paris Siminelakis. Efficient density evaluation for smooth kernels. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 615–626, 2018.
  • [BCW20] Ainesh Bakshi, Nadiia Chepurko, and David P Woodruff. Robust and sample optimal algorithms for psd low rank approximation. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 506–516. IEEE, 2020.
  • [BCW22] Ainesh Bakshi, Kenneth L Clarkson, and David P Woodruff. Low-rank approximation with 1/ε1/31/\varepsilon^{1/3} matrix-vector products. arXiv preprint arXiv:2202.05120, 2022.
  • [BHSW20] Mark Braverman, Elad Hazan, Max Simchowitz, and Blake Woodworth. The gradient complexity of linear regression. In Conference on Learning Theory, pages 627–647. PMLR, 2020.
  • [BIMW21] Arturs Backurs, Piotr Indyk, Cameron Musco, and Tal Wagner. Faster kernel matrix algebra via density estimation. In Proceedings of the 38th International Conference on Machine Learning, pages 500–510, 2021.
  • [BIW19] Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efficient kernel density estimation in high dimensions. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS, pages 15773–15782, 2019.
  • [BW18] Ainesh Bakshi and David Woodruff. Sublinear time low-rank approximation of distance matrices. Advances in Neural Information Processing Systems, 31, 2018.
  • [Can20] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1–100, 2020.
  • [CC08] Michael AA Cox and Trevor F Cox. Multidimensional scaling. In Handbook of data visualization, pages 315–347. Springer, 2008.
  • [CHC+10] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11(4), 2010.
  • [CHL21] Andrew M Childs, Shih-Han Hung, and Tongyang Li. Quantum query complexity with matrix-vector products. arXiv preprint arXiv:2102.11349, 2021.
  • [CKNS20] Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density estimation through density constrained near neighbor search. 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 172–183, 2020.
  • [CS17] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 1032–1043, 2017.
  • [DPRV15] Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32(6):12–30, 2015.
  • [EK12] Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications. Cambridge university press, 2012.
  • [ES20] Rogers Epstein and Sandeep Silwal. Property testing of lp-type problems. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
  • [Gol17] Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017.
  • [HS93] Liisa Holm and Chris Sander. Protein structure comparison by alignment of distance matrices. Journal of molecular biology, 233(1):123–138, 1993.
  • [ILLP04] Piotr Indyk, Moshe Lewenstein, Ohad Lipsky, and Ely Porat. Closest pair problems in very high dimensions. In International Colloquium on Automata, Languages, and Programming, pages 782–792. Springer, 2004.
  • [IP01] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367–375, 2001.
  • [IPZ01] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63(4):512–530, 2001.
  • [IRW17] Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Practical data-dependent metric compression with provable guarantees. Advances in Neural Information Processing Systems, 30, 2017.
  • [IVWW19] Pitor Indyk, Ali Vakilian, Tal Wagner, and David P Woodruff. Sample-optimal low-rank approximation of distance matrices. In Conference on Learning Theory, pages 1723–1751. PMLR, 2019.
  • [JL84] W. Johnson and J. Lindenstrauss. Extensions of lipschitz maps into a hilbert space. Contemporary Mathematics, 26:189–206, 01 1984.
  • [Kru64] Joseph B Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964.
  • [Kru78] Joseph B Kruskal. Multidimensional scaling. Number 11. Sage, 1978.
  • [Kuc09] Marek Kuczma. An introduction to the theory of functional equations and inequalities: Cauchy’s equation and Jensen’s inequality. Springer Science & Business Media, 2009.
  • [Lan50] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. 1950.
  • [LeC98] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • [LN17] Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633–638. IEEE, 2017.
  • [LSZ21] Troy Lee, Miklos Santha, and Shengyu Zhang. Quantum algorithms for graph problems with cut queries. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 939–958. SIAM, 2021.
  • [MM15] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. Advances in neural information processing systems, 28, 2015.
  • [MMMW21] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
  • [PSM14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [RWZ20] Cyrus Rashtchian, David P Woodruff, and Hanlin Zhu. Vector-matrix-vector queries for solving linear algebra, statistics, and graph problems. arXiv preprint arXiv:2006.14015, 2020.
  • [S+94] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
  • [SRB+19] Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Alexander Levis. Rehashing kernel evaluation in high dimensions. In Proceedings of the 36th International Conference on Machine Learning, ICML, pages 5789–5798, 2019.
  • [SV+14] Sushant Sachdeva, Nisheeth K Vishnoi, et al. Faster algorithms via approximation theory. Foundations and Trends® in Theoretical Computer Science, 9(2):125–210, 2014.
  • [SWYZ21a] Zhao Song, David Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polynomial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812–9823. PMLR, 2021.
  • [SWYZ21b] Xiaoming Sun, David P Woodruff, Guang Yang, and Jialin Zhang. Querying a matrix through matrix-vector products. ACM Transactions on Algorithms (TALG), 17(4):1–19, 2021.
  • [SY07] Anthony Man-Cho So and Yinyu Ye. Theory of semidefinite programming for sensor network localization. Mathematical Programming, 109(2):367–384, 2007.
  • [TSL00] Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • [Wie86] Douglas Wiedemann. Solving sparse linear equations over finite fields. IEEE transactions on information theory, 32(1):54–62, 1986.
  • [Wil05] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2-3):357–365, 2005.
  • [Woo14] David P Woodruff. Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357, 2014.
  • [WS06] Kilian Q Weinberger and Lawrence K Saul. Unsupervised learning of image manifolds by semidefinite programming. International journal of computer vision, 70(1):77–90, 2006.
  • [WZ20] David Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings via adaptive sampling. In International Conference on Machine Learning, pages 10324–10333. PMLR, 2020.

Appendix A Omitted Upper Bounds for Faster Matrix-Vector Queries

We now consider the case of pp\ell_{p}^{p} for p=2p=2. Generalizing the results of p=1p=1 and p=2p=2 allows us to handle general pp\ell_{p}^{p} functions.

Algorithm 3 matrix-vector Query for p=2p=2
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:Output: z=Ayz=Ay
3:procedure Query(y)
4:     vi=jnyjxjv\leftarrow\sum_{i=j}^{n}y_{j}x_{j}
5:     S1i=jnyj2S_{1}\leftarrow\sum_{i=j}^{n}y_{j}^{2}
6:     S2i=jnyj2x22S_{2}\leftarrow\sum_{i=j}^{n}y_{j}^{2}\|x\|_{2}^{2}
7:     z0nz\leftarrow 0^{n}
8:     for k[n]k\in[n] do
9:         z(k)S1xk22+S22xk,vz(k)\leftarrow S_{1}\|x_{k}\|_{2}^{2}+S_{2}-2\langle x_{k},v\rangle
10:     end for
11:end procedure
Theorem A.1.

We can compute AyAy in O(nd)O(nd) query time.

Proof.

The proof follows from the following calculation of the kkth coordinate of AyAy:

(Ay)(k)=j=1nyjxkxj22=xk22(j=1nyj2)+j=1nyj2xj222xk,j=1nyjxj.(Ay)(k)=\sum_{j=1}^{n}y_{j}\|x_{k}-x_{j}\|_{2}^{2}=\|x_{k}\|_{2}^{2}\left(\sum_{j=1}^{n}y_{j}^{2}\right)+\sum_{j=1}^{n}y_{j}^{2}\|x_{j}\|_{2}^{2}-2\left\langle x_{k},\sum_{j=1}^{n}y_{j}x_{j}\right\rangle.\qed

We can extend our results to general pp\ell_{p}^{p} functions as well as a wide array of commonly used functions to measure (dis)similarity between vectors. For example, suppose the points xix_{i} represent a probability distribution over the domain [n]:={1,,n}[n]:=\{1,\ldots,n\}. A widely used “distance” function over distributions is the KL-divergence defined as

f(xi,xj)=DKL(xixj)=k[d]xi(k)log(xi(k))xi(k)log(xj(k))=H(xi)k[d]xi(k)log(xj(k)),f(x_{i},x_{j})=\text{D}_{\text{KL}}(x_{i}\,\|\,x_{j})=\sum_{k\in[d]}x_{i}(k)\log(x_{i}(k))-x_{i}(k)\log(x_{j}(k))=-H(x_{i})-\sum_{k\in[d]}x_{i}(k)\log(x_{j}(k)),

where HH is the entropy function. Our techniques extend to the KL-divergence as well.

Algorithm 4 matrix-vector Query for KL Divergence
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:Output: z=Ayz=Ay
3:procedure Query( yy)
4:     Sij=1nyjlog(xj(i))S_{i}\leftarrow\sum_{j=1}^{n}y_{j}\log(x_{j}(i)) for all i[d]i\in[d]
5:     HiH(xi)H_{i}\leftarrow H(x_{i}) for all i[n]i\in[n]
6:     Yj=1nyjY\leftarrow\sum_{j=1}^{n}y_{j}
7:     z0nz\leftarrow 0^{n}
8:     for k[n]k\in[n] do
9:         z(k)HkYi=1dxk(i)Siz(k)\leftarrow-H_{k}\cdot Y-\sum_{i=1}^{d}x_{k}(i)S_{i}
10:     end for
11:end procedure
Theorem A.2.

We can compute AyAy in O(nd)O(nd) query time.

Proof.

Note that computed all of SiS_{i} and HiH_{i} takes O(nd)O(nd) time. Now

(Ay)(k)\displaystyle(Ay)(k) =j=1nyjDKL(xkxj)\displaystyle=\sum_{j=1}^{n}y_{j}\text{D}_{\text{KL}}(x_{k}\,\|\,x_{j})
=j=1nyjH(xk)j=1nyjk=1dxi(k)log(xj(k))\displaystyle=\sum_{j=1}^{n}-y_{j}H(x_{k})-\sum_{j=1}^{n}y_{j}\sum_{k=1}^{d}x_{i}(k)\log(x_{j}(k))
=H(xk)(j=1nyj)k=1dj=1nyjxi(k)log(xj(k))\displaystyle=-H(x_{k})\left(\sum_{j=1}^{n}y_{j}\right)-\sum_{k=1}^{d}\sum_{j=1}^{n}y_{j}x_{i}(k)\log(x_{j}(k))
=HkYk=1dxi(k)Sk,\displaystyle=-H_{k}\cdot Y-\sum_{k=1}^{d}x_{i}(k)S_{k},

as desired. ∎

A.1 General \texorpdfstringppp

We now consider the case of a general non-negative even integer pp.

Algorithm 5 matrix-vector Query for even pp
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:Output: z=Ayz=Ay
3:procedure Query(y)
4:     Compute all the values Si,t:=j=1nxj(i)pt(1)ptS_{i,t}:=\sum_{j=1}^{n}x_{j}(i)^{p-t}(-1)^{p-t} for all i[d]i\in[d] and t{0,,p}t\in\{0,\ldots,p\}
5:     z0nz\leftarrow 0^{n}
6:     for k[n]k\in[n] do
7:         z(k)i=1dt=1pxk(i)tSi,tz(k)\leftarrow\sum_{i=1}^{d}\sum_{t=1}^{p}x_{k}(i)^{t}S_{i,t}
8:     end for
9:end procedure
Theorem A.3.

We can compute AyAy in O(ndp)O(ndp) query time.

Proof.

Consider the following calculation of the kkth coordinate of AyAy:

(Ay)(k)\displaystyle(Ay)(k) =j=1nyjxkxjpp\displaystyle=\sum_{j=1}^{n}y_{j}\|x_{k}-x_{j}\|_{p}^{p}
=j=1nyji=1d(xk(i)xj(i))p\displaystyle=\sum_{j=1}^{n}y_{j}\sum_{i=1}^{d}(x_{k}(i)-x_{j}(i))^{p}
=j=1ni=1dt=1pxk(i)txj(i)pt(1)pt\displaystyle=\sum_{j=1}^{n}\sum_{i=1}^{d}\sum_{t=1}^{p}x_{k}(i)^{t}x_{j}(i)^{p-t}(-1)^{p-t}
=i=1dt=1pxk(i)tj=1nxj(i)pt(1)pt\displaystyle=\sum_{i=1}^{d}\sum_{t=1}^{p}x_{k}(i)^{t}\sum_{j=1}^{n}x_{j}(i)^{p-t}(-1)^{p-t}
=i=1dt=1pxk(i)tSi,t.\displaystyle=\sum_{i=1}^{d}\sum_{t=1}^{p}x_{k}(i)^{t}S_{i,t}.

Note that computing Si,tS_{i,t} for all ii and tt takes O(ndp)O(ndp) time. Then returning the value of (Ay)k(Ay)_{k} takes O(dp)O(dp) time resulting in the claimed runtime. ∎

The case of a general non-negative odd integer pp follows in a straightforward manner by combining the above techniques with those of the p=1p=1 case of Theorem 2.2 so we omit the proof.

Theorem A.4.

For odd integer pp, we can compute AyAy in O(ndlogn)O(nd\log n) preprocessing time and O(ndp)O(ndp) query time.

A.2 Other Distance Functions

In this section we initialize matrix-vector queries for a wide variety of “distance” functions.

‘Mixed’ \ell_{\infty}.

We consider the case of a ‘permutation invariant’ version of the \ell_{\infty} norm defined as follows:

f(x,y)=maxi[d],j[d]|xiyj|.f(x,y)=\max_{i\in[d],j\in[d]}|x_{i}-y_{j}|.

ff is not a norm but we will refer to it as ‘mixed’ \ell_{\infty}.

Algorithm 6 Preprocessing
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:procedure Preprocessing
3:     for j[n]j\in[n] do
4:         minj,maxj\min_{j},\max_{j}\leftarrow minimum and maximum values of the entries of xjx_{j}, respectively.
5:     end for
6:end procedure
Algorithm 7 matrix-vector Query for mixed \ell_{\infty}
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:Output: z=Ayz=Ay
3:procedure Query({minj,maxj}j=1n,y\{\min_{j},\max_{j}\}_{j=1}^{n},y)
4:     z0nz\leftarrow 0^{n}
5:     for k[n]k\in[n] do
6:         z(k)j=1nyjmax(|minkminj|,|minkmaxj|,|maxkminj|,|maxkmaxj|)z(k)\leftarrow\sum_{j=1}^{n}y_{j}\cdot\max\left(|\min_{k}-\min_{j}|,|\min_{k}-\max_{j}|,|\max_{k}-\min_{j}|,|\max_{k}-\max_{j}|\right)
7:     end for
8:end procedure
Theorem A.5.

We can compute AyAy in O(nd)O(nd) preprocessing time and O(n2)O(n^{2}) query time.

Proof.

The preprocessing time holds because we calculate the maximum and minimum of a list of dd numbers a total of nn times. For the query time, note that each z(k)z(k) takes O(n)O(n) time to compute since we do a O(1)O(1) operation is each index of the sum in Line 6 of Algorithm 7.

To prove correctness, note that for any two vectors x,ydx,y\in\mathbb{R}^{d}, the maximum value of |xiyj||x_{i}-y_{j}| is attained if xix_{i} and yjy_{j} are among the minimum / maximum values of the coordinates of xx and yy respectively. To see this, fix a value of xix_{i}. We can always increase |xiyj||x_{i}-y_{j}| by setting yjy_{j} to be the maximum or minimum over all jj. ∎

Mahalanobis Distance Squared.

We consider the function

f(x,y)=xTMyf(x,y)=x^{T}My

for some d×dd\times d matrix MM. This is the squared version of the well-known Mahalanobis distance.

Algorithm 8 Preprocessing
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:procedure Preprocessing
3:     Sd×nS\leftarrow d\times n matrix where the jjth column is MxjMx_{j} for all j[n]j\in[n].
4:end procedure
Algorithm 9 matrix-vector Query for Mahalanobis distance squared
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:Output: z=Ayz=Ay
3:procedure Query( S,yS,y)
4:     vSyv\leftarrow Sy
5:     z0nz\leftarrow 0^{n}
6:     for k[n]k\in[n] do
7:         z(k)xk,vz(k)\leftarrow\langle x_{k},v\rangle
8:     end for
9:end procedure
Theorem A.6.

We can compute AyAy in O(nd2)O(nd^{2}) preprocessing time and O(nd)O(nd) query time.

Proof.

Note that the kkth coordinate of AyAy is given by

(Ay)(k)=j=1nyjxkTMxj=xk,j=1nyjMxk=xk,Sy(Ay)(k)=\sum_{j=1}^{n}y_{j}x_{k}^{T}Mx_{j}=\left\langle x_{k},\sum_{j=1}^{n}y_{j}Mx_{k}\right\rangle=\langle x_{k},Sy\rangle

which proves correctness. It takes O(nd2)O(nd^{2}) time to compute SS, O(nd)O(nd) time to compute SySy, and then O(d)O(d) time to compute the kkth coordinate of AyAy for all k[n]k\in[n]. ∎

Polynomial Kernels.

We now consider polynomial kernels of the form

f(x,y)=x,yp.f(x,y)=\langle x,y\rangle^{p}.
Theorem A.7.

We can compute AyAy in O(ndp)O(nd^{p}) query time.

Proof Sketch.

Consider the following expression

g(z)=j=1nyjz,xjpg(z)=\sum_{j=1}^{n}y_{j}\langle z,x_{j}\rangle^{p}

as a polynomial g:dg:\mathbb{R}^{d}\rightarrow\mathbb{R} in the dd coordinates of zz. By rearranging, the above sum can be written as a sum over O(dp)O(d^{p}) terms, corresponding to each monomial z1a1zdadz_{1}^{a_{1}}\ldots z_{d}^{a_{d}} where a1++ad=pa_{1}+\ldots+a_{d}=p. The coefficient of each term takes O(nd)O(nd) time to compute given xix_{i} and yy. Once computed, we can evaluate the polynomial at z=xjz=x_{j} for all jj which form the coordinates of AyAy. Again, this can be viewed as “linearizing” the kernel given by x,yp\langle x,y\rangle^{p}. ∎

We note that a proof similar to that of Theorem A.7 was given in Section 5.3 of [ACSS20] by expanding the relevant quantity as a polynomial; see Section 1.2 for detailed comparison between [ACSS20] and our work.

A.3 Distances for Distributions

We now consider the case that each xix_{i} specifies a discrete distribution over a domain of dd elements. Matrices AA where Ai,jA_{i,j} is a function computing distances between distributions xix_{i} and xjx_{j} have recently been studied in machine learning.

We consider how to construct matrix-vector queries for such matrices for a range of widely used distance measures on distributions. First note that a result on the TV distance follows immediately from our 1\ell_{1} result.

Theorem A.8.

Suppose Ai,j=TV(xi,xj)A_{i,j}=\textup{TV}(x_{i},x_{j}). We can compute AyAy in O(ndlogn)O(nd\log n) preprocessing time and O(nd)O(nd) query time.

We now consider some other distance functions on distributions.

Divergences.

Through a similar calculation as the KL divergence case, we can also achieve O(nd)O(nd) query times if ff is the Jensen-Shannon divergence, defined as

f(x,y)=DKL(xy)+DKL(yx)2,f(x,y)=\frac{\text{D}_{\text{KL}}(x\,\|\,y)+\text{D}_{\text{KL}}(y\,\|\,x)}{2},

as well as the cross entropy function.

Theorem A.9.

Let ff be the Jensen-Shannon divergence or cross entropy function. Then AyAy can be computed in O(nd)O(nd) time.

Through a similar calculation as done in Section A.2 (for the case of p=1p=1), we can also perform matrix-vector multiplication queries in the case that

f(x,y)=i=1dx(i)y(i).f(x,y)=\sum_{i=1}^{d}\sqrt{x(i)y(i)}.

This is the squared Hellinger distance.

Theorem A.10.

Let ff be the squared Hellinger distance. Then AyAy can be computed in O(nd)O(nd) time.

A.4 Approximate Matrix-Vector Query for \texorpdfstring2\ell_{2}L-2

While our techniques do not extend to the 2\ell_{2} case for exact matrix-vector queries, we can nonetheless instantiate approximate matrix-vector queries for the 2\ell_{2} function. We first recall the following well known embedding result.

Theorem A.11.

Let ε(0,1)\varepsilon\in(0,1) and define T:dkT:\mathbb{R}^{d}\rightarrow\mathbb{R}^{k} by

T(x)i=1βkj=1dZijxj,i=1,,kT(x)_{i}=\frac{1}{\beta k}\sum_{j=1}^{d}Z_{ij}x_{j},\quad i=1,\ldots,k

where β=2/π\beta=\sqrt{2/\pi}. Then for every vector xdx\in\mathbb{R}^{d}, we have

Pr[(1ε)x2T(x)1(1+ε)x2]1ecε2k,\Pr[(1-\varepsilon)\|x\|_{2}\leq\|T(x)\|_{1}\leq(1+\varepsilon)\|x\|_{2}]\geq 1-e^{c\varepsilon^{2}k},

where c>0c>0 is a constant.

We can instantiate approximate matrix-vector queries for f(x,y)=xy2f(x,y)=\|x-y\|_{2} via the following algorithm.

Algorithm 10 Preprocessing
1:Input: Dataset XdX\subset\mathbb{R}^{d}
2:procedure Preprocessing(TT)
3:     XTXX^{\prime}\leftarrow TX where TT is the linear map from Theorem A.11
4:     Run Algorithm 1 on XX^{\prime}
5:end procedure

For queries, we just run Algorithm 2 on XX^{\prime}. We have the following guarantee:

Theorem A.12.

Let Ai,j=xixj2A_{i,j}=\|x_{i}-x_{j}\|_{2}. There exists a matrix BB such that we can compute ByBy in O(nd2+ndlogn)O(nd^{2}+nd\log n) preprocessing time and O(nlog(n)/ε2)O(n\log(n)/\varepsilon^{2}) query time where

ABFεAF\|A-B\|_{F}\leq\varepsilon\|A\|_{F}

with probability 11/poly(n)1-1/\textup{poly}(n).

Proof.

The preprocessing and query time follow from the time required to apply the transformation TT from Theorem A.11 to our set of points XX as well as the time needed for the 1\ell_{1} matrix-vector query result of Theorem 2.1. The Frobenius norm guarantee follows from the fact that every entry of AA will be approximated with relative error in BB using Theorem A.11. ∎

A.5 Matrix-Vector Query Lower Bounds

Table 1 shows that we can initialize matrix-vector queries for a variety of distance functions in O(nd)O(nd) time. It is straightforward to see that this bound is optimal for a large class of distance matrices.

Theorem A.13.

Consider the case that Ai,j=f(xi,xj)A_{i,j}=f(x_{i},x_{j}) satisfying f(x,x)=0f(x,x)=0 for all xx. Further assume that for all xx, there exists an input yy such that f(x,y)=1f(x,y)=1. An algorithm which outputs an entry-wise approximation of AzAz to any constant factor for input zz requires Ω(nd)\Omega(nd) time in the worst case.

Proof.

We consider two cases for input points of AA. In the first case, all points in our dataset XX are identical. In the second case, a randomly chosen point is distance 11 away from the n1n-1 identical points. Computing the product of AA times the all ones vector allows us to distinguish the two cases as A1A1 has entries summing to 0 in the first case whereas A1A1 has entries summing to n1n-1 in the second case. Thus to approximate A1A1 entry-wise to any constant factor, we must distinguish the two cases. If we read o(n)o(n) points, then with good probability we will see no duplicates. Thus, we must read Ω(n)\Omega(n) points, require Ω(nd)\Omega(nd) time. ∎

Appendix B When Do Our Upper Bound Techniques Work?

By this point, we have seen many examples of matrix-vector queries which can be initialized as well as a lower bound for a natural distance function which prohibits any subquadratic time algorithm. Naturally, we are thus interested in the limits of our upper bound techniques for instantiating fast matrix-vector product. An understanding of such limits sheds light on families of structured matrices which may admit fast matrix-vector queries in general. In this section we fully characterize the capabilities of our upper bound methods and show that essentially our techniques can only work “linear” functions (in a possibly different basis).

First we set some notation. Let AA be a n×nn\times n matrix we wish to compute where each (i,j)(i,j) entry is given by f(xi,xj)f(x_{i},x_{j}). Given a query vector znz\in\mathbb{R}^{n}, the kkth coordinate of AzAz is given by

(Az)(k)=i=1nzif(xk,xi).(Az)(k)=\sum_{i=1}^{n}z_{i}f(x_{k},x_{i}).

An example choice of ff is given by f(x,y)=j=1dx(j)log(y(j))f(x,y)=\sum_{j=1}^{d}x(j)\log(y(j)) (assuming all the coordinates of xx and yy are entry wise non-negative. Note this is related to the cross entropy function in Table 1).

We first highlight the major steps which are common to all of our upper bound algorithms using ff as an example. Our upper bound technique proceeds as follows:

  • Break up f(x,y)f(x,y) into a sum over dd terms: j=1dx(j)log(y(j))\sum_{j=1}^{d}x(j)\log(y(j)).

  • Switch the order of summation:

    (Az)(k)=i=1nzif(xk,xi)=j=1di=1nzixk(j)log(xi(j)).(Az)(k)=\sum_{i=1}^{n}z_{i}f(x_{k},x_{i})=\sum_{j=1}^{d}\sum_{i=1}^{n}z_{i}x_{k}(j)\log(x_{i}(j)).
  • Evaluate each of the inner dd summations with 11 evaluation each (after some preprocessing). In other words, for a fixed jj, each of the sums i=1nzixk(j)log(xi(j))\sum_{i=1}^{n}z_{i}x_{k}(j)\log(x_{i}(j)) can be computed as one evaluation, namely xk(j)(i=1nzilog(xi(j)))x_{k}(j)\cdot\Big{(}\sum_{i=1}^{n}z_{i}\log(x_{i}(j))\Big{)} and in preprocessing, we can compute i=1nzilog(xi(j))\sum_{i=1}^{n}z_{i}\log(x_{i}(j)) as it does not depend on the coordinate kk.

The key steps of the above outline, namely switching the order of summation and precomputation of repeated terms, can be encapsulated in the following framework.

Theorem B.1.

Suppose there exist mappings T1,T2:ddT_{1},T_{2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d^{\prime}} (possibly non-linear) and a continuous g:×g:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R} such that for every kk,

(Az)(k)\displaystyle(Az)(k) =i=1nzif(xk,xi)\displaystyle=\sum_{i=1}^{n}z_{i}f(x_{k},x_{i})
=i=1nzij=1dg(T1(xk)(j),T2(xi)(j))(breaking f into sum over d terms)\displaystyle=\sum_{i=1}^{n}z_{i}\sum_{j=1}^{d^{\prime}}g\left(T_{1}(x_{k})(j),T_{2}(x_{i})(j)\right)\quad\textup{(breaking $f$ into sum over $d^{\prime}$ terms)}
=j=1di=1nzig(T1(xk)(j),T2(xi)(j))(switching order of summation).\displaystyle=\sum_{j=1}^{d^{\prime}}\sum_{i=1}^{n}z_{i}\cdot g\left(T_{1}(x_{k})(j),T_{2}(x_{i})(j)\right)\quad\textup{(switching order of summation).}

Further suppose that each of the terms i=1nzig(T1(xk)(j),T2(xi)(j))\sum_{i=1}^{n}z_{i}\cdot g\left(T_{1}(x_{k})(j),T_{2}(x_{i})(j)\right) can be evaluated as

i=1nzig(T1(xk)(j),T2(xi)(j))=g(T1(xk)(j),i=1nziT2(xi)(j))\sum_{i=1}^{n}z_{i}\cdot g\left(T_{1}(x_{k})(j),T_{2}(x_{i})(j)\right)=g\left(T_{1}(x_{k})(j),\sum_{i=1}^{n}z_{i}T_{2}(x_{i})(j)\right)

for any choice of the vector zz. Then g(a,b)g(a,b) must be a linear function in bb.

Theorem B.1 is stated in quite general terms. We are stipulating the following statements: the functions T1,T2T_{1},T_{2} represent possibly non-linear transformations to d\mathbb{R}^{d^{\prime}} on x,yx,y respectively such that f(x,y)f(x,y) can be decomposed as a sum over dd^{\prime} function evaluations. Each function evaluation takes in the same coordinate, say the jjth coordinate, of both T1(x)T_{1}(x) and T2(y)T_{2}(y) and computes the function g(T1(x)(j),T2(y)(j))g(T_{1}(x)(j),T_{2}(y)(j)). Finally the resulting sum i=1nzig(T1(xk)(j),T2(xi)(j))\sum_{i=1}^{n}z_{i}\cdot g\left(T_{1}(x_{k})(j),T_{2}(x_{i})(j)\right) can be computed as g(T1(xk)(j),i=1nh(zi)T2(xi)(j))g\Big{(}T_{1}(x_{k})(j),\sum_{i=1}^{n}h(z_{i})T_{2}(x_{i})(j)\Big{)}.

If these conditions hold (which is precisely the case in the proof of all our upper bound results), then it must be the case that gg has a very special form, in particular, gg must be a linear function in its second variable. To make the setting more concrete, we map the terminology of Theorem B.1 into some examples from our upper bound results.

First consider the case that f(x,y)=x,yf(x,y)=\langle x,y\rangle. In this case, both T1T_{1} and T2T_{2} are the identity maps and g(a,b)=abg(a,b)=ab. It is indeed the case that g(a,b)g(a,b) is linear in bb. Now consider a slightly more complicated choice f(x,y)=j=1dx(j)log(y(j))f(x,y)=\sum_{j=1}^{d}x(j)\log(y(j)). Here, we first have the mappings T1T_{1} = identity but T2T_{2} is a coordinate wise map such that T2(y)=[log(y1),,log(yn)]T_{2}(y)=[\log(y_{1}),\ldots,\log(y_{n})]. The function gg again satisfies g(a,b)=abg(a,b)=ab. Finally we consider the example f(x,y)=xy22f(x,y)=\|x-y\|_{2}^{2} which sets ddd^{\prime}\gg d. In particular, the mapppings T1,T2T_{1},T_{2} expand x,yx,y into a O(d2)O(d^{2})-dimensional vector, whose coordinates represent all possible combinations products of two coordinates of xx and yy respectively. (The reader may realize that this particular case is an example of “linearizing” the kernel given by ff). Again gg is the same function as before.

The proof of Theorem B.1 relies on the following classical result on the solutions of Cauchy’s functional equation.

Theorem B.2.

Let t:t:\mathbb{R}\rightarrow\mathbb{R} be a continuous function which satisfies t(x+y)=t(x)+t(y)t(x+y)=t(x)+t(y) for all inputs x,yx,y in its domain. Then tt must be a linear function.

For us the hypothesis that tt is continuous suffices but it is know that the above result follows from much weaker hypothesis. We refer to the reader to [Kuc09] for reference related to Cauchy’s functional equation.

Proof of Theorem B.1.

Our goal is to show that if

i=1nzig(T1(xk)(j),T2(xi)(j))=g(T1(xk)(j),i=1nziT2(xi)(j))\sum_{i=1}^{n}z_{i}\cdot g\left(T_{1}(x_{k})(j),T_{2}(x_{i})(j)\right)=g\left(T_{1}(x_{k})(j),\sum_{i=1}^{n}z_{i}T_{2}(x_{i})(j)\right)

for all zz and choices of input points xix_{i} then gg must be linear in the second variable. First set zj=0z_{j}=0 for all j2j\geq 2 and z1=z2=1z_{1}=z_{2}=1. For ease of notation, denote q:=T1(xk)(j)q:=T_{1}(x_{k})(j). As we vary the coordinate of the points x1x_{1} and x2x_{2}, the values T2(x1)(j)T_{2}(x_{1})(j) and T2(x2)(j)T_{2}(x_{2})(j) also vary over the range of T2T_{2}. Thus,

g(q,a)+g(q,b)=g(q,a+b)g(q,a)+g(q,b)=g(q,a+b)

for all possible inputs a,ba,b. However, this is exactly the hypothesis of Theorem B.2 so it follows that gg must be a linear function in its second coordinate, as desired. ∎

While the proof of Theorem B.1 is straightforward, it precisely captures the scenarios where our upper bound techniques apply. In short, it implies that ff must have a linear structure, under a suitable change of basis, for our techniques to hold. If its not the case, then our techniques do not apply and new ideas are needed. Nevertheless, as displayed by the versatility of examples in Table 1, such a structure is quite common in many applications where matrices of distance or similarity functions arise.

The observant reader might wonder how our result for the 1\ell_{1} function fits into the above framework as it is not obviously linear. However, we note that the function hj(x)=i=1n|x(j)xi(j)|h_{j}(x)=\sum_{i=1}^{n}|x(j)-x_{i}(j)| (which appears in the theorem statement of Theorem B.1 as the sum i=1nzig(T1(xk)(j),T2(xi)(j))\sum_{i=1}^{n}z_{i}\cdot g\Big{(}T_{1}(x_{k})(j),T_{2}(x_{i})(j)\Big{)} is actually a piece-wise linear function in x(j)x(j). The sorting preprocessing we performed for Theorem 2.1 can be thought of as creating a data structure which allows us to efficiently index into the correct linear piece.

Appendix C Applications of Matrix-Vector Products

C.1 Preliminary Tools

We highlight specific prior results which we use in conjunction with our matrix-vector query upper bounds to obtain improved algorithmic results. First we recall a result of [BCW22] which gives a nearly optimal low-rank approximation result in terms of the number of matrix-vector queries required.

Theorem C.1 (Theorem 5.1 in [BCW22]).

Given matrix-vector query access to a matrix An×nA\in\mathbb{R}^{n\times n}, accuracy parameter ε(0,1),k[n]\varepsilon\in(0,1),k\in[n] and any p1p\geq 1, there exists an algorithm which uses O~(k/ε1/3)\tilde{O}(k/\varepsilon^{1/3}) matrix-vector queries and outputs a n×kn\times k matrix ZZ with orthonormal columns such that with probability at least 9/109/10,

A(IZZT)p(1+ε)minU:UTU=IkA(IUUT)p\|A(I-ZZ^{T})\|_{p}\leq(1+\varepsilon)\min_{U:U^{T}U=I_{k}}\|A(I-UU^{T})\|_{p}

where Mp=(i=1nσi(M)p)1/p\|M\|_{p}=(\sum_{i=1}^{n}\sigma_{i}(M)^{p})^{1/p} is the pp-th Schatten norm where σ1,,σ(M)\sigma_{1},\ldots,\sigma(M) are the singular values of MM. The runtime of the algorithm is O~(Tk/ε1/3+nkw1/ε(w1)/3)\tilde{O}(Tk/\varepsilon^{1/3}+nk^{w-1}/\varepsilon^{(w-1)/3}) where TT is the time for computing a matrix-vector query.

The second result is of [MM15] which give an optimized analysis of a variant of power method for computing the top kk singular values.

Theorem C.2 (Theorem 11 and 77 in [MM15]).

Given matrix-vector query access to a matrix An×nA\in\mathbb{R}^{n\times n}, accuracy parameter ε(0,1),k[n]\varepsilon\in(0,1),k\in[n], there exists an algorithm which uses O~(k/ε1/2)\tilde{O}(k/\varepsilon^{1/2}) matrix-vector queries and outputs a 1±ε1\pm\varepsilon approximation to the top kk singular values of AA. The runtime of the algorithm is O~(Tk/ε1/2+nk2/ε+k3/ε3/2)\tilde{O}(Tk/\varepsilon^{1/2}+nk^{2}/\varepsilon+k^{3}/\varepsilon^{3/2}).

Lastly, we recall the gaurantees of the classical conjugate-gradient descent method.

Theorem C.3.

Let AA be a symmetric PSD matrix and consider the linear system Ax=bAx=b and let x=argminxAxb2x^{*}=\text{argmin}_{x}\|Ax-b\|_{2}. Let κ\kappa denote the condition number of AA. Given a starting vector x0x_{0}, the conjugate gradient descent algorithm uses O(κlog(1/ε))O(\sqrt{\kappa}\log(1/\varepsilon)) matrix-vector queries and returns xx such that

xxAεx0xA\|x-x^{*}\|_{A}\leq\varepsilon\|x_{0}-x^{*}\|_{A}

where xA=(xTAx)1/2\|x\|_{A}=(x^{T}Ax)^{1/2} denotes the AA-norm.

Note that matrices in our setting are also PSD, for example if Ai,j=xi,xjA_{i,j}=\langle x_{i},x_{j}\rangle. For non PSD matrices AA, one can also use the conjugate gradient descent method on the matrix ATAA^{T}A which squares the condition number. Therefore, there are more complicated algorithms which work directly on the matrix-vector queries of AA for non PSD matrices, for example see references in Chapters 6 and 7 of of [AG11]. We omit their discussion for clarity and just note that in practice, iterative methods which directly use matrix-vector queries are preferred for linear system solving.

C.2 Applications

We now derive specific applications using prior results from “matrix-free” methods. First we cover low-rank approximation.

For the 1\ell_{1} and 22\ell_{2}^{2} distance matrices, we improve upon prior works for computing a relative error low-rank approximation. While we can obtain such an approximation for a wide variety of Schatten norms, we state the bound in terms of the Frobenius norm since it has been studied in prior works.

Theorem C.4.

Let p1p\geq 1 and consider the case that Ai,j=xixjppA_{i,j}=\|x_{i}-x_{j}\|_{p}^{p} for all i,ji,j. We can compute a matrix BB such that

ABF(1+ε)AAkF\|A-B\|_{F}\leq(1+\varepsilon)\|A-A_{k}\|_{F}

where AkA_{k} denotes the optimal rank-kk approximation to AA in Frobenius norm. The runtime is O~(ndpk/ε1/3+nkw1/ε(w1)/3)\tilde{O}(ndpk/\varepsilon^{1/3}+nk^{w-1}/\varepsilon^{(w-1)/3}).

Proof.

The theorem follows from combining the matrix-vector query runtime of Table 1 and Theorem C.1. ∎

Note that the best prior result for the special case of 1\ell_{1} and 22\ell_{2}^{2} from [BCW20] where they obtained a runtime bound of O(ndk/ε+nkw1/εw1)O(ndk/\varepsilon+nk^{w-1}/\varepsilon^{w-1}). Thus our bound improves upon this by a multiplicative factor of poly(1/ε)\text{poly}(1/\varepsilon). We point out that the bound of O(ndk/ε+nkw1/εw1)O(ndk/\varepsilon+nk^{w-1}/\varepsilon^{w-1}) is actually optimal for the class of algorithms which sample the entries of AA. Thus our results show that if we know the set of points beforehand, which is a natural assumption, one can overcome such lower bounds.

For the case of Ai,j=xixj2A_{i,j}=\|x_{i}-x_{j}\|_{2}, we cannot hope to achieve a relative error approximation for low-rank approximation since we only have fast matrix-vector queries to the matrix BB where Bi,j=(1±ε)xixj2B_{i,j}=(1\pm\varepsilon)\|x_{i}-x_{j}\|_{2} via Theorem A.12. Nevertheless, we can still obtain an additive error low-rank approximation guarantee which outperforms prior works. First we show that our approximation matrix-vector queries are sufficient to obtain such a guarantee.

Lemma C.5.

Let A,BA,B satisfy ABFεAF\|A-B\|_{F}\leq\varepsilon\|A\|_{F} and suppose AA^{\prime} and BB^{\prime} are the best rank-rr approximation of AA and BB respectively in the Frobenius norm. Then

ABFAAF+2εAF.\|A-B^{\prime}\|_{F}\leq\|A-A^{\prime}\|_{F}+2\varepsilon\|A\|_{F}.
Proof.

We have

ABF\displaystyle\|A-B^{\prime}\|_{F} ABF+BBF\displaystyle\leq\|A-B\|_{F}+\|B-B^{\prime}\|_{F}
εAF+BAF\displaystyle\leq\varepsilon\|A\|_{F}+\|B-A^{\prime}\|_{F}
εAF+BAF+AAF\displaystyle\leq\varepsilon\|A\|_{F}+\|B-A\|_{F}+\|A-A^{\prime}\|_{F}
AAF+2εAF.\displaystyle\leq\|A-A^{\prime}\|_{F}+2\varepsilon\|A\|_{F}.\qed
Theorem C.6.

Let Ai,j=xixj2A_{i,j}=\|x_{i}-x_{j}\|_{2} for all i,ji,j. We can compute a matrix BB such that

ABFAAkF+εAF\|A-B\|_{F}\leq\|A-A_{k}\|_{F}+\varepsilon\|A\|_{F}

with probability 11/poly(n)1-1/\textup{poly}(n) where AkA_{k} denotes the optimal rank-kk approximation to AA in Frobenius norm. The runtime is O~(ndk/ε1/3+nkw1/ε(w1)/3)\tilde{O}(ndk/\varepsilon^{1/3}+nk^{w-1}/\varepsilon^{(w-1)/3}).

Proof.

The runtime follows from applying Theorem C.1 on the matrix created after applying Theorems A.11 and A.12. The approximation guarantee follows from Lemma C.5. ∎

The best prior work for additive error low-rank approximation for this case is due to [IVWW19] which obtained such a guarantee with runtime O~(ndpoly(k,1/ε))\tilde{O}(nd\cdot\text{poly}(k,1/\varepsilon)) for a large unspecified polynomial in kk and 1/ε1/\varepsilon. Lastly we note that our relative error low-rank approximation guarantee holds for any ff in Table 1, as summarized in Table 2.

Theorem C.7.

Suppose we have exact matrix-vector query access to a matrix AA with each query taking time TT. Then we can output a matrix BB such that

ABF(1+ε)AAkF\|A-B\|_{F}\leq(1+\varepsilon)\|A-A_{k}\|_{F}

where AkA_{k} denotes the optimal rank-kk approximation to AA in Frobenius norm. The runtime is O~(Tk/ε1/3+nkw1/ε(w1)/3)\tilde{O}(Tk/\varepsilon^{1/3}+nk^{w-1}/\varepsilon^{(w-1)/3}).

Directly appealing to Theorems C.2 and C.3 in conjunction with our matrix-vector queries achieves the fastest runtime for computing the top kk singular values and solving linear systems for a wide variety of distance matrices that we are aware of.

Theorem C.8.

Suppose we have exact matrix-vector query access to a matrix AA with each query taking time TT. We can compute a 1±ε1\pm\varepsilon approximation to the kk singular values of AA in time O~(T/ε1/2+nk2/ε+k3/ε3/2)\tilde{O}(T/\varepsilon^{1/2}+nk^{2}/\varepsilon+k^{3}/\varepsilon^{3/2}). Furthermore, we can solve linear systems for AA with the same guarantees as any iterative method which only uses matrix-vector queries with an multiplicative overhead of TT.

Finally, we can also perform matrix multiplication faster with distance matrices compared to the general runtime of nwn2.37n^{w}\approx n^{2.37}. This follows from the following lemma.

Lemma C.9.

Suppose An×nA\in\mathbb{R}^{n\times n} admits an exact matrix-vector query algorithm in time TT. Then for any Bn×nB\in\mathbb{R}^{n\times n}, we can compute ABAB in time O(Tm)O(Tm).

Proof.

We can compute ABAB by computing the product of AA with the nn columns of BB separately. ∎

As a corollary, we obtain faster matrix multiplication for all the family of matrices which we have obtained a fast matrix-vector query for. We state one such corollary for the pp\ell_{p}^{p} case.

Corollary C.10.

Let p1p\geq 1 and consider the case that Ai,j=xixjppA_{i,j}=\|x_{i}-x_{j}\|_{p}^{p} for all i,ji,j. For any other matrix BB, we can compute ABAB in time O(n2dp)O(n^{2}dp).

We can improve upon the above result slightly if we are multiplying two distance matrices for the p=2p=2 case.

Lemma C.11.

Consider the case that Ai,j=xixj22A_{i,j}=\|x_{i}-x_{j}\|_{2}^{2} for all i,ji,j and Bi,j=yiyj22B_{i,j}=\|y_{i}-y_{j}\|_{2}^{2}, i.e., both AA and BB are n×nn\times n matrices with f=22f=\ell_{2}^{2}. We can compute ABAB in time O(n2dw2)O(n^{2}d^{w-2}).

Proof.

By decomposing both AA and BB, it suffices to compute the product XXTYYTXX^{T}YY^{T} where X,Yn×dX,Y\in\mathbb{R}^{n\times d} are the matrices with the points xix_{i} and yiy_{i} in the rows respectively. Z1:=XTYd×dZ_{1}:=X^{T}Y\in\mathbb{R}^{d\times d} can be computed in O(nd2)O(nd^{2}) time. Then Z2:=XZ1n×dZ_{2}:=XZ_{1}\in\mathbb{R}^{n\times d} can also be computed in O(nd2)O(nd^{2}) time. Finally, we need to compute Z2×YTZ_{2}\times Y^{T}. This can be done in O(n2dw2)O(n^{2}d^{w-2}) time by decomposing both Z2Z_{2} and YTY^{T} into n/dn/d many d×dd\times d square matrices and using the standard matrix multiplication bound on each pair of square matrices. This results in the claimed runtime of O((n/d)2dw)=O(n2dw2)O((n/d)^{2}\cdot d^{w})=O(n^{2}d^{w-2}). ∎

Appendix D A Fast Algorithm for Creating \texorpdfstring1\ell_{1}L-1 and \texorpdfstring2\ell_{2}L-2 Distance Matrices

We now present a fast algorithm for creating distance matrices which addresses our contribution (3)(3) stated in the introduction. Given a set of nn points x1,,xnx_{1},\ldots,x_{n} in d\mathbb{R}^{d}, our goal is to initialize an approximate n×nn\times n distance matrix BB for the 1\ell_{1} distance which satisfies

Bij=(1±ε)xixj1B_{ij}=(1\pm\varepsilon)\|x_{i}-x_{j}\|_{1} (1)

for all entries of BB where 0<ε<10<\varepsilon<1 is a precision parameter. The straightforward way to create the exact distance matrix takes take O(n2d)O(n^{2}d) time and by using the stability of Cauchy random variables, we can create BB which satisfies (1) in O(n2logn)O(n^{2}\log n) time for any constant ε\varepsilon. (Note the Johnson-Lindenstrauss lemma implies a similar guarantee for the 2\ell_{2} distance matrix). The goal of this section is to improve upon this ‘baseline’ runtime of O(n2logn)O(n^{2}\log n). The final runtime guarantees of this section will be of the form O(n2poly(loglogn))O(n^{2}\cdot\text{poly}(\log\log n)).

Our improvement holds in the Word-RAM model of computation. Formally, we assume each memory cell (i.e. word) can hold O(logn)O(\log n) bits and certain computations on words take O(1)O(1) time. The only assumptions we require are the arithmetic operations of adding or subtracting words as well as performing left or right bit shifts on words takes constant time.

We first present prior work on metric compression of [IRW17] in Section D.1. Our algorithm description starts from Section D.2 which describes our preprocesing step. Section D.3 then presents our key algorithm ideas whose runtime and accuracy are analyzed in Sections D.4 and D.5.

D.1 Metric Compression Tree of \texorpdfstring[IRW17][IRW ’17]

The starting point of our result is the metric compression tree construction of [IRW17], whose properties we summarize below. First we introduce some useful definitions. The aspect ration Φ\Phi of XX is defined as

Φ=maxi,jxixj1minijxixj1.\Phi=\frac{\max_{i,j}\|x_{i}-x_{j}\|_{1}}{\min_{i\neq j}\|x_{i}-x_{j}\|_{1}}.

Let Δ=maxi[n]x1xi1\Delta^{\prime}=\max_{i\in[n]}\|x_{1}-x_{i}\|_{1} and Δ=2logΔ\Delta=2^{\lceil\log\Delta^{\prime}\rceil}.

Theorem 11 of [IRW17] implies the following result. Given a dataset X={x1,,xn}dX=\{x_{1},\ldots,x_{n}\}\subset\mathbb{R}^{d} with aspect ration Φ\Phi, there exists a tree data structure TT which allows for the computation of a compressed representation XX for the purposes of distance computations. At a high level, TT is created by enclosing XX in a large enough and appropriately shifted axis-parallel square and then recursively dividing into smaller squares (also called cells) with half the side-length until all points of XX are contained in their own cell. The edges of TT encode the cell containment relationships. Formally, TT has the following properties:

  • The leaf nodes of TT correspond to the points of XX.

  • The edges of TT are of two types: short edges and long edges which are defined as follows. Short edges have a length dd bit vector associated with them whereas long edges have an integer O(logΦ)\leq O(\log\Phi) associated with them.

  • Each long edge with associated integer kk represents a non-branching path of length kk of short edges, all of whose associated length dd bit vectors are the 0 string.

  • Each node of TT (including the nodes that are on paths which are compressed as long edges) have an associated level <log(4Δ)-\infty<\ell\leq\log(4\Delta). A level \ell of a node vv corresponds to an axis-parallel square GG_{\ell} of side length 22^{\ell} which contains all axis-parallel squares of child nodes of vv.

The notion of a padded point is important for the metric compression properties of TT.

Definition D.1 (Padded Point).

A point xix_{i} is (ε,Λ,)(\varepsilon,\Lambda,\ell)-padded, if the grid cell GG_{\ell} of side length 22^{\ell} that contains xix_{i} also contains the ball of radius ρ()\rho(\ell) centered at xix_{i}, where

ρ()=8ε12Λd.\rho(\ell)=8\varepsilon^{-1}2^{\ell-\Lambda}\sqrt{d}.

We say that xix_{i} is (ε,Λ)(\varepsilon,\Lambda)-padded in TT, if it is (ε,Λ,)(\varepsilon,\Lambda,\ell)-padded for every level \ell.

The following lemma is proven in [IRW17]. First define

Λ=log(16d1.5logΦ/(εδ)).\Lambda=\log(16d^{1.5}\log\Phi/(\varepsilon\delta)). (2)
Lemma D.1 (Lemma 1 in [IRW17]).

Consider the construction of TT defined formally in Section 33 of [IRW17]. Every point xix_{i} is (ε,Λ)(\varepsilon,\Lambda)-padded in T with probability 1δ1-\delta.

Now let xx be any point in our dataset. We can construct x~d\widetilde{x}\in\mathbb{R}^{d}, called the decompression of xx, from TT with the following procedure: We follow the downward path from the root of TT to the leaf associated with xx and collect a bit string for every coordinate dd of x~\widetilde{x}. When going down a short edge with an associated bit vector bb, we concatenate the iith bit of bb to the end of the bit string that we are keeping track of for the iith coordinate of x~\widetilde{x}. When going down a long edge, we concatenate with a number of zeros equalling the integer associated with the long edge. Finally, a binary floating point is placed in the resulting bit strings of each coordinate after the bit corresponding to level 0. The collected bits then correspond to the binary expansion of the coordinates of x~\widetilde{x}. For a more thorough description of the decompression scheme, see Section 33 of [IRW17].

The decompression scheme is useful because it allows approximate distance computations.

Lemma D.2 (Lemma 2 in [IRW17]).

If a point xix_{i} is (ε,Λ)(\varepsilon,\Lambda)-padded in TT, then for every j[n]j\in[n],

(1ε)x~ix~j1xixj1(1+ε)x~ix~j1.(1-\varepsilon)\|\widetilde{x}_{i}-\widetilde{x}_{j}\|_{1}\leq\|x_{i}-x_{j}\|_{1}\leq(1+\varepsilon)\|\widetilde{x}_{i}-\widetilde{x}_{j}\|_{1}.

We now cite the runtime and space required for TT. The following theorem follows from the results of [IRW17].

Theorem D.3.

Let L=logΦ+ΛL=\log\Phi+\Lambda. TT has O(nΛ)O(n\Lambda) edges, height LL, its total size is O(ndΛ+nlogn)O(nd\Lambda+n\log n) bits, and its construction time is O(ndL)O(ndL).

We contrast the above gauranttes with the naive representation of XX which stores O(logn)O(\log n) bits of precision for each coordinate and occupies O(ndlogn)O(nd\log n) bits of space, whereas TT occupies roughly O(ndloglogn+nlogn)O(nd\log\log n+n\log n) bits.

Finally, Theorem 22 in [IRW17] implies we can create a collection of O(logn)O(\log n) trees TT (by setting δ\delta to be a small constant in (2)) such that every point in XX is padded in at least one tree in the collection.

D.2 Step 1: Preprocessing Metric Compression Trees

We now describe the preprocessing steps needed for our faster distance matrix compression. Let

w=4dΛlognw=\frac{4d\Lambda}{\log n} (3)

and recall our setting of Λ\Lambda in (2). Note that we assume ww is an integer which implicitly assumes 4dΛlogn4d\Lambda\geq\log n.

First we describe the preprocessing of TT. Consider a short edge of TT with an associated dd length bit string bb. We break up bb into ww equal chunks, each containing d/wd/w bits. Consider a single chunk cc. We pad (an equal number of) 2Λ2\Lambda many 0’s after each bit in cc so that the total number of bits is logn/2\log n/2. We then store each padded chunk in one word. We do this for every chunk resulting in ww words for each short edge and we do this for all short edges in all the trees.

The second preprocessing step we perform is creating a O(n)×O(n)O(\sqrt{n})\times O(\sqrt{n}) table AA. The rows and columns of AA are indexed by all possible bit strings with logn2\frac{\log n}{2} bits. The entries of AA record evaluations of the function f(x,y)f(x,y) defined as follow: given x,yx,y where x,y{0,1}logn2x,y\in\{0,1\}^{\frac{\log n}{2}}, consider the partition of each of them into d/wd/w blocks, each with an equal number of bits (2Λ2\Lambda bits per block. Note that 2Λd/w=(logn)/22\Lambda\cdot d/w=(\log n)/2). Each block defines an integer. Doing so results in d/wd/w integers x1,,xd/wx^{1},\ldots,x^{d/w} derived from xx and ww integers y1,,yd/wy^{1},\ldots,y^{d/w} derived from yy. Finally,

f(x,y)=i=1d/w|xiyi|.f(x,y)=\sum_{i=1}^{d/w}|x^{i}-y^{i}|.

D.3 Step 2: Depth-First Search

We now calculate one row of the distance matrix from point a padded point xx to all other points in our dataset. Our main subroutine is a tree search procedure. Its input is a node vv in a tree TT and it performs a depth-first search on the subtree rooted at vv as described in Algorithm 11. Given an internal vertex vv, it calculates all the distances between the padded point xx to all data points in our dataset which are leaf nodes in the subtree rooted at vv. A summary of the algorithm follows.

We perform a depth-first search starting at vv. As we traverse the tree, we keep track of the current embedding of the internal nodes via collecting bits along the edges of TT: we append bits when we descend the tree and remove as we move up edges. However, we only keep track of this embedding up to 2Λ2\Lambda levels below vv. After that, we continue traversing the tree but don’t update the embedding. The reason for this is after 2Λ2\Lambda levels, the embedding is precise enough for all nodes below with respect to computing the distance to xx. Towards this end, we also track how many levels below vv the tree search is currently at and update this value appropriately.

The current embedding is tracked using ww words. Recall that the bit string of every short edge has been repackaged into ww words, each ‘responsible’ for d/wd/w coordinates. Furthermore, in each word on the edge, we have padded 0’s between the bits of each d/wd/w coordinates. When we need to update the current embedding by incorporating the bits along a short edge ee, we simply perform a bit shift on each of the ww words on ee and add it to the ww words we are keeping track of. We need to make sure we place the bits ‘in order.’ That is for our tracked embedding, for every dd coordinates, the bits on an edge ee should precede the bits on the edge directly following ee in the tree search. Due to the padding from the preprocessing step, the bit shift implies the bits on the short edges after ee will be placed in their appropriate corresponding places in order in the embedding representation.

Algorithm 11 DFS in Subtree
1:Input: Metric Compression Tree TT, node vv
2:procedure Search( T,vT,v)
3:     Initialize a global counter pp for the number of levels which have been processed. Initially set to 0 and will be at most 2Λ2\Lambda
4:     Initialize ww words t1,,twt_{1},\ldots,t_{w}, all initially 0
5:     Initialize a global counter rr for the current level which is initially set to the level of vv in TT
6:     Perform a depth-first search in the subtree rooted at vv. Run Process-Short-Edge if a short edge is encountered, Process-Long-Edge if a long edge is encountered, and Process-Leaf when we arrive at a leaf.
7:end procedure

While performing the depth-first search, we will encounter both short and long edges. When encountering a short edge, we run the function Process-Short-Edge and similarly, we run Process-Long-Edge when a long edge is encountered. Finally if we arrive at a leaf node, we run Process-Leaf.

Algorithm 12 Process Short Edge
1:Input: Short edge ee, number of processed nodes pp, t1,,twt_{1},\ldots,t_{w}
2:procedure Process-Short-Edge(e,p,t1,,twe,p,t_{1},\ldots,t_{w})
3:     Let e1,,ewe_{1},\ldots,e_{w} be the ww words associated with edge ee
4:     If search is traversing down ee and p<2Λp<2\Lambda, add 2pei2^{-p}e_{i} to tit_{i} for all 1iw1\leq i\leq w and increment pp
5:     If search is traversing up ee and p2Λp\leq 2\Lambda, subtract 2pei2^{-p}e_{i} from tit_{i} for all 1iw1\leq i\leq w and decrement pp
6:     Update rr to the level of the current node
7:end procedure
Algorithm 13 Process Long Edge
1:Input: Long edge ee, number of processed nodes pp
2:procedure Process-Long-Edge(e,pe,p)
3:     If search is traversing down ee and p<2Λp<2\Lambda, increment pp
4:     If search is traversing up ee and p2Λp\leq 2\Lambda, decrement pp
5:     Update rr to the level of the current node
6:end procedure

When we arrive at a leaf node yy, we currently have the decompression of yy computed from the tree. Note that we only have kept track of the bits after node vv (up to limited precision) since all prior bits are the same for yy and xx since they are in the same subtree. More specifically, we have ww words t1,,twt_{1},\ldots,t_{w}. The first word t1t_{1} has 2Λ2\Lambda bits of each of the first d/wd/w coordinates of yy. For every coordinate, the 2Λ2\Lambda bits respect the order described in the decompression step in Section D.2. A similar fact is true for the rest of the words tit_{i}. Now to calculate the distance between xx and yy, we just have to consider the 2Λ2\Lambda bits of all dd coordinates of xx which come after descending down vertex vv. We then repackage these 2dΛ2d\Lambda total bits into ww words in the same format as yy. Note this preprocessing for xx only happens once (at the subtreee level) and can be used for all leaves in the subtree rooted at vv.

Algorithm 14 Process Leaf
1:Input: t1,,twt_{1},\ldots,t_{w}
2:procedure Process-Leaf(y,t1,,twy,t_{1},\ldots,t_{w})
3:     Let the point yy correspond to the current leaf node
4:     Let s1,,sws_{1},\ldots,s_{w} denote the embedding of xx after node vv, preprocessed to be in the same format as t1,,twt_{1},\ldots,t_{w}
5:     Report i=1wA[ti,si]\sum_{i=1}^{w}A[t_{i},s_{i}] as the distance between xx and yy
6:end procedure

Finally, the complete algorithm just calls Algorithm 11 on successive parent nodes of xx. We mark each subtree that has already been processed (at the root node) so that the subtree is only ever visited once. The number of calls to Algorithm 11 is at most the height of the tree, which is bounded by O(logΦ+Λ)O(\log\Phi+\Lambda). We then repeat this for all points xx in our dataset (using the tree which xx is padded in) to create the full distance matrix.

D.4 Runtime Analysis

We consider the runtime required to compute the row corresponding to a padded point xx in the distance matrix. Multiplying by nn results in the total runtime. Consider the tree TT in which xx is padded in and which we use for the algorithm described in the previous section and recall the properties of TT outlined in Theorem D.3. TT has O(nΛ)O(n\Lambda) edges, each of which is only visited at most twice in the tree search (going up and down). Thus the time to traverse the tree is O(nΛ)O(n\Lambda). There are also at most O(nΛ)O(n\Lambda) short edges in TT. Updating the embedding given by t1,,twt_{1},\ldots,t_{w} takes O(w)O(w) time per edge since it can be done in O(w)O(w) total word operations. Long edges don’t require this time since they represent 0 bits; for long edges, we just increment the counter for the current level. Altogether, the total runtime for updating t1,,twt_{1},\ldots,t_{w} across all calls to Algorithm 11 for the padded point xx is O(nΛw)O(n\Lambda w). Finally, calculating the distance from xx to a fixed point yy requires O(w)O(w) time since we just index into the array AA ww times. Thus the total runtime is dominated by O(nΛw)O(n\Lambda w). Finally, the total runtime for computing all rows of the distance matrix is

O(n2Λw)=O(n2dΛ2logn)=O(n2dlognlog2(dlogΦε))O(n^{2}\Lambda w)=O\left(\frac{n^{2}d\Lambda^{2}}{\log n}\right)=O\left(\frac{n^{2}d}{\log n}\,\log^{2}\left(\frac{d\log\Phi}{\varepsilon}\right)\right)

by setting δ\delta to be a small constant in (2).

D.5 Accuracy Analysis

We now show that the distances we calculate are accurate within a 1±ε1\pm\varepsilon multiplicative factor. The lemma below shows that if a padded point xx and another point yy have a sufficiently far away least-common ancestor in TT, then we can disregard many lower order bits in the decompression computed from TT while still guaranteeing accurate distance measurements. The lemma crucially relies on xx being padded.

Lemma D.4.

Suppose xx is (ε,Λ)(\varepsilon,\Lambda)-padded in TT. For another point yy, suppose the least common ancestor of xx and yy is at level \ell. Let x~\widetilde{x} and y~\widetilde{y} denote the sketches of xx and yy produced by TT. Let x~\widetilde{x}^{\prime} be a modified version of x~\widetilde{x} where for each of the dd coordinates, we remove all the bits acquired after level 2Λ\ell-2\Lambda. Similarly define y~\widetilde{y}^{\prime}. Then

x~y~1=(1±ε)xy1.\|\widetilde{x}^{\prime}-\widetilde{y}^{\prime}\|_{1}=(1\pm\varepsilon)\|x-y\|_{1}.
Proof.

Since xx is padded, we know that xy1p(1)\|x-y\|_{1}\geq p(\ell-1) by Definition D.1. On the other hand, if we ignore the bits after level 2Λ\ell-2\Lambda for every coordinate of x~\widetilde{x} and y~\widetilde{y}, the additive approximation error in the distance is bounded by a constant factor times

di=2Λ12i=d22Λ.d\cdot\sum_{i=-\infty}^{\ell-2\Lambda-1}2^{i}=d\cdot 2^{\ell-2\Lambda}.

From our choice of Λ\Lambda, we can easily verify that d22Λεp(1)/2d\cdot 2^{\ell-2\Lambda}\leq\varepsilon p(\ell-1)/2. Putting everything together and adjusting the value of ε\varepsilon, we have

x~y~1=x~y~1±εp(1)/2=(1±ε/2)xy1±εp(1)/2=(1±ε)xy1\|\widetilde{x}^{\prime}-\widetilde{y}^{\prime}\|_{1}=\|\widetilde{x}-\widetilde{y}\|_{1}\pm\varepsilon p(\ell-1)/2=(1\pm\varepsilon/2)\|x-y\|_{1}\pm\varepsilon p(\ell-1)/2=(1\pm\varepsilon)\|x-y\|_{1}

where we have used the fact that x~y~1=(1±ε/2)xy1\|\widetilde{x}-\widetilde{y}\|_{1}=(1\pm\varepsilon/2)\|x-y\|_{1} from the guarantees of the compression tree of [IRW17]. ∎

Putting together our results above along with the Johnson-Lindenstrauss Lemma and Theorem A.11 proves the following theorem.

Theorem D.5.

Let X={x1,,xn}X=\{x_{1},\ldots,x_{n}\} be a dataset of nn points in dd dimensions with aspect ration Φ\Phi. We can calculate a n×nn\times n matrix BB such that each (i,j)(i,j) entry BijB_{ij} of BB satisfies

(1ε)xixj1Bij(1+ε)xixj1(1-\varepsilon)\|x_{i}-x_{j}\|_{1}\leq B_{ij}\leq(1+\varepsilon)\|x_{i}-x_{j}\|_{1}

in time

O(n2dlognlog2(dlogΦε)).O\left(\frac{n^{2}d}{\log n}\,\log^{2}\left(\frac{d\log\Phi}{\varepsilon}\right)\right).

Assuming the aspect ratio is polynomially bounded, we can compute n×nn\times n matrix BB such that each (i,j)(i,j) entry BijB_{ij} of BB satisfies

(1ε)xixj2Bij(1+ε)xixj2(1-\varepsilon)\|x_{i}-x_{j}\|_{2}\leq B_{ij}\leq(1+\varepsilon)\|x_{i}-x_{j}\|_{2}

with probability 11/poly(n)1-1/\text{poly}(n). The construction time is

O(n2ε2log2(lognε)).O\left(\frac{n^{2}}{\varepsilon^{2}}\,\log^{2}\left(\frac{\log n}{\varepsilon}\right)\right).

D.6 A Faster Algorithm for \texorpdfstring\ell_{\infty}L-Infinity Distance Matrix Construction Over Bounded Alphabet

In this section, we show how to create the \ell_{\infty} distance matrix. Recall from Section 3 that there exists no o(n2)o(n^{2}) time algorithm to compute a matrix-vector query for the \ell_{\infty} distance matrix, assuming SETH, even for nn points in {0,1,2}d\{0,1,2\}^{d}. This suggests that any algorithm for computing a matrix-vector query needs to initialize the distance matrix. However, there is still a gap between a Ω(n2)\Omega(n^{2}) lower bound for matrix-vector queries and the naive O(n2d)O(n^{2}d) time needed to compute the \ell_{\infty} distance matrix. We make progress towards showing that this gap is not necessary. Our main result is that surprisingly, we can initialize the \ell_{\infty} distance matrix in time much faster than the naive O(n2d)O(n^{2}d) time.

Theorem D.6.

Given nn points over {0,1,,M}d\{0,1,\ldots,M\}^{d}, we can initialize the exact \ell_{\infty} distance matrix in time O(Mw1n2(dlogM)w2)O(M^{w-1}n^{2}(d\log M)^{w-2}) where ww is the matrix multiplication constant. We can also initialize a n×nn\times n matrix BB such that each (i,j)(i,j) entry BijB_{ij} of BB satisfies

(1ε)xixjBij(1+ε)xixj(1-\varepsilon)\|x_{i}-x_{j}\|_{\infty}\leq B_{ij}\leq(1+\varepsilon)\|x_{i}-x_{j}\|_{\infty}

in time O~(ε1n2(dM)w2)\tilde{O}(\varepsilon^{-1}n^{2}(dM)^{w-2}).

Thus for M=O(1)M=O(1), which is the setting of the lower bound, we can initialize the distance matrix in time O(n2dw2)O(n^{2}d^{w-2}) and thus compute a matrix-vector query in that time as well.

Proof.

The starting point of the proof is to first design an algorithm which constructs a matrix with (i,j)(i,j) entry an indicator vector for xyi\|x-y\|_{\infty}\leq i or xy>i\|x-y\|_{\infty}>i. Given this, we can then sum across all MM choices and construct the full distance matrix. Thus it suffices to solve this intermediate task.

Pick a sufficiently large pp such that dip(i+1)pdi^{p}\leq(i+1)^{p}. A choice of p=O(Mlogd)p=O(M\log d) suffices. Now in the case that xyi\|x-y\|_{\infty}\leq i, we have xyppdip\|x-y\|_{p}^{p}\leq di^{p} and otherwise, xypp(i+1)p\|x-y\|_{p}^{p}\geq(i+1)^{p}. Thus, the matrix CC with the (i,j)(i,j) entry being xixjpp\|x_{i}-x_{j}\|_{p}^{p} is able to distinguish the two cases so it suffices to create such a matrix. Now we can write xypp\|x-y\|_{p}^{p} as an inner product in O(pd)O(pd) variables, i.e., it is a gram matrix. Thus computing CC can be done by computing a product of n×O(pd)n\times O(pd) matrix by a O(pd)×nO(pd)\times n matrix, which can be done in

O((npd)2(pd)w2)=O(n2(pd)w2).O\left(\left(\frac{n}{pd}\right)^{2}(pd)^{w-2}\right)=O(n^{2}(pd)^{w-2}).

time by partitioning each matrix into square submatrices of dimension O(pd)O(pd). Plugging in the bound for pp and considering all possible choices of ii results in the final runtime bound of O(Mw1n2(dlogM)w2)O(M^{w-1}n^{2}(d\log M)^{w-2}), as desired.

Now if we only want to approximate each entry up to a multiplicative 1±ε1\pm\varepsilon factor, it suffices to only loop over ii’s which are increasing by powers of 1+cε1+c\varepsilon for a small constant cc. This replaces an O(M)O(M) factor by an O(ε1logM)O(\varepsilon^{-1}\log M) factor. ∎