Faster Linear Algebra for Distance Matrices
Abstract
The distance matrix of a dataset of points with respect to a distance function represents all pairwise distances between points in induced by . Due to their wide applicability, distance matrices and related families of matrices have been the focus of many recent algorithmic works. We continue this line of research and take a broad view of algorithm design for distance matrices with the goal of designing fast algorithms, which are specifically tailored for distance matrices, for fundamental linear algebraic primitives. Our results include efficient algorithms for computing matrix-vector products for a wide class of distance matrices, such as the metric for which we get a linear runtime, as well as an lower bound for any algorithm which computes a matrix-vector product for the case, showing a separation between the and the metrics. Our upper bound results, in conjunction with recent works on the matrix-vector query model, have many further downstream applications, including the fastest algorithm for computing a relative error low-rank approximation for the distance matrix induced by and functions and the fastest algorithm for computing an additive error low-rank approximation for the metric, in addition to applications for fast matrix multiplication among others. We also give algorithms for constructing distance matrices and show that one can construct an approximate distance matrix in time faster than the bound implied by the Johnson-Lindenstrauss lemma.
1 Introduction
Given a set of points , the distance matrix of with respect to a distance function is defined as the matrix satisfying . Distances matrices are ubiquitous objects arising in various applications ranging from learning image manifolds [TSL00, WS06], signal processing [SY07], biological analysis [HS93], and non-linear dimensionality reduction [Kru64, Kru78, TSL00, CC08], to name a few111We refer the reader to the survey [DPRV15] for a more thorough discussion of applications of distance matrices.. Unfortunately, explicitly computing and storing requires at least time and space. Such complexities are prohibitive for scaling to large datasets.
A silver lining is that in many settings, the matrix is not explicitly required. Indeed in many applications, it suffices to compute some underlying function or property of , such as the eigenvalues and eigenvectors of or a low-rank approximation of . Thus an algorithm designer can hope to use the special geometric structure encoded by to design faster algorithms tailored for such tasks.
Therefore, it is not surprising that many recent works explicitly take advantage of the underlying geometric structure of distance matrices, and other related families of matrices, to design fast algorithms (see Section 1.2 for a thorough discussion of prior works). In this work, we continue this line of research and take a broad view of algorithm design for distance matrices. Our main motivating question is the following:
Can we design algorithms for fundamental linear algebraic primitives which are specifically tailored for distance matrices and related families of matrices?
We make progress towards the motivating question by studying three of the most fundamental primitives in algorithmic linear algebra. Specifically:
-
1.
We study upper and lower bounds for computing matrix-vector products for a wide array of distance matrices,
-
2.
We give algorithms for multiplying distance matrices faster than general matrices, and,
-
3.
We give fast algorithms for constructing distance matrices.
1.1 Our Results
We now describe our contributions in more detail.
1. We study upper and lower bounds for constructing matrix-vector queries for a wide array of distance matrices.
A matrix-vector query algorithm accepts a vector as input and outputs the vector . There is substantial motivation for studying such queries. Indeed, there is now a rich literature for fundamental linear algebra algorithms which are in the “matrix free” or “implicit” model. These algorithms only assume access to the underlying matrix via matrix-vector queries. Some well known algorithms in this model include the power method for computing eigenvalues and the conjugate gradient descent method for solving a system of linear equations. For many fundamental functions of , nearly optimal bounds in terms of the number of queries have been achieved [MM15, BHSW20, BCW22]. Furthermore, having access to matrix-vector queries also allows the simulation of any randomized sketching algorithm, a well studied algorithmic paradigm in its own right [Woo14]. This is because randomized sketching algorithms operate on the matrix or where is a suitably chosen random matrix, such as a Gaussian matrix. Typically, is chosen so that the sketches or have significantly smaller row or column dimension compared to . If is symmetric, we can easily acquire both types of matrix sketches via a small number of matrix-vector queries.
Therefore, creating efficient versions of matrix-vector queries for distance matrices automatically lends itself to many further downstream applications. We remark that our algorithms can access to the set of input points but do not explicitly create the distance matrix. A canonical example of our upper bound results is the construction of matrix-vector queries for the function .
Theorem 1.1.
Let be an integer. Suppose we are given a dataset of points . implicitly defines the matrix . Given a query , we can compute exactly in time . If is odd, we also require preprocessing time.
We give similar guarantees for a wide array of functions and we refer the reader to Table 1 which summarizes our matrix-vector query upper bound results. Note that some of the functions we study in Table 1 do not necessarily induce a metric in the strict mathematical sense (for example the function does not satisfy the triangle inequality). Nevertheless, we still refer to such functions under the broad umbrella term of “distance functions” for ease of notation. We always explicitly state the function we are referring to.
Crucially, most of our bounds have a linear dependency on which allows for scalable computation as the size of the dataset grows. Our upper bounds are optimal in many cases, see Theorem A.13.
Function | Preprocessing | Query Time | Reference | |
---|---|---|---|---|
for even | Thms. A.1 / A.3 | |||
for odd | Thms. 2.2 / A.4 | |||
Mixed | Thm. A.5 | |||
Mahalanobis Distance2 | Thm. A.6 | |||
Polynomial Kernel | Thm. A.7 | |||
Total Variation Distance | Thm. A.8 | |||
KL Divergence | Thm. A.2 | |||
Symmetric Divergence | Thm. A.9 | |||
Cross Entropy | Thm. A.9 | |||
Hellinger Distance2 | Thm. A.10 |
Combining our upper bound results with optimized matrix-free methods, immediate corollaries of our results include faster algorithms for eigenvalue and singular value computations and low-rank approximations. Low-rank approximation is of special interest as it has been widely studied for distance matrices; for low-rank approximation, our bounds outperform prior results for specific distance functions. For example, for the and case (and in general PSD matrices), [BCW20] showed that a rank- approximation can be found in time . This bound has extra overhead compared to our bound stated in Table 2. The work of [IVWW19] has a worse overhead for an additive error approximation for the case. See Section 1.2 for further discussion of prior works. The downstream applications of matrix-vector queries are summarized in Table 2.
We also study fundamental limits for any upper bound algorithms. In particular, we show that no algorithm can compute a matrix-vector query for general inputs for the metric in subquadratic time, assuming a standard complexity-theory assumption called the Strong Exponential Time Hypothesis (SETH) [IP01, IPZ01].
Theorem 1.2.
For any and , any algorithm for exactly computing for any input , where is the distance matrix, requires time (assuming SETH).
This shows a separation between the functions listed in Table 1 and the metric. Surprisingly, we can create queries for the approximate matrix-vector query in substantially faster time.
Theorem 1.3.
Suppose . We can compute in time where .
To put the above result into context, the lower bound of Theorem 1.2 holds for points sets in in dimensions. In contrast, if we relax to an approximation guarantee, we can obtain a subquadratic-time algorithm for up to .
Finally, we provide a general understanding of the limits of our upper bound techniques. In Theorem B.1, we show that essentially the only for which our upper bound techniques apply have a “linear structure” after a suitable transformation. We refer to Appendix Section B for details.
Problem | Runtime | Prior Work | |||||||
---|---|---|---|---|---|---|---|---|---|
|
|
|
|||||||
|
|
|
|||||||
|
Any in Table 1 |
|
|
||||||
|
Any in Table 1 |
|
|
||||||
|
Any in Table 1 |
|
|||||||
|
|
2. We give algorithms for multiplying distance matrices faster than general matrices.
Fast matrix-vector queries also automatically imply fast matrix multiplication, which can be reduced to a series of matrix-vector queries. For concreteness, if is the function which induces , and is any matrix, we can compute in time . This is substantially faster than the general matrix multiplication bound of . We also give an improvement of this result for the case where we are multiplying two distance matrices arising from . See Table 2 for summary.
3. We give fast algorithms for constructing distance matrices.
Finally, we give fast algorithms for constructing approximate distance matrices. To establish some context, recall the classical Johnson-Lindenstrauss (JL) lemma which (roughly) states that a random projection of a dataset of size onto a dimension of size approximately preserves all pairwise distances [JL84]. A common applications of this lemma is to instantiate the distance matrix. A naive algorithm which computes the distance matrix after performing the JL projection requires approximately time. Surprisingly, we show that the JL lemma is not tight with respect to creating an approximate distance matrix; we show that one can initialize the distance in an asymptotically better runtime.
Theorem 1.4 (Informal; See Theorem D.5 ).
We can calculate a matrix such that each entry of satisfies in time .
Our result can be viewed as the natural runtime bound which would follow if the JL lemma implied an embedding dimension bound of . While this is impossible, as it would imply an exponential improvement over the JL bound which is tight [LN17], we achieve our speedup by carefully reusing distance calculations via tools from metric compression [IRW17]. Our results also extend to the distance matrix; see Theorem D.5 for details.
Notation.
Our dataset will be the points . For points in , we denote to be the th coordinate of point for clarity. For all other vectors , denotes the th coordinate. We are interested in matrices of the form for which measures the similarity between any pair of points. might not necessarily be a distance function but we use the terminology “distance function” for ease of notation. We will always explicitly state the function as needed. denotes the matrix multiplication constant, i.e., the exponent of in the time required to compute the product of two matrix [AW21].
1.2 Related Works
Matrix-Vector Products Queries.
Our work can be understood as being part of a long line of classical works on the matrix free or implicit model as well as the active recent line of works on the matrix-vector query model. Many widely used linear algebraic algorithms such as the power method, the Lanczos algorithm [Lan50], conjugate gradient descent [S+94], and Wiedemann’s coordinate recurrence algorithm [Wie86], to name a few, all fall into this paradigm. Recent works such as [MM15, BHSW20, BCW22] have succeeded in precisely nailing down the query complexity of these classical algorithms in addition to various other algorithmic tasks such as low-rank approximation [BCW22], trace estimation [MMMW21], and other linear-algebraic functions [SWYZ21b, RWZ20]. There is also a rich literature on query based algorithms in other contexts with the goal of minimizing the number of queries used. Examples include graph queries [Gol17], distribution queries [Can20], and constraint based queries [ES20] in property testing, inner product queries in compressed sensing [EK12], and quantum queries [LSZ21, CHL21].
Most prior works on query based models assume black-box access to matrix-vector queries. While this is a natural model which allows for the design non-trivial algorithms and lower bounds, it is not always clear how such queries can be initialized. In contrast, the focus of our work is not on obtaining query complexity bounds, but rather complementing prior works by creating an efficient matrix-vector query for a natural class of matrices.
Subquadratic Algorithms for Distance Matrices.
Most work on subquadratic algorithms for distance matrices have focused on the problem of computing a low-rank approximation. [BW18, IVWW19] both obtain an additive error low-rank approximation applicable for all distance matrices. These works only assume access to the entries of the distance matrix whereas we assume we also have access to the underlying dataset. [BCW20] study the problem of computing the low-rank approximation of PSD matrices with also sample access to the entries of the matrix. Their results extend to low-rank approximation for the and distance matrices in addition to other more specialized metrics such as spherical metrics. Table 2 lists the runtime comparisons between their results and ours.
Practically, the algorithm of [IVWW19] is the easiest to implement and has outstanding empirical performance. We note that we can easily simulate their algorithm with no overall asymptotic runtime overhead using vector queries. Indeed, their algorithm proceeds by sampling rows of the matrix according to their value and then post-processing these rows. The sampling probabilities only need to be accurate up to a factor of two. We can acquire these sampling probabilities by performing matrix-vector queries which sketches the rows onto dimension and preserves all row-norms up to a factor of two with high probability due to the Johnson-Lindenstrauss lemma [JL84]. This procedure only incurs an additional runtime of where is the time required to perform a matrix-vector query.
The paper [ILLP04] shows that the exact distance matrix can be created in time in the case of , which is asymptotically faster than the naive bound of . In contrast, we focus on creating an (entry-wise) approximate distance matrices for all values of .
We also compare to the paper of [ACSS20]. In summary, their main upper bounds are approximation algorithms while we mainly focus on exact algorithms. Concretely, they study matrix vector products for matrices of the form for some function . They present results on approximating the matrix vector product of where the approximation error is additive. They also consider a wide range of , including polynomials and other kernels, but the input to is always the distance squared. In contrast, we also present exact algorithms, i.e., with no approximation errors. For example one of our main upper bounds is an exact algorithm when (see Table 1 for the full list). Since it is possible to approximately embed the distance into , their methods could be used to derive approximate algorithms for , but not the exact ones. Furthermore, we also study a wide variety of other distance functions such as and (and others listed in Table 1) which are not studied in Alman et al. In terms of technique, the main upper bound technique of Alman et al. is to expand and approximate the resulting quantity via a polynomial. This is related to our upper bound results for for even where we also use polynomials. However, our results are exact, while theirs are approximate. Our upper bound technique is orthogonal to the polynomial approximation techniques used in Alman et al. We also employ polynomial techniques to give upper bounds for the approximate distance function which is not studied in Alman et al. Lastly, Alman et al. also focus on the Laplacian matrix of the weighted graph represented by the distance matrix, such as spectral sparsification and Laplacian system solving. In contrast, we study different problems including low-rank approximations, eigenvalue estimation, and the task of initializing an approximate distance matrix. We do not consider the distance matrix as a graph or consider the associated Laplacian matrix.
It is also easy to verify the “folklore” fact that for a gram matrix , we can compute in time if by computing first and then . Our upper bound for the function can be reduced to this folklore fact by noting that . Thus the matrix can be decomposed into two rank one components due to the terms and , and a gram matrix due to the term . This decomposition of the matrix is well-known (see Section in [DPRV15]). Hence, a matrix-vector query for the matrix easily reduces to the gram matrix case. Nevertheless, we explicitly state the upper bound for completeness since we also consider all functions for any integer .
Polynomial Kernels.
There have also been works on faster algorithms for approximating a kernel matrix defined as the matrix with entries for a kernel function . Specifically for the polynomial kernel , recent works such as [ANW14, AKK+20, WZ20, SWYZ21a] have shown how to find a sketch of which approximately satisfies for all . In contrast, we can exactly simulate the matrix-vector product . Our runtime is which has a linear dependence on but an exponential dependence on while the aforementioned works have at least a quadratic dependence on but a polynomial dependence on . Thus our results are mostly applicable to the setting where our dataset is large, i.e. and is a small constant. For example, is a common choice in practice [CHC+10]. Algorithms with polynomial dependence in and but quadratic dependence in are suited for smaller datasets which have very large and large . Note that a large might arise if approximates a non-polynomial kernel using a polynomial kernel via a taylor expansion. We refer to the references within [ANW14, AKK+20, WZ20, SWYZ21a] for additional related work. There is also work on kernel density estimation (KDE) data structures which upon query , allow for estimation of the sum in time sublinear in after some preprocessing on the dataset . For widely used kernels such as the Gaussian and Laplacian kernels, KDE data structures were used in [BIMW21] to create a matrix-vector query algorithm for kernel matrices in time subquadratic in for input vectors which are entry wise non-negative. We refer the reader to [CS17, BCIS18, SRB+19, BIW19, CKNS20] for prior works on KDE data structures.
2 Faster Matrix-Vector Product Queries for \texorpdfstringL1
We derive faster matrix-vector queries for distance matrices for a wide array of distance metrics. First we consider the case of the metric such that where .
We first analyze the correctness of Algorithm 2.
Theorem 2.1.
Let . Algorithm 2 computes exactly.
Proof.
Consider any coordinate . We show that is computed exactly. We have
Let denote the order of induced by . We have
We now consider the inner sum. It rearranges to the following:
where and are defined in lines of Algorithm 2 and the last equality follows from the definition of the arrays and . Summing over all gives us the desired result. ∎
The following theorem readily follows.
Theorem 2.2.
Suppose we are given a dataset which implicitly defines the distance matrix . Given a query , we can compute exactly in query time. We also require a one time preprocessing time which can be reused for all queries.
3 Lower and Upper bounds for \texorpdfstringL-Infinity
In this section we give a proof of Theorem 1.2. Specifically, we give a reduction from a so-called Orthogonal Vector Problem (OVP) [Wil05] to the problem of computing matrix-vector product , where , for a given set of points . The orthogonal vector problem is defined as follows: given two sets of vectors and , , , determine whether there exist and such that the dot product (taken over reals) is equal to 0. It is known that if OVP can be solved in strongly subquadratic time for any constant and , then SETH is false. Thus, an efficient reduction from OVP to the matrix-vector product problem yields Theorem 1.2.
Lemma 3.1.
If the matrix-vector product problem for distance matrices induced by vectors of dimension can be solved in time , then OVP (with the same parameters) can be solved in time .
Proof.
Define two functions, , such that , , . We extend both functions to vectors by applying and coordinate wise and to sets by letting ; the function is extended in the same way for . Observe that, for any pair of non-zero vectors , we have if and only if , and otherwise.
Consider two sets of binary vectors and . Without loss of generality we can assume that the vectors are non-zero, since otherwise the problem is trivial. Define three distance matrices: matrix defined by the set , matrix defined by the set and defined by the set . Furthermore, let be the “cross-distance” matrix, such that such that . Observe that the matrix contains blocks and on its diagonal, and blocks and off-diagonal. Thus, , where denotes an all-ones vector of the appropriate dimension. Since , we can calculate in time . Since all entries of are either or , we have that if and only if there is an entry . However, this only occurs if . ∎
3.1 Approximate \texorpdfstringL-Infinity Matrix-Vector Queries
In light of the lower bounds given above, we consider initializing approximate matrix-vector queries for the function. Note that the lower bound holds for points in and thus it is natural to consider approximate upper bounds for the case of limited alphabet.
Binary Case.
We first consider the case that all points are from . We first claim the existence of a polynomial with the following properties. Indeed, the standard Chebyshev polynomials satisfy the following lemma, see e.g., see Chapter 2 in [SV+14].
Lemma 3.2.
There exists a polynomial of degree such that and for all .
Now note that can only take on two values, or . Furthermore, if and only if and if and only if . Therefore, if and only if and if and only if . Thus, we have that
for all entries of . Note that is a polynomial with monomials in the variables . Consider the matrix satisfying . Using the same ideas as our upper bound results for , it is straightforward to calculate the matrix vector product (see Section A.2). To summarize, for each , we write the th coordinate of as a polynomial in the coordinates of . This polynomial has monomials and can be constructed in time. Once constructed, we can evaluate the polynomial at to obtain all the coordinates of . Each evaluation requires resulting in an overall time bound of .
Theorem 3.3.
Let . We can compute in time where .
Entries in .
We now consider the case that all points are from . Our argument will be a generalization of the previous section. At a high level, our goal is to detect which of the possible values in is equal to the norm. To do so, we appeal to the prior section and design estimators which approximate the indicator function . By summing up these indicators, we can approximate .
Our estimators will again be designed using the Chebyshev polynomials. To motivate them, suppose that we want to detect if or if . In the first case, some entry in will have absolute value value at least where as in the other case, all entries of will be bounded by in absolute value. Thus if we can boost this ‘signal’, we can apply a polynomial which performs thresholding to distinguish the two cases. This motivates considering the functions of for a larger power . In particular, in the case that , we have and otherwise, . By setting , the first value is much larger than the latter, which we can detect using the ‘threshold’ polynomials of the previous section.
We now formalize our intuition. It is known that appropriately scaled Chebyshev polynomials satisfy the following guarantees, see e.g., see Chapter 2 in [SV+14].
Lemma 3.4.
There exists a polynomial of degree such that for all and for all .
Given , our estimator will first try to detect if . Let be a polynomial from Lemma 3.4 with for and assuming is even. Let be a polynomial from Lemma 3.4 with . Our estimator will be
If coordinate is such that , then
and so will evaluate to a value very close to . Otherwise, we know that
by our choice of , which means that will evaluate to a value close to . Formally,
will be at least if there is a with and otherwise, will be at most . By our choice of , the overall estimate output at least in the first case and a value at most in the second case.
The polynomial which is the concatenation of and has monomials, if we consider the expression as a polynomial in the variables . Our final estimator will be the sum across all . Following our upper bound techniques for matrix-vector products for polynomial, e.g. in Section A.2, and as outlined in the prior section, we get the following overall query time:
Theorem 3.5.
Suppose we are given which implicitly defines the matrix . For any query , we can compute in time where .
4 Empirical Evaluation
We perform an empirical evaluation of our matrix-vector query for the distance function. We chose to implement our upper bound since it’s a clean algorithm which possesses many of the same underlying algorithmic ideas as some of our other upper bound results. We envision that similar empirical results hold for most of our upper bounds in Table 1. Furthermore, matrix-vector queries are the dominating subroutine in many key practical linear algebra algorithms such as the power method for eigenvalue estimation or iterative methods for linear regression: a fast matrix-vector query runtime automatically translates to faster algorithms for downstream applications.
Dataset | Algo. | Preprocessing Time | Avg. Query Time | |
Gaussian Mixture | ) | Naive | 453.7 s | 43.3 s |
Ours | 0.55 s | 0.09 s | ||
MNIST | Naive | 2672.5 s | 38.6 s | |
Ours | 5.5 s | 1.9 s | ||
Glove | Naive | - | 2.6 days (estimated) | |
Ours | 16.8 s | 3.4 s |
Experimental Design.
We chose two real and one synthetic dataset for our experiments. We have two “small” datasets and one “large” dataset. The two small datasets have points whereas the large dataset has approximately points. The first dataset is points drawn from a mixture of three spherical Gaussians in . The second dataset is the standard MNIST dataset [LeC98] and finally, our large dataset is Glove word embeddings222Can be accessed here: \urlhttp://github.com/erikbern/ann-benchmarks/ in [PSM14].
The two small datasets are small enough that one can feasibly initialize the full distance matrix in memory in reasonable time. A matrix with each entry stored using bits requires gigabytes of space. This is simply impossible for the Glove dataset as approximately terabytes of space is required to initialize the distance matrix (in contrast, the dataset itself only requires gigabytes to store).
The naive algorithm for the small datasets is the following: we initialize the full distance matrix (which will count towards preprocessing), and then we use the full distance matrix to perform a matrix-vector query. Note that having the full matrix to perform a matrix-vector product only helps the naive algorithm since it can now take advantage of optimized linear algebra subroutines for matrix multiplication and does not need to explicitly calculate the matrix entries. Since we cannot initialize the full distance matrix for the large dataset, the naive algorithm in this case will compute the matrix-vector product in a standalone fashion by generating the entries of the distance matrix on the fly. We compare the naive algorithm to our Algorithms 1 and 2.
Our experiments are done in a 2021 M1 Macbook Pro with 32 gigabytes of RAM. We implement all algorithms in Python 3.9 using Numpy with Numba acceleration to speed up all algorithms whenever possible.
Results.
Results are shown in Table 3. We show preprocessing and query time for both the naive and our algorithm in seconds. The query time is averaged over trials using Gaussian vectors as queries. For the Glove dataset, it was infeasible to calculate even a single matrix-vector product, even using fast Numba accelerated code. We thus estimated the full query time by calculating the time on a subset of points of the Glove dataset and extrapolating to the full dataset by multiplying the query time by where is the total number of points. We see that in all cases, our algorithm outperforms the naive algorithm in both preprocessing time and query time and the gains become increasingly substantial as the dataset size increases, as predicted by our theoretical results.
Acknowledgements.
This research was supported by the NSF TRIPODS program (award DMS-2022448), Simons Investigator Award, MIT-IBM Watson collaboration, GIST- MIT Research Collaboration grant, and NSF Graduate Research Fellowship under Grant No. 1745302.
References
- [ACSS20] Josh Alman, Timothy Chu, Aaron Schild, and Zhao Song. Algorithms and hardness for linear algebra on geometric graphs. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 541–552. IEEE, 2020.
- [AG11] Uri M Ascher and Chen Greif. A first course on numerical methods. SIAM, 2011.
- [AKK+20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141–160. SIAM, 2020.
- [ANW14] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polynomial kernel. Advances in neural information processing systems, 27, 2014.
- [AW21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522–539. SIAM, 2021.
- [BCIS18] Arturs Backurs, Moses Charikar, Piotr Indyk, and Paris Siminelakis. Efficient density evaluation for smooth kernels. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 615–626, 2018.
- [BCW20] Ainesh Bakshi, Nadiia Chepurko, and David P Woodruff. Robust and sample optimal algorithms for psd low rank approximation. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 506–516. IEEE, 2020.
- [BCW22] Ainesh Bakshi, Kenneth L Clarkson, and David P Woodruff. Low-rank approximation with matrix-vector products. arXiv preprint arXiv:2202.05120, 2022.
- [BHSW20] Mark Braverman, Elad Hazan, Max Simchowitz, and Blake Woodworth. The gradient complexity of linear regression. In Conference on Learning Theory, pages 627–647. PMLR, 2020.
- [BIMW21] Arturs Backurs, Piotr Indyk, Cameron Musco, and Tal Wagner. Faster kernel matrix algebra via density estimation. In Proceedings of the 38th International Conference on Machine Learning, pages 500–510, 2021.
- [BIW19] Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efficient kernel density estimation in high dimensions. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS, pages 15773–15782, 2019.
- [BW18] Ainesh Bakshi and David Woodruff. Sublinear time low-rank approximation of distance matrices. Advances in Neural Information Processing Systems, 31, 2018.
- [Can20] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1–100, 2020.
- [CC08] Michael AA Cox and Trevor F Cox. Multidimensional scaling. In Handbook of data visualization, pages 315–347. Springer, 2008.
- [CHC+10] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11(4), 2010.
- [CHL21] Andrew M Childs, Shih-Han Hung, and Tongyang Li. Quantum query complexity with matrix-vector products. arXiv preprint arXiv:2102.11349, 2021.
- [CKNS20] Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density estimation through density constrained near neighbor search. 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 172–183, 2020.
- [CS17] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 1032–1043, 2017.
- [DPRV15] Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32(6):12–30, 2015.
- [EK12] Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications. Cambridge university press, 2012.
- [ES20] Rogers Epstein and Sandeep Silwal. Property testing of lp-type problems. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
- [Gol17] Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017.
- [HS93] Liisa Holm and Chris Sander. Protein structure comparison by alignment of distance matrices. Journal of molecular biology, 233(1):123–138, 1993.
- [ILLP04] Piotr Indyk, Moshe Lewenstein, Ohad Lipsky, and Ely Porat. Closest pair problems in very high dimensions. In International Colloquium on Automata, Languages, and Programming, pages 782–792. Springer, 2004.
- [IP01] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367–375, 2001.
- [IPZ01] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63(4):512–530, 2001.
- [IRW17] Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Practical data-dependent metric compression with provable guarantees. Advances in Neural Information Processing Systems, 30, 2017.
- [IVWW19] Pitor Indyk, Ali Vakilian, Tal Wagner, and David P Woodruff. Sample-optimal low-rank approximation of distance matrices. In Conference on Learning Theory, pages 1723–1751. PMLR, 2019.
- [JL84] W. Johnson and J. Lindenstrauss. Extensions of lipschitz maps into a hilbert space. Contemporary Mathematics, 26:189–206, 01 1984.
- [Kru64] Joseph B Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964.
- [Kru78] Joseph B Kruskal. Multidimensional scaling. Number 11. Sage, 1978.
- [Kuc09] Marek Kuczma. An introduction to the theory of functional equations and inequalities: Cauchy’s equation and Jensen’s inequality. Springer Science & Business Media, 2009.
- [Lan50] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. 1950.
- [LeC98] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- [LN17] Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633–638. IEEE, 2017.
- [LSZ21] Troy Lee, Miklos Santha, and Shengyu Zhang. Quantum algorithms for graph problems with cut queries. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 939–958. SIAM, 2021.
- [MM15] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. Advances in neural information processing systems, 28, 2015.
- [MMMW21] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
- [PSM14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- [RWZ20] Cyrus Rashtchian, David P Woodruff, and Hanlin Zhu. Vector-matrix-vector queries for solving linear algebra, statistics, and graph problems. arXiv preprint arXiv:2006.14015, 2020.
- [S+94] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
- [SRB+19] Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Alexander Levis. Rehashing kernel evaluation in high dimensions. In Proceedings of the 36th International Conference on Machine Learning, ICML, pages 5789–5798, 2019.
- [SV+14] Sushant Sachdeva, Nisheeth K Vishnoi, et al. Faster algorithms via approximation theory. Foundations and Trends® in Theoretical Computer Science, 9(2):125–210, 2014.
- [SWYZ21a] Zhao Song, David Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polynomial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812–9823. PMLR, 2021.
- [SWYZ21b] Xiaoming Sun, David P Woodruff, Guang Yang, and Jialin Zhang. Querying a matrix through matrix-vector products. ACM Transactions on Algorithms (TALG), 17(4):1–19, 2021.
- [SY07] Anthony Man-Cho So and Yinyu Ye. Theory of semidefinite programming for sensor network localization. Mathematical Programming, 109(2):367–384, 2007.
- [TSL00] Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
- [Wie86] Douglas Wiedemann. Solving sparse linear equations over finite fields. IEEE transactions on information theory, 32(1):54–62, 1986.
- [Wil05] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2-3):357–365, 2005.
- [Woo14] David P Woodruff. Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357, 2014.
- [WS06] Kilian Q Weinberger and Lawrence K Saul. Unsupervised learning of image manifolds by semidefinite programming. International journal of computer vision, 70(1):77–90, 2006.
- [WZ20] David Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings via adaptive sampling. In International Conference on Machine Learning, pages 10324–10333. PMLR, 2020.
Appendix A Omitted Upper Bounds for Faster Matrix-Vector Queries
We now consider the case of for . Generalizing the results of and allows us to handle general functions.
Theorem A.1.
We can compute in query time.
Proof.
The proof follows from the following calculation of the th coordinate of :
We can extend our results to general functions as well as a wide array of commonly used functions to measure (dis)similarity between vectors. For example, suppose the points represent a probability distribution over the domain . A widely used “distance” function over distributions is the KL-divergence defined as
where is the entropy function. Our techniques extend to the KL-divergence as well.
Theorem A.2.
We can compute in query time.
Proof.
Note that computed all of and takes time. Now
as desired. ∎
A.1 General \texorpdfstringp
We now consider the case of a general non-negative even integer .
Theorem A.3.
We can compute in query time.
Proof.
Consider the following calculation of the th coordinate of :
Note that computing for all and takes time. Then returning the value of takes time resulting in the claimed runtime. ∎
The case of a general non-negative odd integer follows in a straightforward manner by combining the above techniques with those of the case of Theorem 2.2 so we omit the proof.
Theorem A.4.
For odd integer , we can compute in preprocessing time and query time.
A.2 Other Distance Functions
In this section we initialize matrix-vector queries for a wide variety of “distance” functions.
‘Mixed’ .
We consider the case of a ‘permutation invariant’ version of the norm defined as follows:
is not a norm but we will refer to it as ‘mixed’ .
Theorem A.5.
We can compute in preprocessing time and query time.
Proof.
The preprocessing time holds because we calculate the maximum and minimum of a list of numbers a total of times. For the query time, note that each takes time to compute since we do a operation is each index of the sum in Line 6 of Algorithm 7.
To prove correctness, note that for any two vectors , the maximum value of is attained if and are among the minimum / maximum values of the coordinates of and respectively. To see this, fix a value of . We can always increase by setting to be the maximum or minimum over all . ∎
Mahalanobis Distance Squared.
We consider the function
for some matrix . This is the squared version of the well-known Mahalanobis distance.
Theorem A.6.
We can compute in preprocessing time and query time.
Proof.
Note that the th coordinate of is given by
which proves correctness. It takes time to compute , time to compute , and then time to compute the th coordinate of for all . ∎
Polynomial Kernels.
We now consider polynomial kernels of the form
Theorem A.7.
We can compute in query time.
Proof Sketch.
Consider the following expression
as a polynomial in the coordinates of . By rearranging, the above sum can be written as a sum over terms, corresponding to each monomial where . The coefficient of each term takes time to compute given and . Once computed, we can evaluate the polynomial at for all which form the coordinates of . Again, this can be viewed as “linearizing” the kernel given by . ∎
A.3 Distances for Distributions
We now consider the case that each specifies a discrete distribution over a domain of elements. Matrices where is a function computing distances between distributions and have recently been studied in machine learning.
We consider how to construct matrix-vector queries for such matrices for a range of widely used distance measures on distributions. First note that a result on the TV distance follows immediately from our result.
Theorem A.8.
Suppose . We can compute in preprocessing time and query time.
We now consider some other distance functions on distributions.
Divergences.
Through a similar calculation as the KL divergence case, we can also achieve query times if is the Jensen-Shannon divergence, defined as
as well as the cross entropy function.
Theorem A.9.
Let be the Jensen-Shannon divergence or cross entropy function. Then can be computed in time.
Through a similar calculation as done in Section A.2 (for the case of ), we can also perform matrix-vector multiplication queries in the case that
This is the squared Hellinger distance.
Theorem A.10.
Let be the squared Hellinger distance. Then can be computed in time.
A.4 Approximate Matrix-Vector Query for \texorpdfstringL-2
While our techniques do not extend to the case for exact matrix-vector queries, we can nonetheless instantiate approximate matrix-vector queries for the function. We first recall the following well known embedding result.
Theorem A.11.
Let and define by
where . Then for every vector , we have
where is a constant.
We can instantiate approximate matrix-vector queries for via the following algorithm.
For queries, we just run Algorithm 2 on . We have the following guarantee:
Theorem A.12.
Let . There exists a matrix such that we can compute in preprocessing time and query time where
with probability .
Proof.
The preprocessing and query time follow from the time required to apply the transformation from Theorem A.11 to our set of points as well as the time needed for the matrix-vector query result of Theorem 2.1. The Frobenius norm guarantee follows from the fact that every entry of will be approximated with relative error in using Theorem A.11. ∎
A.5 Matrix-Vector Query Lower Bounds
Table 1 shows that we can initialize matrix-vector queries for a variety of distance functions in time. It is straightforward to see that this bound is optimal for a large class of distance matrices.
Theorem A.13.
Consider the case that satisfying for all . Further assume that for all , there exists an input such that . An algorithm which outputs an entry-wise approximation of to any constant factor for input requires time in the worst case.
Proof.
We consider two cases for input points of . In the first case, all points in our dataset are identical. In the second case, a randomly chosen point is distance away from the identical points. Computing the product of times the all ones vector allows us to distinguish the two cases as has entries summing to in the first case whereas has entries summing to in the second case. Thus to approximate entry-wise to any constant factor, we must distinguish the two cases. If we read points, then with good probability we will see no duplicates. Thus, we must read points, require time. ∎
Appendix B When Do Our Upper Bound Techniques Work?
By this point, we have seen many examples of matrix-vector queries which can be initialized as well as a lower bound for a natural distance function which prohibits any subquadratic time algorithm. Naturally, we are thus interested in the limits of our upper bound techniques for instantiating fast matrix-vector product. An understanding of such limits sheds light on families of structured matrices which may admit fast matrix-vector queries in general. In this section we fully characterize the capabilities of our upper bound methods and show that essentially our techniques can only work “linear” functions (in a possibly different basis).
First we set some notation. Let be a matrix we wish to compute where each entry is given by . Given a query vector , the th coordinate of is given by
An example choice of is given by (assuming all the coordinates of and are entry wise non-negative. Note this is related to the cross entropy function in Table 1).
We first highlight the major steps which are common to all of our upper bound algorithms using as an example. Our upper bound technique proceeds as follows:
-
•
Break up into a sum over terms: .
-
•
Switch the order of summation:
-
•
Evaluate each of the inner summations with evaluation each (after some preprocessing). In other words, for a fixed , each of the sums can be computed as one evaluation, namely and in preprocessing, we can compute as it does not depend on the coordinate .
The key steps of the above outline, namely switching the order of summation and precomputation of repeated terms, can be encapsulated in the following framework.
Theorem B.1.
Suppose there exist mappings (possibly non-linear) and a continuous such that for every ,
Further suppose that each of the terms can be evaluated as
for any choice of the vector . Then must be a linear function in .
Theorem B.1 is stated in quite general terms. We are stipulating the following statements: the functions represent possibly non-linear transformations to on respectively such that can be decomposed as a sum over function evaluations. Each function evaluation takes in the same coordinate, say the th coordinate, of both and and computes the function . Finally the resulting sum can be computed as .
If these conditions hold (which is precisely the case in the proof of all our upper bound results), then it must be the case that has a very special form, in particular, must be a linear function in its second variable. To make the setting more concrete, we map the terminology of Theorem B.1 into some examples from our upper bound results.
First consider the case that . In this case, both and are the identity maps and . It is indeed the case that is linear in . Now consider a slightly more complicated choice . Here, we first have the mappings = identity but is a coordinate wise map such that . The function again satisfies . Finally we consider the example which sets . In particular, the mapppings expand into a -dimensional vector, whose coordinates represent all possible combinations products of two coordinates of and respectively. (The reader may realize that this particular case is an example of “linearizing” the kernel given by ). Again is the same function as before.
The proof of Theorem B.1 relies on the following classical result on the solutions of Cauchy’s functional equation.
Theorem B.2.
Let be a continuous function which satisfies for all inputs in its domain. Then must be a linear function.
For us the hypothesis that is continuous suffices but it is know that the above result follows from much weaker hypothesis. We refer to the reader to [Kuc09] for reference related to Cauchy’s functional equation.
Proof of Theorem B.1.
Our goal is to show that if
for all and choices of input points then must be linear in the second variable. First set for all and . For ease of notation, denote . As we vary the coordinate of the points and , the values and also vary over the range of . Thus,
for all possible inputs . However, this is exactly the hypothesis of Theorem B.2 so it follows that must be a linear function in its second coordinate, as desired. ∎
While the proof of Theorem B.1 is straightforward, it precisely captures the scenarios where our upper bound techniques apply. In short, it implies that must have a linear structure, under a suitable change of basis, for our techniques to hold. If its not the case, then our techniques do not apply and new ideas are needed. Nevertheless, as displayed by the versatility of examples in Table 1, such a structure is quite common in many applications where matrices of distance or similarity functions arise.
The observant reader might wonder how our result for the function fits into the above framework as it is not obviously linear. However, we note that the function (which appears in the theorem statement of Theorem B.1 as the sum is actually a piece-wise linear function in . The sorting preprocessing we performed for Theorem 2.1 can be thought of as creating a data structure which allows us to efficiently index into the correct linear piece.
Appendix C Applications of Matrix-Vector Products
C.1 Preliminary Tools
We highlight specific prior results which we use in conjunction with our matrix-vector query upper bounds to obtain improved algorithmic results. First we recall a result of [BCW22] which gives a nearly optimal low-rank approximation result in terms of the number of matrix-vector queries required.
Theorem C.1 (Theorem 5.1 in [BCW22]).
Given matrix-vector query access to a matrix , accuracy parameter and any , there exists an algorithm which uses matrix-vector queries and outputs a matrix with orthonormal columns such that with probability at least ,
where is the -th Schatten norm where are the singular values of . The runtime of the algorithm is where is the time for computing a matrix-vector query.
The second result is of [MM15] which give an optimized analysis of a variant of power method for computing the top singular values.
Theorem C.2 (Theorem and in [MM15]).
Given matrix-vector query access to a matrix , accuracy parameter , there exists an algorithm which uses matrix-vector queries and outputs a approximation to the top singular values of . The runtime of the algorithm is .
Lastly, we recall the gaurantees of the classical conjugate-gradient descent method.
Theorem C.3.
Let be a symmetric PSD matrix and consider the linear system and let . Let denote the condition number of . Given a starting vector , the conjugate gradient descent algorithm uses matrix-vector queries and returns such that
where denotes the -norm.
Note that matrices in our setting are also PSD, for example if . For non PSD matrices , one can also use the conjugate gradient descent method on the matrix which squares the condition number. Therefore, there are more complicated algorithms which work directly on the matrix-vector queries of for non PSD matrices, for example see references in Chapters 6 and 7 of of [AG11]. We omit their discussion for clarity and just note that in practice, iterative methods which directly use matrix-vector queries are preferred for linear system solving.
C.2 Applications
We now derive specific applications using prior results from “matrix-free” methods. First we cover low-rank approximation.
For the and distance matrices, we improve upon prior works for computing a relative error low-rank approximation. While we can obtain such an approximation for a wide variety of Schatten norms, we state the bound in terms of the Frobenius norm since it has been studied in prior works.
Theorem C.4.
Let and consider the case that for all . We can compute a matrix such that
where denotes the optimal rank- approximation to in Frobenius norm. The runtime is .
Proof.
Note that the best prior result for the special case of and from [BCW20] where they obtained a runtime bound of . Thus our bound improves upon this by a multiplicative factor of . We point out that the bound of is actually optimal for the class of algorithms which sample the entries of . Thus our results show that if we know the set of points beforehand, which is a natural assumption, one can overcome such lower bounds.
For the case of , we cannot hope to achieve a relative error approximation for low-rank approximation since we only have fast matrix-vector queries to the matrix where via Theorem A.12. Nevertheless, we can still obtain an additive error low-rank approximation guarantee which outperforms prior works. First we show that our approximation matrix-vector queries are sufficient to obtain such a guarantee.
Lemma C.5.
Let satisfy and suppose and are the best rank- approximation of and respectively in the Frobenius norm. Then
Proof.
We have
Theorem C.6.
Let for all . We can compute a matrix such that
with probability where denotes the optimal rank- approximation to in Frobenius norm. The runtime is .
Proof.
The best prior work for additive error low-rank approximation for this case is due to [IVWW19] which obtained such a guarantee with runtime for a large unspecified polynomial in and . Lastly we note that our relative error low-rank approximation guarantee holds for any in Table 1, as summarized in Table 2.
Theorem C.7.
Suppose we have exact matrix-vector query access to a matrix with each query taking time . Then we can output a matrix such that
where denotes the optimal rank- approximation to in Frobenius norm. The runtime is .
Directly appealing to Theorems C.2 and C.3 in conjunction with our matrix-vector queries achieves the fastest runtime for computing the top singular values and solving linear systems for a wide variety of distance matrices that we are aware of.
Theorem C.8.
Suppose we have exact matrix-vector query access to a matrix with each query taking time . We can compute a approximation to the singular values of in time . Furthermore, we can solve linear systems for with the same guarantees as any iterative method which only uses matrix-vector queries with an multiplicative overhead of .
Finally, we can also perform matrix multiplication faster with distance matrices compared to the general runtime of . This follows from the following lemma.
Lemma C.9.
Suppose admits an exact matrix-vector query algorithm in time . Then for any , we can compute in time .
Proof.
We can compute by computing the product of with the columns of separately. ∎
As a corollary, we obtain faster matrix multiplication for all the family of matrices which we have obtained a fast matrix-vector query for. We state one such corollary for the case.
Corollary C.10.
Let and consider the case that for all . For any other matrix , we can compute in time .
We can improve upon the above result slightly if we are multiplying two distance matrices for the case.
Lemma C.11.
Consider the case that for all and , i.e., both and are matrices with . We can compute in time .
Proof.
By decomposing both and , it suffices to compute the product where are the matrices with the points and in the rows respectively. can be computed in time. Then can also be computed in time. Finally, we need to compute . This can be done in time by decomposing both and into many square matrices and using the standard matrix multiplication bound on each pair of square matrices. This results in the claimed runtime of . ∎
Appendix D A Fast Algorithm for Creating \texorpdfstringL-1 and \texorpdfstringL-2 Distance Matrices
We now present a fast algorithm for creating distance matrices which addresses our contribution stated in the introduction. Given a set of points in , our goal is to initialize an approximate distance matrix for the distance which satisfies
(1) |
for all entries of where is a precision parameter. The straightforward way to create the exact distance matrix takes take time and by using the stability of Cauchy random variables, we can create which satisfies (1) in time for any constant . (Note the Johnson-Lindenstrauss lemma implies a similar guarantee for the distance matrix). The goal of this section is to improve upon this ‘baseline’ runtime of . The final runtime guarantees of this section will be of the form .
Our improvement holds in the Word-RAM model of computation. Formally, we assume each memory cell (i.e. word) can hold bits and certain computations on words take time. The only assumptions we require are the arithmetic operations of adding or subtracting words as well as performing left or right bit shifts on words takes constant time.
We first present prior work on metric compression of [IRW17] in Section D.1. Our algorithm description starts from Section D.2 which describes our preprocesing step. Section D.3 then presents our key algorithm ideas whose runtime and accuracy are analyzed in Sections D.4 and D.5.
D.1 Metric Compression Tree of \texorpdfstring[IRW17][IRW ’17]
The starting point of our result is the metric compression tree construction of [IRW17], whose properties we summarize below. First we introduce some useful definitions. The aspect ration of is defined as
Let and .
Theorem of [IRW17] implies the following result. Given a dataset with aspect ration , there exists a tree data structure which allows for the computation of a compressed representation for the purposes of distance computations. At a high level, is created by enclosing in a large enough and appropriately shifted axis-parallel square and then recursively dividing into smaller squares (also called cells) with half the side-length until all points of are contained in their own cell. The edges of encode the cell containment relationships. Formally, has the following properties:
-
•
The leaf nodes of correspond to the points of .
-
•
The edges of are of two types: short edges and long edges which are defined as follows. Short edges have a length bit vector associated with them whereas long edges have an integer associated with them.
-
•
Each long edge with associated integer represents a non-branching path of length of short edges, all of whose associated length bit vectors are the string.
-
•
Each node of (including the nodes that are on paths which are compressed as long edges) have an associated level . A level of a node corresponds to an axis-parallel square of side length which contains all axis-parallel squares of child nodes of .
The notion of a padded point is important for the metric compression properties of .
Definition D.1 (Padded Point).
A point is -padded, if the grid cell of side length that contains also contains the ball of radius centered at , where
We say that is -padded in , if it is -padded for every level .
The following lemma is proven in [IRW17]. First define
(2) |
Lemma D.1 (Lemma 1 in [IRW17]).
Consider the construction of defined formally in Section of [IRW17]. Every point is -padded in T with probability .
Now let be any point in our dataset. We can construct , called the decompression of , from with the following procedure: We follow the downward path from the root of to the leaf associated with and collect a bit string for every coordinate of . When going down a short edge with an associated bit vector , we concatenate the th bit of to the end of the bit string that we are keeping track of for the th coordinate of . When going down a long edge, we concatenate with a number of zeros equalling the integer associated with the long edge. Finally, a binary floating point is placed in the resulting bit strings of each coordinate after the bit corresponding to level . The collected bits then correspond to the binary expansion of the coordinates of . For a more thorough description of the decompression scheme, see Section of [IRW17].
The decompression scheme is useful because it allows approximate distance computations.
Lemma D.2 (Lemma 2 in [IRW17]).
If a point is -padded in , then for every ,
We now cite the runtime and space required for . The following theorem follows from the results of [IRW17].
Theorem D.3.
Let . has edges, height , its total size is bits, and its construction time is .
We contrast the above gauranttes with the naive representation of which stores bits of precision for each coordinate and occupies bits of space, whereas occupies roughly bits.
D.2 Step 1: Preprocessing Metric Compression Trees
We now describe the preprocessing steps needed for our faster distance matrix compression. Let
(3) |
and recall our setting of in (2). Note that we assume is an integer which implicitly assumes .
First we describe the preprocessing of . Consider a short edge of with an associated length bit string . We break up into equal chunks, each containing bits. Consider a single chunk . We pad (an equal number of) many ’s after each bit in so that the total number of bits is . We then store each padded chunk in one word. We do this for every chunk resulting in words for each short edge and we do this for all short edges in all the trees.
The second preprocessing step we perform is creating a table . The rows and columns of are indexed by all possible bit strings with bits. The entries of record evaluations of the function defined as follow: given where , consider the partition of each of them into blocks, each with an equal number of bits ( bits per block. Note that ). Each block defines an integer. Doing so results in integers derived from and integers derived from . Finally,
D.3 Step 2: Depth-First Search
We now calculate one row of the distance matrix from point a padded point to all other points in our dataset. Our main subroutine is a tree search procedure. Its input is a node in a tree and it performs a depth-first search on the subtree rooted at as described in Algorithm 11. Given an internal vertex , it calculates all the distances between the padded point to all data points in our dataset which are leaf nodes in the subtree rooted at . A summary of the algorithm follows.
We perform a depth-first search starting at . As we traverse the tree, we keep track of the current embedding of the internal nodes via collecting bits along the edges of : we append bits when we descend the tree and remove as we move up edges. However, we only keep track of this embedding up to levels below . After that, we continue traversing the tree but don’t update the embedding. The reason for this is after levels, the embedding is precise enough for all nodes below with respect to computing the distance to . Towards this end, we also track how many levels below the tree search is currently at and update this value appropriately.
The current embedding is tracked using words. Recall that the bit string of every short edge has been repackaged into words, each ‘responsible’ for coordinates. Furthermore, in each word on the edge, we have padded ’s between the bits of each coordinates. When we need to update the current embedding by incorporating the bits along a short edge , we simply perform a bit shift on each of the words on and add it to the words we are keeping track of. We need to make sure we place the bits ‘in order.’ That is for our tracked embedding, for every coordinates, the bits on an edge should precede the bits on the edge directly following in the tree search. Due to the padding from the preprocessing step, the bit shift implies the bits on the short edges after will be placed in their appropriate corresponding places in order in the embedding representation.
While performing the depth-first search, we will encounter both short and long edges. When encountering a short edge, we run the function Process-Short-Edge and similarly, we run Process-Long-Edge when a long edge is encountered. Finally if we arrive at a leaf node, we run Process-Leaf.
When we arrive at a leaf node , we currently have the decompression of computed from the tree. Note that we only have kept track of the bits after node (up to limited precision) since all prior bits are the same for and since they are in the same subtree. More specifically, we have words . The first word has bits of each of the first coordinates of . For every coordinate, the bits respect the order described in the decompression step in Section D.2. A similar fact is true for the rest of the words . Now to calculate the distance between and , we just have to consider the bits of all coordinates of which come after descending down vertex . We then repackage these total bits into words in the same format as . Note this preprocessing for only happens once (at the subtreee level) and can be used for all leaves in the subtree rooted at .
Finally, the complete algorithm just calls Algorithm 11 on successive parent nodes of . We mark each subtree that has already been processed (at the root node) so that the subtree is only ever visited once. The number of calls to Algorithm 11 is at most the height of the tree, which is bounded by . We then repeat this for all points in our dataset (using the tree which is padded in) to create the full distance matrix.
D.4 Runtime Analysis
We consider the runtime required to compute the row corresponding to a padded point in the distance matrix. Multiplying by results in the total runtime. Consider the tree in which is padded in and which we use for the algorithm described in the previous section and recall the properties of outlined in Theorem D.3. has edges, each of which is only visited at most twice in the tree search (going up and down). Thus the time to traverse the tree is . There are also at most short edges in . Updating the embedding given by takes time per edge since it can be done in total word operations. Long edges don’t require this time since they represent bits; for long edges, we just increment the counter for the current level. Altogether, the total runtime for updating across all calls to Algorithm 11 for the padded point is . Finally, calculating the distance from to a fixed point requires time since we just index into the array times. Thus the total runtime is dominated by . Finally, the total runtime for computing all rows of the distance matrix is
by setting to be a small constant in (2).
D.5 Accuracy Analysis
We now show that the distances we calculate are accurate within a multiplicative factor. The lemma below shows that if a padded point and another point have a sufficiently far away least-common ancestor in , then we can disregard many lower order bits in the decompression computed from while still guaranteeing accurate distance measurements. The lemma crucially relies on being padded.
Lemma D.4.
Suppose is -padded in . For another point , suppose the least common ancestor of and is at level . Let and denote the sketches of and produced by . Let be a modified version of where for each of the coordinates, we remove all the bits acquired after level . Similarly define . Then
Proof.
Since is padded, we know that by Definition D.1. On the other hand, if we ignore the bits after level for every coordinate of and , the additive approximation error in the distance is bounded by a constant factor times
From our choice of , we can easily verify that . Putting everything together and adjusting the value of , we have
where we have used the fact that from the guarantees of the compression tree of [IRW17]. ∎
Putting together our results above along with the Johnson-Lindenstrauss Lemma and Theorem A.11 proves the following theorem.
Theorem D.5.
Let be a dataset of points in dimensions with aspect ration . We can calculate a matrix such that each entry of satisfies
in time
Assuming the aspect ratio is polynomially bounded, we can compute matrix such that each entry of satisfies
with probability . The construction time is
D.6 A Faster Algorithm for \texorpdfstringL-Infinity Distance Matrix Construction Over Bounded Alphabet
In this section, we show how to create the distance matrix. Recall from Section 3 that there exists no time algorithm to compute a matrix-vector query for the distance matrix, assuming SETH, even for points in . This suggests that any algorithm for computing a matrix-vector query needs to initialize the distance matrix. However, there is still a gap between a lower bound for matrix-vector queries and the naive time needed to compute the distance matrix. We make progress towards showing that this gap is not necessary. Our main result is that surprisingly, we can initialize the distance matrix in time much faster than the naive time.
Theorem D.6.
Given points over , we can initialize the exact distance matrix in time where is the matrix multiplication constant. We can also initialize a matrix such that each entry of satisfies
in time .
Thus for , which is the setting of the lower bound, we can initialize the distance matrix in time and thus compute a matrix-vector query in that time as well.
Proof.
The starting point of the proof is to first design an algorithm which constructs a matrix with entry an indicator vector for or . Given this, we can then sum across all choices and construct the full distance matrix. Thus it suffices to solve this intermediate task.
Pick a sufficiently large such that . A choice of suffices. Now in the case that , we have and otherwise, . Thus, the matrix with the entry being is able to distinguish the two cases so it suffices to create such a matrix. Now we can write as an inner product in variables, i.e., it is a gram matrix. Thus computing can be done by computing a product of matrix by a matrix, which can be done in
time by partitioning each matrix into square submatrices of dimension . Plugging in the bound for and considering all possible choices of results in the final runtime bound of , as desired.
Now if we only want to approximate each entry up to a multiplicative factor, it suffices to only loop over ’s which are increasing by powers of for a small constant . This replaces an factor by an factor. ∎