This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parallel Batch-Dynamic kkd-trees

Rahul Yesantharao [email protected] MIT CSAIL    Yiqiu Wang [email protected] MIT CSAIL    Laxman Dhulipala [email protected] MIT CSAIL    Julian Shun [email protected] MIT CSAIL
Abstract

kkd-trees are widely used in parallel databases to support efficient neighborhood/similarity queries. Supporting parallel updates to kkd-trees is therefore an important operation. In this paper, we present BDL-tree, a parallel, batch-dynamic implementation of a kkd-tree that allows for efficient parallel kk-NN queries over dynamically changing point sets. BDL-trees consist of a log-structured set of kkd-trees which can be used to efficiently insert or delete batches of points in parallel with polylogarithmic depth. Specifically, given a BDL-tree with nn points, each batch of BB updates takes O(Blog2(n+B))O(B\log^{2}{(n+B)}) amortized work and O(log(n+B)loglog(n+B))O(\log(n+B)\log\log{(n+B)}) depth (parallel time). We provide an optimized multicore implementation of BDL-trees. Our optimizations include parallel cache-oblivious kkd-tree construction and parallel bloom filter construction.

Our experiments on a 36-core machine with two-way hyper-threading using a variety of synthetic and real-world datasets show that our implementation of BDL-tree achieves a self-relative speedup of up to 34.8×34.8\times (28.4×28.4\times on average) for batch insertions, up to 35.5×35.5\times (27.2×27.2\times on average) for batch deletions, and up to 46.1×46.1\times (40.0×40.0\times on average) for kk-nearest neighbor queries. In addition, it achieves throughputs of up to 14.5 million updates/second for batch-parallel updates and 6.7 million queries/second for kk-NN queries. We compare to two baseline kkd-tree implementations and demonstrate that BDL-trees achieve a good tradeoff between the two baseline options for implementing batch updates.

I Introduction

Nearest neighbor search is used in a wide range of applications, such as in databases, machine learning, data compression, and cluster analysis. One popular data structure for supporting kk-nearest neighbor (kk-NN) search in low dimensional spatial data is the kkd-tree, originally developed by Bentley [1], as it efficiently builds a recursive spatial partition over point sets.

There has been a significant body of work (e.g., [1, 2, 3, 4, 5, 6, 7, 8, 9]) devoted to developing better kkd-tree variants, both in terms of parallelization and spatial heuristics. However, none of these approaches tackle the problem of parallelizing batched updates, which is important given that many real-world data sets are being frequently updated. In particular, in a scenario where the set of points is being updated in parallel, existing approaches either become imbalanced or require full rebuilds over the new point set. The upper tree in Figure 1 shows the baseline approach that simply rebuilds the kkd-tree on every insert and delete, maintaining perfect spatial balance but adding overhead to the update operations. On the other hand, the lower tree in Figure 1 shows the other baseline approach, which never rebuilds and instead only inserts points into the existing spatial partition and marks deleted points as tombstones. This gives fast updates at the cost of potentially skewed spatial partitions.

Refer to caption
Figure 1: Difference in baseline update strategies when the 2-dimensional points (5,5)(5,5) and (7,7)(7,7) are inserted into a spatial-median kkd-tree initially constructed on the points (1,1)(1,1) and (3,3)(3,3). The internal nodes are labeled with the splitting dimension and the coordinate of the split in that dimension. Baseline 1 rebuilds the kkd-tree on every insertion and deletion, while baseline 2 does not rebuild the tree and instead inserts points into the existing spatial partition.

Procopiuc et al. [9] tackled the problem of a dynamically changing point set with their Bkd-Tree, a data structure for maintaining spatial balance in the face of batched updates. However, it was developed for the external memory case and is not parallel. Similarly, Agarwal et al. [4] developed a dynamic cache-oblivious kkd-tree, but it was not parallel and did not support batch-dynamic operations.

We adapt these approaches and develop BDL-tree, a new parallel, in-memory kkd-tree-based data structure that supports batch-dynamic operations (in particular, batch construction, insertion, and deletion) as well as exact kk-NN queries. BDL-trees consist of a set of exponentially growing kkd-trees and perform batched updates in parallel. This structure can be seen in Figure 2. Just as in the Bkd-tree and cache-oblivious tree, our tree structure consists of a small buffer region followed by exponentially growing static kkd-trees. Inserts are performed by rebuilding the minimum number of trees necessary to maintain fully constructed static trees. Deletes are performed on the underlying trees, and we rebuild the trees whenever they drop to below half of their original capacity. Our use of parallelism, batched updates, and our approach to maintaining balance after deletion are all distinct from the previous trees that we draw inspiration from. We show that given a BDL-tree with nn points, each batch of BB updates takes O(Blog2(n+B))O(B\log^{2}{(n+B)}) amortized work and O(log(n+B)loglog(n+B))O(\log{(n+B)}\log\log{(n+B)}) depth (parallel time). As part of our work, we develop, to our knowledge, the first parallel algorithm for the construction of cache-oblivious kkd-trees. Our construction algorithm takes O(nlogn)O(n\log n) work and O(lognloglogn)O(\log n\log\log n) depth. In addition, we implement parallel bloom filters to improve the performance of batch updates in practice. We also present a cache-efficient method for performing kk-NN queries in BDL-tree. We show theoretically that BDL-trees have strong asymptotic bounds on the work and depth of its operations.

Refer to caption
Figure 2: Logarithmic method used in BDL-tree, with NsN_{s} underlying static kkd-trees and a buffer kkd-tree of size XX.

We experimentally evaluate BDL-trees by designing a set of benchmarks to compare its performance against the two baseline approaches described above, which we implemented using similar optimizations. First, we perform scalability tests for each of the four main operations, construction, batch insertion, batch deletion, and kk-NN in order to evaluate the scalability of our data structure on many cores. On a 36-core machine with two-way hyper-threading, we find that our data structure achieves self-relative speedups of up to 35.4×35.4\times (30.0×30.0\times on average) for construction, up to 35.0×35.0\times (28.3×28.3\times on average) for batch insertion, up to 33.1×33.1\times (28.5×28.5\times on average) for batch deletion, and up to 46.1×46.1\times (40.0×40.0\times on average) for full kk-NN. The largest dataset we test consists of 321 million 3-dimensional points. Then, we design a set of benchmarks that perform a mixed set of updates and queries in order to better understand the performance of BDL-trees in realistic scenarios. We find that, when faced with a mixed set of batch operations, BDL-trees consistently outperforms the two baselines and presents the best option for such a mixed dynamic setting. Our source code implementing BDL-trees is publicly available at https://github.com/rahulyesantharao/batch-dynamic-kdtree.

II Related Work

Numerous tree data structures for spatial search have been proposed in the literature, including the kkd-tree [1], quadtree [10], ball tree [11], cover tree [12], and the R-tree [13]. Among them, the kkd-tree is a well-known data structure proposed by Bentley [1] in 1975. It is simple and yet can handle many types of queries efficiently. Bentley showed that the kkd-tree incurs O(logn)O(\log n) time for insertion and deletion of random nodes, and logarithmic average query time was observed in practice. The data structure has been extensively studied in the parallel setting. Shevtsov [14] et al. proposed fast construction algorithms for the kkd-tree for ray tracing, by using binning techniques to distribute the workload among parallel processors. Agarwal [5] et al. proposed parallel algorithms for a series of spatial trees, including the kkd-tree under the massively parallel communication (MPC) model. Wehr and Radkowski [7] proposed fast sorting-based algorithms for constructing kkd-trees on the GPU. Zellmann [8] proposed algorithms for CPUs and GPUs for kkd-trees used in graphics rendering.

The idea of decomposing a data structure into a logarithmic number of structures for the sake of dynamism has been proposed and used in many different scenarios. Bentley [2, 3] first proposed dynamic structures for decomposable search problems. Specifically, he proposed general methods for converting static data structures into dynamic ones with logarithmic multiplicative overhead in cost, using a set of static data structures with size being a power of two. Agarwal et al. [5] designed cache-oblivious data structures for orthogonal range searching using ideas from the log-structured tree. O’Neil et al. [15] proposed the log-structured merge (LSM) tree, an efficient file indexing data structure that supports efficient dynamic inserts and deletes in systems with multiple storage hierarchies. Specifically, batches of updates are cascaded over time, from faster storage media to slower ones. A parallel version of the LSM tree has been implemented [16] by dividing the key-space independently across cores, and processing it in a data-parallel fashion. Our work, on the other hand, proposes a parallel batch-dynamic kkd-tree, where each batch of updates can be processed efficiently in parallel, while supporting efficient kk-NN queries.

We have recently discovered that, concurrent with our work, Dobson and Blelloch [17] developed the zd-tree, a data structure that also supports parallel batch-dynamic updates and kk-NN queries. Their insight is to modify the basic kkd-tree structure by sorting points based on their Morton ordering and splitting at each level based on a bit in the Morton ordering. For kk-NN queries, they provide a root-based method, which is the traditional top-down kk-NN search algorithm and is what we implement, as well as a leaf-based method, which directly starts from the leaves and searches up for nearest neighbors. The leaf-based method works and is faster when the query points are points in the dataset, as it saves the downward traversal to search for the query points. They prove strong work and depth bounds for construction, batch updates, and kk-NN queries assuming data sets with bounded expansion constant and bounded ratio. Without these assumptions, however, our bounds would be at least as good as theirs.

In Dobson and Blelloch’s implementation, they optimize kk-NN queries by presorting the query points using the Morton ordering to improve cache locality. Their algorithm assumes that there is a maximum bounding box where all future data will fit into, while our algorithm does not make this assumption. Their implementation discretizes the double coordinates into 64-bit integers in order to perform Morton sort on them. As a result, they can only use at most 64/d64/d bits for each of the dd dimensions. Directly extending the implementation to higher dimensions would either lead to larger leaf sizes or using larger-width coordinates for the Morton sort, either of which could add overheads. Their experiments consider only 2D and 3D datasets, while we test on higher dimensional datasets as well. For 2D and 3D datasets, the running times that they report seem to be in the same ballpark as our times (their kk-NN queries for constructing the kk-NN graph are much faster than ours due to the presorting described above), after adjusting for number of processors and dataset sizes, but it would be interesting future work to experimentally compare the codes on the same platform and datasets, including higher-dimensional ones. It would also be interesting to integrate some of their optimizations into our code, and to combine the approaches to build a log-structured version of the zd-tree.

III Preliminaries

III-A kkd-tree

The kkd-tree, first proposed by Bentley [1], is a binary tree data structure that arranges and holds spatial data to speed up spatial queries. Given a set PP of nn dd-dimensional points, the kkd-tree is a balanced binary tree where each node represents a bounding box of a subset of the input points. The root node represents all of the points (and thus the tightest bounding box that includes all the points in PP). Each non-leaf node holds a splitting dimension and splitting value that splits its bounding box into two halves using an axis-aligned hyperplane in the splitting dimension. Each child node represents the points in one of the two halves. This recursive splitting stops when the nodes hold some small constant number of points—these nodes are the leaves and directly represent the points.

III-A1 kk-NN Search using the kkd-tree

Given a query coordinate qq, a kk-NN query finds the kk nearest neighbors of qq amongst elements of the kkd-tree by performing a pruned search on the tree [18]. The canonical approach is to traverse the tree while inserting points in the current node into a buffer that maintains only the kk nearest neighbors encountered so far. Then, entire subtrees can be pruned during the traversal based on the distance of the kk’th nearest neighbor found so far.

III-A2 Dual-Tree kk-NN

Besides the canonical approach explained above, there has been a lot of prior work on “dual-tree” approaches [19, 6, 20, 21], which provide speedups on serial batched kk-NN queries. In particular, the dual-tree approach involves building a second kkd-tree over the set of query points, and then exploring the two trees simultaneously in order to exploit the spatial partitioning provided by the kkd-tree structure. We parallelized and tested this dual-tree approach in our work.

III-B Batch-Dynamic Data Structures

The concept of a parallel batch-dynamic data structure has become popular in recent years [22, 23] as an important paradigm due to the availability of large (dynamic) datasets undergoing rapid changes. The idea is to batch together operations of a single type and perform them as a single batched update, rather than one at a time. This approach offers two benefits. First, from a usability perspective, it is often the case (especially in applications with a lot of data) that operations on a data structure can be grouped into phases or batches of a single type, so this restriction in the usage of the data structure does not significantly limit the usefulness of the data structure. Secondly, from a performance perspective, batching together operations of a single type allows us to group together the involved work and derive significantly more parallelism than otherwise might have been possible (while also avoiding the concurrency issues that might arise with batching together operations of varying types).

III-C Baselines

In order to benchmark and test the performance of BDL-trees, we implement two parallel baseline kkd-trees that use opposite strategies for providing batch-dynamism. In particular, the first baseline kkd-tree simply rebuilds the tree after every batch insertion and batch deletion. With this approach, the tree is able to maintain a fully balanced structure in the face of a dynamically changing dataset, enabling consistently high performance for kk-NN queries. This comes at the cost of reduced performance for updates. The second baseline kkd-tree never rebuilds the tree and simply maintains the initial spatial partition, inserting points into and deleting points from the existing structure. This allows for extremely fast batch insertions and deletions, but could potentially lead to a skewed structure and cause reduced kk-NN performance. The difference between these two baselines is graphically demonstrated in Figure 1. We demonstrate experimentally that BDL-tree achieves a balanced tradeoff between these two baseline options for batch-dynamic parallel kkd-trees on which we are performing kk-NN queries. It outperforms both in the dynamic setting where kk-NN queries and batched updates are all being used.

III-D Logarithmic Method

The logarithmic method [2, 3] for converting static data structures into dynamic ones is a very general idea. At a high level, the idea is to partition the static data structure into multiple structures with exponentially growing sizes (powers of 2). Then, inserts are performed by only rebuilding the smallest structure necessary to account for the new points. In the specific case of the kkd-tree, a set of NsN_{s} static kkd-trees is allocated, with capacities in the set [20,21,,2Ns1][2^{0},2^{1},\ldots,2^{N_{s}-1}], as well as an extra buffer tree with size 202^{0}. Then, when an insert is performed, the insert cascades up from the buffer tree, rebuilding into the first empty tree with all the points from the lower trees. If desired, the sizes of all of the trees can be multiplied by a buffer size XX, which is some constant that is tuned for performance. This structure is illustrated in Figure 2. In the figure, all of the trees shown are full; one can imagine that the tree with size 23X2^{3}X is empty, so the next insert would cause the buffer and trees 0, 1, and 2 to cascade up to it.

III-E Computational Model

We use the standard work-depth model [24, 25] for multicore algorithms to analyze theoretical efficiency. The work of an algorithm is the total number of operations used and the depth is the length of the longest sequential dependence (i.e., the parallel running time). An algorithm with work WW and depth DD can be executed on pp processors in W/p+O(D)W/p+O(D) expected time [26]. Our goal is to come up with parallel algorithms that have low work and depth.

III-E1 Parallel Primitives

We use the following parallel primitives in our algorithms.

  • Prefix sum takes as input a sequence of values [a1,a2,,an][a_{1},a_{2},\ldots,a_{n}], an associative binary operator \oplus, and an identity ii, and returns the sequence [i,a1,(a1a2),,(a1a2an1)][i,a_{1},(a_{1}\oplus a_{2}),...,(a_{1}\oplus a_{2}\oplus\cdots\oplus a_{n-1})] as well as the overall sum of the elements. Prefix sum can be implemented in O(n)O(n) work and O(logn)O(\log{n}) depth [25].

  • Partition takes an array AA, a predicate function ff, and a partition value pp and outputs a new array such that all the values aA,a<pa\in A,a<p appear before all the values aA,apa\in A,a\geq p. Partition can be implemented in O(n)O(n) work and O(logn)O(\log{n}) depth [25].

  • Median partition takes nn elements and a comparator and partitions the elements based on the median value in O(n)O(n) work and O(lognloglogn)O(\log n\log\log n) depth [25].

IV BDL-tree

In this section, we introduce BDL-tree, a parallel batch-dynamic kkd-tree implemented using the logarithmic method [2, 3] (discussed in Section III-D). BDL-trees build on ideas from the Bkd-Tree by Procopiuc et al. [9] and the cache-oblivious kkd-tree by Agarwal et al. [4]. The structure is depicted in Figure 2.

We implement the underlying kkd-trees in an BDL-tree as nodes in a contiguous memory array, where the root node is the first element in the array. The kkd-trees are built using the van Emde Boas (vEB) [27, 28, 4] recursive layout. Agarwal [4] et al. show that this memory layout can be used with kkd-trees to make traversal cache-oblivious, although dynamic updates on a single tree become very complex. However, in the logarithmic method, the underlying kkd-trees themselves are static, and so we are able to sidestep the complexity of cache-oblivious updates on these trees and benefit from the improved cache performance of the vEB layout. For the buffer region of the BDL-tree, we use a regular kkd-tree, laid out like a binary-heap in memory (i.e., nodes are in a contiguous array, and the children of index ii are 2i2i, 2i+12i+1). We will discuss the key parallel algorithms that we used in our implementation: construction, deletion, and kk-NN on the underlying individual kkd-trees (Section IV-A) and construction, insertion, deletion, and kk-NN on BDL-tree (Section IV-B). Note that we do not need to support insertions on individual kkd-trees, because our BDL-tree simply rebuilds the necessary kkd-trees upon insertions. We use subscript S to denote algorithms on the underlying kkd-trees, and subscript L to denote algorithms on the full logarithmic data structure.

IV-A Single-Tree Parallel Algorithms

IV-A1 Parallel vEB Construction

Algorithm 1 Parallel vEB-layout kkd-tree Construction

Input: Point Set PP
      Output: kkd-tree over PP, laid out with the vEB layout on a contiguous memory array of size 2|P|12|P|-1.

1:procedure BuildvEBS{}_{\text{S}}(PP)
2:     Allocate 2|P|12|P|-1 nodes in contiguous memory. The tree nodes will be laid out in this space.
3:     BuildvEBRecursiveS{}_{\text{S}}(PP, 0, 0, log(|P|)\lfloor\log(|P|)\rfloor+1, bottom)
4:procedure BuildvEBRecursiveS{}_{\text{S}}(QQ, idxidx, cc, ll, tt)
5:idxidx: current node index in the memory array
6:cc: current dimension to split on
7:ll: number of levels to build
8:tt: whether we are building the top or bottom of a tree
9:     If we hit the base case n=1n=1, then we construct a node at idxidx. If tt is top, then we perform a parallel median partition on QQ in dimension cc and record this split as an internal node. Otherwise, we create a leaf node that represents the points in QQ.
10:     Compute lb=l+12l_{b}=\left\lceil\left\lceil\frac{l+1}{2}\right\rceil\right\rceil and lt=llbl_{t}=l-l_{b} (vEB layout).
11:     Recursively build the top half of the tree with BuildvEBRecursiveS{}_{\text{S}}(QQ, idxidx, cc, ltl_{t}, top).
12:     Compute idxb=idx+2lt1idx_{b}=idx+2^{l_{t}}-1 as the offset where the top half of the tree was just laid out.
13:     Construct the 2lt2^{l_{t}} lower subtrees in parallel with BuildvEBRecursiveS{}_{\text{S}}(QiQ_{i}, idxiidx_{i}, (c+nt)modd(c+n_{t})\mod{d}, lbl_{b}, tt) where QiQ_{i} is the subarray of points that are held by the parent of this subtree and idxiidx_{i} is the index at which this subtree is to be placed (precomputed with a parallel prefix sum).

The algorithm for parallel construction of the cache-oblivious kkd-tree is shown in Algorithm 1. The function itself is recursive, and so the top level BuildvEBS{}_{\text{S}} function allocates space on line 2 and calls the recursive function BuildvEBRecursiveS{}_{\text{S}}. Refer to Figure 3 for a graphical representation of this construction.

Refer to caption
Figure 3: Constructing a vEB kkd-tree in parallel over 8 2-dimensional points. Note that the top 3 nodes are placed before the remaining 4 bottom subtrees are built in parallel.

The recursive function BuildvEBRecursiveS{}_{\text{S}} maintains state with 5 parameters: a point set QQ, a node index idxidx, a splitting dimension cc, the number of levels to build ll, and whether it is building the top or bottom of the tree (indicated by tt). On line 5, we check for the base case—if the number of levels to build is 1, then we have to construct a node. If this is the top of a tree, then this node will be an internal node, so we perform a parallel median partition in dimension cc and save it as an internal node. On the other hand, if this is the bottom of the tree, we construct a leaf node that holds all the points in QQ. Lines 6–9 form the recursive step. In accordance with the exponential layout [4], we have to first construct the top “half” of the tree and then the bottom “half”. Therefore, on line 6, we compute the number of levels lbl_{b} in the bottom portion as the hyperceiling111The hyperceiling of nn, denoted as n\lceil\lceil n\rceil\rceil is the smallest power of 2 that is greater than or equal to nn, i.e., 2logn2^{\lceil\log{n}\rceil}. of l+12\frac{l+1}{2} and the remaining number of levels ltl_{t} in the top portion of the tree as llbl-l_{b}. On line 7, we recursively build the top half of the tree. Then, on line 8, we note that because the top half of the tree is a complete binary tree with ltl_{t} levels, it will use 2lt12^{l_{t}}-1 nodes. Therefore, we compute idxb=idx+2lt1idx_{b}=idx+2^{l_{t}}-1, the node index where the bottom half of the tree should start because the trees are laid out consecutively in memory. Finally, on line 9, we construct each of the 2lt2^{l_{t}} subtrees that fall under the top half of the tree, each with lbl_{b} levels. Each of these trees falls into a distinct segment of memory in the array, and so we can perform this construction in parallel across all of the subtrees by precomputing the starting index idxiidx_{i} for each of the 2lt2^{l_{t}} subtrees.

We trace this process on an example in Figure 3, in which BuildvEBS{}_{\text{S}} is called on a set PP of 8 points. This spawns a call to BuildvEBRecursiveS{}_{\text{S}}(P[0:8]P[0:8], 0, 0, 4, bottom). On line 6, we will compute lb=2l_{b}=2 and lt=2l_{t}=2, and on line 7, we spawn a recursive call to BuildvEBRecursiveS{}_{\text{S}}(P[0:8]P[0:8], 0, 0, 2, top). This call is shown as the solid box around the top 3 nodes in Figure 3. In this call, we will hit one further level of recursion before laying out the 3 nodes in indices 0, 1, 2. Then, the original recursive call will proceed to line 8, where it will compute idxb=3idx_{b}=3 as the index to begin laying out the 2lt=42^{l_{t}}=4 bottom subtrees. Finally, on line 9, we will precompute that the starting indices for the 4 bottom subtrees are (idx0,idx1,idx2,idx3)=(3,6,9,12)(idx_{0},idx_{1},idx_{2},idx_{3})=(3,6,9,12). This results in 4 parallel recursive calls, shown in the 4 lower dashed boxes in Figure 3. Each of these recursive calls internally has one more level of recursion to lay out their 3 nodes.

Theorem 1.

The cache-oblivious kkd-tree with a vEB layout can be constructed over nn points in O(nlogn)O(n\log{n}) work and O(lognloglogn)O(\log{n}\log{\log{n}}) depth.

Proof.

The work bound is obtained by observing that there are O(logn)O(\log{n}) levels in the fully-constructed tree, and the median partition at each level takes O(n)O(n) work, giving a total of O(nlogn)O(n\log{n}) work. For the depth bound, at each recursive step we first build an upper tree with size O(n)O(\sqrt{n}), and then construct the lower trees in parallel, each with size O(n)O(\sqrt{n}). Further, we use an O(logn)O(\log{n})-depth prefix sum to compute idxiidx_{i} at every level except the base case and an O(lognloglogn)O(\log{n}\log\log{n})-depth median partition in the base case. Overall, this results in O(lognloglogn)O(\log{n}\log{\log{n}}) depth. ∎

After the vEB-layout kkd-tree is constructed, it can be queried as a regular kkd-tree—the only difference is the physical layout of the nodes in memory. The correctness of this recursive algorithm can be seen through induction on the number of levels. In particular, we form two inductive hypotheses:

  • BuildvEBRecursiveS{}_{\text{S}}(QQ, idxidx, cc, ll, top) creates a contiguous, fully-balanced binary tree with ll levels rooted at memory location idxidx. Furthermore, this binary tree consists of internal kkd-tree nodes that equally split the point set QQ in half at each level.

  • BuildvEBRecursiveS{}_{\text{S}}(QQ, idxidx, cc, log|Q|+1\lceil\log{|Q|}\rceil+1, bottom) creates a contiguous kkd-tree with ll levels rooted at memory location idxidx.

The base cases, with l=1l=1, for these inductive hypotheses are explicitly given on line 5. Then, the inductive step follows easily by noting that the definition of hyperceiling implies that the recursive calls on line 9 are all sized such that lb=log|Q|i+1l_{b}=\lceil\log{|Q|_{i}}\rceil+1.

IV-A2 Parallel Deletion

Algorithm 2 Parallel kdkd-Tree Deletion

Input: Point Set PP

1:procedure EraseS{}_{\text{S}}(PP)
2:     EraseRecursiveS{}_{\text{S}}(PP, 0)
3:procedure EraseRecursiveS{}_{\text{S}}(QQ, idxidx)
4:idxidx: current node index
5:     If the current node is a leaf node, mark any points in the leaf node that are also in QQ as deleted. If all of the points in the current leaf are deleted, return NULL. Otherwise, return the current idxidx.
6:     Otherwise, perform a parallel partition on QQ around the split represented by the current node. Let Ql,QrQ_{l},Q_{r} be the resulting left and right arrays, respectively, after the partition.
7:     Then, recurse on the children in parallel with EraseRecursiveS{}_{\text{S}}(QlQ_{l}, idxlidx_{l}), EraseRecursiveS{}_{\text{S}}(QrQ_{r}, idxridx_{r}), where idxlidx_{l} and idxridx_{r} are the IDs of the left and right children, respectively.
8:     If neither of the recursive calls return NULL, reset the left and right children to be the results of these calls and return the current node. If both of the recursive calls return NULL, return NULL. If one of the recursive calls returns NULL and the other does not, return the non-NULL node.

The algorithm for parallel deletion from a single kkd-tree is shown in Algorithm 2. The function itself is recursive, so the top level EraseS{}_{\text{S}} calls the subroutine EraseRecursiveS{}_{\text{S}} on the root node on line 2.

The recursive function EraseRecursiveS{}_{\text{S}} acts on one node at a time, represented by the index idxidx. On line 4, it checks for the base case—if the current node is a leaf node, it simply performs a linear scan to mark any points in the leaf node that are also in QQ as deleted. Then, it returns NULL if the entire leaf was emptied; otherwise, it returns the current node idxidx. Lines 5–7 represent the recursive case. First, on line 5, we perform a parallel partition of QQ around the current node’s splitting hyperplane. We refer to the lower partition as QlQ_{l} and the upper partition as QrQ_{r}. On line 6, we recurse on the left and right subtrees in parallel, passing QlQ_{l} to the left subtree and QrQ_{r} to the right. Finally, line 7 updates the tree structure. We always ensure that every node has 2 children in order to flatten any unnecessary tree traversal. The return value of EraseRecursiveS{}_{\text{S}} indicates the node that should take the place of idxidx in the tree (potentially the same node)—a return value of NULL indicates that the entire subtree rooted at idxidx was removed. So, if both the left and right child are removed, then we can remove the current node as well by returning NULL. On the other hand, if neither the left or right child are removed, then the subtree is still intact, and we simply reset the left and right child pointers of the current node and return the current node idxidx, indicating that it was not removed. Finally, if exactly one of the children was removed, then we remove the current node as well and let the remaining child connect directly to its grandparent—in this way, we remove an unnecessary internal splitting node. We do this by simply returning the non-NULL child, signaling that it will take the place of the current node in the kkd-tree.

Theorem 2.

Deleting a batch of BB points from a single kkd-tree constructed over nn points can be done in O(Blogn)O(B\log{n}) work and O(logBlogn)O(\log{B}\log{n}) depth in the worst case.

Proof.

We can see the work bound by noting that each of the BB points traverse down O(logn)O(\log{n}) levels as part of the algorithm. For the depth, note that in the worst-case the parallel partition at each level operates over O(B)O(B) points at each level. Because parallel partition has logarithmic depth, this would result in a worst-case O(logB)O(\log{B}) depth at each of the O(logn)O(\log{n}) levels, giving the overall depth of O(logBlogn)O(\log{B}\log{n}). ∎

IV-A3 Data-Parallel kk-NN

We execute our kk-NN searches in a data-parallel fashion by parallelizing across all of the query points in a batch. The kk-NN search for each point is executed serially. We implement a “kk-NN buffer”, a data structure that maintains a list of the current kk-nearest neighbors and provide quick insert functionality to test and insert new points if they are closer than the existing set. The data structure maintains an internal buffer of size 2k2k. To insert a point, it simply adds that point to the end of the buffer. If the buffer is filled up, then it uses a serial selection algorithm to partition the buffer around the kk-th nearest element and clears out the remaining kk elements. This achieves a serial amortized O(1)O(1) runtime (because the selection partition step is O(k)O(k) and is only performed for every kk insertions).

To implement batched kk-NN on the kkd-tree, we perform a kk-NN search for each individual point in parallel across all the points. We now describe the kk-NN method (kNNS{}_{\text{S}}) for a single point pp. We first allocate a kk-NN buffer for the point. Then, we recursively descend through the kkd-tree searching for the leaf that pp falls into. When we find this leaf, we add all of the points in the leaf to the kk-NN buffer. Then, as the recursion unfolds, we check whether the kk-NN buffer has kk points. If it does not, we add all the points in the sibling of the current node to the kk-NN buffer to try to fill up the buffer with nearby points as quickly as possible to improve our estimate of the kk-th nearest neighbor. Otherwise, we use the current distance of the kk-th nearest neighbor to prune subtrees in the tree. In particular, if the bounding box of the current subtree is entirely contained within the distance of the kk-th nearest neighbor, we add all points in the subtree to the kk-NN buffer. If the bounding box is entirely disjoint, then we prune the subtree. Finally, if they intersect, we recurse on the subtree.

Theorem 3.

For a constant kk, kk-NN queries over a batch of BB points can be performed over a single kkd-tree containing nn points in worst-case O(Bn)O(Bn) work and worst-case O(n)O(n) depth.

Proof.

In the worst-case, we have to search the entire tree, of size O(n)O(n), resulting in total work of O(Bn)O(Bn) (due to the amortized O(1)O(1) insert cost for kk-NN buffers) and depth of O(n)O(n), as the queries are done in parallel over the batch, but each search is serial. ∎

As noted by Bentley [1] and Friedman et al. [18], the work for a single nearest-neighbor query on a kkd-tree is empirically found to be logarithmic in nn, so the experimental runtime and scalability are much better than suggested by the worst-case bounds.

IV-B Batch-Dynamic Parallel Algorithms

This section describes our algorithms for supporting batch-dynamic updates on BDL-trees.

IV-B1 Parallel Insertion

Algorithm 3 Parallel BDL-tree Batch Insertion

Input: Point Set PP

1:procedure InsertL{}_{\text{L}}(PP)
2:     Build an integer bitmask FF that represents the static trees within the logarithmic tree structure that are currently filled using 1’s, and the trees that are empty using 0’s.
3:     Compute Fnew=F+|P|XF_{new}=F+\frac{|P|}{X}, where XX is the buffer tree size. This is the new bitmask of trees that should be filled.
4:     Based on the difference between FF and FnewF_{new}, determine which trees should be combined into larger trees.
5:     Gather the relevant points and construct all the new trees in parallel using BuildvEBS{}_{\text{S}} (or BuildS{}_{\text{S}} for the buffer tree).

Insertions are performed in the style of the logarithmic method [2, 3], with the goal of maintaining the minimum number of full trees within BDL-tree. Thus, upon inserting a batch BB of points, we rebuild larger trees if it is possible using the existing points and the newly inserted batch. This is implemented as shown in Algorithm 3, and depicted in Figure 4.

First, on line 2, we build a bitmask FF of the current set of full static trees in the logarithmic structure. Then, on line 3, because the buffer kkd-tree has size XX, we can add |P|/X|P|/X to FF to compute a new bitmask FnewF_{new} of full trees that would result if we added |P||P| points to the tree structure. As an implementation detail, note that we first add |P|modX|P|\mod{X} points to the buffer kkd-tree—if we fill up the buffer kkd-tree, then we gather the XX points from it and treat them as part of PP, effectively increasing the size of PP by XX. Then, on line 4, taking the bitwise difference between these two bitmasks gives the set of trees that should be consolidated into new larger trees—specifically, any tree that is set in FnewF_{new} but not in FF must be constructed from trees that are set in FF but not in FnewF_{new}. After determining which trees should be combined into new trees, on line 5 we construct all the new trees in parallel—in parallel for each new tree to be constructed, we deconstruct and gather all the points from trees that are being combined into it and then we construct the new tree over these points and any additional required points from PP using Algorithm 1.

Refer to Figure 4 for an example of this insertion method (suppose for this example that X>2X>2). In Figure 4(a), the BDL-tree contains XX points, giving a bitmask of F=1F=1 (because only the smallest tree is in use). If we insert X+1X+1 points, then we put one node in the buffer tree and compute Fnew=1+XX=2F_{new}=1+\frac{X}{X}=2, and so we have to deconstruct static tree 0 and build static tree 1, as shown in Figure 4(b). Then, if we insert X+1X+1 points again, then we again put one point in the buffer tree and compute Fnew=2+XX=3F_{new}=2+\frac{X}{X}=3, and so we simply construct tree 0 on the XX new points (leaving tree 1 intact), as seen in Figure 4(c). Finally, if we then insert X1X-1 points, we note that this would fill the buffer up, so we take 1 point from the buffer and insert XX points; then, Fnew=3+XX=4F_{new}=3+\frac{X}{X}=4, and so we deconstruct trees 0, 1 and construct tree 2, as seen in Figure 4(d).

Refer to caption
(a) Static tree 0 is full.
Refer to caption
(b) Static tree 1 is full and buffer tree has 1 point.
Refer to caption
(c) Static trees 0 and 1 are full and buffer tree has 2 points.
Refer to caption
(d) Static tree 2 is full and buffer tree has 1 point.
Figure 4: A BDL-tree in various configurations with X>2X>2; starting from (a), inserting X+1X+1 points gives (b), then inserting X+1X+1 points gives (c), and then inserting X1X-1 points gives (d).

IV-B2 Parallel Deletion

Algorithm 4 Parallel BDL-tree Batch Deletion

Input: Point Set PP

1:procedure EraseL{}_{\text{L}}(PP)
2:     In parallel, delete PP from each of the underlying trees which is nonempty by calling EraseS{}_{\text{S}}(PP) on each of these trees.
3:     In parallel, gather the points from any trees that drop to below half of their original capacity into a set RR.
4:     Call InsertL{}_{\text{L}}(RR) to reinsert these points into the log-tree structure.

When deleting a batch of points, the goal is to maintain balance within the subtrees. Thus, if any subtree decreases to less than half of its full capacity, we move all the points down to a smaller subtree in order to maintain balance. As seen in Algorithm 4, this is implemented as a three-step process.

On line 2, we call a parallel bulk erase subroutine on each of the individual trees in parallel in order to actually erase the points from the trees. On line 3, we scan the trees in parallel and collect the points from all trees which have been depleted to less than half of their original capacity. Finally, on line 4, we use the InsertL{}_{\text{L}} routine to reinsert these points into the structure.

Theorem 4.

Given an BDL-tree that was created using only batch insertions and deletions, each batch of BB updates takes O(Blog2(n+B))O(B\log^{2}(n+B)) amortized work and O(log(n+B)loglog(n+B))O(\log{(n+B)}\log\log{(n+B)}) depth, where nn is the number of points in the tree before applying the updates.

Proof.

We first argue the work for only performing insertions starting from an empty BDL-tree. In the worst case, points are added to the structure one by one. Then, as similar to the analysis by Bentley [2], the total work incurred is given by noting that the number of times the ii’th tree is rebuilt when inserting mm points one by one is O(2logmi)O(2^{\log{m}-i}). Then, summing the total work gives O(i=0logm2ii2logmi)=O(mlog2m)O(\sum_{i=0}^{\log{m}}{2^{i}i2^{\log{m}-i}})=O(m\log^{2}{m}), where we use the work bound from Theorem 1. After inserting a batch of size BB, we have n+Bn+B points in the BDL-tree, and so the amortized work for the batch is O(Blog2(n+B))O(B\log^{2}(n+B)). Now, if deletions occurred prior to a batch insertion, and the current BDL-tree has nn points, there still must have been nn previous insertions (since we started with an empty data structure), and so the work of this batch can still be amortized against those nn insertions. We now argue the depth bound. When a batch inserted into the tree, the points from smaller trees can be gathered in worst-case O(log(n+B))O(\log{(n+B)}) depth (if all the points must be rebuilt) and the rebuilding process takes worst case O(log(n+B)loglog(n+B))O(\log{(n+B)}\log\log{(n+B)}) depth, using the result from Theorem 1.

The initial step of deleting the batch of points from each of the underlying kkd-trees incurs O(Blog2n)O(B\log^{2}{n}) work (there are O(logn)O(\log{n}) kkd-trees, each taking work O(Blogn)O(B\log{n})) and depth O(logBlogn)O(\log{B}\log{n}). Then, collecting the points that need to be reinserted can be done in worst-case depth O(log(n+B))O(\log{(n+B)}) and the reinsertion takes O(log(n+B)loglog(n+B))O(\log{(n+B)}\log\log{(n+B)}) depth, from before. Overall, the depth is O(log(n+B)loglog(n+B))O(\log{(n+B)}\log\log{(n+B)}). The amortized work for reinserting points in trees that are less than half full is O(Blog2n)O(B\log^{2}{n}), as every point we reinsert can be charged to a deletion of another point, either from this batch or from a previous batch. This is because for a tree that is half full, there must be at least as many deletions from the tree as the number of points remaining in the tree. ∎

IV-B3 Data-Parallel kk-NN

Algorithm 5 Data-Parallel BDL-tree kk-NN

Input: Point Set SS
      Output: An array of arrays of the kk nearest neighbors of each point in SS.

1:procedure kNNL{}_{\text{L}}(SS)
2:     Allocate a kk-NN buffer for each of the points in SS.
3:     For each nonempty tree in the BDL-tree, in serial, call the parallel subroutine kNNS{}_{\text{S}}(SS) on the individual tree, passing the same set of buffers.
4:     After all of the individual kNNS{}_{\text{S}} calls are complete, gather and return the results from the kk-NN buffers.

In the data-parallel kk-NN implementation, we parallelize over the set of points given to search for nearest neighbors. First, on line 2, we allocate a kk-NN buffer for each of the points in SS. Then, for each of the non-empty trees in BDL-tree, we call the data-parallel kk-NN subroutine on the individual tree, passing in the set SS of points and the set of kk-NN buffers. Because we reuse the same set of kk-NN buffers for each underlying kk-NN call, we eventually end up with the kk-nearest neighbors across all of the individual trees for each point in SS.

Theorem 5.

For a constant kk, kk-NN queries over a batch of BB points over the nn points in the BDL-tree can be performed in O(Bn)O(Bn) work and O(n)O(n) depth.

Proof.

These bounds follow directly from the bounds of the underlying individual kk-NN calls. The kk-NN routine on the ii’th underlying tree, with size nin_{i}, has worst-case work O(Bni)O(Bn_{i}) and depth O(ni)O(n_{i}). Summing over all ii gives the bounds. ∎

V Implementation and Optimizations

In this section, we describe implementation details and optimizations that we developed to speed up the BDL-tree in practice.

V-A Parallel Bloom Filter

The bloom filter is a probabilistic data structure for testing set membership, which can give false positive matches but no false negative matches [29]. When erasing a batch of points from the BDL-tree, we need to potentially search for every point to be deleted within every individual underlying kkd-tree. To mitigate this overhead, we use a bloom filter to prefilter points to be erased from each individual kkd-tree that definitely are not contained within it. Specifically, we implemented a parallel bloom filter and added one to each individual kkd-tree within the BDL-tree to track the points in that individual kkd-tree. Before erasing an input batch from each individual kkd-tree, we filter the batch with that kkd-tree’s bloom filter. This optimization adds some overhead to the construction and insertion subroutines, but provides significant benefits for the delete subroutine. To mitigate some of this overhead, we construct the bloom filters only when they are needed, rather than during construction.

V-B Data-Parallel kk-NN Structure

We tested three different variants of the data-parallel kk-NN search in order to determine the best parallelization scheme. Each scheme consists of 2 nested for loops; one over the input points and one over the individual kkd-trees. In the first variant, the outer loop is over the points and the inner loop is over the kkd-trees. We only parallelize the outer loop, so we perform a kk-NN search over all the trees for each point in parallel. In the second variant, we switch the loop order, so the outer loop is over the kkd-trees and the inner loop is over points. We parallelize both the inner and outer loops, so we perform a kk-NN search over each of the trees in a data-parallel fashion (i.e., search for the nearest neighbors of each point in parallel), but we also parallelize over the trees. In this case, because our kk-NN buffer is not thread-safe, for each point we have to allocate a separate buffer for each tree and combine the results at the end, which adds some extra overhead. In the third variant, described in Section IV-B3, the outer loop is over the kkd-trees and the inner loop is over points, but in contrast to the second scheme, we only parallelize the inner loop over points. So, we perform a data-parallel kk-NN over each of the individual kkd-trees, one at a time. We found that the second scheme gives the most parallelism, but the third scheme has the fastest end-to-end running time, both serially and in parallel. This is due to better cache locality of the third method—each tree is completely processed before moving on to the next one. One other optimization we used in this third scheme was to process the underlying trees in order from largest to smallest. We experimentally found that the data-parallel kk-NN search scaled better on large trees, and so this gave better radius estimates earlier (as opposed to processing the trees from smallest to largest).

V-C Parallel Splitting Heuristic

We implemented two different splitting heuristics for our kkd-trees: splitting by object median and splitting by spatial median. Object median refers to a true median—at each split, we split around the median of the coordinates of the points in that dimension. On the other hand, for spatial median, we compute the average of the minimum and maximum of the coordinates of the points in the current dimension and use this as the splitting hyperplane. The spatial median is faster to compute, but leads to potentially less balanced trees as it is not strictly splitting the points in half at every level. We implement the object median with a parallel in-place sort [25, 30] and we implement the spatial median in two steps. First, we perform two parallel prefix sums, using min()\min(), max()\max() as the predicate functions, to compute the minimum and maximum of the values, respectively. Then, we compute the spatial median as the average of these values and perform a parallel partition around this value.

V-D Coarsening

We introduced a number of optimizations that were controlled by tunable parameters in our implementation. One key optimization was the coarsening of leaves, in which we let each leaf node represent up to 16 points rather than 1. This reduces the amount of space needed for node pointers and also improves cache locality when traversing the tree, as the points in a leaf node are stored contiguously. A second key optimization was the coarsening of the serial base cases of our algorithms, in which we switch from a parallel algorithm to a serial version for recursive cases involving subtrees with less than 1000 points in order to mitigate the overhead of spawning new threads. A third parameter was the buffer tree size, which we set to 1024.

VI Experiments

We designed a set of experiments to investigate the performance and scalability of BDL-tree and compare it to the two baselines described earlier.

  1. 1.

    B1 is a baseline described in Section III-C, where the kkd-tree is rebuilt on each batch insertion and deletion in order to maintain balance. This allows for improved query performance (as the tree is always perfectly balanced) at the cost of slowing down dynamic operations.

  2. 2.

    B2 is another baseline described in Section III-C. It inserts points directly into the existing tree structure without recalculating spatial splits. This results in very fast inserts and deletes at the cost of potentially skewed trees (which would slow down query performance).

  3. 3.

    BDL is our BDL-tree described in Section IV. It represents a tradeoff between batch update performance and query performance. By maintaining a set of balanced trees, it is able to achieve good performance on updates without sacrificing the quality of the spatial partition.

We use the following set of experiments to measure the scalability of BDL-tree and compare its performance characteristics to the baselines.

  1. 1.

    Construction (Section VI-A): construct the tree over a dataset.

  2. 2.

    Insertion (Section VI-B1): insert fixed-size batches of points into an empty tree until the entire dataset is inserted.

  3. 3.

    Deletion (Section VI-C1): delete fixed-size batches of points from a tree constructed over the entire dataset until the entire dataset is deleted.

  4. 4.

    kk-NN (Section VI-D1): find the kk-nearest neighbors of all the points in the dataset (i.e., compute the kk-NN graph of the dataset).

We also designed the following microbenchmarks in order to better explore the design tradeoffs that BDL-tree makes when compared to the baselines.

  1. 1.

    Varying Batch Size (Sections VI-B2 and VI-C2): we vary the batch size of the operations used to fully insert or delete the point set to measure the impact of batch size on throughput.

  2. 2.

    Varying kk after Batched Inserts (Section VI-D2): we build a tree using a set of batched insertions and then measure the kk-NN search performance with a range of kk values to measure the impact of kk on throughput and the impact of dynamic updates on kk-NN performance.

  3. 3.

    Mixed Insert, Delete, and kk-NN Searches (Section VI-D3): we perform a series of batch updates (insertions and deletions) interspersed with kk-NN queries to measure the performance of the data structures over time.

We run all of the experiments over BDL-tree and the two baselines. In addition, for the scalability and batch size experiments, we also compare the object median and spatial median splitting heuristics described in Section V-C.

The experiments are all run on an AWS c5.18xlarge instance with 2 Intel Xeon Platinum 8124M CPUs (3.00 GHz), for a total of 36 two-way hyper-threaded cores and 144 GB RAM. Our experiments use all hyper-threads unless specified otherwise. We compile our benchmarks with the g++ compiler (version 9.3.0) with the -O3 flag, and use ParlayLib [30] for parallelism. All reported running times are the medians of 3 runs, after one extra warm-up run for each experiment.

We run the experiments over 9 datasets, consisting of 6 synthetic datasets and 3 real-world datasets. We use two types of synthetic datasets. The first is Uniform (U), consisting of points distributed uniformly at random inside a bounding hyper-cube with side length n\sqrt{n}, where nn is the number of points. The second is VisualVar (V), a clustered dataset with variable-density, produced by Gan and Tao’s generator [31]. The generator produces points by performing a random walk in a local region, but jumping to random locations with some probability. For each of these two types, we generate them in 2D, 5D, and 7D, and for 10,000,000 points. We also use 3 real-world datasets: 10D-H-1M [32, 33] is a 10-dimensional dataset consisting of 928,991 points of home sensor data; 16D-C-4M [34, 35] is a 16-dimensional dataset consisting of 4,208,261 points of chemical sensor data; and 3D-C-321M [36] is a 3-dimensional dataset consisting of 321,065,547 points of astronomy data. Due to time constraints, we only ran experiments on 3D-C-321M using BDL-tree in parallel to demonstrate that BDL-tree can scale to large datasets.

VI-A Construction

In this benchmark, we measure the time required to construct a tree over each of the datasets. The results using an object median splitting heuristic are shown in Table I(a) and the results using a spatial median splitting heuristic are shown in Table I(b). Figure 5(a) shows the scalability of the throughput on the 10M points 7D uniform dataset.

As we can see from the results, BDL achieves similar or better performance both serially and in parallel than both B1 and B2, and has similar or better scalability than both. With the object median splitting heuristic, it achieves up to 34.8×34.8\times speedup, with an average speedup of 28.4×28.4\times. We also note that the single-threaded runtimes are faster with the spatial-median splitting heuristic than with the object median splitting heuristic. This is expected, because spatial median only involves splitting points at each level compared with finding the median for object-median, hence it is less expensive to compute; however, we also note that the scalability for spatial median is lower because there is less work to distribute among parallel threads.

1 36h
B1 B2 BDL B1 B2 BDL
2D-U-10M 20.6s 16.3s 14.4s 0.5s (40.0x) 3.7s (4.5x) 0.4s (34.5x)
2D-V-10M 20.5s 16.2s 14.2s 0.5s (40.3x) 3.6s (4.5x) 0.4s (34.8x)
5D-U-10M 23.3s 20.8s 16.3s 0.7s (35.2x) 4.8s (4.3x) 0.5s (30.3x)
5D-V-10M 22.8s 20.2s 15.8s 0.7s (34.9x) 4.6s (4.4x) 0.5s (29.2x)
7D-U-10M 27.0s 24.0s 17.1s 0.8s (33.0x) 5.4s (4.4x) 0.6s (27.5x)
7D-V-10M 26.2s 23.2s 16.5s 0.8s (33.5x) 5.3s (4.4x) 0.6s (26.6x)
10D-H-1M 1.5s 1.5s 2.8s 0.1s (23.7x) 0.5s (3.4x) 0.1s (23.8x)
16D-C-4M 13.5s 14.5s 11.0s 0.5s (25.2x) 3.8s (3.8x) 0.5s (20.4x)
3D-C-321M 20.4s
(a) Object median.
1 36h
B1 B2 BDL B1 B2 BDL
2D-U-10M 10.7s 2.3s 5.3s 0.5s (23.8x) 1.6s (1.4x) 0.4s (13.5x)
2D-V-10M 10.9s 2.5s 5.5s 0.5s (22.8x) 1.7s (1.5x) 0.4s (13.5x)
5D-U-10M 13.7s 3.4s 6.7s 0.8s (18.0x) 2.3s (1.5x) 0.6s (10.7x)
5D-V-10M 14.4s 4.0s 7.1s 0.8s (17.4x) 2.6s (1.5x) 0.7s (10.3x)
7D-U-10M 16.8s 4.4s 7.1s 1.0s (17.2x) 2.9s (1.5x) 0.8s (9.3x)
7D-V-10M 17.3s 5.2s 8.3s 1.0s (16.9x) 3.3s (1.6x) 0.9s (9.2x)
10D-H-1M 1.4s 0.8s 3.2s 0.1s (11.0x) 0.5s (1.6x) 0.2s (15.9x)
16D-C-4M 10.0s 5.3s 7.3s 0.7s (14.0x) 3.0s (1.8x) 0.7s (10.2x)
3D-C-321M 20.8s
(b) Spatial median.
TABLE I: Construction times (seconds) for a single thread (1) and 36 cores with hyper-threading (36h). The self-relative speedup for each implementation and dataset is shown in parentheses.
Refer to caption
(a) Construction.
Refer to caption
(b) 10% (1M points) Batch Insertion.
Refer to caption
(c) 10% (1M points) Batch Deletion.
Refer to caption
(d) Full (10M points) kk-NN for k=5k=5.
Figure 5: Plot of throughput (operations per second) of batch operations over thread count for both object and spatial median implementations for the 7D-U-10M dataset.

VI-B Insertion

In this benchmark, we measure the performance of our batch insertion implementation as compared to the baselines. We split this experiment into two separate benchmarks, one to measure the full scalability and performance of our implementation, and one to measure the impact of varying batch sizes.

VI-B1 Scalability

In this benchmark, we measure the time required to insert 10 batches each containing 10% of the points in the dataset into an initially empty tree for each of our two baselines as well as our BDL-tree. The results using object median and spatial median splitting heuristics are shown in Tables II(a)II(b), respectively. Figure 5(b) shows the scalability of the throughput on the 10M points 7D uniform dataset.

We see that B2 achieves the best performance on batched inserts—this is due to the fact that it does not perform any extra work to maintain balance and simply directly inserts points into the existing spatial structure. BDL achieves the second-best performance—this is due to the fact that it does not have to rebuild the entire tree on every insert, but amortizes the rebuilding work across the batches. Finally, B1 has the worst performance, as it must fully rebuild on every insertion. Similar to construction, we note that the spatial median heuristic performs better in the serial case but has lower scalability. With the object median splitting heuristic, BDL achieves parallel speedup of up to 35.5×35.5\times, with an average speedup of 27.2×27.2\times.

1 36h
B1 B2 BDL B1 B2 BDL
2D-U-10M 86.1s 5.9s 24.8s 2.2s (39.5x) 0.6s (10.1x) 0.7s (35.4x)
2D-V-10M 86.1s 5.9s 24.4s 2.2s (39.6x) 0.6s (10.2x) 0.7s (35.5x)
5D-U-10M 97.9s 7.6s 29.2s 2.9s (33.6x) 0.9s (8.7x) 1.0s (30.3x)
5D-V-10M 94.5s 7.5s 28.1s 2.8s (33.2x) 0.9s (8.8x) 1.0s (29.3x)
7D-U-10M 109.7s 8.8s 33.0s 3.5s (31.1x) 1.1s (8.1x) 1.2s (26.9x)
7D-V-10M 106.1s 8.7s 31.7s 3.5s (30.7x) 1.1s (8.2x) 1.2s (25.6x)
10D-H-1M 7.9s 0.7s 1.7s 0.4s (22.1x) 0.1s (5.7x) 0.1s (16.3x)
16D-C-4M 66.2s 5.5s 21.1s 3.0s (22.2x) 0.9s (6.4x) 1.1s (18.3x)
3D-C-321M 20.7s
(a) Object median.
1 36h
B1 B2 BDL B1 B2 BDL
2D-U-10M 32.5s 4.9s 5.2s 1.2s (26.1x) 0.4s (11.9x) 0.6s (9.2x)
2D-V-10M 33.9s 5.0s 5.8s 1.4s (24.0x) 0.5s (10.2x) 0.6s (9.2x)
5D-U-10M 40.7s 6.1s 8.8s 2.3s (17.5x) 0.7s (9.4x) 1.1s (8.1x)
5D-V-10M 44.5s 6.7s 10.0s 2.8s (15.6x) 0.8s (8.0x) 1.2s (8.0x)
7D-U-10M 48.1s 7.0s 11.2s 3.1s (15.6x) 0.8s (8.5x) 1.5s (7.4x)
7D-V-10M 52.6s 7.6s 12.5s 3.7s (14.4x) 1.0s (7.8x) 1.6s (7.6x)
10D-H-1M 6.3s 0.9s 1.3s 0.6s (10.2x) 0.2s (3.8x) 0.2s (6.6x)
16D-C-4M 42.9s 6.0s 12.9s 3.6s (11.8x) 1.0s (5.9x) 1.5s (8.5x)
3D-C-321M 15.7s
(b) Spatial median.
TABLE II: Batch insertion times (seconds) for a single thread (1) and 36 cores with hyper-threading (36h). The self-relative speedup for each implementation and dataset is shown in parentheses. We insert batches of 10% of each dataset, starting from an empty tree until the entire dataset has been inserted.

VI-B2 Batch Size

In this benchmark, we measure the performance of our batch insertion implementation as the size of the batch varies from 1M points to 5M points. We repeatedly perform batched inserts of the specified size until the entire dataset has been inserted. We provide plots of the results for the 2D VisualVar dataset in Figure 6(a) and for the 7D Uniform dataset in Figure 6(b). The first striking result is that the throughput decreases for B2 as the batch size increases, while the throughput increases for B1 and BDL. This is due to the fact that the work that B2 performs increases as the batch size grows—it has to do more work to compute spatial splits over larger batches, whereas for small batches it quickly computes the upper spatial splits on the first batch and never recomputes them. As a result, B2 has best throughput at smaller batch sizes, but the worst at large batch sizes. For B1 and BDL, note that the throughput increases as the batch size increases. This is due to the fact that each insert has an associated overhead of recomputing spatial partitions, and so the larger the batch size, the fewer times this overhead is paid. For larger batch sizes, BDL has the best throughput among the three implementations for object median. This is again due to the fact that it amortizes the work across inserts, rather than having to recompute spatial splits over the entire dataset at each insert. Finally, note that in most cases, the spatial median heuristic has a better throughput than its object median counterpart, as it takes less work to compute.

Refer to caption
(a) Insertions on 2D-V-10M.
Refer to caption
(b) Insertions on 7D-U-10M.
Refer to caption
(c) Deletions on 2D-V-10M.
Refer to caption
(d) Deletions on 7D-U-10M.
Figure 6: Plot of throughput (operations per second) of batch insertions and batch deletions vs. batch size for the 2D-V-10M and 7D-U-10M datasets.

VI-C Deletion

In this benchmark, we measure the performance of our deletion implementation as compared to the baselines. Similar to the insertion experiment, we split this experiment into two separate benchmarks, one to measure the scalability and performance of our implementation, and one to measure the impact of varying batch size.

VI-C1 Scalability

In this benchmark, we measure the time required to delete 10 batches each containing 10% of the points in the dataset from an initially full tree for each of our two baselines as well as the BDL-tree. The results using an object median splitting heuristic are shown in Table III(a) and the results using a spatial median splitting heuristic are shown in Table III(b). Figure 5(c) shows the scalability of the throughput on the 10M point 7D uniform dataset.

We observe that B2 has vastly superior performance—it does almost no work other than tombstoning the deleted points so it is extremely efficient. Next, we see that BDL has the second-best performance, as it amortizes the rebuilding across the batches, rather than having to rebuild across the entire point set for every delete. Finally, B1 has the worst performance as it rebuilds on every delete. With the object median splitting heuristic, BDL achieves parallel speedup of up to 33.1×33.1\times, with an average speedup of 28.5×28.5\times.

1 36h
B1 B2 BDL B1 B2 BDL
2D-U-10M 114.0s 1.1s 29.2s 2.7s (43.0x) 0.1s (10.4x) 0.9s (33.0x)
2D-V-10M 114.2s 1.1s 29.0s 2.7s (43.0x) 0.1s (10.5x) 0.9s (33.1x)
5D-U-10M 130.2s 2.2s 40.1s 3.5s (37.2x) 0.2s (9.1x) 1.4s (28.1x)
5D-V-10M 127.0s 2.2s 39.3s 3.5s (36.8x) 0.2s (9.1x) 1.4s (27.8x)
7D-U-10M 146.7s 3.0s 37.3s 4.1s (36.0x) 0.3s (8.9x) 1.4s (27.0x)
7D-V-10M 144.4s 3.0s 36.9s 4.1s (35.6x) 0.3s (8.9x) 1.4s (27.2x)
10D-H-1M 24.7s 0.2s 21.9s 1.0s (24.7x) 0.0s (5.6x) 0.8s (28.5x)
16D-C-4M 74.8s 2.9s 33.1s 2.7s (27.4x) 0.3s (8.6x) 1.4s (23.6x)
3D-C-321M 17.3s
(a) Object median.
1 36h
B1 B2 BDL B1 B2 BDL
2D-U-10M 70.2s 1.3s 30.2s 1.8s (38.8x) 0.1s (11.8x) 1.1s (26.3x)
2D-V-10M 71.7s 1.5s 30.7s 2.0s (36.0x) 0.1s (10.7x) 1.2s (25.4x)
5D-U-10M 82.9s 2.3s 29.2s 2.9s (28.9x) 0.2s (9.6x) 1.2s (23.9x)
5D-V-10M 85.9s 2.9s 29.6s 3.4s (25.4x) 0.3s (9.0x) 1.3s (22.2x)
7D-U-10M 97.1s 3.2s 38.2s 3.6s (27.1x) 0.3s (9.4x) 1.8s (21.4x)
7D-V-10M 100.8s 3.9s 38.3s 4.2s (24.0x) 0.4s (8.8x) 1.8s (20.9x)
10D-H-1M 27.6s 0.4s 22.1s 1.1s (25.1x) 0.1s (4.8x) 0.8s (26.2x)
16D-C-4M 60.5s 4.0s 32.5s 3.3s (18.3x) 0.5s (8.8x) 1.7s (19.5x)
3D-C-321M
(b) Spatial median.
TABLE III: Batch deletion times (seconds) for a single thread (1) and 36 cores with hyper-threading (36h). The self-relative speedup for each implementation and dataset is shown in parentheses. We delete batches of 10% of each dataset, starting from a full tree until the entire dataset has been deleted.

VI-C2 Batch Size

In this benchmark, we measure the performance of our batch deletion implementation as the size of the batched update varies from 1M points to 5M points. We provide plots of the results for the 2D VisualVar and 7D Uniform datasets in Figures 6(c)6(d), respectively. We note again that B2 consistently has the highest throughput, with BDL in second and B1 with the lowest throughput. Furthermore, the throughput of B1 and BDL increases as the batch size increases; this is true for the same reasons as with insertions. For B2, we observe consistent throughput across batch sizes.

VI-D Data-Parallel kk-NN

In this benchmark, we measure the performance and scalability of our kk-NN implementation as compared to the baselines. We split this into three separate experiments.

VI-D1 Scalability

In this experiment, we measure the scalability of the kk-NN operation after constructing each data structure over the entire dataset (in a single batch). The results using an object median splitting heuristic are shown in Table IV(a) and the results using a spatial median splitting heuristic are shown in Table IV(b). Figure 5(d) shows the scalability of the throughput on the 10M point 7D uniform dataset. With the object median heuristic, BDL achieves a parallel speedup of up to 46.1×46.1\times, with an average speedup of 40.0×40.0\times.

The results show that B1 and B2 have similar performance (B2 is slightly faster due to implementation differences). Furthermore, they are both faster than BDL-tree. This is to be expected, because the kk-NN operation is performed directly over the tree after it is constructed over the entire dataset in a single batch. Thus, both baselines will consist of fully balanced trees and will be able to perform very efficient kk-NN queries. On the other hand, BDL consists of a set of balanced trees, which adds overhead to the kk-NN operation, as it must be performed separately on each of these individual trees. However, as the next two benchmarks show, BDL provides superior performance in the case of a mixed set of dynamic batch inserts and deletes interspersed with kk-NN queries.

1 36h
B1 B2 BDL B1 B2 BDL
2D-U-10M 34.9s 11.9s 64.5s 0.6s (57.2x) 0.3s (40.3x) 1.5s (43.1x)
2D-V-10M 37.2s 12.7s 67.0s 0.7s (57.0x) 0.3s (40.7x) 1.5s (43.7x)
5D-U-10M 302.3s 178.0s 339.4s 6.3s (47.6x) 3.5s (51.4x) 8.6s (39.4x)
5D-V-10M 109.5s 64.1s 145.8s 2.1s (51.1x) 1.2s (52.7x) 3.2s (46.1x)
7D-U-10M 1520.0s 1239.4s 1621.6s 44.3s (34.3x) 34.1s (36.4x) 52.9s (30.7x)
7D-V-10M 133.6s 86.3s 173.1s 2.8s (48.1x) 1.7s (50.7x) 4.0s (43.8x)
10D-H-1M 5.3s 5.5s 11.9s 0.1s (54.5x) 0.1s (50.5x) 0.3s (45.8x)
16D-C-4M 464.4s 612.2s 468.3s 16.5s (28.1x) 16.1s (38.1x) 17.3s (27.1x)
3D-C-321M 15.6s
(a) Object median.
1 36h
B1 B2 BDL B1 B2 BDL
2D-U-10M 33.6s 13.0s 69.9s 0.6s (55.6x) 0.3s (40.0x) 1.6s (44.9x)
2D-V-10M 35.7s 13.5s 72.7s 0.6s (56.8x) 0.3s (39.5x) 1.6s (44.9x)
5D-U-10M 348.8s 216.7s 503.1s 6.7s (52.2x) 4.0s (54.1x) 9.6s (52.2x)
5D-V-10M 103.1s 65.2s 174.2s 2.0s (52.8x) 1.4s (45.4x) 3.5s (49.5x)
7D-U-10M 2142.0s 1494.0s 2804.0s 45.5s (47.1x) 36.7s (40.7x) 57.3s (48.9x)
7D-V-10M 116.9s 81.3s 203.1s 2.3s (50.5x) 1.7s (48.0x) 4.1s (49.3x)
10D-H-1M 5.7s 4.7s 15.5s 0.1s (54.6x) 0.1s (45.2x) 0.3s (47.3x)
16D-C-4M 606.7s 557.0s 623.4s 19.9s (30.5x) 11.9s (47.0x) 20.4s (30.5x)
3D-C-321M 17.4s
(b) Spatial median.
TABLE IV: kk-NN times (seconds) for a single thread (1) and 36 cores with hyper-threading (36h). The self-relative speedup for each implementation and dataset is shown in parentheses. We use 100% of dataset except for 3D-C-321M, which was run with 10% of the dataset because the full kk-NN results could not fit in memory.

VI-D2 Effect of Varying kk

In this experiment, we benchmark the throughput of the kk-NN operation on 36 cores with hyper-threading as kk varies from 2 to 11. For all three trees, we perform the kk-NN operation after building the tree from a set of batch insertions, with a batch size of 5% of the dataset, until the entire dataset is inserted. The results are shown for the 2D VisualVar dataset in Figure 7(a) and for the 7D Uniform dataset in Figure 7(b). In both scenarios, we see that B1 has the best kk-NN performance, followed closely by BDL. B2 has significantly worse performance—this is because the construction of the tree was performed with a set of batch insertions, rather than a single construction over the entire dataset, the tree ends up imbalanced and the kk-NN query performance suffers.

Refer to caption
(a) kk-NN on 2D-V-10M.
Refer to caption
(b) kk-NN on 7D-U-10M.
Figure 7: Plots of kk-NN throughput (operations per second) vs. kk using all 36h cores with hyperthreading, for the 2D-V-10M and 7D-U-10M datasets.

VI-D3 Mixed Operations

In this final experiment, we measure the kk-NN and overall performance of the trees as a mixed set of batch insertions, batch deletions, and batch kk-NN queries are performed. In particular, we perform a set of 20 batch insertions, each consisting of 5% of the dataset, into the tree. After every 5 batch insertions, we perform a kk-NN query with k=5k=5 (over the entire dataset) to measure the current query performance. Then, we perform a set of 15 batch deletes, each consisting of a random 5% of the dataset (with no repeats). After every 5 batch deletes, we again perform a kk-NN (k=5k=5) query. Overall, there are 7 kk-NN queries. We thus split the experiment into 7 distinct sections, demarcated by the kk-NN queries. For each section (labeled as INS0, INS1, INS2, INS3, DEL0, DEL1, DEL2), we measure the kk-NN runtime as well as the time for the 5 batched insertion/deletion operations. The results are shown for the 2D VisualVar dataset in Figure 8(a) and for the 5d Uniform dataset in Figure 8(b) (we observed similar results for other datasets). The xx-axis shows the 7 sections, and the yy-axis shows the time for each of these sections. There are two lines for each tree—a dashed line indicating just the kk-NN times at each section, and a solid line indicating the total time for the batched update and kk-NN.

In the Uniform case, we see that B2 performs the worst overall, due almost entirely to its poor kk-NN performance after batched updates. Note that the batched updates themselves contribute minimally to the total runtime of this baseline—they are very fast but cause significant imbalance in the tree structure, leading to degraded query performance. B1 has the best raw kk-NN query time, but its overall runtime is the second worst, as the batched updates are quite expensive (in order to maintain the balance that results in fast query times). Finally, BDL represents the best tradeoff between dynamic batch updates and kk-NN performance. In particular, it has the best total runtime after every operation. The results are quite similar for the VisualVar case, except that we note that B1 has the overall worst runtime, due to very expensive updates.

Refer to caption
(a) Batch updates and kk-NN queries on 2D-V-10M.
Refer to caption
(b) Batch updates and kk-NN queries on 7D-U-10M.
Figure 8: Plots of running times (seconds) of updates and queries vs. progressive batch updates on the tree using all 36 cores with hyper-threading, for the 2D-V-10M and 7D-U-10M datasets. “knn” represents the kk-NN query performance after cumulative updates on the trees and “total” represents the combined running time of both updates and queries since the previous batch.

VI-D4 Dual-Tree kk-NN

In addition to the data-parallel approach discussed above, we also implemented kk-NN based on the dual-tree traversal proposed by March et al. [21]. While we found our dual-tree kk-NN for BDL was faster on a single-thread due to more efficient pruning, it was not as scalable as other kk-NN approaches and thus did not perform as well in parallel. Therefore, we omit the experimental results for dual-tree kk-NN.

VII Conclusion

We have presented the BDL-tree, a parallel batch-dynamic kkd-tree that supports batched construction, insertions, deletions, and kk-NN queries. We show that our data structure has strong theoretical work and depth bounds. Furthermore, our experiments show that the BDL-tree achieves good parallel speedup and presents a useful tradeoff between two baseline implementations. In particular, it delivers the best performance in a dynamic setting involving batched updates to the underlying dataset interspersed with kk-NN queries.

VIII Acknowledgements

This research was supported by DOE Early Career Award #DE-SC0018947, NSF CAREER Award #CCF-1845763, Google Faculty Research Award, Google Research Scholar Award, DARPA SDH Award #HR0011-18-3-0007, and Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA.

References

  • [1] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Commun. ACM, vol. 18, no. 9, p. 509–517, Sep. 1975.
  • [2] ——, “Decomposable searching problems,” Information Processing Letters, vol. 8, no. 5, pp. 244–251, 1979.
  • [3] J. L. Bentley and J. B. Saxe, “Decomposable searching problems I. static-to-dynamic transformation,” Journal of Algorithms, vol. 1, no. 4, pp. 301–358, 1980.
  • [4] P. K. Agarwal, L. Arge, A. Danner, and B. Holland-Minkley, “Cache-oblivious data structures for orthogonal range searching,” in Proceedings of the Nineteenth Annual Symposium on Computational Geometry, 2003, p. 237–245.
  • [5] P. K. Agarwal, K. Fox, K. Munagala, and A. Nath, “Parallel algorithms for constructing range and nearest-neighbor searching data structures,” in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2016, p. 429–440.
  • [6] R. R. Curtin, “Faster dual-tree traversal for nearest neighbor search,” in Proceedings of the 8th International Conference on Similarity Search and Applications, 2015, p. 77–89.
  • [7] D. Wehr and R. Radkowski, “Parallel kd-tree construction on the GPU with an adaptive split and sort strategy,” Int. J. Parallel Program., vol. 46, no. 6, p. 1139–1156, Dec. 2018.
  • [8] S. Zellmann, J. P. Schulze, and U. Lang, “Binned k-d tree construction for sparse volume data on multi-core and GPU systems,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 3, pp. 1904–1915, 2021.
  • [9] O. Procopiuc, P. K. Agarwal, L. Arge, and J. S. Vitter, “Bkd-tree: A dynamic scalable kd-tree,” in International Symposium on Spatial and Temporal Databases.   Springer, 2003, pp. 46–65.
  • [10] R. A. Finkel and J. L. Bentley, “Quad trees a data structure for retrieval on composite keys,” Acta Informatica, vol. 4, no. 1, pp. 1–9, 1974.
  • [11] S. M. Omohundro, Five balltree construction algorithms.   International Computer Science Institute Berkeley, 1989.
  • [12] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 97–104.
  • [13] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in Proceedings of the ACM SIGMOD international conference on Management of data, 1984, pp. 47–57.
  • [14] M. Shevtsov, A. Soupikov, and A. Kapustin, “Highly parallel fast KD-tree construction for interactive ray tracing of dynamic scenes,” Computer Graphics Forum, vol. 26, no. 3, pp. 395–404, 2007.
  • [15] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil, “The log-structured merge-tree (LSM-tree),” Acta Informatica, vol. 33, no. 4, pp. 351–385, 1996.
  • [16] R. A. Vitillo, “Parallel log-structure merge tree for key-value stores,” https://github.com/vitillo/kvstore.
  • [17] M. Dobson and G. E. Blelloch, “Parallel nearest neighbors in low dimensions with batch updates,” CoRR, vol. abs/2111.04182, 2021. [Online]. Available: https://arxiv.org/abs/2111.04182
  • [18] J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” ACM Transactions on Mathematical Software (TOMS), vol. 3, no. 3, pp. 209–226, 1977.
  • [19] R. R. Curtin, W. B. March, P. Ram, D. V. Anderson, A. G. Gray, and C. L. Isbell, “Tree-independent dual-tree algorithms,” in Proceedings of the 30th International Conference on International Conference on Machine Learning, 2013, p. 1435–1443.
  • [20] A. G. Gray and A. W. Moore, “’N-body’ problems in statistical learning,” in Proceedings of the 13th International Conference on Neural Information Processing Systems, 2000, p. 500–506.
  • [21] W. B. March, P. Ram, and A. G. Gray, “Fast Euclidean minimum spanning tree: Algorithm, analysis, and applications,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, p. 603–612.
  • [22] Y. Wang, S. Yu, Y. Gu, and J. Shun, “A parallel batch-dynamic data structure for the closest pair problem,” in 37th International Symposium on Computational Geometry, vol. 189, 2021, pp. 60:1–60:16.
  • [23] L. Dhulipala, Q. C. Liu, J. Shun, and S. Yu, “Parallel batch-dynamic k-clique counting,” in Symposium on Algorithmic Principles of Computer Systems (APOCS), 2021, pp. 129–143.
  • [24] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms.   MIT press, 2009.
  • [25] J. JaJa, Introduction to Parallel Algorithms.   Addison-Wesley Professional, 1992.
  • [26] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded computations by work stealing,” J. ACM, vol. 46, no. 5, p. 720–748, sep 1999.
  • [27] L. Arge, G. S. Brodal, and R. Fagerberg, “Cache-oblivious data structures,” in Handbook of Data Structures and Applications.   Chapman and Hall/CRC, 2018, pp. 545–565.
  • [28] E. D. Demaine, “Cache-oblivious algorithms and data structures,” Lecture Notes from the EEF Summer School on Massive Data Sets, vol. 8, no. 4, pp. 1–249, 2002.
  • [29] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, 1970.
  • [30] G. E. Blelloch, D. Anderson, and L. Dhulipala, “ParlayLib - a toolkit for parallel algorithms on shared-memory multicore machines,” in Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, 2020, p. 507–509.
  • [31] J. Gan and Y. Tao, “DBSCAN revisited: Mis-claim, un-fixability, and approximation,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2015, p. 519–530.
  • [32] “Ht dataset,” https://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring.
  • [33] R. Huerta, T. Mosqueiro, J. Fonollosa, N. F. Rulkov, and I. Rodriguez-Lujan, “Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring,” Chemometrics and Intelligent Laboratory Systems, vol. 157, pp. 169–176, 2016.
  • [34] “Chem dataset,” https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures.
  • [35] J. Fonollosa, S. Sheik, R. Huerta, and S. Marco, “Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring,” Sensors and Actuators B: Chemical, vol. 215, pp. 618–629, 2015.
  • [36] Y. Kwon, D. Nunley, J. P. Gardner, M. Balazinska, B. Howe, and S. Loebman, “Scalable clustering algorithm for N-body simulations in a shared-nothing cluster,” in International Conference on Scientific and Statistical Database Management, 2010, pp. 132–150.