Tao: A Learning Framework for Adaptive Nearest Neighbor Search using Static Features Only^†^†thanks: Hongya Wang is the corresponding author.

Kaixiang Yang1 Hongya Wang1 Bo Xu1; Wei Wang2 Yingyuan Xiao3 Ming Du1 Junfeng Zhou1 [email protected], {hywang, xubo, duming, zhoujf}@dhu.edu.cn, [email protected], [email protected] 1School of Computer Science and Technology, Donghua University, China 2School of Computer Science and Engineering, University of New South Wales, Australia 3School of Computer Science and Technology, Tianjin University of Technology, Tianjin, China

Abstract

Approximate nearest neighbor (ANN) search is a fundamental problem in areas such as data management, information retrieval and machine learning. Recently, Li et al. proposed a learned approach named AdaptNN to support adaptive ANN query processing. In the middle of query execution, AdaptNN collects a number of runtime features and predicts termination condition for each individual query, by which better end-to-end latency is attained. Despite its efficiency, using runtime features complicates the learning process and leads to performance degradation.

Radically different from AdaptNN, we argue that it is promising to predict termination condition before query exetution. Particularly, we developed Tao, a general learning framework for Terminating ANN queries Adaptively using Only static features. Upon the arrival of a query, Tao first maps the query to a local intrinsic dimension (LID) number, and then predicts the termination condition using LID instead of runtime features. By decoupling prediction procedure from query execution, Tao eliminates the laborious feature selection process involved in AdaptNN. Besides, two design principles are formulated to guide the application of Tao and improve the explainability of the prediction model. We integrate two state-of-the-art indexing approaches, i.e., IMI and HNSW, into Tao, and evaluate the performance over several million to billion-scale datasets. Experimental results show that, in addition to its simplicity and generality , Tao achieves up to 2.69x speedup even compared to its counterpart, at the same high accuracy targets.

Index Terms:

approximate nearest neighbor search, local intrinsic dimension, neural networks

I Introduction

Nearest neighbor search is a fundamental problem in domains such as large-scale image search and information retrieval [39, 38], recommendation[13], entity resolution[22], and sequence matching[6]. Constrained by the curse of dimensionality, exact nearest neighbor search becomes unaffordable for the rapidly increasing amount of unstructured data (images, documents and video clips etc)[57]. As a workaround, approximate nearest neighbor (ANN) search is widely used to provide an appropriate tradeoff between accuracy and latency.

Currently, the quantization-based and graph-based approaches are two mainstream ANN search paradigms [43]. Vector quantization shrinks database vectors into compact codes based on various quantization methods, and reduces computation latency and memory requirements by nearly an order of magnitude [27]. Graph-based approaches [41] first builds a graph to capture proximity among data points and then performs ANN search by traversing graphs in a greedy or heuristic fashion. A number of empirical studies show that graph-based approaches own most appealing tradeoff between accuracy and search cost [56].

Both quantization-based and graph-based approaches use fixed configurations that impose the same termination condition (e.g., the number of candidates to examine) for all queries, which penalizes “easy” queries and incurs higher-than-necessary running time. To address this issue, Li et al. proposed AdaptNN that predicts and then applies termination condition for each individual query [35]. Figure 1(a) illustrates the workflow of AdpatNN. After receiving a query, AdpatNN first invokes a specific ANN algorithm to execute the query for a while, then collects a number of runtime features and predicts the termination condition based on them. Taking the estimated termination condition as the additional input, the ANN algorithm continues to run until the condition is met, and outputs the query results finally.

Refer to caption — (a) Workflow of AdaptNN

However, AdaptNN has several limitations despite its efficiency:

•

Firstly, the runtime features require heavy manual engineering, that is, these hand-crafted features need to be designed for each algorithms individually and no systematic design strategy is available. For example, it uses six input features for the HNSW index whereas adopts 14 runtime statistics for the IMI index. Hence, it is not easy to adapt AdaptNN to other indexing approaches that are not discussed in [35].
•

Secondly, the end-to-end latency is sensitive to the time interval between when AdaptNN makes the prediction (dubbed as prediciton time) and when the query processing starts. [35] provides no principled method to set the prediction time. As will be discussed in detail in Section III-B, the optimal prediction times depend on specific algorithms and datasets, and the tuning procedure is very tedious. In view of this, AdaptNN simply sets it to some fixed default value, leading to undesirable performance degradation.

Both of the above-mentioned limitations are rooted in the belief that only the runtime features provide sufficient prediction power to the termination condition. In this paper, we challenge this belief and propose a simple and principled static feature to perform the prediction accurately. Specifically, we propose Tao, a general learning framework for Terminating ANN queries Adaptively using Only static features. Figure 1(b) illustrates the workflow of Tao. Upon the arrival of a query, Tao first maps the query to a local intrinsic dimension (LID) number, and then predicts termination condition using LID only before query execution. The specific ANN search algorithm takes termination conditions as inputs and answers the query until the condition is met.

As such, Tao has some advantages worth mentioning:

•

Tao decouples the prediction procedure from query execution and eliminates the requirement for tuning the prediction time, making it simple to use in practice.
•

Tao employs query vectors and the associated LID, instead of a set of hand-crafted runtime features, as model inputs to accomplish the goal of adaptive query processing, making it general enough to accommodate new ANN search algorithms
•

Tao improves the explainability of the prediction model by revealing the correlation between LID and search cost and formulating guidelines to use Tao, which is a highly desirable feature in AI-powered applications nowadays.

To demonstrate the aforementioned advantages, we instantiated two instances of our framework based on two state-of-the-art indexing approaches (HNSW [41] and IMI [3]). Similar with [35], we implement them over the Faiss similarity search library [29]. Comprehensive experiments are conducted to evaluate the end-to-end performance on a collection of real datasets, including two billion-scale datasets Deep1B [4] and Sift1B [28]. Empirical results demonstrate that, in addition to its simplicity and generality, Tao offers up to 2.69x speedup compared with the runtime prediction strategy.

The contributions of the paper are summarized as follows: (1) We identify the limitations of the existing approach. (2) We propose and develop a general learning framework for adaptive ANN search using only static features. (3) We conduct extensive experiments on various datasets to verify the effectiveness and efficiency of our framework.

Roadmap. Section 2 presents the necessary preliminaries. Section 3 gives a brief review of AdaptNN and discusses its limitations. Section 4 demonstrates the feasibility of predicting with static features. Section 5 sketches the workflow of Tao. Section 6 describes the experimental methodology and reports the results. Section 7 reviews the related work, and Section 8 concludes the paper.

II Preliminaries

In this section, we present necessary preliminaries for ANN search and relevant indexing algorithms used in this paper.

II-A ANN Search Problem

Nearest neighbor (NN) search in high dimensional spaces is a challenging problem that has been studied extensively in the past two decades. Formally, let $\mathcal{D}=\{o_{1},\cdots,o_{n}\}\subset\mathbb{R}^{d}$ be a finite set of $n$ vectors. Given query $q\in\mathbb{R}^{d}$ , $k$ NN search returns $k$ results $o_{i}$ $(1\leq i\leq k)$ , where $o_{i}$ is the $i$ -th nearest neighbor of $q$ . In this paper we use the Euclidean distance as the distance function to measure (dis)similarity between vectors.

As database sizes reach millions or billions of entries, and the vector dimension grows into the hundreds, any indexing method is shown to be reduced to a linear scan of the whole dataset, which is prohibitively expensive in practice [28]. Therefore, practitioners often resort to approximate nearest neighbor search. Instead of identifying exact NNs of a query surely, ANN search sometimes may return neighbors that are only close enough to the query. By trading precision for efficiency, ANN search provides better latency than the exact version.

There is a large amount of significant literature on algorithms for approximate nearest neighbor search, which are roughly divided into four categories: tree-structure based approaches, hashing-based approaches, quantization-based approaches and graph-based approaches. In this paper, we focus on quantization-based and graph-based methods due to their appealing performance and popularity in real-life large scale applications [55].

II-B Product Quantization and Inverted Multi-Index

Product quantization (PQ) algorithm is a lossy compression quantization algorithm that is developed on the basis of vector quantization [27]. By quantizing the vectors of the dataset, it can effectively reduce the storage needed to store the original vectors of the dataset. To the best of our knowledge, quantization-based method is the only one that can support billion-scale datasets on a commodity computer. However, the search is exhaustive for the original PQ – all database vectors must be evaluated.

To avoid exhaustive search, IVFADC [27] uses two quantization levels to organize large datasets. Particularly, IVFADC groups database vectors into different clusters. When building the index, a list of cluster centroids is trained by $K$ -means clustering, and each database vector is assigned to the cluster with the closest centroid. During searching, the index first computes the distances between the query and all cluster centroids, then evaluates database vectors belonging to the first nprobe nearest clusters, where nprobe is the hyperparameter that should be determined apriori.

A number of variants of IVFADC are proposed, among which Inverted Multi-Index (IMI) [3] offers the state-of-the-art performance thanks to its efficient space-partitioning strategy. To be specific, IMI decomposes the vectors into several subspaces and trains separate a list of centroids in each subspace, which leads to a fine-grained space partition.

For example, Figure 2 illustrates how IMI generates two product codebooks ${C_{1}}$ and ${C_{2}}$ for different halves of the vector at the first level. Each codebook contains $5$ sub-codewords, leading to $5^{2}$ clusters in total. By subtracting a vector to the corresponding cluster centroids, one obtains the residuals for the second quantization level. Similar to IVFADC, IMI searches the same fixed number of nprobe nearest clusters for all queries.

Figure 2: Illustration of IMI (black and red dots represent the training vectors and centroids respectively).

II-C Graph-based Methods

Graph-based methods such as hierarchical navigable small worlds (HNSW) are currently one of the most efficient ANN search paradigms [41]. HNSW employs hierarchical network structures to organized vectors, and such networks are essentially approximate $k$ NN-graphs [54]. To process a query, the graph is traversed starting from the entry point using beam search, and the parameter efSearch is used to control the hops of graph traversal. Greater efSearch is, more accurate answers will be.

For almost all graph-based methods, the ANN search procedure is based on the same principle as illustrated in Figure 3. For a query $q$ , start at an initial vertex chosen arbitrarily or using some sophisticated selection rule, say $p$ . Moves along an edge to the adjacent vertex with minimum distance to $q$ . Repeat this step until the current element $v$ is closer to $q$ than all its neighbors, and then report $v$ as the NN of $q$ .

III Motivations

III-A Brief Review of AdaptNN

Most existing ANN approaches use fixed configurations that apply the same termination condition to all queries. Li et al. conducted an empirical study over three datasets (Deep/Sift/Gist)¹¹1The detailed description of these datasets are given in Section VI-A using IMI and HNSW implementations in the Faiss library [35]. They observed that, due to the index structures and the vector distributions, the number of database vectors that must be searched to find the ground-truth nearest neighbor varies significantly among queries. Take HNSW as an example, 80% of queries (dubbed “easy queries”) only need to perform at most 547/481/1260 distance evaluations for Deep10M/Sift10M/Gist, respectively, while the other 20% queries (dubbed “hard queries”) require up to 88696/16618/118277 distance evaluations, respectively.

Fixed configurations lead to high average latency because easy queries are forced to conduct a lengthy search as the amount required to deliver reasonable performance for hard queries. To address this issue, Li et al. propose to predict different termination condition for each individual query, making easy queries do less computation than hard ones, thus reducing the end-to-end average latency dramatically.

One central proposition in AdaptNN is that static features alone (e.g., the query vectors themselves) cannot offer good prediction power. To realize highly accurate predictions, it elaborately identifies a set of runtime features of the intermediate search results, and it adopts gradient boosting decision trees to predict the minimum amount of search to perform. As an example, AdaptNN uses the distances between the query and its 1st and 10th neighbors as two runtime features for HNSW.

Figure 1(a) depicts the workflow of AdaptNN. When receiving a query, AdaptNN first invokes the ANN search algorithm and performs a fixed amount of search to obtain the intermediate search results. Thereafter, AdaptNN computes the runtime features, coalesces them with static features, and predicts the termination condition using gradient boosting decision trees. The termination condition is passed down to the ANN search algorithm, which continues to run until the condition is met. Experimental results show that adaptive query processing based on runtime prediction consistently reduces the average end-to-end latency.

Table I: Prediction time vs. Search cost (prediction times are measured in terms of the recall already obtained and the optimal search costs are marked in bold)

	Search Cost
Prediction Time	Deep10M	Sift10M	Gist	ImageNet	Msong	Trevi	Glove
0.5	18,483	11,017	37,334	71,897	5,337	30,621	60,190
0.6	22,821	9,453	34,072	86,030	4,521	38,213	67,465
0.7	18,882	10,481	34,508	87,176	8,818	22,895	66,905
0.8	17,594	16,787	44,390	78,641	7,182	22,891	66,406
0.9	17,033	10,616	47,442	122,697	7,090	28,573	108,983

III-B Limitations

AdaptNN, however, faces a number of limitations. Firstly, the input features are chosen in an ad-hoc fashion and one has to try different hand-crafted features for different algorithms. For example, HNSW uses six input features whereas IMI adopts 14 runtime statistics to predict the minimum amount of search, respectively. While the importance of different features are evaluated using the per-feature gain stats from gradient boosting decision tree models (the importance of a feature is proportional to the total error reduction contributed by the feature), this only alleviate the feature engineering efforts and does not address issues such as correlation among features. Hand-crafted feature engineering impedes the application of AdaptNN to other ANN search algorithms not addressed in [35].

Secondly, as noted in [35], if AdaptNN searches less before the feature generation, the intermediate result features may provide less information gain, reducing the prediction accuracy. If we search more before the feature generation, all queries must search more, increasing the end-to-end average latency. As will be discussed shortly, our preliminary experiments show that the end-to-end latency is quite sensitive to the prediction time. Unfortunately, again, AdaptNN provides no principled method or guideline to determine when to predict in runtime.

The pragmatic (currently practiced) way to choose the prediction time adopted by AdaptNN is as follows. Given a dataset, manually choose a set of accuracy targets, say $0.5-0.9$ with a step of $0.1$ , and then perform binary search to obtain the corresponding search costs (nprobe or efSearch) for all accuracy targets. To determine the optimal one, one has to run AdaptNN using different prediction times over training datasets and pick the one with the least average latency. Since the tuning process is tedious, time-consuming and lossy in nature, AdaptNN simply uses a fixed prediction time, i.e., the time instant in which 0.8 recall is reached for all datasets by default in practice [35].

Unsurprisingly, we found that the best prediction time is obviously data-dependent. Table I lists the total number of points examined (search cost) under different prediction times (recall already obtained) for a collection of datasets²²2The detailed description of these datasets are given in Section VI-A using AdaptNN with the HNSW method. As one can see 1) the optimal prediction time vary across different datasets and, 2) choosing inappropriate prediction time will cause significant performance degradation. For example, the maximum search cost is twice as much as the minimum for Msong.

The limitations of AdaptNN motivate us to look for a radically different way to perform adaptive ANN query processing. It is claimed in [35] that “static features such as the query vector itself are not sufficient to predict this termination condition”. Counter-intuitively, however, we observe that query vectors themselves, with the help of an intermediate feature local intrinsic dimension, are sufficient to fulfill this goal. Next, we will present the core ideas and workflow of Tao.

IV Prediction with Static Features Made Possible

In this section, we introduce the core notion of our method, i.e., local intrinsic dimension first, and then explore the feasibility of predicting LID using a regression model. The correlation between LID and search cost is examined based on two representative indexing schemes, i.e., IMI and HNSW.

IV-A Local Intrinsic Dimension

Many learning tasks involve data represented as vectors of dimension $d$ . For example, an image representation is an embedding function that transforms the raw pixel representation of the image to a point in a high-dimensional vector space. While data are embedded in the space $\mathbb{R}^{d}$ , this does not necessarily mean that its intrinsic dimension (ID) is $d$ .

The intrinsic dimensionality of a representation refers to the minimum number of parameters (or degrees of freedom) necessary to capture the entire information present in the representation [8]. Equivalently, it refers to the dimensionality of the $m$ -dimensional manifold $\mathcal{M}$ embedded within the $d$ -dimensional ambient (representation) space where $m\leq d$ .

ID has wide applications in many machine learning and data mining contexts. For example, most dimension reduction techniques require that a target dimension be provided by the user. Ideally, the supplied dimension should depend on the intrinsic dimensionality of the data. This has served to motivate the development of models of ID, as well as accurate estimators.

Over the past few decades, many practical models of the intrinsic dimensionality of datasets have been proposed. Examples include the Principal Component Analysis and its variants [7], as well as several manifold learning techniques [31]. Topological approaches to ID estimate the basis dimension of the tangent space of the data manifold from local samples. Fractal methods such as the Correlation Dimension estimate an intrinsic dimension from the space-filling capacity of the data [9]. Graph-based methods use the $k$ -nearest neighbors graph along with density in order to estimate ID [11].

The aforementioned intrinsic dimensionality measures can be described as ‘global’, in that they consider the dimensionality of a given set as a whole, without any individual object being given a special role. In other words, all vectors share the same intrinsic dimension. In contrast, ‘local’ ID measures are defined as those that involve only the $k$ -nearest neighbor distances of a specific location in the space.

Several local intrinsic dimensionality models have been proposed recently, such as the expansion dimension [30], the generalized expansion dimension [25], the minimum neighbor distance [45], and local continuous intrinsic dimension (LID)[24]. In this paper we focus on LID, which is defined formally as follows:

Definition 1

( [24].) Given an absolutely continuous random distance variable $X$ , for any distance threshold $x$ such that the cumulative density function $F_{X}(x)>0$ , the local continuous intrinsic dimension of $X$ at distance $x$ is given by

LID_{X}\left(x\right)\overset{\bigtriangleup}{=}\lim_{\epsilon\rightarrow 0^{+}}\frac{\ln F_{X}\left(\left(1+\epsilon\right)x-\ln F_{X}\left(x\right)\right)}{\ln\left(1+\epsilon\right)}

wherever the limit exists.

LID quantifies intrinsic dimension in terms of the rate at which the number of encountered objects grows as the considered range of distances expands from a reference location. Estimates of local ID constitute a measure of the complexity of data and LID could give researchers and practitioners more insight into the nature of their data, and therefore help them improve the efficiency and efficacy of their applications [47].

Amsaleg et al. studied several estimators of LID using maximum likelihood estimation (MLE), the method of moment, probability weighted moment and regularly varying functions [1]. Experimental results show that the performance of different estimators to be largely in agreement with one another, and faster initial convergence favors the choice of MLE for applications where the number of available query-to-neighbor distances is limited, or where time complexity is an issue.

Assume that we are given a sequence $x_{1}$ ,…, $x_{n}$ of observations of a random distance variable $X$ with support [0, $w$ ) in ascending order, that is, $x_{1}$ $\leq$ $x_{2}$ $\leq$ ··· $\leq$ $x_{n}$ . Then, the MLE estimator of LID can be calculated as

\displaystyle\widehat{LID}_{X}\overset{\bigtriangleup}{=}\left(\frac{1}{k}\sum_{i=1}^{k}{\ln\left(\frac{w}{x_{i}}\right)}\right)^{-1}

(1)

Table II: MAE, RMSE and R2Score for different datasets.

	MAE	RMSE	R2Score
Deep10M	1.15	1.88	0.68
Sift10M	0.65	0.89	0.89
Gist	1.24	1.72	0.89
ImageNet	0.77	1.07	0.91
Glove	1.80	2.48	0.88
MSong	0.67	1.22	0.80
Trevi	3.17	4.41	0.82

IV-B Vector to LID Mapping

By definition, the LID number of a vector $q$ is closed related to the distribution of its near neighbors, which suggests that LID might be a promising indicator for the difficulty in finding $k$ NN of $q$ . Unfortunately, for an unsee query, there is no way to calculate the LID value without knowing its $k$ NN. To circumvent this dilemma, we explore the possibility of predicting LID for unseen vectors using neural networks, considering they are extremely good at capturing complex relationships in high dimensional spaces [33].

Suppose dataset $\mathcal{D}$ is a sample drawn from a population that follows some unknown probability distribution $\mathcal{X}$ . For any point $o\sim\mathcal{X}$ , we can calculate the LID estimate of $o$ using Equation (1), where $x_{i}$ is the distance between $o$ and its $i$ -th NN in $\mathcal{D}$ . For unseen vectors, assume there exists a function $f(\cdot)$ that relates any sample vector drawn from $\mathcal{X}$ to a single LID value. The universal approximation theorem tells us that a neural network can approximate $f(\cdot)$ , given that $f(\cdot)$ is continuous on a compact set and the neural network has sufficiently many hidden neurons with activation functions [12, 23].

Verifying the continuity of $f(\cdot)$ analytically is impossible because we do not even have an closed-form expression for $f(\cdot)$ . To this end, we demonstrate the feasibility by empirically evaluating the approximation errors on several real datasets. Particularly, we trained an MLP network with two hidden fully-connected layers, using 200 neurons per layer (i.e., 200 width) and ReLU activation functions. The inputs to the neural network are exclusively the original vectors and the outputs are the predicted LID values. We use the training vectors listed in Table III to generate training data and the query vectors to generate testing data.

Table II summarizes the statistics of common evaluation metrics for regression models. For mean absolute error (MAE) and RMSE, lower is better. For R2 scores, higher is better. As we can see from the table, we can reliably estimate the LID of query vectors using our regression model. E.g., the MAE of the prediction is less than 1.24 in 6 out of the 8 datasets. We formulate this observation as Property 1:

Property 1

Given a dataset $\mathcal{D}$ , the LID values of high-dimensional vectors in $\mathcal{D}$ can be estimated using a practical regression model with small approximation error.

We further illustrate the real and the predicted LID values in Figure 4 over Deep10M and Sift10M. The $x$ -axis represents the LID numbers of vectors calculated using Equation (1), which are partitioned into bins of size 2. The average of predicted LID values in each bin are marked on the top of bars. As we can see, the estimates match rather well with the real values for small LID ( $0-14$ for Deep10M and $2-16$ for Sift10M). For large LID, the regression model somewhat underestimates the target values, mainly due to the sparsity of training vectors with large LID numbers. Please note that similar trends are observed for other datasets and we do not report them here due to space limitation.

We also adopt gradient boosting decision tree models (using the LightGBM library [32]) to perform the vector to LID mapping, which achieves similar results with MLP. This suggests that there exists inherent correlation between vectors and LID for all real datasets that we have experimented with, and such correlation may be model independent to some extent. We conjecture that the reason might be these datasets, while represented in high-dimensional spaces, inherently resides on some low-dimensional manifolds [21, 2]. As a result, the relation between vectors and LID can be easily captured using a reasonable regression model. We believe that more in-depth analysis of this correlation deserves further study on its own right, and is thus out of the scope of this paper.

IV-C Correlation between LID and Search Cost

LID is aimed to quantify the local intrinsic dimensionality of a feature space exclusively in terms of the distance distribution of neighbors of a specific location. It is well known that the dimensionality of the space that the point resides in has a significant impact on the difficulty of NN search [20]. Intuitively, the overhead of NN search is probably correlated with the LID of query vectors.

The way to measure search cost varies over different indexing methods. For IMI, the amount of search is represented by the number of nearest clusters that need to be examined (nprobe). For HNSW, we use the number of distance evaluations to represent the amount of search. This is because 1) The distance evaluation between query and database vector is a time-consuming task; 2) The number of distance evaluations fluctuates greatly even with the same number of hops in the graph (efsearch).

Figure 5 illustrates the mean minimum search cost to find the ground truth nearest neighbor for query vectors with different LID using HNSW. The bin size is 2 and the search cost is averaged over each bin. A base 2 logarithmic scale is used for $y$ -axis. As one can see, for both Deep10M and Sift10M, the search cost increases roughly exponentially as LID grows, suggesting that LID is a promising static feature to predict the termination condition.

Figure 6 shows the histogram of the mean minimum number of centroids examined to find the ground truth nearest neighbor using IMI. The LID bin size is set to 2 and the search cost is averaged over each bin. As with HNSW, there exists a strong correlation between LID and the number of centroids examined. It is worth mentioning that IMI and HNSW exhibit similar trends for the remaining datasets that have been experimented with, which are not reported here due to space limitation. We highlight such correlations as Property 2:

Property 2

Given an ANN search algorithm $\mathcal{A}$ and a dataset $\mathcal{D}$ , the LID values of query vectors are positively correlated with the minimum amount of search to identify the true nearest neighbors in $\mathcal{D}$ by $\mathcal{A}$ .

IV-D Remarks

As discussed above, Property 1 is data dependent and Property 2 depends on both data and algorithms. Given an ANN search algorithm and a dataset, it is possible to accomplish the goal of adaptive query processing as long as these two properties hold. Such simple guidelines turn the learning black box transparent to users to some extent, eliminate the hand-crafted feature selection process and ease the applications of Tao to other ANN algorithms and datasets.

V Adaptive ANN Search with Tao

In this section, we describe the prediction pipeline of the proposed learning framework and how ANN search algorithms, exemplified by IMI and HNSW, are integrated into Tao.

The general workflow. Our prediction pipeline consists of two phases as illustrated in Figure 1. In Phase 1, the regression model takes query vectors as inputs and outputs predicted LID numbers. In Phase 2, the other regression model accepts a LID value and reports a numerical indicator, suggesting how much search should be done for the query.

The inputs. Instead of combining a number of hand-crafted static and runtime features as inputs, Tao needs only a single independent variable, that is, the query vector itself.

The output. For each query, we expect to predict the minimum amount of search to obtain the ground truth nearest neighbor. Different indexing approaches may have different metrics to quantify the search cost, but what we need is often a numerical value which is proportional to the search latency.

Model Selection and Training. According to the two properties we formulated, Tao is actually model independent, meaning that any reasonable regression model can be used to fulfill the vector $\rightarrow$ LID $\rightarrow$ termination-condition prediction framework. In this paper, we choose two popular models, i.e., MLP and gradient boosting decision trees, to show the effectiveness of Tao. For MLP, we employ two distinct neural networks to fulfill the predictive tasks in different phases, where the standard feed forward structure with two fully-connected hidden layers is adopted and the ReLU activation function is used across all layers. The parameters of these two neural networks will be discussed in detail in Section VI-B.

For the second one, we build and train two distinct gradient boosting decision tree models using the LightGBM library [32] to accomplish the predictive tasks needed by Tao. Since the prediction performance of LightGBM is similar to that of MLP, we only report the experimental results using MLP in this paper due to space limitation.

We use the training vectors in Table III to generate training/validation data and use the query vectors to generate testing data. Each vector generates one row of data which includes both the output target value and the true LID value. To obtain output target values, we need to first perform an exhaustive search to find the ground truth nearest neighbor(s), and then find the minimum amount of search to reach (one of) it. To calculate LID, we identify the top-1000 neighbors of each vector in the training set and compute LID values using Equation (1). We trained the two neural networks independently, i.e., no joint learning techniques are used. For the first one, we choose a fixed number of training epochs (200) and batch sizes varying from 200 to 1000, depending on the sizes of training sets. For the second one, a smaller number of training epochs is used (20) and batch sizes are the same as the first one.

Integration with IMI and HNSW. The integration of Tao with the existing indexing algorithms such as IMI and HNSW is noninvasive, in that very few modifications have to be done with them. Particularly, since Tao decouples the prediction models from query execution, we only need to change the termination conditions in the code bases, involving less than 5 lines of the core codes for IMI and HNSW.

For IMI, the number of nearest clusters to search equals Max(thresh, multiplier* $2^{TC}$ ), where the thresh equals the maximum target value in the training data and $TC$ is the predicted search cost. For HNSW, the number of distance evaluations equals Max(thresh, multiplier* $2^{TC}$ ). Similar to [35], multiplier is the scale factor needs to be tuned in order to achieve given accuracy target.

Algorithm 1 sketches how Tao is integrated with HNSW.

Algorithm 1 Tao with HNSW

0: Query vector:

q

,Entry point:

ep

,Number of nearest neighbors to return:

k

,Candidate queue:

cq

, Maxheap list:

w

k

nearest neighbors to q

tc

\leftarrow

predict with Tao

ndis\leftarrow 0

//number of distance comparisons

cq.push\_back(ep)

while

|cq|>0

v=cq.top()

cq.pop()

for

v.neighbors

calcultate distances and update

cq,w

ndis++

end for

ndis>tc

then

break

end if

end while

return

k

nearest neighbors to

q

w

VI Evaluation

In this section, we evaluate the performance of Tao by implementing it in the Faiss ANN search library (CPU version) [35] similar to AdaptNN, which makes the comparison fair and reasonable. All experiments are carried out on a server with E5-2620 [email protected] CPU and 256GB memory.

VI-A Datasets

We used seven publicly available real datasets, which are of different data dimensions, cardinalities and types. Table III lists the statistics of datasets we have experimented with. The sizes of these datasets range from millions to one billion.

Table III: Dataset Summary

Dataset	Dim.	Base	Training	Query	Type
Deep	96	10M,1B	1M	10,000	Image
Sift	128	10M,1B	1M	10,000	Image
Gist	960	1M	0.5M	1,000	Image
ImageNet	150	2.34M	200,000	10,000	Image
Msong	420	9.22M	100,000	10,000	Audio
Trevi	4096	1M	20,000	1,000	Image
Glove	100	1.19M	100,000	10,000	Text

•

Deep1B³³3https://github.com/facebookresearch/faiss/tree/master/ consist of 1 billion contains deep neural codes of natural images obtained from the activations of a convolutional neural network.
•

Deep10M⁴⁴4https://github.com/facebookresearch/faiss/tree/master/ is a subset of Deep1B.
•

Sift1B⁵⁵5http://corpus-texmex.irisa.fr/ consist of 1 billion 128-dim SIFT vectors.
•

Sift10M⁶⁶6https://archive.ics.uci.edu/ml/datasets/SIFT10M consists of 10 million 128-dim SIFT vectors.
•

Gist⁷⁷7http://corpus-texmex.irisa.fr/ is an image dataset which contains about 1 million data points.
•

ImageNet⁸⁸8 http://cloudcv.org/objdetect/ contains about 2.4 million data points with 150 dimensions dense SIFT features.
•

Msong⁹⁹9http://www.ifs.tuwien.ac.at/mir/msd/download.html is a collection of audio features and metadata for millions of contemporary popular music tracks.
•

Trevi¹⁰¹⁰10 http://phototour.cs.washington.edu/patches/default.htm consists of around 1,000,000 4096- $d$ vectors.
•

Glove¹¹¹¹11 http://nlp.stanford.edu/projects/glove/ contains 1.2 million 100- $d$ word feature vectors extracted from Tweets.

For ImageNet/Msong/Trevi/Glove datasets, due to these datasets do not have a training set or the query set is small, so we split a subset of the database as the training set or query set. Taking ImageNet as an example, we split the first 200,000 of the database into a training set, then we split 200,001 to 210,000 of the database into a query set, and the rest as the new ImageNet base set.

VI-B Benchmark Methods

•

Fixed Configuration. The implementations of IMI and HNSW in the Faiss ANN search library (CPU version) [35] is used. Parameters to control the answer quality, that is, efSearch (HNSW) and nprobe (IMI), are tuned to achieve the given target accuracy, and fixed for all queries.
•

AdaptNN. We use the training and parameter tuning methods described in [35] to obtain the optimal prediction time, which is both algorithm and dataset dependent. In other words, we report the best end-to-end search performance that AdaptNN can deliver for any given algorithm and dataset.
•

Tao. Tao employs two neural networks as the regression models. Both neural networks employ two fully connected layers with ReLU activation functions. For the first one, 200 neurons are used for each fully connected layer. The second neural network uses 10 neurons for each fully connected layer.
•

Vector Only (VO). VO combines the two neural networks of Tao into one single MLP with five hidden layers. To make fair comparison, the structure of VO is the same as those of Tao, i.e., five hidden layers are fully connected and the number of neurons in each layer are set to 200, 200, 1, 10, 10 respectively. The only difference between VO and Tao is the training process – LID is not used explicitly in VO and the input and output of the neural network of VO are query vectors and the amount of search to find the true NN, respectively.

VI-C Performance Metrics

The end-to-end latency and accuracy are two important metrics for evaluating ANN algorithms. To compare the performance of the baselines and proposed approach, we perform controlled experiments to keep the accuracy achieved by all approaches at the same level, and then to compare the average latency. Given an accuracy target, we perform binary search to find the minimum average latency for the fixed configuration baseline. For AdaptNN and Tao, we multiply the predicted search cost with a coefficient (tuning knob) to reach this desired accuracy. Then we compare their performance at each accuracy target, where the prediction overhead is included in the end-to-end latency.

For the accuracy target, we use recall $@$ 1 (the fraction of queries where the top-1 nearest neighbor returned from search is (one of) the ground truth nearest neighbor) for HNSW. For IMI, we use recall $@$ 100 (the fraction of queries where the top-100 nearest neighbors returned from search include (one of) the ground truth nearest neighbor) as the accuracy target since it’s challenging for quantization-based approaches to reach high recall $@$ 1. We measure the average latency in the single-threaded setting as in previous work [35].

VI-D Prediction Overhead

Since the parameters of neural networks are fixed for all datasets, the model size is a constant (560KB) in our experiment setting as well, which is similar to those of AdaptNN (274 KB to 310 KB). When making prediction, it takes around 65us using the Keras framework for Tao (7us to 47us for AdaptNN). While the prediction overhead is very small already, a significant reduction to a few nano seconds is possible if one uses a plain C++ implementation [33]. We leave this as future work since it is not a dominant factor in the end-to-end latency.

Table IV: Latency for IMI, AdpatNN and Tao.

Dataset	Recall	IMI	AdpatNN	Tao	Reduction over IMI	Reduction over AdpatNN
Deep1B	0.95	39.84 ms	33.69 ms	16.18 ms	59%	52%
	0.96	51.21 ms	36.65 ms	17.77 ms	65%	52%
	0.97	69.22 ms	44.85 ms	21.08 ms	70%	53%
	0.98	95.90 ms	69.09 ms	26.68 ms	72%	61%
	0.99	195.45 ms	117.94 ms	39.61 ms	80%	66%
	0.995	275.20 ms	127.40 ms	47.39 ms	83%	63%

Sift1B	0.95	38.69 ms	35.53 ms	30.93 ms	20%	13%
	0.96	46.28 ms	39.98 ms	32.27 ms	30%	19%
	0.97	57.59 ms	46.63 ms	39.13 ms	32%	16%
	0.98	70.78 ms	62.79 ms	50.71 ms	28%	19%
	0.99	117.68 ms	98.30 ms	69.49 ms	41%	29%
	0.995	181.21 ms	100.73 ms	86.71 ms	52%	14%

VI-E Empirical evaluation with IMI

Figure 7 compares the average end-to-end latency for plain IMI, AdaptNN and Tao. We choose IMI index with OPQ compression as the baseline, which is one of the state-of-the-art approaches for billion-scale ANN search. We build IMI index with $(2^{14})^{2}$ = 268,435,456 clusters.

All three methods achieve the highest accuracy, i.e., 0.995 for both Deep1B and Sift1B. We stop at these recall targets because it takes too long to reach 1.0 recall for billion-scale database. Overall, our approach provides up to 1.16x and 2.69x speedup on Sift1B and Deep1B, respectively.

To see the big picture, Table IV lists the latencies for three methods at recall $@$ 100 targets between 0.95 and 0.995. As the target accuracy increases, the gain of Tao over AdaptNN becomes more and more significant and seems to saturate at the highest recall. The reasons may be two-folds: 1) LID is more informative than features collected by AdaptNN; 2) Tao use only the training data obtained with 1.0 recall whereas AdaptNN is trained at a number of accuracy targets ranging from 0.95 from 0.995. It is worth noting that Tao does not take advantage of any runtime feature as AdaptNN does, and still outperforms its counterpart with the optimal configuration.

Table V: Ablation study over Sift10M.

Recall	HNSW	Tao	Real LID
0.95	0.82 ms	0.79 ms	0.69 ms
0.96	1.01 ms	0.95 ms	0.75 ms
0.97	1.08 ms	1.02 ms	0.83 ms
0.98	1.12 ms	1.05 ms	0.95 ms
0.99	1.36 ms	1.28 ms	1.24 ms
1.0	17.48 ms	4.28 ms	3.81 ms

VI-F Empirical evaluation with HNSW

Figure 8 plots the average end-to-end latencies by comparing plain HNSW, AdaptNN and Tao for highest recalls we have reached. Because of the graph connectivity issue, it is unable to find the nearest neighbor for a few queries in the reasonable time budget. Hence, for different datasets we stop at different recall targets. Overall, Tao provides consistent speedup over AdaptNN from 1.05x to 1.2x for recall $@$ 1 measure.

Table VIII presents the detailed numbers for different accuracy targets. As one can see, the performance gaps among three methods are less significant for a relatively low recall, say 0.9 for Glove and 0.95 for the other datasets. With the increase of target recall, both AdaptNN and Tao perform better than the original algorithms thanks to their ability to distinguish ‘easy’ queries from ‘hard’ ones.

Table V lists the results of ablation study over Sift10M – the performance of Tao after removing the first regression model. We manually calculate LIDs using Equation (1) for all queries, and pass them to the second neural network. As one can see, the results are only slightly inferior to those using the predicted LID, which demonstrates the efficacy of the vectors-to-LID mapping.

Figure 9 depicts the average minimum search cost (obtained by the Oracle), the overhead incurred by Tao and fixed configurations in a semi-log plot over Deep10M. By partitioning 10,000 queries, sorted in ascending order of the minimum search cost, into 10 bins evenly, we compute the average of minimum search cost in each bin. As we can see, there exists significant variation in the optimal search cost. As discussed before, HNSW with fixed configuration cannot utilize this fact, thus lead to the largest latency. Tao, instead, is equipped with the power to assign smaller search steps for ‘easy’ queries, by which better performance is obtained. Note that the room for optimization is still giant since what Tao predict is still far away from the Oracle, calling for more efficient features and/or prediction models to shrink this gap.

VI-G Comparison between Tao and Vector Only Method

Table VI: Latency for plain IMI, Vector Only and Tao.

Dataset	Recall	IMI	AdaptNN	Vector Only	Tao
Deep1B	0.995	275.20 ms	127.40 ms	260.29 ms	47.39 ms
Sift1B	0.995	181.21 ms	100.73 ms	108.58 ms	86.71 ms

Table VII: Latency for plain HNSW, Vector Only and Tao.

Dataset	Recall	HNSW	AdaptNN	Vector Only	Tao
Deep10M	0.999	19.47 ms	5.88 ms	48.21 ms	4.87 ms
Sift10M	1.0	17.48 ms	5.41 ms	40.11 ms	4.28 ms
Gist	0.999	264.12 ms	40.93 ms	314.35 ms	34.32 ms
ImageNet	0.998	303.34 ms	36.46 ms	24.52 ms	26.56 ms
Glove	0.9668	1486.82 ms	65.24 ms	76.07 ms	61.46 ms
MSong	1.0	5.48 ms	4.26 ms	4.67 ms	4.04 ms
Trevi	0.997	45.85 ms	19.41 ms	67.51 ms	18.54 ms

Figure 10 compares the latency of IMI, Tao and VO at the highest recall we achieved, and Table VI lists the detailed numbers including the results of AdaptNN. As we can see: 1) by leveraging the information drawn from query vectors, VO performs better than the plain IMI; 2) without using any runtime features, VO is inferior to AdaptNN, validating the claim made in [35]; 3) LID does matter and helps to obtain the best latency among all approaches. The experimental results of HNSW are in agreement with IMI as shown in Figure 11 and Table VII. One interesting observation made is that, in some cases, the latency of VO is even worse than the original HNSW, which is caused by incorrect prediction.

This set of experimental results indicates that 1) the neural network of VO cannot learn the correlation between query vectors and the amount of NN search without the help of LID; 2) the explicit use of LID not only delivers better performance and makes Tao easy to use in practice, it also makes the black box (learning process) more explainable for users compared with AdaptNN and Vector Only method. This advantage is highly desirable nowadays since explainability is key for users to understand conclusions and recommendations of the prediction model.

VII Related Work

A large number of ANN search algorithms are available in literature, making this section cannot be exhaustive due to space limitation. The latest benchmarks [36] show that no algorithm dominates the others in all scenarios and each index comes with different tradeoffs in performance, accuracy and space overhead.

VII-A Hashing-based approaches

For high-dimensional approximate search, the well-known indexing method is locality sensitive hashing (LSH) [20]. The main idea is to use a family of locality-sensitive hash functions to hash nearby data points into the same bucket. After the query point goes through the same hash functions, it will get the corresponding bucket number, and only compare the distance between the point in the bucket and the query point. In the end, the $k$ approximate nearest neighbor results that are closest to the query point will be returned. In recent two decades, many LSH-based variants have been proposed, such as E2LSH [14], QALSH [26], Multi-Probe LSH [40], BayesLSH [46].

E2LSH, the classical LSH implementations for $\ell_{2}$ norm, cannot solve $c$ -ANN search problem directly. In practice, one has to either assume there exists a “magical’ radius $r$ , which can lead to arbitrarily bad outputs, or uses multiple hashing tables tailored for different radii, which may lead to prohibitively large space consumption in indexing. To reduce the storage cost, LSB-Forest [51] and C2LSH [18] use the so-called virtual rehashing technique, implicitly or explicitly, to avoid building physical hash tables for each search radius.

Based on the idea of query-aware hashing, the two state-of-the-art algorithms QALSH [26] and SRS [49] further improve the efficiency over C2LSH by using different index structures and search methods, respectively. SRS reduces the ANN search problem in a $d$ -dimensional space into the range query in an $m$ -dimensional projection space (typically $m=6$ ), and uses $R$ -tree to fulfill this purpose. Recently, Lv et al. proposed VHP, which achieves better efficiency by ingeniously restricting the search space to be the intersection of those of QALSH and SRS. All of the aforementioned LSH algorithms provide probability guarantees on the result quality (recall and/or precision).

Other LSH extensions such as Multi-probe LSH [40], SK-LSH [37], LSH-forest [5] and Selective hashing [19] use heuristics to access more plausible buckets or re-organize datasets, and do not ensure any LSH-like theoretical guarantee. A recent paper discussed how to select better hash functions using a deep learning approach [50].

VII-B Quantization-based approaches

The most influential vector quantization for ANN search is Product Quantization (PQ) [27]. It seeks to perform a similar dimension reduction to hashing algorithms, but in a way that better retains information about the relative distances between points in the original vector space. Formally, a quantizer is a function $q$ mapping a $d$ -dimensional vector $x\in\mathbb{R}^{d}$ to a vector $q(x)\in C=\{c_{i};i\in\mathcal{I}\}$ , where the index set $\mathcal{I}$ is assumed to be finite: $\mathcal{I}=0\ldots k-1$ . The reproduction values $c_{i}$ are called centroids. The set $\mathcal{V}_{i}$ of vectors mapped to given index $i$ is referred to as a cell, and defined as

\mathcal{V}_{i}\triangleq\left\{x\in\mathbb{R}^{D}:q(x)=c_{i}\right\}

The $k$ cells of a quantizer form a partition of $\mathbb{R}^{d}$ . So all the vectors lying in the same cell $\mathcal{V}_{i}$ are reconstructed by the same centroid $c_{i}$ . Due to the huge number of samples required and the complexity of learning the quantizer, PQ uses $m$ distinct quantizers to quantize the subvectors separately. An input vector will be divided into $m$ distinct subvectors $u_{j}$ , $1\leq j\leq m$ . The dimension of each subvector is $d^{*}=d/m$ . An input vector $x$ is mapped as follows:

\underbrace{x_{1},\ldots,x_{d^{*}}}_{u_{1}(x)},\cdots,\underbrace{x_{d-d^{*}+1},\ldots,x_{d}}_{u_{m}(x)}\rightarrow q_{1}\left(u_{1}(x)\right),\ldots,q_{m}\left(u_{m}(x)\right)

where $q_{j}$ is a low-complexity quantizer associated with the $j^{th}$ subvector. And the codebook is defined as the Cartesian product,

\mathcal{C}=\mathcal{C}_{1}\times\ldots\times\mathcal{C}_{m}

and a centroid of this set is the concatenation of centroids of the m subquantizers. All subquantizers have the same finite number $k^{*}$ of reproduction values, the total number of centroids is $k=\left(k^{*}\right)^{m}$ .

PQ offers three attractive properties: (1) PQ compresses an input vector into a short code (e.g., 64-bits), which enables it to handle typically one billion data points in memory; (2) the approximate distance between a raw vector and a compressed PQ code is computed efficiently (the so-called asymmetric distance computation (ADC) and the symmetric distance computation (SDC)), which is a good estimation of the original Euclidean distance; and (3) the data structure and coding algorithm are simple, which allow it to hybridize with other indexing structures.

Original PQ requires examining all vectors during ANN search. To handle billion-scale datasets, advanced indexing structures such as IMI and IVF are developed [3, 27].

Table VIII: Latency for HNSW, AdpatNN and Tao.

Dataset	Recall	HNSW	AdaptNN	Tao	Reduction over HNSW	Reduction over AdpatNN
Deep10M	0.95	0.77 ms	0.69 ms	0.76 ms	1%	/
	0.96	0.82 ms	0.77 ms	0.79 ms	4%	/
	0.97	1.01 ms	0.91 ms	0.95 ms	6%	/
	0.98	1.27 ms	1.15 ms	1.21 ms	6%	/
	0.99	1.77 ms	1.45 ms	1.53 ms	14%	/
	0.999	19.47 ms	5.88 ms	4.87 ms	75%	17%

Sift10M	0.95	0.82 ms	0.73 ms	0.79 ms	4%	/
	0.96	1.01 ms	0.89 ms	0.95 ms	5%	/
	0.97	1.08 ms	0.96 ms	1.02 ms	6%	/
	0.98	1.12 ms	1.03 ms	1.05 ms	6%	/
	0.99	1.36 ms	1.24 ms	1.28 ms	6%	/
	1.0	17.48 ms	5.41 ms	4.28 ms	76%	21%

Gist	0.95	3.90 ms	3.57 ms	3.69 ms	5%	/
	0.96	4.86 ms	4.23 ms	4.56 ms	6%	/
	0.97	6.17 ms	5.41 ms	5.71 ms	7%	/
	0.98	7.66 ms	7.06 ms	7.13 ms	7%	/
	0.99	14.40 ms	10.92 ms	10.23 ms	29%	6%
	0.999	264.12 ms	40.93 ms	34.32 ms	87%	16%

ImageNet	0.95	1.23 ms	0.66 ms	0.86 ms	30%	/
	0.96	1.45 ms	0.76 ms	0.93 ms	36%	/
	0.97	1.79 ms	0.91 ms	1.14 ms	36%	/
	0.98	2.72 ms	1.28 ms	1.46 ms	46%	/
	0.99	3.82 ms	2.70 ms	2.36 ms	38%	13%
	0.998	303.34 ms	36.46 ms	26.56 ms	91%	27%

Glove	0.9	5.08 ms	3.79 ms	4.49 ms	12%	/
	0.95	94.93 ms	13.10 ms	16.09 ms	83%	/
	0.96	335.25 ms	23.01 ms	22.12 ms	93%	4%
	0.9668	1486.82 ms	65.24 ms	61.46 ms	96%	6%

Msong	0.95	0.60 ms	0.53 ms	0.58 ms	3%	/
	0.96	0.78 ms	0.67 ms	0.74 ms	5%	/
	0.97	0.89 ms	0.73 ms	0.76 ms	15%	/
	0.98	1.01 ms	0.65 ms	0.83 ms	17%	/
	0.99	1.56 ms	0.82 ms	1.06 ms	32%	/
	1.0	5.48 ms	4.26 ms	4.04 ms	26%	5%

Trevi	0.95	2.20 ms	2.03 ms	1.77 ms	20%	/
	0.96	2.66 ms	2.01 ms	2.09 ms	21%	/
	0.97	3.81 ms	2.64 ms	2.88 ms	24%	/
	0.98	6.07 ms	3.76 ms	4.18 ms	31%	/
	0.99	8.99 ms	5.33 ms	6.23 ms	31%	/
	0.997	45.85 ms	19.41 ms	18.54 ms	60%	4%

VII-C Graph-based approaches

The vast majority of graph-based indexing schemes are designed based on three types of theoretical graph models, that is, Delaunay Graphs [34], Relative Neighbor Graphs [52] and $K$ -Nearest Neighbor Graphs [42, 15]. In practice, however, most search graphs are essentially approximate $K$ -nearest neighbor graphs, which mainly distinguish oneself from the other in the edge selection policies.

Inspired by HNSW, several follow-up projects aim to improve the proximity graph-based approach. Douze et al. propose to combine HNSW with quantization [16]. Navigating Spreading-out Graph (NSG) aims to reduce the graph edge density while keeping the search accuracy [17]. SPTAG combines the IVF index and proximity graph for distributed ANN search[10]. As with HNSW, all of these proximity graph variants employ fixed configurations to perform a fixed amount of graph traversal for all queries. Interested readers are referred to a recent survey for more detailed discussion [56], where useful recommendations and guidelines are given for users to choose appropriate graph-based algorithms.

In this subsection, we would like to highlight a few studies focusing on external memory graph-based indices, which may be of great interests to database researchers. This line of research is imperative since the sizes of datasets can easily exceeds capacity of available main memory in a commodity server as more and more vectors are produced with the ubiquity of machine learning techniques.

To the best of our knowledge, Bin Wang et al. first presented ideas to explore slow storage to achieve billion-scale ANNS in a single machine [53]. HM-ANN maps the hierrchical design of the graph-based ANNS with memory heterogeneity in HM, thus most accesses will happen in upper layer stored in fast memory to avoid expensive accesses in slow memory without sacrificing accuracy [44]. Vamana fully analyzes the construction details of HNSW and NSG, based on which Vamana uses the random neighbor selection strategy of HNSW to increase the flexibility of edge selection. As a result, it is able to construct a high recall SSD index on a single server of 64G [48]. Minjia Zhang et al. presented an ANN solution ZOOM that uses SSD to employ a multi-view approach. Through the efficient routing and optimized distance computation, ZOOM greatly enhance the effectiveness and efficiency while attaining equal or higher accuracy [58].

VIII Conclusion and Future Work

In this paper, we propose Tao, a general learning framework for adaptive ANN search in high dimensional spaces. Using LID as an intermediate feature, Tao decouples the prediction phase from query processing, and eliminates the laborious parameter tuning work involved in AdaptNN. Experimental results show that Tao achieves comparable or even better performance compared to AdaptNN. A few interesting problems are worth further investigation. Tao unveil the hidden correlation between query vectors and their LID values, which might deserve further study on its own right. Given a specific ANN search algorithm, how LID affects the difficulty of ANN search in high-dimensional spaces is also an open problem of both theoretical practical importance.

Acknowledgment

This work is supported partially by the Fundamental Research Funds for the Central Universities under grant number (No:2232021A-08), NSF of Xinjiang Key Laboratory under grant number (No:2019D04024), Tianjin “Project + Team” Key Training Project under grant number (No:XC202022). Wei Wang were supported by ARC DPs 170103710 and 180103411, and D2DCRC DC25002 and DC25003.

References

[1] Laurent Amsaleg, Oussama Chelly, Teddy Furon, Stéphane Girard, Michael E. Houle, Ken-ichi Kawarabayashi, and Michael Nett. Estimating local intrinsic dimensionality. In SIGKDD, pages 29–38, 2015.
[2] Peter Andras. Function approximation using combined unsupervised and supervised learning. TNNLS, 25(2):300–315, 2014.
[3] Artem Babenko and Victor Lempitsky. The inverted multi-index. TPAMI, 37(6):1247–1260, 2014.
[4] Artem Babenko and Victor Lempitsky. Efficient indexing of billion-scale datasets of deep descriptors. In CVPR, pages 2055–2063, 2016.
[5] Mayank Bawa, Tyson Condie, and Prasanna Ganesan. LSH forest: self-tuning indexes for similarity search. In WWW, pages 651–660, 2005.
[6] Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology, 33(6):623–630, 2015.
[7] Charles Bouveyron, Gilles Celeux, and Stéphane Girard. Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA. PRL, 32(14):1706–1713, 2011.
[8] Francesco Camastra and Antonino Staiano. Intrinsic dimension estimation: Advances and open problems. ISCI, 328:26–41, 2016.
[9] Francesco Camastra and Alessandro Vinciarelli. Estimating the intrinsic dimension of data with a fractal-based method. TPAMI, 24, 2002.
[10] Qi Chen, Haidong Wang, Mingqin Li, Gang Ren, Scarlett Li, Jeffery Zhu, Jason Li, Chuanjie Liu, Lintao Zhang, and Jingdong Wang. SPTAG: A library for fast approximate nearest neighbor search, 2018.
[11] Jose A. Costa and Alfred O. Hero III. Entropic graphs for intrinsic dimension estimation in manifold learning. In ISIT, page 466, 2004.
[12] George Cybenko. Approximation by superpositions of a sigmoidal function. MCSS, 2(4):303–314, 1989.
[13] Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyamsundar Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271–280, 2007.
[14] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In ACM, pages 253–262, 2004.
[15] Wei Dong, Moses Charikar, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW, pages 577–586, 2011.
[16] Matthijs Douze, Alexandre Sablayrolles, and Hervé Jégou. Link and code: Fast indexing with graphs and compact regression codes. In CVPR, pages 3646–3654, 2018.
[17] Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. Fast approximate nearest neighbor search with the navigating spreading-out graph. PVLDB, 12(5):461–474, 2019.
[18] Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD, pages 541–552, 2012.
[19] Jinyang Gao, H. V. Jagadish, Beng Chin Ooi, and Sheng Wang. Selective hashing: Closing the gap between radius search and k-nn search. In SIGKDD, pages 349–358, 2015.
[20] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518–529, 1999.
[21] S. Haykin. Neural networks and learning machines. Pearson Schweiz Ag, 2008.
[22] Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. KORE: keyphrase overlap relatedness for entity disambiguation. In CIKM, pages 545–554, 2012.
[23] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991.
[24] Michael E. Houle. Dimensionality, discriminability, density and distance distributions. In ICDM, pages 468–473, 2013.
[25] Michael E. Houle, Hisashi Kashima, and Michael Nett. Generalized expansion dimension. In ICDM, pages 587–594, 2012.
[26] Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB, 9(1):1–12, 2015.
[27] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. TPAMI, 33:117–128, 2010.
[28] Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. Searching in one billion vectors: re-rank with source coding. In ICASSP, pages 861–864, 2011.
[29] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Trans. Big Data, 7(3):535–547, 2021.
[30] David R. Karger and Matthias Ruhl. Finding nearest neighbors in growth-restricted metrics. In STOC, pages 741–750, 2002.
[31] Juha Karhunen and Jyrki Joutsensalo. Representation and separation of signals using nonlinear PCA type learning. NN, 7, 1994.
[32] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In NIPS, pages 3146–3154, 2017.
[33] Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In SIGMOD, pages 489–504, 2018.
[34] D. T. Lee and Bruce J. Schachter. Two algorithms for constructing a delaunay triangulation. IJPP, 9(3):219–242, 1980.
[35] Conglong Li, Minjia Zhang, David G. Andersen, and Yuxiong He. Improving approximate nearest neighbor search through learned adaptive early termination. In SIGMOD, pages 2539–2554, 2020.
[36] Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement. IEEE Trans. Knowl. Data Eng., 32(8):1475–1488, 2020.
[37] Yingfan Liu, Jiangtao Cui, Zi Huang, Hui Li, and Heng Tao Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. PVLDB, 7(9):745–756, 2014.
[38] Kejing Lu, Hongya Wang, Wei Wang, and Mineichi Kudo. VHP: approximate nearest neighbor search via virtual hypersphere partitioning. VLDB, 13(9):1443–1455, 2020.
[39] Qin Lv, Moses Charikar, and Kai Li. Image similarity search with compact data structures. In CIKM, pages 208–217, 2004.
[40] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In VLDB, pages 950–961, 2007.
[41] Yury A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI, 42(4):824–836, 2018.
[42] Rodrigo Paredes, Edgar Chávez, Karina Figueroa, and Gonzalo Navarro. Practical construction of k-nearest neighbor graphs in metric spaces. In WEA, volume 4007, pages 85–97, 2006.
[43] Jianbin Qin, Wei Wang, Chuan Xiao, and Ying Zhang. Similarity query processing for high-dimensional data. VLDB, 13(12):3437–3440, 2020.
[44] Jie Ren, Minjia Zhang, and Dong Li. HM-ANN: efficient billion-point nearest neighbor search on heterogeneous memory. In NIPS, 2020.
[45] Alessandro Rozza, Gabriele Lombardi, Claudio Ceruti, Elena Casiraghi, and Paola Campadelli. Novel high intrinsic dimensionality estimators. Mach. Learn., 89(1-2):37–65, 2012.
[46] Venu Satuluri and Srinivasan Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430–441, 2012.
[47] Uri Shaft and Raghu Ramakrishnan. Theory of nearest neighbors indexability. TODS, 31(3):814–838, 2006.
[48] Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnaswamy, and Rohan Kadekodi. Rand-nsg: Fast accurate billion-point nearest neighbor search on a single node. In NIPS, pages 13748–13758, 2019.
[49] Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB, pages 1–12, 2014.
[50] Xiu Tang, Sai Wu, Gang Chen, Jinyang Gao, Wei Cao, and Zhifei Pang. A learning to tune framework for LSH. In ICDE, pages 2201–2206, 2021.
[51] Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. TODS, 35(3):20:1–20:46, 2010.
[52] Godfried T. Toussaint. The relative neighbourhood graph of a finite planar set. PR, 12, 1980.
[53] Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and Jeffrey S. Vetter. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In PACT, pages 93–102, 2013.
[54] Hongya Wang, Zhizheng Wang, Wei Wang, Yingyuan Xiao, Zeng Zhao, and Kaixiang Yang. A note on graph-based nearest neighbor search. CoRR, abs/2012.11083, 2020.
[55] Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, Kun Yu, Yuxing Yuan, Yinghao Zou, Jiquan Long, Yudong Cai, Zhenxiang Li, Zhifeng Zhang, Yihua Mo, Jun Gu, Ruiyi Jiang, Yi Wei, and Charles Xie. Milvus: A purpose-built vector data management system. In SIGMOD, pages 2614–2627, 2021.
[56] Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. CoRR, abs/2101.12631, 2021.
[57] Roger Weber, Hans-Jörg Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194–205, 1998.
[58] Minjia Zhang and Yuxiong He. Zoom: Ssd-based vector search for optimizing accuracy, latency and memory. CoRR, abs/1809.04067, 2018.

Tao: A Learning Framework for Adaptive Nearest Neighbor Search using Static Features Only††thanks: Hongya Wang is the corresponding author.