This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Department of Computer Science & Engineering, Indian Institute of Technology Bombay

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

Harsh Shah 11 0009-0008-0688-9433    Kashish Mittal 11 0009-0004-7003-6444    Ajit Rajwade 11 0000-0001-6463-3315
Abstract

This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. Unlike other methods, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We experimentally show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool with a certain number of members will be pruned. Our method has the following features: (i) It implicitly exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.

Keywords:
near neighbor search, group testing, image retrieval

1 Introduction

Near neighbor (NN) search is a fundamental problem in machine learning, computer vision, image analysis and recommender systems. Consider a database 𝒟\mathcal{D} of NN vectors given as {𝒇𝟏,𝒇𝟐,,𝒇𝑵}\{\boldsymbol{f_{1}},\boldsymbol{f_{2}},\cdots,\boldsymbol{f_{N}}\} where each 𝒇𝒊d\boldsymbol{f_{i}}\in\mathbb{R}^{d}. Given a query vector 𝒒d\boldsymbol{q}\in\mathbb{R}^{d}, our aim is to determine all those vectors in 𝒟\mathcal{D} such that sim(𝒒,𝒇𝒊)ρ\textrm{sim}(\boldsymbol{q},\boldsymbol{f_{i}})\geq\rho, where sim(.,.)\textrm{sim}(.,.) denotes some similarity measure, and ρ\rho denotes some threshold on sim. This is called a similarity measure based NN query. Another variant is the KK nearest neighbor (KNN) search where the aim is to find the KK most similar neighbors to 𝒒\boldsymbol{q}.

A significant portion of the literature on NN search focusses on approximate methods, often resulting in less than perfect recall rates, i.e. many database images that are similar to the query image are not retrieved. This deficiency renders them unsuitable for applications requiring high recall, such as plagiarism detection. In such scenarios, the query vectors, such as images or music files, are compared to an existing database of similar entities, and all vectors exhibiting high similarity with the query need to be retrieved for plagiarism assessment. The failure to retrieve a near neighbor could potentially lead to unfair evaluation of submissions to photography or music competitions, or online forums.

Related Work on NN Search: There exists a large body of literature on approximate NN search in higher dimensions. One of the major contributions in this area is called Locality Sensitive Hashing (LSH) [30]. Here, a hashing function partitions the dataset into buckets, such that only close vectors are likely to be hashed to the same bucket. The LSH family of algorithms consists of many variants such as those involving multiple buckets per query vector, data-driven or machine learning based construction of LSH functions [15, 20], and count-based methods [34]. In general, LSH based methods are difficult to implement for very high dimensions owing to the curse of dimensionality in constructing effective hash tables. Dimensionality reduction algorithms can mitigate this issue to some extent, but they are themselves expensive to implement and incur some inevitable data loss. There is also significant research in graph-based methods for nearest neighbor (NN) search, as explored in [36, 24, 27]. These methods typically achieve high accuracy and efficient query times. However, they often need relatively high pre-processing times, making them less suitable for dynamic datasets that require frequent addition of new points. In addition, various randomized variants of KD-trees and priority search K-means trees have been investigated for efficient approximate NN search in high-dimensional spaces, as seen in [40] and [37]. Despite their efficiency, these approaches involve numerous parameters, which can complicate tuning, especially for large datasets.

Overview of Group Testing: In this work, we present a group testing (GT) based approach for the NN problem. We start by presenting a brief overview of the field of GT. Consider NN items {xi}i=1N\{x_{i}\}_{i=1}^{N} (also called ‘samples’ in many references), out of which only a small number sNs\ll N are ‘defective’. We consider the iith item to be defective if xi=1x_{i}=1 and non-defective if xi=0x_{i}=0. GT involves testing a certain number of ‘pools’ (mNm\ll N) to identify the defective items from {xi}i=1N\{x_{i}\}_{i=1}^{N}. Each pool (also called ‘group’) is created by mixing together or combining subsets of the NN items. Thus the jjth pool (j[m]j\in[m]) is represented as yj=i=1NAjixiy_{j}=\bigvee_{i=1}^{N}A_{ji}x_{i}, where \bigvee stands for bit-wise OR, and 𝑨{0,1}m×N\boldsymbol{A}\in\{0,1\}^{m\times N} is a binary pooling matrix where Aji=1A_{ji}=1 if the iith item participates in the jjth pool and 0 otherwise. GT has a rich literature dating back to the seminal ‘Dorfman’s method’ [21] which is a popular two-round testing strategy. In the first round of Dorfman’s method, some N/KN/K pools, each consisting of KK items, are tested. The items that participated in pools that test negative (i.e., pools that are deemed to contain no defective item) are declared non-defective, whereas all items that participated in the positive pools (i.e., pools that are deemed to contain at least one defective item) are tested individually in the second round. Theoretical analysis shows that for sNs\ll N, this method requires much fewer than NN tests in the worst case. This algorithm was extended to some rr rounds in [33], where positive groups are further divided into smaller groups, and individual testing is performed only in the rrth round. The methods in [21, 33] are said to be adaptive in nature, as the results of one round are given as input to the next one. The GT literature also contains a large number of non-adaptive algorithms that make predictions in a single round. A popular algorithm is called Comp or combinatorial orthogonal matching pursuit, which declares all items participating in a negative pool to be negative, and all the rest to be positive [14]. Many variants of Comp such as Noisy Comp (NComp) or Definite Defectives (Dd) are also popular [14]. A popular class of non-adaptive GT algorithms [25] draw heavily from the compressed sensing (CS) literature [18, 43]. Such algorithms can consider quantitative information for each {xi}i=1N\{x_{i}\}_{i=1}^{N} so that each xix_{i}\in\mathbb{R}, and every pool yjy_{j} also gives a real-valued result. The pooling matrix 𝑨\boldsymbol{A} remains binary, and we have the relation j[m],yj=i=1NAjixi\forall j\in[m],y_{j}=\sum_{i=1}^{N}A_{ji}x_{i}. Given {yj}j=1m\{y_{j}\}_{j=1}^{m} and 𝑨\boldsymbol{A}, the unknown vector 𝒙\boldsymbol{x} can be recovered using efficient convex optimization methods such as the Lasso given by 𝒙^=Argmin𝒙𝒚𝑨𝒙22+λ𝒙1\boldsymbol{\hat{x}}=\textrm{Argmin}_{\boldsymbol{x}}\|\boldsymbol{y}-\boldsymbol{Ax}\|^{2}_{2}+\lambda\|\boldsymbol{x}\|_{1}, where λ\lambda is a regularization parameter [28]. As proved in [28, 19], the Lasso estimator produces provably accurate estimates of 𝒙\boldsymbol{x} given a sufficient number of measurements (or pools) in 𝒚\boldsymbol{y} and a carefully designed matrix 𝑨\boldsymbol{A}.

Overview of our work: GT algorithms have been applied to the NN search problem recently in [41, 31, 22]. However in these papers, the nearest neighbor queries are of an approximate nature, i.e. the results may contain a number of vectors which are not near neighbors of the query vector 𝒒\boldsymbol{q}, and/or may miss some genuine near neighbors. On the other hand, we focus on accurate query results with good computational efficiency in practice. Our approach is based on an adaptive GT approach called binary splitting. Like existing GT based algorithms such as [41, 31, 22], our approach is also based on the assumption that in high-dimensional NN search, most vectors in the database 𝒟\mathcal{D} are dissimilar to a query vector 𝒒\boldsymbol{q}, and very few vectors qualify as near neighbors. However, our method does not require the distance between the most unrelated neighbor and the most closely related non-neighbor to be large, unlike [22]. Additionally, many existing approaches [22, 31, 41, 37, 15, 39, 26, 1, 16] require parameter tuning in a manner that is not fully data driven. It is also a tedious process for large databases, particularly when dealing with algorithms with large index building times. Moreover, algorithms with large index building times or those unable to efficiently handle the addition of new points are ill-suited for streaming settings. In this work, we present a range-based near neighbor search algorithm in high dimensions built on a group testing framework that has the following appealing properties for softmax-based feature descriptors: (i) low index building time, (ii) perfect precision and recall without exhaustive search, (iii) no parameter tuning, (iv) low time complexity (𝒪(1)\mathcal{O}(1)) for addition of new vectors making the algorithm suitable for streaming tasks, and (v) a unique method of incorporating distributional properties of the similarity values between query and database vectors into the search process.

Organization of the paper: The core method is presented in Sec. 2. We present detailed analytical comparisons with existing GT-based techniques for NN search in Sec. 3. Experimental results and comparisons to a large number of recent NN search methods are presented in Sec. 4. A theoretical analysis of our approach is presented in Sec. 5. We conclude in Sec. 6.

2 Description of Method

In our technique, a pool of high-dimensional vectors is created by simply adding together the participating vectors, which may for example be (vectorized) images or features extracted from them. Consider a query vector 𝒒d\boldsymbol{q}\in\mathbb{R}^{d} and the database 𝒟\mathcal{D} consisting of {𝒇𝟏,𝒇𝟐,,𝒇𝑵}\{\boldsymbol{f_{1}},\boldsymbol{f_{2}},\cdots,\boldsymbol{f_{N}}\} where each 𝒇𝒊d\boldsymbol{f_{i}}\in\mathbb{R}^{d}. Throughout this paper, we consider the similarity measure sim(𝒒,𝒇𝒊)𝒒t𝒇𝒊\textrm{sim}(\boldsymbol{q},\boldsymbol{f_{i}})\triangleq\boldsymbol{q}^{t}\boldsymbol{f_{i}}. For simplicity, we consider the query vector 𝒒\boldsymbol{q} as well as the vectors in 𝒟\mathcal{D} to be element-wise non-negative. This constraint is satisfied by images or their features based on ReLU or softmax non-linearities (see the end of this section for a method to handle negative-valued descriptors). We also consider that the vectors in question have unit magnitude, without any loss of generality. Due to this, 𝒒t𝒇𝒊\boldsymbol{q}^{t}\boldsymbol{f_{i}} acts as a cosine similarity measure.

Our aim is to retrieve all ‘defective’ members of 𝒟\mathcal{D}, i.e. each member 𝒇𝒊\boldsymbol{f_{i}} for which 𝒒t𝒇𝒊ρ\boldsymbol{q}^{t}\boldsymbol{f_{i}}\geq\rho, where ρ\rho is some pre-specified threshold. With this aim in mind, we aim to efficiently prune away any vector 𝒇𝒊\boldsymbol{f_{i}} from 𝒟\mathcal{D} for which it is impossible to have 𝒒t𝒇𝒊ρ\boldsymbol{q}^{t}\boldsymbol{f_{i}}\geq\rho. Vector pooling using summations proves to be beneficial for this purpose. Consider the toy example of a pool 𝒚=𝒇𝟏+𝒇𝟐+𝒇𝟑+𝒇𝟒\boldsymbol{y}=\boldsymbol{f_{1}}+\boldsymbol{f_{2}}+\boldsymbol{f_{3}}+\boldsymbol{f_{4}}. If 𝒒t𝒚<ρ\boldsymbol{q}^{t}\boldsymbol{y}<\rho (i.e. the pooled result is negative), then one can clearly conclude that 𝒒t𝒇𝒊<ρ\boldsymbol{q}^{t}\boldsymbol{f_{i}}<\rho for each i{1,2,3,4}i\in\{1,2,3,4\}. However if 𝒒t𝒚ρ\boldsymbol{q}^{t}\boldsymbol{y}\geq\rho, i.e. it yields a positive result, then we need to do further work to identify the defective members (if any). For this, we split the original pool and create two pools 𝒚𝟏\boldsymbol{y_{1}} and 𝒚𝟐\boldsymbol{y_{2}}, each with approximately half the number of members of the original pool. We recursively compute 𝒒t𝒚𝟐\boldsymbol{q}^{t}\boldsymbol{y_{2}}, and then obtain 𝒒t𝒚𝟏=𝒒t𝒚𝒒t𝒚𝟐\boldsymbol{q}^{t}\boldsymbol{y_{1}}=\boldsymbol{q}^{t}\boldsymbol{y}-\boldsymbol{q}^{t}\boldsymbol{y_{2}} (where both 𝒒t𝒚\boldsymbol{q}^{t}\boldsymbol{y} and 𝒒t𝒚𝟐\boldsymbol{q}^{t}\boldsymbol{y_{2}} have already been computed earlier, and hence 𝒒t𝒚𝟏\boldsymbol{q}^{t}\boldsymbol{y_{1}} is directly obtained from the difference between them). Thereafter, we proceed in a similar fashion for both branches.

We now consider the case of searching within the entire database 𝒟\mathcal{D} defined earlier. We initiate this recursive process with a pool 𝒚𝟎\boldsymbol{y_{0}} that is obtained from the summation of all NN vectors. If 𝒚𝟎\boldsymbol{y_{0}} yields a positive result, we test the query vector with two pools: 𝒚𝟏𝟏\boldsymbol{y_{11}} obtained from a summation of vectors 𝒇𝟏\boldsymbol{f_{1}} through to 𝒇𝑵/𝟐\boldsymbol{f_{\lfloor N/2\rfloor}}, and 𝒚𝟏𝟐\boldsymbol{y_{12}} obtained from a summation of vectors 𝒇𝑵/𝟐+𝟏\boldsymbol{f_{\lfloor N/2\rfloor+1}} through to 𝒇𝑵\boldsymbol{f_{N}}. If 𝒚𝟏𝟏\boldsymbol{y_{11}} tests negative, then all its members are declared negative and no further tests are required on its members, which greatly reduces the total number of tests. If 𝒚𝟏𝟏\boldsymbol{y_{11}} tests positive, then it is split into two pools 𝒚𝟐𝟏,𝒚𝟐𝟐\boldsymbol{y_{21}},\boldsymbol{y_{22}} with roughly equal number of members. Tests are carried out further on these two pools recursively. The same treatment is accorded to 𝒚𝟏𝟐\boldsymbol{y_{12}}, which would be split into pools 𝒚𝟐𝟑,𝒚𝟐𝟒\boldsymbol{y_{23}},\boldsymbol{y_{24}} if it were to test positive. This recursive process stops when the pools obtained after splitting contain just a single vector each, in which case each such single vector is tested against 𝒒\boldsymbol{q}. Clearly, the depth of this recursive process is log2N\lceil\log_{2}N\rceil given NN vectors in 𝒟\mathcal{D}. This procedure is called binary splitting, and it is a multi-round adaptive group testing approach, applied here to NN search. For efficiency of implementation, it is important to be able to create the different pools such as 𝒚𝟎,𝒚𝟏𝟏,𝒚𝟏𝟐,𝒚𝟐𝟏,𝒚𝟐𝟐,𝒚𝟐𝟑,𝒚𝟐𝟒\boldsymbol{y_{0}},\boldsymbol{y_{11}},\boldsymbol{y_{12}},\boldsymbol{y_{21}},\boldsymbol{y_{22}},\boldsymbol{y_{23}},\boldsymbol{y_{24}}, etc., efficiently. For this purpose, we maintain cumulative sums of the following form in memory:

𝒇~𝟏=𝒇𝟏,𝒇~𝟐=𝒇~𝟏+𝒇𝟐,,𝒇~𝑵=𝒇~𝑵𝟏+𝒇𝑵.\displaystyle\boldsymbol{\widetilde{f}_{1}}=\boldsymbol{f_{1}},\boldsymbol{\widetilde{f}_{2}}=\boldsymbol{\widetilde{f}_{1}}+\boldsymbol{f_{2}},...,\boldsymbol{\widetilde{f}_{N}}=\boldsymbol{\widetilde{f}_{N-1}}+\boldsymbol{f_{N}}. (1)

The cumulative sums are useful for efficient creation of various pools. They are created purely offline, independent of any query vector. The pools are created using simple vector difference operations, given these cumulative sums. For example, notice that 𝒚𝟎=𝒇~𝑵,𝒚𝟏𝟏=𝒇~𝑵/𝟐\boldsymbol{y_{0}}=\boldsymbol{\widetilde{f}_{N}},\boldsymbol{y_{11}}=\boldsymbol{\widetilde{f}_{\lfloor N/2\rfloor}} and 𝒚𝟏𝟐=𝒇~𝑵𝒇~𝑵/𝟐\boldsymbol{y_{12}}=\boldsymbol{\widetilde{f}_{N}}-\boldsymbol{\widetilde{f}_{\lfloor N/2\rfloor}}. The overall binary splitting procedure is presented in Alg. 1 (a schematic diagram is provided in [13]). For the sake of greater efficiency, it is implemented non-recursively using a simple stack data structure.

Algorithm 1 Iterative Binary Splitting (see [13, Fig. 1] for a diagram)
1:Fs[𝒇~𝟏,𝒇~𝟐,,𝒇~𝑵]Fs\leftarrow[\boldsymbol{\widetilde{f}_{1}},\boldsymbol{\widetilde{f}_{2}},...,\boldsymbol{\widetilde{f}_{N}}]
2:SS\leftarrow stack for storing triples {si,ei,sim}\{si,ei,sim\} where si,eisi,ei are start and end index for a pool, sim=sim= dot product between query vector 𝒒\boldsymbol{q} and pool vector.
3:FlgFlg\leftarrow array of size NN flags, all 0’s initially (0=0= non-neighbor, 1=1= neighbor of 𝒒\boldsymbol{q})
4:S.push{1,N,𝒒t𝒇~𝑵}S.push\ \{1,N,\boldsymbol{q}^{t}\boldsymbol{\widetilde{f}_{N}}\}, i.e. si=1,ei=Nsi=1,ei=N
5:while S.size()0S.size()\neq 0 do
6:     {si,ei,sim}S.top()\{si,ei,sim\}\leftarrow S.top(); S.pop()S.pop()
7:     if simρsim\geq\rho then
8:         neisi+1n\leftarrow ei-si+1
9:         if n>2n>2 then
10:              rsim𝒒t(𝒇~𝒆𝒊𝒇~𝒔𝒊+𝒏/𝟐𝟏)rsim\leftarrow\boldsymbol{q}^{t}(\boldsymbol{\widetilde{f}_{ei}}-\boldsymbol{\widetilde{f}_{si+\lfloor n/2\rfloor-1}}) \triangleright testing right-half
11:              S.push{si+n/2,ei,rsim}S.push\ \{si+\lfloor n/2\rfloor,ei,rsim\}; \triangleright push right branch onto stack
12:              S.push{si,si+n/21,simrsim}S.push\ \{si,si+\lfloor n/2\rfloor-1,sim-rsim\}; \triangleright push left branch onto stack
13:         else
14:              if n=2n=2 then
15:                  rsim𝒒t(𝒇~𝒆𝒊𝒇~𝒔𝒊)rsim\leftarrow\boldsymbol{q}^{t}(\boldsymbol{\widetilde{f}_{ei}}-\boldsymbol{\widetilde{f}_{si}})
16:                  set Flg[ei]=1Flg[ei]=1, if rsimρrsim\geq\rho
17:                  set Flg[si]=1Flg[si]=1, if simrsimsim-rsim \geq ρ\rho
18:              else
19:                  Flg[si]=1Flg[si]=1 \triangleright Mark sisi a neighbor
20:              end if
21:         end if
22:     end if
23:end while

This binary splitting approach is quite different from Dorfman’s algorithm [21], which performs individual testing in the second round. Our approach is more similar to Hwang’s generalized binary splitting approach, a GT procedure which was proposed in [29] (see also [8], Alg. 1.1 of [14]). However compared to Hwang’s technique, our approach allows for creation of all pools either in a purely offline manner or with simple updates involving a single vector-subtraction operation (for example, 𝒚𝟏𝟐=𝒇~𝑵𝒇~𝑵/𝟐\boldsymbol{y_{12}}=\boldsymbol{\widetilde{f}_{N}}-\boldsymbol{\widetilde{f}_{\lfloor N/2\rfloor}} in the earlier example). On the other hand, the method in [29] detects defectives one at a time, after going through all log2N\log_{2}N levels of recursion. Once a defective item is detected, it is discarded and all pools are created again. The entire set of log2N\log_{2}N levels of recursion are again executed. This process is repeated until all defectives have been detected and discarded one at a time. The method from [29] is computationally inefficient for high-dimensional NN search, as the pools must be created afresh for each level of recursion. This is because if a vector at index ii is discarded, all cumulative sums from index i+1i+1 to NN need to be updated. Such updates are not required in our method. We also emphasize that the techniques in [21, 14, 29, 33] have all been applied only for binary GT and not for NN search (see the beginning of Sec. 5 for an important difference between binary GT and NN search). The binary splitting procedure saves on a large number of tests because in high dimensional search, we observe that most vectors in 𝒟\mathcal{D} are dissimilar to a given query vector 𝒒\boldsymbol{q}, as will be demonstrated in Sec. 4. After the first round where pool 𝒚𝟎\boldsymbol{y_{0}} is created, all other pools are essentially created randomly (since the initial arrangement of the vectors in 𝒟\mathcal{D} is random). As a result, across the different rounds, many pools test negative and are dropped out, especially in later rounds. In particular, we note that there is no loss of accuracy in this method, although we save on the number of tests by a factor more than ten on some large datasets (as will be shown in Sec. 4).

Pooling using Element-wise Maximum: We have so far described pool creation using vector summation. However, pools can also be created by computing the element-wise maximum of all vectors participating in the pool. Given a subset of vectors {𝒇𝒌}k=1K\{\boldsymbol{f_{k}}\}_{k=1}^{K}, such a pool 𝒚𝒎\boldsymbol{y_{m}} is given by: j{1,2,,d},ym,j=maxj(fk,j)\forall j\in\{1,2,...,d\},y_{m,j}=\textrm{max}_{j}(f_{k,j}) where fk,jf_{k,j} stands for the jjth element of 𝒇𝒌\boldsymbol{f_{k}}. We denote this operation as 𝒚𝒎=max({𝒇𝒌}k=1K)\boldsymbol{y_{m}}=\textrm{max}(\{\boldsymbol{f_{k}}\}_{k=1}^{K}). If 𝒒t𝒚𝒎<ρ\boldsymbol{q}^{t}\boldsymbol{y_{m}}<\rho, it follows that 𝒒t𝒇𝒌<ρ\boldsymbol{q}^{t}\boldsymbol{f_{k}}<\rho for every 𝒇𝒌\boldsymbol{f_{k}} that contributed to 𝒚𝒎\boldsymbol{y_{m}}. Hence pools created using an element-wise maximum are also well suited for iterative binary splitting in a similar fashion as in Alg. 1. Separate element-wise maximum vectors need to be stored in memory for each pool. In case of negative values in the query and/or database vectors, the element-wise maximum can be replaced by an element-wise minimum for every index jj for which qj<0q_{j}<0. Unlike the summation technique, this handles negative-valued descriptors, albeit at the cost of more memory in order to store both element-wise maximum and element-wise minimum values for each pool.

3 Comparison to other Group Testing Methods for NN Search

GT algorithms of different flavors have been applied to the NN search problem in [41, 31, 22]. A detailed description of these techniques is presented in the supplemental material at [13, Sec. 2]. Compared to these three afore-mentioned techniques, our method is exact, with the same accuracy as exhaustive search. The methods [41, 31, 22] require choice of various hyper-parameters (RR, the number of backpropagation steps, for [41]; sparse coding parameters in [31]; bloom filter and hash function parameters (B,R,m,LB,R,m,L) in [22] – see [13, Sec. 2]) for which there is no clear data-driven selection procedure. Our method, however, requires no parameter tuning. In experiments, we have observed a speed up of more than ten-fold in querying time with our method as compared to exhaustive search on some datasets. Like [41], and unlike [22, 31], our method does have a large memory requirement, as we require all pools to be in memory. Some techniques such as [22] require the nearest neighbors to be significantly more similar than all other members of 𝒟\mathcal{D} (let us call this condition 𝒞1\mathscr{C}1). Our method does not have such a requirement. However, our method will perform more efficiently for queries for which a large number of similarity values turn out to be small, and only a minority are above the threshold ρ\rho (let us call this condition 𝒞2\mathscr{C}2). In our experimental study on diverse datasets, we have observed that 𝒞2\mathscr{C}2 is true always. On the other hand, 𝒞1\mathscr{C}1 does not hold true in a large number of cases, as also reported in [22, Sec. 5.3].

We also experimented with applying the latest developments in the compressed sensing (CS) literature [43] (which is closely related to GT) to the approximate NN search problem, albeit unsuccessfully. The theoretical reasons for the failure of the latest CS techniques for NN search are described in [13, Sec. 3].

4 Experimental Results

In this section, we present experimental results conducted in two settings: (i) Non-streaming setting with a static database, and (ii) Streaming setting with incremental additions of new vectors to the database. In both the settings, we tested our algorithms on the following publicly available image datasets, of which the first two are widely used in retrieval and NN search problems: (i) MIRFLICKR, a set of 1 million (1M) images from the MIRFLICKR dataset (subset of images from Flickr) available at [11], with an additional 6K images from the Paris dataset available at [12], as followed in [41]; (ii) ImageNet, a set of 1M images used in the ImageNet Object Localization Challenge, available at [2]; (iii) IMDB-Wiki, a set of all 500K images from the IMDB-Wiki face image dataset from [9]; (iv) InstaCities, a set of 1M images of 10 large cities [10]. Although, we have worked with image datasets, our method is equally applicable to any other modalities such as speech, genomics, etc.

In the following, we denote a query image by 𝑰𝒒\boldsymbol{I_{q}}. For every dataset containing images {𝑰𝒊}i=1N\{\boldsymbol{I_{i}}\}_{i=1}^{N}, feature vectors {𝒇𝒊}i=1N\{\boldsymbol{f_{i}}\}_{i=1}^{N} were extracted as follows: First, each image 𝑰𝒊\boldsymbol{I_{i}} was passed through the popular VGGNet [42] (as its features are known to be useful for building good perceptual metrics [44]) to obtain intermediate feature vector 𝒇𝒊\boldsymbol{f^{\prime}_{i}}. Second, a softmax-based feature vector 𝒇𝒊\boldsymbol{f_{i}} of dimension d=1000d=1000, corresponding to the 1000 classes in ImageNet, was extracted further from the VGGNet output vector 𝒇𝒊\boldsymbol{f^{\prime}_{i}}. In our experiments, we observed that the softmax-based features yielded superior retrieval recall (defined as # points retrieved belonging to the same class as the query point / # points of the same class as the query point) as compared to the baseline VGGNet features, with exhaustive search on different datasets (see [13, Table 1] for more details), even on those datasets which are different from ImageNet. Higher recall is particularly beneficial for NN search application in scenarios such as image plagiarism detection. The feature vectors were all non-negative and were subsequently unit-normalized. For every dataset, NN search was carried out using different methods (see below) on the same unit-normalized feature vectors extracted from 10,000 query images. That is, the aim was to efficiently identify those indices i{1,2,,N}i\in\{1,2,...,N\} for which the dot products between 𝒒\boldsymbol{q} (the feature vector of query image 𝑰𝒒\boldsymbol{I_{q}}) and 𝒇𝒊\boldsymbol{f_{i}} (both feature vectors unit-normalized) exceeded some specified threshold ρ\rho. For all datasets, the intrinsic dimensionality was computed using the number of singular values of the d×Nd\times N data matrix that accounted for 99% of its Frobenius norm. The intrinsic dimensionality for all datasets was more than 200 (for d=1000d=1000).

Our binary splitting approach, using pool creation with summation (Our-Sum) and element-wise maximum (Our-Max), was compared to the following recent and popular techniques: (i) Exhaustive search (Exh) through the entire database, for every query image; (ii) The recent GT-based technique from [22], known as Flinng (Filters to Identify Near-Neighbor Groups), using code from [3]; (iii) The popular randomized KD-tree and K-means based technique Flann for approximate NN search [37], using code from [7] with default parameters; (iv) The Falconn technique from [15] which uses multi-probe LSH, using the code from [4]; (v) Falconn++ [39], an enhanced version of Falconn which filters out potentially distant points from any hash bucket before querying using code from [5]; (vi) Ivf, the speedy inverted multi-index technique implemented in the highly popular Faiss library from Meta/Facebook [1, 16]; (vii) Hnsw, a graph-based technique also used in [1]; (viii) The Scann library [6, 26] which implements an anisotropic vector quantization (KMeans) algorithm. Our methods have no hyperparameters, but for all competing methods, results are reported by tuning hyperparameters so as to maximize the recall (defined below) – see [13, Sec. 5] for more details.

We did not compare with the GT-based techniques from [41] and [31] as we could not find full publicly available implementations for these. Moreover, the technique from [22] is claimed to outperform those in [41] and [31]. We did implement the algorithm from [41], but it yielded significantly large query times. This may be due to subtle implementation differences in the way image lists are re-ranked in each of the RR iterations of their algorithm (see Sec. 3). In any case, our method produces 100% recall and precision, unlike [41, 31].

Non-streaming Setting: The different methods are compared in terms of Mean Query Time (MQT), Mean Precision (MP) and Mean Recall (MR) for 10K queries as reported in Table 1. The similarity threshold (ρ\rho) chosen for this setting is 0.80.8. The recall is defined as as #true positives/(#true positives+#false negatives)\textrm{\#true positives}/(\textrm{\#true positives}+\textrm{\#false negatives}). The precision is defined as # true positives/(#true positives+# false positives)\textrm{\# true positives}/(\textrm{\#true positives}+\textrm{\# false positives}). For each dataset, we have also mentioned the average number of neighbours (k^\hat{k}) having similarity greater than ρ=0.8\rho=0.8. The cumulative sums in Eqn. 1 for Our-Sum and the cumulative element-wise maximums for Our-Max were computed offline. The query implementation was done in C++ with the highest level of optimization in the g++ compiler. The codes for Flinng, Falconn++, Scann and Hnsw operate in the KNN search mode, as opposed to a range-based query for Our-Sum, Our-Max, Falconn and Ivf, for which we set ρ=0.8\rho=0.8. The Flann method has code support for both KNN and range-based queries, which we refer to as Flann-K and Flann-R respectively. For all KNN-based methods, we gave an appropriate KK as input, such that KK ground truth neighbors satisfy the similarity threshold of ρ\rho. The time taken to compute KK was not included in the query times reported for the KNN-based methods. All query times are reported for an Intel(R) Xeon(R) server with CPU speed 2.10GHz and 128 GB RAM. For all methods, the computed query times did not count the time taken to compute the feature vector for the query image.

Streaming Setting: We now consider experiments in a streaming setting where new vectors are added to the database on the fly, i.e. the database is modified incrementally. In our experimental setting, 80% of the full database was initially available to all algorithms for index creation. New points were incrementally added to the database, and after addition of 100 new points, a search query was fired. This process was carried out until the entire database was loaded into memory. The query images were created by introducing controlled perturbations such as dither and blur to images existing in the database (refer to [13] for more details). The similarity threshold for this setting is chosen to be ρ=0.9\rho=0.9. Note that this setting is analogous to a typical image plagiarism detection setting, wherein (a) the database is updated with insertion of new points (without deletion) and requires fetching high similarity vectors, and (b) the submitted images could be subtly manipulated to escape plagiarism detection. High recall is desirable in such applications.

The only update required to Our-Sum for insertion of nsn_{s} new points will be to produce nsn_{s} extra cumulative sums as in Eqn. 1 for a cost of O(dns)O(dn_{s}). For comparison, we have selected only those algorithms that include methods for the addition of new points without necessitating the rebuilding of the entire index from scratch. These include Exh, Ivf, Hnsw and Flann-R. Table 2 reports the initial Index Build Time (IBT) for 80% of the database, Mean Insertion Time (MIT) for new points, besides MQT, MP and MR.

Discussion of Results on Static Datasets (Table 1): We observe that Our-Sum yields lower MQT on most datasets as compared to most other techniques, and that too with a guarantee of 100% precision and recall. Falconn produces higher query times than Our-Sum or Our-Max and at the loss of some recall. Falcon++ produces 100% precision and recall but with much higher query times than Our-Sum or Our-Max. Flinng and Flann-K are efficient, but have significantly lower precision and recall than Our-Sum with many of the retrieved dot products being significantly lower than the chosen ρ\rho. Flann-R yields very low recall despite its excellent precision and query time. Hnsw produces higher query times in cases where the number of neighbors is large with some loss of recall. The closest competitors to Our-Sum are Scann and Ivf, but they do not guarantee 100% precision and recall. In addition, we note that as reported in [26, Fig. 3a], the accuracy of the dot-products retrieved by Scann will depend on the chosen bit-rate, i.e. the chosen number of cluster centers in KMeans, whereas Our-Sum has no such dependencies or parameter choice. This is a major conceptual advantage of Our-Sum over Scann (also see the ‘Streaming’ setting below). As compared to exhaustive search, Our-Sum produces a speed gain between 4 to 120 depending on the dataset. Our-Sum outperforms Our-Max in terms of query time, but it is important note that Our-Max is capable of handling negative-valued descriptors, although we have not experimented with it in this paper.

Db. Our-S Our-M Ivf Falc Flann-R Falc++ Flinng Flann-K Scann Hnsw Exh
MQT MIRFL. 54.0 81 94 302 1 509 62 54 23 66 1222
MP (k^1.7K\hat{k}\approx 1.7K) 1.0 1.0 1.0 \approx1.0 1.0 1.0 0.268 0.743 0.996 0.959 1.0
MR 1.0 1.0 0.997 0.954 0.323 1.0 0.268 0.743 0.996 0.959 1.0
MQT ImgN. 4.9 7 103 179 2 385 119 40 23 9 1276
MP (k^0.9K\hat{k}\approx 0.9K) 1.0 1.0 1.0 1.0 1.0 1.0 0.372 0.310 0.998 0.965 1.0
MR 1.0 1.0 0.997 0.958 0.217 1.0 0.372 0.310 0.998 0.965 1.0
MQT IMDB. 98.8 249 49 415 1.0 739 46 98 60 1076 800
MP (k^7K\hat{k}\approx 7K) 1.0 1.0 1.0 \approx1.0 1.0 1.0 0.351 0.656 0.997 0.985 1.0
MR 1.0 1.0 0.995 0.995 0.191 1.0 0.351 0.656 0.997 0.985 1.0
MQT InstaC. 58.0 110 90 330 1 601 57 61 25 298 1174
MP (k^3K\hat{k}\approx 3K) 1.0 1.0 1.0 \approx1.0 \approx1.0 1.0 0.293 0.745 0.996 0.952 1.0
MR 1.0 1.0 0.990 0.964 0.324 1.0 0.293 0.745 0.996 0.952 1.0
Table 1: Comparison of mean query time (MQT, \downarrow) in milliseconds, and mean precision (MP, \uparrow) and mean recall (MR, \uparrow) values for our method with summation pooling (Our-S.); our method with max pooling (Our-M.); Ivf from the Faiss package [1, 16]; Hnsw from the Faiss package [1, 36]; Falconn [15, 4]; Falconn+ [39]; Flinng[22]; Flann[37] with KNN-based (Flann-K) and range-based queries (Flann-R); Scann [6] and exhaustive search (Exh). For all methods ρ=0.8\rho=0.8, see text for more details. k^\hat{k}\triangleq avg. number of database vectors satisfying 𝒒t𝒇𝒊ρ\boldsymbol{q}^{t}\boldsymbol{f_{i}}\geq\rho.
Index Build Time (in ms) \downarrow
Alg.\downarrow, Db. \rightarrow ImgN. IMDB. MIRFL. InstaC.
Our-Sum 3789 2000 3774 3765
Exh. NA NA NA NA
Ivf 15000 9960 15410 18722
Hnsw 7e5 2e5 4e5 4e5
Flann-R 2e6 9e5 2e6 2e6
Average Insertion Time (in ms) \downarrow
Alg.\downarrow, Db. \rightarrow ImgN. IMDB. MIRFL. InstaC.
Our-Sum 0.28 0.29 0.28 0.29
Exh. 0\sim 0 0\sim 0 0\sim 0 0\sim 0
Ivf 2.2 2.7 2.2 2.7
Hnsw 236 239 222 219
Flann-R 42 53 27 26
Mean Query Time (in ms) \downarrow
Alg.\downarrow, Db. \rightarrow ImgN. IMDB. MIRFL. InstaC.
Our-Sum 18 121 86 115
Exh. 609 315 607 605
Ivf 183 67 145 132
Hnsw 10 168 13 70
Flann-R 6 4 5 5
Mean Precision \uparrow, Mean Recall \uparrow
Alg.\downarrow, Db. \rightarrow ImgN. IMDB. MIRFL. InstaC.
Our-Sum 1,1 1,1 1,1 1,1
Exh. 1,1 1,1 1,1 1,1
Ivf 1,0.99 1,0.99 1,0.99 1,0.99
Hnsw 0.96,0.96 0.98,0.98 0.97,0.97 0.97,0.97
Flann-R 1,0.31 1,0.59 1,0.49 1,0.54
Table 2: Streaming results: Index Build Time (IBT) for 80% of the database, mean insertion time (MIT) for new points, MQT, MP and MR. For IMDB, the number of insertion operations and search queries is \sim700 and \sim1350 respectively. For all other datasets, the numbers are \sim1400 and \sim2600 respectively.

Discussion of Results on Datasets with Streaming (Table 2): It is clear that the Index Build Time (IBT) and Mean Insertion Time (MIT) of our method clearly surpasses that of all other methods across all datasets. Furthermore, the space taken for storing the index is only O(Nd)O(Nd) in our method. Since Our-Sum only has to compute cumulative sums for the newly inserted points, the index update time is very low when compared with other methods. Flann-R consistently produces lower mean query times, but this comes at the cost of low recall and high insertion time, making it unusable for streaming applications. Similarly, Hnsw shows significant gains in MQT for MIRFLICKR, ImageNet and InstaCities datasets, but exhibits a large insertion time across all datasets.

Analysis of Pruning (Non-Streaming): For each dataset, we also computed the number of pools that got rejected in every ‘round’ of binary splitting (because the dot product with the query image fell below threshold ρ\rho), beginning from round 0 to log2N\lceil\log_{2}N\rceil. (Referring to Alg. 1, we see that the procedure implements an inorder traversal of the binary tree implicitly created during binary splitting. However the same tree can be traversed in a level-wise manner, and each level of the tree represents one ‘round’ of binary splitting.) The corresponding histograms for ρ0.7\rho\geq 0.7 are plotted in Fig. 2 of [13], for Our-Sum. As can be seen, a large number of rejections occur from round 14 onward in MIRFLICKR and InstaCities, from round 11 onward in ImageNet, and from round 16 onward in IMDB-Wiki. As a result, the MQT is lowest for ImageNet than for other datasets, and also much lower for MIRFLICKR and InstaCities than for IMDB-Wiki. A similar set of histograms are plotted for Our-Max in Fig. 3 of [13]. Furthermore, we plotted histograms of dot products between every query feature vector and every gallery feature vector, across 10,000 queries for all datasets. These are plotted in Fig. 1, clearly indicating that the vast majority of dot products lie in the bin (0,0.1](0,0.1] or (0.1,0.2](0.1,0.2] (also see Fig. 1 of [13], which shows negative logarithms of the histograms for clearer visualization of values close to 0). This supports the hypothesis that most items in 𝒟\mathcal{D} are dissimilar to a query vector. However, as seen in Fig. 1, the percentage of dot products in the (0,0.1](0,0.1] bin is much larger for ImageNet, MIRFLICKR and InstaCities than for IMDB-Wiki. This tallies with the significantly larger speedup obtained by binary splitting (both Our-Sum and Our-Max) for ImageNet, MIRFLICKR and InstaCities as compared to IMDB-Wiki – see Table 1.

5 Theoretical Analysis

For the problem of binary GT, different variants of the binary splitting approach have been shown to require just slog2N+O(s)s\log_{2}N+O(s) or slog2(N/s)+O(s)s\log_{2}(N/s)+O(s) tests, depending on some implementation details [14, Theorems 1.2 and 1.3]. Here ss stands for the number of defectives out of NN. However in binary GT, a positive result on a pool necessarily implies that at least one of the participants was defective. In our application for NN search, a pool 𝒚\boldsymbol{y} can produce 𝒒t𝒚ρ\boldsymbol{q}^{t}\boldsymbol{y}\geq\rho for query vector 𝒒\boldsymbol{q}, even if none of the vectors that contributed to 𝒚\boldsymbol{y} produce a dot product with 𝒒\boldsymbol{q} that exceeds or equals ρ\rho. Due to this important difference, the analysis from [14] does not apply in our situation. Furthermore, it is difficult to pin down a precise deterministic number of tests in our technique. Instead, we resort to a distributional assumption on each dot product 𝒒t𝒇𝒊\boldsymbol{q}^{t}\boldsymbol{f_{i}}. We assume that these dot products are independently distributed as per a truncated normalized exponential (TNE) distribution:

p(x;λ)={λeλx1eλ, for 0x1.0, otherwise.p(x;\lambda)=\begin{cases}\dfrac{\lambda e^{-\lambda x}}{1-e^{-\lambda}},\textrm{ for }0\leq x\leq 1.\\ 0,\textrm{ otherwise.}\end{cases} (2)

This distribution is obtained by truncating the well-known exponential distribution to the [0,1][0,1] range and normalizing it so that it integrates to one. Larger the value of λ\lambda, the more rapid is the ‘decay’ in the probability of dot product values from 0 towards 1. Our empirical studies show that for softmax features, this is a reasonable model for dot product values commonly encountered in queries in image retrieval, as can be seen in Fig. 1 for all datasets we experimented with. Also, the best-fit λ\lambda values for MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities are respectively 34,57,10,3034,57,10,30. The TNE model fit may not be perfect, but less than perfect adherence to the TNE model does not change the key message that will be conveyed in this section.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Normalized histograms (bar graphs in blue), and best-fit exponential distribution approximations (red curves), of dot products between feature vectors of query images (10,000 in number) and all images of MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities databases (left to right, top to bottom). Also see Fig. 1 of [13] for a plot of negative log of the histograms for clearer visualization of the smaller probability values.

Our aim now is to determine the probability that a pool 𝒚\boldsymbol{y} with some LL participating vectors will produce a dot product value 𝒒t𝒚\boldsymbol{q}^{t}\boldsymbol{y} below a threshold ρ\rho. Note again that such a pool will get pruned away in binary splitting, and therefore we would like this probability to be high. Now, the LL individual dot-products (of 𝒒\boldsymbol{q} with vectors that contributed to 𝒚\boldsymbol{y}) are independent and they are all distributed as per the TNE from Eqn. 2. The mean and variance of a TNE variable with parameter λ\lambda are respectively given by (1/λ+1/(1eλ))\left(1/\lambda+1/(1-e^{\lambda})\right) and (1/λ2eλ/(eλ1)2)\left(1/\lambda^{2}-e^{\lambda}/(e^{\lambda}-1)^{2}\right). We now apply the central limit theorem (CLT), due to which we can conclude that 𝒒t𝒚\boldsymbol{q}^{t}\boldsymbol{y} is (approximately) Gaussian distributed with a mean μL(1/λ+1/(1eλ))\mu\triangleq L\left(1/\lambda+1/(1-e^{\lambda})\right) and variance σ2L(1/λ2eλ/(eλ1)2)\sigma^{2}\triangleq L\left(1/\lambda^{2}-e^{\lambda}/(e^{\lambda}-1)^{2}\right). With this, we can easily compute P(𝒒t𝒚ρ)P(\boldsymbol{q}^{t}\boldsymbol{y}\leq\rho) given a value of LL and ρ\rho. Our numerical experiments, illustrated in Fig. 3 of [13], reveal that this probability increases with λ\lambda and decreases as LL increases. This tallies with intuition, and supports the hypothesis that a larger number of pools will be pruned away in a given round if there is more rapid decrease in the probability of dot product values from 0 to 1.

We are now equipped to determine the expected number of tests (dot products) to be computed per query, given such a TNE distribution for the dot products. Given a value of ρ\rho, let us denote the probability that a pool with LL participants will get pruned away, by pL(ρ)p_{L}(\rho). For Our-Sum, for any L6L\geq 6, the probability pL(ρ)p_{L}(\rho) is computed using the TNE and CLT based method described earlier. For L<6L<6, the Gaussian distribution can be replaced by a truncated Erlang distribution, as the sum of independent exponential random variables follows an Erlang distribution. Even in this case, P(𝒒t𝒚ρ)P(\boldsymbol{q}^{t}\boldsymbol{y}\leq\rho) increases with λ\lambda and decreases with LL.

In round 1, there is only one pool and it contains all NN vectors. It will get pruned away with probability pN(ρ)p_{N}(\rho), for a given fixed threshold ρ\rho and fixed λ\lambda. The expected number of pools in round 2 will be Q22(1pN(ρ))Q_{2}\triangleq 2(1-p_{N}(\rho)), each of which will be pruned away with probability pN/2(ρ)p_{N/2}(\rho). The factor 2 is because each pool gets split into two before sending it to the next round of binary splitting. The expected number of pools in round 3 will be Q32(1pN/2(ρ))Q2Q_{3}\triangleq 2(1-p_{N/2}(\rho))Q_{2}. Likewise, in the kkth round, the expected number of pools is Qk=2(1pN/2k2(ρ))Qk1Q_{k}=2(1-p_{N/2^{k-2}}(\rho))Q_{k-1}. Hence the expected number of tests per query in our approach will be given by:

E(ρ)1+12i=2log2N+1Qi,E(\rho)\triangleq 1+\frac{1}{2}\sum\limits_{i=2}^{\lceil\log_{2}N+1\rceil}Q_{i}, (3)

as there are at most log2N+1\lceil\log_{2}N+1\rceil rounds. The factor 12\frac{1}{2} is due to the fact that the number of tests (i.e. dot product computations) is always half the number of pools (see second para. of Sec. 2). The value of E(ρ)E(\rho) decreases with λ\lambda as pN(ρ),pN/2(ρ),,pN/2k2(ρ)p_{N}(\rho),p_{N/2}(\rho),...,p_{N/2^{k-2}}(\rho) all increase with λ\lambda. A similar analysis for Our-Max is presented in [13, Sec. 4].

Database
MIRFLICKR, N=1MN=1M ρ=0.7\rho=0.7 ρ=0.8\rho=0.8 ρ=0.9\rho=0.9
Empirical (Sum) 51690 44194 37788
Empirical (Max) 75782 60752 47531
Theoretical (Sum) 63838 57553 50094
Theoretical U.B. (Max) 371672 270408 195140
ImageNet, N=1MN=1M ρ=0.7\rho=0.7 ρ=0.8\rho=0.8 ρ=0.9\rho=0.9
Empirical (Sum) 13852 12339 10801
Empirical (Max) 6946 5713 4551
Theoretical (Sum) 34524 32603 31345
Theoretical U.B. (Max) 14160 6267 2775
IMDB-Wiki, N=500KN=500K ρ=0.7\rho=0.7 ρ=0.8\rho=0.8 ρ=0.9\rho=0.9
Empirical (Sum) 160020 141290 125050
Empirical (Max) 240947 197370 158284
Theoretical (Sum) 113600 99552 87835
Theoretical U.B. (Max) 1426259 1336543 1249448
InstaCities, N=1MN=1M ρ=0.7\rho=0.7 ρ=0.8\rho=0.8 ρ=0.9\rho=0.9
Empirical (Sum) 72729 62741 54125
Empirical (Max) 105122 85188 67441
Theoretical (Sum) 71021 63453 57944
Theoretical U.B. (Max) 566498 445181 346927
Table 3: Empirically observed average and theoretically predicted expected number of dot products (Eqn. 3) for queries with ρ{0.7,0.8,0.9}\rho\in\{0.7,0.8,0.9\} for different datasets. See explanation at the end of Sec. 5. Note that ‘Sum’ and ‘Max’ denote algorithms Our-Sum and Our-Max respectively.

Our theoretical analysis depends on λ\lambda, but note that our algorithm is completely agnostic to λ\lambda and does not use it in any manner. For values of ρ{0.7,0.8,0.9}\rho\in\{0.7,0.8,0.9\}, we compared the theoretically predicted average number of dot products (E(ρ)E(\rho)) to the experimentally observed number for all datasets. These are presented in Tab. 3, where we observe that for MIRFLICKR, ImageNet and InstaCities, the theoretical number for Our-Sum is generally greater than the empirical one. This is because the dot products for these datasets decay slightly faster than what is predicted by a TNE distribution, as can be seen in Fig. 1. Conversely, for IMDB-Wiki, the theoretical number for Our-Sum is less than the empirical one because the dot products decay slower for this dataset than what is predicted by a TNE distribution. For Our-Max, the difference between empirical and theoretical values is larger, as explain in [13].

6 Conclusion

We have presented a simple and intuitive multi-round group testing approach for range-based near neighbor search. Our method outperforms exhaustive search by a large factor in terms of query time without losing accuracy, on datasets where query vectors are dissimilar to the vast majority of vectors in the gallery. The larger the percentage of dissimilar vectors, the greater is the speedup our algorithm will produce. Our technique has no tunable parameters. It is also easily applicable to situations where new vectors are dynamically added to an existing gallery, without requiring any expensive updates. Apart from this, our technique also has advantages in terms of accuracy of query results (especially recall) over many recent, state of the art methods. A significant limitation of our method is that it currently only supports cosine distances and is applicable only to databases with a sharp decay in feature similarity distribution, as in the case with softmax features. At present, a pool may produce a large dot product value with a query vector, even though many or even all the members of the pool are dissimilar to the query vector. Developing techniques to reduce the frequency of such situations, possibly using various LSH techniques, as well as testing on billion-scale datasets, are important avenues for future work. The implementation of our algorithms can be found at https://github.com/Harsh-Sensei/GroupTestingNN.

Acknowledgement: The authors thank the Amazon IIT Bombay AI-ML Initiative for support for the work in this paper.

References

  • [1] https://github.com/facebookresearch/faiss
  • [2] https://www.kaggle.com/competitions/imagenet-object-localization-challenge/data
  • [3] https://github.com/JoshEngels/FLINNG
  • [4] https://github.com/FALCONN-LIB/FALCONN
  • [5] https://github.com/NinhPham/FalconnLSF
  • [6] https://github.com/google-research/google-research/tree/master/scann
  • [7] FLANN - fast library for approximate nearest neighbors: wbia-pyflann 4.0.4. https://pypi.org/project/wbia-pyflann/
  • [8] Generalized binary splitting algorithm. https://en.wikipedia.org/wiki/Group_testing#Generalised_binary-splitting_algorithm
  • [9] Imdb-wiki – 500k+ face images with age and gender labels. https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
  • [10] The InstaCities1M Dataset. https://gombru.github.io/2018/08/01/InstaCities1M/
  • [11] MIRFLICKR download. https://press.liacs.nl/mirflickr/mirdownload.html
  • [12] Oxford and paris buildings dataset. https://www.kaggle.com/datasets/skylord/oxbuildings
  • [13] Supplemental material. Uploaded on conference portal
  • [14] Aldridge, M., Johnson, ., Scarlett, J.: Group testing: an information theory perspective. Foundations and Trends in Communications and Information Theory 15(3-4), 196–392 (2019)
  • [15] Andoni, A., Razenshteyn, I.: Optimal data-dependent hashing for approximate near neighbors. In: Proceedings of the forty-seventh annual ACM symposium on Theory of computing. pp. 793–801 (2015)
  • [16] Babenko, A., Lempitsky, V.: The inverted multi-index. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(6), 1247–1260 (2014)
  • [17] Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)
  • [18] Candès, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE signal processing magazine 25(2), 21–30 (2008)
  • [19] Davenport, M., Duarte, M., Eldar, Y., Kutyniok, G.: Introduction to compressed sensing. (2012)
  • [20] Dong, Y., Indyk, P., Razenshteyn, I., Wagner, T.: Learning space partitions for nearest neighbor search. In: ICLR (2019)
  • [21] Dorfman, R.: The detection of defective members of large populations. The Annals of mathematical statistics 14(4), 436–440 (1943)
  • [22] Engels, J., Coleman, B., Shrivastava, A.: Practical near neighbor search via group testing. In: Advances in Neural Information Processing Systems. vol. 34, pp. 9950–9962 (2021)
  • [23] Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop. pp. 178–178 (2004). https://doi.org/10.1109/CVPR.2004.383
  • [24] Fu, C., Xiang, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with the navigating spreading-out graph 12(5), 461–474 (2019)
  • [25] Ghosh, S., Agarwal, R., Rehan, M.A., Pathak, S., Agarwal, P., Gupta, Y., Consul, S., Gupta, N., Goenka, R., Rajwade, A., Gopalkrishnan, M.: A compressed sensing approach to pooled RT-PCR testing for COVID-19 detection. IEEE Open Journal of Signal Processing 2, 248–264 (2021)
  • [26] Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., Kumar, S.: Accelerating large-scale inference with anisotropic vector quantization. In: International Conference on Machine Learning. pp. 3887–3896 (2020)
  • [27] Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., Zhang, H.: Fast approximate nearest-neighbor search with k-nearest neighbor graph. In: IJCAI. pp. 1312–1317 (2011)
  • [28] Hastie, T., Tibshirani, R., Wainwright, M.: Statistical learning with sparsity: the lasso and generalizations. CRC press (2015)
  • [29] Hwang, F.: A method for detecting all defective members in a population by group testing. Journal of the American Statistical Association 67(339), 605–608 (1972)
  • [30] Indyk, P., Motwani, R.: Approximate nearest authoneighbors: towards removing the curse of dimensionality. In: STOC. p. 604–613 (1998)
  • [31] Iscen, A., Chum, O.: Local orthogonal-group testing. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 449–465 (2018)
  • [32] Kirsch, A., Mitzenmacher, M.: Distance-sensitive bloom filters. In: Proceedings of the Eighth Workshop on Algorithm Engineering and Experiments (ALENEX). pp. 41–50 (2006)
  • [33] Li, C.H.: A sequential method for screening experimental variables. Journal of the American Statistical Association 57(298), 455–477
  • [34] Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd international conference on Very large data bases. pp. 950–961 (2007)
  • [35] Maji, S., Bose, S.: Cbir using features derived by deep learning. ACM/IMS Trans. Data Sci. 2(3) (sep 2021). https://doi.org/10.1145/3470568, https://doi.org/10.1145/3470568
  • [36] Malkov, Y., Yashunin, D.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(4), 824–836 (2018)
  • [37] Muja, M., Lowe, D.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(11), 2227–2240 (2014)
  • [38] Ortega-Binderberger, M.: Corel Image Features. UCI Machine Learning Repository (1999), DOI: https://doi.org/10.24432/C5K599
  • [39] Pham, N., Liu, T.: Falconn++: A locality-sensitive filtering approach for approximate nearest neighbor search. In: Advances in Neural Information Processing Systems. vol. 35, pp. 31186–31198 (2022)
  • [40] Ram, P., Sinha, K.: Revisiting kd-tree for nearest neighbor search. In: SIGKDD. pp. 1378–1388 (2019)
  • [41] Shi, M., Furon, T., Jégou, H.: A group testing framework for similarity search in high-dimensional spaces. In: ACMMM. pp. 407–416 (2014)
  • [42] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
  • [43] Vidyasagar, M.: An introduction to compressed sensing. SIAM (2019)
  • [44] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR. pp. 586–595 (2018)

Supplementary Material

7 Contents

This supplemental material contains the following:

  1. 1.

    A figure showing the negative logarithm of the normalized histograms of dot products between feature vectors of query images and those of all gallery images from each of the four datasets we experimented with in the main paper. These are shown in Fig. 3. Compare with Fig. 2 of the main paper where normalized histograms are shown.

  2. 2.

    A comparison of the techniques Our-Sum and Our-Max proposed in this paper to recent group testing based NN search methods such as [22, 41, 31] is presented in Sec. 8.

  3. 3.

    A discussion regarding the limitation of using the recent advances in the compressed sensing literature to the NN search problem is presented in Sec. 9.

  4. 4.

    Statistics for the average number of points pruned away in each round of binary splitting in our method on each of the four databases, are presented in Fig. 4 for Our-Sum and Fig. 5 for Our-Max. Notice that for both the proposed algorithms (Our-Sum and Our-Max) there is nearly no pruning encountered during the initial rounds. But as the adaptive tests continue, negative tests are encountered in further rounds, and hence pools get pruned away. Surprisingly, most negative tests for Our-Max are encountered in the last round in contrast to that for Our-Sum.

  5. 5.

    Our numerical experiments, illustrated in Fig. 6 reveal that P(𝒒t𝒚ρ)P(\boldsymbol{q}^{t}\boldsymbol{y}\leq\rho) increases with λ\lambda and decreases as LL increases. This tallies with intuition, and supports the hypothesis that a larger number of pools will be pruned away in a given round if there is more rapid decrease in the probability of dot product values from 0 to 1.

  6. 6.

    A theoretical analysis of the Our-Max algorithm is presented in Sec. 10.

  7. 7.

    The hyper-parameters for various competing algorithms are presented in Sec. 11.

  8. 8.

    Details of augmentation to images for generating queries in a streaming setting are presented in Sec. 12 and Fig. 8.

  9. 9.

    A comparison of softmax features (on top of VGGNet features) versus VGGNet features in terms of retrieval precision and retrieval recall is presented in Table 4 and Sec. 13.

  10. 10.

    Experiments with a low similarity value, i.e. ρ\rho (=0.3), and experiments on a dataset with 12-million points are presented in Sec.  14.

Refer to caption
Figure 2: Schematic of our GT algorithm for NN search
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Negative logarithm of normalized histograms of dot products between feature vectors of query images (10K in number) and those of all gallery images of MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities (left to right, top to bottom). Compare to Fig. 2 of the main paper.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Histograms of the number of pools pruned in every round of binary splitting for the Our-Sum method, for MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities (left to right, top to bottom), all for ρ0.7\rho\geq 0.7.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Histograms of the number of pools pruned in every round of binary splitting for the Our-Max technique for MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities (left to right, top to bottom), all for ρ0.7\rho\geq 0.7.
Refer to caption
Figure 6: The probability that the dot product of a query vector 𝒒\boldsymbol{q} with a pool created from LL participating vectors, falls below ρ=0.7\rho=0.7, as a function of λ\lambda for L{8,16,32}L\in\{8,16,32\}.

8 Comparison to other Group Testing Methods for NN Search

The oldest work on GT for NN search was presented in [41]. In this method, the original data vectors are represented as a matrix 𝑭N×d\boldsymbol{F}\in\mathbb{R}^{N\times d}. The pools are represented as a matrix 𝒀m×d\boldsymbol{Y}\in\mathbb{R}^{m\times d} obtained by pre-multiplying 𝑭\boldsymbol{F} by a randomly generated balanced binary matrix 𝑨\boldsymbol{A} of size m×Nm\times N where m<Nm<N. Given a query vector 𝒒\boldsymbol{q}, its similarity with the jjth pool is computed, yielding a score vyj=𝒒t𝒀𝒋v_{yj}=\boldsymbol{q}^{t}\boldsymbol{Y^{j}}, where 𝒀𝒋\boldsymbol{Y^{j}} is the jjth group vector (and the jjth row of 𝒀\boldsymbol{Y}). This is repeated for every j[m]j\in[m]. Let 𝒫i\mathcal{P}_{i} be the set of pools in which the iith vector, i.e. 𝒇𝒊\boldsymbol{f_{i}}, belongs. Then, a likelihood score Lij𝒫ivyjL_{i}\triangleq\sum_{j\in\mathcal{P}_{i}}v_{yj} is computed for every vector. The vector from 𝑭\boldsymbol{F} with the largest likelihood score (denote this vector by 𝒇𝒌𝟎\boldsymbol{f_{k0}}) is considered to be one of the nearest neighbors of 𝒒\boldsymbol{q}. Next, the value 𝒒t𝒇𝒌𝟎\boldsymbol{q}^{t}\boldsymbol{f_{k0}} is computed and the pool-level similarity scores for all pools containing 𝒇𝒌𝟎\boldsymbol{f_{k0}} are updated to yield new scores of the form vyj=vyj𝒒t𝒇𝒌𝟎v_{yj}=v_{yj}-\boldsymbol{q}^{t}\boldsymbol{f_{k0}}. Thereafter, all likelihood scores are updated. This process is called back-propagation in [41]. Again, the vector from 𝑭\boldsymbol{F} with the highest likelihood is identified and this procedure is repeated some RR times, and finally a sort operation is performed in order to rank them. In [41], it is argued that the time complexity of this procedure is O(d(m+R))O(d(m+R)). However, this is an approximate search method, and there is no guarantee that the RR neighbors thus derived will indeed be the nearest ones. As a result, a large number of false positives may be generated by this method.

The work in [31] clusters the collection 𝒟\mathcal{D} into disjoint groups such that the members of a group are as orthogonal to each other as possible. As a result, the similarity measure 𝒒t𝒚𝒋\boldsymbol{q}^{t}\boldsymbol{y_{j}} between 𝒒\boldsymbol{q} and a group vector 𝒚𝒋\boldsymbol{y_{j}} is almost completely dominated by the similarity between 𝒒\boldsymbol{q} and exactly one member of the group. A careful sparse coding and decoder correction step is also used, and the sparsity of the codes plays a key role in the speedup achieved by this method. However, this method is again an approximate method and heavily relies on the near-orthogonality of each group.

The work in [22] randomly distributes the collection 𝒟\mathcal{D} of NN vectors over some BB cells. This is independently repeated RR times, creating a grid of B×RB\times R cells. Within each cell, we have a collection of participating members Mr,bM_{r,b} and a corresponding group test Cr,bC_{r,b}. The group test is essentially a binary classifier which determines whether Mr,bM_{r,b} contains at least one member which is similar to the query vector 𝒒\boldsymbol{q}. This is an approximate query, with a true positive rate pp and false positive rate qq. It is implemented via a distance-sensitive bloom filter [17, 32], which allows for very fast querying. The bloom filter is constructed with mm binary arrays, with some LL concatenated hash functions used in each array. For all positive tests Cr,bC_{r,b}, a union set of all their members is computed. This is repeated RR times to create RR candidate sets, and the intersection of these RR sets is created. Each union can contain many false positives, but this intersection filters out non-members effectively. Precise bounds on p,qp,q are derived in [22], and the time complexity of a single query is proved to be O(N1/2+γlog3N)O(N^{1/2+\gamma}\log^{3}N) where γlogs|K|/(logs|K|+1logs|K|)\gamma\triangleq\log s_{|K|}/(\log s_{|K|+1}-\log s_{|K|}) where s|K|s_{|K|} stands for the similarity between 𝒒\boldsymbol{q} and the KKth most similar vector in 𝒟\mathcal{D}. The query time is provably sub-linear for queries for which γ<1/2\gamma<1/2. This is obeyed in data which are distributed as per a Gaussian mixture model with components that have well spread out mean vectors and with smaller variances. But for many distributions, this condition could be violated, leading to arbitrarily high query times in the worst case. This method, too, produces a large number of false negatives and false positives.

Compared to these three afore-mentioned techniques, our method is exact, with the same accuracy as exhaustive search. The methods [41, 31, 22] require hyper-parameters (RR for [41], sparse coding parameters in [31], B,R,m,LB,R,m,L in [22]) for which there is no clear data-driven selection procedure. Our method, however, requires no parameter tuning. In experiments, we have observed a speed up of more than ten-fold in querying time with our method as compared to exhaustive search on some datasets. Like [41], and unlike [22, 31], our method does have a large memory requirement, as we require all pools to be in memory. Some techniques such as [22] require the nearest neighbors to be significantly more similar than all other members of 𝒟\mathcal{D} (let us call this condition 𝒞1\mathscr{C}1). Our method does not have such a requirement. However, our method will perform more efficiently for queries for which a large number of similarity values turn out to be small, and only a minority are above the threshold ρ\rho (let us call this condition 𝒞2\mathscr{C}2). In our experimental study on diverse datasets, we have observed that 𝒞2\mathscr{C}2 is true always. On the other hand, 𝒞1\mathscr{C}1 does not hold true in a large number of cases, as also reported in [22, Sec. 5.3].

9 A Discussion on using Compressed Sensing Techniques for NN Search

Sec. 3 of the main paper describes three recent papers [41, 31, 22] which apply the principles of group testing to near neighbor search, and compares them to our technique. In this section, we describe our attempts to apply the latest developments from the compressed sensing (CS) literature to this problem. Note that CS and GT are allied problems, and hence it makes sense to comment on the application of CS to near neighbor search.

Consider that the original data vectors are represented as a matrix 𝑭N×d\boldsymbol{F}\in\mathbb{R}^{N\times d}. The pools are represented as a matrix 𝒀m×d\boldsymbol{Y}\in\mathbb{R}^{m\times d} obtained by pre-multiplying 𝑭\boldsymbol{F} by a randomly generated balanced binary matrix 𝑨\boldsymbol{A} of size m×Nm\times N where m<Nm<N. We considered the relation 𝒗𝒚=𝑨𝒗𝒙\boldsymbol{v_{y}}=\boldsymbol{Av_{x}}, where 𝒗𝒚m\boldsymbol{v_{y}}\in\mathbb{R}^{m} contains the dot products of the query vector 𝒒\boldsymbol{q} with each of the pool vectors in 𝒀m×d\boldsymbol{Y}\in\mathbb{R}^{m\times d}, i.e. 𝒗𝒚=𝒀𝒒\boldsymbol{v_{y}}=\boldsymbol{Yq}. Likewise 𝒗𝒙N\boldsymbol{v_{x}}\in\mathbb{R}^{N} contains the dot products of the query vector 𝒒\boldsymbol{q} with each of the vectors in the collection 𝒟\mathcal{D}, i.e. 𝒗𝒙=𝑭𝒒\boldsymbol{v_{x}}=\boldsymbol{Fq}. The aim is to efficiently recover the largest elements of 𝒗𝒙\boldsymbol{v_{x}} from 𝑨,𝒗𝒚\boldsymbol{A},\boldsymbol{v_{y}}. Since algorithms such as Lasso [28] are of an iterative nature, we considered efficient non-iterative algorithms from the recent literature for recovery of nearly sparse vectors [43, Alg. 8.3]. These are called expander recovery algorithms. Theorem 8.4 of [43] guarantees recovery of the kk largest elements of 𝒗𝒙\boldsymbol{v_{x}} using this algorithm, but the bounds on the recovery error are too high to be useful, i.e. 𝒗𝒙(𝒌)𝒗𝒙,𝒆𝒔𝒕δ𝒗𝒙𝒗𝒙(𝒌)1\|\boldsymbol{v^{(k)}_{x}}-\boldsymbol{v_{x,est}}\|_{\infty}\leq\delta\triangleq\|\boldsymbol{v_{x}}-\boldsymbol{v^{(k)}_{x}}\|_{1}. Here, the vector 𝒗𝒙(𝒌)\boldsymbol{v^{(k)}_{x}} is constructed as follows: the kk largest elements of 𝒗𝒙\boldsymbol{v_{x}} are copied into 𝒗𝒙(𝒌)\boldsymbol{v^{(k)}_{x}} at the corresponding indices, and all other elements of 𝒗𝒙(𝒌)\boldsymbol{v^{(k)}_{x}} are set to 0. Clearly, for most situations, δ\delta will be too large to be useful. Indeed, in some of our numerical simulations, we observed δ\delta to be several times larger than the sum of the kk largest elements of 𝒗𝒙\boldsymbol{v_{x}}, due to which Alg. 8.3 from [43] is rendered ineffective for our application. Moreover, for Theorem 8.4 of [43] to hold true, the matrix 𝑨\boldsymbol{A} requires each item in 𝒟\mathcal{D} to participate in a very large number of pools, rendering the procedure inefficient for our application. Given these restrictive conditions, we did not continue with this approach.

10 Theoretical Analysis for Our-Max

In Sec. 4 of the main paper, we have presented a theoretical analysis of the expected number of dot product computations required for Our-Sum, which is the binary splitting procedure using summations for pool creation. A similar analysis for estimating the number of dot products required for Our-Max using solely the distribution assumption on dot products (truncated normalized exponential or TNE) is challenging. Consider a pool vector 𝒚=i=1K𝒇𝒊\boldsymbol{y}=\sum_{i=1}^{K}\boldsymbol{f_{i}} created by the summation of KK participating vectors {𝒇𝒊}i=1K\{\boldsymbol{f_{i}}\}_{i=1}^{K}. In Our-Sum, the dot product between the query vector 𝒒\boldsymbol{q} and the pool vector 𝒚\boldsymbol{y} is exactly equal to the sum of the dot products between 𝒒\boldsymbol{q} and individual members of the pool. Hence an analysis via the central limit theorem or the truncated Erlang distribution is possible, as explained in Sec. 5 of the main paper. For Our-Max, the pool vector 𝒚\boldsymbol{y} is constructed in the following manner: j{1,2,,d},yj=maxi{1,2,,K}fi,j\forall j\in\{1,2,...,d\},y_{j}=\textrm{max}_{i\in\{1,2,...,K\}}f_{i,j} where fi,jf_{i,j} stands for the jjth element of vector 𝒇𝒊\boldsymbol{f_{i}}. The dot product 𝒒t𝒚\boldsymbol{q}^{t}\boldsymbol{y} is an upper bound on maxi{1,2,,K}𝒒t𝒇𝒊\textrm{max}_{{}_{i\in\{1,2,...,K\}}}\boldsymbol{q}^{t}\boldsymbol{f_{i}} (assuming that all values in 𝒒\boldsymbol{q} and in each 𝒇𝒊\boldsymbol{f_{i}} are non-negative). One can of course use the TNE assumption of the individual dot products 𝒒t𝒇𝒊\boldsymbol{q}^{t}\boldsymbol{f_{i}} and use analytical expressions for the maximum of TNE random variables. However the fact that 𝒒t𝒚\boldsymbol{q}^{t}\boldsymbol{y} is an upper bound on maxi{1,2,,K}𝒒t𝒇𝒊\textrm{max}_{{}_{i\in\{1,2,...,K\}}}\boldsymbol{q}^{t}\boldsymbol{f_{i}} complicates the analysis as compared to the case with Our-Sum. Hence, it is difficult to determine an expected number of dot-products per query for Our-Max, but it is possible to determine an upper bound on this quantity. This can be done as described below.

Let Xi𝒒t𝒇𝒊X_{i}\triangleq\boldsymbol{q}^{t}\boldsymbol{f_{i}}. Let DPD_{P} denote the similarity of the query vector 𝒒\boldsymbol{q} with the pool vector 𝒚\boldsymbol{y}, and let PP denote the set of vectors that participated in that pool. We can bound DPD_{P} with a constant cc such that DPcmaxiP(Xi)D_{P}\leq c\cdot\text{max}_{i\in P}(X_{i}) (see later in this section regarding the value of cc). Let n:=|P|n:=|P|. Then using this bound, we have the following relation:

pn(ρ)=P[DP<ρ]P[cmaxiP(Xi)<ρ][FX(ρ/c)]n,p_{n}(\rho)=P[D_{P}<\rho]\geq P[c\cdot\text{max}_{i\in P}(X_{i})<\rho]\geq[F_{X}\left(\rho/c\right)]^{n}, (4)

where the last inequality is based on order statistics, and where FX(.)F_{X}(.) denotes the CDF of the TNE. From the main paper, recall that QkQ_{k} stands for the expected number of pools in round kk. Using the recursive relation for QkQ_{k} in terms of Qk1Q_{k-1}, we have the following:

Qk2(1[FX(ρ/c)]N/2k2)Qk1.Q_{k}\leq 2(1-[F_{X}\left(\rho/c\right)]^{N/2^{k-2}})Q_{k-1}. (5)

Using this, an upper bound on E(ρ)E(\rho) can be obtained in the following manner:

E(ρ)1+i=2log2N+1Qi.E(\rho)\triangleq 1+\sum\limits_{i=2}^{\lceil\log_{2}N+1\rceil}Q_{i}. (6)

Notice there is no 12\frac{1}{2} factor (in contrast to E(ρ)E(\rho) for Our-Sum) here. This is because even after splitting, the similarity needs to be calculated for both the divided pools unlike in the Our-sum algorithm. The constant cc can be theoretically as large as the dimension of the dataset (that is, d=1000d=1000 for the datasets used here), giving a very loose upper bound. However, in order to obtain a sharper upper bound, we find the distribution of cc for each pool in a given dataset and use the 90 percentile value (denoted by c90c_{90}) to evaluate the bound. Details for this are provided next.

Distribution of ratios (c=DPmaxiP(Xi))\left(c=\frac{D_{P}}{\textrm{max}_{i\in P}(X_{i})}\right): Any upper bound on the expected number of dot products required by Our-Max is dependent on the constant cc. Although, theoretically cc can be as large as the dimension of the database vectors, the distribution of cc for each dataset studied and 90 percentile value of cc (i.e., c90c_{90}) can be used to obtain the upper bounds on E[ρ]E[\rho]. Fig. 7 shows the distribution of cc for the datasets used in our experiments. Here below, we mention the c90c_{90} values for various datasets used in evaluating the theoretical upper bound reported in Table 2 of the main paper.

  • MIRFLICKR: c9010c_{90}\approx 10.

  • ImageNet: c907c_{90}\approx 7.

  • IMDB-Wiki: c9010c_{90}\approx 10.

  • InstaCities: c9011c_{90}\approx 11.

Clearly, these values are significantly lower than dd, which facilitates better upper bounds, albeit with some probability of error.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Histograms showing distribution of the ratio c=DPmaxiP(Xi)c=\frac{D_{P}}{\text{max}_{i\in P}(X_{i})} for pools in datasets MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities (left to right, top to bottom). The histograms are computed over 1000\sim 1000 queries, with 100 bins per histogram.

11 Hyper-parameters for the algorithms

In order to arrive at comparable values of precision and recall of other algorithms as compared to our proposed algorithms, we performed experiments with many hyper-parameters for each algorithm. Below are the hyper-parameters we used for various experiments with the competing methods. Note that in order to select the set of hyperparameters for each algorithm, we chose the one which gives high recall without significantly increasing time. So there can be other hyperparameter sets for an algorithm with higher recall but at the cost of disproportionate increase in query time. Both streaming and non-streaming setting uses the same set of hyperparameters for each algorithm, except Flann-R (details mentioned below).

  1. 1.

    Flinng [22]: For each dataset, we ran this code with different hyper-parameters B{212,213,,218,219}B\in\{2^{12},2^{13},\cdots,2^{18},2^{19}\}, m{22,23,,210}m\in\{2^{2},2^{3},\cdots,2^{10}\}, R{2,3,4},L=14R\in\{2,3,4\},L=14 as recommended in the suppl. mat. of [22] (see Sec. 3 of the main paper for their precise meaning) and chose the result that yielded the best recall.

  2. 2.

    Falconn [15]: Number of tables = {100,150,200,250,300}\{100,150,200,250,300\}, Number of probes = {40,50,70,100}\{40,50,70,100\}. Parameters used for reporting : (250, 70) respectively

  3. 3.

    Ivf from the Faiss package [1, 16]: Number of lists = {32,64,128}\{32,64,128\}, Number of probes = {2,16}\{2,16\}. Parameters used for reporting : (32, 16) respectively

  4. 4.

    Hsnw from the Faiss package [1]: M=32M=32, Efconstruction = {32,64}\{32,64\}, Efsearch = {2}×Number of neighbors to retrieve\{2\}\times\text{Number of neighbors to retrieve}, i.e. KK. Parameters used for reporting : (32, 2) respectively

  5. 5.

    Scann [6, 26]: Number of leaves = {1,2}×Number of elements in the dataset\{1,2\}\times\sqrt{\text{Number of elements in the dataset}}, Number of leaves to search = {1/2,1/4,1/8}×Number of leaves\{1/2,1/4,1/8\}\times\text{Number of leaves}, Reorder number of neighbors = {4,8,16}×K\{4,8,16\}\times K. Parameters used for reporting : (2, 1/4, 16) respectively

  6. 6.

    Flann [37, 7]: We used AutotunedIndexParams with high target precision (0.85) and other parameters set to their default values. For IMDB-Wiki dataset, we used KDTreeIndexParams(32), for building the index with 32 parallel kd-trees and SearchParams(128) during search phase. This was done because AutotunedIndexParams took long time to build index (more than 20 hours) on IMDB-Wiki dataset. For streaming setting, due to excess time taken by AutotunedIndexParams (leading to high Index Build Time), we used KDTreeIndexParams(16), for building the index with 16 parallel kd-trees and SearchParams(256).

  7. 7.

    Falconn++ [5, 39]: We used the parameters suggested in examples of Falconn++, with number of random vectors = 256, number of tables = 350, α=0.01\alpha=0.01, Index Probes (iProbes) = 4, Query Probes (qProbes) = min(40k,2max(k among all queries)\textrm{min}(40k,2\cdot\textrm{max}(k\text{ among all queries}). But we have different values of kk for each query which can be as large as N/10N/10. Hence, in order to prevent qProbes from exceeding the size of dataset, we cap it to 2max(k among all queries)2\cdot\textrm{max}(k\text{ among all queries})).

ImageNet
Precision \uparrow Recall \uparrow
Feature\downarrow, ρ\rho\rightarrow 0.60.6 0.70.7 0.80.8 0.90.9 0.60.6 0.70.7 0.80.8 0.90.9
Softmax 0.7 0.73 0.76 0.81 0.74 0.70 0.66 0.60
Penultimate 0.66 0.85 0.96 0.99 0.2 0.07 0.015 0.001
ImageDBCorel
Precision \uparrow Recall \uparrow
Feature\downarrow, ρ\rho\rightarrow 0.60.6 0.70.7 0.80.8 0.90.9 0.60.6 0.70.7 0.80.8 0.90.9
Softmax 0.95 0.96 0.97 0.98 0.32 0.27 0.23 0.18
Penultimate 0.99 1 1 1 0.29 0.11 0.028 0.01
ImageDBCaltech
Precision \uparrow Recall \uparrow
Feature\downarrow, ρ\rho\rightarrow 0.60.6 0.70.7 0.80.8 0.90.9 0.60.6 0.70.7 0.80.8 0.90.9
Softmax 0.58 0.66 0.73 0.82 0.32 0.29 0.25 0.21
Penultimate 0.89 0.97 0.99 0.99 0.19 0.08 0.03 0.02
Table 4: Comparison of the retrieval accuracy (precision and recall) using cosine distance with (i) softmax features of VGGNet (1000 dimensional) and (ii) penultimate VGGNet features (4096 dimensional) for different values of cosine distance threshold ρ{0.6,0.7,0.8,0.9}\rho\in\{0.6,0.7,0.8,0.9\}. Note the higher recall of softmax features.

12 Image Augmentations

In order to create queries suitable for the streaming environment for plagiarism detection, images from the database underwent specially tailored augmentations. The augmentations included artifacts that are somewhat subtle and which do not alter the image content very signficantly, but are designed to trick the detection system: Gaussian blur (with kernel size 3×\times3), color jitter (in hue, saturation and value), horizontal flip, downscaling of images and image rotation (by 3 degrees). Figure 8 shows an example of these augmentations.

Refer to caption
Figure 8: Example of augmentations performed on an image from the ImageNet dataset

13 Comparison of VGG16 features

In this section, we provide a comparison between features extracted from penultimate layer of VGG16 (4096 dimensional) and the features extracted after the last layer (i.e. after applying the softmax functions leading to a 1000 dimensional feature vector as there are 1000 classes involved). For comparison, we use a setting akin to [35], in which database images which are similar to a given query image are retrieved. However, instead of retrieving a fixed number of images (kk), we have performed range-based retrieval with different similarity thresholds (ρ={0.6,0.7,0.8,0.9}\rho=\{0.6,0.7,0.8,0.9\}) on ImageNet [2], ImageDBCaltech (Caltech101) [23] and ImageDBCorel [38]. Table 4 shows the retrieval recall and retrieval precision values for the aforementioned datasets. Note that we define retrieval precision = # (images retrieved belonging to the same class as the query image) / # (images retrieved), and retrieval recall = # (images retrieved belonging to the same class as the query image) / # (images in the database having the same class as the query image). Note that the retrieval precision and retrieval recall as defined here pertain to identification of the class labels. These are different from the precision and recall reported in the main paper, which are based on near neighbors retrieved by a possibly approximate near neighbor search algorithm, in comparison to an exhaustive nearest neighbor search. Note that though the main algorithm proposed by our paper is fully accurate, the other algorithms we have compared to (i.e., [22, 4, 37]) perform only approximate near neighbor search. Observe from Table 4 that the recall rates for penultimate features noticeably lag behind those of softmax features, indicating that softmax features are better suited for plagiarism detection and other applications requiring high recall.

14 Other Experiments

Here we present results on two experiments with a static database: (i) One on retrieval with low similarity values, i.e. ρ=0.3\rho=0.3, and (ii) One on the complete ImageNet Database containing 12 million data-points. These results are reported in Table 5.

ρ=0.3\rho=0.3 10M scale, ρ=0.8\rho=0.8
Alg./DB. ImgN. IMDBW. InstaC. MIRFL. ImgN. 12M
Ivf (Faiss) 157,0.99,1 51,0.91,1 97,0.97,1 98,0.98,1 1342,1,1
Falconn 128,0.93,0.99 184,0.95,1 163,0.92,1 141, 0.89,1 1521,0.96,1
Hnsw 15,0.96,0.96 3126,0.97,0.97 432,0.97,0.97 223,0.98,0.98 561,0.98,0.98
Our-Sum 7,1,1 133,1,1 90,1,1 67,1,1 269,1,1
Table 5: Comparison of mean query times (ms), mean recall, mean precision (respectively) for different methods over 10K queries: (i) For ρ=0.3\rho=0.3 for different databases; and (ii) for the full ImageNet database with 12 million images for ρ=0.8\rho=0.8. Hyperparameters for all algorithms are the same as mentioned earlier in in this supplemental material document. Our-Sum gives better precision, recall than all methods, and generally better query times than others.