This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the adversarial robustness of Locality-Sensitive Hashing in Hamming space

Michael Kapralov
   Mikhail Makarov
   Christian Sohler
Abstract

Locality-sensitive hashing [22] is a classical data structure for approximate nearest neighbor search. It allows, after a close to linear time preprocessing of the input dataset, to find an approximately nearest neighbor of any fixed query in sublinear time in the dataset size. The resulting data structure is randomized and succeeds with high probability for every fixed query. In many modern applications of nearest neighbor search the queries are chosen adaptively. In this paper, we study the robustness of the locality-sensitive hashing to adaptive queries in Hamming space. We present a simple adversary that can, under mild assumptions on the initial point set, provably find a query to the approximate near neighbor search data structure that the data structure fails on. Crucially, our adaptive algorithm finds the hard query exponentially faster than random sampling.

1 Introduction

Nearest neighbor search is one of the most fundamental data structure problems in machine learning. In particular, the development of efficient data structures for nearest neighbor search in high dimensions is a challenge and finding the exact nearest neighbor often requires a linear scan over the data [35]. Therefore, people also started to investigate approximate nearest neighbor search. One important technique to design approximate nearest neighbor data structures is locality sensitive hashing (LSH) which has been introduced in the seminal paper by Indyk and Motwani [22]. LSH is widely used in many applications including music retrieval [33], video anomaly detection [39], and clustering [26] (see also the survey [23] for further examples).

LSH is a randomized data structure and as such its performance guarantees depend on its internal randomness. The standard analysis assumes an oblivious adversary, i.e. queries may not depend on answers to earlier queries for the guarantees to apply. What are the consequences, if we would like to have the LSH performance guarantees when LSH is used as a data structure in an algorithm? It essentially says that for the guarantees to hold, the algorithm must be able to specify at a single point of time a set QQ of query points and it gets all answers to these query points at once. Then it can do any computation with the answers but it may not ask further queries. Such a restricted functionality significantly limits the strength of the guarantees provided by LSH as more often than not future queries depend on the result of earlier queries. Therefore, it would be great to better understand the LSH guarantees in the adaptive setting. Even though some work on LSH such as constructions without false negatives [31, 1] (see also Related Work for further discussions) are motivated by this question, there is still a lack of knowledge about how resilient standard LSH data structures are if we allow adaptive adversaries.

In this paper, we address this question by developing a randomized query scheme (adversarial algorithm) for LSH under the Hamming distance with the goal of efficiently generating a query that makes LSH return an incorrect answer. We show that for a wide setting of parameters, if a point set has a somewhat isolated point, then we can make LSH have a false positive (that is, returning no point while there is a near point) much faster than this would happen with non-adaptive queries.

Let us discuss this further on a simple example. We may use LSH for an approximate version of a 11-nearest neighbor classifier (see [2] for a related study), that is, we store the training set in an LSH data structure and on a query we return the label of the (approximate) nearest point in the training set that is returned by LSH. In this setting LSH will satisfy its standard guarantees when the choice of queries does not depend on previous answers. This happens, for example, when queries are independently sampled from a fixed query distribution. However, one could potentially design a carefully engineered sequence of queries to construct queries on which the algorithm returns an incorrect answer. This is particularly problematic if the algorithm is used for a sensitive task and relies on strong guarantees.

Our results. Our main result is an algorithm for generating an adversarial query:

Theorem 1.1 (Informal version of Theorem A.6).

Given an ‘isolated point’ zz in a dataset PP, one can find a point qq at a distance at most rr from zz such that querying LSH with qq returns no point. The number of queries to the LSH data structure needed to generate the point qq is bounded by O(log(cr)log(1/δ))O(\log(cr)\cdot\log(1/\delta)), where δ\delta is the failure probability of LSH.

Note that to find such a point through random sampling, one would have to make Ω(1/δ)\Omega(1/\delta) queries in expectation Theorem 3.3, which is exponentialy slower in δ\delta than the proposed adversary. We then evaluate our data structure experimentally in various settings to understand how the efficiency and effectiveness of our approach depends on the various parameters.

2 Related Work

Monte-Carlo Locality-Sensitive Hashing. Most of the work dedicated to Locality-Sensitive Hashing was done in the Monte-Carlo model, which we consider to be the standard model in this paper. That is, the goal is to construct a data structure with deterministically bounded query time that correctly answers the queries with high probability. Constructions of locality-sensitive hash families are known for several different metrics, including Hamming distance [22, 7], 1\ell_{1} [19], the Euclidean metric [3], Jaccard similarity [13, 18], angular distance [4, 25, 29] and more. Asymmetric LSH solutions, which allow to reduce query time at the expense of an increase in space, or reduce space at the expense of increasing query time, have received significant attention in the literature and are quite well understood by now [32, 24, 15, 6]. Distance sensitive hash families, where collision probabilities are not necessarily monotone in the distance between points being hashed, have also been constructed [9]. Data-dependent results, which adapt the locality-sensitive hash family to the dataset to achieve better exponents, are also known [5, 6].

Las-Vegas Locality Sensitive Hashing. In an attempt to avoid the problem of having false negatives, another line of work seeks to construct LSH instances that answer the queries deterministically correctly, but the running time is a random variable and guarantees are usually on its expectation. It was first shown by [31] that such a scheme exists, however this scheme for general parameter settings requires more space than the standard LSH. A later work by [1] achieves the same space-time trade-off as in [22]. [36] further improves on it for p\ell_{p}, and offers a data-dependent construction following [3]. Although Las Vegas constructions avoid the problem of having false negatives, their guarantees still hold only against an oblivious adversary. It is entirely possible that an adaptive adversary can find a sequence of queries with an average response time much larger than what is guaranteed against an oblivious adversary. In any case, investigation of such adversaries lies outside of the scope of this paper.

Robustness to adaptive adversaries. One of the main motivations of this paper is to study the behavior of a classic data structure solving against an adaptive adversary. One of the most studied class of algorithms that is relevant to our paper is linear sketches. Some of the notable early works here include [20], which proposed an efficient attack on the linear sketch for norm estimation, and [30], which proposed deterministic sketch constructions for a range of problems that work for all inputs and provide lower bounds on their size. In [27] an adversarially robust approach to solving approximate nearest neighbor in Hamming space is proposed, which does not use LSH. It is improved upon by [14], which gives a robust construction for vector p\ell_{p}-norm estimation, for 0<p20<p\leq 2. On the other side of the spectrum, an adaptive attack on CountSketch was proposed by [17, 16], with the latter also proposing a robustifycation of this sketch.

Another fruitful line of work covers the use of differential privacy to make existing algorithms more robust against adaptive adversaries [10, 37, 21, 11, 8]. The main idea is to use differentially private mechanisms to hide the random bits of the algorithm from the adversary. This approach turned out to be applicable to a wide range of estimation problems.

3 Preliminaries

For x>0x>0, logx:=log2x\log x:=\log_{2}x will denote binary logarithm of xx. For nn\in\mathbb{N}, [n]:={1,,n}[n]:=\{1,\ldots,n\} will be the set of integers from 11 to nn. For a point p{0,1}dp\in\{0,1\}^{d} and i[d]i\in[d], pip_{i} is the value of the point pp along ii-th dimension. For two points p,q{0,1}dp,q\in\{0,1\}^{d}, dist(p,q)=i[d]|piqi|\operatorname*{dist}(p,q)=\sum_{i\in[d]}|p_{i}-q_{i}| is their Hamming distance.

Basics of Locality Sensitive Hashing. Locality sensitive hashing is a technique to derive randomized data structures for near neighbor search. This can be formalized as follows. A (randomized) data structure for the (c,cr)(c,cr)-approximate near neighbor problem in a metric space (X,dist)(X,\operatorname*{dist}) preprocesses a point set PXP\subset X in such a way that for a query point qXq\in X with probability at least 1δ1-\delta it (a) returns pPp\in P with dist(p,q)cr\operatorname*{dist}(p,q)\leq cr, if there is pPp^{\prime}\in P with dist(p,q)r\operatorname*{dist}(p^{\prime},q)\leq r, (b) returns \bot, if there is no point pPp\in P with dist(p,q)cr\operatorname*{dist}(p,q)\leq cr, and (c) returns either \bot or pPp\in P with dist(p,q)cr\operatorname*{dist}(p,q)\leq cr, otherwise. Here δ\delta is the failure probability of the data structure. In this paper, we will consider the Hamming space, i.e. X={0,1}dX=\{0,1\}^{d} and dist\operatorname*{dist} is the Hamming distance and in the following discussion we focus on this setting.

The key component of LSH schemes is a locality-sensitive hash family. The idea behind them is that a function from such a family is more likely to map close points to the same bucket than far ones.

Definition 3.1 ([22]).

A hash family \mathcal{H} is (r,cr,p1,p2)(r,cr,p_{1},p_{2})-sensitive if for every p,q{0,1}dp,q\in\{0,1\}^{d}, the following holds. Let hh be chosen uniformly at random from \mathcal{H}. If dist(p,q)r\operatorname*{dist}(p,q)\leq r, then Pr[h(p)=h(q)]p1\Pr[h(p)=h(q)]\geq p_{1}. Otherwise, if dist(p,q)cr\operatorname*{dist}(p,q)\geq cr, Pr[h(p)=h(q)]p2\Pr[h(p)=h(q)]\leq p_{2}.

For a dd-dimensional Hamming space {0,1}d\{0,1\}^{d}, the locality-sensitive hash family has a very simple form [22]. Let ={h(i),i[d]:p{0,1}d,h(i)(p)=pi}\mathcal{H}=\{h^{(i)},i\in[d]:\forall p\in\{0,1\}^{d},h^{(i)}(p)=p_{i}\} be the elementary hash function family. That is, it consists of hash functions h(i):{0,1}d{0,1}h^{(i)}:\{0,1\}^{d}\to\{0,1\}, which project points from {0,1}\{0,1\} to their ii-th bit. It is known that

Fact 3.2 ([22]).

For any cc, rr such that r<crdr<cr\leq d, the elementary hash function family \mathcal{H} is (r,cr,1rd,1crd)(r,cr,1-\frac{r}{d},1-\frac{cr}{d})-sensitive.

Indyk and Motwani give the following guarantee (assuming that the points come from a dd-dimensional space):

Theorem 3.3 ([22]).

Suppose there exists a (r,cr,p1,p2)(r,cr,p_{1},p_{2})-sensitive family of hash functions. Then there is a data structure for the (r,cr)(r,cr)-approximate near neighbor problem (with δ=1/2\delta=1/2) that uses O(dn+n1+ρ)O(dn+n^{1+\rho}) space and requires O(nρ)O(n^{\rho}) evaluations of hash functions, where ρ=log(1r/d)log(1cr/d)\rho=\frac{\log(1-r/d)}{\log(1-cr/d)}.

In the following, we briefly describe the construction of an LSH data structure, which can be used to prove the above theorem. An LSH data structure consists of a multiset of GG of LL hash functions in the product k=(h1,h2,,hk),hj,j{1,2,,k},\mathcal{H}^{k}=(h_{1},h_{2},\ldots,h_{k}),h_{j}\in\mathcal{H},j\in\{1,2,\ldots,k\}, i.e. GkG\subseteq\mathcal{H}^{k} In other words, each element of GG is constructed by sampling a sequence of kk elementary hash functions h1,,hkh_{1},\ldots,h_{k} uniformly at random from \mathcal{H} and then setting g:{0,1}d{0,1}kg:\{0,1\}^{d}\to\{0,1\}^{k}, g(x)=(h1(x),,hk(x))g(x)=(h_{1}(x),\ldots,h_{k}(x)) to be their concatenation. That is, p{0,1}d\forall p\in\{0,1\}^{d}, g(p)=(h1(p),,hk(p))g(p)=(h_{1}(p),\ldots,h_{k}(p)). The intuition is that concatenating hash functions from \mathcal{H} allows to exponentially increase the gap between probabilities p1p_{1} and p2p_{2} in such a way that if we sample LL hash functions then two points p,qp,q with distance at most rr are likely to have at least one gig_{i} with gi(p)=gi(q)g_{i}(p)=g_{i}(q) and we can recover the point. At the same time, the number of points with dist(p,q)>cr\operatorname*{dist}(p,q)>cr and gi(p)=gi(q)g_{i}(p)=g_{i}(q) will be small, so that the recovery can be done efficiently.

The proof of [22] essentially follows by choosing the parameters kk and LL appropriately, which is done as follows. We set p1=1r/dp_{1}=1-r/d and p2=1cr/dp_{2}=1-cr/d such that \mathcal{H} is (r,cr,p1,p2)(r,cr,p_{1},p_{2})-sensitive. We define ρ=log1/p1log1/p2\rho=\frac{\log 1/p_{1}}{\log 1/p_{2}} and =nρ\ell=n^{\rho}. For an input set P{0,1}dP\subseteq\{0,1\}^{d} of nn points we set k=log1/p2nk=\lceil\log_{1/p_{2}}n\rceil. For this setting of parameter, we get the guarantees provided in the above Theorem. In order to get a smaller failure probability we set L=λL=\lceil\lambda\ell\rceil, which results in a failure probability of 1exp(Ω(λ))1-\exp({-\Omega(\lambda)}).

One particular property of the above LSH construction is that there is a chance of incorrectly returning that there is no near point (i.e. returning \bot). Such an incorrect answer will be called a false negative.

Definition 3.4 (False negative).

Consider an LSH data structure initialized for the point set P{0,1}dP\subseteq\{0,1\}^{d}. We say that the answer to a query q{0,1}dq\in\{0,1\}^{d} is a false negative query for this LSH instance, if QueryLSH(q,G)=\textsc{QueryLSH}(q,G)=\bot but there exists a point pPp\in P such that dist(p,q)r\operatorname*{dist}(p,q)\leq r.

Defining the adversary. The goal of this paper is to study how one can force LSH to return a false negative by doing adaptive queries. We take the role as an adversary and our goal is design an adversarial algorithm that has access to an LSH instance and quickly forces it to return a false negative. In particular, we provide guarantees on the number of queries required until LSH makes a false negative.

In the following, we will describe our formal setting. Our LSH instance is constructed according to the scheme described in Section 3. Recall that this LSH scheme samples a multiset GG of LL hash functions g1,,gLg_{1},\dots,g_{L} uniformly at random from the set of allowed hash functions of the form g:{0,1}d{0,1}kg:\{0,1\}^{d}\rightarrow\{0,1\}^{k} with g(p)=(h1(p),,hk(p))g(p)=(h_{1}(p),\dots,h_{k}(p)) for elementary hash function hjh_{j}. The LSH scheme receives as input a point set P{0,1}dP\subseteq\{0,1\}^{d} of size nn and parameters n,d,c,rn,d,c,r and λ\lambda and calculates ,ρ,L\ell,\rho,L and kk as described earlier. Our adversarial algorithm may use the input point set PP, its size nn and dimension dd, and the parameters r,cr,c and λ\lambda of the LSH scheme and can interact with the LSH data structure using the procedure 1 described below. That is, the LSH data structure is in a black box and the only interaction uses procedure 1. In particular, the randomness used to construct the LSH instance and so the hash functions is unknown to the adversarial algorithm.

  SS\leftarrow\emptyset
  for all gGg\in G do
     S=S{pP:g(p)=g(q) and dist(p,q)cr}S=S\;\cup\{p\in P:g(p)=g(q)\text{ and }\operatorname*{dist}(p,q)\leq cr\}
  return  any point in SS, or \bot if SS is empty
Algorithm 1 QueryLSH(q,r,cq,r,c)

Note that in an LSH implementation one would store all buckets of gg in a hash table. This way, line 3 can be implemented without going through all points. We also remark that since we do not have control of the parameters and point set our adversarial algorithm cannot be expected to work in all settings (for example, if the number of hash functions is very large, then no false negatives exist).

4 Adversarial Algorithm

Define the set Coll(p,q)={gG:g(p)=g(q)}\textsc{Coll}(p,q)=\{g\in G:g(p)=g(q)\} to be the set of hash functions in GG that collide on pp and qq. Our algorithm starts with finding a point zz in the dataset such that any other point is at least at a distance of 2cr2cr from zz. We assume that such a point exists, and we experimentally explore the case when it doesn’t in Section 6. We also show that this assumption is reasonable for a randomly chosen point set in Lemma A.7. We refer to zz as origin point.

Algorithm outline. The idea of the algorithm is to query a random point qq at distance rtr-t such that the expected number of hash functions hashing qq to the same bucket as zz is at most t/2t/2. We show that this is true when t=2e2(λ+1)t=2e^{2}(\lambda+1), which is the value we use throughout this section. This implies that with a constant probability |Coll(q,z)|t|\textsc{Coll}(q,z)|\leq t. The algorithm now does tt iterations and in each of them it moves point qq away from zz by flipping one bit in it in such a way that the size of the set Coll(q,z)\textsc{Coll}(q,z) also decreases by at least 11.

To find this bit, the algorithm flips bits in qq at crdist(q,z)cr-\operatorname*{dist}(q,z) randomly chosen positions among those where qq is equal to zz and obtains a point q~\tilde{q}. q~\tilde{q} is located at distance crcr from zz, so with a high probability QueryLSH(q~)=\textsc{QueryLSH}(\tilde{q})=\bot. Now consider a path between qq and q~\tilde{q}, i.e. a sequence of points starting with qq and ending with q~\tilde{q}, where each two consecutive points differ in one bit and where each point is further away from zz than its predecessor. Because q~\tilde{q} doesn’t hash with zz but qq does, there are two consecutive points qq^{\prime}, q′′q^{\prime\prime} on this path with the same property. Consider the bit where they differ. Because Coll(q,z)\textsc{Coll}(q^{\prime},z)\neq\emptyset and Coll(q′′,z)=\textsc{Coll}(q^{\prime\prime},z)=\emptyset, this bit must lie in the support of all functions in Coll(q,z)\textsc{Coll}(q^{\prime},z).111 for an elementary hash function h(i)h^{(i)}\in\mathcal{H} its support is defined as the set {i}\{i\}. For a hash function gGg\in G, g=(h1,,hk)g=(h_{1},\ldots,h_{k}) its support is the union of supports of h1,,hkh_{1},\ldots,h_{k}. Therefore, by flipping it in qq the algorithm removes these functions from Coll(q,z)\textsc{Coll}(q,z). It is easy to see that at the end we get a point qq at a distance at most rr from zz such that Coll(q,z)=\textsc{Coll}(q,z)=\emptyset, so QueryLSH(q)=\textsc{QueryLSH}(q)=\bot and qq is a false negative.

In the following we prove Lemmas 4.3 and 4.4, which are the key results underpinning the correctness of our algorithm. Lemma 4.3 essentially shows that a point qq sampled at distance rtr-t collides with zz in at most O(t)O(t) hashings with constant probability. Lemma 4.4 that says that if we move this point randomly and monotonically away from zz to distance crcr, it will not hash at all with zz.

A key property of these lemmas is that, unlike in the standard analysis of LSH, we fix the hash functions and sample points at random. This gives us a different perspective on how LSH behaves when queried with random points, which reflects the fact that our goal is to construct an adversary for LSH.

First, we state some intermediate results bounding the size of kk and the support of hash functions. Their proof is given in Appendix A.

Lemma 4.1 (Bounds on kk).

If cr/d1/5cr/d\leq 1/5 and n>en>e, one has 2dcrlnnk0.5dcrlnn2\frac{d}{cr}\ln n\geq k\geq 0.5\frac{d}{cr}\ln n.

Algorithm 2 Adaptive adversarial walk from zz.
0:  Origin point zz, parameters rr, cc, and λ\lambda
1:  t2e2(λ+1)t\leftarrow 2e^{2}(\lambda+1)
2:  Sample qq uniformly at random from all points at distance rtr-t from zz
3:  while QueryLSH(q)=z\textsc{QueryLSH}(q)=z do
4:     if dist(q,z)r\operatorname*{dist}(q,z)\geq r then
5:        return  \bot
6:     qqq^{\prime}\leftarrow q
7:     while QueryLSH(q)=z\textsc{QueryLSH}(q^{\prime})=z do
8:        Let I={i[d]:qi=zi}I=\{i\in[d]:q_{i}^{\prime}=z_{i}\}
9:        Select jIj\in I uniformly at random
10:        Set qj=1qjq_{j}^{\prime}=1-q_{j}^{\prime}
11:        if dist(q,z)>cr\operatorname*{dist}(q^{\prime},z)>cr then
12:           return  \bot
13:     Set qj=1qjq_{j}=1-q_{j}
14:  return  qq
Lemma 4.2 (Support lower bound).

Let λn\lambda\leq n and n>en>e. Then with probability at least 11n1-\frac{1}{n} the size of the support of each of the functions gGg\in G is at least k7lnnmax{1,k22d}k-7\ln n\cdot\max\{1,\frac{k^{2}}{2d}\}.

The next two lemmas lower bound the size of the support of hash functions in LSH when they are chosen as described in Section 3. We now show the first lemma, which bounds the number of hash functions hashing a point at distance rtr-t from zz in the same bucket with zz. We outline the proof below, and present the full proof in Section A.1.

Lemma 4.3.

Let n>en>e, 8e2(λ+1)lnncr8e^{2}(\lambda+1)\ln n\leq cr, 28ln3n/c2rd/(14lnn)28\ln^{3}n/c^{2}\leq r\leq d/(14\ln n), rd/(5c)r\leq d/(5c) and c<lnnc<\ln n. Let the hash functions GG be such that the size of the support of each is at least k7lnnmax{1,k22d}k-7\ln n\cdot\max\{1,\frac{k^{2}}{2d}\}. Let t=2e2(λ+1)t=2e^{2}(\lambda+1). Let qq be a point at distance rtr-t from zz chosen uniformly at random. The expected size of Coll(q,z)\textsc{Coll}(q,z) is at most e2(λ+1)e^{2}(\lambda+1), where the expectation is taken with respect to the randomness of qq.

Proof outline.

We fix an arbitrary set GG of LL hash functions that satisfy the condition of the lemma. By linearity of expectation, 𝔼[|Coll(q,z)|]=gG𝔼[Xg]=gGPr[g(q)=q(z)],\operatorname*{\mathbb{E}}[|\textsc{Coll}(q,z)|]=\sum_{g\in G}\operatorname*{\mathbb{E}}[X_{g}]=\sum_{g\in G}\Pr[g(q)=q(z)], where XgX_{g} is the indicator random variable for the event g(q)=g(z)g(q)=g(z) and the randomness is over the choice of qq as in the lemma. In the following we consider an arbitrary gGg\in G and derive an upper bound for the collision probability of qq and zz. Indeed, let x=7lnnmax{1,k22d}x=7\ln n\cdot\max\{1,\frac{k^{2}}{2d}\}. We first observe (see Section A.1 for more details) that Pr[g(q)=g(z)](1kxd)rt.\Pr[g(q)=g(z)]\leq\left(1-\frac{k-x}{d}\right)^{r-t}. The basic argument is that sampling a point at distance rr from zz is not worse than just picking rr bits to flip in zz independently at random, and then bounding the probability that none of those chosen bits lie in the support of gg.

Next we can use this upper bound to argue that gGPr[g(q)=g(z)]L(1kxd)rt.\sum_{g\in G}\Pr[g(q)=g(z)]\leq L\cdot\left(1-\frac{k-x}{d}\right)^{r-t}. Using L=λ(λ+1)L=\lceil\lambda\ell\rceil\leq(\lambda+1)\ell and =nρ=(1rd)log1/p2n(1rd)k\ell=n^{\rho}=\left(1-\frac{r}{d}\right)^{-\log_{1/p_{2}}n}\leq\left(1-\frac{r}{d}\right)^{-k} we obtain 𝔼q[|Coll(q,z)|](λ+1)(1rd)k(1kxd)rt.\operatorname*{\mathbb{E}}_{q}\left[|\textsc{Coll}(q,z)|\right]\leq(\lambda+1)\left(1-\frac{r}{d}\right)^{-k}\left(1-\frac{k-x}{d}\right)^{r-t}. It remains to show that the r.h.s. is at most e2(λ+1)e^{2}(\lambda+1). Because xx and tt are considerably smaller than kk and rr respectively, this essentially boils down to show that (1rd)k(1kd)rO(1).\left(1-\frac{r}{d}\right)^{-k}\cdot\left(1-\frac{k}{d}\right)^{r}\leq O(1). This is simple to bound using the Taylor expansion of both terms (see Section A.1 for detailed calculations). ∎

We now prove the second lemma, which bounds the probability that a point at distance crcr will be hashed with the origin.

Lemma 4.4.

Let n>en>e, cr/d1/5cr/d\leq 1/5, c1+80+16ln(λ+1)lnnc\geq 1+\frac{80+16\ln(\lambda+1)}{\ln n}, 8e2(λ+1)n1/88e^{2}(\lambda+1)\leq n^{1/8}. Let GG be such that no hash functions gGg\in G has support smaller than k/2k/2. Let qq be a point at distance at most rr from zz such that |Coll(q,z)||\textsc{Coll}(q,z)| is at most 2e2(λ+1)2e^{2}(\lambda+1). Let qq^{\prime} be a point chosen uniformly at random among all of the points at distance crcr from zz such that for all i[d]i\in[d], if qiziq_{i}\neq z_{i}, then qiziq^{\prime}_{i}\neq z_{i}. Then with probability at least 11/(32e2(λ+1))1-1/(32e^{2}(\lambda+1)), Coll(q,z)=\textsc{Coll}(q^{\prime},z)=\emptyset, i.e. no function gg hashes qq^{\prime} and zz to the same bucket.

Proof.

Fix gColl(q,z)g\in\textsc{Coll}(q,z). First notice that qq^{\prime} defined in lemma statement can be generated as follows. One first selects a uniformly random subset SS of crdist(q,z)cr-\operatorname*{dist}(q,z) coordinates in I={i[d]:qi=zi}I=\{i\in[d]:q_{i}=z_{i}\}. Then one lets qi=qi1q^{\prime}_{i}=q_{i}\oplus 1 (negation of qiq_{i}) for iSi\in S and qi=qiq^{\prime}_{i}=q_{i} for i[d]Si\in[d]\setminus S. We denote this distribution by 𝒟(q)\mathcal{D}(q). Note that because g(q)=g(z)g(q)=g(z) (since gColl(q,z)g\in\textsc{Coll}(q,z) by assumption), the support of gg is a subset of II. We will show that Prq𝒟(q)[g(q)=g(z)](1k2d)(c1)r.\Pr_{q^{\prime}\sim\mathcal{D}(q)}[g(q^{\prime})=g(z)]\leq\left(1-\frac{k}{2d}\right)^{(c-1)r}. In order to establish the above, we view the generation of qq^{\prime} as a two step process. First we uniformly sample crdist(q,z)crrcr-\operatorname*{dist}(q,z)\geq cr-r bits from II with replacement into a set BB (removing duplicates) and flip them in qq – denote the result by q~\tilde{q}. Next, we choose crdist(q~,z)cr-\operatorname*{dist}(\tilde{q},z) more bits from IBI\setminus B and flip them in q~\tilde{q} to get qq^{\prime}. It again holds that g(q~)g(z)g(\tilde{q})\neq g(z) implies g(q)g(z)g(q^{\prime})\neq g(z), so Prq(g(q)g(z))Prq~(g(q~)g(z))\Pr_{q^{\prime}}(g(q^{\prime})\neq g(z))\geq\Pr_{\tilde{q}}(g(\tilde{q})\neq g(z)).

The probability that each bit chosen in the first step lies in support of gg is at least k/2ddist(q,r)k2d\frac{k/2}{d-\operatorname*{dist}(q,r)}\geq\frac{k}{2d}. Hence, the probability that none of them lie in the support, i.e. Prq~(g(q~)=g(z))\Pr_{\tilde{q}}(g(\tilde{q})=g(z)), is at most (1k2d)crr(1-\frac{k}{2d})^{cr-r}. This establishes the inequality.

By Lemma 4.1, k0.5dcrlnk\geq 0.5\frac{d}{cr}\ln. By applying A.1 to (1k2d)crr(1-\frac{k}{2d})^{cr-r}, we obtain: Pr[g(q)=g(z)](1k2d)(c1)rnc14c.\Pr[g(q^{\prime})=g(z)]\leq\left(1-\frac{k}{2d}\right)^{(c-1)r}\leq n^{-\frac{c-1}{4c}}. Now consider two cases. First, if c2c\geq 2, then the rhs of the above inequality satisfies nc14cn1/41(8e2(λ+1))2n^{-\frac{c-1}{4c}}\leq n^{-1/4}\leq\frac{1}{(8e^{2}(\lambda+1))^{2}} since 8e2(λ+1)n1/88e^{2}(\lambda+1)\leq n^{1/8} by assumption of the lemma. Otherwise, using that c1+80+16ln(λ+1)lnnc\geq 1+\frac{80+16\ln(\lambda+1)}{\ln n} by the assumption of the lemma, we get c14cc1810+2ln(λ+1)lnnln(8e2(λ+1))/lnn\frac{c-1}{4c}\geq\frac{c-1}{8}\geq\frac{10+2\ln(\lambda+1)}{\ln n}\geq\ln(8e^{2}(\lambda+1))/\ln n. Thus, in both cases one has nc14c1(8e2(λ+1))2n^{-\frac{c-1}{4c}}\leq\frac{1}{(8e^{2}(\lambda+1))^{2}}. Finally, since |Coll(q,z)|2e2(λ+1)|\textsc{Coll}(q,z)|\leq 2e^{2}(\lambda+1) by the assumption of the lemma, by union bound across all gColl(q,z)g\in\textsc{Coll}(q,z), Pr[Coll(q,z)]1/(32e2(λ+1))\Pr[\textsc{Coll}(q^{\prime},z)\neq\emptyset]\leq 1/(32e^{2}(\lambda+1)), as required. ∎

4.1 Analysis of the algorithm

Equipped with Lemma 4.3 and Lemma 4.4, we are now ready to analyse Algorithm 2. Throughout this section we make a strong assumption that the input point zz is isolated, that is, the distance from zz to any other point in PP that is not equal to zz is at least 2cr2cr. This means that as long as we only make queries qq at distance at most crcr to zz, the only two possible results of QueryLSH are zz and \bot. This in turn means that QueryLSH(q)=z\textsc{QueryLSH}(q)=z iff Coll(q,z)\textsc{Coll}(q,z)\neq\emptyset.

First, we analyze the inner loop of the algorithm — lines 6 to 12. Its goal is to find a bit to flip in qq such that the size of Coll(q,z)\textsc{Coll}(q,z) would decrease afterward. Of course, our algorithm only gets black-box access to QueryLSH, and cannot test the size of |Coll(q,z)||\textsc{Coll}(q,z)| in general. However, for a point qq^{\prime} one can, using black-box access to QueryLSH, distinguish between |Coll(q,z)|=0|\textsc{Coll}(q^{\prime},z)|=0 and |Coll(q,z)|>0|\textsc{Coll}(q^{\prime},z)|>0: due to the assumption that zz is an isolated point we have |Coll(q,z)|>0|\textsc{Coll}(q^{\prime},z)|>0 iff QueryLSH(q,z)\textsc{QueryLSH}(q^{\prime},z)\neq\bot. As we show, this ability suffices. Algorithm 2 simply keeps flipping bits in qq at random (but restricted to coordinates where qq and zz agree) until zz is no longer returned by QueryLSH  on the modified query: the last flipped bit must lead to a decrease in |Coll(q,z)||\textsc{Coll}(q,z)| even in the original query qq! More formally, our algorithm finds two points qq^{\prime} and q′′q^{\prime\prime} such that qq^{\prime} lies on a path222Recall that by a path between aa and bb we mean a sequence of points starting with aa and ending with bb, where each two consecutive points differ in one bit and where each point is further away from aa than its predecessor. between q′′q^{\prime\prime} and qq, and Coll(q′′,z)=\textsc{Coll}(q^{\prime\prime},z)=\emptyset and Coll(q,z)\textsc{Coll}(q^{\prime},z)\neq\emptyset, and that differ only in one bit. This is enough to conclude that this bit lies in support of all hash functions of Coll(q,z)\textsc{Coll}(q^{\prime},z), so by flipping it in qq we would remove all of those hash functions from Coll(q,z)\textsc{Coll}(q,z).

Lemma 4.5 (Inner loop of the adversarial walk).

Let n>en>e, cr/d1/5cr/d\leq 1/5, c1+80+16ln(λ+1)lnnc\geq 1+\frac{80+16\ln(\lambda+1)}{\ln n}, 8e2(λ+1)n1/88e^{2}(\lambda+1)\leq n^{1/8}. Let GG be such that no hash functions gGg\in G has support smaller than k/2k/2. Suppose that the point zz passed to Algorithm 2 is located at distance at least 2cr2cr from any other point in PP. Let qq be sampled in line 2 be such that |Coll(q,z)|2e2(λ+1)|\textsc{Coll}(q,z)|\leq 2e^{2}(\lambda+1). For a fixed iteration of the outer loop in line 3 of Algorithm 2, with probability at least 11/(32e2(λ+1))1-1/(32e^{2}(\lambda+1)) the inner loop between lines 6 and 12 finds two points qq^{\prime} and q′′q^{\prime\prime} such that QueryLSH(q)z\textsc{QueryLSH}(q^{\prime})\neq z, QueryLSH(q′′)=z\textsc{QueryLSH}(q^{\prime\prime})=z, and qq^{\prime} and q′′q^{\prime\prime} differ in one bit.

Proof.

Fix an iteration of the outer loop. Then the inner loop at line 7 essentially does a random walk away from zz, always increasing the distance from zz after every flip. There are two possibilities: either it finds a point at distance at most crcr from zz such that QueryLSH(q)z\textsc{QueryLSH}(q)\neq z, i.e. Coll(q,z)=\textsc{Coll}(q^{\prime},z)=\emptyset, or it reaches a point q′′q^{\prime\prime} at distance crcr such that QueryLSH(q′′)=z\textsc{QueryLSH}(q^{\prime\prime})=z, where the algorithm would return \bot.

Assume that the latter happens. Since the bits flipped are chosen at random, this process is equivalent to flipping crdist(q,z)cr-\operatorname*{dist}(q,z) different bits of qq where qq is not equal to zz. This is the same process as in statement of Lemma 4.4, so by this lemma the probability that this case takes place is at most 1/(32e2(λ+1))1/(32e^{2}(\lambda+1)). ∎

We are now ready to analyze Algorithm 2. Here we present only the conceptual part of the proof of Theorem A.4. The full proof can be found in Section A.2

Theorem 4.6.

Let n>en>e, 28ln3nrd/(28lnn)28\ln^{3}n\leq r\leq d/(28\ln n), λ18e2min{rlnn,n1/8}1\lambda\leq\frac{1}{8e^{2}}\min\{\frac{r}{\ln n},n^{1/8}\}-1, and 1+80+16ln(λ+1)lnnclnn1+\frac{80+16\ln(\lambda+1)}{\ln n}\leq c\leq\ln n. Let t=2e2(λ+1)t=2e^{2}(\lambda+1) be the parameter from Algorithm 2 (Algorithm 3). Suppose that the point zz passed to Algorithm 2 (Algorithm 3) is located at a distance of at least 2cr2cr from any other point in PP. With probability at least 1/41/n1/4-1/n the algorithm finds a point qq at a distance of at most rr from zz such that querying LSH with qq returns no point. It uses at most O(crλ)O(cr\cdot\lambda) (O(log(cr)λ)O(\log(cr)\cdot\lambda) for Algorithm 3) queries to the LSH data structure.

Proof outline.

We start with Algorithm 2. Let q(j)q^{(j)} be the value of qq at the end of the iteration jj inner loop of Lemma 4.5, and let q(0)q^{(0)} be its value at the beginning of the first iteration of this loop. Let jj^{*} be the number of iterations.

For each iteration jj of the inner loop, by Lemma 4.5 two points qq^{\prime} and q′′q^{\prime\prime} are found. The bit where they differ must lie in the support of one of the hash functions in Coll(q′′,z)\textsc{Coll}(q^{\prime\prime},z) since QueryLSH(q)z\textsc{QueryLSH}(q^{\prime})\neq z. Since q(j)q^{(j)} and q(j1)q^{(j-1)} differ in only this bit, and because Coll(q,z)Coll(q(j1),z)\textsc{Coll}(q^{\prime},z)\subseteq\textsc{Coll}(q^{(j-1)},z), it holds that |Coll(q(j),z)|<|Coll(q(j1),z)||\textsc{Coll}(q^{(j)},z)|<|\textsc{Coll}(q^{(j-1)},z)|, so the size of the |Coll(q(j),z)||\textsc{Coll}(q^{(j)},z)| decreases by at least 11 after each execution of the loop.

By Lemma 4.3 and Markov’s inequality, |Coll(q(0),z)||\textsc{Coll}(q^{(0)},z)| is at most 2e2(λ+1)2e^{2}(\lambda+1) with constant probability, so at most 2e2(λ+1)2e^{2}(\lambda+1) iterations will be made until Coll(q(j),z)=\textsc{Coll}(q^{(j^{*})},z)=\emptyset, which also means that QueryLSH(q(j))=\textsc{QueryLSH}(q^{(j^{*})})=\bot so the loop stops. On the other hand, dist(q(j),z)r\operatorname*{dist}(q^{(j^{*})},z)\leq r, which makes q(j)q^{(j^{*})} a false negative query.

With a simple modification of the inner loop of Algorithm 2, we can exponentially reduce the dependence on crcr in the running time and the number of queries. We state the algorithm and give the proof of its correctness in Section A.3. At a high level, instead of using a linear search as in Algorithm 2, to find a pair of points q′′q^{\prime\prime}, qq^{\prime} such that Coll(q′′,z)\textsc{Coll}(q^{\prime\prime},z)\neq\emptyset and Coll(q,z)=\textsc{Coll}(q^{\prime},z)=\emptyset, we use a binary search in the inner loop. We still start with the same point qq. However, we now flip the crdist(q,z)cr-\operatorname*{dist}(q,z) bits in qq in positions where qq equals zz to get qrightq^{right}. Then by Lemma 4.4, qrightq^{right} does not hash with zz. Now we use a binary search on the path from qq to qrightq^{right} to find the aforementioned pair of points. The analysis is analogous to that of Lemma 4.5. ∎

5 Discussion of potential defenses against our attacking scheme

As was mentioned in Section 2, there already exists large literature on the robustification of LSH, many of which work against our adversary. In particular, we highlight [1], which provides a false-negative-free Las-Vegas type data structure with query time dn1/c+o(1)dn^{1/c+o(1)} and used space dn1+1/c+o(1)dn^{1+1/c+o(1)}, achieving the same asymptotic performance as [22], as well as [14], which can be applied to LSH to guarantee perfect robustness by paying O(d)O(d) times more space.

However, both approaches are not perfect within the same parameter bounds, as approach of [1] doesn’t give any guarantees on query time in this setting, and the space usage of [14] is too prohibitive in settings with large dd. We experimentally investigate how the latter, and a construction based on differential privacy, fares against our adaptive adversary when the available space is limited in Experiment 6 in Appendix C. Interestingly, these defences perform well in this setting, which suggests that they might be reasonably effective even against a general adversary.

6 Experiments

The goal of this section is to verify the efficacy of our adversary on real datasets, where an isolated point does not necessarily exist, and for parameter configurations outside of the guarantees of Theorems A.4 and A.6.

Refer to caption
Figure 1: Dependence of success probability on various parameters. All experiments are done on the Random dataset, with the third one also featuring the Zero dataset.
Refer to caption
Figure 2: Dependence of the success probability on the value of tt and on the value of desired distance of false negative query from the origin.

Implementation details. Our implementation of LSH follows Algorithm 1 in Section 3 [22]. To summarise, given a query point qq, we iterate through all points in the dataset that hash together with qq under some hash function, then return the first one which lies at distance at most crcr from qq, and return nothing if there is no such point. The adversary is implemented according to Algorithm 3.

Experiment 1: Influence of parameters on success probability. The goal of this experiment is to explore within which parameter setting the algorithm works. For this experiment we use synthetic datasets, since for them we can easily vary the number of points and dimensions. We use two datasets:

Zero. This dataset entirely consists of nn copies of the all-zero point. This dataset is the easiest for our algorithm to find a false negative in, as it is equivalent to a dataset containing only a single point, which would automatically be isolated.

Random. Each point is sampled i.i.d. uniformly, with each bit having a 1/21/2 probability to be set to 11.

For each point on the plots, the experiments were run 10001000 times, and the result is the mean of the associated values. Plots contain error bars computed as standard error of the mean, but sometimes they are too small to be visible.

The parameters that we vary are nn, dd, rr, λ\lambda, cc and tt. Each parameter is changed individually. The default setting of parameters is n=1000n=1000, d=300d=300, λ=4\lambda=4, r=30r=30, c=2c=2. The origin point zz is picked at random. The offset tt is set to rr, or, in other words, the initial point qq is initialized to be equal to zz. This is different than what was used in Algorithms 2 and 3. However, the intuition is that if instead of setting qq to be a random point at distance 2e2(λ+1)2e^{2}(\lambda+1) from zz we set q=zq=z and then move it away from zz in a way that ensures that the size of Coll(q,z)\textsc{Coll}(q,z) decreases for every changed bit, the probability that the algorithm will find a false negative query would only increase. This intuition ends up being supported by the experiment results.

The results are presented in Figure 1. The sixth subplot, counting left-to-right and top-to-bottom, supports the claim that we just made: the lower the starting distance of the point qq, which is equal to rtr-t, the higher the success probability of the algorithm. The first and fifth subplots are straightforward. The number of points nn has a very weak effect on the success probability. Parameter cc also has a weak influence if it is large enough, and when it is small it is unlikely that a point at distance crcr does not hash with zz, see Lemma 4.4. The drop in probability on the left-hand side of the third subplot where the near radius rr is varied and the right-hand side of the fourth where λ\lambda is varied can be explained by the fact that a random point at distance rr has in expectation Ω(λ)\Omega(\lambda) hash functions hashing it with zz. Hence, intuitively, for it to be possible to eliminate all collisions with zz, it must be that tΩ(λ)t\geq\Omega(\lambda), which also implies that rΩ(λ)r\geq\Omega(\lambda) is necessary.

Experiment 2: Efficiency across datasets. In this experiment we measure how well our algorithm performs on a variety of different datasets. We use the datasets from Experiment 1, as well as one additional synthetic dataset and 3 datasets with real data. The new synthetic dataset is Sparse random. Each point there is generated by independently sampling each bit to be 11 with probability 1/151/15, and to be 0 otherwise. The remaining datasets are: We use embeddings of the data sets MNIST [28], Anonymous Microsoft Web Data (MSWeb) [12] and Mushroom [34, 38] into Hamming space. For MNIST we flatten the images into vectors whose Boolean variables indicate the presence of a pixel. For MSWeb and Mushroom we use one-hot encoding to map the data into Hamming space. A more detailed description can be found in the Appendix C.

Due to computational constraints, we choose to limit the size of each dataset to at most 1000010000. For MNIST and MSWeb we only use the first 1000010000 points. We use all 81248124 points in Mushroom. For the synthetic datasets we generate 1000010000 points with the dimension 300300, which is about the same order of magnitude as the dimensions of real datasets. We choose a setting of parameters with which our algorithm has a good chance of success as per the results of Experiment 1. As the number of dimensions varies depending on the dataset, we set r=0.15dr=0.15d. c=2c=2, λ=4\lambda=4 and the starting distance is 0 on the second subplot. Otherwise, we use the same setup as Experiment 1, except that for fixed datasets the origin point is the first one in the dataset. On the first subplot the starting distance is measured as a fraction of rr. On the second subplot the goal of the algorithm is to find a query to the LSH data structure that returns \bot at a distance from the origin specified by the xx axis as a fraction of the crcr. The dashed black vertical line represents the radius rr and is located at 0.50.5 since c=2c=2. The results in the first subplot of Figure 2 show that success probability drops monotonically for all datasets as the starting distance grows. This is natural to expect given that similar behaviour was observed in Experiment 1. On the second subplot we can see that for the most datasets we can reliably find a negative query only at distance much larger than rr. This, in a way, allows us to compare the “hardness” of datasets, as further away from the origin we move, the less functions collide the origin point zz and our current point, the less bits we need to change for the query to become negative. As one can see, all of the real datasets lie between the Random and Sparse random on the second subplot. This can be explained by the fact that the points in real datasets are much sparser than the Random, and the maximum distance from the most isolated point to the nearest point is smaller than the ideal 2cr2cr.

References

  • [1] Thomas Dybdahl Ahle. Optimal las vegas locality sensitive data structures. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 938–949. IEEE Computer Society, 2017.
  • [2] Panagiotis Anagnostou, Petros Barbas, Aristidis G. Vrahatis, and Sotiris K. Tasoulis. Approximate knn classification for biomedical data. In 2020 IEEE International Conference on Big Data (Big Data), pages 3602–3607, 2020.
  • [3] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings, pages 459–468. IEEE Computer Society, 2006.
  • [4] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P. Razenshteyn, and Ludwig Schmidt. Practical and optimal LSH for angular distance. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1225–1233, 2015.
  • [5] Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya P. Razenshteyn. Beyond locality-sensitive hashing. In Chandra Chekuri, editor, Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1018–1028. SIAM, 2014.
  • [6] Alexandr Andoni, Thijs Laarhoven, Ilya P. Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Philip N. Klein, editor, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 47–66. SIAM, 2017.
  • [7] Alexandr Andoni and Ilya P. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 793–801. ACM, 2015.
  • [8] Idan Attias, Edith Cohen, Moshe Shechner, and Uri Stemmer. A framework for adversarial streaming via differential privacy and difference estimators. In Yael Tauman Kalai, editor, 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA, volume 251 of LIPIcs, pages 8:1–8:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
  • [9] Martin Aumüller, Tobias Christiani, Rasmus Pagh, and Francesco Silvestri. Distance-sensitive hashing. In Jan Van den Bussche and Marcelo Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 89–104. ACM, 2018.
  • [10] Amos Beimel, Haim Kaplan, Yishay Mansour, Kobbi Nissim, Thatchaphol Saranurak, and Uri Stemmer. Dynamic algorithms against an adaptive adversary: generic constructions and lower bounds. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1671–1684. ACM, 2022.
  • [11] Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff, and Eylon Yogev. A framework for adversarially robust streaming algorithms. J. ACM, 69(2):17:1–17:33, 2022.
  • [12] Jack Breese, David Heckerman, and Carl Kadie. Anonymous Microsoft Web Data. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C5VS3Q.
  • [13] Moses Charikar. Similarity estimation techniques from rounding algorithms. In John H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 380–388. ACM, 2002.
  • [14] Yeshwanth Cherapanamjeri and Jelani Nelson. On adaptive distance estimation. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [15] Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Philip N. Klein, editor, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 31–46. SIAM, 2017.
  • [16] Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, and Uri Stemmer. On the robustness of countsketch to adaptive inputs. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 4112–4140. PMLR, 2022.
  • [17] Edith Cohen, Jelani Nelson, Tamás Sarlós, and Uri Stemmer. Tricking the hashing trick: A tight lower bound on the robustness of countsketch to adaptive inputs. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 7235–7243. AAAI Press, 2023.
  • [18] Søren Dahlgaard, Mathias Bæk Tejs Knudsen, and Mikkel Thorup. Practical hash functions for similarity estimation and dimensionality reduction. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6615–6625, 2017.
  • [19] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253–262, 2004.
  • [20] Moritz Hardt and David P. Woodruff. How robust are linear sketches to adaptive inputs? In Dan Boneh, Tim Roughgarden, and Joan Feigenbaum, editors, Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA, June 1-4, 2013, pages 121–130. ACM, 2013.
  • [21] Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Yossi Matias, and Uri Stemmer. Adversarially robust streaming algorithms via differential privacy. J. ACM, 69(6):42:1–42:14, 2022.
  • [22] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Jeffrey Scott Vitter, editor, Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23-26, 1998, pages 604–613. ACM, 1998.
  • [23] Omid Jafari, Preeti Maurya, Parth Nagarkar, Khandker Mushfiqul Islam, and Chidambaram Crushev. A survey on locality sensitive hashing algorithms and their applications. arXiv preprint arXiv:2102.08942, 2021.
  • [24] Michael Kapralov. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In Tova Milo and Diego Calvanese, editors, Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 329–342. ACM, 2015.
  • [25] Christopher Kennedy and Rachel A. Ward. Fast cross-polytope locality-sensitive hashing. In Christos H. Papadimitriou, editor, 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017, Berkeley, CA, USA, volume 67 of LIPIcs, pages 53:1–53:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
  • [26] Hisashi Koga, Tetsuo Ishibashi, and Toshinori Watanabe. Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowledge and Information Systems, 12:25–53, 2007.
  • [27] Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput., 30(2):457–474, 2000.
  • [28] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • [29] Ping Li, Art B. Owen, and Cun-Hui Zhang. One permutation hashing. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 3122–3130, 2012.
  • [30] Jelani Nelson, Huy L. Nguyên, and David P. Woodruff. On deterministic sketching and streaming for sparse recovery and norm estimation. In Anupam Gupta, Klaus Jansen, José D. P. Rolim, and Rocco A. Servedio, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques - 15th International Workshop, APPROX 2012, and 16th International Workshop, RANDOM 2012, Cambridge, MA, USA, August 15-17, 2012. Proceedings, volume 7408 of Lecture Notes in Computer Science, pages 627–638. Springer, 2012.
  • [31] Rasmus Pagh. Locality-sensitive hashing without false negatives. In Robert Krauthgamer, editor, Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 1–9. SIAM, 2016.
  • [32] Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, pages 1186–1195. ACM Press, 2006.
  • [33] Matti Ryynanen and Anssi Klapuri. Query by humming of midi and audio using locality sensitive hashing. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2249–2252, 2008.
  • [34] Mushroom. UCI Machine Learning Repository, 1987. DOI: https://doi.org/10.24432/C5959T.
  • [35] Roger Weber, Hans-Jörg Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Ashish Gupta, Oded Shmueli, and Jennifer Widom, editors, VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pages 194–205. Morgan Kaufmann, 1998.
  • [36] Alexander Wei. Optimal las vegas approximate near neighbors in \ellp{}_{\mbox{p}}. ACM Trans. Algorithms, 18(1):7:1–7:27, 2022.
  • [37] David P. Woodruff and Samson Zhou. Tight bounds for adversarially robust streams and sliding windows via difference estimators. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021, Denver, CO, USA, February 7-10, 2022, pages 1183–1196. IEEE, 2021.
  • [38] Barry L Wulff. The audubon society field guide to north american mushrooms, by gary lincoff, 1982.
  • [39] Ying Zhang, Huchuan Lu, Lihe Zhang, Xiang Ruan, and Shun Sakai. Video anomaly detection based on locality sensitive hashing filters. Pattern Recognition, 59:302–311, 2016.

Appendix A Omitted proofs and statements

This appendix contains lemmas and proofs omitted in the Section 4.

A.1 Proofs from Section 4

Fact A.1.

For x[0,+)x\in\left[0,+\infty\right), 1xex1-x\leq e^{-x}. For x[0,1/5]x\in\left[0,1/5\right], exx21xe^{-x-x^{2}}\leq 1-x.

Proof.

First we show ex(1x)0e^{-x}-(1-x)\geq 0. Taking a derivative, we get ex+1-e^{-x}+1, which is non-negative on [0,+)\left[0,+\infty\right). Therefore, since the inequality holds at 0, it holds on [0,+)\left[0,+\infty\right).

For the second inequality, consider f(x):=exx2(1x)f(x):=e^{-x-x^{2}}-(1-x). It is enough to show that it’s first derivative f(x)=(2x+1)exx2+1f^{\prime}(x)=-(2x+1)e^{-x-x^{2}}+1 is non-positive on [0,1/5]\left[0,1/5\right]. For this it is sufficient to show that g(x):=(2x+1)ex+x20g(x):=(2x+1)-e^{x+x^{2}}\geq 0 on this interval. g(x)=2(2x+1)ex+x2g(x)^{\prime}=2-(2x+1)e^{x+x^{2}} is a decreasing function. Since g(0)>0g^{\prime}(0)>0, g(1/5)0.22>0g^{\prime}(1/5)\approx 0.22>0, g(x)g^{\prime}(x) is positive on the interval [0,1/5]\left[0,1/5\right]. This means that g(x)g(x) is non-negative on this interval since g(0)=0g(0)=0. ∎

See 4.1

Proof.

Recall that k=lnnlnp2=lnn(ln(1crd))1k=\lceil\frac{\ln n}{-\ln p_{2}}\rceil=\lceil\ln n\cdot(-\ln(1-\frac{cr}{d}))^{-1}\rceil. By taking logarithm of the inequalities of A.1, we get that for x1/5x\leq 1/5, xln(1x)x+x2x\leq-\ln(1-x)\leq x+x^{2}. Setting x=cr/dx=cr/d, we obtain

lnn(crd)1lnnln(1crd))lnn(crd+(crd)2)1.\ln n\cdot\left(\frac{cr}{d}\right)^{-1}\geq\frac{\ln n}{-\ln(1-\frac{cr}{d}))}\geq\ln n\cdot\left(\frac{cr}{d}+\left(\frac{cr}{d}\right)^{2}\right)^{-1}.

The required inequalities now follow immediately, since 0cr/d10\leq cr/d\leq 1 and lnn1\ln n\geq 1, so dcrlnn1\frac{d}{cr}\ln n\geq 1 and 11+cr/d1/2\frac{1}{1+cr/d}\leq 1/2. ∎

Fact A.2 (Chernoff bound).

Let X1,,XnX_{1},\ldots,X_{n} be independent Bernoulli random variables. Let S=i=1nXiS=\sum_{i=1}^{n}X_{i}. Let δ>0\delta>0. Then

Pr[S(1+δ)𝔼(S)]exp(δ2𝔼(S)2+δ).\Pr[S\geq(1+\delta)\operatorname*{\mathbb{E}}(S)]\leq\exp\left(-\frac{\delta^{2}\operatorname*{\mathbb{E}}(S)}{2+\delta}\right).
Lemma A.3.

Suppose that a hash function gg was constructed according to the procedure in Section 3 and n>en>e. Then with probability at least 11/n31-1/n^{3} the size of the support of gg is at least k7lognmax{1,k(k1)2d}k-7\log n\cdot\max\{1,\frac{k(k-1)}{2d}\}.

Proof.

Recall that a hash function gg is composed of hash functions h1,,hkh_{1},\dots,h_{k} each of which are supported on one dimension from {1,,d}\{1,\dots,d\}. The dimension on which each of the function hih_{i} is supported is picked uniformly and independently from {1,,d}\{1,\ldots,d\}. We will call hash function hih_{i} duplicated if for some j<ij<i, hi=hjh_{i}=h_{j}. Note that the first instance of each hash function in the range according to this definition is not duplicated. Then, because k=|support of g|+#duplicated hash functions hk=|\text{support of }g|+\#\text{duplicated hash functions }h, it is enough to bound the number of duplicated hash functions to conclude the statement of the lemma, which we will denote by SS.

Consider the following equivalent process of constructing gg. We will pick each of the hih_{i} in a sequence of increasing ii. Denote by tit_{i} the size of the union of supports of h1,,hi1h_{1},\ldots,h_{i-1}. For each ii we will first reorder the range {1,,d}\{1,\ldots,d\} so that the first tit_{i} entries would be the dimension indices already in the support of h1,,hi1h_{1},\ldots,h_{i-1}, and the rest would be everything else. We will then pick a number ai[d]a_{i}\in[d] uniformly at random, and this number would indicate the position of the dimension index in the new range on which hih_{i} will be supported on.

It is now easy to see that the hash function hih_{i} is duplicated iff aitia_{i}\leq t_{i}.

Since the support of first i1i-1 hash functions is at most of size i1i-1, it always holds that tii1t_{i}\leq i-1. Let σi\sigma_{i} be an indicator of the event aitia_{i}\leq t_{i}, and ξi\xi_{i} be an indicator of the event aii1a_{i}\leq i-1. Then i=1kσi=S\sum_{i=1}^{k}\sigma_{i}=S and for each ii, σiξi\sigma_{i}\leq\xi_{i}.

Hence we will upper bound SS by S:=i=1kξiS^{\prime}:=\sum_{i=1}^{k}\xi_{i}, and upper bound SS^{\prime} using the Chernoff bound. Indeed 𝔼(S)=i=1k𝔼(ξi)=i=1ki1d=k(k1)2d\operatorname*{\mathbb{E}}(S^{\prime})=\sum_{i=1}^{k}\operatorname*{\mathbb{E}}(\xi_{i})=\sum_{i=1}^{k}\frac{i-1}{d}=\frac{k(k-1)}{2d}. Note that δ2/(2+δ)δ/2\delta^{2}/(2+\delta)\geq\delta/2 when δ>2\delta>2, which is the case for the values we use below.

Now, we first consider the case when k(k1)2d1\frac{k(k-1)}{2d}\geq 1.

By applying A.2 to ξi\xi_{i} and setting δ=6lnn\delta=6\ln n, we get

Pr[S7lnnk(k1)2d]Pr[S(1+6lnn)k(k1)2d]exp(36ln2n2+6lnnk(k1)2d)exp(3lnnk(k1)2d)1/n3.\begin{split}\Pr\left[S^{\prime}\geq 7\ln n\frac{k(k-1)}{2d}\right]&\leq\Pr\left[S^{\prime}\geq(1+6\ln n)\frac{k(k-1)}{2d}\right]\\ &\leq\exp\left(-\frac{36\ln^{2}n}{2+6\ln n}\cdot\frac{k(k-1)}{2d}\right)\\ &\leq\exp\left(-3\ln n\frac{k(k-1)}{2d}\right)\\ &\leq 1/n^{3}.\end{split}

When k(k1)2d<1\frac{k(k-1)}{2d}<1, set δ=6lnn2dk(k1)\delta=6\ln n\frac{2d}{k(k-1)}.

Pr[S7lnn]Pr[S(1+6lnn2dk(k1))k(k1)2d]exp(3lnn2dk(k1)k(k1)2d)exp(3lnn)1/n3.\begin{split}\Pr\left[S^{\prime}\geq 7\ln n\right]&\leq\Pr\left[S^{\prime}\geq(1+6\ln n\frac{2d}{k(k-1)})\frac{k(k-1)}{2d}\right]\\ &\leq\exp\left(-3\ln n\frac{2d}{k(k-1)}\cdot\frac{k(k-1)}{2d}\right)\\ &\leq\exp\left(-3\ln n\right)\\ &\leq 1/n^{3}.\end{split}

Therefore,

Pr[|support of g|<k7lognmax{1,k(k1)2d}]=Pr[Sk7lognmax{1,k(k1)2d}]Pr[Sk7lognmax{1,k(k1)2d}]1/n3,\Pr\left[|\text{support of }g|<k-7\log n\cdot\max\left\{1,\frac{k(k-1)}{2d}\right\}\right]\\ =\Pr\left[S\geq k-7\log n\cdot\max\left\{1,\frac{k(k-1)}{2d}\right\}\right]\\ \leq\Pr\left[S^{\prime}\geq k-7\log n\cdot\max\left\{1,\frac{k(k-1)}{2d}\right\}\right]\\ \leq 1/n^{3},

which concludes the proof. ∎

See 4.2

Proof.

By Lemma A.3, the probability that the support of any fixed hash function in GG is at least k7lognmax{1,k(k1)2d}k-7\log n\cdot\max\{1,\frac{k(k-1)}{2d}\} is at least 11/n31-1/n^{3}. The number of hash functions in GG is λnρn2\lceil\lambda n^{\rho}\rceil\leq n^{2} and so by the union bound, the probability that there exists a hash function gGg\in G with support less than k7lognmax{1,k22d}k-7\log n\cdot\max\{1,\frac{k^{2}}{2d}\} is at most λnρ1n31/n\lceil\lambda n^{\rho}\rceil\cdot\frac{1}{n^{3}}\leq 1/n. ∎

See 4.3

Proof.

We fix an arbitrary set GG of LL hash functions that satisfy the condition of the lemma. By linearity of expectation,

𝔼[|Coll(q,z)|]=gGPr[g(q)=q(z)],\operatorname*{\mathbb{E}}[|\textsc{Coll}(q,z)|]=\sum_{g\in G}\Pr[g(q)=q(z)],

where the randomness is over the choice of qq as in the lemma.

Let x=7lnnmax{1,k22d}x=7\ln n\cdot\max\{1,\frac{k^{2}}{2d}\}

In the following we consider an arbitrary gGg\in G and derive an upper bound for the collision probability of qq and zz.

PrqUNIF(𝕊(z,rt))[g(q)=g(z)](1kxd)rt,\Pr_{q\sim UNIF(\mathbb{S}(z,r-t))}[g(q)=g(z)]\leq\left(1-\frac{k-x}{d}\right)^{r-t}, (1)

where 𝕊(z,rt)={p{0,1}d:dist(z,p)=rt}\mathbb{S}(z,r-t)=\{p\in\{0,1\}^{d}:\operatorname*{dist}(z,p)=r-t\}.

To see it, consider the following way to sample qq: first sample rtr-t bits uniformly at random with replacement from [d][d] to form a set BB, |B|rt|B|\leq r-t, and then flip the bits from BB in zz to obtain a point qq^{\prime}. After that choose the rest of rtdist(z,q)r-t-\operatorname*{dist}(z,q^{\prime}) bits without replacement uniformly at random among the bits [d]B[d]\setminus B and flip them. Because the bits flipped in the second step are different from the bits flipped in the first step, if g(q)g(z)g(q^{\prime})\neq g(z) then g(q)g(z)g(q)\neq g(z). Therefore, Pr[g(q)=g(z)]Pr[g(q)=g(z)]\Pr[g(q^{\prime})=g(z)]\geq\Pr[g(q)=g(z)].

On the other hand, g(q)=g(z)g(q^{\prime})=g(z) if and only if all of the bits sampled into BB do not lie in the support of gg. Because the support of gg is at least kxk-x, each sampled bit doesn’t lie in it with probability at most 1kxd1-\frac{k-x}{d}. Since the bits are chosen independently, we have Pr[g(q)=g(z)](1kxd)rt\Pr[g(q^{\prime})=g(z)]\leq(1-\frac{k-x}{d})^{r-t}. This establishes (1).

We are now in a position to upper bound the expected size of Coll(q,z)\textsc{Coll}(q,z), i.e. the number of hashings that qq and zz collide under, obtaining the result of the lemma. Using (1) we get

𝔼q[|Coll(q,z)|]=𝔼q[gGI(g(q)=g(z))]=gGPr[g(q)=g(z)]L(1kxd)rt.\operatorname*{\mathbb{E}}_{q}\left[|\textsc{Coll}(q,z)|\right]=\operatorname*{\mathbb{E}}_{q}\left[\sum_{g\in G}I(g(q)=g(z))\right]=\\ \sum_{g\in G}\Pr[g(q)=g(z)]\leq L\cdot\left(1-\frac{k-x}{d}\right)^{r-t}. (2)

Recall that by our parameter setting the number of hashings LL satisfies L=λ(λ+1)L=\lceil\lambda\ell\rceil\leq(\lambda+1)\ell, where

=nρ=(1rd)log1/p2n(1rd)k.\ell=n^{\rho}=\left(1-\frac{r}{d}\right)^{-\log_{1/p_{2}}n}\leq\left(1-\frac{r}{d}\right)^{-k}. (3)

Hence, by (3), we have the following upper bound on the expectation.

𝔼q[|Coll(q,z)|](λ+1)(1rd)k(1kxd)rt\operatorname*{\mathbb{E}}_{q}\left[|\textsc{Coll}(q,z)|\right]\leq(\lambda+1)\left(1-\frac{r}{d}\right)^{-k}\left(1-\frac{k-x}{d}\right)^{r-t} (4)

The only step left is to upper bound the right hand side of (4). Therefore, we need to upper bound (1rd)k(1kxd)rt\left(1-\frac{r}{d}\right)^{-k}\left(1-\frac{k-x}{d}\right)^{r-t}. We have by A.1

(1kxd)rt\displaystyle(1-\frac{k-x}{d})^{r-t} exp(kxd)rt\displaystyle\leq\exp(-\frac{k-x}{d})^{r-t}
=exp(krd+x(rt)d+tkd)\displaystyle=\exp(-\frac{kr}{d}+\frac{x(r-t)}{d}+\frac{tk}{d})
(1rd)k\displaystyle(1-\frac{r}{d})^{k} exp(rd+r2d2)k\displaystyle\geq\exp(-\frac{r}{d}+\frac{r^{2}}{d^{2}})^{k}
=exp(krdkr2d2)\displaystyle=\exp(-\frac{kr}{d}-\frac{kr^{2}}{d^{2}})

The largest term krd\frac{kr}{d} cancels out in the fraction, so we are only left with a sum of smaller terms.

(1kxd)rt(1rd)kexp(tkd+kr2d2+x(rt)d).\frac{(1-\frac{k-x}{d})^{r-t}}{(1-\frac{r}{d})^{k}}\leq\exp\left(\frac{tk}{d}+\frac{kr^{2}}{d^{2}}+\frac{x(r-t)}{d}\right). (5)

Now we will establish upper bounds on each of the summands in the right-hand side of (5). First, by Lemma 4.1,

k2dcrlnn.k\leq 2\frac{d}{cr}\ln n.

This implies:

  • kd/(2t)k\leq d/(2t), since 8e2(λ+1)ln=4tln<cr8e^{2}(\lambda+1)\ln=4t\ln<cr, so 2dcrlnn<d/(2t)2\frac{d}{cr}\ln n<d/(2t).

  • kd2/r2k\leq d^{2}/r^{2}, since rd/(14lnn)dc2lnnr\leq d/(14\ln n)\leq\frac{dc}{2\ln n}, so 21clnnd/r2\frac{1}{c}\ln n\leq d/r and 2dcrlnnd2/r22\frac{d}{cr}\ln n\leq d^{2}/r^{2}.

  • k2d27rlnnk^{2}\leq\frac{d^{2}}{7r\ln n}, since r28ln3n/c2r\geq 28\ln^{3}n/c^{2}, so 17lnn41c2rln2n\frac{1}{7\ln n}\geq 4\frac{1}{c^{2}r}\ln^{2}n and 4d2c2r2ln2nd27rlnn4\frac{d^{2}}{c^{2}r^{2}}\ln^{2}n\leq\frac{d^{2}}{7r\ln n}

The last inequality implies that 7lnnrk22d21/27\ln nr\cdot\frac{k^{2}}{2d^{2}}\leq 1/2, and rd14lnnr\leq\frac{d}{14\ln n} implies 7lnnr/d1/27\ln nr/d\leq 1/2, so

x(rt)dxr/d1/2.\frac{x(r-t)}{d}\leq xr/d\leq 1/2.

On the other hand, the first two bounds on kk imply tk/d<1/2tk/d<1/2 and kr2/d2<1kr^{2}/d^{2}<1, which finally allows us to bound rhs of (5):

exp(tkd+kr2d2+x(rt)d)exp(0.5+1+0.5)=e2.\exp\left(\frac{tk}{d}+\frac{kr^{2}}{d^{2}}+\frac{x(r-t)}{d}\right)\leq\exp\left(0.5+1+0.5\right)=e^{2}.

Putting the above together with (4) we therefore conclude that

𝔼q[Coll(q,z)]e2(λ+1)\operatorname*{\mathbb{E}}_{q}[\textsc{Coll}(q,z)]\leq e^{2}(\lambda+1)

as required. ∎

A.2 Proofs from Section 4.1

We split Theorem 4.6 into two separate Theorems A.4 and A.6 for each of Algorithms 2 and 3. In this subsection we formally prove Theorem A.4.

Theorem A.4 (Simplest algorithm).

Let n>en>e, 28ln3nrd/(28lnn)28\ln^{3}n\leq r\leq d/(28\ln n), λ18e2min{rlnn,n1/8}1\lambda\leq\frac{1}{8e^{2}}\min\{\frac{r}{\ln n},n^{1/8}\}-1, and 1+80+16ln(λ+1)lnnclnn1+\frac{80+16\ln(\lambda+1)}{\ln n}\leq c\leq\ln n. Let t=2e2(λ+1)t=2e^{2}(\lambda+1) be the parameter from Algorithm 2. Suppose that the point zz passed to Algorithm 2 is located at a distance of at least 2cr2cr from any other point in PP. With probability at least 1/41/n1/4-1/n the algorithm finds a point qq at a distance of at most rr from zz such that querying LSH with qq returns no point. It uses at most O(crλ)O(cr\cdot\lambda) queries to the LSH data structure.

Proof.

First we show that under this parameter setting the requirements of Lemmas 4.4 and 4.3 are satisfied. For this we need to show that rd/(5c)r\leq d/(5c) and that the support of each hash function gGg\in G is at least k/2k/2.

The inequality rd/(5c)r\leq d/(5c) immediately follows from clnnc\leq\ln n and rd/(28ln)r\leq d/(28\ln).

By Lemma 4.2, each hash function gGg\in G has support at least k7lnnmax{1,k22d}k-7\ln n\cdot\max\{1,\frac{k^{2}}{2d}\} with probability at least 1/n1/n. By Lemma 4.1 and the parameter setting, rd/(28lnn)d/(28c)r\leq d/(28\ln n)\leq d/(28c), hence 14lnnd2crlnnk14\ln n\leq\frac{d}{2cr}\ln n\leq k. On the other hand, r28ln3n14lnn/cr\geq 28\ln^{3}n\geq 14\ln n/c, so k2dcrlnnd7lnk\leq\frac{2d}{cr}\ln n\leq\frac{d}{7\ln} and k22dk14lnn\frac{k^{2}}{2d}\leq\frac{k}{14\ln n}. This means that k/27lnnmax{1,k22d}k/2\geq 7\ln n\cdot\max\{1,\frac{k^{2}}{2d}\}, so each gGg\in G has support at least k/2k/2 with probality at least 11/n1-1/n.

To conclude the theorem statement, we show in this proof that Algorithm 2 returns a point qq within distance rr of zz such that Coll(q,z)=\textsc{Coll}(q,z)=\emptyset.

Let qq be the point sampled in line 2. By Lemma 4.3 and Markov’s inequality, |Coll(q,z)|2e2(λ+1)|\textsc{Coll}(q,z)|\leq 2e^{2}(\lambda+1) with probability 1/21/2. Conditioned on those inequalities holding, we will show that with a probability at least 118e2(λ+1)1-\frac{1}{8e^{2}(\lambda+1)} the algorithm finds a bit ii during each execution of the outer loop at line 3 such that, after flipping this bit in qq, |Coll(q,z)||\textsc{Coll}(q,z)| decreases by at least 11. This would mean that after 2e2(λ+1)2e^{2}(\lambda+1) iterations Coll(q,z)=\textsc{Coll}(q,z)=\emptyset. This happens with probability at least 1(1/2+2e2(λ+1)18e2(λ+1))=1/41-(1/2+2e^{2}(\lambda+1)\frac{1}{8e^{2}(\lambda+1)})=1/4 by a union bound.

By taking the union bound with the probability that each gGg\in G has support at least k4lognk-4\log n, we obtain the final success probability of at least 1/41/n1/4-1/n.

The inner loop at line 6 by Lemma 4.5, with probability at least 11/(32e2(λ+1))1-1/(32e^{2}(\lambda+1)), finds two points qq^{\prime} and q′′q^{\prime\prime} such that Coll(q′′,z)\textsc{Coll}(q^{\prime\prime},z)\neq\emptyset, Coll(q,z)=\textsc{Coll}(q^{\prime},z)=\emptyset and q,q′′q^{\prime},q^{\prime\prime} differ in exactly one bit.

Assuming such a pair of points is found, let ii be the position of the bit that is different between them. Since Coll(q′′,z)\textsc{Coll}(q^{\prime\prime},z)\neq\emptyset, there is a function gColl(q,z)g\in\textsc{Coll}(q,z) such that g(q′′)=g(z)g(q^{\prime\prime})=g(z). On the other hand, because Coll(q,z)=\textsc{Coll}(q^{\prime},z)=\emptyset, g(q)g(z)g(q^{\prime})\neq g(z), therefore bit ii must lie in the support of gg. Hence after we flip this bit in qq, g(q)g(z)g(q)\neq g(z) and |Coll(q,z)||\textsc{Coll}(q,z)| decreases.

Note that because we are doing the inner loop at most tt times, dist(q,z)r\operatorname*{dist}(q,z)\leq r, so the prerequisites for Lemma 4.5 always hold.

The outer loop is executed at most 2e2(λ+1)2e^{2}(\lambda+1) times, and the inner loop makes at most cr(rt)cr-(r-t) iterations, as each iteration moves qq^{\prime} further away from zz. Therefore, the total number of queries is bounded by O((crr+t)λ)O((cr-r+t)\lambda). ∎

A.3 Proofs for Algorithm 3

Here we present Algorithm 3 and its analysis. The key difference compared to Algorithm 2 is the inner loop, which we analyze in a separate lemma.

Algorithm 3 Fast adaptive adversarial walk from zz.
0:  Origin point zz, parameters rr, crcr, λ\lambda, access to an QueryLSH structure
1:  t2e2(λ+1)t\leftarrow 2e^{2}(\lambda+1)
2:  Sample qq at distance rtr-t from zz
3:  while QueryLSH(q)=z\textsc{QueryLSH}(q)=z do
4:     if dist(q,z)r\operatorname*{dist}(q,z)\geq r then
5:        return  \bot
6:     qleft,qrightqq^{left},q^{right}\leftarrow q
7:     Choose crdist(qright,z)cr-\operatorname*{dist}(q^{right},z) random bits ii in qrightq^{right} such that qileftziq^{left}_{i}\neq z_{i} and flip them
8:     if QueryLSH(qright)=z\textsc{QueryLSH}(q^{right})=z then
9:        return  \bot
10:     while dist(qleft,qright)>1\operatorname*{dist}(q^{left},q^{right})>1 do
11:        qmidqleftq^{mid}\leftarrow q^{left}
12:        Flip any 12dist(qleft,qright)\frac{1}{2}\operatorname*{dist}(q^{left},q^{right}) bits ii in qmidq^{mid} such that qileftqirightq^{left}_{i}\neq q^{right}_{i}
13:        if QueryLSH(qmid)=z\textsc{QueryLSH}(q^{mid})=z then
14:           qleftqmidq^{left}\leftarrow q^{mid}
15:        else
16:           qrightqmidq^{right}\leftarrow q^{mid}
17:     jj\leftarrow the bit where qleftq^{left} differs from qrightq^{right}
18:     Flip bit jj in qq
19:  return  qq
Lemma A.5 (Inner loop of the fast adversarial walk).

Let n>en>e, cr/d1/5cr/d\leq 1/5, c1+80+16ln(λ+1)lnnc\geq 1+\frac{80+16\ln(\lambda+1)}{\ln n}, 8e2(λ+1)n1/88e^{2}(\lambda+1)\leq n^{1/8}. Let GG be such that no hash functions gGg\in G has support smaller than k4lognk-4\log n. Suppose that the point zz passed to Algorithm 3 is located at distance at least 2cr2cr from any other point in PP. Let qq sampled in line 2 be such that |Coll(q,z)|2e2(λ+1)|\textsc{Coll}(q,z)|\leq 2e^{2}(\lambda+1). For a fixed iteration of the outer loop in line 3 of Algorithm 3, with probability at least 11/(32e2(λ+1))1-1/(32e^{2}(\lambda+1)) the inner loop between the lines 6 and 17 finds two points qleftq^{left} and qrightq^{right} such that:

  • QueryLSH(qright)z\textsc{QueryLSH}(q^{right})\neq z,

  • QueryLSH(qleft)=z\textsc{QueryLSH}(q^{left})=z,

  • qleftq^{left} and qrightq^{right} differ in one bit.

The inner loops takes time at most O(log(crr+t))O(\log(cr-r+t)).

Proof.

Fix an iteration of the outer loop. The idea of the inner loop is to sample two points qleftq^{left} and qrightq^{right} such that there is a path between them that goes strictly away from zz and such that QueryLSH(qleft)=z\textsc{QueryLSH}(q^{left})=z and QueryLSH(qright)z\textsc{QueryLSH}(q^{right})\neq z. Then we move those points towards each other by binary search while preserving those properties.

Initial point qrightq^{right} is sampled through the same process as in Lemma 4.4, so with probability at least 11/(32e2(λ+1)1-1/(32e^{2}(\lambda+1) the condition QueryLSH(qright)z\textsc{QueryLSH}(q^{right})\neq z holds and the return in line 9 is not reached. The condition QueryLSH(qleft)=z\textsc{QueryLSH}(q^{left})=z holds at the start because QueryLSH(q)=z\textsc{QueryLSH}(q)=z by the while clause of the outer loop.

During the execution both guarantees continue to hold through the way we reassign variables in the binary search. At the end, the distance between the two points becomes 11, which means that they are differ in exactly one bit. This concludes the correctness proof.

The running time is O(log(cr(rt))O(\log(cr-(r-t)) because the distance between qleftq^{left} and qrightq^{right} is halved after each iteration. ∎

The proof of Algorithm 3 is almost the same as of Algorithm 2, so we only outline the differences.

Theorem A.6 (Fast algorithm).

Let n>en>e, 28ln3nrd/(28lnn)28\ln^{3}n\leq r\leq d/(28\ln n), λ18e2min{rlnn,n1/8}1\lambda\leq\frac{1}{8e^{2}}\min\{\frac{r}{\ln n},n^{1/8}\}-1, and 1+80+16ln(λ+1)lnnclnn1+\frac{80+16\ln(\lambda+1)}{\ln n}\leq c\leq\ln n. Let t=2e2(λ+1)t=2e^{2}(\lambda+1) be the parameter from Algorithm 2 (Algorithm 3). Suppose that the point zz passed to Algorithm 2 (Algorithm 3) is located at a distance of at least 2cr2cr from any other point in PP. With probability at least 1/41/n1/4-1/n the algorithm finds a point qq at a distance of at most rr from zz such that querying LSH with qq returns no point. It uses at most O(crλ)O(cr\cdot\lambda) (O(log(cr)λ)O(\log(cr)\cdot\lambda) for Algorithm 3) queries to the LSH data structure.

Proof.

Algorithms 3 and 2 only differ in the implementation of the inner loop, between the lines 6 and 12 and the lines 6 and 17 respectively. Because of this, the proof is exactly the same as in Algorithm 2, however Lemma A.5 is used instead of Lemma 4.5.

The final runtime is O(log(crr+t)λ)O(\log(cr-r+t)\cdot\lambda) since the inner loop takes time O(log(crr+t))O(\log(cr-r+t)). ∎

A.4 Existence of an isolated point

Not all data sets contain a sufficiently isolated point as required by Theorems A.4 and A.6, i.e. a point that is at a distance 2cr2cr from any other point in the set. However, when a point set is sampled randomly, this assumption holds when 2cr<d/42cr<d/4 for all of the points with high probability.

Lemma A.7.

Let PP be a set of nn random points sampled uniformly independently at random from {0,1}d\{0,1\}^{d}. If d48lnnd\geq 48\ln n, with probability at least 11/n1-1/n the distance between any two of them is at least d/4d/4.

Proof.

Fix an arbitrary point zz. We will first show that the probability that dist(z,x)<d/4\operatorname*{dist}(z,x)<d/4 is at most 1/n31/n^{3} for a uniformly random point xx.

Notice that dist(z,x)=i=1dyi\operatorname*{dist}(z,x)=\sum_{i=1}^{d}y_{i}, where yi=|zixi|y_{i}=|z_{i}-x_{i}|, and all yiy_{i} are independent with respect to each other and are equal to 11 with probability 1/21/2 and 0 otherwise. Hence 𝔼[i=1dyi]=d/2\operatorname*{\mathbb{E}}[\sum_{i=1}^{d}y_{i}]=d/2. Therefore, by the Chernoff bound,

Pr[dist(z,x)0.50.5d]e0.250.5d2e3lnn=1/n3.\Pr[\operatorname*{dist}(z,x)\leq 0.5\cdot 0.5d]\leq e^{-\frac{0.25\cdot 0.5d}{2}}\leq e^{-3\ln n}=1/n^{3}.

Now, by a union bound across all pairs of points, the probability that the minimum of pairwise distances between any two of them is at most d/4d/4 is at most 1/n1/n. ∎

Appendix B Discussion of impact

This paper presents work whose goal is to advance the field of Machine Learning.

The main contribution of the paper is to get an improved understanding for which choices of parameters LSH data structure are unsafe with respect to adversarial attacks. We show that, in principle, attacks on standard LSH constructions are possible, if the parameters are not chosen carefully. As a consequence false-negative-free LSH data structures should be given preference in sensitive applications. Our main contribution is the theoretical analysis of the attack and our experiments are focussing on understanding the parameter setting for which attack may work. Therefore we believe that there is very limited risk to abuse our results and we do think that the benefit of understanding such potential attacks clearly outweights this limited risk.

Appendix C Additional experiments and details

Resources used.

The experiments were run on a server in an internal cluster, where a typical worker has two AMD EPYC 7302 16-Core processors, with up to 500 GiB of RAM. Experiments where run in parallel on multiple cores. In total, running all of the experiments, including those in the appendix, takes at most 100000 CPU hours, and uses 21 GiB of disk memory.

Data sets used in experiment 2.

MNIST [28]. This dataset consists of grayscale pictures of handwritten digits of size 28×2828\times 28. We first flatten the images into vectors of size 784784, and then we project those vectors into Hamming space by setting each entry to 0 if it corresponded to a white pixel, i.e. if the grayscale color was 0, and to 11 otherwise. Note that this transformation preserves the dataset fairly well, as the background pixels in each image are perfectly white.

Anonymous Microsoft Web Data (MSWeb) [12]. Entries in this dataset are anonymous web user IDs and paths on the Microsoft website that they have visited. There are in total 294294 different paths. We represent each user by a point in 294294 dimensional Hamming space, where a point entry is equal to 11 if they have visited the corresponding path.

Mushroom [34, 38]. This dataset consists of descriptions of samples of mushrooms of 2323 different species. Each mushroom has 2222 categorical features describing it. We one-hot encode each feature to map the descriptions into Hamming space, with the final number of dimensions being 116116.

Experiment 3: Influence of λ\lambda on the algorithm efficiency.

In this experiment the goal was to measure the impact of λ\lambda on the probability of success of the algorithm. The setup is the same as in Experiment 2. All runs are done on the Random dataset. The results are on Figure 3.

Similar to Experiment 2, we compare the different values of λ\lambda by varying the distance at which we want to find the negative query. As is evident from the plots, the bigger the λ\lambda, the harder it is for the algorithm to succeed.

Note that the value of λ\lambda where algorithm no longer succeeds for distance rr are rather modest. It may be that it is enough to just increase the value of λ\lambda to defend against the type of attack that is described in this paper.

Experiment 4: Number of queries as a function of starting distance.

In this experiment we experimentally confirm that the number of queries that an algorithm needs is linearly dependent on tt. We use the same datasets and experimental setting as in Experiment 2. The results are on Figure 4.

Combined with results of Experiment 2, we can see that lowering the starting distance provides a trade-off between the running time and the probability of success.

Experiment 5: Efficiency gain over random sampling.

Finally, we investigated the efficiency gain that our adaptive algorithm achieved over random point sampling. The experimental setup is the same as in Experiment 2. Here we use the Random dataset. For both approaches, we continued rerunning them until a false negative query was found. The results are presented in Figure 5.

As you can see, when the λ\lambda and, accordingly, the number of hash functions LL are very small, random sampling is more efficient. However, for even modest values of λ\lambda, we already get a significant exponential speed-up.

Experiment 6: Comparison of defense mechanisms.

In this experiment we compare two different defence mechanisms based on ideas from differential privacy [10] and distance estimation [14].

Theorem 1.3. of [14] presents a general robustification mechanism which, when applied to then LSH of [22], says that it can be made robust with blow-ups of O(dlog1/δ)O(d\log 1/\delta) for memory and O(log1/δ)O(\log 1/\delta) for query time, where δ\delta is the failure probability per adaptive query. This is achieved by creating O(dlog1/δ)O(d\log 1/\delta) copies of plain LSH. Querying is done by selecting O(log1/δ)O(\log 1/\delta) of those LSH data structures, relaying the query to them and then choosing any of the returned points as the answer, or \bot if all of the data structure failed to find a point.

As a side note, the original ADE construction of [14] would also allow to solve the ANN problem. We don’t study it in this experiment for two reasons: it would require time per query Ω(n)\Omega(n), which is too slow in our parameter setting, and it is based upon linear sketches, against which our adaptive adversary is not effective.

The generic differentially private conversion of [10] is similar. For a generic estimation data structure, to be able to process TT queries against an adaptive adversary, they create O~(T)\widetilde{O}(\sqrt{T}) copies of the data structure. For each query, they select O~(1)\widetilde{O}(1) copies of the data structure and aggregate their output using a differentially private median. Although their construction is not generally applicable to ANN, as it is a search problem, it would nevertheless still be effective against our adversary, as for the latter it is not important which exact point was returned, but only whether anything was returned at all.

As both of the data structures can provably defend against an adaptive adversary and because both of them require a lot of additional memory, in this experiment we consider their performance against our adversary when the available memory is limited. Similarly, because rigorous implementation of a differentially private median would be computationally expensive, we instead opt to use a slighlty simplified version adapted to our setting. Namely, after sampling and querying the data structures, we count the number of them that returned a point and those that did not. We add to both values two-sided geometric noise, whose probability mass function isPr[z]=1α1+αα|z|\Pr[z]=\frac{1-\alpha}{1+\alpha}\alpha^{|z|}, with parameter α=e1/4\alpha=e^{-1/4}. If after this perturbation the number of data structures that returned \bot is bigger than the ones that returned a point, we return \bot. Otherwise, one of the outputs of the successful data structures is returned.

The results are depicted on Figures 6 and 7. On both figures, plots annotated with both λ\lambda and “q.s.” correspond to one of the two robustified variation accordingly. λ\lambda represents the number of copies used, and “q.s.” stands for “query samples”, and represents the number of data structures randomly sampled to answer each query. Each used copy is build with the same parameters as in experiment 22, and λ=1\lambda=1. The plots annotated with only λ\lambda are the plain LSH constructions with different values of λ\lambda. The idea is that the plots with the same value of λ\lambda have the same number of initialized hash functions. The used dataset is Random.

Both figures contain 4 subplots. The first subplot represents the probability that our adaptive adversary reports that it has found a false negative query at a desired target distance. Note that here, unlike in our previous experiments, the output of each query is non-deterministic even after the randomness of the hash functions is fixed. Hence we also test how “good” of a false negative the reported point is. For each found point, we repeatedly query it to the data structure 100100 times, and depict on the other three subplots the probability of the points discovered by the adversary to still be a negative query for at least 90%90\%, 50%50\% and 10%10\% of the repeated queries.

As you can see from the Figure 6, the defense based on [14] is performing better than querying the plain LSH with the same and even often greater amount of hash functions. Similarly, the probability that the discovered false negative remain false negatives in at least 50%50\% or 90%90\% of cases is fairly low.

On the other hand, the results for the algorithm based on differential privacy on Figure 7 show that the probability of reporting a false negative is considerably higher than for a plain LSH even for low values of target distance. This is due to the aggregation mechanism that we use, as it has a chance of reporting nothing even if all sampled data structures returned a positive answer. However, this also means that the adversary has less chances to learn the underlying data structure, which we can see in the fact that the probability of that a repeated query would be a negative one in at least 10%10\% of cases being below 0.50.5.

Based on this, it appears that unless in your application you are prepared to tolerate a significantly higher chance of failure for each query, it is better to use the construction based on [14]. Otherwise, a differentially private construction would leak less information to the adversary. Note that both constructions are still computationally faster than directly querying each data structure.

Refer to caption
Figure 3: Desired distance of the query not hashing with the origin point from the origin point zz for different values of λ\lambda. The experiments are conducted on the Zero dataset for the default parameter setting. The distance equal to rr is denoted by a dashed vertical line.
Refer to caption
Figure 4: The number of queries as a function of the distance of qq from origin point zz on all tested datasets.
Refer to caption
Figure 5: Mean number of queries necessary to find a false negative depending on parameter λ\lambda required by our adaptive adversary and random sampling of query points.
Refer to caption
Figure 6: Performance of robustification based on [14]. First subplot depicts probability that our adaptive adversary finds a false negative query versus the desired distance to that query from origin. The rest of subplots depict probabilities that, if such a query was found, if it was queried again it would still be a false negative query with probability at least 90%90\%, 50%50\% or 10%10\%. In the legend, plots annotated with both λ\lambda and “q.s” correspond to the robust algorithm, with “q.s.” standing for “query samples”, the number of data structures sampled to answer each query, and λ\lambda representing the total number of used data structures. The rest are the plain LSH construction with different values of λ\lambda.
Refer to caption
Figure 7: Performance of robustification based on differential privacy [10]. First subplot depicts probability that our adaptive adversary finds a false negative query versus the desired distance to that query from origin. The rest of subplots depict probabilities that, if such a query was found, if it was queried again it would still be a false negative query with probability at least 90%90\%, 50%50\% or 10%10\%. In the legend, plots annotated with both λ\lambda and “q.s” correspond to the robust algorithm, with “q.s.” standing for “query samples”, the number of data structures sampled to answer each query, and λ\lambda representing the total number of used data structures. The rest are the plain LSH construction with different values of λ\lambda.