This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic Maintenance of Kernel Density Estimation Data Structure: From Practice to Theory

Jiehao Liang [email protected]. UC Berkeley.    Zhao Song [email protected]. Adobe Research.    Zhaozhuo Xu [email protected]. Rice University.    Junze Yin [email protected]. Boston University.    Danyang Zhuo [email protected]. Duke University.

Kernel density estimation (KDE) stands out as a challenging task in machine learning. The problem is defined in the following way: given a kernel function f(x,y)f(x,y) and a set of points {x1,x2,,xn}d\{x_{1},x_{2},\cdots,x_{n}\}\subset\mathbb{R}^{d}, we would like to compute 1ni=1nf(xi,y)\frac{1}{n}\sum_{i=1}^{n}f(x_{i},y) for any query point ydy\in\mathbb{R}^{d}. Recently, there has been a growing trend of using data structures for efficient KDE. However, the proposed KDE data structures focus on static settings. The robustness of KDE data structures over dynamic changing data distributions is not addressed. In this work, we focus on the dynamic maintenance of KDE data structures with robustness to adversarial queries. Especially, we provide a theoretical framework of KDE data structures. In our framework, the KDE data structures only require subquadratic spaces. Moreover, our data structure supports the dynamic update of the dataset in sublinear time. Furthermore, we can perform adaptive queries with the potential adversary in sublinear time.

1 Introduction

Kernel density estimation (KDE) is a well-known machine learning approach with wide applications in biology [20, 13], physics [21, 12] and law [7]. The KDE is defined as follows: given a kernel function f(x,y)f(x,y) and a dataset {x1,x2,,xn}\{x_{1},x_{2},\cdots,x_{n}\}, we would like to estimate

1ni=1nf(xi,y)\displaystyle\frac{1}{n}\sum_{i=1}^{n}f(x_{i},y)

for a query yy. It is standard to assume the kernel ff to be positive semi-definite. From a statistics perspective, we regard KDE as estimating the density of a probability distribution provided by a mapping.

Recently, there has been a growing trend in applying probabilistic data structures for KDE [15, 4, 30, 1, 9, 16, 25]. The general idea is to transform the kernel function into a distance measure and then apply similarity search data structures such as locality sensitive hashing and sketching. This co-design of data structure and KDE is of practical importance: (1) the computational efficiency is taken into consideration when we design KDE algorithms; (2) the capacity of traditional probabilistic data structures is extended from search to sampling. As a result, we obtain a series of KDE algorithms with both sample efficiency and running time efficiency.

However, current KDE data structures focus on static settings where the dataset is fixed and the queries and independent of each other. More practical settings should be taken into consideration. In some applications of KDE, the dataset is dynamically changing. For instance, in time series modeling [28, 22], the KDE estimators should be adaptive to the insertion and deletion in the dataset. In semi-supervised learning [36, 37], KDE data structures should handle the update of the kernel function. Moreover, in the works that apply KDE in optimization, the data structures should be robust over adversarial queries. As a result, the dynamic maintenance of KDE data structures should be emphasized in the research of machine learning.

In this paper, we argue that there exists a practice-to-theory gap for the dynamic maintenance of KDE data structures. Although there are existing work [8] that supports insertion and deletion in KDE data structures, these operations’ impact on the quality of KDE is not well-addressed. Moreover, the robustness of KDE data structures over adversaries is a recently raised concern. Thus, a formal theoretical analysis is required to discuss the robustness of KDE data structures in a dynamic setting.

In this paper, we present a theoretical analysis of the efficient maintenance of KDE for dynamic datasets and adversarial queries. Specifically, we present the first data structure design that can quickly adapt to updated input data and is robust to adversarial queries. We call our data structure and the corresponding algorithms adaptive kernel density estimation. Our data structure only requires subquadratic spaces, and each update to the input data only requires sublinear time, and each query can finish in sublinear time.

Notations.

We use \mathbb{R}, +\mathbb{R}_{+}, +\mathbb{N}_{+} to denote the set of real numbers, positive real numbers, and positive integers. For a set XX, we use |X||X| to denote its cardinality. Let n+n\in\mathbb{N}_{+} and rr\in\mathbb{R}. We define [n]:={1,2,3,,n}[n]:=\{1,2,3,\dots,n\} and r\lceil r\rceil to be the smallest integer greater than or equal to rr. |r||r| is the absolute value of rr. Let n\mathbb{R}^{n} be the set of all nn-dimensional vectors whose entries are all real numbers. x2\|x\|_{2} represents the 2\ell_{2} norm of xx. Pr[]\Pr[\cdot] represents the probability, and 𝔼[]\operatorname*{{\mathbb{E}}}[\cdot] represents the expectation. We define exp2(r)\exp_{2}(r) as 2r2^{r}.

1.1 Related Work

Efficient Kernel Density Estimation

The naive KDE procedure takes a linear scan of the data points. This is prohibitively expensive for large-scale datasets. Therefore, it is of practical significance to develop efficient KDE algorithms. A series of traditional KDE algorithms, namely kernel merging [23, 6], is to perform clustering on the dataset so that the KDE is approximated by a weighted combination of centroids. However, these algorithms do not scale to high-dimensional datasets.

On the other hand, there is a trend of sampling-based KDE algorithms. The focus of this line of work is to develop efficient procedures that approximate KDE with fewer data samples. Starting from random sampling [27], sampling procedures such as Herding [17] and kk-centers [14] are introduced in KDE. Some work also incorporates sampling with the coreset concept [29] and provides a KDE algorithm by sampling on an optimized subset of data points. Recently, there has been a growing interest in applying hash-based estimators (HBE) [15, 4, 30, 16, 5, 31] for KDE. The HBE uses Locality Sensitive Hashing (LSH) functions. The collision probability of two vectors in terms of an LSH function is monotonic to their distance measure. Using this feature, HBE performs efficient importance sampling by LSH functions and hash table type data structures. However, current HBEs are built for static settings and thus, are not robust to incremental changes in the input data. As a result, their application in large-scale online learning is limited. Except for LSH based KDE literature, there are also other KDE work based polynomial methods [1, 3]. The dynamic type of KDE has also been considered in [18, 35]. The work [19] presents both randomized algorithm deterministic algorithms for approximating a symmetric KDE computation.

Adaptive Data Structure

Recently, there is a growing trend of applying data structures [10, 11, 32, 40, 39, 34, 41, 33, 38, 26] to improve running time efficiency in machine learning. However, there exists a practice-to-theory gap between data structures and learning algorithms. Most data structures assume queries to be independent and provide theoretical guarantees based on this assumption. On the contrary, the query to data structures in each iteration of learning algorithms is mutually dependent. As a result, the existing analysis framework for data structures could not provide guarantees in optimization. To bridge this gap, quantization strategies [32, 40, 34, 33] are developed for adaptive queries in machine learning. The idea of these approaches is to quantize each query into its nearest vertex on the ϵ\epsilon-net. Therefore, the failure probability of the data structures could be upper bounded by a standard ϵ\epsilon-net argument. Although quantization methods demonstrate their success in machine learning, this direct combination does not fully enable the power of both data structure and learning algorithms. In our work, we aim at a co-design of data structure and machine learning for efficiency improvements in adaptive KDE.

1.2 Problem Formulation

In this work, we would like to study the following problem.

Definition 1.1 (Dynamic Kernel Density Estimation).

Let f:d×d[0,1]f:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1] denote a kernel function. Let X={xi}i=1ndX=\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d} denote a dataset. Let

f𝖪𝖣𝖤:=f(X,q):=1|X|xXf(x,q)f_{\mathsf{KDE}}^{*}:=f(X,q):=\frac{1}{|X|}\sum_{x\in X}f(x,q)

define the kernel density estimate of a query qdq\in\mathbb{R}^{d} with respect to XX. Our goal is to design a data structure that efficiently supports any sequence of the following operations:

  • Initialize(f:d×d[0,1],Xd,ϵ(0,1),f𝖪𝖣𝖤[0,1])(f:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1],X\subset\mathbb{R}^{d},\epsilon\in(0,1),f_{\mathsf{KDE}}\in[0,1]). The data structure takes kernel function ff, data set X={x1,x2,,xn}X=\{x_{1},x_{2},\dots,x_{n}\}, accuracy parameter ϵ\epsilon and a known quantity f𝖪𝖣𝖤f_{\mathsf{KDE}} satisfying f𝖪𝖣𝖤f𝖪𝖣𝖤f_{\mathsf{KDE}}\geq f_{\mathsf{KDE}}^{*} as input for initialization.

  • Update(zd,i[n])(z\in\mathbb{R}^{d},i\in[n]). Replace the ii’th data point of data set XX with zz.

  • Query(qd)(q\in\mathbb{R}^{d}). Output d~\widetilde{d}\in\mathbb{R} such that

    (1ϵ)f𝖪𝖣𝖤(X,q)d~(1+ϵ)f𝖪𝖣𝖤(X,q).(1-\epsilon)f_{\mathsf{KDE}}^{*}(X,q)\leq\widetilde{d}\leq(1+\epsilon)f_{\mathsf{KDE}}^{*}(X,q).

We note that in the Query procedure do not assume i.i.d queries. Instead, we take adaptive queries and provide theoretical guarantees.

1.3 Our Result

In this work, we provide theoretical guarantees for the dynamic KDE data structures defined in Definition 1.1. We summarize our main result as below:

Theorem 1.2 (Main result).

Given a function KK and a set of points set XdX\subset\mathbb{R}^{d}. Let cost(f)\operatorname{cost}(f) be defined as Definition 2.7. For any accuracy parameter ϵ(0,0.1)\epsilon\in(0,0.1), there is a data structure using space O(ϵ2ncost(f))O(\epsilon^{-2}n\cdot\operatorname{cost}(f)) (Algorithm 45 and 6) for the Dynamic Kernel Density Estimation Problem (Definition 1.1) with the following procedures:

  • Initialize(f:d×d[0,1],Xd,ϵ(0,1),f𝖪𝖣𝖤[0,0.1])(f:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1],X\subset\mathbb{R}^{d},\epsilon\in(0,1),f_{\mathsf{KDE}}\in[0,0.1]). Given a kernel function ff, a dataset PP, an accuracy parameter ϵ\epsilon and a quantity f𝖪𝖣𝖤f_{\mathsf{KDE}} as input, the data structure DynamicKDE preprocess in time

    O(ϵ2n1+o(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log3n)\displaystyle O(\epsilon^{-2}n^{1+o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\log^{3}n)
  • Update(zd,i[n])(z\in\mathbb{R}^{d},i\in[n]). Given a new data point zdz\in\mathbb{R}^{d} and index i[n]i\in[n], the Update operation take zz and ii as input and update the data structure in time

    O(ϵ2no(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log3n)\displaystyle O(\epsilon^{-2}n^{o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\log^{3}n)
  • Query(qd)(q\in\mathbb{R}^{d}). Given a query point qdq\in\mathbb{R}^{d}, the Query operation takes qq as input and approximately estimate kernel density at qq in time

    O(ϵ2no(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log3n).\displaystyle O(\epsilon^{-2}n^{o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\log^{3}n).

    and output d~\widetilde{d} such that:

    (1ϵ)f(X,q)d~(1+ϵ)f(X,q)\displaystyle(1-\epsilon)f(X,q)\leq\widetilde{d}\leq(1+\epsilon)f(X,q)

We will prove the main result in Lemma 4.4, Lemma 4.6 and Lemma 4.9.

1.4 Technical Overview

In this section, we introduce an overview of our technique that leads to our main result.

Density Constraint. We impose an upper bound on the true kernel density for query qq. We also introduce geometric level sets so that the number of data points that fall into each level will be upper bounded.

Importance Sampling. To approximate kernel density efficiently, we adopt the importance sampling technique. We sample each data point with different probability according to their contribution to the estimation, the higher the contribution, the higher the probability to be sampled. Then, we can construct an unbiased estimator based on sampled points and sampling probability. The main problem is how to evaluate the contribution to KDE for each point. We explore the geometry property of the kernel function and estimate the contribution of each point based on their distance from the query point.

Locality Sensitive Hashing. One problem with importance sampling is that when preprocessing, we have no access to the query point. It is impossible to estimate the contribution of each point for a future query. We make use of LSH to address this issue. To be specific, LSH preprocesses data points and finds the near neighbors for a query with high probability. With this property, we can first sample all data points in several rounds with geometrically decaying sampling probability. We design LSH for each round that satisfies the following property: given a query, LSH recovers points that have contributions proportional to the sampling probability in that round. Then we can find sampled points and proper sampling probability when a query comes.

Dynamic Update. Since the previous techniques, i.e. importance sampling and LSH, are independent of the coordinate value of the data point itself. This motivates us to support updating data points dynamically. Since LSH is a hash-based structure, given a new data point zdz\in\mathbb{R}^{d} and index ii indicating the data point to be replaced, we search for a bucket where xix_{i} was hashed, replace it with new data point zz and update the hash table. Such an update will not affect the data structure for estimating kernel density.

Robustness. To make our data structure robust to adaptive queries, we take the following steps to obtain a robust data structure. Starting from the constant success probability, we first repeat the estimation several times and take the median. This will provide a high probability of obtaining a correct answer for a fixed point. Then we push this success probability to the net points of a unit ball. Finally, we generalize to all query points in d\mathbb{R}^{d}. Thus we have a data structure that is robust to adversary query.

Roadmap

The rest of our paper is organized as follows: In Section 2, we describe some basic definitions and lemmas that are frequently used. In Section 3, we list some technical claims for our results. In Section 4, we demonstrate our data structure in detail, including the algorithm and the running time analysis. We perform an analysis of our data structures over the adversary in Section 5. Finally, we draw our conclusion in Section 6.

2 Preliminary

The goal of this section is to introduce the basic definitions and lemmas that will be used to prove our main result.

We first introduce a collection of subsets called geometric weight levels.

Definition 2.1 (Geometric Weight Levels).

Fix R+R\in\mathbb{N}_{+} and qdq\in\mathbb{R}^{d}. We define wi:=f(xi,q)w_{i}:=f(x_{i},q).

For any fix r[R]:={1,2,,R}r\in[R]:=\{1,2,\cdots,R\}, we define

Lr:={xiX|wi(2r+1,2r]}.\displaystyle L_{r}:=\{x_{i}\in X~{}|~{}w_{i}\in(2^{-r+1},2^{-r}]\}.

We define the corresponding distance levels as

zr:=maxs.t.f(z)(2r,2r+1]z.\displaystyle z_{r}:=\max_{\mathrm{s.t.}f(z)\in(2^{-r},2^{-r+1}]}z.

where f(z):=f(x,q)f(z):=f(x,q) for z=xq2z=\|x-q\|_{2}.

In addition, we define LR+1:=Xr[R]LrL_{R+1}:=X\setminus\bigcup_{r\in[R]}L_{r}.

Geometric weight levels can be visualized as a sequence of circular rings centered at query qq. The contribution of each level to kernel density at qq is mainly determined by the corresponding distance level.

Next, we introduce the importance sampling technique which is the key idea to accelerate the query procedure.

Definition 2.2 (Importance Sampling).

Given data points {x1,,xn}d\{x_{1},\cdots,x_{n}\}\subset\mathbb{R}^{d}, we sample each point xix_{i} with probability pip_{i} and construct estimator as follows: T:=i=1nχipixiT:=\sum_{i=1}^{n}\frac{\chi_{i}}{p_{i}}x_{i}.

To apply importance sampling, we need to evaluate the contribution of each point. We sample each point that has a high contribution with a high probability.

A natural question arose: when preprocessing, we have no access to the query, so we cannot calculate distance directly. Locality Sensitive Hashing is a practical tool to address this problem.

Definition 2.3 (Locally Sensitive Hash [24]).

A family \mathcal{H} is called (pnear,pfar,z,c)(p_{\mathrm{near}},p_{\mathrm{far}},z,c)-sensitive where pnear,pfar[0,1],z,c1p_{\mathrm{near}},p_{\mathrm{far}}\in[0,1],z\in\mathbb{R},c\geq 1, if for any x,qdx,q\in\mathbb{R}^{d}:

  • Prh[h(x)=h(q)xq2z]pnear\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\leq z]\geq p_{\mathrm{near}}

  • Prh[h(x)=h(q)xq2cz]pfar\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\geq cz]\leq p_{\mathrm{far}}

The next lemma shows the existence of the LSH family and its evaluation time.

Lemma 2.4 (Lemma 3.2 in page 6 of [2]).

Let (x,q)d×d(x,q)\in\mathbb{R}^{d}\times\mathbb{R}^{d}. Define

pnear:=p1(z):=Prh[h(x)=h(q)xq2z]\displaystyle p_{\mathrm{near}}:=p_{1}(z):=\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\leq z]

and

pfar:=p2(z,c):=Prh[h(x)=h(q)xq2cz].\displaystyle p_{\mathrm{far}}:=p_{2}(z,c):=\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\geq cz].

Then, if we fix zz to be positive, we can have a hash family \mathcal{H} satisfying

ρ:=log1/pnearlog1/pfar1c2+O(logtt12),\displaystyle\rho:=\frac{\log{1}/{p_{\mathrm{near}}}}{\log{1}/{p_{\mathrm{far}}}}\leq\frac{1}{c^{2}}+O(\frac{\log t}{t^{\frac{1}{2}}}),

for any c1,t>0c\geq 1,t>0, where pneareO(t)p_{\mathrm{near}}\geq e^{-O(\sqrt{t})} and it requires dtO(t)dt^{O(t)} time for every evaluation.

Remark 2.5.

We set t=log23nt=\log^{\frac{2}{3}}n, which results in no(1)n^{o(1)} evaluation time and ρ=1c2+o(1)\rho=\frac{1}{c^{2}}+o(1). Note that if c=O(log17n)c=O(\log^{\frac{1}{7}}n), then

11c2+O(logtt12)=c2(1o(1)).\displaystyle\frac{1}{\frac{1}{c^{2}}+O(\frac{\log t}{t^{\frac{1}{2}}})}=c^{2}(1-o(1)).

Next, we assign the LSH family to each geometric weight level (Definition 2.1) and show how well these families can distinguish points from different levels.

Lemma 2.6 (probability bound for separating points in different level sets, informal version of Lemma A.5).

Given kernel function ff and r[R]r\in[R], let LrL_{r} be the weight level set and zrz_{r} be the corresponding distance level (Definition 2.1). For any query qdq\in\mathbb{R}^{d}, any integer pair (i,r)[R+1]×[R](i,r)\in[R+1]\times[R], satisfying i>ri>r, let xLrx\in L_{r} and xLix^{\prime}\in L_{i}. Let

ci,r:=min{zi1zr,log1/7n}.c_{i,r}:=\min\{\frac{z_{i-1}}{z_{r}},\log^{1/7}n\}.

We set up Andoni-Indyk LSH family (Definition 2.3) \mathcal{H} with near distance zrz_{r} and define

pnear,r\displaystyle p_{\mathrm{near},r} :=Prh[h(x)=h(q)xq2z]\displaystyle~{}:=\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\leq z]

and

pfar,r\displaystyle p_{\mathrm{far},r} :=Prh[h(x)=h(q)xq2cz].\displaystyle~{}:=\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\geq cz].

Then, for any k1k\geq 1, it is sufficient to show the following:

  1. 1.

    Prhk[h(x)=h(q)]pnear,rk\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x)=h^{*}(q)]\geq p_{\mathrm{near},r}^{k}

  2. 2.

    Prhk[h(x)=h(q)]pnear,rkci,r2(1o(1))\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x^{\prime})=h^{*}(q)]\leq p_{\mathrm{near},r}^{kc_{i,r}^{2}(1-o(1))}

This lemma suggests that we can apply LSH several times to separate points in different level sets. It is useful for recovering points in a specific level set when estimating the “importance” of a point based on its distance from the query point. We will discuss more in Section 4. We use a similar definition for the cost of the kernel in [9].

Definition 2.7 (Kernel cost).

Given a kernel ff, which has geometric weight levels LrL_{r}’s and distance levels zrz_{r}’s defined in Definition 2.1. For any r[R]r\in[R], we define the kernel cost ff for LrL_{r} as

cost(f,r):=exp2(maxi{r+1,,R+1}irci,r(1o(1))),\displaystyle\operatorname{cost}(f,r):=\exp_{2}(\max\limits_{i\in\{r+1,\cdots,R+1\}}\lceil\frac{i-r}{c_{i,r}(1-\mathrm{o}(1))}\rceil),

where ci,r:=min{zi1zr,log17n}c_{i,r}:=\min\{\frac{z_{i-1}}{z_{r}},\log^{\frac{1}{7}}n\}.

Then we define the general cost of a kernel ff as

cost(f):=maxr[R]cost(f,r).\displaystyle\operatorname{cost}(f):=\max_{r\in[R]}\operatorname{cost}(f,r).

Note that when ff is Gaussian kernel, the cost(f)\operatorname{cost}(f) is (1f𝖪𝖣𝖤)(1+o(1))14(\frac{1}{f_{\mathsf{KDE}}})^{(1+o(1))\frac{1}{4}} [9].

3 Technical Claims

In this section, we list some technical claims that are useful for our main results.

We start by giving an upper bound on sizes of geometric weight levels (Definition 2.1).

Lemma 3.1 (Sizes of geometric weight levels).

Given r[R]r\in[R], we have

|Lr|2rnf𝖪𝖣𝖤2rnf𝖪𝖣𝖤.\displaystyle|L_{r}|\leq 2^{r}nf_{\mathsf{KDE}}^{*}\leq 2^{r}nf_{\mathsf{KDE}}.

Next, we show a lemma for the probability of recovering a point in the query procedure, given that this point is sampled in the preprocessing stage.

Lemma 3.2 (Probability for sampled point recovery).

Suppose that we invoke DynamicKDE.Initialize. Suppose when a=aa=a^{*} and r=rr=r^{*}, we sample a point xLrx\in L_{r^{*}}. Given a query qq, we invoke DynamicKDE.Query. With probability at least 11n101-\frac{1}{n^{10}}, Ha,rH_{a^{*},r^{*}} recovered xx.

With the above lemma, we can bound the number of recovered points in expectation. We show that there are only O(1)O(1) points recovered by LSH in each geometric weight level (Definition 2.1).

Lemma 3.3 (Upper bound on number of recovered points in expectation).

Fix a query qdq\in\mathbb{R}^{d}. We define R:=log1f𝖪𝖣𝖤R:=\lceil\log\frac{1}{f_{\mathsf{KDE}}}\rceil. Fix r[R]r\in[R], we define p:=pnear,rp:=p_{\mathrm{near},r}. For each (i,r)[R]×[R](i,r)\in[R]\times[R], we define

ci,r:=min{zi1zr,log17n}.c_{i,r}:=\min\{\frac{z_{i-1}}{z_{r}},\log^{\frac{1}{7}}n\}.

There exists k+k\in\mathbb{N}_{+}

k:=kr:=1log(1/p)maxl{r+1,,R+1}lrcl,r2(1o(1)),\displaystyle k:=k_{r}:=\frac{1}{\log(1/p)}\max\limits_{l\in\{r+1,\cdots,R+1\}}\lceil\frac{l-r}{c_{l,r}^{2}(1-o(1))}\rceil,

such that for any i>ji>j

𝔼hk[|{xLi:h(x)=h(q)}|]=O(1)\displaystyle\operatorname*{{\mathbb{E}}}_{h^{*}\sim\mathcal{H}^{k}}[|\{x^{\prime}\in L_{i}:h^{*}(x^{\prime})=h^{*}(q)\}|]=O(1)

Finally, we claim that the kernel function is Lipschitz. This is an important property for designing robust algorithms.

Lemma 3.4 (Lipschitz property of KDE function).

Suppose kernel function f𝖪𝖣𝖤:d×d[0,1]f^{*}_{\mathsf{KDE}}:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1] satisfies the following properties:

  • Radial: there exists a function f:[0,1]f:\mathbb{R}\rightarrow[0,1] such that f(p,q)=f(pq2)f(p,q)=f(\|p-q\|_{2}), for all p,qp,q\in\mathbb{R}.

  • Decreasing: ff is decreasing

  • Lipschitz: ff is \mathcal{L}-Lipschitz

Then KDE function f𝖪𝖣𝖤:d[0,1]f_{\mathsf{KDE}}^{*}:\mathbb{R}^{d}\rightarrow[0,1],

f𝖪𝖣𝖤(q):=1|P|pPf(p,q)\displaystyle f_{\mathsf{KDE}}^{*}(q):=\frac{1}{|P|}\sum_{p\in P}f(p,q)

is \mathcal{L}-Lipschitz, i.e.

|f𝖪𝖣𝖤(q)f𝖪𝖣𝖤(q)|qq2\displaystyle|f_{\mathsf{KDE}}^{*}(q)-f_{\mathsf{KDE}}^{*}(q^{\prime})|\leq\mathcal{L}\cdot\|q-q^{\prime}\|_{2}

4 Our Data Structures

Our algorithm’s high-level idea is the following. We apply importance sampling (Definition 2.2) to approximate kernel density at a query point qq efficiently. We want to sample data points with probability according to their contribution to the estimation. Ideally, given query point qdq\in\mathbb{R}^{d} and data set XdX\subset\mathbb{R}^{d}, we can sample each data point in XX with probability proportional to the inverse of the distance from query qq. Unfortunately, we have no access to query points when preprocessing. Hence we make use of LSH (Definition 2.3) to overcome this problem.

In general, given a query qq, LSH can recover its near points with high probability while the probability of recovering far points is bounded by a small quantity. To apply LSH, we first run

R=1f𝖪𝖣𝖤\displaystyle R=\lceil\frac{1}{f_{\mathsf{KDE}}}\rceil

rounds sampling, in which we sample each data point with probability 12rnf𝖪𝖣𝖤\frac{1}{2^{r}nf_{\mathsf{KDE}}} in rr-th round. Then we obtain RR subsampled data sets. Given query qq, we use LSH to recover those points both in level set LrL_{r} and the rr-th subsampled data sets. Hence we obtain the sampled data points and the corresponding sampling rates (in other words “importance”). Then we construct the estimator as in Definition 2.2.

Finally, we repeat the estimation process independently and take the median to obtain (1±ϵ)(1\pm\epsilon) approximation with high probability.

4.1 LSH Data Structure

In this section, we present our LSH data structure with the following procedures:

Initialize. Given a data set {x1,,xn}d\{x_{1},\cdots,x_{n}\}\in\mathbb{R}^{d} and integral parameters k,Lk,L, it first invokes private procedure ChooseHashFunc. The idea behind this is to amplify the “sensitivity” of hashing by concatenating kk basic hashing functions from the family \mathcal{H}(Algorithm 7 line 9) into new functions. Thus we obtain a family of “augmented” hash function l,l[L]\mathcal{H}_{l},l\in[L] (Algorithm 8 line 7). We follow by ConstructHashTable in which we hash each point xix_{i} using the hashing function l\mathcal{H}_{l}. Then we obtain LL hash tables corresponding to LL hash functions which can be updated quickly.

Recover. Given a query qdq\in\mathbb{R}^{d}, it finds the bucket where qq is hashed by l\mathcal{H}_{l} and retrieves all the points in the bucket according to hashtable 𝒯l\mathcal{T}_{l}. This operation applies to all LL hashtables.

UpdateHashTable. Given a new data point zdz\in\mathbb{R}^{d} and index i[n]i\in[n], it repeats the following operations for all l[L]l\in[L]: find bucket l(z)\mathcal{H}_{l}(z) and insert point zz; find bucket l(xi)\mathcal{H}_{l}(x_{i}) and delete point xix_{i}.

Note that traditional LSH data structure only has Initialize and Recover procedures. To make it a dynamic structure, we exploit its hash storage. We design UpdateHashTable procedure so that we can update the hash table on the fly. This procedure provides guarantee for dynamic kernel density estimation.

4.2 Initialize Part of Data Structure

Algorithm 1 Dynamic KDE, members and initialize part, informal version of Algorithm 4
1:data structure DynamicKDE \triangleright Theorem 1.2
2:
3:members
4:     For i[n]{i\in[n]},xidx_{i}\in\mathbb{R}^{d} \triangleright dataset XX
5:     K1,R,K2+K_{1},R,K_{2}\in\mathbb{N}_{+} \triangleright Number of repetitions
6:     For a[K1]a\in[K_{1}], r[R]r\in[R], Ha,rLSHH_{a,r}\in\textsc{LSH} \triangleright Instances from LSH class
7:end members
8:
9:procedure Initialize(Xd,ϵ(0,1),f𝖪𝖣𝖤[0,1]X\subset\mathbb{R}^{d},\epsilon\in(0,1),f_{\mathsf{KDE}}\in[0,1]) \triangleright Lemma 4.2
10:     Initialize K1,RK_{1},R as in Section C
11:     for a=1,2,,K1a=1,2,\cdots,K_{1} do
12:         for r=1,2,,Rr=1,2,\cdots,R do
13:              Compute K2,r,krK_{2,r},k_{r} as in Section C
14:              PjP_{j}\leftarrow sample each element in XX with probability min{12rnf𝖪𝖣𝖤,1}\min\{\frac{1}{2^{r}nf_{\mathsf{KDE}}},1\}.
15:              a,r.Initialize(Pr,kr,K2,r)\mathcal{H}_{a,r}.\textsc{Initialize}(P_{r},k_{r},K_{2,r})
16:         end for
17:         P~a\widetilde{P}_{a}\leftarrow sample elements from XX, each one has sample probability 1n\frac{1}{n} \triangleright Store P~a\widetilde{P}_{a}
18:     end for
19:end procedure
20:
21:end data structure

In this section, we present the initialize part of our data structure. We start by analyzing the space storage for LSH and DynamicKDE. Then we state the result of running time for LSH in Lemma 4.3 and DynamicKDE in Lemma 4.4. We first show the space storage of LSH part in our data structure.

Lemma 4.1 (Space storage of LSH, informal version of Lemma C.1).

Given data set {xi}i[n]d\{x_{i}\}_{i\in[n]}\subset\mathbb{R}^{d}, parameter L,k+L,k\in\mathbb{N}_{+}, the Initialize (Algorithm 7) of the data-structure LSH uses space

O(Lkdno(1)+Ln)\displaystyle O(Lkdn^{o(1)}+Ln)

Using the lemma above, we can prove the total space storage of DynamicKDE structure in the following lemma.

Lemma 4.2 (Space storage part of Theorem 1.2, informal version of Lemma C.2).

The Initialize of the data structure DynamicKDE (Algorithm 4) uses space

O(ϵ2(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)\displaystyle~{}O(\epsilon^{-2}(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\cdot\log(1/f_{\mathsf{KDE}})\cdot
cost(K)log2n(1f𝖪𝖣𝖤+no(1)log2n))\displaystyle~{}\mathrm{cost}(K)\cdot\log^{2}n\cdot(\frac{1}{f_{\mathsf{KDE}}}+n^{o(1)}\cdot\log^{2}n))

For the running time, we again start with the LSH part.

Lemma 4.3 (Upper bound on running time of Initialize of the data-structure LSH, informal version of Lemma C.3 ).

Given input data points {xi}i[n]d\{x_{i}\}_{i\in[n]}\subset\mathbb{R}^{d}, parameters k,L+k,L\in\mathbb{N}_{+}, LSH parameters pnear,pfar[0,1],c[1,),r+p_{\mathrm{near}},p_{\mathrm{far}}\in[0,1],c\in[1,\infty),r\in\mathbb{R}_{+} and kernel KK, the Initialize of the data-structure LSH(Algorithm 7) runs in time

O(L(kdno(1)+dn1+o(1)+nlogn))\displaystyle O(L\cdot(kdn^{o(1)}+dn^{1+o(1)}+n\log n))

Having shown the running time of LSH, we now move on to prove the total running time of Init in our data structure by combining the above result in the LSH part.

Lemma 4.4 (The initialize part of Theorem 1.2, informal version of Lemma C.4).

Given (K:d×d[0,1],Xd,ϵ(0,1),f𝖪𝖣𝖤[0,1])(K:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1],X\subset\mathbb{R}^{d},\epsilon\in(0,1),f_{\mathsf{KDE}}\in[0,1]), the Initialize of the data-structure DynamicKDE (Algorithm 4) runs in time

O(ϵ2n1+o(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log2n)\displaystyle O(\epsilon^{-2}n^{1+o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\log^{2}n)
Algorithm 2 Dynamic KDE, update part, informal version of Algorithm 5
1:data structure DynamicKDE \triangleright Theorem 1.2
2:
3:procedure Update(vd,f𝖪𝖣𝖤[0,1],i[n]v\in\mathbb{R}^{d},f_{\mathsf{KDE}}\in[0,1],i\in[n]) \triangleright Lemma 4.6
4:     for a=1,2,,K1a=1,2,\cdots,K_{1} do
5:         for r=1,2,,Rr=1,2,\cdots,R do
6:              Update the hashtables in a,r\mathcal{H}_{a,r} with vv
7:         end for
8:     end for
9:     Replace xix_{i} with vv
10:end procedure
11:
12:end data structure

4.3 Update Part of Data Structure

In this section, we move to the update part of our data structure. We first prove how to update the LSH data structure. Then we can extend the update procedure to DynamicKDE structure so that we can prove Lemma 4.6.

Lemma 4.5 (Update time of LSH, informal version of Lemma C.5).

Given a data point vdv\in\mathbb{R}^{d} and index i[n]i\in[n], the UpdateHashTable of the data-structure LSH (Algorithm 7) runs in (expected) time

O(no(1)log(n)cost(f)).\displaystyle O(n^{o(1)}\log(n)\cdot\operatorname{cost}(f)).

Update in LSH structure is a key part in Update for DynamicKDE. Next, we show the running time of Update for DynamicKDE by combining the above results.

Lemma 4.6 (The update part of Theorem 1.2, informal version of Lemma C.6).

Given an update vdv\in\mathbb{R}^{d} and index i[n]i\in[n], the Update of the data-structure DynamicKDE (Algorithm 5) runs in (expected) time

O(ϵ2no(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log3n).\displaystyle O(\epsilon^{-2}n^{o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\log^{3}n).

4.4 Query Part of Data Structure

Finally, we come to the query part. The goal of this section is to prove Lemma 4.9, which shows the running time of Query procedure for DynamicKDE.

Algorithm 3 Dynamic KDE, query part, informal version of Algorithm 6
1:data structure DynamicKDE \triangleright Theorem 1.2
2:
3:procedure Query(qd,ϵ(0,1),f𝖪𝖣𝖤[0,1]q\in\mathbb{R}^{d},\epsilon\in(0,1),f_{\mathsf{KDE}}\in[0,1])
4:     for a=1,2,,K1a=1,2,\cdots,K_{1} do
5:         for r=1,2,,Rr=1,2,\cdots,R do
6:              Recover near neighbours of qq using a,r\mathcal{H}_{a,r}
7:              Store them into 𝒮\mathcal{S}
8:         end for
9:         for xi𝒮x_{i}\in\mathcal{S} do
10:              wif(xi,q)w_{i}\leftarrow f(x_{i},q)
11:              if  xiLrx_{i}\in L_{r} for some r[R]r\in[R] then
12:                  pimin{12rnf𝖪𝖣𝖤,1}p_{i}\leftarrow\min\{\frac{1}{2^{r}nf_{\mathsf{KDE}}},1\}
13:              end if
14:         end for
15:         Taxi𝒮wipiT_{a}\leftarrow\sum_{x_{i}\in\mathcal{S}}\frac{w_{i}}{p_{i}}
16:     end for
17:     return MedianaK1{Ta}\mathrm{Median}_{a\in K_{1}}\{T_{a}\}
18:end procedure
19:
20:end data structure

The running time of Query procedure depends on two parts: the number of recovered points in each weight level and the recovery time of LSH.

We first prove the expected number of recovered points.

Lemma 4.7 (expected number of points in level sets, informal version of Lemma C.7).

Given a query qdq\in\mathbb{R}^{d} and fix r[R]r\in[R]. For any i[R+1]i\in[R+1], weight level LiL_{i} contributes at most 11 point to the hash bucket of query qq.

Next, we show the running time for LSH to recover points.

Lemma 4.8 (running time for recovering points given a query, informal version of Lemma C.8).

Given a query qdq\in\mathbb{R}^{d} and L,R,k+L,R,k\in\mathbb{N}_{+}, the Recover of the data-structure LSH runs in (expected) time

O(Lkno(1)+LR)\displaystyle O(Lkn^{o(1)}+LR)

Combining the two lemmas above, we can prove the total running time of Query in DynamicKDE structure.

Lemma 4.9 (Query part of Theorem 1.2, informal version of Lemma C.9).

Given a query qdq\in\mathbb{R}^{d}, the Query of the data-structure DynamicKDE (Algorithm 6) runs in (expected) time

O(ϵ2no(1)log(1/f𝖪𝖣𝖤)f𝖪𝖣𝖤o(1)cost(K)log3n).\displaystyle O(\epsilon^{-2}n^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot f_{\mathsf{KDE}}^{-o(1)}\cdot\mathrm{cost}(K)\log^{3}n).

5 Robustness to Adversary

In this section, we will turn the Query algorithm into a robust one. In other words, we want the following thing to happen with high probability: the algorithm responds to all query points correctly. We achieve this goal by taking three steps. We start with constant success probability for the Query procedure, which we have proved in the previous section.

In the first step, we boost this constant probability to a high probability by applying the median technique. We note that the current algorithm succeeds with high probability only for one fixed point but we want it to respond to arbitrary query points correctly.

It is not an easy task to generalize directly from a fixed point to infinite points in the whole space. Thus we take a middle step by introducing unit ball and ϵ\epsilon-net. We say a unit ball in d\mathbb{R}^{d} is a collection of points whose norm is less than or equal to 11. An ϵ\epsilon-net is a finite collection of points, called net points, that has the “covering” property. To be more specific, the union of balls that centered at net points with radius ϵ\epsilon covers the unit ball. In the second step, we show that given a net of the unit ball, we have the correctness on all net points.

Finally, we show the correctness of the algorithm from net points to all points in the whole space. Then we obtain a robust algorithm.

Starting Point In Section D, we have already obtained a query algorithm with constant success probability for a fixed query point.

Lemma 5.1 (Starting with constant probability).

Given ϵ(0,0.1)\epsilon\in(0,0.1), a query point qdq\in\mathbb{R}^{d} and a set of data points X={xi}i=1ndX=\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d}, let f𝖪𝖣𝖤(q):=1|X|xXf(x,q)f_{\mathsf{KDE}}^{*}(q):=\frac{1}{|X|}\sum_{x\in X}f(x,q) be an estimator 𝒟\mathcal{D} can answer the query which satisfies:

(1ϵ)f𝖪𝖣𝖤(q)𝒟.query(q,ϵ)(1+ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq\mathcal{D}.\textsc{query}(q,\epsilon)\leq(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

with probability 0.90.9.

Boost the constant probability to high probability.

Next, we begin to boost the success probability by repeating the query procedure and taking the median output.

Lemma 5.2 (Boost the constant probability to high probability).

Let δ1(0,0.1)\delta_{1}\in(0,0.1) denote the failure probability. Let ϵ(0,0.1)\epsilon\in(0,0.1) denote the accuracy parameter. Given L=O(log(1/δ1))L=O(\log(1/\delta_{1})) estimators {𝒟j}j=1L\{\mathcal{D}_{j}\}_{j=1}^{L}. For each fixed query point qdq\in\mathbb{R}^{d}, the median of queries from LL estimators satisfies that:

(1ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(q)\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

with probability 1δ11-\delta_{1}.

From each fixed point to all the net points.

So far, the success probability of our algorithm is still for a fixed point. We will introduce ϵ\epsilon-net on a unit ball and show the high success probability for all the net points.

Fact 5.3.

Let NN denote the ϵ0\epsilon_{0}-net of {xd|x21}.\{x\in\mathbb{R}^{d}~{}|~{}\|x\|_{2}\leq 1\}. We use |N||N| to denote the number of points in NN. Then |N|(10/ϵ0)d|N|\leq(10/\epsilon_{0})^{d}.

This fact shows that we can bound the size of an ϵ\epsilon-net with an inverse of ϵ\epsilon. We use this fact to conclude the number of repetitions we need to obtain the correctness of Query on all net points.

Lemma 5.4 (From each fixed points to all the net points).

Let NN denote the ϵ0\epsilon_{0}-net of {xd|x21}.\{x\in\mathbb{R}^{d}~{}|~{}\|x\|_{2}\leq 1\}. We use |N||N| to denote the number of points in NN. Given L=log(|N|/δ)L=\log(|N|/\delta) estimators {𝒟j}j=1L\{\mathcal{D}_{j}\}_{j=1}^{L}. With probability 1δ1-\delta, we have: for all qNq\in N, the median of queries from LL estimators satisfies that:

(1ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(q).\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q).

From net points to all points.

With Lemma 5.4, we are ready to extend the correctness for net points to the whole unit ball. We demonstrate that all query points q21\|q\|_{2}\leq 1 can be answered approximately with high probability in the following lemma.

Lemma 5.5 (From net points to all points).

Let ϵ(0,0.1)\epsilon\in(0,0.1). Let 1{\cal L}\geq 1. Let δ(0,0.1)\delta\in(0,0.1). Let τ[0,1]\tau\in[0,1]. Given L=O(log((/ϵτ)d/δ))L=O(\log((\mathcal{L}/\epsilon\tau)^{d}/\delta)) estimators {𝒟j}j=1L\{\mathcal{D}_{j}\}_{j=1}^{L}, with probability 1δ1-\delta, for all query points p21\|p\|_{2}\leq 1, we have the median of queries from LL estimators satisfies that: p21\forall\|p\|_{2}\leq 1

(1ϵ)f𝖪𝖣𝖤(p)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(p)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(p).\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(p).

where qq is the closest net point of pp.

Thus, we obtain an algorithm that could respond to adversary queries robustly.

6 Conclusion

Kernel density estimation is an important problem in machine learning. It has wide applications in similarity search and nearest neighbor clustering. Meanwhile, in many modern scenarios, input data can change over time, and queries can be provided by adversaries. In these scenarios, we need to build adaptive data structures such that incremental changes in the input data do not require our data structure to go through costly re-initialization. Also, queries provided by adversaries do not reduce the accuracy of the estimation. We call this problem the adaptive kernel density estimation.

We present the first such adaptive data structure design for kernel density estimation. Our data structure is efficient. It only requires subquadratic spaces. Each update to the input data only requires sublinear time, and each query can finish in sublinear time. It should be observed that the trade-off between efficiency and effectiveness persists in our proposed algorithms. Ordinarily, to augment the execution speed, a slight compromise on the algorithm’s performance becomes inevitable. Yet, we assert that the introduction of our groundbreaking data structures pushes the boundaries of this trade-off.

7 Impact Statement

Our algorithm’s implementation, it must be noted, could potentially contribute to environmental carbon emissions. That said, we aim to decrease the repetition of experiments by focusing our efforts on theoretical analysis.

References

  • ACSS [20] Josh Alman, Timothy Chu, Aaron Schild, and Zhao Song. Algorithms and hardness for linear algebra on geometric graphs. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 541–552. IEEE, 2020.
  • AI [06] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pages 459–468. IEEE, 2006.
  • AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
  • BIW [19] Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efficient kernel density estimation in high dimensions. Advances in Neural Information Processing Systems, 32, 2019.
  • CGC+ [22] Benjamin Coleman, Benito Geordie, Li Chou, RA Leo Elworth, Todd Treangen, and Anshumali Shrivastava. One-pass diversified sampling with application to terabyte-scale genomic sequence streams. In International Conference on Machine Learning, pages 4202–4218. PMLR, 2022.
  • CHM [12] Yuan Cao, Haibo He, and Hong Man. Somke: Kernel density estimation over data streams by sequences of self-organizing maps. IEEE transactions on neural networks and learning systems, 23(8):1254–1268, 2012.
  • CHS [20] Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. A fair classifier using kernel density estimation. Advances in Neural Information Processing Systems, 33:15088–15099, 2020.
  • CIU+ [21] Tsz Nam Chan, Pak Lon Ip, Leong Hou U, Byron Choi, and Jianliang Xu. Sws: a complexity-optimized solution for spatial-temporal kernel density visualization. Proceedings of the VLDB Endowment, 15(4):814–827, 2021.
  • CKNS [20] Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density estimation through density constrained near neighbor search. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 172–183. IEEE, 2020.
  • CLP+ [20] Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, and Christopher Re. Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2020.
  • CMF+ [20] Beidi Chen, Tharun Medini, James Farwell, Charlie Tai, Anshumali Shrivastava, et al. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. Proceedings of Machine Learning and Systems, 2:291–306, 2020.
  • Cra [01] Kyle Cranmer. Kernel estimation in high-energy physics. Computer Physics Communications, 136(3):198–207, 2001.
  • CRV+ [18] Danielle L Cantrell, Erin E Rees, Raphael Vanderstichel, Jon Grant, Ramón Filgueira, and Crawford W Revie. The use of kernel density estimation with a bio-physical model provides a method to quantify connectivity among salmon farms: spatial planning and management with epidemiological relevance. Frontiers in Veterinary Science, page 269, 2018.
  • CS [16] Efren Cruz Cortes and Clayton Scott. Sparse approximation of a kernel mean. IEEE Transactions on Signal Processing, 65(5):1310–1323, 2016.
  • CS [17] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 1032–1043. IEEE, 2017.
  • CS [20] Benjamin Coleman and Anshumali Shrivastava. Sub-linear race sketches for approximate kernel density estimation on streaming data. In Proceedings of The Web Conference 2020, pages 1739–1749, 2020.
  • CWS [10] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 109–116, 2010.
  • DJS+ [22] Yichuan Deng, Wenyu Jin, Zhao Song, Xiaorui Sun, and Omri Weinstein. Dynamic kernel sparsifiers. arXiv preprint arXiv:2211.14825, 2022.
  • DMS [23] Yichuan Deng, Sridhar Mahadevan, and Zhao Song. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arXiv preprint arXiv:2304.04397, 2023.
  • FC [17] Christen H Fleming and Justin M Calabrese. A new kernel density estimator for accurate home-range and species-range area estimation. Methods in Ecology and Evolution, 8(5):571–579, 2017.
  • HIK+ [21] Anna Hallin, Joshua Isaacson, Gregor Kasieczka, Claudius Krause, Benjamin Nachman, Tobias Quadfasel, Matthias Schlaffer, David Shih, and Manuel Sommerhalder. Classifying anomalies through outer density estimation (cathode). arXiv preprint arXiv:2109.00546, 2021.
  • HL [18] Yaoyao He and Haiyan Li. Probability density forecasting of wind power using quantile regression neural network and kernel density estimation. Energy conversion and management, 164:374–384, 2018.
  • HS [08] Christoph Heinz and Bernhard Seeger. Cluster kernels: Resource-aware kernel density estimators over streaming data. IEEE Transactions on Knowledge and Data Engineering, 20(7):880–893, 2008.
  • IM [98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.
  • KAP [22] Matti Karppa, Martin Aumüller, and Rasmus Pagh. Deann: Speeding up kernel-density estimation using approximate nearest neighbor search. In International Conference on Artificial Intelligence and Statistics, pages 3108–3137. PMLR, 2022.
  • LWZ+ [23] Zirui Liu, Guanchu Wang, Shaochen Zhong, Zhaozhuo Xu, Daochen Zha, Ruixiang Tang, Zhimeng Jiang, Kaixiong Zhou, Vipin Chaudhary, Shuai Xu, et al. Winner-take-all column row sampling for memory efficient adaptation of language model. arXiv preprint arXiv:2305.15265, 2023.
  • MFS+ [17] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
  • MRL [95] Young-Il Moon, Balaji Rajagopalan, and Upmanu Lall. Estimation of mutual information using kernel density estimators. Physical Review E, 52(3):2318, 1995.
  • PT [20] Jeff M Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867–887, 2020.
  • SRB+ [19] Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Levis. Rehashing kernel evaluation in high dimensions. In International Conference on Machine Learning, pages 5789–5798. PMLR, 2019.
  • SS [21] Ryan Spring and Anshumali Shrivastava. Mutual information estimation using lsh sampling. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 2807–2815, 2021.
  • SSX [21] Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. Sublinear least-squares value iteration via locality sensitive hashing. arXiv preprint arXiv:2105.08285, 2021.
  • SXYZ [22] Zhao Song, Zhaozhuo Xu, Yuanyuan Yang, and Lichen Zhang. Accelerating frank-wolfe algorithm using low-dimensional and adaptive data structures. arXiv preprint arXiv:2207.09002, 2022.
  • SXZ [22] Zhao Song, Zhaozhuo Xu, and Lichen Zhang. Speeding up sparsification using inner product search data structures. arXiv preprint arXiv:2204.03209, 2022.
  • vdBSZ [23] Jan van den Brand, Zhao Song, and Tianyi Zhou. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv e-prints, pages arXiv–2304, 2023.
  • WHM+ [09] Meng Wang, Xian-Sheng Hua, Tao Mei, Richang Hong, Guojun Qi, Yan Song, and Li-Rong Dai. Semi-supervised kernel density estimation for video annotation. Computer Vision and Image Understanding, 113(3):384–396, 2009.
  • WLG [19] Qin Wang, Wen Li, and Luc Van Gool. Semi-supervised learning by augmented distribution alignment. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1466–1475, 2019.
  • WXW+ [22] Zhuang Wang, Zhaozhuo Xu, Xinyu Wu, Anshumali Shrivastava, and TS Eugene Ng. Dragonn: Distributed randomized approximate gradients of neural networks. In International Conference on Machine Learning, pages 23274–23291. PMLR, 2022.
  • XCL+ [21] Zhaozhuo Xu, Beidi Chen, Chaojian Li, Weiyang Liu, Le Song, Yingyan Lin, and Anshumali Shrivastava. Locality sensitive teaching. Advances in Neural Information Processing Systems, 34, 2021.
  • XSS [21] Zhaozhuo Xu, Zhao Song, and Anshumali Shrivastava. Breaking the linear iteration cost barrier for some well-known conditional gradient methods using maxip data-structures. Advances in Neural Information Processing Systems, 34, 2021.
  • Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.

Appendix

Roadmap.

Section A presents the basic definitions and lemmas. Section B presents the technical claims for the proof of our main result. Section C consists of four subsections: Initialize part, Update part, Query part, and LSH part of our data structure. Each section presents the corresponding algorithm and proof of running time. We provide a proof sketch for correctness in Section D. Section E presents the detailed proof of the correctness of our data structure. Section F presents how to change our algorithm into an adaptive algorithm. Section G presents the Lipschitz property of the kernel function.

Appendix A Preliminary

The goal of this subsection is to introduce some basic Definitions (Section A.1) and Lemmas (Section A.2) that will be used to prove the main result.

A.1 Definitions

We start by recalling the definition of geometric weight level.

Definition A.1 (Restatement of Definition 2.1 Geometric Weight Levels).

Fix R+R\in\mathbb{N}_{+} and qdq\in\mathbb{R}^{d}. We define

wi:=f(xi,q)\displaystyle w_{i}:=f(x_{i},q)

For any fix r[R]:={1,2,,R}r\in[R]:=\{1,2,\cdots,R\}, we define

Lr:={xiX|wi(2r+1,2r]}\displaystyle L_{r}:=\{x_{i}\in X~{}|~{}w_{i}\in(2^{-r+1},2^{-r}]\}

We define the corresponding distance levels as

zr:=maxs.t.f(z)(2r,2r+1]z.\displaystyle z_{r}:=\max_{\mathrm{s.t.}f(z)\in(2^{-r},2^{-r+1}]}z.

where f(z):=f(x,q)f(z):=f(x,q) for z=xq2z=\|x-q\|_{2}.

In addition, we define LR+1:=Pr[R]LrL_{R+1}:=P\setminus\bigcup_{r\in[R]}L_{r}

We restate the definition and some properties of Locality Sensitive Hashing.

Definition A.2 (Restatement of Definition 2.3, Locally Sensitive Hash).

A family \mathcal{H} is called (pnear,pfar,z,c)(p_{\mathrm{near}},p_{\mathrm{far}},z,c)-sensitive where pnear,pfar[0,1],z,c1p_{\mathrm{near}},p_{\mathrm{far}}\in[0,1],z\in\mathbb{R},c\geq 1, if for any x,qdx,q\in\mathbb{R}^{d}:

  • Prh[h(x)=h(q)xq2r]pnear\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\leq r]\geq p_{\mathrm{near}}

  • Prh[h(x)=h(q)xq2cr]pfar\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\geq cr]\leq p_{\mathrm{far}}

A.2 Lemmas

Lemma A.3 (Lemma 3.2 in page 6 of [2]).

Let (a,b)d×d(a,b)\in\mathbb{R}^{d}\times\mathbb{R}^{d}. Fixed z>0z>0, there is a hash family \mathcal{H} such that, if pnear:=p1(z):=Prh[h(a)=h(b)ab2z]p_{\mathrm{near}}:=p_{1}(z):=\Pr_{h\sim\mathcal{H}}[h(a)=h(b)~{}|~{}\|a-b\|_{2}\leq z] and pfar:=p2(z,c):=Prh[h(a)=h(b)ab2cz]p_{\mathrm{far}}:=p_{2}(z,c):=\Pr_{h\sim\mathcal{H}}[h(a)=h(b)~{}|~{}\|a-b\|_{2}\geq cz], then

ρ:=log1/pnearlog1/pfar1c2+O(logtt12)\displaystyle\rho:=\frac{\log{1}/{p_{\mathrm{near}}}}{\log{1}/{p_{\mathrm{far}}}}\leq\frac{1}{c^{2}}+O(\frac{\log t}{t^{\frac{1}{2}}})

for any c1,t>0c\geq 1,t>0, where pneareO(t)p_{\mathrm{near}}\geq e^{-O(\sqrt{t})} and it requires dtO(t)dt^{O(t)} time to evaluate.

Remark A.4.

We find an upper bound for our definition of ρ\rho and evaluation time for hashing. For the rest part, we denote t=log23nt=\log^{\frac{2}{3}}n. Thus we obtain no(1)n^{o(1)} evaluation time and ρ=1c2+o(1)\rho=\frac{1}{c^{2}}+o(1). Since c=O(log17n)c=O(\log^{\frac{1}{7}}n), we have

11c2+O(logtt12)=c2(1o(1)).\displaystyle\frac{1}{\frac{1}{c^{2}}+O(\frac{\log t}{t^{\frac{1}{2}}})}=c^{2}(1-o(1)).

In the next lemma, we show the existence of an LSH family that can separate the near points and the far points from the query with high probability.

Lemma A.5 (probability bound for separating points in different level sets, formal version of Lemma 2.6).

Given kernel function ff, we have corresponding weight level sets LrL_{r}’s and distance levels zrz_{r}’s (Definition 2.1). Given query qdq\in\mathbb{R}^{d} and integer i[R+1]i\in[R+1], r[R]r\in[R] satisfying i>ri>r, let xLrx\in L_{r}, xLi,ci,r:=min{zi1zr,log1/7n}x^{\prime}\in L_{i},c_{i,r}:=\min\{\frac{z_{i-1}}{z_{r}},\log^{1/7}n\}. We set up an Andoni-Indyk LSH family \mathcal{H} (Definition 2.3) with near distance zrz_{r}.

We define

pnear,r\displaystyle p_{\mathrm{near},r} :=Prh[h(x)=h(q)xq2z]\displaystyle~{}:=\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\leq z]
pfar,r\displaystyle p_{\mathrm{far},r} :=Prh[h(x)=h(q)xq2cz]\displaystyle~{}:=\Pr_{h\sim\mathcal{H}}[h(x)=h(q)~{}|~{}\|x-q\|_{2}\geq cz]

Then the following inequalities holds for any integer k+k\in\mathbb{N}_{+}

  1. 1.

    Prhk[h(x)=h(q)]pnear,rk\Pr_{h^{*}\sim\mathcal{H}^{k}}[~{}h^{*}(x)=h^{*}(q)~{}]\geq p_{\mathrm{near},r}^{k}

  2. 2.

    Prhk[h(x)=h(q)]pnear,rkci,r2(1o(1))\Pr_{h^{*}\sim\mathcal{H}^{k}}[~{}h^{*}(x^{\prime})=h^{*}(q)~{}]\leq p_{\mathrm{near},r}^{kc_{i,r}^{2}(1-o(1))}

Proof.

Since xLrx\in L_{r}, by Lemma 2.1, we have

xq2zr\displaystyle\|x-q\|_{2}\leq z_{r} (1)

For xLix^{\prime}\in L_{i}, since we assume the ff is decaying radial kernel, we have

xq2zi1ci,rzr\displaystyle\|x^{\prime}-q\|_{2}\geq z_{i-1}\geq c_{i,r}z_{r} (2)

where the first step follows from Definition 2.1, the last step follows from ci,rc~rc_{i,r}\geq\widetilde{c}_{r}.

By Lemma 2.4 and Eq. (1), Eq. (2), we have

  1. 1.

    Prh[h(x)=h(q)]pnear,r\Pr_{h\sim\mathcal{H}}[h(x)=h(q)]\geq p_{\mathrm{near},r}

  2. 2.

    Prh[h(x)=h(q)]pfar,r\Pr_{h\sim\mathcal{H}}[h(x^{\prime})=h(q)]\leq p_{\mathrm{far},r}

By remark 2.5, we have

pfar,rpnear,rci,r(1o(1))\displaystyle p_{\mathrm{far},r}\leq p_{\mathrm{near},r}^{c_{i,r}(1-o(1))} (3)

Then for any integer k>1k>1, we have

Prhk[h(x)=h(q)]\displaystyle\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x)=h^{*}(q)] pnear,rk\displaystyle~{}\geq~{}p_{\mathrm{near},r}^{k}
Prhk[h(x)=h(q)]\displaystyle\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x^{\prime})=h^{*}(q)] pfar,rk\displaystyle~{}\leq~{}p_{\mathrm{far},r}^{k}

By Eq. (3), we obtain the final result

  1. 1.

    Prhk[h(x)=h(q)]pnear,rk\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x)=h^{*}(q)]\geq p_{\mathrm{near},r}^{k}

  2. 2.

    Prhk[h(x)=h(q)]pnear,rkci,r2(1o(1))\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x^{\prime})=h^{*}(q)]\leq p_{\mathrm{near},r}^{kc_{i,r}^{2}(1-o(1))}

Thus, we complete the proof. ∎

Appendix B Technical Claims

In Section B.1, we show how to bound the size of geometric weight levels. In Section B.2, we explain how show the probability for sampled point recovery. In Section B.3, we give an expectation bound the number of recovered points.

B.1 Size of geometric weight levels

The goal of this section is to prove Lemma B.1.

Lemma B.1 (Sizes of geometric weight levels).

Given r[R]r\in[R], we have

|Lr|2rnf𝖪𝖣𝖤2rnf𝖪𝖣𝖤.\displaystyle|L_{r}|\leq 2^{r}nf_{\mathsf{KDE}}^{*}\leq 2^{r}nf_{\mathsf{KDE}}.
Proof.

For any fix r[R]r\in[R], x,qdx,q\in\mathbb{R}^{d}, we have

nf𝖪𝖣𝖤\displaystyle n\cdot f_{\mathsf{KDE}}\geq nf𝖪𝖣𝖤\displaystyle~{}n\cdot f_{\mathsf{KDE}}^{*}
=\displaystyle= xXf(x,q)\displaystyle~{}\sum\limits_{x\in X}f(x,q)
\displaystyle\geq pLrf(x,q)\displaystyle~{}\sum\limits_{p\in L_{r}}f(x,q)
\displaystyle\geq |Lr|2r\displaystyle~{}|L_{r}|2^{-r}

where the first step follows from f𝖪𝖣𝖤f𝖪𝖣𝖤f_{\mathsf{KDE}}\geq f_{\mathsf{KDE}}^{*}, the second step follows from Definition 1.1, the third step follows from shrinking the number of summands, the last step follows from Definition 2.1.

Thus, we complete the proof. ∎

B.2 Probability for sampled point recovery

The goal of this section is to prove Lemma B.2.

Lemma B.2 (Probability for sampled point recovery).

Suppose that we invoke DynamicKDE.Initialize. Suppose when a=aa=a^{*} and r=rr=r^{*}, we sample a point xLrx\in L_{r^{*}}. Given a query qq, we invoke DynamicKDE.Query. With probability at least 11n101-\frac{1}{n^{10}}, Ha,rH_{a^{*},r^{*}} recovered xx.

Proof.

By Lemma A.5 we have

Prhk[h(x)=h(q)]pnear,rk\displaystyle\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x)=h^{*}(q)]\geq p_{\mathrm{near},r^{*}}^{k}

Now note that in LSH.Recover (line 6) procedure, we repeat this process for

K2,r=100log(n)pnear,rk\displaystyle K_{2,r^{*}}=100\log(n)p_{\mathrm{near},r^{*}}^{-k}

times. Thus, for any sampled point pLrp\in L_{r^{*}}, it is recovered in one of the repetitions of phase r=rr=r^{*}, with probability at least 11n101-\frac{1}{n^{10}}.

B.3 Number of recovered points in expectation

The goal of this section is to prove Lemma B.3

Lemma B.3 (Upper bound on number of recovered points in expectation).

Fix a query qdq\in\mathbb{R}^{d}. We define R:=log1f𝖪𝖣𝖤R:=\lceil\log\frac{1}{f_{\mathsf{KDE}}}\rceil. Fix r[R]r\in[R], we define p:=pnear,rp:=p_{\mathrm{near},r}. For each (i,r)[R]×[R](i,r)\in[R]\times[R], we define ci,r:=min{zi1zr,log17n}c_{i,r}:=\min\{\frac{z_{i-1}}{z_{r}},\log^{\frac{1}{7}}n\}. There exists k+k\in\mathbb{N}_{+}

k:=kr:=1log(1/p)maxl{r+1,,R+1}lrcl,r2(1o(1)).\displaystyle k:=k_{r}:=\frac{1}{\log(1/p)}\max\limits_{l\in\{r+1,\cdots,R+1\}}\lceil\frac{l-r}{c_{l,r}^{2}(1-o(1))}\rceil.

such that for any i>ji>j

𝔼hk[|{xLi:h(x)=h(q)}|]=O(1)\displaystyle\operatorname*{{\mathbb{E}}}_{h^{*}\sim\mathcal{H}^{k}}[|\{x^{\prime}\in L_{i}:h^{*}(x^{\prime})=h^{*}(q)\}|]=O(1)
Proof.

By Lemma A.5 we have

Prhk[h(x)=h(q)]pkci,r2(1o(1))\displaystyle\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x)=h^{*}(q)]\leq p^{kc_{i,r}^{2}(1-o(1))}

where ci,j:=min{ri1rj,log17n},p:=pnear,j(0,1)c_{i,j}:=\min\{\frac{r_{i-1}}{r_{j}},\log^{\frac{1}{7}}n\},p:=p_{\mathrm{near},j}\in(0,1)(remark 2.5).

𝔼hk[|xLi:h(x)=h(q)|]\displaystyle~{}\operatorname*{{\mathbb{E}}}_{h^{*}\sim\mathcal{H}^{k}}[|x^{\prime}\in L_{i}:h^{*}(x^{\prime})=h^{*}(q)|]
\displaystyle\leq 2inf𝖪𝖣𝖤12rnf𝖪𝖣𝖤Prhk[h(x)=h(q)]\displaystyle~{}2^{i}nf_{\mathsf{KDE}}\cdot\frac{1}{2^{r}nf_{\mathsf{KDE}}}\Pr_{h^{*}\sim\mathcal{H}^{k}}[h^{*}(x)=h^{*}(q)]
\displaystyle\leq 2irpkci,r2(1o(1))\displaystyle~{}2^{i-r}\cdot p^{kc_{i,r}^{2}(1-o(1))}

where the first step follows from lemma B.1 and sampling probability (Algorithm 4 line 25), the second step follows from Lemma 2.6.

Note that for i>ji>j, we have

k1log1pirci,r2(1o(1))\displaystyle k\geq\frac{1}{\log\frac{1}{p}}\lceil\frac{i-r}{c_{i,r}^{2}(1-o(1))}\rceil (4)

Then

2irpkci,r2(1o(1))\displaystyle~{}2^{i-r}\cdot p^{kc_{i,r}^{2}(1-o(1))}
=\displaystyle= 2ir2log(p)kci,r2(1o(1))\displaystyle~{}2^{i-r}\cdot 2^{\log(p)\cdot kc_{i,r}^{2}(1-o(1))}
\displaystyle\leq 2ir2log(p)1log(1p)irci,r2(1o(1))ci,r2(1o(1))\displaystyle~{}2^{i-r}2^{\log(p)\cdot\frac{1}{\log(\frac{1}{p})}\lceil\frac{i-r}{c_{i,r}^{2}(1-o(1))}\rceil c_{i,r}^{2}(1-o(1))}
\displaystyle\leq 2ir+ri\displaystyle~{}2^{i-r+r-i}
=\displaystyle= 1\displaystyle~{}1

where the first step follows from rewriting in exponential form, the second step follows from Eq. (4) , the third step follows from canceling the same term in both numerator and denominator, and the last step follows from canceling ii and rr.

Thus, we complete the proof. ∎

Appendix C Our Data Structures

In this section, we describe our data structures in detail. Starting with the initialize part in Section C.1, we state the result of space storage and running time for Initialize in DynamicKDE. In Section C.2, we demonstrate the running time for the update part in our data structure. Section C.3 presents the running time for the query procedure. Finally, we study the LSH data structure in Section C.4. It is an important member in the DynamicKDE structure and fundamental to the implementation of all three procedures above.

C.1 Initialize part of data structure

In this section, we describe the space storage and running time of Initialize part of our data structure DynamicKDE.

We start by showing the space storage of LSH structure.

Lemma C.1 (Space storage of LSH, formal version of Lemma 4.1).

Given data set {xi}i[n]d\{x_{i}\}_{i\in[n]}\subset\mathbb{R}^{d}, parameter L,k+L,k\in\mathbb{N}_{+}, the Initialize (Algorithm 7) of the data-structure LSH uses space

O(Lkdno(1)+Ln)\displaystyle O(Lkdn^{o(1)}+Ln)
Proof.

The space storage comes from two parts: ChooseHashFunc and ConstructHashTable.

Part 1. ChooseHashFunc(line 4) takes L,kL,k as input.

It has a for loop with LL iterations.

In each iteration, it samples kk functions(line 7) from hash family \mathcal{H} to create l\mathcal{H}_{l}, which uses O(kdno(1))O(kdn^{o(1)}) space.

Thus the total space usage of ChooseHashFunc is LO(kdno(1))=O(Lkdno(1))L\cdot O(kdn^{o(1)})=O(Lkdn^{o(1)}).

Part 2. ConstructHashTable(line 11) takes data set {xi}i[n]\{x_{i}\}_{i\in[n]} and parameter LL as input.

It has two recursive for loops.

  • The first for loop repeats LL iterations.

  • The second for loop repeats nn iterations.

The space storage of the inner loop comes from line 28 and line 15, which is O(1)O(1).

Thus the total space storage of ConstructHashTable is LnO(1)=O(Ln)L\cdot n\cdot O(1)=O(Ln).

The final space storage of Initialize is

𝐏𝐚𝐫𝐭𝟏+𝐏𝐚𝐫𝐭𝟐\displaystyle~{}{\bf Part1}+{\bf Part2}
=\displaystyle= O(Lkdno(1)+Ln)\displaystyle~{}O(Lkdn^{o(1)}+Ln)

Thus, we complete the proof.

Using the above lemma, we state the space storage of our DynamicKDE structure.

Lemma C.2 (Space storage part of Theorem 1.2, formal version of Lemma 4.2).

The Initialize of the data structure DynamicKDE (Algorithm 4) uses space

O(\displaystyle O( ϵ2(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)cost(K)\displaystyle\epsilon^{-2}(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\cdot\log(1/f_{\mathsf{KDE}})\cdot\mathrm{cost}(K)
log2n(1f𝖪𝖣𝖤+no(1)log2n))\displaystyle\cdot\log^{2}n\cdot(\frac{1}{f_{\mathsf{KDE}}}+n^{o(1)}\cdot\log^{2}n))
Proof.

The space storage mainly comes from K1JK_{1}\cdot J copies of \mathcal{H}.

Now let’s consider the space storage of \mathcal{H}. By Lemma 4.1, we replace {xi}i[n],L,k\{x_{i}\}_{i\in[n]},L,k with PjP_{j} (line 26), K2,jK_{2,j} (line 24), kjk_{j} (line 22) respectively. We have |Pj|=O(1f𝖪𝖣𝖤),K2,j=O(cost(K)logn)|P_{j}|=O(\frac{1}{f_{\mathsf{KDE}}}),K_{2,j}=O(\mathrm{cost}(K)\cdot\log n) and kj=O(logn)k_{j}=O(\log n). Thus the total space usage of \mathcal{H} is

O(Lkdno(1)+Ln)\displaystyle~{}O(Lkdn^{o(1)}+Ln) (5)
=\displaystyle= O(cost(f)no(1)log3n+cost(f)(1f𝖪𝖣𝖤)logn)\displaystyle~{}O(\mathrm{cost}(f)n^{o(1)}\cdot\log^{3}n+\mathrm{cost}(f)(\frac{1}{f_{\mathsf{KDE}}})\cdot\log n) (6)

The total space storage of Initialize of the data structure DynamicKDE is

K1JO(Lk+Ln)\displaystyle~{}K_{1}\cdot J\cdot O(Lk+Ln)
=\displaystyle= O(K1Jcost(K)logn(1f𝖪𝖣𝖤+logn))\displaystyle~{}O(K_{1}\cdot J\cdot\mathrm{cost}(K)\log n\cdot(\frac{1}{f_{\mathsf{KDE}}}+\log n))
=\displaystyle= O(ϵ2(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)cost(K)\displaystyle~{}O(\epsilon^{-2}(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\cdot\log(1/f_{\mathsf{KDE}})\cdot\mathrm{cost}(K)
log2n(1f𝖪𝖣𝖤+no(1)log2n))\displaystyle\cdot\log^{2}n\cdot(\frac{1}{f_{\mathsf{KDE}}}+n^{o(1)}\cdot\log^{2}n))

where the first step follows from Eq. (5), the last step follows from K1=O(ϵ2(1f𝖪𝖣𝖤)o(1)logn)K_{1}=O(\epsilon^{-2}(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\cdot\log n) and J=O(log(1/f𝖪𝖣𝖤))J=O(\log(1/f_{\mathsf{KDE}})).

Thus, we complete the proof. ∎

Next, we show an upper bound on running time for Initilize in LSH data structure.

Lemma C.3 (Upper bound on running time of Initialize of the data-structure LSH, formal version of Lemma 4.3 ).

Given input data points {xi}i[n]d\{x_{i}\}_{i\in[n]}\subset\mathbb{R}^{d}, parameters k,L+k,L\in\mathbb{N}_{+}, LSH parameters pnear,pfar[0,1],c[1,),r+p_{\mathrm{near}},p_{\mathrm{far}}\in[0,1],c\in[1,\infty),r\in\mathbb{R}_{+} and kernel ff, the Initialize of the data-structure LSH(Algorithm 7) runs in time

O(L(kdno(1)+dn1+o(1)+nlogn))\displaystyle O(L\cdot(kdn^{o(1)}+dn^{1+o(1)}+n\log n))
Proof.

This procedure consists of three parts

Part 1. We invoke ChooseHashTable procedure with parameters k,Lk,L (line 15). The ChooseHashTable procedure has one for loop with L iterations.

Now let’s consider the running time in line 7, which is the running time in each iteration. In line 7, we sample kk hash functons from hash family \mathcal{H}, which takes O(kdno(1))O(k\cdot dn^{o(1)}) time.

Thus the total running time for Part 1 is

O(Lkdno(1))\displaystyle O(Lkdn^{o(1)})

Part 2. We invoke ConstructHashTable procedure with data set {xi}i[n]\{x_{i}\}_{i\in[n]}. This procedure has two recursive for loops.

  • The first loops repeat LL iterations

  • The second loop repeats nn iterations

Now let’s consider the running time from line 14 to line 15, which is the time for each inner loop.

  • line 14: We first evaluate l(xi)\mathcal{H}_{l}(x_{i}), which takes O(dno(1))O(dn^{o(1)}). Then we insert xix_{i} in the bucket l(xi)\mathcal{H}_{l}(x_{i}), which takes O(logn)O(\log n) time.

  • line 15 takes O(1)O(1) time.

The running time from line 14 to line 15 is O(dno(1)+logn)O(dn^{o(1)}+\log n)

The total running time for Part 2 is

O(Ln(dno(1)+logn))\displaystyle O(Ln\cdot(dn^{o(1)}+\log n))

Putting it all together. We prove that the Initialize of the data-structure LSH(Algorithm 7) runs in time

𝐏𝐚𝐫𝐭𝟏+𝐏𝐚𝐫𝐭𝟐\displaystyle~{}{\bf Part1}+{\bf Part2}
=\displaystyle= O(Lkdno(1))+O(Ln(dno(1)+logn))\displaystyle~{}O(Lkdn^{o(1)})+O(Ln\cdot(dn^{o(1)}+\log n))
=\displaystyle= O(L(kdno(1)+dn1+o(1)+nlogn))\displaystyle~{}O(L\cdot(kdn^{o(1)}+dn^{1+o(1)}+n\log n))

Thus, we complete the proof.

Combining the results above, we can demonstrate the running time of Initialize in DynamicKDE in the following lemma.

Lemma C.4 (The initialize part of Theorem 1.2, formal version of Lemma 4.4).

Given (f:d×d[0,1],Pd,ϵ(0,1),f𝖪𝖣𝖤[0,1])(f:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1],P\subset\mathbb{R}^{d},\epsilon\in(0,1),f_{\mathsf{KDE}}\in[0,1]), the Initialize of the data-structure DynamicKDE (Algorithm 4) runs in time

O(ϵ2n1+o(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log2n)\displaystyle O(\epsilon^{-2}n^{1+o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\log^{2}n)
Proof.

The Initialize procedure has two recursive for loops.

  • The first loops repeat K1=O(ϵ2log(n)f𝖪𝖣𝖤o(1))K_{1}=O(\epsilon^{-2}\log(n)\cdot f_{\mathsf{KDE}}^{-o(1)}) iterations

  • The second loops repeats J=O(log1f𝖪𝖣𝖤)J=O(\log\frac{1}{f_{\mathsf{KDE}}}) iterations

Now let’s consider the running time from line 20 to line 27, which is the running time of the inner loop.

  • line 20 to line 25 takes O(log(1/f𝖪𝖣𝖤))O(\log(1/f_{\mathsf{KDE}})) time.

  • line 26 takes O(n)O(n) time.

  • line 27: By Lemma 4.3, we replace LL with K2,j=O(cost(K)logn)K_{2,j}=O(\mathrm{cost}(K)\cdot\log n) and kk with kj=O(logn)k_{j}=O(\log n) .

    Thus the running time of this line is

    O(L(kdno(1)+dn1+o(1)+nlogn)\displaystyle~{}O(L\cdot(kdn^{o(1)}+dn^{1+o(1)}+n\log n)
    =\displaystyle= O(cost(K)logn(no(1)log2n+n1+o(1)logn+nlogn))\displaystyle~{}O(\mathrm{cost}(K)\cdot\log n\cdot(n^{o(1)}\cdot\log^{2}n+n^{1+o(1)}\cdot\log n+n\cdot\log n))
    =\displaystyle= O(n1+o(1)cost(K)log2n)\displaystyle~{}O(n^{1+o(1)}\mathrm{cost}(K)\cdot\log^{2}n)

where the first step follows from K2,j=O(cost(K)logn)K_{2,j}=O(\mathrm{cost}(K)\cdot\log n), d=O(logn)d=O(\log n) and kj=O(logn)k_{j}=O(\log n), the second step follows from O(nlogn)=O(n1+o(1))O(n\log n)=O(n^{1+o(1)}).

The running time from from line 20 to line 27 is

O(log(1/f𝖪𝖣𝖤))+O(n)+O(n1+o(1)cost(K)logn)\displaystyle~{}O(\log(1/f_{\mathsf{KDE}}))+O(n)+O(n^{1+o(1)}\mathrm{cost}(K)\log n)
=\displaystyle= O(n1+o(1)cost(K)log2n)\displaystyle~{}O(n^{1+o(1)}\mathrm{cost}(K)\log^{2}n)

The final running time for Initialize procedure is

K1JO(n1+o(1)cost(K)log2n)\displaystyle~{}K_{1}\cdot J\cdot O(n^{1+o(1)}\mathrm{cost}(K)\cdot\log^{2}n)
=\displaystyle= O(ϵ2n1+o(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log3n)\displaystyle~{}O(\epsilon^{-2}n^{1+o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\cdot\log(1/f_{\mathsf{KDE}})\cdot\log^{3}n)

where we use K1=O(ϵ2f𝖪𝖣𝖤o(1)logn)K_{1}=O(\epsilon^{-2}\cdot f_{\mathsf{KDE}}^{-o(1)}\cdot\log n) and J=O(log(1/f𝖪𝖣𝖤))J=O(\log(1/f_{\mathsf{KDE}})).

Thus, we complete the proof. ∎

Algorithm 4 Dynamic KDE, members and initialize part
1:data structure DynamicKDE \triangleright Theorem 1.2
2:members
3:     d+d\in\mathbb{N}_{+} \triangleright Dimension of data point
4:     For i[n]{i\in[n]},xidx_{i}\in\mathbb{R}^{d} \triangleright dataset XX
5:     K1+K_{1}\in\mathbb{N}_{+} \triangleright Number of repetitions
6:     R+R\in\mathbb{N}_{+}
7:     For a[K1]a\in[K_{1}], P~ad\widetilde{P}_{a}\subset\mathbb{R}^{d} \triangleright Sampled data points
8:     K2+K_{2}\in\mathbb{N}_{+}
9:     For a[K1]a\in[K_{1}], r[R]r\in[R], Ha,rLSHH_{a,r}\in\textsc{LSH} \triangleright Instances from LSH class
10:end members
11:
12:procedure Initialize(Xd,ϵ(0,1),f𝖪𝖣𝖤[0,1]X\subset\mathbb{R}^{d},\epsilon\in(0,1),f_{\mathsf{KDE}}\in[0,1]) \triangleright Lemma 4.2
13:     \triangleright f𝖪𝖣𝖤f_{\mathsf{KDE}} is a known quantity satisfy f𝖪𝖣𝖤f𝖪𝖣𝖤f_{\mathsf{KDE}}\geq f_{\mathsf{KDE}}^{*}
14:     \triangleright ϵ\epsilon represents the precision of estimation
15:     K1Cϵ2lognf𝖪𝖣𝖤o(1)K_{1}\leftarrow C\cdot\epsilon^{-2}\log{n}\cdot f_{\mathsf{KDE}}^{-o(1)}
16:     Rlog1/f𝖪𝖣𝖤R\leftarrow\left\lceil\log{{1}/{f_{\mathsf{KDE}}}}\right\rceil
17:     for a=1,2,,K1a=1,2,\cdots,K_{1} do
18:         for r=1,2,,Rr=1,2,\cdots,R do
19:              for i=r+1,,R+1i=r+1,\cdots,R+1 do
20:                  ci,rmin{zi1zr,log17n}c_{i,r}\leftarrow\min\{\frac{z_{i-1}}{z_{r}},\log^{\frac{1}{7}}n\} \triangleright zrz_{r} is defined in Definition 2.1
21:              end for
22:              krmaxi{r+1,,R+1}1log1pirc~i,r(1o(1))k_{r}\leftarrow\max_{i\in\{r+1,\cdots,R+1\}}\frac{1}{\log\frac{1}{p}}\lceil\frac{i-r}{\widetilde{c}_{i,r}(1-o(1))}\rceil
23:              pnear,rp(zr)p_{\mathrm{near},r}\leftarrow p(z_{r})
24:              K2,r100lognpnear,rkrK_{2,r}\leftarrow 100\log{n}\cdot p_{\mathrm{near},r}^{-k_{r}} \triangleright pnear,r,pfar,rp_{\mathrm{near},r},p_{\mathrm{far},r} are defined in Lemma 2.4
25:              psamplingmin{12rnf𝖪𝖣𝖤,1}p_{\mathrm{sampling}}\leftarrow\min\{\frac{1}{2^{r}nf_{\mathsf{KDE}}},1\}
26:              PrP_{r}\leftarrow sample each element in XX with probability psamplingp_{\mathrm{sampling}}.
27:              a,r.Initialize(Pr,kr,K2,r)\mathcal{H}_{a,r}.\textsc{Initialize}(P_{r},k_{r},K_{2,r})
28:         end for
29:         P~a\widetilde{P}_{a}\leftarrow sample each element in XX with probability 1n\frac{1}{n} \triangleright Store P~a\widetilde{P}_{a}
30:     end for
31:end procedure
32:end data structure
Algorithm 5 Dynamic KDE, update part
1:data structure DynamicKDE \triangleright Theorem 1.2
2:
3:procedure Update(vd,f𝖪𝖣𝖤[0,1],i[n]v\in\mathbb{R}^{d},f_{\mathsf{KDE}}\in[0,1],i\in[n]) \triangleright Lemma 4.6
4:     for a=1,2,,K1a=1,2,\cdots,K_{1} do
5:         for r=1,2,,Rr=1,2,\cdots,R do
6:              a,r.UpdateHashTable(v,i)\mathcal{H}_{a,r}.\textsc{UpdateHashTable}(v,i)
7:         end for
8:     end for
9:     xivx_{i}\leftarrow v
10:end procedure
11:end data structure

C.2 Update part of data structure

The goal of this section is to prove Lemma 4.6. Our Lemma C.6 in this section is the formal version of Lemma 4.6. We present an auxiliary Lemma C.5 and then show how to this this auxiliary lemma to prove Lemma C.6.

Lemma C.5 (Update time of LSH, formal version of Lemma 4.5).

Given a data point zRdz\in\mathrm{R}^{d} and index i[n]i\in[n], the UpdateHashTable of the data-structure LSH runs in (expected) time

O(no(1)log(n)cost(f)).\displaystyle O(n^{o(1)}\log(n)\cdot\operatorname{cost}(f)).
Proof.

This procedure has one for loop which repeats L=O(logn)L=O(\log n) iterations. Now let us consider the running time from line 28 to line 29, which is the time for each iteration.

  • line 28 takes O(dno(1)cost(f))O(dn^{o(1)}\operatorname{cost}(f))

  • line 29 takes the same time as line 28

The final running time

LO(dno(1)cost(f))\displaystyle~{}L\cdot O(dn^{o(1)}\operatorname{cost}(f))
=\displaystyle= O(no(1)cost(f)log2n).\displaystyle~{}O(n^{o(1)}\operatorname{cost}(f)\cdot\log^{2}n).

where we use L=O(logn)L=O(\log n) and d=O(logn)d=O(\log n).

Thus, we complete the proof.

Lemma C.6 (The update part of Theorem 1.2, formal version of Lemma 4.6).

Given an update zdz\in\mathbb{R}^{d} and index i[n]i\in[n], the Update of the data-structure DynamicKDE (Algorithm 5) runs in (expected) time

O(ϵ2no(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log3n).\displaystyle O(\epsilon^{-2}n^{o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\log^{3}n).
Proof.

This algorithm has two recursive for loops

  • The first loops repeat K1=O(ϵ2log(n)f𝖪𝖣𝖤o(1))K_{1}=O(\epsilon^{-2}\log(n)\cdot f_{\mathsf{KDE}}^{-o(1)}) iterations

  • The second loops repeats J=O(log1f𝖪𝖣𝖤)J=O(\log\frac{1}{f_{\mathsf{KDE}}}) iterations

Now let’s consider the running time in line 6, which is the time for each inner loop.

By Lemma 4.5, line 6 takes O(no(1)log(n)cost(f))O(n^{o(1)}\log(n)\cdot\operatorname{cost}(f)) time.

The final running time

K1JO(no(1)cost(f)log2n)\displaystyle~{}K_{1}\cdot J\cdot O(n^{o(1)}\operatorname{cost}(f)\cdot\log^{2}n)
=\displaystyle= O(ϵ2no(1)cost(f)(1f𝖪𝖣𝖤)o(1)log(1/f𝖪𝖣𝖤)log3n)\displaystyle~{}O(\epsilon^{-2}n^{o(1)}\operatorname{cost}(f)\cdot(\frac{1}{f_{\mathsf{KDE}}})^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\log^{3}n)

where we use K1=O(ϵ2log(n)f𝖪𝖣𝖤o(1))K_{1}=O(\epsilon^{-2}\log(n)\cdot f_{\mathsf{KDE}}^{-o(1)}) and J=O(log1f𝖪𝖣𝖤)J=O(\log\frac{1}{f_{\mathsf{KDE}}}).

Thus, we complete the proof. ∎

C.3 Query part of data structure

The goal of this section is to prove Lemma C.9. Our Algorithm 6 is for querying the approximated kernel density at point qq, and Lemma C.9 specifies the running time for the query operation. In order to prove this lemma, we list and prove a few auxiliary lemmas.

Algorithm 6 Dynamic KDE, query part
1:data structure DynamicKDE \triangleright Theorem 1.2
2:
3:procedure Query(qd,ϵ(0,1),f𝖪𝖣𝖤[0,1]q\in\mathbb{R}^{d},\epsilon\in(0,1),f_{\mathsf{KDE}}\in[0,1])
4:     for a=1,2,,K1a=1,2,\cdots,K_{1} do
5:         for r=1,2,,Rr=1,2,\cdots,R do
6:              a,r.Recover(q)\mathcal{H}_{a,r}.\textsc{Recover}(q)
7:              𝒮𝒮(a,r.Lr)\mathcal{S}\leftarrow\mathcal{S}\cup(\mathcal{H}_{a,r}.\mathcal{R}\cap L_{r})
8:         end for
9:         R+1\mathcal{R}_{R+1}\leftarrow recover points in LR+1P~aL_{R+1}\cap\widetilde{P}_{a} \triangleright Recover by calculating ww directly.
10:         𝒮𝒮R+1\mathcal{S}\leftarrow\mathcal{S}\cup\mathcal{R}_{R+1}
11:         for xi𝒮x_{i}\in\mathcal{S} do
12:              wif(xi,q)w_{i}\leftarrow f(x_{i},q)
13:              if xiLrx_{i}\in L_{r} for some r[R]r\in[R] then
14:                  pimin{12rnf𝖪𝖣𝖤,1}p_{i}\leftarrow\min\{\frac{1}{2^{r}nf_{\mathsf{KDE}}},1\}
15:              else if xiXr[R]Lrx_{i}\in X\setminus\bigcup_{r\in[R]}L_{r} then
16:                  pi1np_{i}\leftarrow\frac{1}{n}
17:              end if
18:         end for
19:         Taxi𝒮wipiT_{a}\leftarrow\sum_{x_{i}\in\mathcal{S}}\frac{w_{i}}{p_{i}}
20:     end for
21:     return Median{Ta}\mathrm{Median}\{T_{a}\}
22:end procedure
23:end data structure

We start by showing a lemma that states the expected number of points in each level set.

Lemma C.7 (expected number of points in level sets, formal version of Lemma 4.7).

Given a query qdq\in\mathbb{R}^{d} and fix r[R]r\in[R]. For any i[R+1]i\in[R+1], weight level LiL_{i} contributes at most 11 point to the hash bucket of query qq.

Proof.

We consider 2 cases:

Case 1. iri\leq r: By lemma B.1, we have |Li|2inf𝖪𝖣𝖤|L_{i}|\leq 2^{i}nf_{\mathsf{KDE}}. In the rr’th phase, we sample each point in the whole data set with probability min{12rnf𝖪𝖣𝖤,1}\min\{\frac{1}{2^{r}nf_{\mathsf{KDE}}},1\} to obtain a subset XrX_{r} (Algorithm 4 line 26). Then

𝔼[|{x:xLiXr}|]\displaystyle~{}\operatorname*{{\mathbb{E}}}[|\{x:x\in L_{i}\cap X_{r}\}|]
\displaystyle\leq |Li|12rnf𝖪𝖣𝖤\displaystyle~{}|L_{i}|\cdot\frac{1}{2^{r}nf_{\mathsf{KDE}}}
\displaystyle\leq 2inf𝖪𝖣𝖤12rnf𝖪𝖣𝖤\displaystyle~{}2^{i}nf_{\mathsf{KDE}}\cdot\frac{1}{2^{r}nf_{\mathsf{KDE}}}
=\displaystyle= 2ir\displaystyle~{}2^{i-r}
\displaystyle\ \leq 1\displaystyle~{}1

where the first step follows from sampling probability min{12rnf𝖪𝖣𝖤,1}\min\{\frac{1}{2^{r}nf_{\mathsf{KDE}}},1\}, the second step follows from |Li|2inf𝖪𝖣𝖤|L_{i}|\leq 2^{i}nf_{\mathsf{KDE}}, the third step follows from canceling nf𝖪𝖣𝖤nf_{\mathsf{KDE}}, the last step follows from iri\leq r.

Thus, there is at most 11 sampled point from LiL_{i} in expectation.

Then LiL_{i} contributes at most 11 point in the bucket of query qq in expectation.

Case 2. i=r+1,,R+1i=r+1,\cdots,R+1: By Lemma 2.6, we have |Li|2inf𝖪𝖣𝖤|L_{i}|\leq 2^{i}nf_{\mathsf{KDE}}. The sampling rate in rr’th phase is min{12rnf𝖪𝖣𝖤,1}\min\{\frac{1}{2^{r}nf_{\mathsf{KDE}}},1\}(Algorithm 4 line 26). Then there are at most 2ir2^{i-r} sampled points from LiL_{i} in expectation. We set up LSH function such that the near distance is zrz_{r} (Definition 2.1). Also, we use (Algorithm 4 line 22)

k:=kr:=1log1pmaxi=r+1,,R+1irci,r(1o(1)).\displaystyle k:=k_{r}:=\frac{1}{\log\frac{1}{p}}\max\limits_{i=r+1,\cdots,R+1}\lceil\frac{i-r}{c_{i,r}(1-o(1))}\rceil.

as the number of concatenations. By Lemma B.3, LiL_{i} contributes at most 11 point in the bucket of query qq in expectation.

The total number of points that LiL_{i} contributes to hash bucket of qq is max{\max\{Case 1,Case 2}=1\}=1 in expectation.

Thus, we complete the proof. ∎

Next, we present the running time of Recover procedure in the LSH data structure, which is an important part of Query procedure in DynamicKDE.

Lemma C.8 (running time for recover points given a query, formal version of Lemma 4.8).

Given a query qdq\in\mathbb{R}^{d} and L,R,k+L,R,k\in\mathbb{N}_{+}, the Recover of the data-structure LSH runs in (expected) time

O(Lkno(1)+LR)\displaystyle O(Lkn^{o(1)}+LR)
Proof.

The procedure has one for loop with LL iterations. In each iteration, the running time consists of two parts

  • The evaluation of l(q)\mathcal{H}_{l}(q) takes O(kno(1))O(kn^{o(1)}) time

  • The Retrieve operation takes O(|l(q)|)O(|\mathcal{H}_{l}(q)|) time. By Lemma C.7, |l(q)|=O(R)|\mathcal{H}_{l}(q)|=O(R)

The running time of one iteration is O(kno(1)+R)O(kn^{o(1)}+R)

The final running of this procedure is LO(kno(1)+R)=O(Lkno(1)+LR)L\cdot O(kn^{o(1)}+R)=O(Lkn^{o(1)}+LR).

Thus, we complete the proof. ∎

Based on the running time of Recover in LSH above, we prove the running time of Query procedure in DynamicKDE.

Lemma C.9 (Query part of Theorem 1.2, formal version of Lemma 4.9).

Given a query qdq\in\mathbb{R}^{d}, the Query of the data-structure DynamicKDE (Algorithm 6) runs in (expected) time

O(ϵ2no(1)log(1/f𝖪𝖣𝖤)f𝖪𝖣𝖤o(1)cost(K)log3n).\displaystyle O(\epsilon^{-2}n^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot f_{\mathsf{KDE}}^{-o(1)}\cdot\mathrm{cost}(K)\log^{3}n).
Proof.

First, the algorithm do a for loop with K1=O(ϵ2lognf𝖪𝖣𝖤o(1))K_{1}=O(\epsilon^{-2}\log n\cdot f_{\mathsf{KDE}}^{-o(1)}) iterations.

In each iteration, the running time consists of three parts.

Part 1. The running time from line 5 to line 8, which is a for loop with RR iterations. In each iteration, the running time comes from

  • By Lemma 4.8, we replace LL with K2,j=O(cost(K)logn)K_{2},j=O(\mathrm{cost}(K)\log n), J=O(log(1/f𝖪𝖣𝖤))J=O(\log(1/f_{\mathsf{KDE}})) and kk with kj=O(logn)k_{j}=O(\log n). Thus line 6 takes O(no(1)cost(K)log2n)O(n^{o(1)}\mathrm{cost}(K)\log^{2}n) time.

  • line 7 takes O(|a,r.|)O(|\mathcal{H}_{a,r}.\mathcal{R}|) time. By Lemma B.3, |a,r.|=O(1)|\mathcal{H}_{a,r}.\mathcal{R}|=O(1). Thus the running time of line 7 is O(1)O(1).

Thus the running time of this for loop is RO(no(1)cost(K)log2n)=O(no(1)log(1/f𝖪𝖣𝖤)cost(K)log2n)R\cdot O(n^{o(1)}\mathrm{cost}(K)\log^{2}n)=O(n^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\mathrm{cost}(K)\log^{2}n), where we use R=O(log(1/f𝖪𝖣𝖤))R=O(\log(1/f_{\mathsf{KDE}})).

Part 2. The running time from line 11 to line 18, which is a forloop with |𝒮||\mathcal{S}| iterations. In each iteration, the running time is O(1)O(1). By Lemma B.3, |𝒮|=O(R)|\mathcal{S}|=O(R). Thus the running time of this forloop is O(log(1/f𝖪𝖣𝖤))O(\log(1/f_{\mathsf{KDE}})), where we use R=O(log(1/f𝖪𝖣𝖤))R=O(\log(1/f_{\mathsf{KDE}})).

Part 3. The running time of line 910 and 19 is O(1)O(1)

The final running time of Query is

K1(𝐏𝐚𝐫𝐭𝟏+𝐏𝐚𝐫𝐭𝟐+𝐏𝐚𝐫𝐭𝟑)\displaystyle~{}K_{1}\cdot({\bf Part1}+{\bf Part2}+{\bf Part3})
=\displaystyle= K1O(no(1)log(1/f𝖪𝖣𝖤)cost(K)log2n\displaystyle~{}K_{1}\cdot O(n^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot\mathrm{cost}(K)\log^{2}n
+O(log(1/f𝖪𝖣𝖤))+O(1))\displaystyle~{}+O(\log(1/f_{\mathsf{KDE}}))+O(1))
=\displaystyle= O(ϵ2no(1)log(1/f𝖪𝖣𝖤)f𝖪𝖣𝖤o(1)cost(K)log3n)\displaystyle~{}O(\epsilon^{-2}n^{o(1)}\log(1/f_{\mathsf{KDE}})\cdot f_{\mathsf{KDE}}^{-o(1)}\cdot\mathrm{cost}(K)\log^{3}n)

where the first step follows directly from the running time of three parts, and the last step follows from K1=O(ϵ2lognf𝖪𝖣𝖤o(1))K_{1}=O(\epsilon^{-2}\log n\cdot f_{\mathsf{KDE}}^{-o(1)})

Thus, we complete the proof.

C.4 LSH data structure

In this section, we present the LSH data structures with the following procedures:

Initialize Given a data set {x1,,xn}d\{x_{1},\cdots,x_{n}\}\in\mathbb{R}^{d} and integral parameters k,Lk,L, it first invokes private procedure ChooseHashFunc. The idea behind this is to amplify the ”sensitivity” of hashing by concatenating kk basic hashing functions from the family \mathcal{H}(Algorithm 7 line 9) into a new function. Thus we obtain a family of ”augmented” hash function l,l[L]\mathcal{H}_{l},l\in[L] (Algorithm 8 line 7). We follow by ConstructHashTable in which we hash each point xix_{i} using the hashing function l\mathcal{H}_{l}. Then we obtain LL hash tables corresponding to LL hash functions which can be updated quickly.

Recover Given a query qdq\in\mathbb{R}^{d}, it finds the bucket where qq is hashed by 𝓁\mathcal{H_{l}} and retrieve all the points in the bucket according to hashtable 𝒯l\mathcal{T}_{l}. This operation applies to all LL hashtables.

UpdateHashTable Given a new data point zdz\in\mathbb{R}^{d} and index i[n]i\in[n], it repeats the following operations for all l[L]l\in[L]: find bucket l(z)\mathcal{H}_{l}(z) and insert point zz; find bucket l(xi)\mathcal{H}_{l}(x_{i}) and delete point xix_{i}.

Algorithm 7 LSH, members and public procedures
1:data structure LSH
2:members
3:     d,n+d,n\in\mathbb{N}_{+} \triangleright dd is dimension, nn is number of data points
4:     K,L+K,L\in\mathbb{N}_{+} \triangleright KK is amplification factor, LL is number of repetition for hashing
5:     pnear,pfar(0,1)p_{\mathrm{near}},p_{\mathrm{far}}\in(0,1) \triangleright Collision probability
6:     For lLl\in L, 𝒯l:=[n]\mathcal{T}_{l}:=[n] \triangleright Hashtable recording data points hashed by l\mathcal{H}_{l}
7:     :=[n]\mathcal{R}:=[n] \triangleright retrieved points
8:     :={f:d[M]}\mathcal{H}:=\{f\in\mathcal{H}:\mathbb{R}^{d}\rightarrow[M]\} \triangleright MM is number of buckets for hashing family \mathcal{H}
9:     For l[L]l\in[L], lK\mathcal{H}_{l}\in\mathcal{H}^{K} \triangleright Family of amplified hash functions with at most MKM^{K} non-empty buckets
10:     For b[MK]b\in[M^{K}], 𝒮b:=\mathcal{S}_{b}:=AVL tree \triangleright Use AVL tree to store points in bucket
11:end members
12:
13:public
14:procedure Initialize({xi}i[n]d,k,L+\{x_{i}\}_{i\in[n]}\subset\mathbb{R}^{d},k,L\in\mathbb{N}_{+})
15:     ChooseHashFunc(k,Lk,L)
16:     ConstructHashTable({xi}i[n]\{x_{i}\}_{i\in[n]})
17:end procedure
18:
19:procedure Recover(qdq\in\mathbb{R}^{d})
20:     0\mathcal{R}\leftarrow 0
21:     for l[L]l\in[L] do
22:         𝒯l\mathcal{R}\leftarrow\mathcal{R}\cup\mathcal{T}_{l}.Retrieve(l(q)\mathcal{H}_{l}(q)) \triangleright Find the bucket l(q)\mathcal{H}_{l}(q) in 𝒯l\mathcal{T}_{l} and retrieve all points
23:     end for
24:end procedure
25:
26:procedure UpdateHashTable(zd,i[n]z\in\mathbb{R}^{d},i\in[n])
27:     for l[L]l\in[L] do
28:         l(z)\mathcal{H}_{l}(z).Insert(zz) \triangleright l(z)\mathcal{H}_{l}(z) denotes the bucket that zz is mapped to
29:         l(xi)\mathcal{H}_{l}(x_{i}).Delete(xix_{i})
30:     end for
31:end procedure
32:end data structure

Next, we provide a private procedure of LSH in Algorithm 8.

Algorithm 8 LSH, private procedures
1:data structure LSH
2:
3:private
4:procedure ChooseHashFunc(k,L+k,L\in\mathbb{N}_{+})
5:     for l[L]l\in[L] do
6:         \triangleright Amplify hash functions by concatenating
7:         l\mathcal{H}_{l}\leftarrow sample kk hash functions (f1,l,f2,l,,fk,l)(f_{1,l},f_{2,l},\cdots,f_{k,l}) from \mathcal{H}
8:     end for
9:end procedure
10:
11:procedure ConstructHashTable({xi}i[n]d\{x_{i}\}_{i\in[n]}\subset\mathbb{R}^{d})
12:     for l[L]l\in[L] do
13:         for i[n]i\in[n] do
14:              l(xi)\mathcal{H}_{l}(x_{i}).Insert(xix_{i})
15:              𝒯l𝒯ll(xi)\mathcal{T}_{l}\leftarrow\mathcal{T}_{l}\cup\mathcal{H}_{l}(x_{i}) \triangleright Creat hashtable by aggregating buckets
16:         end for
17:     end for
18:end procedure
19:end data structure

Appendix D Correctness: A Brief Summary

In this section, we briefly discuss the correctness of the algorithm from data independence (Section D.1), unbiasedness (Section D.2) and variance (Section D.3) aspects.

D.1 Data Independence of Initialize and Update.

We will show that procedure Initialize is independent of the input data. Hence, it is convenient for procedure Update to change only a small part of the stored data, without re-initializing the whole data structure. For Initialize, it first samples data points by doing nn rounds of Bernoulli trials on each data point (Algorithm 4, line 26). Then it invokes LSH.Initialize to hash the sampled data points into some buckets, which will be stored into an LSH instance a,r\mathcal{H}_{a,r} (Algorithm 4, Line 6). For Update, it finds the bucket where the old data point is hashed (Algorithm 7, Line 26 and Line 28) and replaces it with the new one. Thus procedure Update maintains the initialized structure.

D.2 Unbiasedness of Query

We now present the unbiasedness of the estimator (up to some inverse polynomial error).

Lemma D.1 (unbiasedness of the Query, informal version of Lemma E.1).

Given f𝖪𝖣𝖤(0,1)f_{\mathsf{KDE}}^{*}\in(0,1), f𝖪𝖣𝖤f𝖪𝖣𝖤f_{\mathsf{KDE}}\geq f_{\mathsf{KDE}}^{*}, ϵ(f𝖪𝖣𝖤10,1)\epsilon\in(f_{\mathsf{KDE}}^{10},1) and qdq\in\mathbb{R}^{d}, we claim that estimator TaT_{a} for any a[K1]a\in[K_{1}] constructed in line 19 Algorithm 6 satisfies (1n9)nf𝖪𝖣𝖤𝔼[Ta]nf𝖪𝖣𝖤(1-n^{-9})nf_{\mathsf{KDE}}^{*}\leq\operatorname*{{\mathbb{E}}}[T_{a}]\leq nf_{\mathsf{KDE}}^{*}.

D.3 Variance Bound for Query

We present the variance bound of our estimator.

Lemma D.2 (Variance bound for Query, informal version of Lemma E.2).

Given f𝖪𝖣𝖤(0,1)f_{\mathsf{KDE}}^{*}\in(0,1), f𝖪𝖣𝖤/4f𝖪𝖣𝖤f𝖪𝖣𝖤f_{\mathsf{KDE}}/4\leq f_{\mathsf{KDE}}^{*}\leq f_{\mathsf{KDE}}, ϵ(f𝖪𝖣𝖤10,1)\epsilon\in(f_{\mathsf{KDE}}^{10},1) and qdq\in\mathbb{R}^{d},

Query can output a (1±ϵ)(1\pm\epsilon)-factor approximation to f𝖪𝖣𝖤f_{\mathsf{KDE}}^{*}.

Appendix E Correctness: Details

The goal of this section is to prove the correctness of our algorithms. In Section E.1, we provide the proof of unbiasedness for the query. In Section E.2, we provide the proof of variance bound for the query.

E.1 Unbiasedness of Query

In this section, we prove that the estimator returned by Query is unbiased.

Lemma E.1 (unbiasedness of the Query, formal version of Lemma D.1).

For every f𝖪𝖣𝖤(0,1)f_{\mathsf{KDE}}^{*}\in(0,1), every f𝖪𝖣𝖤f𝖪𝖣𝖤f_{\mathsf{KDE}}\geq f_{\mathsf{KDE}}^{*}, every ϵ(f𝖪𝖣𝖤10,1)\epsilon\in(f_{\mathsf{KDE}}^{10},1), every qdq\in\mathbb{R}^{d}, estimator Ta=xi𝒮wipiT_{a}=\sum_{x_{i}\in\mathcal{S}}\frac{w_{i}}{p_{i}} for any a[K1]a\in[K_{1}] constructed in line 19 Algorithm 6 satisfies the following:

(1n9)nf𝖪𝖣𝖤𝔼[Ta]nf𝖪𝖣𝖤\displaystyle(1-n^{-9})nf_{\mathsf{KDE}}^{*}\leq\operatorname*{{\mathbb{E}}}[T_{a}]\leq nf_{\mathsf{KDE}}^{*}
Proof.

Let \mathcal{E} be the event that all the points are sampled. Let T:=TaT:=T_{a} (see line 19 Algorithm 6). By Lemma B.2 and union bound, we have

Pr[]1n9\displaystyle\Pr[\mathcal{E}]\geq 1-n^{-9}

Thus we obtain 𝔼[T]=i=1n𝔼[χi]piwi\operatorname*{{\mathbb{E}}}[T]=\sum_{i=1}^{n}\frac{\operatorname*{{\mathbb{E}}}[\chi_{i}]}{p_{i}}w_{i} and (1n9)pi𝔼[χi]pi(1-n^{-9})p_{i}\leq\operatorname*{{\mathbb{E}}}[\chi_{i}]\leq p_{i}, where χi=1\chi_{i}=1 is defined to be the event that point pip_{i} gets sampled and recovered in the phase corresponding to its weight level, and χi=0\chi_{i}=0 is defined to the contrary. Thus

(1n9)nf𝖪𝖣𝖤𝔼[T]nf𝖪𝖣𝖤\displaystyle(1-n^{-9})nf_{\mathsf{KDE}}^{*}\leq\operatorname*{{\mathbb{E}}}[T]\leq nf_{\mathsf{KDE}}^{*}

E.2 Variance Bound for Query

The goal of this section is to prove the variance bound of the estimator.

Lemma E.2 (Variance bound for Query, formal version of Lemma D.2 ).

For every f𝖪𝖣𝖤(0,1)f_{\mathsf{KDE}}^{*}\in(0,1), every ϵ(f𝖪𝖣𝖤10,1)\epsilon\in(f_{\mathsf{KDE}}^{10},1), every qdq\in\mathbb{R}^{d}, using estimators Ta=xi𝒮wipiT_{a}=\sum_{x_{i}\in\mathcal{S}}\frac{w_{i}}{p_{i}}, for a[K1]a\in[K_{1}] constructed in line 19 Algorithm 6, where f𝖪𝖣𝖤/4f𝖪𝖣𝖤f𝖪𝖣𝖤f_{\mathsf{KDE}}/4\leq f_{\mathsf{KDE}}^{*}\leq f_{\mathsf{KDE}}, one can output a (1±ϵ)(1\pm\epsilon)-factor approximation to f𝖪𝖣𝖤f_{\mathsf{KDE}}^{*}.

Proof.

First, we have Tn2f𝖪𝖣𝖤T\leq n^{2}f_{\mathsf{KDE}}^{*}, where equality holds when all the points are sampled and recovered in the phase of their weight levels. By Lemma D.1, we have

𝔼[T|]Pr[]+n2f𝖪𝖣𝖤(1Pr[])𝔼[T]\displaystyle\operatorname*{{\mathbb{E}}}[T~{}|~{}\mathcal{E}]\cdot\Pr[\mathcal{E}]+n^{2}f_{\mathsf{KDE}}^{*}(1-\Pr[\mathcal{E}])\geq\operatorname*{{\mathbb{E}}}[T]

Also, we have

𝔼[T|]𝔼[T]Pr[]nf𝖪𝖣𝖤Pr[]=nf𝖪𝖣𝖤(1+o(1/n9))\displaystyle\operatorname*{{\mathbb{E}}}[T~{}|~{}\mathcal{E}]\leq\frac{\operatorname*{{\mathbb{E}}}[T]}{\Pr[\mathcal{E}]}\leq\frac{nf_{\mathsf{KDE}}^{*}}{\Pr[\mathcal{E}]}=nf_{\mathsf{KDE}}^{*}(1+o(1/n^{9}))

Then, we have

𝔼[T2]\displaystyle\operatorname*{{\mathbb{E}}}[T^{2}] =𝔼[(piPχiwipi)2]\displaystyle=\operatorname*{{\mathbb{E}}}[(\sum_{p_{i}\in P}\chi_{i}\frac{w_{i}}{p_{i}})^{2}]
=ij𝔼[χiχjwiwjpipj]+i[n]𝔼[χiwi2pi2]\displaystyle=\sum_{i\neq j}\operatorname*{{\mathbb{E}}}[\chi_{i}\chi_{j}\frac{w_{i}w_{j}}{p_{i}p_{j}}]+\sum_{i\in[n]}\operatorname*{{\mathbb{E}}}[\chi_{i}\frac{w_{i}^{2}}{p_{i}^{2}}]
ijwiwj+i[n]wi2pi𝕀[pi=1]+i[n]wi2pi𝕀[pi1]\displaystyle\leq\sum_{i\neq j}w_{i}w_{j}+\sum_{i\in[n]}\frac{w_{i}^{2}}{p_{i}}\mathbb{I}[p_{i}=1]+\sum_{i\in[n]}\frac{w_{i}^{2}}{p_{i}}\mathbb{I}[p_{i}\neq 1]
(iwi)2+maxi{wipi𝕀[pi1]}i[n]wi\displaystyle\leq(\sum_{i}w_{i})^{2}+\max_{i}\{\frac{w_{i}}{p_{i}}\mathbb{I}[p_{i}\neq 1]\}\sum_{i\in[n]}w_{i}
2n2(f𝖪𝖣𝖤)2+maxj[J],piLj{wi2j+1}nf𝖪𝖣𝖤nf𝖪𝖣𝖤\displaystyle\leq 2n^{2}(f_{\mathsf{KDE}}^{*})^{2}+\max_{j\in[J],p_{i}\in L_{j}}\{w_{i}2^{j+1}\}nf_{\mathsf{KDE}}\cdot nf_{\mathsf{KDE}}^{*}
4n2f𝖪𝖣𝖤2\displaystyle\leq 4n^{2}f_{\mathsf{KDE}}^{2}

where the first step follows definition of TT, the second step follows from expanding the square of summation, the third step follows from χiχj1\chi_{i}\chi_{j}\leq 1, the fourth step follows from wi2pi𝕀[pi=1]wi2\frac{w_{i}^{2}}{p_{i}}\mathbb{I}[p_{i}=1]\leq w_{i}^{2} and (iwi)2=ijwiwj+i[n]wi2(\sum_{i}w_{i})^{2}=\sum_{i\neq j}w_{i}w_{j}+\sum_{i\in[n]}{w_{i}^{2}}, the fifth step follows from nf𝖪𝖣𝖤=(iwi)2nf_{\mathsf{KDE}}^{*}=(\sum_{i}w_{i})^{2} and pj1/(n2j+1f𝖪𝖣𝖤)p_{j}\geq 1/(n\cdot 2^{j+1}f_{\mathsf{KDE}}) and f𝖪𝖣𝖤f𝖪𝖣𝖤f_{\mathsf{KDE}}\geq f_{\mathsf{KDE}}^{*}.

𝔼[Z2|]𝔼[Z2]Pr[]n2f𝖪𝖣𝖤2o(1)(1+o(1/n9))\displaystyle\operatorname*{{\mathbb{E}}}[Z^{2}~{}|~{}\mathcal{E}]\leq\frac{\operatorname*{{\mathbb{E}}}[Z^{2}]}{\Pr[\mathcal{E}]}\leq n^{2}f_{\mathsf{KDE}}^{2-o(1)}(1+o(1/n^{9}))

Now, we repeat this process for K1=Clognϵ2f𝖪𝖣𝖤o(1)K_{1}=\frac{C\log n}{\epsilon^{2}}\cdot f_{\mathsf{KDE}}^{-o(1)} times with constant CC. Then, we have (1±ϵ)(1\pm\epsilon)-factor approximation with higher success probability. We show that if we repeat the procedure mm times and take average, denoted as T¯\bar{T}, we have:

Pr[|T¯nf𝖪𝖣𝖤|ϵnf𝖪𝖣𝖤]\displaystyle~{}\Pr[|\bar{T}-nf_{\mathsf{KDE}}^{*}|\geq\epsilon nf_{\mathsf{KDE}}^{*}]
\displaystyle\leq Pr[|T¯𝔼[T]|ϵnf𝖪𝖣𝖤|𝔼[T]nf𝖪𝖣𝖤|]\displaystyle~{}\Pr[|\bar{T}-\operatorname*{{\mathbb{E}}}[T]|\geq\epsilon nf_{\mathsf{KDE}}^{*}-|\operatorname*{{\mathbb{E}}}[T]-nf_{\mathsf{KDE}}^{*}|]
\displaystyle\leq Pr[|T¯𝔼[T]|(ϵn9)nf𝖪𝖣𝖤]\displaystyle~{}\Pr[|\bar{T}-\operatorname*{{\mathbb{E}}}[T]|\geq(\epsilon-n^{-9})nf_{\mathsf{KDE}}^{*}]
\displaystyle\leq 𝔼[T¯2](ϵn9)2(n2f𝖪𝖣𝖤)2\displaystyle~{}\frac{\operatorname*{{\mathbb{E}}}[\bar{T}^{2}]}{(\epsilon-n^{-9})^{2}(n^{2}f_{\mathsf{KDE}}^{*})^{2}}
\displaystyle\leq 1m64n2(f𝖪𝖣𝖤)2(ϵn9)2(n2f𝖪𝖣𝖤)2\displaystyle~{}\frac{1}{m}\frac{64n^{2}(f_{\mathsf{KDE}}^{*})^{2}}{(\epsilon-n^{-9})^{2}(n^{2}f_{\mathsf{KDE}}^{*})^{2}}

where the first step follows from |T¯nf𝖪𝖣𝖤||T¯𝔼[T]|+|𝔼[T]nf𝖪𝖣𝖤||\bar{T}-nf_{\mathsf{KDE}}^{*}|\leq|\bar{T}-\operatorname*{{\mathbb{E}}}[T]|+|\operatorname*{{\mathbb{E}}}[T]-nf_{\mathsf{KDE}}^{*}|, the second step follows from 𝔼[T](1n9nf𝖪𝖣𝖤)\operatorname*{{\mathbb{E}}}[T]\geq(1-n^{-9}nf_{\mathsf{KDE}}^{*}), the third step follows from Markov inequality and the last step follows from 𝔼[T¯2]𝔼[T2]/m4n2f𝖪𝖣𝖤2\operatorname*{{\mathbb{E}}}[\bar{T}^{2}]\leq\operatorname*{{\mathbb{E}}}[T^{2}]/m\leq 4n^{2}f_{\mathsf{KDE}}^{2} and f𝖪𝖣𝖤4f𝖪𝖣𝖤f_{\mathsf{KDE}}\leq 4f_{\mathsf{KDE}}^{*}.

We can repeat m=O(1ϵ2)m=O(\frac{1}{\epsilon^{2}}) times to upper bound failure probability to δ\delta and then take the median out of O(log(1/δ))O(\log(1/\delta)) means, where we assume δ=1poly(n)\delta=\frac{1}{\mathrm{poly}(n)}.

Appendix F Adversary

In this section, we provide the detailed proofs for the lemmas in Section 5.

Starting Point In Section D, we have already obtained a query algorithm with constant success probability for a fixed query point.

Lemma F.1 (Starting with constant probability, restatement of Lemma 5.1).

Given ϵ(0,0.1)\epsilon\in(0,0.1), a query point qdq\in\mathbb{R}^{d} and a set of data points X={xi}i=1ndX=\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d}, let f𝖪𝖣𝖤(q):=1|X|xXf(x,q)f_{\mathsf{KDE}}^{*}(q):=\frac{1}{|X|}\sum_{x\in X}f(x,q) be an estimator 𝒟\mathcal{D} can answer the query which satisfies:

(1ϵ)f𝖪𝖣𝖤(q)𝒟.query(q,ϵ)(1+ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq\mathcal{D}.\textsc{query}(q,\epsilon)\leq(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

with probability 0.90.9.

Proof.

By Lemma E.1 and Lemma E.2, our Query procedure can provide an estimator that answers kernel density estimation correctly with constant probability. ∎

Boost the constant probability to high probability.

Next, we begin to boost the success probability by repeating the query procedure and taking the median output.

Lemma F.2 (Boost the constant probability to high probability, restatement of Lemma 5.2).

Let δ1(0,0.1)\delta_{1}\in(0,0.1) denote the failure probability. Let ϵ(0,0.1)\epsilon\in(0,0.1) denote the accuracy parameter. Given L=O(log(1/δ1))L=O(\log(1/\delta_{1})) estimators {𝒟j}j=1L\{\mathcal{D}_{j}\}_{j=1}^{L}. For each fixed query point qdq\in\mathbb{R}^{d}, the median of queries from LL estimators satisfies that:

(1ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(q)\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

with probability 1δ11-\delta_{1}.

Proof.

From Lemma 5.1 we know each estimator 𝒟j\mathcal{D}_{j} can answer the query that satisfies:

(1ϵ)f𝖪𝖣𝖤(q)𝒟.query(q,ϵ)(1+ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq\mathcal{D}.\textsc{query}(q,\epsilon)\leq(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

with probability 0.90.9.

From the chernoff bound we know the median of L=O(log(1/δ1))L=O(\log(1/\delta_{1})) queries from {𝒟j}j=1L\{\mathcal{D}_{j}\}_{j=1}^{L} satisfies:

(1ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(q)\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

with probability 1δ11-\delta_{1}.

Therefore, we complete the proof. ∎

From each fixed point to all the net points.

So far, the success probability of our algorithm is still for a fix point. We will introduce ϵ\epsilon-net on a unit ball and show the high success probability for all the net points.

Fact F.3.

Let NN denote the ϵ0\epsilon_{0}-net of {xd|x21}\{x\in\mathbb{R}^{d}~{}|~{}\|x\|_{2}\leq 1\}. We use |N||N| to denote the number of points in NN. Then |N|(10/ϵ0)d|N|\leq(10/\epsilon_{0})^{d}.

This fact shows that we can bound the size of an ϵ\epsilon-net with an inverse of ϵ\epsilon. We use this fact to conclude the number of repetitions we need to obtain the correctness of Query on all net points.

Lemma F.4 (From each fixed points to all the net points, restatement of Lemma 5.4).

Let NN denote the ϵ0\epsilon_{0}-net of {xd|x21}\{x\in\mathbb{R}^{d}~{}|~{}\|x\|_{2}\leq 1\}. We use |N||N| to denote the number of points in NN. Given L=log(|N|/δ)L=\log(|N|/\delta) estimators {𝒟j}j=1L\{\mathcal{D}_{j}\}_{j=1}^{L}.

With probability 1δ1-\delta, we have: for all qNq\in N, the median of queries from LL estimators satisfies that:

(1ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
(1+ϵ)f𝖪𝖣𝖤(q).\displaystyle~{}\leq(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q).
Proof.

There are |N||N| points on the dd dimension ϵ\epsilon-net when q21\|q\|_{2}\leq 1. From Lemma F.2 we know that for each query point qq on NN, we have :

(1ϵ)f𝖪𝖣𝖤(q)\displaystyle(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(q)\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

with probability 1δ/|N|1-\delta/|N|.

By union bound all |N||N| points on NN, we have:

q21:(1ϵ)f𝖪𝖣𝖤(q)\displaystyle\forall\|q\|_{2}\leq 1:(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(q)\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

with probability 1δ1-\delta. ∎

From net points to all points.

With Lemma F.4, we are ready to extend the correctness for net points to the whole unit ball. We demonstrate that all query points q21\|q\|_{2}\leq 1 can be answered approximately with high probability in the following lemma.

Lemma F.5 (From net points to all points, restatement of Lemma 5.5 ).

Let ϵ(0,0.1)\epsilon\in(0,0.1). Let 1{\cal L}\geq 1. Let δ(0,0.1)\delta\in(0,0.1). Let τ[0,1]\tau\in[0,1]. Given L=O(log((/ϵτ)d/δ))L=O(\log((\mathcal{L}/\epsilon\tau)^{d}/\delta)) estimators {𝒟j}j=1L\{\mathcal{D}_{j}\}_{j=1}^{L}, with probability 1δ1-\delta, for all query points p21\|p\|_{2}\leq 1, we have the median of queries from LL estimators satisfies that:

p21:\displaystyle\forall\|p\|_{2}\leq 1:
(1ϵ)f𝖪𝖣𝖤(p)\displaystyle~{}(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(p)
\displaystyle\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(p).\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(p).

where qq is the closest net point of pp.

Proof.

We define an event ξ\xi to be the following,

qN,(1ϵ)f𝖪𝖣𝖤(q)\displaystyle\forall q\in N,~{}~{}~{}(1-\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+ϵ)f𝖪𝖣𝖤(q)\displaystyle~{}(1+\epsilon)\cdot f_{\mathsf{KDE}}^{*}(q)

Using Lemma F.4 with L=log(|N|/δ)L=\log(|N|/\delta), we know that

Pr[ event ξ holds ]1δ\displaystyle\Pr[~{}\text{~{}event~{}}\xi\text{~{}holds~{}}]\geq 1-\delta

Using Fact F.3, we know that

L=\displaystyle L= log(|N|/δ)\displaystyle~{}\log(|N|/\delta)
=\displaystyle= log((10/ϵ0)d/δ)\displaystyle~{}\log((10/\epsilon_{0})^{d}/\delta)
=\displaystyle= log((10/ϵτ)d/δ)\displaystyle~{}\log((10\mathcal{L}/\epsilon\tau)^{d}/\delta)

where the last step follows from ϵ0=ϵτ/\epsilon_{0}=\epsilon\tau/\mathcal{L}.

We condition the above event EE to be held. (Then the remaining proof does not depend on any randomness, for each and for all becomes the same.)

For each point pNp\notin N, there exists a qNq\in N such that

pq2ϵ0\displaystyle\|p-q\|_{2}\leq\epsilon_{0} (7)

For each pp, we know

|Medianj𝒟j.query(q,ϵ)f𝖪𝖣𝖤(p)|\displaystyle~{}|\mathrm{Median}_{j}\mathcal{D}_{j}.\textsc{query}(q,\epsilon)-f_{\mathsf{KDE}}^{*}(p)|
\displaystyle\leq |Medianj𝒟j.query(q,ϵ)f𝖪𝖣𝖤(q)|+|f𝖪𝖣𝖤(q)f𝖪𝖣𝖤(p)|\displaystyle~{}|\mathrm{Median}_{j}\mathcal{D}_{j}.\textsc{query}(q,\epsilon)-f_{\mathsf{KDE}}^{*}(q)|+|f_{\mathsf{KDE}}^{*}(q)-f_{\mathsf{KDE}}^{*}(p)|
\displaystyle\leq ϵf𝖪𝖣𝖤(q)+pq2\displaystyle~{}\epsilon f_{\mathsf{KDE}}^{*}(q)+\mathcal{L}\cdot\|p-q\|_{2}
\displaystyle\leq ϵ(f𝖪𝖣𝖤(p)+ϵ0)+ϵ0\displaystyle~{}\epsilon(f_{\mathsf{KDE}}^{*}(p)+\mathcal{L}\epsilon_{0})+\mathcal{L}\cdot\epsilon_{0}
\displaystyle\leq ϵ(f𝖪𝖣𝖤(p)+2τ)\displaystyle~{}\epsilon(f_{\mathsf{KDE}}^{*}(p)+2\tau)

where the first step follows from Lipschitz, the second step follows from Eq. (7), the third step follows from ϵ0ϵτ/\epsilon_{0}\leq\epsilon\tau/\mathcal{L}.

Using j[L]:𝒟j.query(p,ϵ)τ\forall j\in[L]:\mathcal{D}_{j}.\textsc{query}(p,\epsilon)\geq\tau, we have

(13ϵ)f𝖪𝖣𝖤(p)\displaystyle(1-3\epsilon)\cdot f_{\mathsf{KDE}}^{*}(p)\leq Median({𝒟j.query(q,ϵ)}j=1L)\displaystyle~{}\mathrm{Median}(\{\mathcal{D}_{j}.\textsc{query}(q,\epsilon)\}_{j=1}^{L})
\displaystyle\leq (1+3ϵ)f𝖪𝖣𝖤(p).\displaystyle~{}(1+3\epsilon)\cdot f_{\mathsf{KDE}}^{*}(p).

Rescaling the ϵ\epsilon completes the proof. ∎

Thus, we obtain an algorithm that could respond to adversary queries robustly.

Appendix G Lipschitz

The goal of this section is to prove the Lipschitz property of the KDE function.

G.1 Lipschitz property of KDE function

Lemma G.1.

Suppose kernel function f𝖪𝖣𝖤:d×d[0,1]f^{*}_{\mathsf{KDE}}:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1] satisfies the following properties:

  • Radial: there exists a funtion f:[0,1]f:\mathbb{R}\rightarrow[0,1] such that f(p,q)=f(pq2)f(p,q)=f(\|p-q\|_{2}), for all p,qp,q\in\mathbb{R}.

  • Decreasing: ff is decreasing

  • Lipschitz: ff is \mathcal{L}-Lipschitz

Then KDE function f𝖪𝖣𝖤:d[0,1]f_{\mathsf{KDE}}^{*}:\mathbb{R}^{d}\rightarrow[0,1], f𝖪𝖣𝖤(q):=1|P|pPf(p,q)f_{\mathsf{KDE}}^{*}(q):=\frac{1}{|P|}\sum_{p\in P}f(p,q) is \mathcal{L}-Lipschitz, i.e.

|f𝖪𝖣𝖤(q)f𝖪𝖣𝖤(q)|qq2\displaystyle|f_{\mathsf{KDE}}^{*}(q)-f_{\mathsf{KDE}}^{*}(q^{\prime})|\leq\mathcal{L}\cdot\|q-q^{\prime}\|_{2}
Proof.

For any q,qdq,q^{\prime}\in\mathbb{R}^{d}, we have:

|f𝖪𝖣𝖤(q)f𝖪𝖣𝖤(q)|\displaystyle~{}|f_{\mathsf{KDE}}^{*}(q)-f_{\mathsf{KDE}}^{*}(q^{\prime})|
=\displaystyle= |1|P|pPf(p,q)1|P|pPf(p,q)|\displaystyle~{}|\frac{1}{|P|}\sum_{p\in P}f(p,q)-\frac{1}{|P|}\sum_{p\in P}f(p,q^{\prime})|
\displaystyle\leq 1|P|pP|f(p,q)f(p,q)|\displaystyle~{}\frac{1}{|P|}\sum_{p\in P}|f(p,q)-f(p,q^{\prime})|
=\displaystyle= 1|P|pP|f(pq2)f(pq2)|\displaystyle~{}\frac{1}{|P|}\sum_{p\in P}|f(\|p-q\|_{2})-f(\|p-q^{\prime}\|_{2})|
\displaystyle\leq 1|P|pP|pq2pq2|\displaystyle~{}\frac{1}{|P|}\sum_{p\in P}\mathcal{L}\cdot|~{}\|p-q\|_{2}-\|p-q^{\prime}\|_{2}~{}|
\displaystyle\leq 1|P|pPqq2\displaystyle~{}\frac{1}{|P|}\sum_{p\in P}\mathcal{L}\cdot\|q-q^{\prime}\|_{2}
=\displaystyle= qq2\displaystyle~{}\mathcal{L}\cdot\|q-q^{\prime}\|_{2}

where the first step follows from the definition of f𝖪𝖣𝖤f_{\mathsf{KDE}}^{*}, the second step follows from triangular inequality of absolute value, the third step follows from the property of radial kernel, the fourth step follows from the Lipschitz property of ff, the fifth step follows from the triangular property of L2L_{2} norm, and the last step follows from canceling |P||P|.

Thus, we complete the proof. ∎