Dynamic Maintenance of Kernel Density Estimation Data Structure: From Practice to Theory
Kernel density estimation (KDE) stands out as a challenging task in machine learning. The problem is defined in the following way: given a kernel function and a set of points , we would like to compute for any query point . Recently, there has been a growing trend of using data structures for efficient KDE. However, the proposed KDE data structures focus on static settings. The robustness of KDE data structures over dynamic changing data distributions is not addressed. In this work, we focus on the dynamic maintenance of KDE data structures with robustness to adversarial queries. Especially, we provide a theoretical framework of KDE data structures. In our framework, the KDE data structures only require subquadratic spaces. Moreover, our data structure supports the dynamic update of the dataset in sublinear time. Furthermore, we can perform adaptive queries with the potential adversary in sublinear time.
1 Introduction
Kernel density estimation (KDE) is a well-known machine learning approach with wide applications in biology [20, 13], physics [21, 12] and law [7]. The KDE is defined as follows: given a kernel function and a dataset , we would like to estimate
for a query . It is standard to assume the kernel to be positive semi-definite. From a statistics perspective, we regard KDE as estimating the density of a probability distribution provided by a mapping.
Recently, there has been a growing trend in applying probabilistic data structures for KDE [15, 4, 30, 1, 9, 16, 25]. The general idea is to transform the kernel function into a distance measure and then apply similarity search data structures such as locality sensitive hashing and sketching. This co-design of data structure and KDE is of practical importance: (1) the computational efficiency is taken into consideration when we design KDE algorithms; (2) the capacity of traditional probabilistic data structures is extended from search to sampling. As a result, we obtain a series of KDE algorithms with both sample efficiency and running time efficiency.
However, current KDE data structures focus on static settings where the dataset is fixed and the queries and independent of each other. More practical settings should be taken into consideration. In some applications of KDE, the dataset is dynamically changing. For instance, in time series modeling [28, 22], the KDE estimators should be adaptive to the insertion and deletion in the dataset. In semi-supervised learning [36, 37], KDE data structures should handle the update of the kernel function. Moreover, in the works that apply KDE in optimization, the data structures should be robust over adversarial queries. As a result, the dynamic maintenance of KDE data structures should be emphasized in the research of machine learning.
In this paper, we argue that there exists a practice-to-theory gap for the dynamic maintenance of KDE data structures. Although there are existing work [8] that supports insertion and deletion in KDE data structures, these operations’ impact on the quality of KDE is not well-addressed. Moreover, the robustness of KDE data structures over adversaries is a recently raised concern. Thus, a formal theoretical analysis is required to discuss the robustness of KDE data structures in a dynamic setting.
In this paper, we present a theoretical analysis of the efficient maintenance of KDE for dynamic datasets and adversarial queries. Specifically, we present the first data structure design that can quickly adapt to updated input data and is robust to adversarial queries. We call our data structure and the corresponding algorithms adaptive kernel density estimation. Our data structure only requires subquadratic spaces, and each update to the input data only requires sublinear time, and each query can finish in sublinear time.
Notations.
We use , , to denote the set of real numbers, positive real numbers, and positive integers. For a set , we use to denote its cardinality. Let and . We define and to be the smallest integer greater than or equal to . is the absolute value of . Let be the set of all -dimensional vectors whose entries are all real numbers. represents the norm of . represents the probability, and represents the expectation. We define as .
1.1 Related Work
Efficient Kernel Density Estimation
The naive KDE procedure takes a linear scan of the data points. This is prohibitively expensive for large-scale datasets. Therefore, it is of practical significance to develop efficient KDE algorithms. A series of traditional KDE algorithms, namely kernel merging [23, 6], is to perform clustering on the dataset so that the KDE is approximated by a weighted combination of centroids. However, these algorithms do not scale to high-dimensional datasets.
On the other hand, there is a trend of sampling-based KDE algorithms. The focus of this line of work is to develop efficient procedures that approximate KDE with fewer data samples. Starting from random sampling [27], sampling procedures such as Herding [17] and -centers [14] are introduced in KDE. Some work also incorporates sampling with the coreset concept [29] and provides a KDE algorithm by sampling on an optimized subset of data points. Recently, there has been a growing interest in applying hash-based estimators (HBE) [15, 4, 30, 16, 5, 31] for KDE. The HBE uses Locality Sensitive Hashing (LSH) functions. The collision probability of two vectors in terms of an LSH function is monotonic to their distance measure. Using this feature, HBE performs efficient importance sampling by LSH functions and hash table type data structures. However, current HBEs are built for static settings and thus, are not robust to incremental changes in the input data. As a result, their application in large-scale online learning is limited. Except for LSH based KDE literature, there are also other KDE work based polynomial methods [1, 3]. The dynamic type of KDE has also been considered in [18, 35]. The work [19] presents both randomized algorithm deterministic algorithms for approximating a symmetric KDE computation.
Adaptive Data Structure
Recently, there is a growing trend of applying data structures [10, 11, 32, 40, 39, 34, 41, 33, 38, 26] to improve running time efficiency in machine learning. However, there exists a practice-to-theory gap between data structures and learning algorithms. Most data structures assume queries to be independent and provide theoretical guarantees based on this assumption. On the contrary, the query to data structures in each iteration of learning algorithms is mutually dependent. As a result, the existing analysis framework for data structures could not provide guarantees in optimization. To bridge this gap, quantization strategies [32, 40, 34, 33] are developed for adaptive queries in machine learning. The idea of these approaches is to quantize each query into its nearest vertex on the -net. Therefore, the failure probability of the data structures could be upper bounded by a standard -net argument. Although quantization methods demonstrate their success in machine learning, this direct combination does not fully enable the power of both data structure and learning algorithms. In our work, we aim at a co-design of data structure and machine learning for efficiency improvements in adaptive KDE.
1.2 Problem Formulation
In this work, we would like to study the following problem.
Definition 1.1 (Dynamic Kernel Density Estimation).
Let denote a kernel function. Let denote a dataset. Let
define the kernel density estimate of a query with respect to . Our goal is to design a data structure that efficiently supports any sequence of the following operations:
-
•
Initialize. The data structure takes kernel function , data set , accuracy parameter and a known quantity satisfying as input for initialization.
-
•
Update. Replace the ’th data point of data set with .
-
•
Query. Output such that
We note that in the Query procedure do not assume i.i.d queries. Instead, we take adaptive queries and provide theoretical guarantees.
1.3 Our Result
In this work, we provide theoretical guarantees for the dynamic KDE data structures defined in Definition 1.1. We summarize our main result as below:
Theorem 1.2 (Main result).
Given a function and a set of points set . Let be defined as Definition 2.7. For any accuracy parameter , there is a data structure using space (Algorithm 4, 5 and 6) for the Dynamic Kernel Density Estimation Problem (Definition 1.1) with the following procedures:
-
•
Initialize. Given a kernel function , a dataset , an accuracy parameter and a quantity as input, the data structure DynamicKDE preprocess in time
-
•
Update. Given a new data point and index , the Update operation take and as input and update the data structure in time
-
•
Query. Given a query point , the Query operation takes as input and approximately estimate kernel density at in time
and output such that:
1.4 Technical Overview
In this section, we introduce an overview of our technique that leads to our main result.
Density Constraint. We impose an upper bound on the true kernel density for query . We also introduce geometric level sets so that the number of data points that fall into each level will be upper bounded.
Importance Sampling. To approximate kernel density efficiently, we adopt the importance sampling technique. We sample each data point with different probability according to their contribution to the estimation, the higher the contribution, the higher the probability to be sampled. Then, we can construct an unbiased estimator based on sampled points and sampling probability. The main problem is how to evaluate the contribution to KDE for each point. We explore the geometry property of the kernel function and estimate the contribution of each point based on their distance from the query point.
Locality Sensitive Hashing. One problem with importance sampling is that when preprocessing, we have no access to the query point. It is impossible to estimate the contribution of each point for a future query. We make use of LSH to address this issue. To be specific, LSH preprocesses data points and finds the near neighbors for a query with high probability. With this property, we can first sample all data points in several rounds with geometrically decaying sampling probability. We design LSH for each round that satisfies the following property: given a query, LSH recovers points that have contributions proportional to the sampling probability in that round. Then we can find sampled points and proper sampling probability when a query comes.
Dynamic Update. Since the previous techniques, i.e. importance sampling and LSH, are independent of the coordinate value of the data point itself. This motivates us to support updating data points dynamically. Since LSH is a hash-based structure, given a new data point and index indicating the data point to be replaced, we search for a bucket where was hashed, replace it with new data point and update the hash table. Such an update will not affect the data structure for estimating kernel density.
Robustness. To make our data structure robust to adaptive queries, we take the following steps to obtain a robust data structure. Starting from the constant success probability, we first repeat the estimation several times and take the median. This will provide a high probability of obtaining a correct answer for a fixed point. Then we push this success probability to the net points of a unit ball. Finally, we generalize to all query points in . Thus we have a data structure that is robust to adversary query.
Roadmap
The rest of our paper is organized as follows: In Section 2, we describe some basic definitions and lemmas that are frequently used. In Section 3, we list some technical claims for our results. In Section 4, we demonstrate our data structure in detail, including the algorithm and the running time analysis. We perform an analysis of our data structures over the adversary in Section 5. Finally, we draw our conclusion in Section 6.
2 Preliminary
The goal of this section is to introduce the basic definitions and lemmas that will be used to prove our main result.
We first introduce a collection of subsets called geometric weight levels.
Definition 2.1 (Geometric Weight Levels).
Fix and . We define .
For any fix , we define
We define the corresponding distance levels as
where for .
In addition, we define .
Geometric weight levels can be visualized as a sequence of circular rings centered at query . The contribution of each level to kernel density at is mainly determined by the corresponding distance level.
Next, we introduce the importance sampling technique which is the key idea to accelerate the query procedure.
Definition 2.2 (Importance Sampling).
Given data points , we sample each point with probability and construct estimator as follows: .
To apply importance sampling, we need to evaluate the contribution of each point. We sample each point that has a high contribution with a high probability.
A natural question arose: when preprocessing, we have no access to the query, so we cannot calculate distance directly. Locality Sensitive Hashing is a practical tool to address this problem.
Definition 2.3 (Locally Sensitive Hash [24]).
A family is called -sensitive where , if for any :
-
•
-
•
The next lemma shows the existence of the LSH family and its evaluation time.
Lemma 2.4 (Lemma 3.2 in page 6 of [2]).
Let . Define
and
Then, if we fix to be positive, we can have a hash family satisfying
for any , where and it requires time for every evaluation.
Remark 2.5.
We set , which results in evaluation time and . Note that if , then
Next, we assign the LSH family to each geometric weight level (Definition 2.1) and show how well these families can distinguish points from different levels.
Lemma 2.6 (probability bound for separating points in different level sets, informal version of Lemma A.5).
Given kernel function and , let be the weight level set and be the corresponding distance level (Definition 2.1). For any query , any integer pair , satisfying , let and . Let
We set up Andoni-Indyk LSH family (Definition 2.3) with near distance and define
and
Then, for any , it is sufficient to show the following:
-
1.
-
2.
This lemma suggests that we can apply LSH several times to separate points in different level sets. It is useful for recovering points in a specific level set when estimating the “importance” of a point based on its distance from the query point. We will discuss more in Section 4. We use a similar definition for the cost of the kernel in [9].
Definition 2.7 (Kernel cost).
Given a kernel , which has geometric weight levels ’s and distance levels ’s defined in Definition 2.1. For any , we define the kernel cost for as
where .
Then we define the general cost of a kernel as
Note that when is Gaussian kernel, the is [9].
3 Technical Claims
In this section, we list some technical claims that are useful for our main results.
We start by giving an upper bound on sizes of geometric weight levels (Definition 2.1).
Lemma 3.1 (Sizes of geometric weight levels).
Given , we have
Next, we show a lemma for the probability of recovering a point in the query procedure, given that this point is sampled in the preprocessing stage.
Lemma 3.2 (Probability for sampled point recovery).
Suppose that we invoke DynamicKDE.Initialize. Suppose when and , we sample a point . Given a query , we invoke DynamicKDE.Query. With probability at least , recovered .
With the above lemma, we can bound the number of recovered points in expectation. We show that there are only points recovered by LSH in each geometric weight level (Definition 2.1).
Lemma 3.3 (Upper bound on number of recovered points in expectation).
Fix a query . We define . Fix , we define . For each , we define
There exists
such that for any
Finally, we claim that the kernel function is Lipschitz. This is an important property for designing robust algorithms.
Lemma 3.4 (Lipschitz property of KDE function).
Suppose kernel function satisfies the following properties:
-
•
Radial: there exists a function such that , for all .
-
•
Decreasing: is decreasing
-
•
Lipschitz: is -Lipschitz
Then KDE function ,
is -Lipschitz, i.e.
4 Our Data Structures
Our algorithm’s high-level idea is the following. We apply importance sampling (Definition 2.2) to approximate kernel density at a query point efficiently. We want to sample data points with probability according to their contribution to the estimation. Ideally, given query point and data set , we can sample each data point in with probability proportional to the inverse of the distance from query . Unfortunately, we have no access to query points when preprocessing. Hence we make use of LSH (Definition 2.3) to overcome this problem.
In general, given a query , LSH can recover its near points with high probability while the probability of recovering far points is bounded by a small quantity. To apply LSH, we first run
rounds sampling, in which we sample each data point with probability in -th round. Then we obtain subsampled data sets. Given query , we use LSH to recover those points both in level set and the -th subsampled data sets. Hence we obtain the sampled data points and the corresponding sampling rates (in other words “importance”). Then we construct the estimator as in Definition 2.2.
Finally, we repeat the estimation process independently and take the median to obtain approximation with high probability.
4.1 LSH Data Structure
In this section, we present our LSH data structure with the following procedures:
Initialize. Given a data set and integral parameters , it first invokes private procedure ChooseHashFunc. The idea behind this is to amplify the “sensitivity” of hashing by concatenating basic hashing functions from the family (Algorithm 7 line 9) into new functions. Thus we obtain a family of “augmented” hash function (Algorithm 8 line 7). We follow by ConstructHashTable in which we hash each point using the hashing function . Then we obtain hash tables corresponding to hash functions which can be updated quickly.
Recover. Given a query , it finds the bucket where is hashed by and retrieves all the points in the bucket according to hashtable . This operation applies to all hashtables.
UpdateHashTable. Given a new data point and index , it repeats the following operations for all : find bucket and insert point ; find bucket and delete point .
Note that traditional LSH data structure only has Initialize and Recover procedures. To make it a dynamic structure, we exploit its hash storage. We design UpdateHashTable procedure so that we can update the hash table on the fly. This procedure provides guarantee for dynamic kernel density estimation.
4.2 Initialize Part of Data Structure
In this section, we present the initialize part of our data structure. We start by analyzing the space storage for LSH and DynamicKDE. Then we state the result of running time for LSH in Lemma 4.3 and DynamicKDE in Lemma 4.4. We first show the space storage of LSH part in our data structure.
Lemma 4.1 (Space storage of LSH, informal version of Lemma C.1).
Given data set , parameter , the Initialize (Algorithm 7) of the data-structure LSH uses space
Using the lemma above, we can prove the total space storage of DynamicKDE structure in the following lemma.
Lemma 4.2 (Space storage part of Theorem 1.2, informal version of Lemma C.2).
The Initialize of the data structure DynamicKDE (Algorithm 4) uses space
For the running time, we again start with the LSH part.
Lemma 4.3 (Upper bound on running time of Initialize of the data-structure LSH, informal version of Lemma C.3 ).
Given input data points , parameters , LSH parameters and kernel , the Initialize of the data-structure LSH(Algorithm 7) runs in time
Having shown the running time of LSH, we now move on to prove the total running time of Init in our data structure by combining the above result in the LSH part.
Lemma 4.4 (The initialize part of Theorem 1.2, informal version of Lemma C.4).
Given , the Initialize of the data-structure DynamicKDE (Algorithm 4) runs in time
4.3 Update Part of Data Structure
In this section, we move to the update part of our data structure. We first prove how to update the LSH data structure. Then we can extend the update procedure to DynamicKDE structure so that we can prove Lemma 4.6.
Lemma 4.5 (Update time of LSH, informal version of Lemma C.5).
Given a data point and index , the UpdateHashTable of the data-structure LSH (Algorithm 7) runs in (expected) time
Update in LSH structure is a key part in Update for DynamicKDE. Next, we show the running time of Update for DynamicKDE by combining the above results.
4.4 Query Part of Data Structure
Finally, we come to the query part. The goal of this section is to prove Lemma 4.9, which shows the running time of Query procedure for DynamicKDE.
The running time of Query procedure depends on two parts: the number of recovered points in each weight level and the recovery time of LSH.
We first prove the expected number of recovered points.
Lemma 4.7 (expected number of points in level sets, informal version of Lemma C.7).
Given a query and fix . For any , weight level contributes at most point to the hash bucket of query .
Next, we show the running time for LSH to recover points.
Lemma 4.8 (running time for recovering points given a query, informal version of Lemma C.8).
Given a query and , the Recover of the data-structure LSH runs in (expected) time
Combining the two lemmas above, we can prove the total running time of Query in DynamicKDE structure.
5 Robustness to Adversary
In this section, we will turn the Query algorithm into a robust one. In other words, we want the following thing to happen with high probability: the algorithm responds to all query points correctly. We achieve this goal by taking three steps. We start with constant success probability for the Query procedure, which we have proved in the previous section.
In the first step, we boost this constant probability to a high probability by applying the median technique. We note that the current algorithm succeeds with high probability only for one fixed point but we want it to respond to arbitrary query points correctly.
It is not an easy task to generalize directly from a fixed point to infinite points in the whole space. Thus we take a middle step by introducing unit ball and -net. We say a unit ball in is a collection of points whose norm is less than or equal to . An -net is a finite collection of points, called net points, that has the “covering” property. To be more specific, the union of balls that centered at net points with radius covers the unit ball. In the second step, we show that given a net of the unit ball, we have the correctness on all net points.
Finally, we show the correctness of the algorithm from net points to all points in the whole space. Then we obtain a robust algorithm.
Starting Point In Section D, we have already obtained a query algorithm with constant success probability for a fixed query point.
Lemma 5.1 (Starting with constant probability).
Given , a query point and a set of data points , let be an estimator can answer the query which satisfies:
with probability .
Boost the constant probability to high probability.
Next, we begin to boost the success probability by repeating the query procedure and taking the median output.
Lemma 5.2 (Boost the constant probability to high probability).
Let denote the failure probability. Let denote the accuracy parameter. Given estimators . For each fixed query point , the median of queries from estimators satisfies that:
with probability .
From each fixed point to all the net points.
So far, the success probability of our algorithm is still for a fixed point. We will introduce -net on a unit ball and show the high success probability for all the net points.
Fact 5.3.
Let denote the -net of We use to denote the number of points in . Then .
This fact shows that we can bound the size of an -net with an inverse of . We use this fact to conclude the number of repetitions we need to obtain the correctness of Query on all net points.
Lemma 5.4 (From each fixed points to all the net points).
Let denote the -net of We use to denote the number of points in . Given estimators . With probability , we have: for all , the median of queries from estimators satisfies that:
From net points to all points.
With Lemma 5.4, we are ready to extend the correctness for net points to the whole unit ball. We demonstrate that all query points can be answered approximately with high probability in the following lemma.
Lemma 5.5 (From net points to all points).
Let . Let . Let . Let . Given estimators , with probability , for all query points , we have the median of queries from estimators satisfies that:
where is the closest net point of .
Thus, we obtain an algorithm that could respond to adversary queries robustly.
6 Conclusion
Kernel density estimation is an important problem in machine learning. It has wide applications in similarity search and nearest neighbor clustering. Meanwhile, in many modern scenarios, input data can change over time, and queries can be provided by adversaries. In these scenarios, we need to build adaptive data structures such that incremental changes in the input data do not require our data structure to go through costly re-initialization. Also, queries provided by adversaries do not reduce the accuracy of the estimation. We call this problem the adaptive kernel density estimation.
We present the first such adaptive data structure design for kernel density estimation. Our data structure is efficient. It only requires subquadratic spaces. Each update to the input data only requires sublinear time, and each query can finish in sublinear time. It should be observed that the trade-off between efficiency and effectiveness persists in our proposed algorithms. Ordinarily, to augment the execution speed, a slight compromise on the algorithm’s performance becomes inevitable. Yet, we assert that the introduction of our groundbreaking data structures pushes the boundaries of this trade-off.
7 Impact Statement
Our algorithm’s implementation, it must be noted, could potentially contribute to environmental carbon emissions. That said, we aim to decrease the repetition of experiments by focusing our efforts on theoretical analysis.
References
- ACSS [20] Josh Alman, Timothy Chu, Aaron Schild, and Zhao Song. Algorithms and hardness for linear algebra on geometric graphs. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 541–552. IEEE, 2020.
- AI [06] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pages 459–468. IEEE, 2006.
- AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
- BIW [19] Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efficient kernel density estimation in high dimensions. Advances in Neural Information Processing Systems, 32, 2019.
- CGC+ [22] Benjamin Coleman, Benito Geordie, Li Chou, RA Leo Elworth, Todd Treangen, and Anshumali Shrivastava. One-pass diversified sampling with application to terabyte-scale genomic sequence streams. In International Conference on Machine Learning, pages 4202–4218. PMLR, 2022.
- CHM [12] Yuan Cao, Haibo He, and Hong Man. Somke: Kernel density estimation over data streams by sequences of self-organizing maps. IEEE transactions on neural networks and learning systems, 23(8):1254–1268, 2012.
- CHS [20] Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. A fair classifier using kernel density estimation. Advances in Neural Information Processing Systems, 33:15088–15099, 2020.
- CIU+ [21] Tsz Nam Chan, Pak Lon Ip, Leong Hou U, Byron Choi, and Jianliang Xu. Sws: a complexity-optimized solution for spatial-temporal kernel density visualization. Proceedings of the VLDB Endowment, 15(4):814–827, 2021.
- CKNS [20] Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density estimation through density constrained near neighbor search. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 172–183. IEEE, 2020.
- CLP+ [20] Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, and Christopher Re. Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2020.
- CMF+ [20] Beidi Chen, Tharun Medini, James Farwell, Charlie Tai, Anshumali Shrivastava, et al. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. Proceedings of Machine Learning and Systems, 2:291–306, 2020.
- Cra [01] Kyle Cranmer. Kernel estimation in high-energy physics. Computer Physics Communications, 136(3):198–207, 2001.
- CRV+ [18] Danielle L Cantrell, Erin E Rees, Raphael Vanderstichel, Jon Grant, Ramón Filgueira, and Crawford W Revie. The use of kernel density estimation with a bio-physical model provides a method to quantify connectivity among salmon farms: spatial planning and management with epidemiological relevance. Frontiers in Veterinary Science, page 269, 2018.
- CS [16] Efren Cruz Cortes and Clayton Scott. Sparse approximation of a kernel mean. IEEE Transactions on Signal Processing, 65(5):1310–1323, 2016.
- CS [17] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 1032–1043. IEEE, 2017.
- CS [20] Benjamin Coleman and Anshumali Shrivastava. Sub-linear race sketches for approximate kernel density estimation on streaming data. In Proceedings of The Web Conference 2020, pages 1739–1749, 2020.
- CWS [10] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 109–116, 2010.
- DJS+ [22] Yichuan Deng, Wenyu Jin, Zhao Song, Xiaorui Sun, and Omri Weinstein. Dynamic kernel sparsifiers. arXiv preprint arXiv:2211.14825, 2022.
- DMS [23] Yichuan Deng, Sridhar Mahadevan, and Zhao Song. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arXiv preprint arXiv:2304.04397, 2023.
- FC [17] Christen H Fleming and Justin M Calabrese. A new kernel density estimator for accurate home-range and species-range area estimation. Methods in Ecology and Evolution, 8(5):571–579, 2017.
- HIK+ [21] Anna Hallin, Joshua Isaacson, Gregor Kasieczka, Claudius Krause, Benjamin Nachman, Tobias Quadfasel, Matthias Schlaffer, David Shih, and Manuel Sommerhalder. Classifying anomalies through outer density estimation (cathode). arXiv preprint arXiv:2109.00546, 2021.
- HL [18] Yaoyao He and Haiyan Li. Probability density forecasting of wind power using quantile regression neural network and kernel density estimation. Energy conversion and management, 164:374–384, 2018.
- HS [08] Christoph Heinz and Bernhard Seeger. Cluster kernels: Resource-aware kernel density estimators over streaming data. IEEE Transactions on Knowledge and Data Engineering, 20(7):880–893, 2008.
- IM [98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.
- KAP [22] Matti Karppa, Martin Aumüller, and Rasmus Pagh. Deann: Speeding up kernel-density estimation using approximate nearest neighbor search. In International Conference on Artificial Intelligence and Statistics, pages 3108–3137. PMLR, 2022.
- LWZ+ [23] Zirui Liu, Guanchu Wang, Shaochen Zhong, Zhaozhuo Xu, Daochen Zha, Ruixiang Tang, Zhimeng Jiang, Kaixiong Zhou, Vipin Chaudhary, Shuai Xu, et al. Winner-take-all column row sampling for memory efficient adaptation of language model. arXiv preprint arXiv:2305.15265, 2023.
- MFS+ [17] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
- MRL [95] Young-Il Moon, Balaji Rajagopalan, and Upmanu Lall. Estimation of mutual information using kernel density estimators. Physical Review E, 52(3):2318, 1995.
- PT [20] Jeff M Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867–887, 2020.
- SRB+ [19] Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Levis. Rehashing kernel evaluation in high dimensions. In International Conference on Machine Learning, pages 5789–5798. PMLR, 2019.
- SS [21] Ryan Spring and Anshumali Shrivastava. Mutual information estimation using lsh sampling. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 2807–2815, 2021.
- SSX [21] Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. Sublinear least-squares value iteration via locality sensitive hashing. arXiv preprint arXiv:2105.08285, 2021.
- SXYZ [22] Zhao Song, Zhaozhuo Xu, Yuanyuan Yang, and Lichen Zhang. Accelerating frank-wolfe algorithm using low-dimensional and adaptive data structures. arXiv preprint arXiv:2207.09002, 2022.
- SXZ [22] Zhao Song, Zhaozhuo Xu, and Lichen Zhang. Speeding up sparsification using inner product search data structures. arXiv preprint arXiv:2204.03209, 2022.
- vdBSZ [23] Jan van den Brand, Zhao Song, and Tianyi Zhou. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv e-prints, pages arXiv–2304, 2023.
- WHM+ [09] Meng Wang, Xian-Sheng Hua, Tao Mei, Richang Hong, Guojun Qi, Yan Song, and Li-Rong Dai. Semi-supervised kernel density estimation for video annotation. Computer Vision and Image Understanding, 113(3):384–396, 2009.
- WLG [19] Qin Wang, Wen Li, and Luc Van Gool. Semi-supervised learning by augmented distribution alignment. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1466–1475, 2019.
- WXW+ [22] Zhuang Wang, Zhaozhuo Xu, Xinyu Wu, Anshumali Shrivastava, and TS Eugene Ng. Dragonn: Distributed randomized approximate gradients of neural networks. In International Conference on Machine Learning, pages 23274–23291. PMLR, 2022.
- XCL+ [21] Zhaozhuo Xu, Beidi Chen, Chaojian Li, Weiyang Liu, Le Song, Yingyan Lin, and Anshumali Shrivastava. Locality sensitive teaching. Advances in Neural Information Processing Systems, 34, 2021.
- XSS [21] Zhaozhuo Xu, Zhao Song, and Anshumali Shrivastava. Breaking the linear iteration cost barrier for some well-known conditional gradient methods using maxip data-structures. Advances in Neural Information Processing Systems, 34, 2021.
- Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
Appendix
Roadmap.
Section A presents the basic definitions and lemmas. Section B presents the technical claims for the proof of our main result. Section C consists of four subsections: Initialize part, Update part, Query part, and LSH part of our data structure. Each section presents the corresponding algorithm and proof of running time. We provide a proof sketch for correctness in Section D. Section E presents the detailed proof of the correctness of our data structure. Section F presents how to change our algorithm into an adaptive algorithm. Section G presents the Lipschitz property of the kernel function.
Appendix A Preliminary
The goal of this subsection is to introduce some basic Definitions (Section A.1) and Lemmas (Section A.2) that will be used to prove the main result.
A.1 Definitions
We start by recalling the definition of geometric weight level.
Definition A.1 (Restatement of Definition 2.1 Geometric Weight Levels).
Fix and . We define
For any fix , we define
We define the corresponding distance levels as
where for .
In addition, we define
We restate the definition and some properties of Locality Sensitive Hashing.
Definition A.2 (Restatement of Definition 2.3, Locally Sensitive Hash).
A family is called -sensitive where , if for any :
-
•
-
•
A.2 Lemmas
Lemma A.3 (Lemma 3.2 in page 6 of [2]).
Let . Fixed , there is a hash family such that, if and , then
for any , where and it requires time to evaluate.
Remark A.4.
We find an upper bound for our definition of and evaluation time for hashing. For the rest part, we denote . Thus we obtain evaluation time and . Since , we have
In the next lemma, we show the existence of an LSH family that can separate the near points and the far points from the query with high probability.
Lemma A.5 (probability bound for separating points in different level sets, formal version of Lemma 2.6).
Given kernel function , we have corresponding weight level sets ’s and distance levels ’s (Definition 2.1). Given query and integer , satisfying , let , . We set up an Andoni-Indyk LSH family (Definition 2.3) with near distance .
We define
Then the following inequalities holds for any integer
-
1.
-
2.
Appendix B Technical Claims
In Section B.1, we show how to bound the size of geometric weight levels. In Section B.2, we explain how show the probability for sampled point recovery. In Section B.3, we give an expectation bound the number of recovered points.
B.1 Size of geometric weight levels
The goal of this section is to prove Lemma B.1.
Lemma B.1 (Sizes of geometric weight levels).
Given , we have
B.2 Probability for sampled point recovery
The goal of this section is to prove Lemma B.2.
Lemma B.2 (Probability for sampled point recovery).
Suppose that we invoke DynamicKDE.Initialize. Suppose when and , we sample a point . Given a query , we invoke DynamicKDE.Query. With probability at least , recovered .
B.3 Number of recovered points in expectation
The goal of this section is to prove Lemma B.3
Lemma B.3 (Upper bound on number of recovered points in expectation).
Fix a query . We define . Fix , we define . For each , we define . There exists
such that for any
Proof.
By Lemma A.5 we have
where (remark 2.5).
where the first step follows from lemma B.1 and sampling probability (Algorithm 4 line 25), the second step follows from Lemma 2.6.
Note that for , we have
(4) |
Then
where the first step follows from rewriting in exponential form, the second step follows from Eq. (4) , the third step follows from canceling the same term in both numerator and denominator, and the last step follows from canceling and .
Thus, we complete the proof. ∎
Appendix C Our Data Structures
In this section, we describe our data structures in detail. Starting with the initialize part in Section C.1, we state the result of space storage and running time for Initialize in DynamicKDE. In Section C.2, we demonstrate the running time for the update part in our data structure. Section C.3 presents the running time for the query procedure. Finally, we study the LSH data structure in Section C.4. It is an important member in the DynamicKDE structure and fundamental to the implementation of all three procedures above.
C.1 Initialize part of data structure
In this section, we describe the space storage and running time of Initialize part of our data structure DynamicKDE.
We start by showing the space storage of LSH structure.
Lemma C.1 (Space storage of LSH, formal version of Lemma 4.1).
Given data set , parameter , the Initialize (Algorithm 7) of the data-structure LSH uses space
Proof.
The space storage comes from two parts: ChooseHashFunc and ConstructHashTable.
Part 1. ChooseHashFunc(line 4) takes as input.
It has a for loop with iterations.
In each iteration, it samples functions(line 7) from hash family to create , which uses space.
Thus the total space usage of ChooseHashFunc is .
Part 2. ConstructHashTable(line 11) takes data set and parameter as input.
It has two recursive for loops.
-
•
The first for loop repeats iterations.
-
•
The second for loop repeats iterations.
Thus the total space storage of ConstructHashTable is .
The final space storage of Initialize is
Thus, we complete the proof.
∎
Using the above lemma, we state the space storage of our DynamicKDE structure.
Lemma C.2 (Space storage part of Theorem 1.2, formal version of Lemma 4.2).
The Initialize of the data structure DynamicKDE (Algorithm 4) uses space
Proof.
The space storage mainly comes from copies of .
Now let’s consider the space storage of . By Lemma 4.1, we replace with (line 26), (line 24), (line 22) respectively. We have and . Thus the total space usage of is
(5) | ||||
(6) |
The total space storage of Initialize of the data structure DynamicKDE is
where the first step follows from Eq. (5), the last step follows from and .
Thus, we complete the proof. ∎
Next, we show an upper bound on running time for Initilize in LSH data structure.
Lemma C.3 (Upper bound on running time of Initialize of the data-structure LSH, formal version of Lemma 4.3 ).
Given input data points , parameters , LSH parameters and kernel , the Initialize of the data-structure LSH(Algorithm 7) runs in time
Proof.
This procedure consists of three parts
Part 1. We invoke ChooseHashTable procedure with parameters (line 15). The ChooseHashTable procedure has one for loop with L iterations.
Now let’s consider the running time in line 7, which is the running time in each iteration. In line 7, we sample hash functons from hash family , which takes time.
Thus the total running time for Part 1 is
Part 2. We invoke ConstructHashTable procedure with data set . This procedure has two recursive for loops.
-
•
The first loops repeat iterations
-
•
The second loop repeats iterations
The total running time for Part 2 is
Putting it all together. We prove that the Initialize of the data-structure LSH(Algorithm 7) runs in time
Thus, we complete the proof.
∎
Combining the results above, we can demonstrate the running time of Initialize in DynamicKDE in the following lemma.
Lemma C.4 (The initialize part of Theorem 1.2, formal version of Lemma 4.4).
Given , the Initialize of the data-structure DynamicKDE (Algorithm 4) runs in time
Proof.
The Initialize procedure has two recursive for loops.
-
•
The first loops repeat iterations
-
•
The second loops repeats iterations
Now let’s consider the running time from line 20 to line 27, which is the running time of the inner loop.
- •
-
•
line 26 takes time.
-
•
Thus the running time of this line is
where the first step follows from , and , the second step follows from .
The final running time for Initialize procedure is
where we use and .
Thus, we complete the proof. ∎
C.2 Update part of data structure
The goal of this section is to prove Lemma 4.6. Our Lemma C.6 in this section is the formal version of Lemma 4.6. We present an auxiliary Lemma C.5 and then show how to this this auxiliary lemma to prove Lemma C.6.
Lemma C.5 (Update time of LSH, formal version of Lemma 4.5).
Given a data point and index , the UpdateHashTable of the data-structure LSH runs in (expected) time
Proof.
This procedure has one for loop which repeats iterations. Now let us consider the running time from line 28 to line 29, which is the time for each iteration.
The final running time
where we use and .
Thus, we complete the proof.
∎
Lemma C.6 (The update part of Theorem 1.2, formal version of Lemma 4.6).
Given an update and index , the Update of the data-structure DynamicKDE (Algorithm 5) runs in (expected) time
Proof.
This algorithm has two recursive for loops
-
•
The first loops repeat iterations
-
•
The second loops repeats iterations
Now let’s consider the running time in line 6, which is the time for each inner loop.
The final running time
where we use and .
Thus, we complete the proof. ∎
C.3 Query part of data structure
The goal of this section is to prove Lemma C.9. Our Algorithm 6 is for querying the approximated kernel density at point , and Lemma C.9 specifies the running time for the query operation. In order to prove this lemma, we list and prove a few auxiliary lemmas.
We start by showing a lemma that states the expected number of points in each level set.
Lemma C.7 (expected number of points in level sets, formal version of Lemma 4.7).
Given a query and fix . For any , weight level contributes at most point to the hash bucket of query .
Proof.
We consider 2 cases:
Case 1. : By lemma B.1, we have . In the ’th phase, we sample each point in the whole data set with probability to obtain a subset (Algorithm 4 line 26). Then
where the first step follows from sampling probability , the second step follows from , the third step follows from canceling , the last step follows from .
Thus, there is at most sampled point from in expectation.
Then contributes at most point in the bucket of query in expectation.
Case 2. : By Lemma 2.6, we have . The sampling rate in ’th phase is (Algorithm 4 line 26). Then there are at most sampled points from in expectation. We set up LSH function such that the near distance is (Definition 2.1). Also, we use (Algorithm 4 line 22)
as the number of concatenations. By Lemma B.3, contributes at most point in the bucket of query in expectation.
The total number of points that contributes to hash bucket of is Case 1,Case 2 in expectation.
Thus, we complete the proof. ∎
Next, we present the running time of Recover procedure in the LSH data structure, which is an important part of Query procedure in DynamicKDE.
Lemma C.8 (running time for recover points given a query, formal version of Lemma 4.8).
Given a query and , the Recover of the data-structure LSH runs in (expected) time
Proof.
The procedure has one for loop with iterations. In each iteration, the running time consists of two parts
-
•
The evaluation of takes time
-
•
The Retrieve operation takes time. By Lemma C.7,
The running time of one iteration is
The final running of this procedure is .
Thus, we complete the proof. ∎
Based on the running time of Recover in LSH above, we prove the running time of Query procedure in DynamicKDE.
Lemma C.9 (Query part of Theorem 1.2, formal version of Lemma 4.9).
Given a query , the Query of the data-structure DynamicKDE (Algorithm 6) runs in (expected) time
Proof.
First, the algorithm do a for loop with iterations.
In each iteration, the running time consists of three parts.
Part 1. The running time from line 5 to line 8, which is a for loop with iterations. In each iteration, the running time comes from
- •
- •
Thus the running time of this for loop is , where we use .
Part 2. The running time from line 11 to line 18, which is a forloop with iterations. In each iteration, the running time is . By Lemma B.3, . Thus the running time of this forloop is , where we use .
The final running time of Query is
where the first step follows directly from the running time of three parts, and the last step follows from
Thus, we complete the proof.
∎
C.4 LSH data structure
In this section, we present the LSH data structures with the following procedures:
Initialize Given a data set and integral parameters , it first invokes private procedure ChooseHashFunc. The idea behind this is to amplify the ”sensitivity” of hashing by concatenating basic hashing functions from the family (Algorithm 7 line 9) into a new function. Thus we obtain a family of ”augmented” hash function (Algorithm 8 line 7). We follow by ConstructHashTable in which we hash each point using the hashing function . Then we obtain hash tables corresponding to hash functions which can be updated quickly.
Recover Given a query , it finds the bucket where is hashed by and retrieve all the points in the bucket according to hashtable . This operation applies to all hashtables.
UpdateHashTable Given a new data point and index , it repeats the following operations for all : find bucket and insert point ; find bucket and delete point .
Next, we provide a private procedure of LSH in Algorithm 8.
Appendix D Correctness: A Brief Summary
In this section, we briefly discuss the correctness of the algorithm from data independence (Section D.1), unbiasedness (Section D.2) and variance (Section D.3) aspects.
D.1 Data Independence of Initialize and Update.
We will show that procedure Initialize is independent of the input data. Hence, it is convenient for procedure Update to change only a small part of the stored data, without re-initializing the whole data structure. For Initialize, it first samples data points by doing rounds of Bernoulli trials on each data point (Algorithm 4, line 26). Then it invokes LSH.Initialize to hash the sampled data points into some buckets, which will be stored into an LSH instance (Algorithm 4, Line 6). For Update, it finds the bucket where the old data point is hashed (Algorithm 7, Line 26 and Line 28) and replaces it with the new one. Thus procedure Update maintains the initialized structure.
D.2 Unbiasedness of Query
We now present the unbiasedness of the estimator (up to some inverse polynomial error).
D.3 Variance Bound for Query
We present the variance bound of our estimator.
Lemma D.2 (Variance bound for Query, informal version of Lemma E.2).
Given , , and ,
Query can output a -factor approximation to .
Appendix E Correctness: Details
The goal of this section is to prove the correctness of our algorithms. In Section E.1, we provide the proof of unbiasedness for the query. In Section E.2, we provide the proof of variance bound for the query.
E.1 Unbiasedness of Query
In this section, we prove that the estimator returned by Query is unbiased.
Lemma E.1 (unbiasedness of the Query, formal version of Lemma D.1).
For every , every , every , every , estimator for any constructed in line 19 Algorithm 6 satisfies the following:
Proof.
Let be the event that all the points are sampled. Let (see line 19 Algorithm 6). By Lemma B.2 and union bound, we have
Thus we obtain and , where is defined to be the event that point gets sampled and recovered in the phase corresponding to its weight level, and is defined to the contrary. Thus
∎
E.2 Variance Bound for Query
The goal of this section is to prove the variance bound of the estimator.
Lemma E.2 (Variance bound for Query, formal version of Lemma D.2 ).
Proof.
First, we have , where equality holds when all the points are sampled and recovered in the phase of their weight levels. By Lemma D.1, we have
Also, we have
Then, we have
where the first step follows definition of , the second step follows from expanding the square of summation, the third step follows from , the fourth step follows from and , the fifth step follows from and and .
Now, we repeat this process for times with constant . Then, we have -factor approximation with higher success probability. We show that if we repeat the procedure times and take average, denoted as , we have:
where the first step follows from , the second step follows from , the third step follows from Markov inequality and the last step follows from and .
We can repeat times to upper bound failure probability to and then take the median out of means, where we assume .
∎
Appendix F Adversary
In this section, we provide the detailed proofs for the lemmas in Section 5.
Starting Point In Section D, we have already obtained a query algorithm with constant success probability for a fixed query point.
Lemma F.1 (Starting with constant probability, restatement of Lemma 5.1).
Given , a query point and a set of data points , let be an estimator can answer the query which satisfies:
with probability .
Proof.
Boost the constant probability to high probability.
Next, we begin to boost the success probability by repeating the query procedure and taking the median output.
Lemma F.2 (Boost the constant probability to high probability, restatement of Lemma 5.2).
Let denote the failure probability. Let denote the accuracy parameter. Given estimators . For each fixed query point , the median of queries from estimators satisfies that:
with probability .
Proof.
From the chernoff bound we know the median of queries from satisfies:
with probability .
Therefore, we complete the proof. ∎
From each fixed point to all the net points.
So far, the success probability of our algorithm is still for a fix point. We will introduce -net on a unit ball and show the high success probability for all the net points.
Fact F.3.
Let denote the -net of . We use to denote the number of points in . Then .
This fact shows that we can bound the size of an -net with an inverse of . We use this fact to conclude the number of repetitions we need to obtain the correctness of Query on all net points.
Lemma F.4 (From each fixed points to all the net points, restatement of Lemma 5.4).
Let denote the -net of . We use to denote the number of points in . Given estimators .
With probability , we have: for all , the median of queries from estimators satisfies that:
Proof.
There are points on the dimension -net when . From Lemma F.2 we know that for each query point on , we have :
with probability .
By union bound all points on , we have:
with probability . ∎
From net points to all points.
With Lemma F.4, we are ready to extend the correctness for net points to the whole unit ball. We demonstrate that all query points can be answered approximately with high probability in the following lemma.
Lemma F.5 (From net points to all points, restatement of Lemma 5.5 ).
Let . Let . Let . Let . Given estimators , with probability , for all query points , we have the median of queries from estimators satisfies that:
where is the closest net point of .
Proof.
We define an event to be the following,
Using Lemma F.4 with , we know that
We condition the above event to be held. (Then the remaining proof does not depend on any randomness, for each and for all becomes the same.)
For each point , there exists a such that
(7) |
For each , we know
where the first step follows from Lipschitz, the second step follows from Eq. (7), the third step follows from .
Using , we have
Rescaling the completes the proof. ∎
Thus, we obtain an algorithm that could respond to adversary queries robustly.
Appendix G Lipschitz
The goal of this section is to prove the Lipschitz property of the KDE function.
G.1 Lipschitz property of KDE function
Lemma G.1.
Suppose kernel function satisfies the following properties:
-
•
Radial: there exists a funtion such that , for all .
-
•
Decreasing: is decreasing
-
•
Lipschitz: is -Lipschitz
Then KDE function , is -Lipschitz, i.e.
Proof.
For any , we have:
where the first step follows from the definition of , the second step follows from triangular inequality of absolute value, the third step follows from the property of radial kernel, the fourth step follows from the Lipschitz property of , the fifth step follows from the triangular property of norm, and the last step follows from canceling .
Thus, we complete the proof. ∎