This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The SpaceSaving± Family of Algorithms for Data Streams with Bounded Deletions

Fuheng Zhao UC Santa Barbara [email protected] Divyakant Agrawal UC Santa Barbara [email protected] Amr El Abbadi UC Santa Barbara [email protected] Claire Mathieu CNRS and IRIF [email protected] Ahmed Metwally Uber Inc. [email protected]  and  Michel de Rougemont University Paris II and IRIF [email protected]
(2018)
Abstract.

In this paper, we present an advanced analysis of near optimal algorithms that use limited space to solve the frequency estimation, heavy hitters, frequent items, and top-k approximation in the bounded deletion model. We define the family of SpaceSaving± algorithms and explain why the original SpaceSaving± algorithm only works when insertions and deletions are not interleaved. Next, we propose the new Double SpaceSaving±, Unbiased Double SpaceSaving±, and Integrated SpaceSaving± and prove their correctness. The three proposed algorithms represent different trade-offs, in which Double SpaceSaving± can be extended to provide unbiased estimations while Integrated SpaceSaving± uses less space. Since data streams are often skewed, we present an improved analysis of these algorithms and show that errors do not depend on the hot items. We also demonstrate how to achieve relative error guarantees under mild assumptions. Moreover, we establish that the important mergeability property is satisfied by all three algorithms, which is essential for running the algorithms in distributed settings.

Data Mining, Streaming Algorithms, Data Summary, Data Sketch, Frequency Estimation, Heavy Hitters, Frequent Items, Top-K.
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYprice: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Data management systemsccs: Information systems Data miningccs: Networks Network monitoringccs: Theory of computation Streaming, sublinear and near linear time algorithms

1. Introduction

Streaming algorithms (Cormode and Yi, 2020; Han et al., 2006) aim to process a stream of data in a single pass with small space. In particular, for a stream of updates of items, these algorithms often only store a compact summary which uses space at most logarithmic in the data stream length or in the cardinality of the input domain. The streaming paradigm provides essential analysis and statistical measurements with strong accuracy guarantees for many big data applications (Agrawal et al., 2011). For instance, Hyperloglog (Flajolet et al., 2007) answers queries about cardinality; Bloom Filters (Bloom, 1970) answer membership queries; KLL (Karnin et al., 2016; Ivkin et al., 2019; Zhao et al., 2021b) gives accurate quantile approximations such as finding the median. These data summaries (sketches) are now the cornerstones for many real-world systems and applications including network monitoring (Yang et al., 2018; Zhao et al., 2023b; Xu et al., 2023), storage engine (Dayan et al., 2017; Zhao et al., 2023c), data mining (Wang et al., 2019; Li et al., 2020), federated learning (Rothchild et al., 2020; Jiang et al., 2018), privacy-preserving analytic (Chen et al., 2023; Zhao et al., 2022; Lebeda and Tetek, 2023; Pagh and Thorup, 2022; Wang et al., 2024), vector search (Broder, 1997; Bruch et al., 2023), and many more. We refer the reader to a recent paper (Cormode, 2023) for a more thorough discussion on the applications.

In this paper, we focus on the frequency estimation and heavy hitters problems with the presence of deletions. Given a data stream σ\sigma from some universe UU, the Frequency Estimation problem constructs a summary providing, for every item xx in UU, an estimate f^(x)\hat{f}(x) of the true frequency f(x)f(x). Since resources are limited (small memory footprint), the data summary cannot track every item and hence there is an inherent trade-off between the space used and the accuracy of the estimate. In the closely related approximate Heavy Hitters (Frequent Items) problem, there is an additional input frequency threshold TT, and the goal is to identify all items with frequency larger than TT. This applies to settings where data scientists and system operators are interested in knowing what items are hot (appear very often). Frequency estimation, heavy hitters and approximate top-k have found many applications in big data systems (Bailis et al., 2017; Zakhary et al., 2021; Liu et al., 2023; Chiosa et al., 2021).

Algorithm Space Model Error Bound Note
SpaceSaving &\& MG O(1ϵ)O(\frac{1}{\epsilon}) Insertion-Only |fif^i|ϵF1|f_{i}-\hat{f}_{i}|\leq\epsilon F_{1} Lemma 2.3
 (Metwally et al., 2005; Misra and Gries, 1982; Berinde et al., 2010; Mathieu and de Rougemont, 2023) O(kϵ)O(\frac{k}{\epsilon}) |fif^i|ϵkF1res(k)|f_{i}-\hat{f}_{i}|\leq\frac{\epsilon}{k}F_{1}^{res(k)} Lemma 2.4
O(kϵ2γ2(γ1))O(\frac{k}{\epsilon}\frac{2-\gamma}{2(\gamma-1)}) (γ\gamma-decreasing) |fif^i|ϵfi,ik|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i},\forall i\leq k Lemma 2.6
Count-Min (Cormode and Muthukrishnan, 2005) O(kϵlogn)O(\frac{k}{\epsilon}\log n) Turnstile |fif^i|ϵkϵF1res(k)|f_{i}-\hat{f}_{i}|\leq\frac{\epsilon}{k}\epsilon F_{1}^{res(k)} Never underestimate
CountSketch (Charikar et al., 2002) O(kϵlogn)O(\frac{k}{\epsilon}\log n) Turnstile (fif^i)2ϵkF2res(k)(f_{i}-\hat{f}_{i})^{2}\leq\frac{\epsilon}{k}F_{2}^{res(k)} unbiased
DoubleSS±\pm & IntegratedSS±\pm O(αϵ)O(\frac{\alpha}{\epsilon}) Bounded Deletion |fif^i|ϵF1|f_{i}-\hat{f}_{i}|\leq\epsilon F_{1} Theorem 3.13.9
Unbiased DoubleSS±\pm O(αϵ)O(\frac{\alpha}{\epsilon}) unbiased & Var ¡ ϵ2F12\epsilon^{2}F_{1}^{2} Theorem 3.3
DoubleSS±\pm & IntegratedSS±\pm O(αkϵ)O(\frac{\alpha k}{\epsilon}) |fif^i|ϵkF1.αres(k)|f_{i}-\hat{f}_{i}|\leq\frac{\epsilon}{k}F_{1.\alpha}^{res(k)} Theorem 4.14.3
DoubleSS±\pm & IntegratedSS±\pm O(αkϵ(2γ)kβ2(γ1)2logγk)O(\frac{\alpha k}{\epsilon}\frac{(2-\gamma)k^{\beta}}{2(\gamma-1)2^{log_{\gamma}k}}) (γ\gamma-decreasing and β\beta-zipf) |fif^i|ϵfi,ik|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i},\forall i\leq k Theorem 4.64.7
Table 1. Comparison between different frequency estimation and approximate heavy hitters algorithms.

1.1. Related Work

There is a large body of algorithms proposed to solve frequency estimation and heavy hitters problems (Misra and Gries, 1982; Metwally et al., 2005; Charikar et al., 2002; Cormode and Muthukrishnan, 2005; Karp et al., 2003; Demaine et al., 2002; Braverman et al., 2017; Jayaram and Woodruff, 2018; Zhao et al., 2021a, 2022; Manku and Motwani, 2002). The existing algorithms can be classified into three models: insertion-only, turnstile, and bounded-deletion. In the insertion-only model, all arriving stream items are insertions and the frequency f(x)f(x) of xx is the number of times xx has been inserted. SpaceSaving (Metwally et al., 2005) and MG (Misra and Gries, 1982) are two popular algorithms that operate in the insertion-only model; they provide strong deterministic guarantees with minimal space. Moreover, since these algorithms are deterministic, they are inherently adversarial robust (Hasidim et al., 2020; Ben-Eliezer et al., 2022; Cohen et al., 2022). In the turnstile model, arriving stream items can be either insertions or deletions and the (net) frequency f(x)f(x) of xx is the difference between the number of times xx has been inserted and deleted, with the proviso that no item’s frequency is negative. Count-Min (Cormode and Muthukrishnan, 2005) and CountSketch (Charikar et al., 2002) are two popular algorithms that operate in the turnstile model. Supporting deletions is a harder task and as a result Count-Min and Count-Sketch require more space with poly-log dependence on the input domain size to offer strong error bounds with high probability (Alon et al., 1996). The bounded deletion model (Jayaram and Woodruff, 2018; Zhao et al., 2021b, a) assumes the number of deletions are bounded by a fraction of the number of insertions. For instance, in standard testings, if students are only allowed to submit one regrade request on their tests, then the number of deletions is at most half of the number of insertions in which updating a student’s grade is a deletion followed by an insertion.

Recently, SpaceSaving±\pm (Zhao et al., 2021a) proposed an algorithm extending the original SpaceSaving (Metwally et al., 2005) to handle bounded deletions efficiently, (implicitly) assuming (Zhao et al., 2021a) that the data stream has no interleaving between insertions and deletions: all insertions precede all deletions (Metwally et al., 2005) (Zhao et al., 2023a). In addition, Berinde et al. (Berinde et al., 2010) showcased that data summaries in the insertion-only model can achieve an even stronger error bound using residual error analysis. The difference among these algorithms and their corresponding error guarantees is listed in Table 1. As a result, in this work, we aim to investigate the SpaceSaving±\pm family of algorithms to support interleavings between insertions and deletions and provide a tighter error bound analysis.

1.2. Our Contributions

In this paper, we present a detailed analysis of the SpaceSaving±\pm family of algorithms with bounded deletions. In particular, we propose three new algorithms: Double SpaceSaving±\pm, Unbiased Double SpaceSaving±\pm, and Integrated SpaceSaving±\pm, which operate in the general bounded deletion model (with interleavings of insertions and deletions) and provide strong theoretical guarantees with small space, while having distinct characteristics. In addition, we introduce the residual error bound guarantee in the bounded deletion model and prove that the proposed algorithms satisfy the stronger residual error guarantee. Moreover, we demonstrate that the proposed algorithms also achieve the relative error bound (Mathieu and de Rougemont, 2023; Cormode et al., 2021) under mild assumptions, assuming skewness in the data stream. Finally, we provide a merge algorithm for our proposed algorithms and prove that all proposed algorithms are mergeable in the bounded deletion model. The mergeability is critical for distributed applications (Agarwal et al., 2012).

2. Background

2.1. Preliminaries

Consider a data stream consisting of a sequence of NN operations (insertions or deletions) on items drawn from a universe UU of size |U|=n|U|=n. We denote the stream by σ={(itemt,opt)}t={1,2N}\sigma=\{(item_{t},op_{t})\}_{t=\{1,2...N\}} where itemtUitem_{t}\in U is an element of the universe and opt{insertion,deletion}op_{t}\in\{insertion,deletion\}. Let II denote the total number of insertions and DD denote the total number of deletions in the stream, so that I+D=NI+D=N. In the bounded deletion model (Jayaram and Woodruff, 2018; Zhao et al., 2021b), it is assumed that, for some parameter α1\alpha\geq 1, the stream is constrained to satisfy D(11/α)ID\leq(1-1/\alpha)I. Observe that when α=1\alpha=1, there can be no deletions, and so the model includes the insertion-only model as a special case. Let ff denote the frequency function: the frequency of an item xx is f(x)=I(x)D(x)f(x)=I(x)-D(x) where I(x)I(x) denotes the number of insertions of xx in the stream and D(x)D(x) denotes the number of deletions of xx in the stream. In this paper, we use a summary data structure from which we infer an estimation of the frequency function. We use f^(x)\hat{f}(x) to denote the estimated frequency of xx.

Let {fi}1n\{f_{i}\}_{1}^{n} denote the frequencies of the nn items, sorted in non-increasing order, so that f1f2,fnf_{1}\geq f_{2},...\geq f_{n}. Similarly, sorting {I(x):xU}\{I(x):x\in U\} in non-increasing order leads to {Ii}1n\{I_{i}\}_{1}^{n} where I1I2,InI_{1}\geq I_{2},...\geq I_{n}, and sorting {D(x):xU}\{D(x):x\in U\} in non-increasing order leads to {Di}1n\{D_{i}\}_{1}^{n} where D1D2,DnD_{1}\geq D_{2},...\geq D_{n}. Notice that, for the same ii, the item corresponding to fif_{i} may differ from the item corresponding to IiI_{i} and from the item corresponding to DiD_{i}. For the special case of the insertion-only model, the frequency vector and insertion vector are identical. Moreover, we use F1F_{1} to denote the total number of items resulting from the stream operations, such that F1=ID=i=1nfiF_{1}=I-D=\sum_{i=1}^{n}f_{i}. In this paper, we focus on the unit updates model (an operation only inserts or deletes one count of an item) and assume that at any point of time, no item frequency is below zero (i.e, items cannot be deleted if they were not previously inserted).

Definition 2.1 (Deterministic Frequency Estimation Problem).

Given a stream of operations (insertions or deletions) of items from a universe UU, the Frequency Estimation Problem asks for an (implicitly defined) function f^\hat{f} that when queried for any element xUx\in U, returns a value f^(x)\hat{f}(x) such that

xU,|f(x)f^(x)|ϵF1,\forall x\in U,|f(x)-\hat{f}(x)|\leq\epsilon F_{1},

where F1=xUf(x)F_{1}=\sum_{x\in U}f(x).

Definition 2.2 (Heavy Hitters problem).

Given a parameter ϵ>0\epsilon>0, a heavy hitter is an item with frequency at least ϵF1\epsilon F_{1}. Given a stream of operations (insertions or deletions) of items from a universe UU, the heavy hitters problem asks for an (explicitly defined) subset of UU that contains all heavy hitters.

All of our streaming algorithms use a summary of the information seen in the stream so far, consisting of a collection of items and some counts for each of them. The size of the summary is much smaller than |U||U| (equivalent to nn), and we say that an item is monitored if it is present in the summary.

2.2. The SpaceSaving± Family of Algorithms

Input: Insertion-only data stream σ\sigma and SpaceSaving summary SS
Output: SpaceSaving summary SS
1
2for (item ee, insertion) from σ\sigma do
3       if eSe\in S then
4             countecount_{e} += 1 ;
5            
6       else if SS not full then
7             add ee to SS with countecount_{e} = 1 ;
8            
9       else
10             minItemminItem = item xSx\in S minimizing countxcount_{x};
11             ww = countminItemcount_{minItem};
12             evict minItemminItem from summary;
13             add ee to summary with counte=w+1count_{e}=w+1;
14            
15      
16 end for
Algorithm 1 SpaceSaving Update Algorithm (insertion-only, one count per item)

The SpaceSaving±\pm family of algorithms are data summaries that build on top of the original SpaceSaving (Metwally et al., 2005) algorithm. In 2005, Metwally, Agrawal, and El Abbadi (Metwally et al., 2005) proposed the popular SpaceSaving algorithm that provides highly accurate estimates on item frequency and heavy hitters among many competitors (Cormode and Hadjieleftheriou, 2008). The SpaceSaving algorithm operates in the insertion-only model (so that the stream is simply a sequence of items from UU) and uses the optimal space (Berinde et al., 2010). The update algorithm, as shown in Algorithm 1 (Metwally et al., 2005), uses mm counters to store a monitored item’s identity and its estimated frequency. When a new itemitem arrives: if itemitem is monitored, then increment its counter; if itemitem is not monitored and the summary has unused counters, then start to monitor itemitem, and set countitemcount_{item} to 1; otherwise, i) find minItemminItem, the item with minimum counter (minCountminCount), ii) evicts this minItemminItem from the summary, and iii) monitor the new itemitem and set countitemcount_{item} to minCount+1minCount+1. See Algorithm 1.

After the stream has been read, the SpaceSaving Update algorithm has produced a summary. To solve the Frequency Estimation Problem, the algorithm then needs to estimate the frequency f(e)f(e) of an item ee being queried: the SpaceSaving Query algorithm returns f^(e)=counte\hat{f}(e)=count_{e} if ee is monitored and f^(e)=0\hat{f}(e)=0 otherwise, as shown in Algorithm 2. (To solve the heavy hitters problem, the algorithm outputs all items in the summary whose count is greater than or equal to ϵ\epsilon times the sum of the counts.)

When the number of counters (size of the summary) m=1ϵm=\frac{1}{\epsilon}, SpaceSaving solves both frequency estimation and heavy hitters problems in the insertion-only model, as shown in Lemma 2.3. Moreover, SpaceSaving satisfies a more refined residual error bound as shown in Lemma 2.4 in which ”hot” items do not contribute error and the error bound only depends on the ”cold” and ”warm” items (Berinde et al., 2010).

The ϵ\epsilon error guarantee is shown in Lemma 2.3. When m=1ϵm=\frac{1}{\epsilon}, SpaceSaving ensures xU,|f(x)f^(x)|ϵF1\forall x\in U,|f(x)-\hat{f}(x)|\leq\epsilon F_{1}.

Lemma 2.3.

(Metwally et al., 2005) After processing insertion-only data stream σ\sigma with the SpaceSaving Update algorithm (Algorithm 1) with mm counters, the SpaceSaving Query algorithm (algorithm 2) provides an estimate f^\hat{f} such that xU,|f(x)f^(x)|F1m\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{F_{1}}{m}, where F1=i=1nfiF_{1}=\sum_{i=1}^{n}f_{i}.

The residual error guarantee is shown in Lemma 2.4.

Lemma 2.4.

(Berinde et al., 2010) After processing insertion-only data stream σ\sigma, SpaceSaving with O(kϵ)O(\frac{k}{\epsilon}) counters ensure xU,|f(x)f^(x)|ϵkF1res(k)\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{k}F_{1}^{res(k)} where F1res(k)=F1i=1kfiF_{1}^{res(k)}=F_{1}-\sum_{i=1}^{k}f_{i}.

The relative error guarantee is shown in Lemma 2.6.

Definition 2.5 (γ\gamma-decreasing).

Let γ>1\gamma>1, then a non-decreasing function with domain {1,2n}\{1,2...n\} is γ\gamma-decreasing if for all t such that 1γtn1\leq\gamma t\leq n, fγtft/2f_{\lceil\gamma t\rceil}\leq f_{t}/2

Lemma 2.6.

(Mathieu and de Rougemont, 2023) After processing insertion-only data stream σ\sigma and assume the exact frequencies follow the γ\gamma-decreasing function (Definition 2.5), SpaceSaving with O(kϵ2γ2(γ1))O(\frac{k}{\epsilon}\frac{2-\gamma}{2(\gamma-1)}) counters ensure |fif^i|ϵfi,ik|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i},\forall i\leq k.

Input: SpaceSaving summary SS and an item ee
Output: ee’s estimated frequency
1 if eSe\in S then
2       return countecount_{e};
3      
4 return 0;
Algorithm 2 Query Algorithm

In 2021, Zhao et al. (Zhao et al., 2021a) proposed SpaceSaving±\pm to extend the SpaceSaving algorithm from the insertion-only model to the bounded deletion model. In that model, they established a space lower-bound of at least αϵ\frac{\alpha}{\epsilon} counters required to solve the frequency estimation and heavy hitters problems, and proposed an algorithm, SpaceSaving±\pm (see Algorithm 3), that used space matching the lower bound. The update algorithm is shown in Algorithm 3 in which insertions are dealt with as in Algorithm 1, deletions of monitored items decrement the corresponding counters, and deletions of unmonitored items are ignored. The query algorithm is identical to Algorithm 2111There are a couple of variants of Algorithm 2. We adopt the most commonly used approach..

Input: Bounded deletion data stream σ\sigma, and SpaceSaving±\pm summary SS
Output: SpaceSaving±\pm Summary SS
1 for  (item ee, operation opop) from σ\sigma do
2       if opop is an insertion then
3             follow Algorithm 1 to process ee;
4            
5       else
6             if eSe\in S then
7                   counte=1count_{e}-=1;
8                  
9            
10      
11 end for
Algorithm 3 SpaceSaving±\pm Update Algorithm (bounded deletion)

However, the SpaceSaving±\pm algorithm is only correct if an implicit assumption holds (Zhao et al., 2023a), namely, that deletions can only happen after all the insertions, so that interleaving insertions and deletions are not allowed, as shown in Lemma 2.7. In this paper, we present new algorithms in the general bounded deletions model, that allow interleavings between insertions and deletions while still using only O(αϵ)O(\frac{\alpha}{\epsilon}) space.

This Lemma correspond to Theorem 2 in (Zhao et al., 2021a). Consider the bounded deletions model and assume, in addition, that the stream consists of an insertion phase with II insertions followed by a deletion phase with DD deletions (no interleaving).

Lemma 2.7.

After processing data stream σ\sigma with the SpaceSaving±\pm Update algorithm (Algorithm 3) with m=αϵm=\frac{\alpha}{\epsilon} counters, the SpaceSaving Query algorithm (algorithm 2) provides an estimate f^\hat{f} such that xU,|f(x)f^(x)|F1m\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{F_{1}}{m}, where F1=i=1nfiF_{1}=\sum_{i=1}^{n}f_{i}.

3. New Algorithms

In this section, we propose new algorithms supporting the bounded deletion model (with interleavings between insertions and deletions, while an item’s frequency is never negative). Double SpaceSaving±\pm and Integrated SpaceSaving±\pm both use O(αϵ)O(\frac{\alpha}{\epsilon}) space to solve the deterministic frequency estimation and heavy hitters problems in the bounded deletion model. We extended Double SpaceSaving±\pm to support unbiased estimations. Also, we establish the correctness proof for these algorithms. While these algorithm share similar characteristics, they represent different trade-offs. In particular, Double SpaceSaving±\pm uses two disjoint summaries, one to track insertions and the other to track deletions, while Intergrated SpaceSaving±\pm uses a single summary to track both deletions and insertions.

3.1. Double SpaceSaving±\pm

Input: Bounded deletion data stream σ\sigma, and Double SpaceSaving±\pm summary (Sinsert,Sdelete){(S_{insert},S_{delete})}
Output: Double SpaceSaving±\pm Summary S=(Sinsert,Sdelete)S{=(S_{insert},S_{delete})}
1
2for (item ee, operation opop) from σ\sigma do
3       if opop is an insertion then
4             Apply Algorithm 1 to insert ee in SinsertS_{insert};
5            
6       else
7             Apply Algorithm 1 to insert ee in SdeleteS_{delete};
8            
9      
10 end for
Algorithm 4 Double SpaceSaving±\pm Update Algorithm
Input: Double SpaceSaving±\pm summary SS and an item ee
Output: ee’s estimated frequency.
1 inserts = Query SinsertS_{insert} for ee with Algorithm 2;
2 deletes = Query SdeleteS_{delete} for ee with Algorithm 2;
3 return max(inserts - deletes, 0);
Algorithm 5 Double SpaceSaving±\pm Query Algorithm

In Double SpaceSaving±\pm, we employ two SpaceSaving summaries to track insertions (SinsertS_{insert}) and deletions (SdeleteS_{delete}) separately, using mIm_{I} and mDm_{D} counters respectively (the algorithms is parameterized by the choice of mIm_{I} and mDm_{D}). As shown in Algorithm 4, Double SpaceSaving±\pm completely partitions the update operations to two independent summaries. To query for the frequency of an item ee, both summaries need to be queried to estimate the frequency of item ee as shown in Algorithm 5.

Theorem 3.1.

After processing a bounded deletion data stream σ\sigma, Double SpaceSaving±\pm following Update Algorithm 4 with mI=O(αϵ)m_{I}=O(\frac{\alpha}{\epsilon}) and mD=O(α1ϵ)m_{D}=O(\frac{\alpha-1}{\epsilon}) and Query Algorithm 5 provides an estimate f^\hat{f} such that xU,|f(x)f^(x)|<ϵF1\forall x\in U,|f(x)-\hat{f}(x)|<\epsilon F_{1}.

Proof.

For every items in the universe their frequency estimation error is upper bounded by ImI+DmD\frac{I}{m_{I}}+\frac{D}{m_{D}} due to the property of SpaceSaving as shown in Lemma 2.3 and triangle inequality. In addition, from the definition of bounded deletion, we can derive the relationship between II, DD, and F1F_{1}. Namely, IαF1I\leq\alpha F_{1} and D(α1)F1D\leq(\alpha-1)F_{1}. Set mI=2αϵm_{I}=\frac{2\alpha}{\epsilon} and mD=2(α1)ϵm_{D}=\frac{2(\alpha-1)}{\epsilon}, then ImI+DmDϵF1\frac{I}{m_{I}}+\frac{D}{m_{D}}\leq\epsilon F_{1}. Hence, Double SpaceSaving±\pm solve the deterministic frequency estimation problem in the bounded deletion model with O(αϵ)O(\frac{\alpha}{\epsilon}) space. ∎

Theorem 3.2.

Double SpaceSaving±\pm solves the heavy hitters problem in the bounded deletion model using O(αϵ)O(\frac{\alpha}{\epsilon}) space by reporting all items monitored in S1S_{1}.

Proof.

Assume a heavy hitter xx is not monitored in the summary. By definition of heavy hitters, f(x)ϵF1f(x)\geq\epsilon F_{1}. Since xx is not monitored, its estimated frequency 0. As a result, the estimation error for item xx is larger than ϵF1\epsilon F_{1} which contradicts Theorem 3.1.

Hence, by contradiction Double SpaceSaving±\pm solves the heavy hitters problem. ∎

Being more careful and using variants of Algorithm 2 would lead to improvements but would not change the theoretical bound of Theorem 3.2.

3.1.1. Unbiased Double SpaceSaving±\pm

Unbiased estimation is a desirable property for robust estimation (Charikar et al., 2002; Duffield et al., 2007). Research has shown that SpaceSaving can be extend to provide unbiased estimation ( i.e., E[f^(x)]=f(x)E[\hat{f}(x)]=f(x)(Ting, 2018). The extension is obtained from Algorithm 1 by the following simple modification: after line 8 of the code, with probability 1/(w+1)1/(w+1) lines 9 and 10 are executed; with the complementary probability 11/(w+1)1-1/(w+1), instead of lines 9 and 10 the algorithm simply increments countminItemcount_{minItem}. This indicates that we can replace the two independent deterministic SpaceSaving summaries with two independent unbiased SpaceSaving summaries in Algorithm 5 to support unbiased frequency estimation in the bounded deletion model.

Theorem 3.3.

After processing a bounded deletion data stream σ\sigma, Unbiased Double SpaceSaving±\pm of space OαϵO\frac{\alpha}{\epsilon}, with two independent unbiased SpaceSaving summaries in Algorithm 5, provides unbiased frequency estimation and the estimation variance is upper bounded by ϵ2F12\epsilon^{2}F_{1}^{2}.

Proof.

The variance of an Unbiased SpaceSaving’s estimation is upper bounded by O(F1m2)O(\frac{F_{1}}{m}^{2}) (Ting, 2018). We know that Var[XY]Var[X-Y] = Var[X]Var[X] + Var[Y]Var[Y] - 2(Cov[X,Y]2(Cov[X,Y], where XX is the estimation of inserts and YY is the estimation of deletes. Since two summaries are independent, the covariance is 0; Hence, the variance upper bounded by the sum of the O(ImI2+DmD2)O(\frac{I}{m_{I}}^{2}+\frac{D}{m_{D}}^{2}). In Theorem 3.1, we proved that ImI+DmDϵF1\frac{I}{m_{I}}+\frac{D}{m_{D}}\leq\epsilon F_{1}. As a result, the variance of the unbiased estimation for any item’s frequency is upper bounded by ϵ2F12\epsilon^{2}F_{1}^{2}.

3.2. Integrated SpaceSaving±\pm

Input: Bounded deletion data stream σ\sigma, and IntergratedSpaceSaving±\pm summary SS
Output: IntergratedSpaceSaving±\pm summary SS
1 for (item ee, operation opop) from σ\sigma do
2       if eSe\in S then
3             if opop is an insertion then
4                   inserteinsert_{e} += 1;
5                  
6             else
7                   deleteedelete_{e} += 1;
8                  
9            
10       else
11             if SS not full then
12                   inserteinsert_{e} = 1;
13                   deleteedelete_{e} = 0;
14                  
15             else if opop is an insertion then
16                   minItemminItem = item xSx\in S minimizing insertxinsert_{x};
17                   ww = insertminIteminsert_{minItem};
18                   evict minItemminItem from summary;
19                   add ee to summary and set inserte=w+1insert_{e}=w+1 and deletee=0delete_{e}=0;
20                  
21            
22      
23 end for
Algorithm 6 Integrated SpaceSaving±\pm Update Algorithm

Unlike Double SpaceSaving±\pm which uses two independent summaries, Integrated SpaceSaving±\pm completely integrates deletions and insertions in one data structure as shown in Algorithm 6. The main difference between Integrated SpaceSaving±\pm and SpaceSaving±\pm is that Integrated SpaceSaving±\pm tracks the insert and delete count separately. This ensures that the minimum insert count monotonically increases and never decrease; however, the original SpaceSaving±\pm uses one single count to track inserts and deletes together. When insertions and deletions are interleaved, then the minimum count in SpaceSaving±\pm may decreases. This can lead to severe underestimation of a frequent item (Zhao et al., 2023a).

Input: Integrated SpaceSaving±\pm summary SS and an item ee
Output: ee’s estimated frequency.
1 if eSe\in S then
2       return insertedeleteeinsert_{e}-delete_{e};
3      
4 return 0;
Algorithm 7 Integrated SpaceSaving±\pm Query Algorithm

More specifically, the Integrated SpaceSaving±\pm data structure maintains a record estimating the number of insertions and deletions for each monitored item. When a new item arrives, if it is already being monitored, then depending on the type of operation, either the insertion or the deletion count of that item is incremented. If the summary is not full, then no item could have every been evicted from the summary, and since no delete operation arrives before the corresponding item is inserted, then any new item in the stream must be an insertion. Hence, when a new item arrives, and if the summary is not full, the new item is added to the summary and initialized with one insertions and 0 deletions. If a delete arrives and the summary is full, the delete is simply ignored. The interesting case is when an insert arrives and the summary is full. In this case, Integrated SpaceSaving±\pm mimics the behaviour of the original SpaceSaving±\pm by evicting the element with the least number of inserts and replacing it with the new element. The insertion count for the new element is overestimated by initializing it to the insertion count for the evicted element. However, the deletion count is underestimated by initializing the deletion count to zero.

To estimate an item’s frequency, if the item is monitored, it subtracts the item’s delete count from the item’s insert count, otherwise the estimated frequency is 0, as shown in Algorithm 7.

We now establish the correctness of Integrated SpaceSaving±\pm. First, we establish three lemmas about Integrated SpaceSaving±\pm to help us prove the correctness of the algorithm.

Lemma 3.4.

After processing a bounded deletion data stream σ\sigma, Integrated SpaceSaving±\pm following update Algorithm 6 ensures that the sum of all insert counts equals II.

Proof.

This holds by induction over time, as verified by inspection of the algorithm. ∎

Lemma 3.5.

After processing a bounded deletion data stream σ\sigma, Integrated SpaceSaving±\pm with mm counters following update Algorithm 6 ensures that the minimum insert count, insertminIteminsert_{minItem}, is less than or equal to Im\frac{I}{m}.

Proof.

Since deletions never increment any insert counts, insert counts is only affected by insertions. We observe that the sum of all insert counts in summary is exactly equal to II, since the sum of all insert counts increment by 1 after processing one insertion. As a result, insertminIteminsert_{minItem} is maximized when all insert counts are the same. Hence, insertminIteminsert_{minItem} is less than or equal to Im\frac{I}{m}. ∎

Lemma 3.6.

All monitored items in Integrated SpaceSaving±\pm are never underestimated following query Algorithm 7.

Proof.

The inserteinsert_{e} counts of the summary are dealt with exactly like in the SpaceSaving algorithm, for the substream of σ\sigma consisting of the insertion operations only. By the analysis of SpaceSaving, it is known that for an item ee in the summary, inserteinsert_{e} is an overestimate of the true number of insertions of ee since the beginning of the stream.

The deleteedelete_{e} count of element ee in the summary is equal to the number of deletions of item ee since the last time that it was inserted into the summary, so it is an underestimate of the number of deletions of ee since the beginning of the stream.

Thus, for any ee in summary, insertedeleteeinsert_{e}-delete_{e} is an overestimate. ∎

Lemma 3.7.

In Integrated SpaceSaving±\pm, any item inserted more than insertminIteminsert_{minItem} must be monitored in the summary; moreover, for any item ee in the summary, its count inserteinsert_{e} exceeds its true number of insertions by at most insertminIteminsert_{minItem}.

Proof.

The inserteinsert_{e} counts of the summary are dealt with exactly like in the SpaceSaving algorithm, for the substream of σ\sigma consisting of the insertion operations only, and the properties are well-known to hold for SpaceSaving. ∎

Lemma 3.8.

In Integrated SpaceSaving±\pm following update Algorithm 6 and query Algorithm 7, the estimation error for any item is upper bounded by insertminIteminsert_{minItem}. We can then write: xU,|f(x)f^(x)|insertminItem\forall x\in U,|f(x)-\hat{f}(x)|\leq insert_{minItem}.

Proof.

First consider the case when ee is unmonitored, so the estimate is 0. Its true frequency is at most its number of insertions, and by Lemma 3.7 that is bounded by insertminIteminsert_{minItem}, hence the Lemma holds for ee.

Now, consider the case when ee is in the summary, and consider the last time tt at which ee was inserted in the Summary. Let It(e)I_{\geq t}(e) and Dt(e)D_{\geq t}(e) denote the number of insertions and of deletions of ee since time tt, and let minmin denote the value of insertminIteminsert_{minItem} at time tt. Since the algorithm updates inserteinsert_{e} and deleteedelete_{e} correctly while ee is in the summary, the query algorithm outputs the estimate f^(e)=insertedeletee=min+It(e)Dt(e)\hat{f}(e)=insert_{e}-delete_{e}=min+I_{\geq t}(e)-D_{\geq t}(e). On the other hand, the frequency f(e)f(e) of ee equals the net number of occurences of ee prior to time tt, plus It(e)Dt(e)I_{\geq t}(e)-D_{\geq t}(e). By Lemma 3.7 the net number of occurences of ee prior to time tt is at most minmin, so f^(e)minf(e)f^(e)\hat{f}(e)-min\leq f(e)\leq\hat{f}(e), implying the Lemma. ∎

Theorem 3.9.

After processing a bounded deletion data stream σ\sigma, IntegratedSpaceSaving± with O(αϵ)O(\frac{\alpha}{\epsilon}) space following update Algorithm 6 and query Algorithm 7 provide an estimate f^\hat{f} such that i,|f(i)f^(i)|<ϵF1\forall i,|f(i)-\hat{f}(i)|<\epsilon F_{1}.

Proof.

By Lemma 3.5 and Lemma 3.8, we known that xU,|f(x)f^(x)|insertminItemIm\forall x\in U,|f(x)-\hat{f}(x)|\ \leq insert_{minItem}\leq\frac{I}{m}. We also know that IαF1I\leq\alpha F_{1}. Let’s set m=αϵm=\frac{\alpha}{\epsilon}, then we have ImϵF1\frac{I}{m}\leq\epsilon F_{1}. As a result, IntegratedSpaceSaving± with O(αϵ)O(\frac{\alpha}{\epsilon}) space ensures i,|f(i)f^(i)|<ϵF1\forall i,|f(i)-\hat{f}(i)|<\epsilon F_{1}. ∎

Integrated SpaceSaving±\pm also solves the heavy hitters problem. Since Integrated SpaceSaving±\pm never underestimates monitored items (Lemma 3.6), then if we report all the items with frequency estimations greater than or equal to ϵF1\epsilon F_{1}, then all heavy hitters will be identified as shown in Theorem 3.10.

Theorem 3.10.

IntegratedSpaceSaving± solves the heavy hitters problem in the bounded deletion model using O(αϵ)O(\frac{\alpha}{\epsilon}) space.

Proof.

Assume a heavy hitter xx is not contained if the set of all items with estimated frequency larger than or equal to ϵF1\epsilon F_{1}. Recall, by definition of heavy hitters, f(x)ϵF1f(x)\geq\epsilon F_{1}. Since xx is not contained, xx’s frequency estimation, f^(x)\hat{f}(x), must be less than ϵF1\epsilon F_{1}. However, by Lemma 3.6, Integrated SpaceSaving±\pm never underestimates the frequency of monitored items.

Hence, by contradiction Integrated SpaceSaving±\pm solves the deterministic heavy hitters problem. ∎

3.3. Comparisons between Double SpaceSaving±\pm and Integrated SpaceSaving±\pm

We assume each field uses the same number of bits (32-bit in the implementation). Following the proofs of Theorem 3.9 and that each entry in the summary has three fields (i,inserti,deletei)(i,insert_{i},delete_{i}), IntegratedSpaceSaving±\pm uses 3αϵ3\frac{\alpha}{\epsilon}. On the other hand, following the proofs of Theorem 3.1 and that each entry in the summary uses two fields, Double SpaceSaving±\pm uses 22αϵ+22(α1)ϵ=8α2ϵ2\frac{2\alpha}{\epsilon}+2\frac{2(\alpha-1)}{\epsilon}=\frac{8\alpha-2}{\epsilon}. Integrated SpaceSaving±\pm requires less memory footprint. Double SpaceSaving±\pm also has its own advantage and it can be extended supporting unbiased estimations (Unbiased Double SpaceSaving±\pm).

4. Tighter Analysis on Error Bounds

In this section, we present tighter analysis of the proposed algorithms. In Section 3, we showcased that the proposed algorithms achieve the desired error bound ϵF1\epsilon F_{1}, in which the error bound depends on all items in data stream σ\sigma. In this section, we first prove that both algorithms guarantee the residual error bound and then prove that they both also ensure the relative error bound under relatively mild assumptions. We take inspirations from (Berinde et al., 2010; Metwally et al., 2005; Zhao et al., 2021a; Mathieu and de Rougemont, 2023).

4.1. Residual Error Bound

We define F1,αres(k)=F11αi=1kfiF_{1,\alpha}^{res(k)}=F_{1}-\frac{1}{\alpha}\sum_{i=1}^{k}f_{i}. Observe that F1,αres(k)F_{1,\alpha}^{res(k)} is equivalent to the residual error bound shown in Lemma 2.4, F1res(k)F_{1}^{res(k)}, for the insertion-only model, as α\alpha is set to 1. The residual error bound (F1,αres(k)F_{1,\alpha}^{res(k)}) is much tighter than the original error bound (F1F_{1}), since the residual error depends less on the heavy hitters.

First, we derive the residual error bound for Double SpaceSaving±\pm.

Theorem 4.1.

Double SpaceSaving±\pm with mI+mD=O(αk/ϵ)m_{I}+m_{D}=O(\alpha k/\epsilon) space achieves the residual error bound guarantee in which xU,|f(x)f^(x)|ϵkF1,αres(k)\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{k}F_{1,\alpha}^{res(k)}, where k<min(mI,mD)k<min(m_{I},m_{D}).

Proof.

First, let’s define the maximum error δ\delta as δ=max(xU,|f(x)f^(x)|)\delta=max(\forall x\in U,|f(x)-\hat{f}(x)|) for Double SpaceSaving±\pm. We know that δIi=1kIimIk+Di=1kDimDk\delta\leq\frac{I-\sum_{i=1}^{k}I_{i}}{m_{I}-k}+\frac{D-\sum_{i=1}^{k}D_{i}}{m_{D}-k} by Lemma 2.4 and triangle inequality. By definition of bounded deletion model, IαF1I\leq\alpha F_{1} and D(α1)F1D\leq(\alpha-1)F_{1}. Hence, we have δαF11/αi=1kIimIk+(α1)F11/(α1)i=1kDimDk\delta\leq\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}I_{i}}{m_{I}-k}+(\alpha-1)\frac{F_{1}-1/(\alpha-1)\sum_{i=1}^{k}D_{i}}{m_{D}-k}. We know that i=1kIii=1kfi\sum_{i=1}^{k}I_{i}\geq\sum_{i=1}^{k}f_{i} (this is because sum of k largest insert count has to be larger than or equal to the sum of k largest frequencies) and i=1kDi0\sum_{i=1}^{k}D_{i}\geq 0. As a result, δ<αF11/αi=1kfimIk+(α1)F1mDk\delta<\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}f_{i}}{m_{I}-k}+(\alpha-1)\frac{F_{1}}{m_{D}-k}. Let mI=k(2αϵ+1)m_{I}=k(\frac{2\alpha}{\epsilon}+1) and mD=k(2(α1)ϵ+1)m_{D}=k(\frac{2(\alpha-1)}{\epsilon}+1). Then, δ<ϵk(F11/(2α)i=1kfi)ϵkF1,αres(k)\delta<\frac{\epsilon}{k}(F_{1}-1/(2\alpha)\sum_{i=1}^{k}f_{i})\approx\frac{\epsilon}{k}F_{1,\alpha}^{res(k)}. ∎

We present the residual error bound for Integrated SpaceSaving±\pm. Before presenting the proof, we first construct a helpful Lemma.

Lemma 4.2.

The minimum insert count (insertminIteminsert_{minItem}), in Integrated SpaceSaving±\pm with m counters, is less than or equal to αF11/αi=1kfimk\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}f_{i}}{m-k} where k<mk<m.

Proof.

The sum of all insert count equals to II, as the sum of all insert count always increase by 1 after processing an insertion. If we zoom into the k largest insert count in the summary, the sum of their insert counts is no less than i=1kfi\sum_{i=1}^{k}f_{i}, since no monitored items are underestimated as shown in Lemma 3.6. As a result, insertminItemIi=1kfimkinsert_{minItem}\leq\frac{I-\sum_{i=1}^{k}f_{i}}{m-k}. We know that by definition of bounded deletion, IαF1I\leq\alpha F_{1}. Hence, insertminItemαF11/αi=1kfimkinsert_{minItem}\leq\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}f_{i}}{m-k}. ∎

Theorem 4.3.

Integrated SpaceSaving±\pm with O(αk/ϵ)O(\alpha k/\epsilon) space achieves the residual error bound guarantee in which xU,|f(x)f^(x)|ϵkF1,αres(k)\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{k}F_{1,\alpha}^{res(k)}, where k<mk<m.

Proof.

From Lemma 3.8 and Lemma 4.2, we know the estimation error for all items is upper bounded by insertminIteminsert_{minItem} and insertminItemαF11/αi=1kfimkinsert_{minItem}\leq\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}f_{i}}{m-k}. Let m=k(αϵ+1)m=k(\frac{\alpha}{\epsilon}+1). We achieve the desired residual error bound: xU,|f(x)f^(x)|ϵkF1,αres(k)\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{k}F_{1,\alpha}^{res(k)}. ∎

4.2. Relative Error Bound

The ϵ\epsilon relative error guarantee is defined such that xU,|f(x)f^(x)|ϵf(x)\forall x\in U,|f(x)-\hat{f}(x)|\leq\epsilon f(x). The relative error has been shown to be much harder to achieve, but with practical importance (Gupta and Zane, 2003; Cormode et al., 2021). In a recent work, Mathieu and de Rougemont (Mathieu and de Rougemont, 2023) showcased that the original SpaceSaving algorithm achieves relative error guarantees under mild assumptions. As a result, we demonstrate that the proposed algorithms in the bounded deletion model also achieve relative error guarantees under similar assumptions.

Lemma 4.4.

Let f be γ\gamma decreasing with 1<γ<21<\gamma<2. Then j=infj(i1)fi12(γ1)γ1\sum_{j=i}^{n}f_{j}\leq(i-1)f_{i-1}\frac{2(\gamma-1)}{\gamma-1} and hence F1=j=1nfjf1(1+2(γ1)2γ)=f1γ2γF_{1}=\sum_{j=1}^{n}f_{j}\leq f_{1}(1+\frac{2(\gamma-1)}{2-\gamma})=f_{1}\frac{\gamma}{2-\gamma}.

Proof.

We omit the proof. (This is proven in Lemma 5 at (Mathieu and de Rougemont, 2023)) ∎

Definition 4.5 (Zipfian Distribution (Zipf, 2016)).

Let a function f with domain {1,2n}\{1,2...n\} follow the Zipf distribution with parameter β\beta, then fi=F11iβξ(β)f_{i}=F_{1}\frac{1}{i^{\beta}\xi(\beta)} where F1=i=1nfiF_{1}=\sum_{i=1}^{n}f_{i} and ξ(β)=i=1niβ\xi(\beta)=\sum_{i=1}^{n}i^{-\beta}.

By the definition of Zipf distribution, we know that ξ(β)\xi(\beta) converges to a small constant when β>1\beta>1 and f1=F1/ξ(β)f_{1}=F_{1}/\xi(\beta).

Theorem 4.6.

Assuming insertions and deletions in the α\alpha-bounded deletion model follow the γ\gamma-decreasing property (1<γ<21<\gamma<2) and the exact frequencies follow the Zipf distribution with parameter β\beta, then Double SpaceSaving±\pm with O(αkβ+1ϵ2logγk)O(\frac{\alpha k^{\beta+1}}{\epsilon 2^{log_{\gamma}k}}) space achieves the relative error bound guarantee such that |fif^i|ϵfi|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i}, for iki\leq k.

Proof.

Recall, the maximum estimation error δ\delta is upper bounded by Ii=1kIimIk+Di=1kDimDk\frac{I-\sum_{i=1}^{k}I_{i}}{m_{I}-k}+\frac{D-\sum_{i=1}^{k}D_{i}}{m_{D}-k}. Since insertions and deletions follow γ\gamma-decreasing property, then i=k+1nIikIk2(γ1)2γ\sum_{i=k+1}^{n}I_{i}\leq kI_{k}\frac{2(\gamma-1)}{2-\gamma} and i=k+1nDikDk2(γ1)2γ\sum_{i=k+1}^{n}D_{i}\leq kD_{k}\frac{2(\gamma-1)}{2-\gamma}. Hence, δkIk2(γ1)2γmIk+kDk2(γ1)2γmDk\delta\leq\frac{kI_{k}\frac{2(\gamma-1)}{2-\gamma}}{m_{I}-k}+\frac{kD_{k}\frac{2(\gamma-1)}{2-\gamma}}{m_{D}-k}.

For insertions, we know i) I1IαF1=αξ(β)f1I_{1}\leq I\leq\alpha F_{1}=\alpha\xi(\beta)f_{1}, ii) fi=f1iβf_{i}=\frac{f_{1}}{i^{\beta}}, and iii) IiI1/2logγ(i)I_{i}\leq I_{1}/2^{log_{\gamma}(i)}; For deletions, we know i) D1D(α1)F1=(α1)ξ(β)f1D_{1}\leq D\leq(\alpha-1)F_{1}=(\alpha-1)\xi(\beta)f_{1}, ii) fi=f1iβf_{i}=\frac{f_{1}}{i^{\beta}}, and iii) DiD1/2logγ(i)D_{i}\leq D_{1}/2^{log_{\gamma}(i)}.

Hence, IkI12logγ(k)αξ(β)f1/2logγ(k)=αξ(β)fkkβ/2logγ(k)I_{k}\leq\frac{I_{1}}{2^{log_{\gamma}(k)}}\leq\alpha\xi(\beta)f_{1}/2^{log_{\gamma}(k)}=\alpha\xi(\beta)f_{k}k^{\beta}/2^{log_{\gamma}(k)}, and DkD1/2logγ(k)(α1)ξ(β)fkkβ/2logγ(k)D_{k}\leq D_{1}/2^{log_{\gamma}(k)}\leq(\alpha-1)\xi(\beta)f_{k}k^{\beta}/2^{log_{\gamma}(k)}. Let mI=mD=k+2(γ1)2γkβ+12logγk(2α1)ϵm_{I}=m_{D}=k+\frac{2(\gamma-1)}{2-\gamma}\frac{k^{\beta+1}}{2^{log_{\gamma}k}}\frac{(2\alpha-1)}{\epsilon}, then δϵfk\delta\leq\epsilon f_{k} and hence for all i<ki<k, |fif^i|ϵfi|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i}.

Theorem 4.7.

Assuming insertions in the α\alpha-bounded deletion model follows the γ\gamma-decreasing property (1<γ<21<\gamma<2) and the exact frequencies follow the Zipf distribution with parameter β\beta, then Integrated SpaceSaving±\pm with O(αkβ+1ϵ2logγk)O(\frac{\alpha k^{\beta+1}}{\epsilon 2^{log_{\gamma}k}}) space achieves the relative error bound guarantee such that |fif^i|ϵfi|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i}, for iki\leq k.

Proof.

We know that estimation error at most insertminIteminsert_{minItem} and insertminItemIi=1kIimkinsert_{minItem}\leq\frac{I-\sum_{i=1}^{k}I_{i}}{m-k}. Since insertions follow γ\gamma-decreasing property, then i=k+1nIikIk2(γ1)2γ\sum_{i=k+1}^{n}I_{i}\leq kI_{k}\frac{2(\gamma-1)}{2-\gamma}. Hence, insertminItemi=k+1nIimkkIk2(γ1)2γmkinsert_{minItem}\leq\frac{\sum_{i=k+1}^{n}I_{i}}{m-k}\leq\frac{kI_{k}\frac{2(\gamma-1)}{2-\gamma}}{m-k}. Since we know i) I1IαF1=αξ(β)f1I_{1}\leq I\leq\alpha F_{1}=\alpha\xi(\beta)f_{1}, ii) fi=f1iβf_{i}=\frac{f_{1}}{i^{\beta}}, and iii) IgI1/2logγ(g)I_{g}\leq I_{1}/2^{log_{\gamma}(g)}. Hence, IkI1/2logγ(k)αξ(β)f1/2logγ(k)=αξ(β)fkkβ/2logγ(k)I_{k}\leq I_{1}/2^{log_{\gamma}(k)}\leq\alpha\xi(\beta)f_{1}/2^{log_{\gamma}(k)}=\alpha\xi(\beta)f_{k}k^{\beta}/2^{log_{\gamma}(k)}. Let m=k+2(γ1)2γkβ+12logγkαϵm=k+\frac{2(\gamma-1)}{2-\gamma}\frac{k^{\beta+1}}{2^{log_{\gamma}k}}\frac{\alpha}{\epsilon}, then insertminItemϵfkinsert_{minItem}\leq\epsilon f_{k} and hence for all i<ki<k, |fif^i|ϵfi|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i}. ∎

5. Mergeability

Mergeability is a desirable property in distributed settings in which two summaries on two data sets are given, after merging between these two summaries, the merged summary has error and size guarantees equivalent to a single summary which processed the union of two data sets.

Definition 5.1 (Mergeable (Agarwal et al., 2013)).

A summary SS is mergeable if there exists a merge algorithm AA that produces a summary S(σ1σ2,ϵ)S(\sigma_{1}\cup\sigma_{2},\epsilon) from two input summaries S(σ1,ϵ)S(\sigma_{1},\epsilon) and S(σ2,ϵ)S(\sigma_{2},\epsilon). Note, the size of all three summaries should be the same in achieving ϵ\epsilon error guarantee.

Both the original SpaceSaving and the Unbiased SpaceSaving have been shown to be mergable (Agarwal et al., 2013; Ting, 2018). Since Double SpaceSaving±\pm relies on two independent summaries, Double SpaceSaving±\pm is also mergeable.

For IntergratedSpaceSaving±\pm, to define a Merge algorithm, we take inspiration from (Agarwal et al., 2013). The merge algorithm is shown in Algorithm 8. Given two IntergratedSpaceSaving±\pm summaries of mm counters, S1S_{1} and S2S_{2}, we first union the two summaries (i,e, if both summaries monitor item xx, then xx’s insert and delete counts in the merge summary are the sum of xx’s insert and delete counts from S1S_{1} and S2S_{2}). After the union, the merged summary contain at most 2m2m counters. The merged summary consists of the mm largest items among these 2mm counters based on the insert counts.

Input: Two Integrated SpaceSaving±\pm summaries of size mm: S1S_{1} and S2S_{2}
Output: Merged summary SmergeS_{merge} with size mm
1 S3=S1S_{3}=S_{1} UNION S2S_{2};
2 select mm largest items from S3S_{3} based on insert counts;
3 add these mm items into SmergeS_{merge} and keep their insert/delete counts;
4
Algorithm 8 Integrated SpaceSaving±\pm Merge Algorithm
Refer to caption
(a) Zipf β=1.0\beta=1.0, Frequency Estimation
Refer to caption
(b) YCSB, Frequency Estimation
Refer to caption
(c) Zipf β=1.0\beta=1.0, Top 100
Refer to caption
(d) YCSB, Top 100
Figure 1. Comparison between our proposed algorithms with linear sketches over synthetic and real world dataset.
Theorem 5.2.

The IntegratedSpaceSaving±IntegratedSpaceSaving\pm summaries are mergeable for α\alpha bounded deletion data stream following the Algorithm 8 with m=αϵm=\frac{\alpha}{\epsilon} counters.

Proof.

Assume the given IntegratedSpaceSaving±IntegratedSpaceSaving\pm summaries with m=αϵm=\frac{\alpha}{\epsilon} counters, S1S_{1} and S2S_{2}, have processed two α\alpha bounded deletion data stream σ1\sigma^{1} and σ2\sigma^{2} with I1I^{1} and I2I^{2} insertions and F11F_{1}^{1} and F12F_{1}^{2} total frequencies respectively. Let F1final=F11+F12F_{1}^{final}=F_{1}^{1}+F_{1}^{2}. First, we observe that The union step do not introduce any additional error. As a result, the error for these items is at most ϵF11+ϵF12ϵF1final\epsilon F_{1}^{1}+\epsilon F_{1}^{2}\leq\epsilon F_{1}^{final}.

Since we don’t want to underestimate any monitored items, Algorithm 8 keeps the largest mm items from the union-ed summary based on insert counts. We need to showcase that evicting all items except the largest mm items will not lead to error estimation larger than I1+I2m\frac{I^{1}+I^{2}}{m}, and we know that I1+I2mϵF1final\frac{I^{1}+I^{2}}{m}\leq\epsilon F_{1}^{final}. This can be shown by contradiction. Assuming there exists an item xx in the union of S1S_{1} and S2S_{2} with true frequency larger or equal to I1+I2m\frac{I^{1}+I^{2}}{m} and xx is not included than the mm largest items. Note, xx’s insert count must be larger than its true frequency by Lemma 3.6. Then this must imply the sum of the insert counts of mm largest item and xx is at least (m+1)I1+I2m>(I1+I2)(m+1)\cdot\frac{I^{1}+I^{2}}{m}>(I^{1}+I^{2}). However, by Lemma 3.4, we know the sum of insert counts cannot exceed I1+I2I^{1}+I^{2}. As a result, evicting all items except the largest mm items from the union-ed summary based on insert counts does not lead to estimation error larger than ϵF1final\epsilon F_{1}^{final}. Therefore, Intergrated SpaceSaving±\pm is mergeable.

6. Evaluation

This section evaluates and compares the performance of our proposed algorithms with linear sketch from the turnstile model (Li et al., 2014). Our proposed counter based summaries have no assumption on the input universe but operates in the bounded deletion model. whereas linear sketch have dependency on the universe size but allow the entire data set to be deleted. The experiments aim to better understand the advantage and disadvantages of our proposed algorithms. The linear sketches that we compared with are:

  • Count Min (Cormode and Muthukrishnan, 2005): Item’s frequency is never underestimated.

  • Count Sketch (Charikar et al., 2002): Provides an unbiased estimation, such that E[f^(x)]=f(x)E[\hat{f}(x)]=f(x) where EE is the expected value.

Experiment Setup. We implemented all of the algorithms in Python. Through all experiments, we set δ=U1\delta=U^{-1} for linear sketches to align the experiments with the theoretical literature (Bhattacharyya et al., 2018).

Data Sets. We use both synthetic and real world data consisting of items that are inserted and deleted. Zipf Stream: The insertions (10510^{5}) are drawn from the Zipf distribution (Zipf, 2016) and deletions (105/210^{5}/2) are uniformly chosen from the insertions. Deletions happen after all the insertions. Cardinality is 22k. YCSB: We use 60% insertions and 40% updates (delete followed by insert) with request distribution set to zipf in YCSB benchmark (Cooper et al., 2010) with interleaved 116645 insertions and 39825 deletions. Cardinality is 65k.

Metrics. We use average relative error (ARE) as the metrics for frequency estimation. Let the final data set D={x|f(x)>0}D=\{x|f(x)>0\}. ARE is calculated as 1/|D|(xD|f(x)f^(x)|f(x))1/|D|(\sum_{x\in D}\frac{|f(x)-{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{f}}(x)|}{f(x)}). Lower ARE indicates more accurate approximations. For identifying the top 100 heavy hitters, we query all items in DD and report a set of 100 items with the largest frequency. Metrics is the F1 score.

6.1. Main Results

Our proposed algorithms perform very well on frequency estimation task, as shown in Figure 1(a)-(b). The y-axis is the average relative error and the x-axis the the number of counter used. For fair comparisons, we let IntergratedSpaceSaving±\pm uses three counters per entry, SpaceSaving and unbiased SpaceSaving use two counters, and linear sketch uses one counter per entry. We find that Integrated SpaceSaving±\pm is always the best, followed by Double SpaceSaving±\pm (DSS±\pm) or Unbiased Double SpaceSaving±\pm ((UDSS±\pm)), and Count Sketch always ranks at the fourth. For instance, when using 16384 counters over YCSB dataset, average relative error for IntergratedSpaceSaving±\pm is 2.20, DSS±\pm is 2.25, UDSS±\pm is 2.26, Count Sketch is 2.29, and Count-Min is 35.

In identifying the top 100 items, linear sketches are more competitive to our proposed algorithms. As shown in Figure 1(c)-(d), the y-axis is the F1 score and x-axis is the number of counters. When more memory budget is allowed, Integrated SpaceSaving±\pm is always the best. When the space budget up to 8192 counters, DSS±\pm and UDSS±\pm always outperform linear sketches. For instance, using 8192 counters over YCSB, DSS±\pm and UDSS±\pm both achieve 0.95 F1 score, whereas Count Sketch score 0.91 and Count Min scores 0.78.

7. Conclusion

This paper presents a detailed analysis of SpaceSaving±\pm family of algorithms with bounded deletion. We proposed new algorithms built on top of the original SpaceSaving and SpaceSaving±\pm. They exhibit many desirable properties such as no underestimation, unbiased estimation, residual/ relative error guarantees and mergability. We believe these unique characteristics make our proposed algorithms a strong candidate for real-world applications and systems.

References

  • (1)
  • Agarwal et al. (2012) Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable summaries. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. 23–34.
  • Agarwal et al. (2013) Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 1–28.
  • Agrawal et al. (2011) Divyakant Agrawal, Philip Bernstein, Elisa Bertino, Susan Davidson, Umeshwas Dayal, Michael Franklin, Johannes Gehrke, Laura Haas, Alon Halevy, Jiawei Han, et al. 2011. Challenges and opportunities with Big Data 2011-1. (2011).
  • Alon et al. (1996) Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing. 20–29.
  • Bailis et al. (2017) Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541–556.
  • Ben-Eliezer et al. (2022) Omri Ben-Eliezer, Rajesh Jayaram, David P Woodruff, and Eylon Yogev. 2022. A framework for adversarially robust streaming algorithms. ACM Journal of the ACM (JACM) 69, 2 (2022), 1–33.
  • Berinde et al. (2010) Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J Strauss. 2010. Space-optimal heavy hitters with strong error bounds. ACM Transactions on Database Systems (TODS) 35, 4 (2010), 1–28.
  • Bhattacharyya et al. (2018) Arnab Bhattacharyya, Palash Dey, and David P Woodruff. 2018. An optimal algorithm for l1-heavy hitters in insertion streams and related problems. ACM Transactions on Algorithms (TALG) 15, 1 (2018), 1–27.
  • Bloom (1970) Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
  • Braverman et al. (2017) Vladimir Braverman, Stephen R Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P Woodruff. 2017. BPTree: an l2 heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 361–376.
  • Broder (1997) Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 21–29.
  • Bruch et al. (2023) Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty. 2023. An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors. arXiv preprint arXiv:2301.10622 (2023).
  • Charikar et al. (2002) Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming. Springer, 693–703.
  • Chen et al. (2023) Wei-Ning Chen, Ayfer Ozgur, Graham Cormode, and Akash Bharadwaj. 2023. The communication cost of security and privacy in federated frequency estimation. In International Conference on Artificial Intelligence and Statistics. PMLR, 4247–4274.
  • Chiosa et al. (2021) Monica Chiosa, Thomas B Preußer, and Gustavo Alonso. 2021. Skt: A one-pass multi-sketch data analytics accelerator. Proceedings of the VLDB Endowment 14, 11 (2021), 2369–2382.
  • Cohen et al. (2022) Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, and Uri Stemmer. 2022. On the robustness of countsketch to adaptive inputs. In International Conference on Machine Learning. PMLR, 4112–4140.
  • Cooper et al. (2010) Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143–154.
  • Cormode (2023) Graham Cormode. 2023. Applications of Sketching and Pathways to Impact. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (Seattle, WA, USA) (PODS ’23). Association for Computing Machinery, New York, NY, USA, 5–10. https://doi.org/10.1145/3584372.3589937
  • Cormode and Hadjieleftheriou (2008) Graham Cormode and Marios Hadjieleftheriou. 2008. Finding frequent items in data streams. Proceedings of the VLDB Endowment 1, 2 (2008), 1530–1541.
  • Cormode et al. (2021) Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, and Pavel Veselỳ. 2021. Relative error streaming quantiles. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 96–108.
  • Cormode and Muthukrishnan (2005) Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
  • Cormode and Yi (2020) Graham Cormode and Ke Yi. 2020. Small summaries for big data. Cambridge University Press.
  • Dayan et al. (2017) Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of the 2017 ACM International Conference on Management of Data. 79–94.
  • Demaine et al. (2002) Erik D Demaine, Alejandro López-Ortiz, and J Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In European Symposium on Algorithms. Springer, 348–360.
  • Duffield et al. (2007) Nick Duffield, Carsten Lund, and Mikkel Thorup. 2007. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM) 54, 6 (2007), 32–es.
  • Flajolet et al. (2007) Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science Proceedings (2007).
  • Gupta and Zane (2003) Anupam Gupta and Francis Zane. 2003. Counting inversions in lists. In SODA, Vol. 3. 253–254.
  • Han et al. (2006) Jiawei Han, Jian Pei, and Hanghang Tong. 2006. Data mining: Concepts and techniques. Morgan Kaufinann 10, 559-569 (2006), 4.
  • Hasidim et al. (2020) Avinatan Hasidim, Haim Kaplan, Yishay Mansour, Yossi Matias, and Uri Stemmer. 2020. Adversarially robust streaming algorithms via differential privacy. Advances in Neural Information Processing Systems 33 (2020), 147–158.
  • Ivkin et al. (2019) Nikita Ivkin, Edo Liberty, Kevin Lang, Zohar Karnin, and Vladimir Braverman. 2019. Streaming quantiles algorithms with small space and update time. arXiv preprint arXiv:1907.00236 (2019).
  • Jayaram and Woodruff (2018) Rajesh Jayaram and David P Woodruff. 2018. Data streams with bounded deletions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 341–354.
  • Jiang et al. (2018) Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui. 2018. Sketchml: Accelerating distributed machine learning with data sketches. In Proceedings of the 2018 International Conference on Management of Data. 1269–1284.
  • Karnin et al. (2016) Zohar Karnin, Kevin Lang, and Edo Liberty. 2016. Optimal quantile approximation in streams. In 2016 ieee 57th annual symposium on foundations of computer science (focs). IEEE, 71–78.
  • Karp et al. (2003) Richard M Karp, Scott Shenker, and Christos H Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28, 1 (2003), 51–55.
  • Lebeda and Tetek (2023) Christian Janos Lebeda and Jakub Tetek. 2023. Better differentially private approximate histograms and heavy hitters using the Misra-Gries sketch. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 79–88.
  • Li et al. (2020) Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. 2020. Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1574–1584.
  • Li et al. (2014) Yi Li, Huy L Nguyen, and David P Woodruff. 2014. Turnstile streaming algorithms might as well be linear sketches. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing. 174–183.
  • Liu et al. (2023) Zirui Liu, Yixin Zhang, Yifan Zhu, Ruwen Zhang, Tong Yang, Kun Xie, Sha Wang, Tao Li, and Bin Cui. 2023. TreeSensing: Linearly Compressing Sketches with Flexibility. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–28.
  • Manku and Motwani (2002) Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 346–357.
  • Mathieu and de Rougemont (2023) Claire Mathieu and Michel de Rougemont. 2023. Testing frequency distributions in a stream. arXiv preprint arXiv:2309.11175 (2023).
  • Metwally et al. (2005) Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Database Theory-ICDT 2005: 10th International Conference, Edinburgh, UK, January 5-7, 2005. Proceedings 10. Springer, 398–412.
  • Misra and Gries (1982) Jayadev Misra and David Gries. 1982. Finding repeated elements. Science of computer programming 2, 2 (1982), 143–152.
  • Pagh and Thorup (2022) Rasmus Pagh and Mikkel Thorup. 2022. Improved utility analysis of private countsketch. Advances in Neural Information Processing Systems 35 (2022), 25631–25643.
  • Rothchild et al. (2020) Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. 2020. Fetchsgd: Communication-efficient federated learning with sketching. In International Conference on Machine Learning. PMLR, 8253–8265.
  • Ting (2018) Daniel Ting. 2018. Data sketches for disaggregated subset sum and frequent item estimation. In Proceedings of the 2018 International Conference on Management of Data. 1129–1140.
  • Wang et al. (2019) Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John CS Lui, and Xiaohong Guan. 2019. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 25–33.
  • Wang et al. (2024) Yiping Wang, Yanhao Wang, and Cen Chen. 2024. DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report). arXiv preprint arXiv:2406.07953 (2024).
  • Xu et al. (2023) Yuchen Xu, Wenfei Wu, Bohan Zhao, Tong Yang, and Yikai Zhao. 2023. MimoSketch: A Framework to Mine Item Frequency on Multiple Nodes with Sketches. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2838–2849.
  • Yang et al. (2018) Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 561–575.
  • Zakhary et al. (2021) Victor Zakhary, Lawrence Lim, Divy Agrawal, and Amr El Abbadi. 2021. Cache on Track (CoT): Decentralized Elastic Caches for Cloud Environments. In International Conference on Extending Database Technology.
  • Zhao et al. (2021a) Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi, and Ahmed Metwally. 2021a. SpaceSaving±\pm: An Optimal Algorithm for Frequency Estimation and Frequent items in the Bounded Deletion Model. arXiv preprint arXiv:2112.03462 (2021).
  • Zhao et al. (2023a) Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi, Ahmed Metwally, Claire Mathieu, and Michel de Rougemont. 2023a. Errata for” SpaceSaving±\pm: An Optimal Algorithm for Frequency Estimation and Frequent Items in the Bounded-Deletion Model”. Proceedings of the VLDB Endowment 17, 4 (2023), 643–643.
  • Zhao et al. (2023b) Fuheng Zhao, Punnal Ismail Khan, Divyakant Agrawal, Amr El Abbadi, Arpit Gupta, and Zaoxing Liu. 2023b. Panakos: Chasing the Tails for Multidimensional Data Streams. Proceedings of the VLDB Endowment 16, 6 (2023), 1291–1304.
  • Zhao et al. (2021b) Fuheng Zhao, Sujaya Maiyya, Ryan Wiener, Divyakant Agrawal, and Amr El Abbadi. 2021b. Kll±\pm Approximate Quantile Sketches over Dynamic Datasets. Proceedings of the VLDB Endowment 14, 7 (2021), 1215–1227.
  • Zhao et al. (2022) Fuheng Zhao, Dan Qiao, Rachel Redberg, Divyakant Agrawal, Amr El Abbadi, and Yu-Xiang Wang. 2022. Differentially private linear sketches: Efficient implementations and applications. Advances in Neural Information Processing Systems 35 (2022), 12691–12704.
  • Zhao et al. (2023c) Fuheng Zhao, Leron Reznikov, Divyakant Agrawal, and Amr El Abbadi. 2023c. Autumn: A Scalable Read Optimized LSM-tree based Key-Value Stores with Fast Point and Range Read Speed. arXiv preprint arXiv:2305.05074 (2023).
  • Zipf (2016) George Kingsley Zipf. 2016. Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books.