The SpaceSaving± Family of Algorithms for Data Streams with Bounded Deletions
Abstract.
In this paper, we present an advanced analysis of near optimal algorithms that use limited space to solve the frequency estimation, heavy hitters, frequent items, and top-k approximation in the bounded deletion model. We define the family of SpaceSaving± algorithms and explain why the original SpaceSaving± algorithm only works when insertions and deletions are not interleaved. Next, we propose the new Double SpaceSaving±, Unbiased Double SpaceSaving±, and Integrated SpaceSaving± and prove their correctness. The three proposed algorithms represent different trade-offs, in which Double SpaceSaving± can be extended to provide unbiased estimations while Integrated SpaceSaving± uses less space. Since data streams are often skewed, we present an improved analysis of these algorithms and show that errors do not depend on the hot items. We also demonstrate how to achieve relative error guarantees under mild assumptions. Moreover, we establish that the important mergeability property is satisfied by all three algorithms, which is essential for running the algorithms in distributed settings.
1. Introduction
Streaming algorithms (Cormode and Yi, 2020; Han et al., 2006) aim to process a stream of data in a single pass with small space. In particular, for a stream of updates of items, these algorithms often only store a compact summary which uses space at most logarithmic in the data stream length or in the cardinality of the input domain. The streaming paradigm provides essential analysis and statistical measurements with strong accuracy guarantees for many big data applications (Agrawal et al., 2011). For instance, Hyperloglog (Flajolet et al., 2007) answers queries about cardinality; Bloom Filters (Bloom, 1970) answer membership queries; KLL (Karnin et al., 2016; Ivkin et al., 2019; Zhao et al., 2021b) gives accurate quantile approximations such as finding the median. These data summaries (sketches) are now the cornerstones for many real-world systems and applications including network monitoring (Yang et al., 2018; Zhao et al., 2023b; Xu et al., 2023), storage engine (Dayan et al., 2017; Zhao et al., 2023c), data mining (Wang et al., 2019; Li et al., 2020), federated learning (Rothchild et al., 2020; Jiang et al., 2018), privacy-preserving analytic (Chen et al., 2023; Zhao et al., 2022; Lebeda and Tetek, 2023; Pagh and Thorup, 2022; Wang et al., 2024), vector search (Broder, 1997; Bruch et al., 2023), and many more. We refer the reader to a recent paper (Cormode, 2023) for a more thorough discussion on the applications.
In this paper, we focus on the frequency estimation and heavy hitters problems with the presence of deletions. Given a data stream from some universe , the Frequency Estimation problem constructs a summary providing, for every item in , an estimate of the true frequency . Since resources are limited (small memory footprint), the data summary cannot track every item and hence there is an inherent trade-off between the space used and the accuracy of the estimate. In the closely related approximate Heavy Hitters (Frequent Items) problem, there is an additional input frequency threshold , and the goal is to identify all items with frequency larger than . This applies to settings where data scientists and system operators are interested in knowing what items are hot (appear very often). Frequency estimation, heavy hitters and approximate top-k have found many applications in big data systems (Bailis et al., 2017; Zakhary et al., 2021; Liu et al., 2023; Chiosa et al., 2021).
Algorithm | Space | Model | Error Bound | Note |
SpaceSaving MG | Insertion-Only | Lemma 2.3 | ||
(Metwally et al., 2005; Misra and Gries, 1982; Berinde et al., 2010; Mathieu and de Rougemont, 2023) | Lemma 2.4 | |||
(-decreasing) | Lemma 2.6 | |||
Count-Min (Cormode and Muthukrishnan, 2005) | Turnstile | Never underestimate | ||
CountSketch (Charikar et al., 2002) | Turnstile | unbiased | ||
DoubleSS & IntegratedSS | Bounded Deletion | Theorem 3.1 & 3.9 | ||
Unbiased DoubleSS | unbiased & Var ¡ | Theorem 3.3 | ||
DoubleSS & IntegratedSS | Theorem 4.1 & 4.3 | |||
DoubleSS & IntegratedSS | (-decreasing and -zipf) | Theorem 4.6 & 4.7 |
1.1. Related Work
There is a large body of algorithms proposed to solve frequency estimation and heavy hitters problems (Misra and Gries, 1982; Metwally et al., 2005; Charikar et al., 2002; Cormode and Muthukrishnan, 2005; Karp et al., 2003; Demaine et al., 2002; Braverman et al., 2017; Jayaram and Woodruff, 2018; Zhao et al., 2021a, 2022; Manku and Motwani, 2002). The existing algorithms can be classified into three models: insertion-only, turnstile, and bounded-deletion. In the insertion-only model, all arriving stream items are insertions and the frequency of is the number of times has been inserted. SpaceSaving (Metwally et al., 2005) and MG (Misra and Gries, 1982) are two popular algorithms that operate in the insertion-only model; they provide strong deterministic guarantees with minimal space. Moreover, since these algorithms are deterministic, they are inherently adversarial robust (Hasidim et al., 2020; Ben-Eliezer et al., 2022; Cohen et al., 2022). In the turnstile model, arriving stream items can be either insertions or deletions and the (net) frequency of is the difference between the number of times has been inserted and deleted, with the proviso that no item’s frequency is negative. Count-Min (Cormode and Muthukrishnan, 2005) and CountSketch (Charikar et al., 2002) are two popular algorithms that operate in the turnstile model. Supporting deletions is a harder task and as a result Count-Min and Count-Sketch require more space with poly-log dependence on the input domain size to offer strong error bounds with high probability (Alon et al., 1996). The bounded deletion model (Jayaram and Woodruff, 2018; Zhao et al., 2021b, a) assumes the number of deletions are bounded by a fraction of the number of insertions. For instance, in standard testings, if students are only allowed to submit one regrade request on their tests, then the number of deletions is at most half of the number of insertions in which updating a student’s grade is a deletion followed by an insertion.
Recently, SpaceSaving (Zhao et al., 2021a) proposed an algorithm extending the original SpaceSaving (Metwally et al., 2005) to handle bounded deletions efficiently, (implicitly) assuming (Zhao et al., 2021a) that the data stream has no interleaving between insertions and deletions: all insertions precede all deletions (Metwally et al., 2005) (Zhao et al., 2023a). In addition, Berinde et al. (Berinde et al., 2010) showcased that data summaries in the insertion-only model can achieve an even stronger error bound using residual error analysis. The difference among these algorithms and their corresponding error guarantees is listed in Table 1. As a result, in this work, we aim to investigate the SpaceSaving family of algorithms to support interleavings between insertions and deletions and provide a tighter error bound analysis.
1.2. Our Contributions
In this paper, we present a detailed analysis of the SpaceSaving family of algorithms with bounded deletions. In particular, we propose three new algorithms: Double SpaceSaving, Unbiased Double SpaceSaving, and Integrated SpaceSaving, which operate in the general bounded deletion model (with interleavings of insertions and deletions) and provide strong theoretical guarantees with small space, while having distinct characteristics. In addition, we introduce the residual error bound guarantee in the bounded deletion model and prove that the proposed algorithms satisfy the stronger residual error guarantee. Moreover, we demonstrate that the proposed algorithms also achieve the relative error bound (Mathieu and de Rougemont, 2023; Cormode et al., 2021) under mild assumptions, assuming skewness in the data stream. Finally, we provide a merge algorithm for our proposed algorithms and prove that all proposed algorithms are mergeable in the bounded deletion model. The mergeability is critical for distributed applications (Agarwal et al., 2012).
2. Background
2.1. Preliminaries
Consider a data stream consisting of a sequence of operations (insertions or deletions) on items drawn from a universe of size . We denote the stream by where is an element of the universe and . Let denote the total number of insertions and denote the total number of deletions in the stream, so that . In the bounded deletion model (Jayaram and Woodruff, 2018; Zhao et al., 2021b), it is assumed that, for some parameter , the stream is constrained to satisfy . Observe that when , there can be no deletions, and so the model includes the insertion-only model as a special case. Let denote the frequency function: the frequency of an item is where denotes the number of insertions of in the stream and denotes the number of deletions of in the stream. In this paper, we use a summary data structure from which we infer an estimation of the frequency function. We use to denote the estimated frequency of .
Let denote the frequencies of the items, sorted in non-increasing order, so that . Similarly, sorting in non-increasing order leads to where , and sorting in non-increasing order leads to where . Notice that, for the same , the item corresponding to may differ from the item corresponding to and from the item corresponding to . For the special case of the insertion-only model, the frequency vector and insertion vector are identical. Moreover, we use to denote the total number of items resulting from the stream operations, such that . In this paper, we focus on the unit updates model (an operation only inserts or deletes one count of an item) and assume that at any point of time, no item frequency is below zero (i.e, items cannot be deleted if they were not previously inserted).
Definition 2.1 (Deterministic Frequency Estimation Problem).
Given a stream of operations (insertions or deletions) of items from a universe , the Frequency Estimation Problem asks for an (implicitly defined) function that when queried for any element , returns a value such that
where .
Definition 2.2 (Heavy Hitters problem).
Given a parameter , a heavy hitter is an item with frequency at least . Given a stream of operations (insertions or deletions) of items from a universe , the heavy hitters problem asks for an (explicitly defined) subset of that contains all heavy hitters.
All of our streaming algorithms use a summary of the information seen in the stream so far, consisting of a collection of items and some counts for each of them. The size of the summary is much smaller than (equivalent to ), and we say that an item is monitored if it is present in the summary.
2.2. The SpaceSaving± Family of Algorithms
The SpaceSaving family of algorithms are data summaries that build on top of the original SpaceSaving (Metwally et al., 2005) algorithm. In 2005, Metwally, Agrawal, and El Abbadi (Metwally et al., 2005) proposed the popular SpaceSaving algorithm that provides highly accurate estimates on item frequency and heavy hitters among many competitors (Cormode and Hadjieleftheriou, 2008). The SpaceSaving algorithm operates in the insertion-only model (so that the stream is simply a sequence of items from ) and uses the optimal space (Berinde et al., 2010). The update algorithm, as shown in Algorithm 1 (Metwally et al., 2005), uses counters to store a monitored item’s identity and its estimated frequency. When a new arrives: if is monitored, then increment its counter; if is not monitored and the summary has unused counters, then start to monitor , and set to 1; otherwise, i) find , the item with minimum counter (), ii) evicts this from the summary, and iii) monitor the new and set to . See Algorithm 1.
After the stream has been read, the SpaceSaving Update algorithm has produced a summary. To solve the Frequency Estimation Problem, the algorithm then needs to estimate the frequency of an item being queried: the SpaceSaving Query algorithm returns if is monitored and otherwise, as shown in Algorithm 2. (To solve the heavy hitters problem, the algorithm outputs all items in the summary whose count is greater than or equal to times the sum of the counts.)
When the number of counters (size of the summary) , SpaceSaving solves both frequency estimation and heavy hitters problems in the insertion-only model, as shown in Lemma 2.3. Moreover, SpaceSaving satisfies a more refined residual error bound as shown in Lemma 2.4 in which ”hot” items do not contribute error and the error bound only depends on the ”cold” and ”warm” items (Berinde et al., 2010).
The error guarantee is shown in Lemma 2.3. When , SpaceSaving ensures .
Lemma 2.3.
The residual error guarantee is shown in Lemma 2.4.
Lemma 2.4.
(Berinde et al., 2010) After processing insertion-only data stream , SpaceSaving with counters ensure where .
The relative error guarantee is shown in Lemma 2.6.
Definition 2.5 (-decreasing).
Let , then a non-decreasing function with domain is -decreasing if for all t such that ,
Lemma 2.6.
In 2021, Zhao et al. (Zhao et al., 2021a) proposed SpaceSaving to extend the SpaceSaving algorithm from the insertion-only model to the bounded deletion model. In that model, they established a space lower-bound of at least counters required to solve the frequency estimation and heavy hitters problems, and proposed an algorithm, SpaceSaving (see Algorithm 3), that used space matching the lower bound. The update algorithm is shown in Algorithm 3 in which insertions are dealt with as in Algorithm 1, deletions of monitored items decrement the corresponding counters, and deletions of unmonitored items are ignored. The query algorithm is identical to Algorithm 2111There are a couple of variants of Algorithm 2. We adopt the most commonly used approach..
However, the SpaceSaving algorithm is only correct if an implicit assumption holds (Zhao et al., 2023a), namely, that deletions can only happen after all the insertions, so that interleaving insertions and deletions are not allowed, as shown in Lemma 2.7. In this paper, we present new algorithms in the general bounded deletions model, that allow interleavings between insertions and deletions while still using only space.
This Lemma correspond to Theorem 2 in (Zhao et al., 2021a). Consider the bounded deletions model and assume, in addition, that the stream consists of an insertion phase with insertions followed by a deletion phase with deletions (no interleaving).
3. New Algorithms
In this section, we propose new algorithms supporting the bounded deletion model (with interleavings between insertions and deletions, while an item’s frequency is never negative). Double SpaceSaving and Integrated SpaceSaving both use space to solve the deterministic frequency estimation and heavy hitters problems in the bounded deletion model. We extended Double SpaceSaving to support unbiased estimations. Also, we establish the correctness proof for these algorithms. While these algorithm share similar characteristics, they represent different trade-offs. In particular, Double SpaceSaving uses two disjoint summaries, one to track insertions and the other to track deletions, while Intergrated SpaceSaving uses a single summary to track both deletions and insertions.
3.1. Double SpaceSaving
In Double SpaceSaving, we employ two SpaceSaving summaries to track insertions () and deletions () separately, using and counters respectively (the algorithms is parameterized by the choice of and ). As shown in Algorithm 4, Double SpaceSaving completely partitions the update operations to two independent summaries. To query for the frequency of an item , both summaries need to be queried to estimate the frequency of item as shown in Algorithm 5.
Theorem 3.1.
Proof.
For every items in the universe their frequency estimation error is upper bounded by due to the property of SpaceSaving as shown in Lemma 2.3 and triangle inequality. In addition, from the definition of bounded deletion, we can derive the relationship between , , and . Namely, and . Set and , then . Hence, Double SpaceSaving solve the deterministic frequency estimation problem in the bounded deletion model with space. ∎
Theorem 3.2.
Double SpaceSaving solves the heavy hitters problem in the bounded deletion model using space by reporting all items monitored in .
Proof.
Assume a heavy hitter is not monitored in the summary. By definition of heavy hitters, . Since is not monitored, its estimated frequency 0. As a result, the estimation error for item is larger than which contradicts Theorem 3.1.
Hence, by contradiction Double SpaceSaving solves the heavy hitters problem. ∎
Being more careful and using variants of Algorithm 2 would lead to improvements but would not change the theoretical bound of Theorem 3.2.
3.1.1. Unbiased Double SpaceSaving
Unbiased estimation is a desirable property for robust estimation (Charikar et al., 2002; Duffield et al., 2007). Research has shown that SpaceSaving can be extend to provide unbiased estimation ( i.e., ) (Ting, 2018). The extension is obtained from Algorithm 1 by the following simple modification: after line 8 of the code, with probability lines 9 and 10 are executed; with the complementary probability , instead of lines 9 and 10 the algorithm simply increments . This indicates that we can replace the two independent deterministic SpaceSaving summaries with two independent unbiased SpaceSaving summaries in Algorithm 5 to support unbiased frequency estimation in the bounded deletion model.
Theorem 3.3.
After processing a bounded deletion data stream , Unbiased Double SpaceSaving of space , with two independent unbiased SpaceSaving summaries in Algorithm 5, provides unbiased frequency estimation and the estimation variance is upper bounded by .
Proof.
The variance of an Unbiased SpaceSaving’s estimation is upper bounded by (Ting, 2018). We know that = + - , where is the estimation of inserts and is the estimation of deletes. Since two summaries are independent, the covariance is 0; Hence, the variance upper bounded by the sum of the . In Theorem 3.1, we proved that . As a result, the variance of the unbiased estimation for any item’s frequency is upper bounded by .
∎
3.2. Integrated SpaceSaving
Unlike Double SpaceSaving which uses two independent summaries, Integrated SpaceSaving completely integrates deletions and insertions in one data structure as shown in Algorithm 6. The main difference between Integrated SpaceSaving and SpaceSaving is that Integrated SpaceSaving tracks the insert and delete count separately. This ensures that the minimum insert count monotonically increases and never decrease; however, the original SpaceSaving uses one single count to track inserts and deletes together. When insertions and deletions are interleaved, then the minimum count in SpaceSaving may decreases. This can lead to severe underestimation of a frequent item (Zhao et al., 2023a).
More specifically, the Integrated SpaceSaving data structure maintains a record estimating the number of insertions and deletions for each monitored item. When a new item arrives, if it is already being monitored, then depending on the type of operation, either the insertion or the deletion count of that item is incremented. If the summary is not full, then no item could have every been evicted from the summary, and since no delete operation arrives before the corresponding item is inserted, then any new item in the stream must be an insertion. Hence, when a new item arrives, and if the summary is not full, the new item is added to the summary and initialized with one insertions and 0 deletions. If a delete arrives and the summary is full, the delete is simply ignored. The interesting case is when an insert arrives and the summary is full. In this case, Integrated SpaceSaving mimics the behaviour of the original SpaceSaving by evicting the element with the least number of inserts and replacing it with the new element. The insertion count for the new element is overestimated by initializing it to the insertion count for the evicted element. However, the deletion count is underestimated by initializing the deletion count to zero.
To estimate an item’s frequency, if the item is monitored, it subtracts the item’s delete count from the item’s insert count, otherwise the estimated frequency is 0, as shown in Algorithm 7.
We now establish the correctness of Integrated SpaceSaving. First, we establish three lemmas about Integrated SpaceSaving to help us prove the correctness of the algorithm.
Lemma 3.4.
After processing a bounded deletion data stream , Integrated SpaceSaving following update Algorithm 6 ensures that the sum of all insert counts equals .
Proof.
This holds by induction over time, as verified by inspection of the algorithm. ∎
Lemma 3.5.
After processing a bounded deletion data stream , Integrated SpaceSaving with counters following update Algorithm 6 ensures that the minimum insert count, , is less than or equal to .
Proof.
Since deletions never increment any insert counts, insert counts is only affected by insertions. We observe that the sum of all insert counts in summary is exactly equal to , since the sum of all insert counts increment by 1 after processing one insertion. As a result, is maximized when all insert counts are the same. Hence, is less than or equal to . ∎
Lemma 3.6.
All monitored items in Integrated SpaceSaving are never underestimated following query Algorithm 7.
Proof.
The counts of the summary are dealt with exactly like in the SpaceSaving algorithm, for the substream of consisting of the insertion operations only. By the analysis of SpaceSaving, it is known that for an item in the summary, is an overestimate of the true number of insertions of since the beginning of the stream.
The count of element in the summary is equal to the number of deletions of item since the last time that it was inserted into the summary, so it is an underestimate of the number of deletions of since the beginning of the stream.
Thus, for any in summary, is an overestimate. ∎
Lemma 3.7.
In Integrated SpaceSaving, any item inserted more than must be monitored in the summary; moreover, for any item in the summary, its count exceeds its true number of insertions by at most .
Proof.
The counts of the summary are dealt with exactly like in the SpaceSaving algorithm, for the substream of consisting of the insertion operations only, and the properties are well-known to hold for SpaceSaving. ∎
Lemma 3.8.
Proof.
First consider the case when is unmonitored, so the estimate is . Its true frequency is at most its number of insertions, and by Lemma 3.7 that is bounded by , hence the Lemma holds for .
Now, consider the case when is in the summary, and consider the last time at which was inserted in the Summary. Let and denote the number of insertions and of deletions of since time , and let denote the value of at time . Since the algorithm updates and correctly while is in the summary, the query algorithm outputs the estimate . On the other hand, the frequency of equals the net number of occurences of prior to time , plus . By Lemma 3.7 the net number of occurences of prior to time is at most , so , implying the Lemma. ∎
Theorem 3.9.
Proof.
Integrated SpaceSaving also solves the heavy hitters problem. Since Integrated SpaceSaving never underestimates monitored items (Lemma 3.6), then if we report all the items with frequency estimations greater than or equal to , then all heavy hitters will be identified as shown in Theorem 3.10.
Theorem 3.10.
IntegratedSpaceSaving± solves the heavy hitters problem in the bounded deletion model using space.
Proof.
Assume a heavy hitter is not contained if the set of all items with estimated frequency larger than or equal to . Recall, by definition of heavy hitters, . Since is not contained, ’s frequency estimation, , must be less than . However, by Lemma 3.6, Integrated SpaceSaving never underestimates the frequency of monitored items.
Hence, by contradiction Integrated SpaceSaving solves the deterministic heavy hitters problem. ∎
3.3. Comparisons between Double SpaceSaving and Integrated SpaceSaving
We assume each field uses the same number of bits (32-bit in the implementation). Following the proofs of Theorem 3.9 and that each entry in the summary has three fields , IntegratedSpaceSaving uses . On the other hand, following the proofs of Theorem 3.1 and that each entry in the summary uses two fields, Double SpaceSaving uses . Integrated SpaceSaving requires less memory footprint. Double SpaceSaving also has its own advantage and it can be extended supporting unbiased estimations (Unbiased Double SpaceSaving).
4. Tighter Analysis on Error Bounds
In this section, we present tighter analysis of the proposed algorithms. In Section 3, we showcased that the proposed algorithms achieve the desired error bound , in which the error bound depends on all items in data stream . In this section, we first prove that both algorithms guarantee the residual error bound and then prove that they both also ensure the relative error bound under relatively mild assumptions. We take inspirations from (Berinde et al., 2010; Metwally et al., 2005; Zhao et al., 2021a; Mathieu and de Rougemont, 2023).
4.1. Residual Error Bound
We define . Observe that is equivalent to the residual error bound shown in Lemma 2.4, , for the insertion-only model, as is set to 1. The residual error bound () is much tighter than the original error bound (), since the residual error depends less on the heavy hitters.
First, we derive the residual error bound for Double SpaceSaving.
Theorem 4.1.
Double SpaceSaving with space achieves the residual error bound guarantee in which , where .
Proof.
First, let’s define the maximum error as for Double SpaceSaving. We know that by Lemma 2.4 and triangle inequality. By definition of bounded deletion model, and . Hence, we have . We know that (this is because sum of k largest insert count has to be larger than or equal to the sum of k largest frequencies) and . As a result, . Let and . Then, . ∎
We present the residual error bound for Integrated SpaceSaving. Before presenting the proof, we first construct a helpful Lemma.
Lemma 4.2.
The minimum insert count (), in Integrated SpaceSaving with m counters, is less than or equal to where .
Proof.
The sum of all insert count equals to , as the sum of all insert count always increase by 1 after processing an insertion. If we zoom into the k largest insert count in the summary, the sum of their insert counts is no less than , since no monitored items are underestimated as shown in Lemma 3.6. As a result, . We know that by definition of bounded deletion, . Hence, . ∎
Theorem 4.3.
Integrated SpaceSaving with space achieves the residual error bound guarantee in which , where .
4.2. Relative Error Bound
The relative error guarantee is defined such that . The relative error has been shown to be much harder to achieve, but with practical importance (Gupta and Zane, 2003; Cormode et al., 2021). In a recent work, Mathieu and de Rougemont (Mathieu and de Rougemont, 2023) showcased that the original SpaceSaving algorithm achieves relative error guarantees under mild assumptions. As a result, we demonstrate that the proposed algorithms in the bounded deletion model also achieve relative error guarantees under similar assumptions.
Lemma 4.4.
Let f be decreasing with . Then and hence .
Proof.
We omit the proof. (This is proven in Lemma 5 at (Mathieu and de Rougemont, 2023)) ∎
Definition 4.5 (Zipfian Distribution (Zipf, 2016)).
Let a function f with domain follow the Zipf distribution with parameter , then where and .
By the definition of Zipf distribution, we know that converges to a small constant when and .
Theorem 4.6.
Assuming insertions and deletions in the -bounded deletion model follow the -decreasing property () and the exact frequencies follow the Zipf distribution with parameter , then Double SpaceSaving with space achieves the relative error bound guarantee such that , for .
Proof.
Recall, the maximum estimation error is upper bounded by . Since insertions and deletions follow -decreasing property, then and . Hence, .
For insertions, we know i) , ii) , and iii) ; For deletions, we know i) , ii) , and iii) .
Hence, , and . Let , then and hence for all , .
∎
Theorem 4.7.
Assuming insertions in the -bounded deletion model follows the -decreasing property () and the exact frequencies follow the Zipf distribution with parameter , then Integrated SpaceSaving with space achieves the relative error bound guarantee such that , for .
Proof.
We know that estimation error at most and . Since insertions follow -decreasing property, then . Hence, . Since we know i) , ii) , and iii) . Hence, . Let , then and hence for all , . ∎
5. Mergeability
Mergeability is a desirable property in distributed settings in which two summaries on two data sets are given, after merging between these two summaries, the merged summary has error and size guarantees equivalent to a single summary which processed the union of two data sets.
Definition 5.1 (Mergeable (Agarwal et al., 2013)).
A summary is mergeable if there exists a merge algorithm that produces a summary from two input summaries and . Note, the size of all three summaries should be the same in achieving error guarantee.
Both the original SpaceSaving and the Unbiased SpaceSaving have been shown to be mergable (Agarwal et al., 2013; Ting, 2018). Since Double SpaceSaving relies on two independent summaries, Double SpaceSaving is also mergeable.
For IntergratedSpaceSaving, to define a Merge algorithm, we take inspiration from (Agarwal et al., 2013). The merge algorithm is shown in Algorithm 8. Given two IntergratedSpaceSaving summaries of counters, and , we first union the two summaries (i,e, if both summaries monitor item , then ’s insert and delete counts in the merge summary are the sum of ’s insert and delete counts from and ). After the union, the merged summary contain at most counters. The merged summary consists of the largest items among these 2 counters based on the insert counts.




Theorem 5.2.
The summaries are mergeable for bounded deletion data stream following the Algorithm 8 with counters.
Proof.
Assume the given summaries with counters, and , have processed two bounded deletion data stream and with and insertions and and total frequencies respectively. Let . First, we observe that The union step do not introduce any additional error. As a result, the error for these items is at most .
Since we don’t want to underestimate any monitored items, Algorithm 8 keeps the largest items from the union-ed summary based on insert counts. We need to showcase that evicting all items except the largest items will not lead to error estimation larger than , and we know that . This can be shown by contradiction. Assuming there exists an item in the union of and with true frequency larger or equal to and is not included than the largest items. Note, ’s insert count must be larger than its true frequency by Lemma 3.6. Then this must imply the sum of the insert counts of largest item and is at least . However, by Lemma 3.4, we know the sum of insert counts cannot exceed . As a result, evicting all items except the largest items from the union-ed summary based on insert counts does not lead to estimation error larger than . Therefore, Intergrated SpaceSaving is mergeable.
∎
6. Evaluation
This section evaluates and compares the performance of our proposed algorithms with linear sketch from the turnstile model (Li et al., 2014). Our proposed counter based summaries have no assumption on the input universe but operates in the bounded deletion model. whereas linear sketch have dependency on the universe size but allow the entire data set to be deleted. The experiments aim to better understand the advantage and disadvantages of our proposed algorithms. The linear sketches that we compared with are:
Experiment Setup. We implemented all of the algorithms in Python. Through all experiments, we set for linear sketches to align the experiments with the theoretical literature (Bhattacharyya et al., 2018).
Data Sets. We use both synthetic and real world data consisting of items that are inserted and deleted. Zipf Stream: The insertions () are drawn from the Zipf distribution (Zipf, 2016) and deletions () are uniformly chosen from the insertions. Deletions happen after all the insertions. Cardinality is 22k. YCSB: We use 60% insertions and 40% updates (delete followed by insert) with request distribution set to zipf in YCSB benchmark (Cooper et al., 2010) with interleaved 116645 insertions and 39825 deletions. Cardinality is 65k.
Metrics. We use average relative error (ARE) as the metrics for frequency estimation. Let the final data set . ARE is calculated as . Lower ARE indicates more accurate approximations. For identifying the top 100 heavy hitters, we query all items in and report a set of 100 items with the largest frequency. Metrics is the F1 score.
6.1. Main Results
Our proposed algorithms perform very well on frequency estimation task, as shown in Figure 1(a)-(b). The y-axis is the average relative error and the x-axis the the number of counter used. For fair comparisons, we let IntergratedSpaceSaving uses three counters per entry, SpaceSaving and unbiased SpaceSaving use two counters, and linear sketch uses one counter per entry. We find that Integrated SpaceSaving is always the best, followed by Double SpaceSaving (DSS) or Unbiased Double SpaceSaving ((UDSS)), and Count Sketch always ranks at the fourth. For instance, when using 16384 counters over YCSB dataset, average relative error for IntergratedSpaceSaving is 2.20, DSS is 2.25, UDSS is 2.26, Count Sketch is 2.29, and Count-Min is 35.
In identifying the top 100 items, linear sketches are more competitive to our proposed algorithms. As shown in Figure 1(c)-(d), the y-axis is the F1 score and x-axis is the number of counters. When more memory budget is allowed, Integrated SpaceSaving is always the best. When the space budget up to 8192 counters, DSS and UDSS always outperform linear sketches. For instance, using 8192 counters over YCSB, DSS and UDSS both achieve 0.95 F1 score, whereas Count Sketch score 0.91 and Count Min scores 0.78.
7. Conclusion
This paper presents a detailed analysis of SpaceSaving family of algorithms with bounded deletion. We proposed new algorithms built on top of the original SpaceSaving and SpaceSaving. They exhibit many desirable properties such as no underestimation, unbiased estimation, residual/ relative error guarantees and mergability. We believe these unique characteristics make our proposed algorithms a strong candidate for real-world applications and systems.
References
- (1)
- Agarwal et al. (2012) Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable summaries. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. 23–34.
- Agarwal et al. (2013) Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 1–28.
- Agrawal et al. (2011) Divyakant Agrawal, Philip Bernstein, Elisa Bertino, Susan Davidson, Umeshwas Dayal, Michael Franklin, Johannes Gehrke, Laura Haas, Alon Halevy, Jiawei Han, et al. 2011. Challenges and opportunities with Big Data 2011-1. (2011).
- Alon et al. (1996) Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing. 20–29.
- Bailis et al. (2017) Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541–556.
- Ben-Eliezer et al. (2022) Omri Ben-Eliezer, Rajesh Jayaram, David P Woodruff, and Eylon Yogev. 2022. A framework for adversarially robust streaming algorithms. ACM Journal of the ACM (JACM) 69, 2 (2022), 1–33.
- Berinde et al. (2010) Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J Strauss. 2010. Space-optimal heavy hitters with strong error bounds. ACM Transactions on Database Systems (TODS) 35, 4 (2010), 1–28.
- Bhattacharyya et al. (2018) Arnab Bhattacharyya, Palash Dey, and David P Woodruff. 2018. An optimal algorithm for l1-heavy hitters in insertion streams and related problems. ACM Transactions on Algorithms (TALG) 15, 1 (2018), 1–27.
- Bloom (1970) Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
- Braverman et al. (2017) Vladimir Braverman, Stephen R Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P Woodruff. 2017. BPTree: an l2 heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 361–376.
- Broder (1997) Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 21–29.
- Bruch et al. (2023) Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty. 2023. An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors. arXiv preprint arXiv:2301.10622 (2023).
- Charikar et al. (2002) Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming. Springer, 693–703.
- Chen et al. (2023) Wei-Ning Chen, Ayfer Ozgur, Graham Cormode, and Akash Bharadwaj. 2023. The communication cost of security and privacy in federated frequency estimation. In International Conference on Artificial Intelligence and Statistics. PMLR, 4247–4274.
- Chiosa et al. (2021) Monica Chiosa, Thomas B Preußer, and Gustavo Alonso. 2021. Skt: A one-pass multi-sketch data analytics accelerator. Proceedings of the VLDB Endowment 14, 11 (2021), 2369–2382.
- Cohen et al. (2022) Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, and Uri Stemmer. 2022. On the robustness of countsketch to adaptive inputs. In International Conference on Machine Learning. PMLR, 4112–4140.
- Cooper et al. (2010) Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143–154.
- Cormode (2023) Graham Cormode. 2023. Applications of Sketching and Pathways to Impact. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (Seattle, WA, USA) (PODS ’23). Association for Computing Machinery, New York, NY, USA, 5–10. https://doi.org/10.1145/3584372.3589937
- Cormode and Hadjieleftheriou (2008) Graham Cormode and Marios Hadjieleftheriou. 2008. Finding frequent items in data streams. Proceedings of the VLDB Endowment 1, 2 (2008), 1530–1541.
- Cormode et al. (2021) Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, and Pavel Veselỳ. 2021. Relative error streaming quantiles. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 96–108.
- Cormode and Muthukrishnan (2005) Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
- Cormode and Yi (2020) Graham Cormode and Ke Yi. 2020. Small summaries for big data. Cambridge University Press.
- Dayan et al. (2017) Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of the 2017 ACM International Conference on Management of Data. 79–94.
- Demaine et al. (2002) Erik D Demaine, Alejandro López-Ortiz, and J Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In European Symposium on Algorithms. Springer, 348–360.
- Duffield et al. (2007) Nick Duffield, Carsten Lund, and Mikkel Thorup. 2007. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM) 54, 6 (2007), 32–es.
- Flajolet et al. (2007) Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science Proceedings (2007).
- Gupta and Zane (2003) Anupam Gupta and Francis Zane. 2003. Counting inversions in lists. In SODA, Vol. 3. 253–254.
- Han et al. (2006) Jiawei Han, Jian Pei, and Hanghang Tong. 2006. Data mining: Concepts and techniques. Morgan Kaufinann 10, 559-569 (2006), 4.
- Hasidim et al. (2020) Avinatan Hasidim, Haim Kaplan, Yishay Mansour, Yossi Matias, and Uri Stemmer. 2020. Adversarially robust streaming algorithms via differential privacy. Advances in Neural Information Processing Systems 33 (2020), 147–158.
- Ivkin et al. (2019) Nikita Ivkin, Edo Liberty, Kevin Lang, Zohar Karnin, and Vladimir Braverman. 2019. Streaming quantiles algorithms with small space and update time. arXiv preprint arXiv:1907.00236 (2019).
- Jayaram and Woodruff (2018) Rajesh Jayaram and David P Woodruff. 2018. Data streams with bounded deletions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 341–354.
- Jiang et al. (2018) Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui. 2018. Sketchml: Accelerating distributed machine learning with data sketches. In Proceedings of the 2018 International Conference on Management of Data. 1269–1284.
- Karnin et al. (2016) Zohar Karnin, Kevin Lang, and Edo Liberty. 2016. Optimal quantile approximation in streams. In 2016 ieee 57th annual symposium on foundations of computer science (focs). IEEE, 71–78.
- Karp et al. (2003) Richard M Karp, Scott Shenker, and Christos H Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28, 1 (2003), 51–55.
- Lebeda and Tetek (2023) Christian Janos Lebeda and Jakub Tetek. 2023. Better differentially private approximate histograms and heavy hitters using the Misra-Gries sketch. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 79–88.
- Li et al. (2020) Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. 2020. Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1574–1584.
- Li et al. (2014) Yi Li, Huy L Nguyen, and David P Woodruff. 2014. Turnstile streaming algorithms might as well be linear sketches. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing. 174–183.
- Liu et al. (2023) Zirui Liu, Yixin Zhang, Yifan Zhu, Ruwen Zhang, Tong Yang, Kun Xie, Sha Wang, Tao Li, and Bin Cui. 2023. TreeSensing: Linearly Compressing Sketches with Flexibility. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–28.
- Manku and Motwani (2002) Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 346–357.
- Mathieu and de Rougemont (2023) Claire Mathieu and Michel de Rougemont. 2023. Testing frequency distributions in a stream. arXiv preprint arXiv:2309.11175 (2023).
- Metwally et al. (2005) Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Database Theory-ICDT 2005: 10th International Conference, Edinburgh, UK, January 5-7, 2005. Proceedings 10. Springer, 398–412.
- Misra and Gries (1982) Jayadev Misra and David Gries. 1982. Finding repeated elements. Science of computer programming 2, 2 (1982), 143–152.
- Pagh and Thorup (2022) Rasmus Pagh and Mikkel Thorup. 2022. Improved utility analysis of private countsketch. Advances in Neural Information Processing Systems 35 (2022), 25631–25643.
- Rothchild et al. (2020) Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. 2020. Fetchsgd: Communication-efficient federated learning with sketching. In International Conference on Machine Learning. PMLR, 8253–8265.
- Ting (2018) Daniel Ting. 2018. Data sketches for disaggregated subset sum and frequent item estimation. In Proceedings of the 2018 International Conference on Management of Data. 1129–1140.
- Wang et al. (2019) Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John CS Lui, and Xiaohong Guan. 2019. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 25–33.
- Wang et al. (2024) Yiping Wang, Yanhao Wang, and Cen Chen. 2024. DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report). arXiv preprint arXiv:2406.07953 (2024).
- Xu et al. (2023) Yuchen Xu, Wenfei Wu, Bohan Zhao, Tong Yang, and Yikai Zhao. 2023. MimoSketch: A Framework to Mine Item Frequency on Multiple Nodes with Sketches. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2838–2849.
- Yang et al. (2018) Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 561–575.
- Zakhary et al. (2021) Victor Zakhary, Lawrence Lim, Divy Agrawal, and Amr El Abbadi. 2021. Cache on Track (CoT): Decentralized Elastic Caches for Cloud Environments. In International Conference on Extending Database Technology.
- Zhao et al. (2021a) Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi, and Ahmed Metwally. 2021a. SpaceSaving: An Optimal Algorithm for Frequency Estimation and Frequent items in the Bounded Deletion Model. arXiv preprint arXiv:2112.03462 (2021).
- Zhao et al. (2023a) Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi, Ahmed Metwally, Claire Mathieu, and Michel de Rougemont. 2023a. Errata for” SpaceSaving: An Optimal Algorithm for Frequency Estimation and Frequent Items in the Bounded-Deletion Model”. Proceedings of the VLDB Endowment 17, 4 (2023), 643–643.
- Zhao et al. (2023b) Fuheng Zhao, Punnal Ismail Khan, Divyakant Agrawal, Amr El Abbadi, Arpit Gupta, and Zaoxing Liu. 2023b. Panakos: Chasing the Tails for Multidimensional Data Streams. Proceedings of the VLDB Endowment 16, 6 (2023), 1291–1304.
- Zhao et al. (2021b) Fuheng Zhao, Sujaya Maiyya, Ryan Wiener, Divyakant Agrawal, and Amr El Abbadi. 2021b. Kll Approximate Quantile Sketches over Dynamic Datasets. Proceedings of the VLDB Endowment 14, 7 (2021), 1215–1227.
- Zhao et al. (2022) Fuheng Zhao, Dan Qiao, Rachel Redberg, Divyakant Agrawal, Amr El Abbadi, and Yu-Xiang Wang. 2022. Differentially private linear sketches: Efficient implementations and applications. Advances in Neural Information Processing Systems 35 (2022), 12691–12704.
- Zhao et al. (2023c) Fuheng Zhao, Leron Reznikov, Divyakant Agrawal, and Amr El Abbadi. 2023c. Autumn: A Scalable Read Optimized LSM-tree based Key-Value Stores with Fast Point and Range Read Speed. arXiv preprint arXiv:2305.05074 (2023).
- Zipf (2016) George Kingsley Zipf. 2016. Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books.