The SpaceSaving± Family of Algorithms for Data Streams with Bounded Deletions

Fuheng Zhao UC Santa Barbara [email protected] , Divyakant Agrawal UC Santa Barbara [email protected] , Amr El Abbadi UC Santa Barbara [email protected] , Claire Mathieu CNRS and IRIF [email protected] , Ahmed Metwally Uber Inc. [email protected] and Michel de Rougemont University Paris II and IRIF [email protected]

(2018)

Abstract.

In this paper, we present an advanced analysis of near optimal algorithms that use limited space to solve the frequency estimation, heavy hitters, frequent items, and top-k approximation in the bounded deletion model. We define the family of SpaceSaving± algorithms and explain why the original SpaceSaving± algorithm only works when insertions and deletions are not interleaved. Next, we propose the new Double SpaceSaving±, Unbiased Double SpaceSaving±, and Integrated SpaceSaving± and prove their correctness. The three proposed algorithms represent different trade-offs, in which Double SpaceSaving± can be extended to provide unbiased estimations while Integrated SpaceSaving± uses less space. Since data streams are often skewed, we present an improved analysis of these algorithms and show that errors do not depend on the hot items. We also demonstrate how to achieve relative error guarantees under mild assumptions. Moreover, we establish that the important mergeability property is satisfied by all three algorithms, which is essential for running the algorithms in distributed settings.

Data Mining, Streaming Algorithms, Data Summary, Data Sketch, Frequency Estimation, Heavy Hitters, Frequent Items, Top-K.

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Data management systems^†^†ccs: Information systems Data mining^†^†ccs: Networks Network monitoring^†^†ccs: Theory of computation Streaming, sublinear and near linear time algorithms

1. Introduction

Streaming algorithms (Cormode and Yi, 2020; Han et al., 2006) aim to process a stream of data in a single pass with small space. In particular, for a stream of updates of items, these algorithms often only store a compact summary which uses space at most logarithmic in the data stream length or in the cardinality of the input domain. The streaming paradigm provides essential analysis and statistical measurements with strong accuracy guarantees for many big data applications (Agrawal et al., 2011). For instance, Hyperloglog (Flajolet et al., 2007) answers queries about cardinality; Bloom Filters (Bloom, 1970) answer membership queries; KLL (Karnin et al., 2016; Ivkin et al., 2019; Zhao et al., 2021b) gives accurate quantile approximations such as finding the median. These data summaries (sketches) are now the cornerstones for many real-world systems and applications including network monitoring (Yang et al., 2018; Zhao et al., 2023b; Xu et al., 2023), storage engine (Dayan et al., 2017; Zhao et al., 2023c), data mining (Wang et al., 2019; Li et al., 2020), federated learning (Rothchild et al., 2020; Jiang et al., 2018), privacy-preserving analytic (Chen et al., 2023; Zhao et al., 2022; Lebeda and Tetek, 2023; Pagh and Thorup, 2022; Wang et al., 2024), vector search (Broder, 1997; Bruch et al., 2023), and many more. We refer the reader to a recent paper (Cormode, 2023) for a more thorough discussion on the applications.

In this paper, we focus on the frequency estimation and heavy hitters problems with the presence of deletions. Given a data stream $\sigma$ from some universe $U$ , the Frequency Estimation problem constructs a summary providing, for every item $x$ in $U$ , an estimate $\hat{f}(x)$ of the true frequency $f(x)$ . Since resources are limited (small memory footprint), the data summary cannot track every item and hence there is an inherent trade-off between the space used and the accuracy of the estimate. In the closely related approximate Heavy Hitters (Frequent Items) problem, there is an additional input frequency threshold $T$ , and the goal is to identify all items with frequency larger than $T$ . This applies to settings where data scientists and system operators are interested in knowing what items are hot (appear very often). Frequency estimation, heavy hitters and approximate top-k have found many applications in big data systems (Bailis et al., 2017; Zakhary et al., 2021; Liu et al., 2023; Chiosa et al., 2021).

Algorithm	Space	Model	Error Bound	Note
SpaceSaving $\&$ MG	$O(\frac{1}{\epsilon})$	Insertion-Only	$\|f_{i}-\hat{f}_{i}\|\leq\epsilon F_{1}$	Lemma 2.3
(Metwally et al., 2005; Misra and Gries, 1982; Berinde et al., 2010; Mathieu and de Rougemont, 2023)	$O(\frac{k}{\epsilon})$		$\|f_{i}-\hat{f}_{i}\|\leq\frac{\epsilon}{k}F_{1}^{res(k)}$	Lemma 2.4
	$O(\frac{k}{\epsilon}\frac{2-\gamma}{2(\gamma-1)})$	( $\gamma$ -decreasing)	$\|f_{i}-\hat{f}_{i}\|\leq\epsilon f_{i},\forall i\leq k$	Lemma 2.6
Count-Min (Cormode and Muthukrishnan, 2005)	$O(\frac{k}{\epsilon}\log n)$	Turnstile	$\|f_{i}-\hat{f}_{i}\|\leq\frac{\epsilon}{k}\epsilon F_{1}^{res(k)}$	Never underestimate
CountSketch (Charikar et al., 2002)	$O(\frac{k}{\epsilon}\log n)$	Turnstile	$(f_{i}-\hat{f}_{i})^{2}\leq\frac{\epsilon}{k}F_{2}^{res(k)}$	unbiased
DoubleSS $\pm$ & IntegratedSS $\pm$	$O(\frac{\alpha}{\epsilon})$	Bounded Deletion	$\|f_{i}-\hat{f}_{i}\|\leq\epsilon F_{1}$	Theorem 3.1 & 3.9
Unbiased DoubleSS $\pm$	$O(\frac{\alpha}{\epsilon})$		unbiased & Var ¡ $\epsilon^{2}F_{1}^{2}$	Theorem 3.3
DoubleSS $\pm$ & IntegratedSS $\pm$	$O(\frac{\alpha k}{\epsilon})$		$\|f_{i}-\hat{f}_{i}\|\leq\frac{\epsilon}{k}F_{1.\alpha}^{res(k)}$	Theorem 4.1 & 4.3
DoubleSS $\pm$ & IntegratedSS $\pm$	$O(\frac{\alpha k}{\epsilon}\frac{(2-\gamma)k^{\beta}}{2(\gamma-1)2^{log_{\gamma}k}})$	( $\gamma$ -decreasing and $\beta$ -zipf)	$\|f_{i}-\hat{f}_{i}\|\leq\epsilon f_{i},\forall i\leq k$	Theorem 4.6 & 4.7

Table 1. Comparison between different frequency estimation and approximate heavy hitters algorithms.

1.1. Related Work

There is a large body of algorithms proposed to solve frequency estimation and heavy hitters problems (Misra and Gries, 1982; Metwally et al., 2005; Charikar et al., 2002; Cormode and Muthukrishnan, 2005; Karp et al., 2003; Demaine et al., 2002; Braverman et al., 2017; Jayaram and Woodruff, 2018; Zhao et al., 2021a, 2022; Manku and Motwani, 2002). The existing algorithms can be classified into three models: insertion-only, turnstile, and bounded-deletion. In the insertion-only model, all arriving stream items are insertions and the frequency $f(x)$ of $x$ is the number of times $x$ has been inserted. SpaceSaving (Metwally et al., 2005) and MG (Misra and Gries, 1982) are two popular algorithms that operate in the insertion-only model; they provide strong deterministic guarantees with minimal space. Moreover, since these algorithms are deterministic, they are inherently adversarial robust (Hasidim et al., 2020; Ben-Eliezer et al., 2022; Cohen et al., 2022). In the turnstile model, arriving stream items can be either insertions or deletions and the (net) frequency $f(x)$ of $x$ is the difference between the number of times $x$ has been inserted and deleted, with the proviso that no item’s frequency is negative. Count-Min (Cormode and Muthukrishnan, 2005) and CountSketch (Charikar et al., 2002) are two popular algorithms that operate in the turnstile model. Supporting deletions is a harder task and as a result Count-Min and Count-Sketch require more space with poly-log dependence on the input domain size to offer strong error bounds with high probability (Alon et al., 1996). The bounded deletion model (Jayaram and Woodruff, 2018; Zhao et al., 2021b, a) assumes the number of deletions are bounded by a fraction of the number of insertions. For instance, in standard testings, if students are only allowed to submit one regrade request on their tests, then the number of deletions is at most half of the number of insertions in which updating a student’s grade is a deletion followed by an insertion.

Recently, SpaceSaving $\pm$ (Zhao et al., 2021a) proposed an algorithm extending the original SpaceSaving (Metwally et al., 2005) to handle bounded deletions efficiently, (implicitly) assuming (Zhao et al., 2021a) that the data stream has no interleaving between insertions and deletions: all insertions precede all deletions (Metwally et al., 2005) (Zhao et al., 2023a). In addition, Berinde et al. (Berinde et al., 2010) showcased that data summaries in the insertion-only model can achieve an even stronger error bound using residual error analysis. The difference among these algorithms and their corresponding error guarantees is listed in Table 1. As a result, in this work, we aim to investigate the SpaceSaving $\pm$ family of algorithms to support interleavings between insertions and deletions and provide a tighter error bound analysis.

1.2. Our Contributions

In this paper, we present a detailed analysis of the SpaceSaving $\pm$ family of algorithms with bounded deletions. In particular, we propose three new algorithms: Double SpaceSaving $\pm$ , Unbiased Double SpaceSaving $\pm$ , and Integrated SpaceSaving $\pm$ , which operate in the general bounded deletion model (with interleavings of insertions and deletions) and provide strong theoretical guarantees with small space, while having distinct characteristics. In addition, we introduce the residual error bound guarantee in the bounded deletion model and prove that the proposed algorithms satisfy the stronger residual error guarantee. Moreover, we demonstrate that the proposed algorithms also achieve the relative error bound (Mathieu and de Rougemont, 2023; Cormode et al., 2021) under mild assumptions, assuming skewness in the data stream. Finally, we provide a merge algorithm for our proposed algorithms and prove that all proposed algorithms are mergeable in the bounded deletion model. The mergeability is critical for distributed applications (Agarwal et al., 2012).

2. Background

2.1. Preliminaries

Consider a data stream consisting of a sequence of $N$ operations (insertions or deletions) on items drawn from a universe $U$ of size $|U|=n$ . We denote the stream by $\sigma=\{(item_{t},op_{t})\}_{t=\{1,2...N\}}$ where $item_{t}\in U$ is an element of the universe and $op_{t}\in\{insertion,deletion\}$ . Let $I$ denote the total number of insertions and $D$ denote the total number of deletions in the stream, so that $I+D=N$ . In the bounded deletion model (Jayaram and Woodruff, 2018; Zhao et al., 2021b), it is assumed that, for some parameter $\alpha\geq 1$ , the stream is constrained to satisfy $D\leq(1-1/\alpha)I$ . Observe that when $\alpha=1$ , there can be no deletions, and so the model includes the insertion-only model as a special case. Let $f$ denote the frequency function: the frequency of an item $x$ is $f(x)=I(x)-D(x)$ where $I(x)$ denotes the number of insertions of $x$ in the stream and $D(x)$ denotes the number of deletions of $x$ in the stream. In this paper, we use a summary data structure from which we infer an estimation of the frequency function. We use $\hat{f}(x)$ to denote the estimated frequency of $x$ .

Let $\{f_{i}\}_{1}^{n}$ denote the frequencies of the $n$ items, sorted in non-increasing order, so that $f_{1}\geq f_{2},...\geq f_{n}$ . Similarly, sorting $\{I(x):x\in U\}$ in non-increasing order leads to $\{I_{i}\}_{1}^{n}$ where $I_{1}\geq I_{2},...\geq I_{n}$ , and sorting $\{D(x):x\in U\}$ in non-increasing order leads to $\{D_{i}\}_{1}^{n}$ where $D_{1}\geq D_{2},...\geq D_{n}$ . Notice that, for the same $i$ , the item corresponding to $f_{i}$ may differ from the item corresponding to $I_{i}$ and from the item corresponding to $D_{i}$ . For the special case of the insertion-only model, the frequency vector and insertion vector are identical. Moreover, we use $F_{1}$ to denote the total number of items resulting from the stream operations, such that $F_{1}=I-D=\sum_{i=1}^{n}f_{i}$ . In this paper, we focus on the unit updates model (an operation only inserts or deletes one count of an item) and assume that at any point of time, no item frequency is below zero (i.e, items cannot be deleted if they were not previously inserted).

Definition 2.1 (Deterministic Frequency Estimation Problem).

Given a stream of operations (insertions or deletions) of items from a universe $U$ , the Frequency Estimation Problem asks for an (implicitly defined) function $\hat{f}$ that when queried for any element $x\in U$ , returns a value $\hat{f}(x)$ such that

\forall x\in U,|f(x)-\hat{f}(x)|\leq\epsilon F_{1},

where $F_{1}=\sum_{x\in U}f(x)$ .

Definition 2.2 (Heavy Hitters problem).

Given a parameter $\epsilon>0$ , a heavy hitter is an item with frequency at least $\epsilon F_{1}$ . Given a stream of operations (insertions or deletions) of items from a universe $U$ , the heavy hitters problem asks for an (explicitly defined) subset of $U$ that contains all heavy hitters.

All of our streaming algorithms use a summary of the information seen in the stream so far, consisting of a collection of items and some counts for each of them. The size of the summary is much smaller than $|U|$ (equivalent to $n$ ), and we say that an item is monitored if it is present in the summary.

2.2. The SpaceSaving± Family of Algorithms

Input: Insertion-only data stream

\sigma

and SpaceSaving summary

S

Output: SpaceSaving summary

S

2for (item $e$ , insertion) from $\sigma$ do

3 if $e\in S$ then

count_{e}

+= 1 ;

6 else if $S$ not full then

7 add

e

S

with

count_{e}

= 1 ;

9 else

minItem

= item

x\in S

minimizing

count_{x}

;

w

count_{minItem}

;

12 evict

minItem

from summary;

13 add

e

to summary with

count_{e}=w+1

;

16 end for

Algorithm 1 SpaceSaving Update Algorithm (insertion-only, one count per item)

The SpaceSaving $\pm$ family of algorithms are data summaries that build on top of the original SpaceSaving (Metwally et al., 2005) algorithm. In 2005, Metwally, Agrawal, and El Abbadi (Metwally et al., 2005) proposed the popular SpaceSaving algorithm that provides highly accurate estimates on item frequency and heavy hitters among many competitors (Cormode and Hadjieleftheriou, 2008). The SpaceSaving algorithm operates in the insertion-only model (so that the stream is simply a sequence of items from $U$ ) and uses the optimal space (Berinde et al., 2010). The update algorithm, as shown in Algorithm 1 (Metwally et al., 2005), uses $m$ counters to store a monitored item’s identity and its estimated frequency. When a new $item$ arrives: if $item$ is monitored, then increment its counter; if $item$ is not monitored and the summary has unused counters, then start to monitor $item$ , and set $count_{item}$ to 1; otherwise, i) find $minItem$ , the item with minimum counter ( $minCount$ ), ii) evicts this $minItem$ from the summary, and iii) monitor the new $item$ and set $count_{item}$ to $minCount+1$ . See Algorithm 1.

After the stream has been read, the SpaceSaving Update algorithm has produced a summary. To solve the Frequency Estimation Problem, the algorithm then needs to estimate the frequency $f(e)$ of an item $e$ being queried: the SpaceSaving Query algorithm returns $\hat{f}(e)=count_{e}$ if $e$ is monitored and $\hat{f}(e)=0$ otherwise, as shown in Algorithm 2. (To solve the heavy hitters problem, the algorithm outputs all items in the summary whose count is greater than or equal to $\epsilon$ times the sum of the counts.)

When the number of counters (size of the summary) $m=\frac{1}{\epsilon}$ , SpaceSaving solves both frequency estimation and heavy hitters problems in the insertion-only model, as shown in Lemma 2.3. Moreover, SpaceSaving satisfies a more refined residual error bound as shown in Lemma 2.4 in which ”hot” items do not contribute error and the error bound only depends on the ”cold” and ”warm” items (Berinde et al., 2010).

The $\epsilon$ error guarantee is shown in Lemma 2.3. When $m=\frac{1}{\epsilon}$ , SpaceSaving ensures $\forall x\in U,|f(x)-\hat{f}(x)|\leq\epsilon F_{1}$ .

Lemma 2.3.

(Metwally et al., 2005) After processing insertion-only data stream $\sigma$ with the SpaceSaving Update algorithm (Algorithm 1) with $m$ counters, the SpaceSaving Query algorithm (algorithm 2) provides an estimate $\hat{f}$ such that $\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{F_{1}}{m}$ , where $F_{1}=\sum_{i=1}^{n}f_{i}$ .

The residual error guarantee is shown in Lemma 2.4.

Lemma 2.4.

(Berinde et al., 2010) After processing insertion-only data stream $\sigma$ , SpaceSaving with $O(\frac{k}{\epsilon})$ counters ensure $\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{k}F_{1}^{res(k)}$ where $F_{1}^{res(k)}=F_{1}-\sum_{i=1}^{k}f_{i}$ .

The relative error guarantee is shown in Lemma 2.6.

Definition 2.5 ( $\gamma$ -decreasing).

Let $\gamma>1$ , then a non-decreasing function with domain $\{1,2...n\}$ is $\gamma$ -decreasing if for all t such that $1\leq\gamma t\leq n$ , $f_{\lceil\gamma t\rceil}\leq f_{t}/2$

Lemma 2.6.

(Mathieu and de Rougemont, 2023) After processing insertion-only data stream $\sigma$ and assume the exact frequencies follow the $\gamma$ -decreasing function (Definition 2.5), SpaceSaving with $O(\frac{k}{\epsilon}\frac{2-\gamma}{2(\gamma-1)})$ counters ensure $|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i},\forall i\leq k$ .

Input: SpaceSaving summary

S

and an item

e

Output:

e

’s estimated frequency

1 if $e\in S$ then

2 return

count_{e}

;

4 return 0;

Algorithm 2 Query Algorithm

In 2021, Zhao et al. (Zhao et al., 2021a) proposed SpaceSaving $\pm$ to extend the SpaceSaving algorithm from the insertion-only model to the bounded deletion model. In that model, they established a space lower-bound of at least $\frac{\alpha}{\epsilon}$ counters required to solve the frequency estimation and heavy hitters problems, and proposed an algorithm, SpaceSaving $\pm$ (see Algorithm 3), that used space matching the lower bound. The update algorithm is shown in Algorithm 3 in which insertions are dealt with as in Algorithm 1, deletions of monitored items decrement the corresponding counters, and deletions of unmonitored items are ignored. The query algorithm is identical to Algorithm 2¹¹1There are a couple of variants of Algorithm 2. We adopt the most commonly used approach..

Input: Bounded deletion data stream

\sigma

, and SpaceSaving

\pm

summary

S

Output: SpaceSaving

\pm

Summary

S

1 for (item $e$ , operation $op$ ) from $\sigma$ do

2 if $op$ is an insertion then

3 follow Algorithm 1 to process

e

;

5 else

6 if $e\in S$ then

count_{e}-=1

;

11 end for

Algorithm 3 SpaceSaving

\pm

Update Algorithm (bounded deletion)

However, the SpaceSaving $\pm$ algorithm is only correct if an implicit assumption holds (Zhao et al., 2023a), namely, that deletions can only happen after all the insertions, so that interleaving insertions and deletions are not allowed, as shown in Lemma 2.7. In this paper, we present new algorithms in the general bounded deletions model, that allow interleavings between insertions and deletions while still using only $O(\frac{\alpha}{\epsilon})$ space.

This Lemma correspond to Theorem 2 in (Zhao et al., 2021a). Consider the bounded deletions model and assume, in addition, that the stream consists of an insertion phase with $I$ insertions followed by a deletion phase with $D$ deletions (no interleaving).

Lemma 2.7.

After processing data stream $\sigma$ with the SpaceSaving $\pm$ Update algorithm (Algorithm 3) with $m=\frac{\alpha}{\epsilon}$ counters, the SpaceSaving Query algorithm (algorithm 2) provides an estimate $\hat{f}$ such that $\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{F_{1}}{m}$ , where $F_{1}=\sum_{i=1}^{n}f_{i}$ .

3. New Algorithms

In this section, we propose new algorithms supporting the bounded deletion model (with interleavings between insertions and deletions, while an item’s frequency is never negative). Double SpaceSaving $\pm$ and Integrated SpaceSaving $\pm$ both use $O(\frac{\alpha}{\epsilon})$ space to solve the deterministic frequency estimation and heavy hitters problems in the bounded deletion model. We extended Double SpaceSaving $\pm$ to support unbiased estimations. Also, we establish the correctness proof for these algorithms. While these algorithm share similar characteristics, they represent different trade-offs. In particular, Double SpaceSaving $\pm$ uses two disjoint summaries, one to track insertions and the other to track deletions, while Intergrated SpaceSaving $\pm$ uses a single summary to track both deletions and insertions.

3.1. Double SpaceSaving $\pm$

Input: Bounded deletion data stream

\sigma

, and Double SpaceSaving

\pm

summary

{(S_{insert},S_{delete})}

Output: Double SpaceSaving

\pm

Summary

S{=(S_{insert},S_{delete})}

2for (item $e$ , operation $op$ ) from $\sigma$ do

3 if $op$ is an insertion then

4 Apply Algorithm 1 to insert

e

S_{insert}

;

6 else

7 Apply Algorithm 1 to insert

e

S_{delete}

;

10 end for

Algorithm 4 Double SpaceSaving

\pm

Update Algorithm

Input: Double SpaceSaving

\pm

summary

S

and an item

e

Output:

e

’s estimated frequency.

1 inserts = Query

S_{insert}

for

e

with Algorithm 2;

2 deletes = Query

S_{delete}

for

e

with Algorithm 2;

3 return max(inserts - deletes, 0);

Algorithm 5 Double SpaceSaving

\pm

Query Algorithm

In Double SpaceSaving $\pm$ , we employ two SpaceSaving summaries to track insertions ( $S_{insert}$ ) and deletions ( $S_{delete}$ ) separately, using $m_{I}$ and $m_{D}$ counters respectively (the algorithms is parameterized by the choice of $m_{I}$ and $m_{D}$ ). As shown in Algorithm 4, Double SpaceSaving $\pm$ completely partitions the update operations to two independent summaries. To query for the frequency of an item $e$ , both summaries need to be queried to estimate the frequency of item $e$ as shown in Algorithm 5.

Theorem 3.1.

After processing a bounded deletion data stream $\sigma$ , Double SpaceSaving $\pm$ following Update Algorithm 4 with $m_{I}=O(\frac{\alpha}{\epsilon})$ and $m_{D}=O(\frac{\alpha-1}{\epsilon})$ and Query Algorithm 5 provides an estimate $\hat{f}$ such that $\forall x\in U,|f(x)-\hat{f}(x)|<\epsilon F_{1}$ .

Proof.

For every items in the universe their frequency estimation error is upper bounded by $\frac{I}{m_{I}}+\frac{D}{m_{D}}$ due to the property of SpaceSaving as shown in Lemma 2.3 and triangle inequality. In addition, from the definition of bounded deletion, we can derive the relationship between $I$ , $D$ , and $F_{1}$ . Namely, $I\leq\alpha F_{1}$ and $D\leq(\alpha-1)F_{1}$ . Set $m_{I}=\frac{2\alpha}{\epsilon}$ and $m_{D}=\frac{2(\alpha-1)}{\epsilon}$ , then $\frac{I}{m_{I}}+\frac{D}{m_{D}}\leq\epsilon F_{1}$ . Hence, Double SpaceSaving $\pm$ solve the deterministic frequency estimation problem in the bounded deletion model with $O(\frac{\alpha}{\epsilon})$ space. ∎

Theorem 3.2.

Double SpaceSaving $\pm$ solves the heavy hitters problem in the bounded deletion model using $O(\frac{\alpha}{\epsilon})$ space by reporting all items monitored in $S_{1}$ .

Proof.

Assume a heavy hitter $x$ is not monitored in the summary. By definition of heavy hitters, $f(x)\geq\epsilon F_{1}$ . Since $x$ is not monitored, its estimated frequency 0. As a result, the estimation error for item $x$ is larger than $\epsilon F_{1}$ which contradicts Theorem 3.1.

Hence, by contradiction Double SpaceSaving $\pm$ solves the heavy hitters problem. ∎

Being more careful and using variants of Algorithm 2 would lead to improvements but would not change the theoretical bound of Theorem 3.2.

3.1.1. Unbiased Double SpaceSaving $\pm$

Unbiased estimation is a desirable property for robust estimation (Charikar et al., 2002; Duffield et al., 2007). Research has shown that SpaceSaving can be extend to provide unbiased estimation ( i.e., $E[\hat{f}(x)]=f(x)$ ) (Ting, 2018). The extension is obtained from Algorithm 1 by the following simple modification: after line 8 of the code, with probability $1/(w+1)$ lines 9 and 10 are executed; with the complementary probability $1-1/(w+1)$ , instead of lines 9 and 10 the algorithm simply increments $count_{minItem}$ . This indicates that we can replace the two independent deterministic SpaceSaving summaries with two independent unbiased SpaceSaving summaries in Algorithm 5 to support unbiased frequency estimation in the bounded deletion model.

Theorem 3.3.

After processing a bounded deletion data stream $\sigma$ , Unbiased Double SpaceSaving $\pm$ of space $O\frac{\alpha}{\epsilon}$ , with two independent unbiased SpaceSaving summaries in Algorithm 5, provides unbiased frequency estimation and the estimation variance is upper bounded by $\epsilon^{2}F_{1}^{2}$ .

Proof.

The variance of an Unbiased SpaceSaving’s estimation is upper bounded by $O(\frac{F_{1}}{m}^{2})$ (Ting, 2018). We know that $Var[X-Y]$ = $Var[X]$ + $Var[Y]$ - $2(Cov[X,Y]$ , where $X$ is the estimation of inserts and $Y$ is the estimation of deletes. Since two summaries are independent, the covariance is 0; Hence, the variance upper bounded by the sum of the $O(\frac{I}{m_{I}}^{2}+\frac{D}{m_{D}}^{2})$ . In Theorem 3.1, we proved that $\frac{I}{m_{I}}+\frac{D}{m_{D}}\leq\epsilon F_{1}$ . As a result, the variance of the unbiased estimation for any item’s frequency is upper bounded by $\epsilon^{2}F_{1}^{2}$ .

∎

3.2. Integrated SpaceSaving $\pm$

Input: Bounded deletion data stream

\sigma

, and IntergratedSpaceSaving

\pm

summary

S

Output: IntergratedSpaceSaving

\pm

summary

S

1 for (item $e$ , operation $op$ ) from $\sigma$ do

2 if $e\in S$ then

3 if $op$ is an insertion then

insert_{e}

+= 1;

6 else

delete_{e}

+= 1;

10 else

11 if $S$ not full then

insert_{e}

= 1;

delete_{e}

= 0;

15 else if $op$ is an insertion then

minItem

= item

x\in S

minimizing

insert_{x}

;

w

insert_{minItem}

;

18 evict

minItem

from summary;

19 add

e

to summary and set

insert_{e}=w+1

and

delete_{e}=0

;

23 end for

Algorithm 6 Integrated SpaceSaving

\pm

Update Algorithm

Unlike Double SpaceSaving $\pm$ which uses two independent summaries, Integrated SpaceSaving $\pm$ completely integrates deletions and insertions in one data structure as shown in Algorithm 6. The main difference between Integrated SpaceSaving $\pm$ and SpaceSaving $\pm$ is that Integrated SpaceSaving $\pm$ tracks the insert and delete count separately. This ensures that the minimum insert count monotonically increases and never decrease; however, the original SpaceSaving $\pm$ uses one single count to track inserts and deletes together. When insertions and deletions are interleaved, then the minimum count in SpaceSaving $\pm$ may decreases. This can lead to severe underestimation of a frequent item (Zhao et al., 2023a).

Input: Integrated SpaceSaving

\pm

summary

S

and an item

e

Output:

e

’s estimated frequency.

1 if $e\in S$ then

2 return

insert_{e}-delete_{e}

;

4 return 0;

Algorithm 7 Integrated SpaceSaving

\pm

Query Algorithm

More specifically, the Integrated SpaceSaving $\pm$ data structure maintains a record estimating the number of insertions and deletions for each monitored item. When a new item arrives, if it is already being monitored, then depending on the type of operation, either the insertion or the deletion count of that item is incremented. If the summary is not full, then no item could have every been evicted from the summary, and since no delete operation arrives before the corresponding item is inserted, then any new item in the stream must be an insertion. Hence, when a new item arrives, and if the summary is not full, the new item is added to the summary and initialized with one insertions and 0 deletions. If a delete arrives and the summary is full, the delete is simply ignored. The interesting case is when an insert arrives and the summary is full. In this case, Integrated SpaceSaving $\pm$ mimics the behaviour of the original SpaceSaving $\pm$ by evicting the element with the least number of inserts and replacing it with the new element. The insertion count for the new element is overestimated by initializing it to the insertion count for the evicted element. However, the deletion count is underestimated by initializing the deletion count to zero.

To estimate an item’s frequency, if the item is monitored, it subtracts the item’s delete count from the item’s insert count, otherwise the estimated frequency is 0, as shown in Algorithm 7.

We now establish the correctness of Integrated SpaceSaving $\pm$ . First, we establish three lemmas about Integrated SpaceSaving $\pm$ to help us prove the correctness of the algorithm.

Lemma 3.4.

After processing a bounded deletion data stream $\sigma$ , Integrated SpaceSaving $\pm$ following update Algorithm 6 ensures that the sum of all insert counts equals $I$ .

Proof.

This holds by induction over time, as verified by inspection of the algorithm. ∎

Lemma 3.5.

After processing a bounded deletion data stream $\sigma$ , Integrated SpaceSaving $\pm$ with $m$ counters following update Algorithm 6 ensures that the minimum insert count, $insert_{minItem}$ , is less than or equal to $\frac{I}{m}$ .

Proof.

Since deletions never increment any insert counts, insert counts is only affected by insertions. We observe that the sum of all insert counts in summary is exactly equal to $I$ , since the sum of all insert counts increment by 1 after processing one insertion. As a result, $insert_{minItem}$ is maximized when all insert counts are the same. Hence, $insert_{minItem}$ is less than or equal to $\frac{I}{m}$ . ∎

Lemma 3.6.

All monitored items in Integrated SpaceSaving $\pm$ are never underestimated following query Algorithm 7.

Proof.

The $insert_{e}$ counts of the summary are dealt with exactly like in the SpaceSaving algorithm, for the substream of $\sigma$ consisting of the insertion operations only. By the analysis of SpaceSaving, it is known that for an item $e$ in the summary, $insert_{e}$ is an overestimate of the true number of insertions of $e$ since the beginning of the stream.

The $delete_{e}$ count of element $e$ in the summary is equal to the number of deletions of item $e$ since the last time that it was inserted into the summary, so it is an underestimate of the number of deletions of $e$ since the beginning of the stream.

Thus, for any $e$ in summary, $insert_{e}-delete_{e}$ is an overestimate. ∎

Lemma 3.7.

In Integrated SpaceSaving $\pm$ , any item inserted more than $insert_{minItem}$ must be monitored in the summary; moreover, for any item $e$ in the summary, its count $insert_{e}$ exceeds its true number of insertions by at most $insert_{minItem}$ .

Proof.

The $insert_{e}$ counts of the summary are dealt with exactly like in the SpaceSaving algorithm, for the substream of $\sigma$ consisting of the insertion operations only, and the properties are well-known to hold for SpaceSaving. ∎

Lemma 3.8.

In Integrated SpaceSaving $\pm$ following update Algorithm 6 and query Algorithm 7, the estimation error for any item is upper bounded by $insert_{minItem}$ . We can then write: $\forall x\in U,|f(x)-\hat{f}(x)|\leq insert_{minItem}$ .

Proof.

First consider the case when $e$ is unmonitored, so the estimate is $0$ . Its true frequency is at most its number of insertions, and by Lemma 3.7 that is bounded by $insert_{minItem}$ , hence the Lemma holds for $e$ .

Now, consider the case when $e$ is in the summary, and consider the last time $t$ at which $e$ was inserted in the Summary. Let $I_{\geq t}(e)$ and $D_{\geq t}(e)$ denote the number of insertions and of deletions of $e$ since time $t$ , and let $min$ denote the value of $insert_{minItem}$ at time $t$ . Since the algorithm updates $insert_{e}$ and $delete_{e}$ correctly while $e$ is in the summary, the query algorithm outputs the estimate $\hat{f}(e)=insert_{e}-delete_{e}=min+I_{\geq t}(e)-D_{\geq t}(e)$ . On the other hand, the frequency $f(e)$ of $e$ equals the net number of occurences of $e$ prior to time $t$ , plus $I_{\geq t}(e)-D_{\geq t}(e)$ . By Lemma 3.7 the net number of occurences of $e$ prior to time $t$ is at most $min$ , so $\hat{f}(e)-min\leq f(e)\leq\hat{f}(e)$ , implying the Lemma. ∎

Theorem 3.9.

After processing a bounded deletion data stream $\sigma$ , IntegratedSpaceSaving^± with $O(\frac{\alpha}{\epsilon})$ space following update Algorithm 6 and query Algorithm 7 provide an estimate $\hat{f}$ such that $\forall i,|f(i)-\hat{f}(i)|<\epsilon F_{1}$ .

Proof.

By Lemma 3.5 and Lemma 3.8, we known that $\forall x\in U,|f(x)-\hat{f}(x)|\ \leq insert_{minItem}\leq\frac{I}{m}$ . We also know that $I\leq\alpha F_{1}$ . Let’s set $m=\frac{\alpha}{\epsilon}$ , then we have $\frac{I}{m}\leq\epsilon F_{1}$ . As a result, IntegratedSpaceSaving^± with $O(\frac{\alpha}{\epsilon})$ space ensures $\forall i,|f(i)-\hat{f}(i)|<\epsilon F_{1}$ . ∎

Integrated SpaceSaving $\pm$ also solves the heavy hitters problem. Since Integrated SpaceSaving $\pm$ never underestimates monitored items (Lemma 3.6), then if we report all the items with frequency estimations greater than or equal to $\epsilon F_{1}$ , then all heavy hitters will be identified as shown in Theorem 3.10.

Theorem 3.10.

IntegratedSpaceSaving^± solves the heavy hitters problem in the bounded deletion model using $O(\frac{\alpha}{\epsilon})$ space.

Proof.

Assume a heavy hitter $x$ is not contained if the set of all items with estimated frequency larger than or equal to $\epsilon F_{1}$ . Recall, by definition of heavy hitters, $f(x)\geq\epsilon F_{1}$ . Since $x$ is not contained, $x$ ’s frequency estimation, $\hat{f}(x)$ , must be less than $\epsilon F_{1}$ . However, by Lemma 3.6, Integrated SpaceSaving $\pm$ never underestimates the frequency of monitored items.

Hence, by contradiction Integrated SpaceSaving $\pm$ solves the deterministic heavy hitters problem. ∎

3.3. Comparisons between Double SpaceSaving $\pm$ and Integrated SpaceSaving $\pm$

We assume each field uses the same number of bits (32-bit in the implementation). Following the proofs of Theorem 3.9 and that each entry in the summary has three fields $(i,insert_{i},delete_{i})$ , IntegratedSpaceSaving $\pm$ uses $3\frac{\alpha}{\epsilon}$ . On the other hand, following the proofs of Theorem 3.1 and that each entry in the summary uses two fields, Double SpaceSaving $\pm$ uses $2\frac{2\alpha}{\epsilon}+2\frac{2(\alpha-1)}{\epsilon}=\frac{8\alpha-2}{\epsilon}$ . Integrated SpaceSaving $\pm$ requires less memory footprint. Double SpaceSaving $\pm$ also has its own advantage and it can be extended supporting unbiased estimations (Unbiased Double SpaceSaving $\pm$ ).

4. Tighter Analysis on Error Bounds

In this section, we present tighter analysis of the proposed algorithms. In Section 3, we showcased that the proposed algorithms achieve the desired error bound $\epsilon F_{1}$ , in which the error bound depends on all items in data stream $\sigma$ . In this section, we first prove that both algorithms guarantee the residual error bound and then prove that they both also ensure the relative error bound under relatively mild assumptions. We take inspirations from (Berinde et al., 2010; Metwally et al., 2005; Zhao et al., 2021a; Mathieu and de Rougemont, 2023).

4.1. Residual Error Bound

We define $F_{1,\alpha}^{res(k)}=F_{1}-\frac{1}{\alpha}\sum_{i=1}^{k}f_{i}$ . Observe that $F_{1,\alpha}^{res(k)}$ is equivalent to the residual error bound shown in Lemma 2.4, $F_{1}^{res(k)}$ , for the insertion-only model, as $\alpha$ is set to 1. The residual error bound ( $F_{1,\alpha}^{res(k)}$ ) is much tighter than the original error bound ( $F_{1}$ ), since the residual error depends less on the heavy hitters.

First, we derive the residual error bound for Double SpaceSaving $\pm$ .

Theorem 4.1.

Double SpaceSaving $\pm$ with $m_{I}+m_{D}=O(\alpha k/\epsilon)$ space achieves the residual error bound guarantee in which $\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{k}F_{1,\alpha}^{res(k)}$ , where $k<min(m_{I},m_{D})$ .

Proof.

First, let’s define the maximum error $\delta$ as $\delta=max(\forall x\in U,|f(x)-\hat{f}(x)|)$ for Double SpaceSaving $\pm$ . We know that $\delta\leq\frac{I-\sum_{i=1}^{k}I_{i}}{m_{I}-k}+\frac{D-\sum_{i=1}^{k}D_{i}}{m_{D}-k}$ by Lemma 2.4 and triangle inequality. By definition of bounded deletion model, $I\leq\alpha F_{1}$ and $D\leq(\alpha-1)F_{1}$ . Hence, we have $\delta\leq\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}I_{i}}{m_{I}-k}+(\alpha-1)\frac{F_{1}-1/(\alpha-1)\sum_{i=1}^{k}D_{i}}{m_{D}-k}$ . We know that $\sum_{i=1}^{k}I_{i}\geq\sum_{i=1}^{k}f_{i}$ (this is because sum of k largest insert count has to be larger than or equal to the sum of k largest frequencies) and $\sum_{i=1}^{k}D_{i}\geq 0$ . As a result, $\delta<\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}f_{i}}{m_{I}-k}+(\alpha-1)\frac{F_{1}}{m_{D}-k}$ . Let $m_{I}=k(\frac{2\alpha}{\epsilon}+1)$ and $m_{D}=k(\frac{2(\alpha-1)}{\epsilon}+1)$ . Then, $\delta<\frac{\epsilon}{k}(F_{1}-1/(2\alpha)\sum_{i=1}^{k}f_{i})\approx\frac{\epsilon}{k}F_{1,\alpha}^{res(k)}$ . ∎

We present the residual error bound for Integrated SpaceSaving $\pm$ . Before presenting the proof, we first construct a helpful Lemma.

Lemma 4.2.

The minimum insert count ( $insert_{minItem}$ ), in Integrated SpaceSaving $\pm$ with m counters, is less than or equal to $\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}f_{i}}{m-k}$ where $k<m$ .

Proof.

The sum of all insert count equals to $I$ , as the sum of all insert count always increase by 1 after processing an insertion. If we zoom into the k largest insert count in the summary, the sum of their insert counts is no less than $\sum_{i=1}^{k}f_{i}$ , since no monitored items are underestimated as shown in Lemma 3.6. As a result, $insert_{minItem}\leq\frac{I-\sum_{i=1}^{k}f_{i}}{m-k}$ . We know that by definition of bounded deletion, $I\leq\alpha F_{1}$ . Hence, $insert_{minItem}\leq\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}f_{i}}{m-k}$ . ∎

Theorem 4.3.

Integrated SpaceSaving $\pm$ with $O(\alpha k/\epsilon)$ space achieves the residual error bound guarantee in which $\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{k}F_{1,\alpha}^{res(k)}$ , where $k<m$ .

Proof.

From Lemma 3.8 and Lemma 4.2, we know the estimation error for all items is upper bounded by $insert_{minItem}$ and $insert_{minItem}\leq\alpha\frac{F_{1}-1/\alpha\sum_{i=1}^{k}f_{i}}{m-k}$ . Let $m=k(\frac{\alpha}{\epsilon}+1)$ . We achieve the desired residual error bound: $\forall x\in U,|f(x)-\hat{f}(x)|\leq\frac{\epsilon}{k}F_{1,\alpha}^{res(k)}$ . ∎

4.2. Relative Error Bound

The $\epsilon$ relative error guarantee is defined such that $\forall x\in U,|f(x)-\hat{f}(x)|\leq\epsilon f(x)$ . The relative error has been shown to be much harder to achieve, but with practical importance (Gupta and Zane, 2003; Cormode et al., 2021). In a recent work, Mathieu and de Rougemont (Mathieu and de Rougemont, 2023) showcased that the original SpaceSaving algorithm achieves relative error guarantees under mild assumptions. As a result, we demonstrate that the proposed algorithms in the bounded deletion model also achieve relative error guarantees under similar assumptions.

Lemma 4.4.

Let f be $\gamma$ decreasing with $1<\gamma<2$ . Then $\sum_{j=i}^{n}f_{j}\leq(i-1)f_{i-1}\frac{2(\gamma-1)}{\gamma-1}$ and hence $F_{1}=\sum_{j=1}^{n}f_{j}\leq f_{1}(1+\frac{2(\gamma-1)}{2-\gamma})=f_{1}\frac{\gamma}{2-\gamma}$ .

Proof.

We omit the proof. (This is proven in Lemma 5 at (Mathieu and de Rougemont, 2023)) ∎

Definition 4.5 (Zipfian Distribution (Zipf, 2016)).

Let a function f with domain $\{1,2...n\}$ follow the Zipf distribution with parameter $\beta$ , then $f_{i}=F_{1}\frac{1}{i^{\beta}\xi(\beta)}$ where $F_{1}=\sum_{i=1}^{n}f_{i}$ and $\xi(\beta)=\sum_{i=1}^{n}i^{-\beta}$ .

By the definition of Zipf distribution, we know that $\xi(\beta)$ converges to a small constant when $\beta>1$ and $f_{1}=F_{1}/\xi(\beta)$ .

Theorem 4.6.

Assuming insertions and deletions in the $\alpha$ -bounded deletion model follow the $\gamma$ -decreasing property ( $1<\gamma<2$ ) and the exact frequencies follow the Zipf distribution with parameter $\beta$ , then Double SpaceSaving $\pm$ with $O(\frac{\alpha k^{\beta+1}}{\epsilon 2^{log_{\gamma}k}})$ space achieves the relative error bound guarantee such that $|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i}$ , for $i\leq k$ .

Proof.

Recall, the maximum estimation error $\delta$ is upper bounded by $\frac{I-\sum_{i=1}^{k}I_{i}}{m_{I}-k}+\frac{D-\sum_{i=1}^{k}D_{i}}{m_{D}-k}$ . Since insertions and deletions follow $\gamma$ -decreasing property, then $\sum_{i=k+1}^{n}I_{i}\leq kI_{k}\frac{2(\gamma-1)}{2-\gamma}$ and $\sum_{i=k+1}^{n}D_{i}\leq kD_{k}\frac{2(\gamma-1)}{2-\gamma}$ . Hence, $\delta\leq\frac{kI_{k}\frac{2(\gamma-1)}{2-\gamma}}{m_{I}-k}+\frac{kD_{k}\frac{2(\gamma-1)}{2-\gamma}}{m_{D}-k}$ .

For insertions, we know i) $I_{1}\leq I\leq\alpha F_{1}=\alpha\xi(\beta)f_{1}$ , ii) $f_{i}=\frac{f_{1}}{i^{\beta}}$ , and iii) $I_{i}\leq I_{1}/2^{log_{\gamma}(i)}$ ; For deletions, we know i) $D_{1}\leq D\leq(\alpha-1)F_{1}=(\alpha-1)\xi(\beta)f_{1}$ , ii) $f_{i}=\frac{f_{1}}{i^{\beta}}$ , and iii) $D_{i}\leq D_{1}/2^{log_{\gamma}(i)}$ .

Hence, $I_{k}\leq\frac{I_{1}}{2^{log_{\gamma}(k)}}\leq\alpha\xi(\beta)f_{1}/2^{log_{\gamma}(k)}=\alpha\xi(\beta)f_{k}k^{\beta}/2^{log_{\gamma}(k)}$ , and $D_{k}\leq D_{1}/2^{log_{\gamma}(k)}\leq(\alpha-1)\xi(\beta)f_{k}k^{\beta}/2^{log_{\gamma}(k)}$ . Let $m_{I}=m_{D}=k+\frac{2(\gamma-1)}{2-\gamma}\frac{k^{\beta+1}}{2^{log_{\gamma}k}}\frac{(2\alpha-1)}{\epsilon}$ , then $\delta\leq\epsilon f_{k}$ and hence for all $i<k$ , $|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i}$ .

∎

Theorem 4.7.

Assuming insertions in the $\alpha$ -bounded deletion model follows the $\gamma$ -decreasing property ( $1<\gamma<2$ ) and the exact frequencies follow the Zipf distribution with parameter $\beta$ , then Integrated SpaceSaving $\pm$ with $O(\frac{\alpha k^{\beta+1}}{\epsilon 2^{log_{\gamma}k}})$ space achieves the relative error bound guarantee such that $|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i}$ , for $i\leq k$ .

Proof.

We know that estimation error at most $insert_{minItem}$ and $insert_{minItem}\leq\frac{I-\sum_{i=1}^{k}I_{i}}{m-k}$ . Since insertions follow $\gamma$ -decreasing property, then $\sum_{i=k+1}^{n}I_{i}\leq kI_{k}\frac{2(\gamma-1)}{2-\gamma}$ . Hence, $insert_{minItem}\leq\frac{\sum_{i=k+1}^{n}I_{i}}{m-k}\leq\frac{kI_{k}\frac{2(\gamma-1)}{2-\gamma}}{m-k}$ . Since we know i) $I_{1}\leq I\leq\alpha F_{1}=\alpha\xi(\beta)f_{1}$ , ii) $f_{i}=\frac{f_{1}}{i^{\beta}}$ , and iii) $I_{g}\leq I_{1}/2^{log_{\gamma}(g)}$ . Hence, $I_{k}\leq I_{1}/2^{log_{\gamma}(k)}\leq\alpha\xi(\beta)f_{1}/2^{log_{\gamma}(k)}=\alpha\xi(\beta)f_{k}k^{\beta}/2^{log_{\gamma}(k)}$ . Let $m=k+\frac{2(\gamma-1)}{2-\gamma}\frac{k^{\beta+1}}{2^{log_{\gamma}k}}\frac{\alpha}{\epsilon}$ , then $insert_{minItem}\leq\epsilon f_{k}$ and hence for all $i<k$ , $|f_{i}-\hat{f}_{i}|\leq\epsilon f_{i}$ . ∎

5. Mergeability

Mergeability is a desirable property in distributed settings in which two summaries on two data sets are given, after merging between these two summaries, the merged summary has error and size guarantees equivalent to a single summary which processed the union of two data sets.

Definition 5.1 (Mergeable (Agarwal et al., 2013)).

A summary $S$ is mergeable if there exists a merge algorithm $A$ that produces a summary $S(\sigma_{1}\cup\sigma_{2},\epsilon)$ from two input summaries $S(\sigma_{1},\epsilon)$ and $S(\sigma_{2},\epsilon)$ . Note, the size of all three summaries should be the same in achieving $\epsilon$ error guarantee.

Both the original SpaceSaving and the Unbiased SpaceSaving have been shown to be mergable (Agarwal et al., 2013; Ting, 2018). Since Double SpaceSaving $\pm$ relies on two independent summaries, Double SpaceSaving $\pm$ is also mergeable.

For IntergratedSpaceSaving $\pm$ , to define a Merge algorithm, we take inspiration from (Agarwal et al., 2013). The merge algorithm is shown in Algorithm 8. Given two IntergratedSpaceSaving $\pm$ summaries of $m$ counters, $S_{1}$ and $S_{2}$ , we first union the two summaries (i,e, if both summaries monitor item $x$ , then $x$ ’s insert and delete counts in the merge summary are the sum of $x$ ’s insert and delete counts from $S_{1}$ and $S_{2}$ ). After the union, the merged summary contain at most $2m$ counters. The merged summary consists of the $m$ largest items among these 2 $m$ counters based on the insert counts.

Input: Two Integrated SpaceSaving

\pm

summaries of size

m

S_{1}

and

S_{2}

Output: Merged summary

S_{merge}

with size

m

S_{3}=S_{1}

UNION

S_{2}

;

2 select

m

largest items from

S_{3}

based on insert counts;

3 add these

m

items into

S_{merge}

and keep their insert/delete counts;

Algorithm 8 Integrated SpaceSaving

\pm

Merge Algorithm

Refer to caption — (a) Zipf $\beta=1.0$ , Frequency Estimation

Theorem 5.2.

The $IntegratedSpaceSaving\pm$ summaries are mergeable for $\alpha$ bounded deletion data stream following the Algorithm 8 with $m=\frac{\alpha}{\epsilon}$ counters.

Proof.

Assume the given $IntegratedSpaceSaving\pm$ summaries with $m=\frac{\alpha}{\epsilon}$ counters, $S_{1}$ and $S_{2}$ , have processed two $\alpha$ bounded deletion data stream $\sigma^{1}$ and $\sigma^{2}$ with $I^{1}$ and $I^{2}$ insertions and $F_{1}^{1}$ and $F_{1}^{2}$ total frequencies respectively. Let $F_{1}^{final}=F_{1}^{1}+F_{1}^{2}$ . First, we observe that The union step do not introduce any additional error. As a result, the error for these items is at most $\epsilon F_{1}^{1}+\epsilon F_{1}^{2}\leq\epsilon F_{1}^{final}$ .

Since we don’t want to underestimate any monitored items, Algorithm 8 keeps the largest $m$ items from the union-ed summary based on insert counts. We need to showcase that evicting all items except the largest $m$ items will not lead to error estimation larger than $\frac{I^{1}+I^{2}}{m}$ , and we know that $\frac{I^{1}+I^{2}}{m}\leq\epsilon F_{1}^{final}$ . This can be shown by contradiction. Assuming there exists an item $x$ in the union of $S_{1}$ and $S_{2}$ with true frequency larger or equal to $\frac{I^{1}+I^{2}}{m}$ and $x$ is not included than the $m$ largest items. Note, $x$ ’s insert count must be larger than its true frequency by Lemma 3.6. Then this must imply the sum of the insert counts of $m$ largest item and $x$ is at least $(m+1)\cdot\frac{I^{1}+I^{2}}{m}>(I^{1}+I^{2})$ . However, by Lemma 3.4, we know the sum of insert counts cannot exceed $I^{1}+I^{2}$ . As a result, evicting all items except the largest $m$ items from the union-ed summary based on insert counts does not lead to estimation error larger than $\epsilon F_{1}^{final}$ . Therefore, Intergrated SpaceSaving $\pm$ is mergeable.

∎

6. Evaluation

This section evaluates and compares the performance of our proposed algorithms with linear sketch from the turnstile model (Li et al., 2014). Our proposed counter based summaries have no assumption on the input universe but operates in the bounded deletion model. whereas linear sketch have dependency on the universe size but allow the entire data set to be deleted. The experiments aim to better understand the advantage and disadvantages of our proposed algorithms. The linear sketches that we compared with are:

•

Count Min (Cormode and Muthukrishnan, 2005): Item’s frequency is never underestimated.
•

Count Sketch (Charikar et al., 2002): Provides an unbiased estimation, such that $E[\hat{f}(x)]=f(x)$ where $E$ is the expected value.

Experiment Setup. We implemented all of the algorithms in Python. Through all experiments, we set $\delta=U^{-1}$ for linear sketches to align the experiments with the theoretical literature (Bhattacharyya et al., 2018).

Data Sets. We use both synthetic and real world data consisting of items that are inserted and deleted. Zipf Stream: The insertions ( $10^{5}$ ) are drawn from the Zipf distribution (Zipf, 2016) and deletions ( $10^{5}/2$ ) are uniformly chosen from the insertions. Deletions happen after all the insertions. Cardinality is 22k. YCSB: We use 60% insertions and 40% updates (delete followed by insert) with request distribution set to zipf in YCSB benchmark (Cooper et al., 2010) with interleaved 116645 insertions and 39825 deletions. Cardinality is 65k.

Metrics. We use average relative error (ARE) as the metrics for frequency estimation. Let the final data set $D=\{x|f(x)>0\}$ . ARE is calculated as $1/|D|(\sum_{x\in D}\frac{|f(x)-{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{f}}(x)|}{f(x)})$ . Lower ARE indicates more accurate approximations. For identifying the top 100 heavy hitters, we query all items in $D$ and report a set of 100 items with the largest frequency. Metrics is the F1 score.

6.1. Main Results

Our proposed algorithms perform very well on frequency estimation task, as shown in Figure 1(a)-(b). The y-axis is the average relative error and the x-axis the the number of counter used. For fair comparisons, we let IntergratedSpaceSaving $\pm$ uses three counters per entry, SpaceSaving and unbiased SpaceSaving use two counters, and linear sketch uses one counter per entry. We find that Integrated SpaceSaving $\pm$ is always the best, followed by Double SpaceSaving $\pm$ (DSS $\pm$ ) or Unbiased Double SpaceSaving $\pm$ ((UDSS $\pm$ )), and Count Sketch always ranks at the fourth. For instance, when using 16384 counters over YCSB dataset, average relative error for IntergratedSpaceSaving $\pm$ is 2.20, DSS $\pm$ is 2.25, UDSS $\pm$ is 2.26, Count Sketch is 2.29, and Count-Min is 35.

In identifying the top 100 items, linear sketches are more competitive to our proposed algorithms. As shown in Figure 1(c)-(d), the y-axis is the F1 score and x-axis is the number of counters. When more memory budget is allowed, Integrated SpaceSaving $\pm$ is always the best. When the space budget up to 8192 counters, DSS $\pm$ and UDSS $\pm$ always outperform linear sketches. For instance, using 8192 counters over YCSB, DSS $\pm$ and UDSS $\pm$ both achieve 0.95 F1 score, whereas Count Sketch score 0.91 and Count Min scores 0.78.

7. Conclusion

This paper presents a detailed analysis of SpaceSaving $\pm$ family of algorithms with bounded deletion. We proposed new algorithms built on top of the original SpaceSaving and SpaceSaving $\pm$ . They exhibit many desirable properties such as no underestimation, unbiased estimation, residual/ relative error guarantees and mergability. We believe these unique characteristics make our proposed algorithms a strong candidate for real-world applications and systems.

References

(1)
Agarwal et al. (2012) Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable summaries. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. 23–34.
Agarwal et al. (2013) Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 1–28.
Agrawal et al. (2011) Divyakant Agrawal, Philip Bernstein, Elisa Bertino, Susan Davidson, Umeshwas Dayal, Michael Franklin, Johannes Gehrke, Laura Haas, Alon Halevy, Jiawei Han, et al. 2011. Challenges and opportunities with Big Data 2011-1. (2011).
Alon et al. (1996) Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing. 20–29.
Bailis et al. (2017) Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541–556.
Ben-Eliezer et al. (2022) Omri Ben-Eliezer, Rajesh Jayaram, David P Woodruff, and Eylon Yogev. 2022. A framework for adversarially robust streaming algorithms. ACM Journal of the ACM (JACM) 69, 2 (2022), 1–33.
Berinde et al. (2010) Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J Strauss. 2010. Space-optimal heavy hitters with strong error bounds. ACM Transactions on Database Systems (TODS) 35, 4 (2010), 1–28.
Bhattacharyya et al. (2018) Arnab Bhattacharyya, Palash Dey, and David P Woodruff. 2018. An optimal algorithm for l1-heavy hitters in insertion streams and related problems. ACM Transactions on Algorithms (TALG) 15, 1 (2018), 1–27.
Bloom (1970) Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
Braverman et al. (2017) Vladimir Braverman, Stephen R Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P Woodruff. 2017. BPTree: an l2 heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 361–376.
Broder (1997) Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 21–29.
Bruch et al. (2023) Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty. 2023. An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors. arXiv preprint arXiv:2301.10622 (2023).
Charikar et al. (2002) Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming. Springer, 693–703.
Chen et al. (2023) Wei-Ning Chen, Ayfer Ozgur, Graham Cormode, and Akash Bharadwaj. 2023. The communication cost of security and privacy in federated frequency estimation. In International Conference on Artificial Intelligence and Statistics. PMLR, 4247–4274.
Chiosa et al. (2021) Monica Chiosa, Thomas B Preußer, and Gustavo Alonso. 2021. Skt: A one-pass multi-sketch data analytics accelerator. Proceedings of the VLDB Endowment 14, 11 (2021), 2369–2382.
Cohen et al. (2022) Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, and Uri Stemmer. 2022. On the robustness of countsketch to adaptive inputs. In International Conference on Machine Learning. PMLR, 4112–4140.
Cooper et al. (2010) Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143–154.
Cormode (2023) Graham Cormode. 2023. Applications of Sketching and Pathways to Impact. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (Seattle, WA, USA) (PODS ’23). Association for Computing Machinery, New York, NY, USA, 5–10. https://doi.org/10.1145/3584372.3589937
Cormode and Hadjieleftheriou (2008) Graham Cormode and Marios Hadjieleftheriou. 2008. Finding frequent items in data streams. Proceedings of the VLDB Endowment 1, 2 (2008), 1530–1541.
Cormode et al. (2021) Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, and Pavel Veselỳ. 2021. Relative error streaming quantiles. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 96–108.
Cormode and Muthukrishnan (2005) Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
Cormode and Yi (2020) Graham Cormode and Ke Yi. 2020. Small summaries for big data. Cambridge University Press.
Dayan et al. (2017) Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of the 2017 ACM International Conference on Management of Data. 79–94.
Demaine et al. (2002) Erik D Demaine, Alejandro López-Ortiz, and J Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In European Symposium on Algorithms. Springer, 348–360.
Duffield et al. (2007) Nick Duffield, Carsten Lund, and Mikkel Thorup. 2007. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM) 54, 6 (2007), 32–es.
Flajolet et al. (2007) Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science Proceedings (2007).
Gupta and Zane (2003) Anupam Gupta and Francis Zane. 2003. Counting inversions in lists. In SODA, Vol. 3. 253–254.
Han et al. (2006) Jiawei Han, Jian Pei, and Hanghang Tong. 2006. Data mining: Concepts and techniques. Morgan Kaufinann 10, 559-569 (2006), 4.
Hasidim et al. (2020) Avinatan Hasidim, Haim Kaplan, Yishay Mansour, Yossi Matias, and Uri Stemmer. 2020. Adversarially robust streaming algorithms via differential privacy. Advances in Neural Information Processing Systems 33 (2020), 147–158.
Ivkin et al. (2019) Nikita Ivkin, Edo Liberty, Kevin Lang, Zohar Karnin, and Vladimir Braverman. 2019. Streaming quantiles algorithms with small space and update time. arXiv preprint arXiv:1907.00236 (2019).
Jayaram and Woodruff (2018) Rajesh Jayaram and David P Woodruff. 2018. Data streams with bounded deletions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 341–354.
Jiang et al. (2018) Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui. 2018. Sketchml: Accelerating distributed machine learning with data sketches. In Proceedings of the 2018 International Conference on Management of Data. 1269–1284.
Karnin et al. (2016) Zohar Karnin, Kevin Lang, and Edo Liberty. 2016. Optimal quantile approximation in streams. In 2016 ieee 57th annual symposium on foundations of computer science (focs). IEEE, 71–78.
Karp et al. (2003) Richard M Karp, Scott Shenker, and Christos H Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28, 1 (2003), 51–55.
Lebeda and Tetek (2023) Christian Janos Lebeda and Jakub Tetek. 2023. Better differentially private approximate histograms and heavy hitters using the Misra-Gries sketch. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 79–88.
Li et al. (2020) Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. 2020. Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1574–1584.
Li et al. (2014) Yi Li, Huy L Nguyen, and David P Woodruff. 2014. Turnstile streaming algorithms might as well be linear sketches. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing. 174–183.
Liu et al. (2023) Zirui Liu, Yixin Zhang, Yifan Zhu, Ruwen Zhang, Tong Yang, Kun Xie, Sha Wang, Tao Li, and Bin Cui. 2023. TreeSensing: Linearly Compressing Sketches with Flexibility. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–28.
Manku and Motwani (2002) Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 346–357.
Mathieu and de Rougemont (2023) Claire Mathieu and Michel de Rougemont. 2023. Testing frequency distributions in a stream. arXiv preprint arXiv:2309.11175 (2023).
Metwally et al. (2005) Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Database Theory-ICDT 2005: 10th International Conference, Edinburgh, UK, January 5-7, 2005. Proceedings 10. Springer, 398–412.
Misra and Gries (1982) Jayadev Misra and David Gries. 1982. Finding repeated elements. Science of computer programming 2, 2 (1982), 143–152.
Pagh and Thorup (2022) Rasmus Pagh and Mikkel Thorup. 2022. Improved utility analysis of private countsketch. Advances in Neural Information Processing Systems 35 (2022), 25631–25643.
Rothchild et al. (2020) Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. 2020. Fetchsgd: Communication-efficient federated learning with sketching. In International Conference on Machine Learning. PMLR, 8253–8265.
Ting (2018) Daniel Ting. 2018. Data sketches for disaggregated subset sum and frequent item estimation. In Proceedings of the 2018 International Conference on Management of Data. 1129–1140.
Wang et al. (2019) Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John CS Lui, and Xiaohong Guan. 2019. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 25–33.
Wang et al. (2024) Yiping Wang, Yanhao Wang, and Cen Chen. 2024. DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report). arXiv preprint arXiv:2406.07953 (2024).
Xu et al. (2023) Yuchen Xu, Wenfei Wu, Bohan Zhao, Tong Yang, and Yikai Zhao. 2023. MimoSketch: A Framework to Mine Item Frequency on Multiple Nodes with Sketches. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2838–2849.
Yang et al. (2018) Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 561–575.
Zakhary et al. (2021) Victor Zakhary, Lawrence Lim, Divy Agrawal, and Amr El Abbadi. 2021. Cache on Track (CoT): Decentralized Elastic Caches for Cloud Environments. In International Conference on Extending Database Technology.
Zhao et al. (2021a) Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi, and Ahmed Metwally. 2021a. SpaceSaving $\pm$ : An Optimal Algorithm for Frequency Estimation and Frequent items in the Bounded Deletion Model. arXiv preprint arXiv:2112.03462 (2021).
Zhao et al. (2023a) Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi, Ahmed Metwally, Claire Mathieu, and Michel de Rougemont. 2023a. Errata for” SpaceSaving $\pm$ : An Optimal Algorithm for Frequency Estimation and Frequent Items in the Bounded-Deletion Model”. Proceedings of the VLDB Endowment 17, 4 (2023), 643–643.
Zhao et al. (2023b) Fuheng Zhao, Punnal Ismail Khan, Divyakant Agrawal, Amr El Abbadi, Arpit Gupta, and Zaoxing Liu. 2023b. Panakos: Chasing the Tails for Multidimensional Data Streams. Proceedings of the VLDB Endowment 16, 6 (2023), 1291–1304.
Zhao et al. (2021b) Fuheng Zhao, Sujaya Maiyya, Ryan Wiener, Divyakant Agrawal, and Amr El Abbadi. 2021b. Kll $\pm$ Approximate Quantile Sketches over Dynamic Datasets. Proceedings of the VLDB Endowment 14, 7 (2021), 1215–1227.
Zhao et al. (2022) Fuheng Zhao, Dan Qiao, Rachel Redberg, Divyakant Agrawal, Amr El Abbadi, and Yu-Xiang Wang. 2022. Differentially private linear sketches: Efficient implementations and applications. Advances in Neural Information Processing Systems 35 (2022), 12691–12704.
Zhao et al. (2023c) Fuheng Zhao, Leron Reznikov, Divyakant Agrawal, and Amr El Abbadi. 2023c. Autumn: A Scalable Read Optimized LSM-tree based Key-Value Stores with Fast Point and Range Read Speed. arXiv preprint arXiv:2305.05074 (2023).
Zipf (2016) George Kingsley Zipf. 2016. Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books.

The SpaceSaving± Family of Algorithms for Data Streams with Bounded Deletions

Abstract.

1. Introduction

1.1. Related Work

1.2. Our Contributions

2. Background

2.1. Preliminaries

Definition 2.1 (Deterministic Frequency Estimation Problem).

Definition 2.2 (Heavy Hitters problem).

2.2. The SpaceSaving± Family of Algorithms

Lemma 2.3.

Lemma 2.4.

Definition 2.5 (γ\gamma-decreasing).

Lemma 2.6.

Lemma 2.7.

3. New Algorithms

3.1. Double SpaceSaving±\pm

Theorem 3.1.

Proof.

Theorem 3.2.

Proof.

3.1.1. Unbiased Double SpaceSaving±\pm

Theorem 3.3.

Proof.

3.2. Integrated SpaceSaving±\pm

Lemma 3.4.

Proof.

Lemma 3.5.

Proof.

Lemma 3.6.

Proof.

Lemma 3.7.

Proof.

Lemma 3.8.

Proof.

Theorem 3.9.

Proof.

Theorem 3.10.

Proof.

3.3. Comparisons between Double SpaceSaving±\pm and Integrated SpaceSaving±\pm

4. Tighter Analysis on Error Bounds

4.1. Residual Error Bound

Theorem 4.1.

Proof.

Lemma 4.2.

Proof.

Theorem 4.3.

Proof.

4.2. Relative Error Bound

Lemma 4.4.

Proof.

Definition 4.5 (Zipfian Distribution (Zipf, 2016)).

Theorem 4.6.

Proof.

Theorem 4.7.

Proof.

5. Mergeability

Definition 5.1 (Mergeable (Agarwal et al., 2013)).

Theorem 5.2.

Proof.

6. Evaluation

6.1. Main Results

7. Conclusion

References

Definition 2.5 ( $\gamma$ -decreasing).

3.1. Double SpaceSaving $\pm$

3.1.1. Unbiased Double SpaceSaving $\pm$

3.2. Integrated SpaceSaving $\pm$

3.3. Comparisons between Double SpaceSaving $\pm$ and Integrated SpaceSaving $\pm$