This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

WOR and pp’s:
Sketches for p\ell_{p}-Sampling Without Replacement

Edith Cohen
Google Research
Tel Aviv University
[email protected]
&Rasmus Pagh
IT University of Copenhagen
BARC
Google Research
[email protected] &David P. Woodruff
CMU
[email protected]
Abstract

Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: it provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR p\ell_{p} sampling, weighted sampling of keys according to a power p[0,2]p\in[0,2] of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with the sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as CountSketch. Our method is the first to provide WOR sampling in the important regime of p>1p>1 and the first to handle signed updates for p>0p>0.

1 Introduction

Weighted random sampling is a fundamental tool that is pervasive in machine learning and data analysis pipelines. A sample serves as a sparse summary of the data and provides efficient estimation of statistics and aggregates.

We consider data \mathcal{E} presented as elements in the form of key value pairs e=(e.key,e.val)e=(e.\textsf{key},e.\textsf{val}). We operate with respect to the aggregated form of keys and their frequencies νx:=ee.key=xe.val\nu_{x}:=\sum_{e\mid e.\textsf{key}=x}e.\textsf{val}, defined as the sum of values of elements with key xx. Examples of such data sets are stochastic gradient updates (keys are parameters and element values are signed and the aggregated form is the combined gradient), search (keys are queries, elements have unit values, and the aggregated form are query-frequency pairs), or training examples for language models (keys are co-occurring terms).

The data is commonly distributed across servers or devices or is streamed and the number of distinct keys is very large. In this scenario it is beneficial to perform computations without explicitly producing a table of key-frequency pairs, as this requires storage or communication that grows linearly with the number of keys. Instead, we use composable sketches which are data structures that support (i) processing a new element ee: Computing a sketch of {e}\mathcal{E}\cup\{e\} from a sketch of \mathcal{E} and ee (ii) merging: Computing a sketch of 12\mathcal{E}_{1}\cup\mathcal{E}_{2} from sketches of each i\mathcal{E}_{i} and (iii) are such that the desired output can be produced from the sketch. Composability facilitates parallel, distributed, or streaming computation. We aim to design sketches of small size, because the sketch size determines the storage and communication requirements. For sampling, we aim for the sketch size to be not much larger than the desired sample size.

The case for pp’s:

Aggregation and statistics of functions of the frequencies are essential for resource allocation, planning, and management of large scale systems across application areas. The need for efficiency prompted rich theoretical and applied work on streaming and sketching methods that spanned decades [60, 41, 4, 38, 43, 36, 35, 55, 54]. We study p\ell_{p} sampling, weighted sampling of keys with respect to a power pp of their frequency νxp\nu^{p}_{x}. These samples support estimates of frequency statistics of the general form xf(νx)Lx\sum_{x}f(\nu_{x})L_{x} for functions of frequency ff and constitute sparse representations of the data. Low powers (p<1p<1) are used to mitigate frequent keys and obtain a better resolution of the tail whereas higher powers (p>1p>1) emphasize more frequent keys. Moreover, recent work suggests that on realistic distributions, p\ell_{p} samples for p[0,2]p\in[0,2] provide accurate estimates for a surprisingly broad set of tasks [34].

Sampling is at the heart of stochastic optimization. When training data is distributed [57], sampling can facilitate efficient example selection for training and efficient communication of gradient updates of model parameters. Training examples are commonly weighted by a function of their frequency: Language models [59, 66] use low powers p<1p<1 of frequency to mitigate the impact of frequent examples. More generally, the function of frequency can be adjusted in the course of training to shift focus to rarer and harder examples as training progresses [9]. A sample of examples can be used to produce stochastic gradients or evaluate loss on domains of examples (expressed as frequency statistics). In distributed learning, the communication of dense gradient updates can be a bottleneck, prompting the development of methods that sparsify communication while retaining accuracy [57, 2, 71, 47]. Weighted sampling by the pp-th powers of magnitudes complements existing methods that sparsify using heavy hitters (or other methods, e.g., sparsify randomly), provides adjustable emphasis to larger magnitudes, and retains sparsity as updates are composed.

The case for WOR:

Weighted sampling is classically considered with (WR) or without (WOR) replacement. We study here the WOR setting. The benefits of WOR sampling were noted in very early work [44, 42, 72] and are becoming more apparent with modern applications and the typical skewed distributions of massive datasets. WOR sampling provides a broader representation and more accurate estimates, with tail norms replacing full norms in error bounds. Figure 1 illustrates these benefits of WOR for Zipfian distributions with 1\ell_{1} sampling (weighted by frequencies) and 2\ell_{2} sampling (weighted by the squares of frequencies). We can see that WR samples have a smaller effective sample size than WOR (due to high multiplicity of heavy keys) and that while both WR and WOR well-approximate the frequency distribution on heavy keys, WOR provides a much better approximation of the tail.

Refer to caption Refer to caption Refer to caption

Figure 1: WOR vs WR. Left and middle: Effective vs actual sample size Zipf[α=1]\textsf{Zipf}[\alpha=1] and Zipf[α=2]\textsf{Zipf}[\alpha=2], with each point reflecting a single sample. Right: Estimates of the frequency distribution Zipf[α=2]\textsf{Zipf}[\alpha=2].

Related work.

The sampling literature offers many WOR sampling schemes for aggregated data: [68, 72, 14, 69, 64, 37, 25, 26, 23]. A particularly appealing technique is bottom-kk (order) sampling, where weights are scaled by random variables and the sample is the set of keys with top-kk transformed values [69, 64, 37, 25, 26]. There is also a large body of work on sketches for sampling unaggregated data by functions of frequency. There are two primary approaches. The first approach involves transforming data elements so that a bottom-kk sample by function of frequency is converted to an easier problem of finding the top-kk keys sorted according to the maximum value of an element with the key. This approach yields WOR distinct (0\ell_{0}) sampling [53], 1\ell_{1} sampling [41, 22], and sampling with respect to any concave sublinear functions of frequency (including p\ell_{p} sampling for p1p\leq 1[20, 24]). These sketches work with non-negative element values but only provide limited support for negative updates [40, 22]. The second approach performs WR p\ell_{p} sampling for p[0,2]p\in[0,2] using sketches that are random projections [45, 39, 5, 51, 61, 6, 52, 50]. The methods support signed updates but were not adapted to WOR sampling. For p>2p>2, a classic lower bound [4, 7] establishes that sketches of size polynomial in the number of distinct keys are required for worst case frequency distributions. This task has also been studied in distributed settings [16, 49]; [49] observes the importance of WOR in that setting though does not allow for updates to element values.

Contributions:

We present WORp: A method for WOR p\ell_{p} sampling for p[0,2]p\in[0,2] via composable sketches of size that grows linearly with the sample size. WORp is simple and practical and uses a bottom-kk transform to reduce sampling to (residual) heavy hitters (rHH) computation on the transformed data. The technical heart of the paper is establishing that for any set of input frequencies, the keys with top-kk transformed frequencies are indeed rHH. In terms of implementation, WORp only requires an off-the-shelf use of popular (and widely-implemented) HH sketches [60, 56, 15, 35, 58, 10]. WORp is the first WOR p\ell_{p} sampling method (that uses sample-sized sketches) for the regime p(1,2]p\in(1,2] and the first to fully support negative updates for p(0,2]p\in(0,2]. As a bonus, we include practical optimizations (that preserve the theoretical guarantees) and perform experiments that demonstrate both the practicality and accuracy of WORp.111Code for the experiments is provided in the following Colab notebook https://colab.research.google.com/drive/1JcNokk_KdQz1gfUarKUcxhnoo9CUL-fZ?usp=sharing

In addition to the above, we show that perhaps surprisingly, it is possible to obtain a WOR p\ell_{p}-sample of a set of kk indices, for any p[0,2]p\in[0,2], with variation distance at most 1poly(n)\frac{1}{\mathop{\mathrm{poly}}(n)} to a true WOR p\ell_{p}-sample, and using only kpoly(logn)k\cdot\mathop{\mathrm{poly}}(\log n) bits of memory. Our variation distance is extremely small, and cannot be detected by any polynomial time algorithm. This makes it applicable in settings for which privacy may be a concern; indeed, this shows that no polynomial time algorithm can learn anything from the sampled output other than what follows from a simulator who outputs a WOR p\ell_{p}-sample from the actual (variation distance 0) distribution. Finally, for p(0,2)p\in(0,2), we show that the memory of our algorithm is optimal up to an O(log2logn)O(\log^{2}\log n) factor.

2 Preliminaries

A dataset \mathcal{E} consists of data elements that are key value pairs e=(e.key,e.val)e=(e.\textsf{key},e.\textsf{val}). The frequency of a key xx, denoted νx:=ee.key=xe.val\nu_{x}:=\sum_{e\mid e.\textsf{key}=x}e.\textsf{val}, is the sum of values of elements with key xx. We use the notation 𝝂\boldsymbol{\nu} for a vector of frequencies of keys.

For a function ff and vector 𝒘\boldsymbol{w}, we denote the vector with entries f(wx)f(w_{x}) by f(𝒘)f(\boldsymbol{w}). In particular, 𝒘p\boldsymbol{w}^{p} is the vector with entries wxpw^{p}_{x} that are the pp-th powers of the entries of 𝒘\boldsymbol{w}. For vector 𝒘n\boldsymbol{w}\in\Re^{n} and index ii, we denote by w(i)w_{(i)} the value of the entry with the ii-th largest magnitude in 𝒘\boldsymbol{w}. We denote by order(𝒘)\textsf{order}(\boldsymbol{w}) the permutation of the indices [n]={1,2,,n}[n]=\{1,2,\ldots,n\} that corresponds to decreasing order of entries by magnitude. For k1k\geq 1, we denote by tailk(𝒘)\textsf{tail}_{k}(\boldsymbol{w}) the vector with the kk entries with largest magnitudes removed (or replaced with 0).

In the remainder of the section we review ingredients that we build on: bottom-kk sampling, implementing a bottom-kk transform on unaggregated data, and composable sketch structures for residual heavy hitters (rHH).

2.1 Bottom-kk sampling (ppswor and priority)

Bottom-kk sampling (also called order sampling [69]) is a family of without-replacement weighted sampling schemes of a set {(x,wx)}\{(x,w_{x})\} of key and weight pairs. The weights (x,wx)(x,w_{x}) are transformed via independent random maps wxTwxrxw^{T}_{x}\leftarrow\frac{w_{x}}{r_{x}}\enspace, where rx𝒟r_{x}\sim\mathcal{D} are i.i.d. from some distribution 𝒟\mathcal{D}. The sample includes the pairs (x,wx)(x,w_{x}) for keys xx that are top-kk by transformed magnitudes |𝒘𝑻||\boldsymbol{w^{T}}| [63, 69, 37, 17, 13, 25, 27] 222Historically, the term bottom-kk is due to analogous use of 1/wxT1/w_{x}^{T}, but here we find it more convenient to work with ”top-kk. For estimation tasks, we also include a threshold τ:=|w(k+1)T|\tau:=|w^{T}_{(k+1)}|, the (k+1)(k+1)-st largest magnitude of transformed weights. Bottom-kk schemes differ by the choice of distribution 𝒟\mathcal{D}. Two popular choices are Probability Proportional to Size WithOut Replacement (ppswor) [69, 17, 27] via the exponential distribution 𝒟Exp[1]\mathcal{D}\leftarrow\textsf{Exp}[1] and Priority (Sequential Poisson) sampling [63, 37] via the uniform distribution 𝒟U[0,1]\mathcal{D}\leftarrow U[0,1]. Ppswor is equivalent to a weighted sampling process [68] where keys are drawn successively (without replacement) with probability proportional to their weight. Priority sampling mimics a pure Probability Proportional to Size (pps) sampling, where sampling probabilities are proportional to weights (but truncated to be at most 11).

Estimating statistics from a Bottom-kk sample

Bottom-kk samples provide us with unbiased inverse-probability [44] per-key estimates on f(wx)f(w_{x}), where ff is a function applied to the weight [3, 26, 21, 19]):

f(wx)^:={f(wx)Prr𝒟[r|wx|/τ] if xS0 if xS.\widehat{f(w_{x})}:=\begin{cases}\frac{f(w_{x})}{\Pr_{r\sim\mathcal{D}}[r\leq|w_{x}|/\tau]}&\text{ if }x\in S\\ 0&\text{ if }x\notin S\end{cases}\enspace. (1)

These estimates can be used to sparsify a vector f(𝒘)f(\boldsymbol{w}) to kk entries or to estimate sum statistics of the general form:

xf(wx)Lx\sum_{x}f(w_{x})L_{x} (2)

using the unbiased estimate

xf(wx)Lx^:=xf(wx)^Lx=xSf(wx)^Lx.\widehat{\sum_{x}f(w_{x})L_{x}}:=\sum_{x}\widehat{f(w_{x})}L_{x}=\sum_{x\in S}\widehat{f(w_{x})}L_{x}\enspace.

The quality of estimates depends on ff and LL. We measure the quality of these unbiased estimates by the sum over keys of the per-key variance. With both ppswor and priority sampling and f(w):=wf(w):=w, the sum is bounded by a respective one for pps with replacement. The per-key variance bound is

Var[wx^]1k1wx𝒘1\textsf{Var}[\widehat{w_{x}}]\leq\frac{1}{k-1}w_{x}\|\boldsymbol{w}\|_{1}\enspace (3)

and the respective sum by xVar[wx^]1k1𝒘12\sum_{x}\textsf{Var}[\widehat{w_{x}}]\leq\frac{1}{k-1}\|\boldsymbol{w}\|^{2}_{1}. This can be tightened to Var[wx^]min{O(1k)wxtailk(𝒘)1,exp(O(k)wxtailk(𝒘)1)wx2}\textsf{Var}[\widehat{w_{x}}]\leq\min\{O(\frac{1}{k})w_{x}\|\textsf{tail}_{k}(\boldsymbol{w})\|_{1},\exp\left(-O(k)\frac{w_{x}}{\|\textsf{tail}_{k}(\boldsymbol{w})\|_{1}}\right)w_{x}^{2}\}\enspace and respective bound on the sum of O(tailk(𝒘)12/k)O(\|\textsf{tail}_{k}(\boldsymbol{w})\|^{2}_{1}/k). For skewed distributions, tailk(𝒘)12𝒘12\|\textsf{tail}_{k}(\boldsymbol{w})\|^{2}_{1}\ll\|\boldsymbol{w}\|^{2}_{1} and hence WOR sampling is beneficial. Conveniently, bottom-kk estimates for different keys x1x2x_{1}\not=x_{2} have non-positive correlations Cov[wx1^,wx2^]0\textsf{Cov}[\widehat{w_{x_{1}}},\widehat{w_{x_{2}}}]\leq 0, so the variance of sum estimates is at most the respective weighted sum of per-key variance. Note that the per-key variance for a function of weight is Var[f(wx)^]=f(wx)2wx2Var[wx^]\textsf{Var}[\widehat{f(w_{x})}]=\frac{f(w_{x})^{2}}{w_{x}^{2}}\textsf{Var}[\widehat{w_{x}}]. WOR (and WR) estimates are more accurate (in terms of normalized variance sum) when the sampling is weighted by f(𝒘)f(\boldsymbol{w}).

2.2 Bottom-kk sampling by power of frequency

To perform bottom-kk sampling of 𝒘p\boldsymbol{w}^{p} with distribution 𝒟\mathcal{D}, we draw rx𝒟r_{x}\sim\mathcal{D}, transform the weights wxTwxp/rxw^{T}_{x}\leftarrow w_{x}^{p}/r_{x}, and return the top-kk keys in 𝒘T\boldsymbol{w}^{T}. This is equivalent to bottom-kk sampling the vector 𝒘\boldsymbol{w} using the distribution 𝒟1/p\mathcal{D}^{1/p}, that is, draw rx𝒟r_{x}\sim\mathcal{D}, transform the weights

wxwxrx1/pw^{*}_{x}\leftarrow\frac{w_{x}}{r_{x}^{1/p}} (4)

and return the top-kk keys according to 𝒘\boldsymbol{w}^{*}. Equivalence is because (wx)p=(wxrx1/p)p=wxprx=wxT(w^{*}_{x})^{p}=\left(\frac{w_{x}}{r_{x}^{1/p}}\right)^{p}=\frac{w_{x}^{p}}{r_{x}}=w^{T}_{x} and f(x)=xpf(x)=x^{p} is a monotone increasing and hence order(𝒘)=order(𝒘T)\textsf{order}(\boldsymbol{w}^{*})=\textsf{order}(\boldsymbol{w}^{T}). We denote the distribution of 𝒘\boldsymbol{w}^{*} obtained from the bottom-kk transform (4) as pp-𝒟[𝒘]\mathcal{D}[\boldsymbol{w}] and specifically, pp-ppswor[𝒘][\boldsymbol{w}] when 𝒟=Exp[1]\mathcal{D}=\textsf{Exp}[1] and pp-priority[𝒘][\boldsymbol{w}] when 𝒟=U[0,1]\mathcal{D}=U[0,1]. We use the term pp-ppswor for bottom-kk sampling by Exp1/p\textsf{Exp}^{1/p}.

The linear transform (4) can be efficiently performed over unaggregated data by using a random hash to represent rxr_{x} for keys xx and then locally generating an output element for each input element

(e.key,e.val)(e.key,e.val/re.key1/p)(e.\textsf{key},e.\textsf{val})\rightarrow(e.\textsf{key},e.\textsf{val}/r_{e.\textsf{key}}^{1/p}) (5)

The task of sampling by pp-th power of frequency 𝝂p\boldsymbol{\nu}^{p} is replaced by the task of top-kk by frequency νx:=ee.key=xe.val\nu^{*}_{x}:=\sum_{e\in\mathcal{E}^{*}\mid e.\textsf{key}=x}e.\textsf{val} on the respective output dataset \mathcal{E}^{*}, noting that νx=νx/rx1/p\nu^{*}_{x}=\nu_{x}/r_{x}^{1/p}. Therefore, the top-kk keys in 𝝂\boldsymbol{\nu}^{*} are a bottom-kk sample according to 𝒟\mathcal{D} of 𝝂p\boldsymbol{\nu}^{p}. Note that we can approximate the input frequency νx\nu^{\prime}_{x} of a key xx from an approximate output frequency νx^\widehat{\nu^{*}_{x}} using the hash rxr_{x}. Note that relative error is preserved:

νxνx^rx1/p.\nu^{\prime}_{x}\leftarrow\widehat{\nu^{*}_{x}}r_{x}^{1/p}\enspace. (6)

This per-element scaling was proposed in the precision sampling framework of Andoni et al. [6] and inspired by a technique for frequency moment estimation using stable distributions [45].

Generally, finding the top-kk frequencies is a task that requires large sketch structures of size linear in the number of keys. However, [6] showed that when the frequencies are drawn from pp-priority[𝒘][\boldsymbol{w}] (applied to arbitrary 𝒘\boldsymbol{w}) and p2p\leq 2 then the top-11 value is likely to be an 2\ell_{2} heavy hitter. Here we refine the analysis and use the more subtle notion of residual heavy hitters  [10]. We show that the top-kk output frequencies in 𝒘p\boldsymbol{w}^{*}\sim p-ppswor[𝒘][\boldsymbol{w}] are very likely to be q\ell_{q} residual heavy hitters (when qpq\geq p) and can be found with a sketch of size O~(k)\tilde{O}(k).

2.3 Residual heavy hitters (rHH)

An entry in a weight vector 𝒘\boldsymbol{w} is called an ε\varepsilon- heavy hitter if wxεywyw_{x}\geq\varepsilon\sum_{y}w_{y}. A heavy hitter with respect to a function ff is defined as a key with f(wx)εyf(wy)f(w_{x})\geq\varepsilon\sum_{y}f(w_{y}). When f(w)=wqf(w)=w^{q}, we refer to a key as an q\ell_{q} heavy hitter. For k1k\geq 1 and ψ>0\psi>0, a vector 𝒘\boldsymbol{w} has (k,ψ)(k,\psi) residual heavy hitters [10] when the top-kk keys are “heavy" with respect to the tail of the frequencies starting at the (k+1)(k+1)-st most frequent key, that is, ik,w(i)ψktailk(𝒘)1\forall i\leq k,\ w_{(i)}\geq\frac{\psi}{k}\|\textsf{tail}_{k}(\boldsymbol{w})\|_{1}. This is equivalent to tailk(𝒘)1w(k)kψ\frac{\|\textsf{tail}_{k}(\boldsymbol{w})\|_{1}}{w_{(k)}}\leq\frac{k}{\psi}. We say that 𝒘\boldsymbol{w} has rHH with respect to a function ff if f(𝒘)f(\boldsymbol{w}) has rHH. In particular, 𝒘\boldsymbol{w} has q\ell_{q} (k,ψ)(k,\psi) rHH if

tailk(𝒘)qqw(k)qkψ.\frac{\|\textsf{tail}_{k}(\boldsymbol{w})\|_{q}^{q}}{w^{q}_{(k)}}\leq\frac{k}{\psi}\enspace. (7)

Popular composable HH sketches were adopted to rHH and include (see Table 1): (i) 1\ell_{1} sketches designed for positive data elements. A deterministic counter-based variety [60, 56, 58] with rHH adaptation [10] and the randomized CountMin sketch [35]. (ii) 2\ell_{2} sketches designed for signed data elements, notably CountSketch [15] with rHH analysis in [52]. With these designs, a sketch for q\ell_{q} (k,ψ)(k,\psi)-rHH provides estimates νx^\widehat{\nu_{x}} for all keys xx with error bound:

𝝂^𝝂qψktailk(𝝂)qq.\|\widehat{\boldsymbol{\nu}}-\boldsymbol{\nu}\|^{q}_{\infty}\leq\frac{\psi}{k}\|\textsf{tail}_{k}(\boldsymbol{\nu})\|_{q}^{q}\enspace. (8)

With randomized sketches, the error bound (8) is guaranteed with some probability 1δ1-\delta. CountSketch has the advantages of capturing top keys that are 2\ell_{2} but not 1\ell_{1} heavy hitters and supports signed data, but otherwise provides lower accuracy than 1\ell_{1} sketches for the same sketch size. Methods also vary in supported key domains: counters natively work with key strings whereas randomized sketches work for keys from [n][n] (see Appendix A for a further discussion). We use these sketches off-the-shelf through the following operations:

  • R.Initialize(k,ψ,δ)R.\textnormal{{Initialize}}(k,\psi,\delta): Initialize a sketch structure

  • Merge(R1,R2)\textnormal{{Merge}}(R_{1},R_{2}): Merge two sketches with the same parameters and internal randomization

  • R.Process(e)R.\textnormal{{Process}}(e): process a data element ee

  • R.Est(x)R.\textnormal{{Est}}(x): Return an estimate of the frequency of a key xx .

Sketch (q\ell_{q}, sign) Size 𝝂^𝝂q\|\widehat{\boldsymbol{\nu}}-\boldsymbol{\nu}\|^{q}_{\infty}\leq
Counters (1\ell_{1}, ++) [10] O(kψ)O(\frac{k}{\psi}) ψktailk(𝝂)1\frac{\psi}{k}\|\textsf{tail}_{k}(\boldsymbol{\nu})\|_{1}
CountSketch (2\ell_{2},±\pm) [15] O(kψlognδ)O(\frac{k}{\psi}\log\frac{n}{\delta}) ψktailk(𝝂)22\frac{\psi}{k}\|\textsf{tail}_{k}(\boldsymbol{\nu})\|_{2}^{2}
Table 1: Sketches for q\ell_{q} (k,ψ)(k,\psi) rHH.

3 WORp Overview

Our WORp methods apply a pp-ppswor transform to data elements (5) and (for qpq\geq p) compute an q\ell_{q} (k,ψ)(k,\psi)-rHH sketch of the output elements. The rHH sketch is used to produce a sample of kk keys.

We would like to set ψ\psi to be low enough so that for any input frequencies 𝝂\boldsymbol{\nu}, the top-kk keys by transformed frequencies 𝝂p-ppswor[𝝂]\boldsymbol{\nu}^{*}\sim\text{$p$-ppswor}[\boldsymbol{\nu}] are rHH (satisfy condition (7)) with probability at least 1δ1-\delta. We refine this desiderata to be conditioned on the permutation of keys in order(𝝂)\textsf{order}(\boldsymbol{\nu}^{*}). This conditioning turns out not to further constrain ψ\psi and allows us to provide the success probability uniformly for any potential kk-sample. Since our sketch size grows inversely with ψ\psi (see Table 1), we want to use the maximum value that works. We will be guided by the following:

Ψn,k,ρ=q/p(δ):=sup{ψ𝒘n,πSnPr𝒘p-ppswor[𝒘]order(𝒘)=π[k|w(k)|qtailk(𝒘)qqψ]δ},\Psi_{n,k,\rho=q/p}(\delta):=\sup\left\{\psi\mid\forall\boldsymbol{w}\in\Re^{n},\pi\in S^{n}\Pr_{\boldsymbol{w}^{*}\sim\text{$p$-ppswor}[\boldsymbol{w}]\mid\textsf{order}(\boldsymbol{w}^{*})=\pi}\left[k\frac{|w^{*}_{(k)}|^{q}}{\|\textsf{tail}_{k}(\boldsymbol{w}^{*})\|_{q}^{q}}\leq\psi\right]\leq\delta\right\}, (9)

where SnS^{n} denotes the set of permutations of [n][n]. If we set the rHH sketch parameter to ψεΨn,k,ρ\psi\leftarrow\varepsilon\Psi_{n,k,\rho} then using (8), with probability at least 1δ1-\delta,

𝝂^𝝂qψktailk(𝝂)qq=εΨn,k,ρ(λ)ktailk(𝝂)qqε|ν(k)|q.\|\widehat{\boldsymbol{\nu}^{*}}-\boldsymbol{\nu}^{*}\|^{q}_{\infty}\leq\frac{\psi}{k}\|\textsf{tail}_{k}(\boldsymbol{\nu}^{*})\|_{q}^{q}=\varepsilon\frac{\Psi_{n,k,\rho}(\lambda)}{k}\|\textsf{tail}_{k}(\boldsymbol{\nu}^{*})\|_{q}^{q}\leq\varepsilon|\nu^{*}_{(k)}|^{q}\ . (10)

We establish the following lower bounds on Ψn,k,ρ(δ)\Psi_{n,k,\rho}(\delta):

Theorem 3.1.

There is a universal constant C>0C>0 such that for all nn, k>1k>1, and ρ=q/p\rho=q/p

For ρ=1\rho=1: Ψn,k,ρ(3ek)1Clnnk\displaystyle\Psi_{n,k,\rho}(3e^{-k})\geq\frac{1}{C\ln\frac{n}{k}} (11)
For ρ>1\rho>1: Ψn,k,ρ(3ek)1Cmax{ρ1,1lnnk)}.\displaystyle\Psi_{n,k,\rho}(3e^{-k})\geq\frac{1}{C}\max\{\rho-1,\frac{1}{\ln\frac{n}{k})}\}\enspace. (12)

To set sketch parameters in implementations, we approximate Ψ\Psi using simulations of what we establish to be a “worst case" frequency distribution. For this we use a specification of a “worst-case" set of frequencies as part of the proof of Theorem 3.1 (See Appendix B.1). From simulations we obtain that very small values of C<2C<2 suffice for δ=0.01\delta=0.01, ρ{1,2}\rho\in\{1,2\}, and k10k\geq 10.

We analyse a few WORp variants. The first we consider returns an exact pp-ppswor sample, including exact frequencies of keys, using two passes. We then consider a variant that returns an approximate pp-ppswor sample in a single pass. The two-pass method uses smaller rHH sketches and efficiently works with keys that are arbitrary strings.

We also provide another rHH-based technique that provides a guaranteed very small variation distance on kk-tuples in a single pass.

4 Two-pass WORp

Sketch size
sign, pp words key strings Pr[success]\Pr[\text{success}]
±\pm, p<2p<2 O(klogn)O(k\log n) O(k)O(k) (11poly(n))(13ek)(1-\frac{1}{\text{poly}(n)})(1-3e^{-k})
±\pm, p=2p=2 O(klog2n)O(k\log^{2}n) O(k)O(k) (11poly(n))(13ek)(1-\frac{1}{\text{poly}(n)})(1-3e^{-k})
++, p<1p<1 O(k)O(k) O(k)O(k) 13ek1-3e^{-k}
++, p=1p=1 O(klogn)O(k\log n) O(k)O(k) 13ek1-3e^{-k}

Table 2: Two-pass ppswor sampling of kk keys according to 𝝂p\boldsymbol{\nu}^{p}. Sketch size (memory words and number of stored key strings). For p(0,2]p\in(0,2] and signed (±\pm) or positive (++) value elements.

We design a two-pass method for ppswor sampling according to 𝝂p\boldsymbol{\nu}^{p} for p(0,2]p\in(0,2] (Equivalently, a pp-ppswor sample according to 𝝂\boldsymbol{\nu}):

  • \bullet

    Pass I: We compute an q\ell_{q} (k+1,ψ)(k+1,\psi)-rHH sketch RR of the transformed data elements

    (KeyHash(e.key),e.val/re.key1/p).(\textnormal{{KeyHash}}(e.\textsf{key}),e.\textsf{val}/r_{e.\textsf{key}}^{1/p})\enspace. (13)

    A hash KeyHash[n]\textnormal{{KeyHash}}\rightarrow[n] is applied when keys have a large or non-integer domain to facilitate use of CountSketch or reduce storage with Counters. We set ψ13qΨn,k,ρ(δ)\psi\leftarrow\frac{1}{3^{q}}\Psi_{n,k,\rho}(\delta).

  • \bullet

    Pass II: We collect key strings xx (if needed) and corresponding exact frequencies νx\nu_{x} for keys with the BkBk largest |νx^||\widehat{\nu^{*}_{x}}|, where BB is a constant (see below) and νx^:=R.Est[KeyHash(x)]\widehat{\nu^{*}_{x}}:=R.\textnormal{{Est}}[\textnormal{{KeyHash}}(x)] are the estimates of νx\nu^{*}_{x} provided by RR. For this purpose we use a composable top-(Bk)(Bk) sketch structure TT. The size of TT is dominated by storing BkBk key strings.

  • \bullet

    Producing a pp-ppswor sample from TT: Compute exact transformed frequencies νxνxrx1/p\nu^{*}_{x}\leftarrow\nu_{x}r^{1/p}_{x} for stored keys xTx\in T. Our sample is the set of key frequency pairs (x,νx)(x,\nu_{x}) for the top-kk stored keys by νx\nu^{*}_{x}. The threshold τ\tau is the (k+1)th(k+1)^{\text{th}} largest νx\nu^{*}_{x} over stored keys.

  • \bullet

    Estimation: We compute per-key estimates as in (1): Plugging in 𝒟=Exp[1]1/p\mathcal{D}=\textsf{Exp}[1]^{1/p} for pp-ppswor, we have f(νx)^:=0\widehat{f(\nu_{x})}:=0 for xSx\not\in S and for xSx\in S is f(νx)^:=f(νx)1exp((νxτ)p)\widehat{f(\nu_{x})}:=\frac{f(\nu_{x})}{1-\exp\left(-(\frac{\nu_{x}}{\tau})^{p}\right)}.

We establish that the method returns the pp-ppswor sample with probability at least 1δ1-\delta, propose practical optimizations, and analyze the sketch size:

Theorem 4.1.

The 2-pass method returns a pp-ppswor sample of size kk according to 𝛎\boldsymbol{\nu} with success probability and composable sketch sizes as detailed in Table 2. The success probability is defined to be that of returning the exact top-kk keys by transformed frequencies. The bound applies even when conditioned on the top-kk being a particular kk-tuple.

Proof.

From (10), the estimates νx^=R.Est[KeyHash(x)]\widehat{\nu^{*}_{x}}=R.\textnormal{{Est}}[\textnormal{{KeyHash}}(x)] of νx\nu^{*}_{x} are such that:

Pr[x{e.keye},|νx^νx|13|ν(k+1)|]1δ.\Pr\left[\forall x\in\{e.\textsf{key}\mid e\in\mathcal{E}\},|\widehat{\nu^{*}_{x}}-\nu^{*}_{x}|\leq\frac{1}{3}|\nu^{*}_{(k+1)}|\right]\geq 1-\delta\enspace. (14)

We set BB in the second pass so that the following holds:

The top-(k+1)(k+1) keys by 𝝂\boldsymbol{\nu}^{*} are a subset of the top-(B(k+1))(B(k+1)) keys by 𝝂^\widehat{\boldsymbol{\nu}^{*}}. (15)

Note that for any frequency distribution with rHH, it suffices to store O(k/ψ)O(k/\psi) keys to have (15). We establish that for our particular distributions, a constant BB suffices. For that we used the following:

Lemma 4.1.

A sufficient condition for property (15) is that |ν(B(k+1))|13|ν((k+1))||\nu^{*}_{(B(k+1))}|\leq\frac{1}{3}|\nu^{*}_{((k+1))}|.

Proof.

Note that |νx^|23|ν(k+1)||\widehat{\nu^{*}_{x}}|\geq\frac{2}{3}|\nu^{*}_{(k+1)}| for keys that are top-(k+1)(k+1) by 𝝂\boldsymbol{\nu}^{*} and |ν^(B(k+1))||ν(B(k+1))|+13|ν(k+1)||\widehat{\nu^{*}}_{(B(k+1))}|\leq|\nu^{*}_{(B(k+1))}|+\frac{1}{3}|\nu^{*}{(k+1)}|. Hence |νx^||ν^(B(k+1)||\widehat{\nu^{*}_{x}}|\geq|\widehat{\nu^{*}}_{(B(k+1)}| for all keys that are top-(k+1)(k+1) by 𝝂\boldsymbol{\nu}^{*}. ∎

We then use Lemma E.1 to express a “worst-case" distribution for the ratio νB(k+1)/ν(k+1)\nu^{*}_{B(k+1)}/\nu^{*}_{(}k+1) and use the latter (using Corollary D.2) to show that the setting of Ψ(δ)\Psi(\delta) according to our proof of Theorem 3.1 (Appendix B-D) implies that the conditions that guarantee the rHH property will also imply a ratio of at most 1/31/3 with a constant BB.

Correctness for the final sample follows from property (15) : TT storing the top-(k+1)(k+1) keys in the data according to 𝝂\boldsymbol{\nu}^{*}. To conclude the proof of Theorem 4.1 we need to specify the rHH sketch structure we use. From Theorem 3.1 we obtain a lower bound on Ψn,k,ρ\Psi_{n,k,\rho} for δ=3ek\delta=3e^{-k} and we use it to set ψ\psi. For our rHH sketch we use CountSketch (q=2q=2 and supports signed values) or Counters (q=1q=1 and positive values). The top two lines in Table 2 are for CountSketch and the next two lines are for Counters. The rHH sketch sizes follow from ψ\psi and Table 1. ∎

4.1 Practical optimizations

We propose practical optimizations that improve efficiency without compromising quality guarantees.

The first optimization allows us to store kB(k+1)k^{\prime}\ll B(k+1) keys in the second pass: We always store the top-(k+1)(k+1) keys by 𝝂^\widehat{\boldsymbol{\nu}^{*}} but beyond that only store keys if they satisfy

ν^x12ν^(k+1),\widehat{\nu^{*}}_{x}\geq\frac{1}{2}\widehat{\nu^{*}}_{(k+1)}\enspace, (16)

where 𝝂\boldsymbol{\nu}^{*} is with respect to data elements processed by the current sketch. We establish correctness:

Lemma 4.2.
  1. 1.

    There is a composable structure that only stores keys that satisfy (16) and collects exact frequencies for these keys.

  2. 2.

    If (14) holds, the top-(k+1)(k+1) keys by 𝝂\boldsymbol{\nu}^{*} satisfy (16) (and hence will be stored in the sketch).

Proof.

(i) The structure is a slight modification of a top-kk^{\prime} structure. Since ν^(k+1)\widehat{\nu^{*}}_{(k+1)} can only increase as more distinct keys are processed, the condition (16) only becomes more stringent as we merge sketches and process elements. Therefore if a key satisfies the condition at some point it would have satisfied the condition when elements with the key were previously processed and therefore we can collect exact frequencies.

(ii) From the uniform approximation (14), we have ν^(k+1)43ν(k+1)\widehat{\nu^{*}}_{(k+1)}\leq\frac{4}{3}\nu^{*}_{(k+1)}. Let xx be the (k+1)(k+1)-th key by |𝝂||\boldsymbol{\nu}^{*}|. Its estimated transformed frequency is at at least νx^23ν(k+1)2334ν^(k+1)=12ν^(k+1)\widehat{\nu^{*}_{x}}\geq\frac{2}{3}\nu^{*}_{(k+1)}\geq\frac{2}{3}\cdot\frac{3}{4}\widehat{\nu^{*}}_{(k+1)}=\frac{1}{2}\widehat{\nu^{*}}_{(k+1)}. Therefore, if we store all keys xx with νx^12ν^(k+1)\widehat{\nu^{*}_{x}}\geq\frac{1}{2}\widehat{\nu^{*}}_{(k+1)} we store the top-(k+1)(k+1) keys by νx\nu^{*}_{x}. ∎

A second optimization allows us to extract a larger effective sample from the sketch with kkk^{\prime}\geq k keys. This can be done when we can certify that the top-kk^{\prime} keys by 𝝂\boldsymbol{\nu}^{*} in the transformed data are stored in the sketch TT. Using a larger sample is helpful as it can only improve (in expectation) estimation quality (see e.g., [28, 31]). To extend this, we compute the uniform error bound ν(k+1)/3\nu^{*}_{(k+1)}/3 (available because the top-(k+1)(k+1) keys by ν\nu^{*} are stored). Let LminxTν^xL\leftarrow\min_{x\in T}\widehat{\nu^{*}}_{x}. We obtain that any key xx in the dataset with νxL+ν(k+1)/3\nu^{*}_{x}\geq L+\nu^{*}_{(k+1)}/3 must be stored in TT. Our final sample contains these keys with the minimum νx\nu^{*}_{x} in the set used as the threshold τ\tau.

5 One-pass WORp

Our 1-pass WORp yields a sample of size kk that approximates a pp-ppswor sample of the same size and provides similar guarantees on estimation quality.

  • \bullet

    Sketch: For qpq\geq p and ε(0,13]\varepsilon\in(0,\frac{1}{3}] Compute an q\ell_{q} (k+1,ψ)(k+1,\psi)-rHH sketch RR of the transformed data elements (5) where ψεqΨn,k+1,ρ\psi\leftarrow\varepsilon^{q}\Psi_{n,k+1,\rho}.

  • \bullet

    Produce a sample: Our sample SS includes the keys with top-kk estimated transformed frequencies νx^:=R.Est[x]\widehat{\nu^{*}_{x}}:=R.\textnormal{{Est}}[x]. For each key xSx\in S we include (x,νx)(x,\nu^{\prime}_{x}), where the approximate frequency νxνx^rx1/p\nu^{\prime}_{x}\leftarrow\widehat{\nu^{*}_{x}}r^{1/p}_{x} is according to (6). We include with the sample the threshold τν^(k+1)\tau\leftarrow\widehat{\nu^{*}}_{(k+1)}.

  • \bullet

    Estimation: We treat the sample as a pp-ppswor sample and compute per-key estimates as in (1), substituting approximate frequencies νx\nu^{\prime}_{x} for actual frequencies νx\nu_{x} of sampled keys and the 1-pass threshold ν^(k+1)\widehat{\nu^{*}}_{(k+1)} for the exact ν(k+1)\nu^{*}_{(k+1)}. The estimate is f(νx)^:=0\widehat{f(\nu_{x})}:=0 for xSx\not\in S and for xSx\in S is:

    f(νx)^:=f(νx)1exp((νxν^(k+1))p)=f(νx^rx1/p)1exp(rx(νx^ν^(k+1))p)\widehat{f(\nu_{x})}:=\frac{f(\nu^{\prime}_{x})}{1-\exp\left(-(\frac{\nu^{\prime}_{x}}{\widehat{\nu^{*}}_{(k+1)}})^{p}\right)}=\frac{f(\widehat{\nu^{*}_{x}}r_{x}^{1/p})}{1-\exp\left(-r_{x}(\frac{\widehat{\nu^{*}_{x}}}{\widehat{\nu^{*}}_{(k+1)}})^{p}\right)} (17)

We relate the quality of the estimates to those of a perfect pp-ppswor. Since our 1-pass estimates are biased (unlike the unbiased perfect pp-ppswor), we consider both bias and variance. The proof is provided in Appendix G.

Theorem 5.1.

Let f(w)f(w) be such that |f((1+ε)w)f(w)|cεf(w)|f((1+\varepsilon)w)-f(w)|\leq c\varepsilon f(w) for some c>0c>0 and ε[0,1/2]\varepsilon\in[0,1/2]. Let f(νx)^\widehat{f(\nu_{x})} be per-key estimates obtained with a one-pass WORp sample and let f(νx)^\widehat{f(\nu_{x})}^{\prime} be the respective estimates obtained with a (perfect) pp-ppswor sample. Then Bias[f(νx)^]O(ε)f(νx)\textsf{Bias}[\widehat{f(\nu_{x})}]\leq O(\varepsilon)f(\nu_{x}) and MSE[f(νx)^](1+O(ε))Var[f(νx)^]+O(ϵ)f(νx)2\textsf{MSE}[\widehat{f(\nu_{x})}]\leq(1+O(\varepsilon))\textsf{Var}[\widehat{f(\nu_{x})}^{\prime}]+O(\epsilon)f(\nu_{x})^{2}.

Note that the assumption on ff holds for f(w)=wpf(w)=w^{p} with c=(1.5)pc=(1.5)^{p}. Also note that the bias bound implies a respective contribution to the relative error of O(ε)O(\varepsilon) on all sum estimates.

6 One-pass Total Variation Distance Guarantee

We provide another 1-pass method, based on the combined use of rHH and known WR perfect p\ell_{p} sampling sketches [50] to select a kk-tuple with a polynomially small total variation (TV) distance from the kk-tuple distribution of a perfect pp-ppswor. The method uses O(k)O(k) (for variation distance 2Θ(k)2^{-\Theta(k)}, and O(klogn)O(k\log n) for variation distance 1/nC1/n^{C} for an arbitrarily large constant C>0C>0) perfect samplers (each providing a single WR sample) and an rHH sketch. The perfect samplers are processed in sequence with prior selections “subtracted" from the linear sketch (using approximate frequencies provided by the rHH sketch) to uncover fresh samples. As with WORp, exact frequencies of sampled keys can be obtained in a second pass or approximated using larger sketches in a single pass. Details are provided in Appendix F.

Theorem 6.1.

There is a 1-pass method via composable sketches of size O(kpolylog(n))O(k\mathop{\mathrm{polylog}}(n)) that returns a kk-tuple of keys such that the total variation distance from the kk-tuples produced by a perfect pp-ppswor sample is at most 1/nC1/n^{C} for an arbitrarily large constant C>0C>0. The method applies to keys from a domain [n][n], and signed values with magnitudes and intermediate frequencies that are polynomial in nn.

We also show in Appendix F that our sketches in Theorem 6.1 use O(klog2n(loglogn)2)O(k\cdot\log^{2}n(\log\log n)^{2}) bits of memory for 0<p<20<p<2, and we prove a matching lower bound on the memory required of any algorithm achieving this guarantee, up to a (loglogn)2(\log\log n)^{2} factor. For p=2p=2 we also show they are of optimal size, up to an O(logn)O(\log n) factor.

7 Experiments

We implemented 2-pass and 1-pass WORp in Python using CountSketch as our rHH sketch. Figure 2 reports estimates of the rank-frequency distribution obtained with 11-pass and 22-pass WORp and perfect WOR (pp-ppswor) and perfect WR samples (shown for reference). For best comparison, all WOR methods use the same randomization of the pp-ppswor transform. Table 3 reports normalized root averaged squared errors (NRMSE) on example statistics. As expected, 2-pass WORp and perfect 22-ppswor are similar and WR 2\ell_{2} samples are less accurate with larger skew or on the tail. Note that current state of the art sketching methods are not more efficient for WR sampling than for WOR sampling, and estimation quality with perfect methods is only shown for reference. We can also see that 1-pass WORp performs well, although it requires more accuracy (lager sketch size) since it works with estimated weights (reported results are with fixed CountSketch size of k×31k\times 31 for all methods).

Refer to caption Refer to caption Refer to caption

Figure 2: Estimates of the rank-frequency distribution of Zipf[1]\textsf{Zipf}[1] and Zipf[2]\textsf{Zipf}[2]. Using WORp 1-pass, WORp 2-pass with CountSketch (matrix k×31k\times 31), perfect WOR, and perfect WR. Estimates from a (representative) single sample of size k=100k=100. Left and Center: 2\ell_{2} sampling. Right: 1\ell_{1} sampling.
p\ell_{p} α\alpha pp^{\prime} perfect WR perfect WOR 1-pass WORp 2-pass WORp
2\ell_{2} Zipf[2]\textsf{Zipf}[2] ν3\nu^{3} 1.16e-04 2.09e-11 1.06e-03 2.08e-11
2\ell_{2} Zipf[2]\textsf{Zipf}[2] ν2\nu^{2} 7.96e-05 1.26e-07 1.14e-02 1.25e-07
1\ell_{1} Zipf[2]\textsf{Zipf}[2] ν\nu 9.51e-03 1.60e-03 2.79e-02 1.60e-03
1\ell_{1} Zipf[1]\textsf{Zipf}[1] ν3\nu^{3} 3.59e-01 5.73e-03 5.14e-03 5.72e-03
1\ell_{1} Zipf[2]\textsf{Zipf}[2] ν3\nu^{3} 3.45e-04 7.34e-10 5.11e-05 7.38e-10
Table 3: NRMSE on estimates of frequency moments on statistics of the form 𝝂pp\|\boldsymbol{\nu}\|_{p^{\prime}}^{p^{\prime}} from p\ell_{p} samples (p=1,2p=1,2). Zipf[α]\textsf{Zipf}[\alpha] distributions with support size n=104n=10^{4}, k=100k=100 samples, averaged over 100100 runs. CountSketch of size k×31k\times 31

Conclusion

We present novel composable sketches for without-replacement (WOR) p\ell_{p} sampling, based on “reducing" the sampling problem to a heavy hitters (HH) problem. The reduction, which is simple and practical, allows us to use existing implementations of HH sketches to perform WOR sampling. Moreover, streaming HH sketches that support time decay (for example, sliding windows [8]) provide a respective time-decay variant of sampling. We present two approaches, WORp, based on a bottom-kk transform, and another technique based on “perfect” with-replacement sampling sketches, which provide 1-pass WOR samples with negligible variation distance to a true sample. Our methods open the door for a wide range of future applications. Our WORp schemes produce bottom-kk samples (aka bottom-kk sketches) with respect to a specified randomization rxr_{x} over the support (with 1-pass WORp we obtain approximate bottom-kk samples). Samples of different datasets or different pp values or different time-decay functions that are generated with the same rxr_{x} are coordinated [12, 70, 65, 17, 32, 13]. Coordination is a desirable and powerful property: Samples are locally sensitivity (LSH) and change little with small changes in the dataset [12, 70, 46, 33, 19]. This LSH property allows for a compact representation of multiple samples, efficient updating, and sketch-based similarity searches. Moreover, coordinated samples (sketches) facilitate powerful estimators for multi-set statistics and similarity measures such as weighted Jaccard similarity, min or max sums, and one-sided distance norms [17, 13, 28, 31, 29, 30, 18].

Acknowledgments:

D. Woodruff would like to thank partial support from the Office of Naval Research (ONR) grant N00014-18-1-2562, and the National Science Foundation (NSF) under Grant No. CCF-1815840.

References

  • [1] Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. Mergeable summaries. ACM Trans. Database Syst., 38(4):26:1–26:28, 2013.
  • [2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1709–1720. Curran Associates, Inc., 2017.
  • [3] N. Alon, N. Duffield, M. Thorup, and C. Lund. Estimating arbitrary subset sums with few probes. In Proceedings of the 24th ACM Symposium on Principles of Database Systems, pages 317–325, 2005.
  • [4] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137–147, 1999.
  • [5] Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David P. Woodruff. Efficient sketches for earth-mover distance, with applications. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pages 324–330. IEEE Computer Society, 2009.
  • [6] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, 2011.
  • [7] Ziv Bar-Yossef, Thathachar S Jayram, Ravi Kumar, and D Sivakumar. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4):702–732, 2004.
  • [8] R. Ben-Basat, G. Einziger, R. Friedman, and Y. Kassner. Heavy hitters in streams and sliding windows. In IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, 2016.
  • [9] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
  • [10] Radu Berinde, Graham Cormode, Piotr Indyk, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. In Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’09. Association for Computing Machinery, 2009.
  • [11] Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. Bptree: An l2 heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, 2017.
  • [12] K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian Journal of Statistics, 14(3):231–239, 1972.
  • [13] A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21–29. IEEE, 1997.
  • [14] M. T. Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653–656, 1982.
  • [15] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), pages 693–703, 2002.
  • [16] Yung-Yu Chung, Srikanta Tirthapura, and David P. Woodruff. A simple message-optimal algorithm for random sampling from a distributed stream. IEEE Trans. Knowl. Data Eng., 28(6):1356–1368, 2016.
  • [17] E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441–453, 1997.
  • [18] E. Cohen. Distance queries from sampled data: Accurate and efficient. In KDD. ACM, 2014. full version: http://arxiv.org/abs/1203.4903.
  • [19] E. Cohen. Multi-objective weighted sampling. In HotWeb. IEEE, 2015. full version: http://arxiv.org/abs/1509.07445.
  • [20] E. Cohen. Stream sampling for frequency cap statistics. In KDD. ACM, 2015. full version: http://arxiv.org/abs/1502.05955.
  • [21] E. Cohen. Stream sampling framework and application for frequency cap statistics. ACM Trans. Algorithms, 14(4):52:1–52:40, 2018. preliminary version published in KDD 2015. arXiv:http://arxiv.org/abs/1502.05955.
  • [22] E. Cohen, G. Cormode, and N. Duffield. Don’t let the negatives bring you down: Sampling from streams of signed updates. In Proc. ACM SIGMETRICS/Performance, 2012.
  • [23] E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput., 40(5), 2011.
  • [24] E. Cohen and O. Geri. Sampling sketches for concave sublinear functions of frequencies. In NeurIPS, 2019.
  • [25] E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC, 2007.
  • [26] E. Cohen and H. Kaplan. Sketch-based estimation of subpopulation-weight. Technical Report 802.3448, CORR, 2008.
  • [27] E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008.
  • [28] E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. In ACM SIGMETRICS, 2009.
  • [29] E. Cohen and H. Kaplan. Get the most out of your sample: Optimal unbiased estimators using partial information. In Proc. of the 2011 ACM Symp. on Principles of Database Systems (PODS 2011). ACM, 2011. full version: http://arxiv.org/abs/1203.4903.
  • [30] E. Cohen and H. Kaplan. What you can do with coordinated samples. In The 17th. International Workshop on Randomization and Computation (RANDOM), 2013. full version: http://arxiv.org/abs/1206.5637.
  • [31] E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB, 2, 2009. full version: http://arxiv.org/abs/0906.4560.
  • [32] E. Cohen, Y.-M. Wang, and G. Suri. When piecewise determinism is almost true. In Proc. Pacific Rim International Symposium on Fault-Tolerant Systems, pages 66–71, December 1995.
  • [33] Edith Cohen, Graham Cormode, Nick Duffield, and Carsten Lund. On the tradeoff between stability and fit. ACM Trans. Algorithms, 13(1), 2016.
  • [34] Edith Cohen, Ofir Geri, and Rasmus Pagh. Composable sketches for functions of frequencies: Beyond the worst case. In ICML, 2020.
  • [35] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55(1), 2005.
  • [36] N. Duffield, C. Lund, and M. Thorup. Estimating flow distributions from sampled flow statistics. In Proceedings of the ACM SIGCOMM’03 Conference, pages 325–336, 2003.
  • [37] N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007.
  • [38] C. Estan and G. Varghese. New directions in traffic measurement and accounting. In SIGCOMM. ACM, 2002.
  • [39] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. Int. J. Comput. Geom. Appl., 18(1/2):3–28, 2008.
  • [40] R. Gemulla, W. Lehner, and P. J. Haas. A dip in the reservoir: Maintaining sample synopses of evolving datasets. In VLDB, 2006.
  • [41] P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998.
  • [42] H. O. Hartley and J. N. K. Rao. Sampling with unequal probabilities and without replacement. Annals of Mathematical Statistics, 33(2), 1962.
  • [43] N. Hohn and D. Veitch. Inverting sampled traffic. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 222–233, 2003.
  • [44] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
  • [45] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Annual Symposium on Foundations of Computer Science, pages 189–197. IEEE, 2001.
  • [46] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symposium on Theory of Computing, pages 604–613. ACM, 1998.
  • [47] Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir Braverman, Ion Stoica, and Raman Arora. Communication-efficient distributed SGD with sketching. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 2019.
  • [48] Svante Janson. Tail bounds for sums of geometrics and exponential variables. https://http://www2.math.uu.se/~svante/papers/sj328.pdf, 2017.
  • [49] Rajesh Jayaram, Gokarna Sharma, Srikanta Tirthapura, and David P. Woodruff. Weighted reservoir sampling from distributed streams. In Dan Suciu, Sebastian Skritek, and Christoph Koch, editors, Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 218–235. ACM, 2019.
  • [50] Rajesh Jayaram and David P. Woodruff. Perfect lp sampling in a data stream. In FOCS, 2018.
  • [51] T. S. Jayram and David P. Woodruff. The data stream space complexity of cascaded norms. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pages 765–774. IEEE Computer Society, 2009.
  • [52] H. Jowhari, M. Saglam, and G. Tardos. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In PODS, 2011.
  • [53] D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 1st edition, 1968.
  • [54] Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, Vladimir Braverman, Roy Friedman, and Vyas Sekar. Nitrosketch: robust and general sketch-based monitoring in software switches. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM 2019, Beijing, China, August 19-23, 2019, pages 334–350, 2019.
  • [55] Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. One sketch to rule them all: Rethinking network flow monitoring with univmon. In SIGCOMM, 2016.
  • [56] G. Manku and R. Motwani. Approximate frequency counts over data streams. In International Conference on Very Large Databases (VLDB), pages 346–357, 2002.
  • [57] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Aguera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research. PMLR, 2017.
  • [58] A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ICDT, 2005.
  • [59] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
  • [60] J. Misra and D. Gries. Finding repeated elements. Technical report, Cornell University, 1982.
  • [61] M. Monemizadeh and D. P. Woodruff. 1-pass relative-error lp{}_{\mbox{p}}-sampling with applications. In Proc. 21st ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 2010.
  • [62] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. In Algorithms and Theory of Computation Handbook. 1999.
  • [63] E. Ohlsson. Sequential poisson sampling from a business register and its application to the swedish consumer price index. Technical Report 6, Statistics Sweden, 1990.
  • [64] E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149–162, 1998.
  • [65] E. Ohlsson. Coordination of pps samples over time. In The 2nd International Conference on Establishment Surveys, pages 255–264. American Statistical Association, 2000.
  • [66] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
  • [67] Eric Price. Efficient sketches for the set query problem. In Dana Randall, editor, Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pages 41–56. SIAM, 2011.
  • [68] B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373–397, 1972.
  • [69] B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135–158, 1997.
  • [70] P. J. Saavedra. Fixed sample size pps approximations with a permanent random number. In Proc. of the Section on Survey Research Methods, pages 697–700, Alexandria, VA, 1995. American Statistical Association.
  • [71] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18. Curran Associates Inc., 2018.
  • [72] Y. Tillé. Sampling Algorithms. Springer-Verlag, New York, 2006.

Appendix A Properties of rHH sketches

Sketches for 1\ell_{1} heavy hitters on datasets with positive values:

These include the deterministic counter-based Misra Gries [60, 1], Lossy Counting [56], and Space Saving [58] and the randomized Count-Min Sketch [35]. A sketch of size O(ε1)O(\varepsilon^{-1}) provides frequency estimates with absolute error at most ε𝝂1\varepsilon\|\boldsymbol{\nu}\|_{1}. Berinde et al. [10] provide a counter-based sketch of size O(k/ψ)O(k/\psi) that provides absolute error at most ψktailk(𝝂)1\frac{\psi}{k}\|\textsf{tail}_{k}(\boldsymbol{\nu})\|_{1}.

Sketches for 2\ell_{2} heavy hitters on datasets with signed values:

Pioneered by CountSketch [15]: A sketch of size O(ε1lognδ)O(\varepsilon^{-1}\log\frac{n}{\delta}) provides with confidence 1δ1-\delta estimates with error bound ε𝝂22\varepsilon\|\boldsymbol{\nu}\|_{2}^{2} for squared frequencies. For rHH, a CountSketch of size O(kψlognδ)O(\frac{k}{\psi}\log\frac{n}{\delta}) provides estimates for all squared frequencies with absolute error at most ψktailk(𝝂)22\frac{\psi}{k}\|\textsf{tail}_{k}(\boldsymbol{\nu})\|_{2}^{2}. These bounds were further refined in [52] for p\ell_{p} rHH. The dependence on logn\log n was replaced by 1/ψ1/\psi in [11] for insertion only streams. Unlike the case for counter-based sketches, the estimates produced by CountSketch are unbiased, a helpful property for estimation tasks.

Obtaining the rHH keys

Keys can be arbitrary strings (search queries, URLs, terms) or integers from a domain [n][n] (parameters in an ML model). Keys in the form of strings can be handled by hashing them to a domain [n][n] but we discuss applications that require the sketch to return rHH keys in their string form. Counter-based sketches store explicitly O(k/ψ)O(k/\psi) keys. The stored keys can be arbitrary strings. The estimates are positive for stored keys and 0 for other keys. The rHH keys are contained in the stored keys. The randomized rHH sketches (CountSketch and CountMin) are designed for keys that are integers in [n][n]. The bare sketch does not explicitly store keys. The rHH keys can be recovered by enumerating over [n][n] and retaining the keys with largest estimates. Alternatively, when streaming (performing one sequential pass over elements) we can maintain an auxiliary structure that holds key strings with current top-kk estimates [15].

With general composable sketches, key strings can be handled using an auxiliary structure that increases the sketch size by a factor linear in string length. This is inefficient compared with sketches that natively store the string. Alternatively, a two-pass method can be used with the first pass computing an rHH sketch for a hashed numeric representation and a second pass is used to obtain the key strings of hashed representations with high estimates.

Recovering (approximate) frequencies of rHH keys

For our application, we would need to have approximate or exact frequencies of rHH keys. The estimates provided by a (k,ψ)(k,\psi) rHH sketch provide absolute error (statistical) guarantees (see Table 1). One approach is to recover exact frequencies in a second pass. We can also obtain more accurate estimates (of relative error at most ϵ\epsilon) by using a (k,ϵψ)(k,\epsilon\psi) rHH sketch.

Testing for failure

Recall that the dataset may not have (k,ψ)(k,\psi) rHH. We can test if one of the kk largest estimated frequencies to the ppth power is below the specified error bound of ψktailk(𝝂)pp\geq\frac{\psi}{k}\|\textsf{tail}_{k}(\boldsymbol{\nu})\|_{p}^{p}. If so, we declare “failure.”

Appendix B Overview of the proof of Theorem 3.1

For a vector 𝒘n\boldsymbol{w}\in\Re^{n}, permutation πSn\pi\in S^{n}, and p>0p>0, let the random variable 𝒘p-ppswor[𝒘]π\boldsymbol{w^{*}}\sim\text{$p$-ppswor}[\boldsymbol{w}]\mid\pi be a pp-ppswor transform (4) of 𝒘\boldsymbol{w} conditioned on the event order(𝒘)=π\textsf{order}(\boldsymbol{w^{*}})=\pi. For q>pq>p and k>1k>1, we define the following distribution:

F𝒘,p,q,kπ:=tailk(𝒘)qq(w(k))q.F_{\boldsymbol{w},p,q,k}\mid\pi:=\frac{\|\textsf{tail}_{k}(\boldsymbol{w^{*}})\|_{q}^{q}}{(w^{*}_{(k)})^{q}}\ . (18)

Note that for any 𝒘n\boldsymbol{w}\in\Re^{n} and πSn\pi\in S^{n},

Pr𝒘p-ppswor[𝒘]order(𝒘)=π[k|w(k)|qtailk(𝒘)qqψ]=PrzF𝒘,p,q,kπ[zkψ]\Pr_{\boldsymbol{w}^{*}\sim\text{$p$-ppswor}[\boldsymbol{w}]\mid\textsf{order}(\boldsymbol{w}^{*})=\pi}\left[k\frac{|w^{*}_{(k)}|^{q}}{\|\textsf{tail}_{k}(\boldsymbol{w}^{*})\|_{q}^{q}}\leq\psi\right]=\Pr_{z\sim F_{\boldsymbol{w},p,q,k}\mid\pi}\left[z\leq\frac{k}{\psi}\right] (19)

Therefore tail bounds on FF that hold for any 𝒘n\boldsymbol{w}\in\Re^{n} and πSn\pi\in S^{n} can be used to establish the claim.

We now define another distribution that does not depend on 𝒘\boldsymbol{w} and π\pi:

Definition B.1.

For 1kn1\leq k\leq n and ρ1\rho\geq 1 we define a distribution Rn,k,ρR_{n,k,\rho} as follows.

Rk,n,ρ:=i=k+1n(j=1kZj)ρ(j=1iZj)ρ,R_{k,n,\rho}:=\sum_{i=k+1}^{n}\frac{\left(\sum_{j=1}^{k}Z_{j}\right)^{\rho}}{\left(\sum_{j=1}^{i}Z_{j}\right)^{\rho}}\enspace,

where ZiExp[1]Z_{i}\sim\textsf{Exp}[1] i[n]i\in[n] are i.i.d.

The proof of Theorem 3.1 will follow using the following two components:

  1. (i)

    We show (Section C) that for any 𝒘n\boldsymbol{w}\in\Re^{n} and permutation πSn\pi\in S^{n},

    F𝒘,p,q,k|πRk,n,ρ=(q/p),F_{\boldsymbol{w},p,q,k}|\pi\preceq R_{k,n,\rho=(q/p)}\ ,

    where the relation \preceq corresponds here to statistical domination of distributions.

  2. (ii)

    We establish (Section D) tail bounds on Rk,n,ρ=(q/p)R_{k,n,\rho=(q/p)}.

Because of domination, the tails bounds on Rk,n,ρ=(q/p)R_{k,n,\rho=(q/p)} provide corresponding tail bound for F𝒘,p,q,k|πF_{\boldsymbol{w},p,q,k}|\pi for any 𝒘n\boldsymbol{w}\in\Re^{n} and πSn\pi\in S^{n}. Together with (19), we use the tail bounds to conclude the proof of Theorem 3.1.

Moreover, the domination relation is tight in the sense that for some 𝒘\boldsymbol{w} and π\pi, F𝒘,p,q,k|πF_{\boldsymbol{w},p,q,k}|\pi is very close to Rk,n,q/pR_{k,n,q/p}: For distributions with kk keys with relative frequency ϵ\epsilon and π\pi that has these keys in the first kk (as ϵ0\epsilon\rightarrow 0), or for uniform distributions with nkn\gg k, F𝒘,p,q,k|πF_{\boldsymbol{w},p,q,k}|\pi (as nn grows).

The tail bounds (and hence the claim of Theorem 3.1) also hold without the condition on π\pi.

Lemma B.2.

The tail bounds also hold for the unconditional distribution F𝐰,p,q,kF_{\boldsymbol{w},p,q,k}.

Proof.

The distribution F𝒘,p,q,kF_{\boldsymbol{w},p,q,k} is a convex combination of distributions F𝒘,p,q,k|πF_{\boldsymbol{w},p,q,k}|\pi. Specifically, for each permutation π\pi let pπp_{\pi} be the probability that we obtain this permutation with successive weighted sampling with replacement. Then

F𝒘,p,q,k=πpπF𝒘,p,q,k|π.F_{\boldsymbol{w},p,q,k}=\sum_{\pi}p_{\pi}F_{\boldsymbol{w},p,q,k}|\pi\enspace. (20)

Since tail bounds hold for each term, they hold for the combination. ∎

B.1 Approximating 𝝍\boldsymbol{\psi} by simulations

Ψk,n,ρ(δ)\Psi_{k,n,\rho}(\delta) is the solution of the following for ψ\psi:

PrzRk,n,ρ[zk/ψ]=δ.\Pr_{z\sim R_{k,n,\rho}}[z\geq k/\psi]=\delta\ . (21)

We can approximate Ψk,n,ρ(δ)\Psi_{k,n,\rho}(\delta) by computing i.i.d. ziRk,n,ρz_{i}\sim R_{k,n,\rho}, taking the (1δ)(1-\delta) quantile zz^{\prime} in the empirical distribution and returning k/zk/z^{\prime}.

From simulations we obtain that for δ=0.01\delta=0.01 and ρ{1,2}\rho\in\{1,2\}, C=2C=2 suffices for sample size k10k\geq 10, C=1.4C=1.4 suffices for k100k\geq 100, and C=1.1C=1.1 suffices for k1000k\geq 1000.

Appendix C Domination of the ratio distribution

Lemma C.1 (Domination).

For any permutation π\pi, 𝐰\boldsymbol{w}, pp, qpq\geq p, and k1k\geq 1, the distribution F𝐰,p,q,k|πF_{\boldsymbol{w},p,q,k}|_{\pi} (18) is dominated by Rn,k,q/pR_{n,k,q/p}. That is, for all z0z\geq 0,

PrzF𝒘,p,q,k|π[zkψ]PrzRn,k,q/p[zkψ]\Pr_{z\sim F_{\boldsymbol{w},p,q,k}|\pi}\left[z\leq\frac{k}{\psi}\right]\geq\Pr_{z\sim R_{n,k,q/p}}\left[z\leq\frac{k}{\psi}\right] (22)
Proof.

Assume without loss of generality that order(𝒘)=π\textsf{order}(\boldsymbol{w})=\pi. Let 𝒘p-ppswor[𝒘]order(𝒘)=π\boldsymbol{w}^{*}\sim\text{$p$-ppswor}[\boldsymbol{w}]\mid\textsf{order}(\boldsymbol{w}^{*})=\pi. Note by definition 𝒘\boldsymbol{w}^{*} is in decreasing order of magnitude. Define the random variable 𝒚:=𝒘p\boldsymbol{y}:=\boldsymbol{w^{*}}^{p}. 𝒚\boldsymbol{y} are transformed weights of a ppswor sample of 𝒘\boldsymbol{w} conditioned on the order π\pi. We use use properties of the exponential distribution (see a similar use in [26]) to express the joint distribution of {yi}\{y_{i}\}. We use the following set of independent random variables:

XiExp[j=inwjp].X_{i}\sim\textsf{Exp}[\sum_{j=i}^{n}w^{p}_{j}]\enspace.

We have:

yi=1(j=1iXi)q/p.y_{i}=\frac{1}{\left(\sum_{j=1}^{i}X_{i}\right)^{q/p}}\enspace. (23)

To see this, recall that y1y_{1} is the (inverse) of the minimum of exponential random variables with parameters w1,,wnw_{1},\ldots,w_{n} and thus is (the inverse of) exponential random variable with parameter equal to their sum. Therefore, y1=1/X1y_{1}=1/X_{1}. From memorylessness, the difference between the (i+1)(i+1)-st smallest inverse and the ii-th smallest is an exponential random variable with distribution XiX_{i}. Therefore, the ii-th smallest inverse has the claimed distribution (23).

We are now ready to express the random variable that is the ratio (18) in terms of the independent random variables XiX_{i}:

j=k+1nyjyk=i=k+1n1(j=1iXi)q/p1(j=1kXj)q/p=i=k+1n(j=1kXj)q/p(j=1iXj)q/p.\frac{\sum_{j=k+1}^{n}y_{j}}{y_{k}}=\frac{\sum_{i=k+1}^{n}\frac{1}{\left(\sum_{j=1}^{i}X_{i}\right)^{q/p}}}{\frac{1}{\left(\sum_{j=1}^{k}X_{j}\right)^{q/p}}}=\sum_{i=k+1}^{n}\frac{\left(\sum_{j=1}^{k}X_{j}\right)^{q/p}}{\left(\sum_{j=1}^{i}X_{j}\right)^{q/p}}\enspace. (24)

We rewrite this using i.i.d. random variables ZiExp[1]Z_{i}\sim\textsf{Exp}[1], recalling that for any ww, Exp[w]\textsf{Exp}[w] is the same as Exp[1]/w\textsf{Exp}[1]/w. Then we have Xi=Zi/j=inwjpX_{i}=Z_{i}/\sum_{j=i}^{n}w^{p}_{j}.

We next provide a simpler distribution that dominates the distribution of the ratio. Let W:=j=knwjpW^{\prime}:=\sum_{j=k}^{n}w^{p}_{j} and consider the i.i.d. random variables Xi=Zi/WX^{\prime}_{i}=Z_{i}/W^{\prime}. Note that XjXjX_{j}\leq X^{\prime}_{j} for jkj\leq k and XjXjX_{j}\geq X^{\prime}_{j} for jkj\geq k. Thus, for ik+1i\geq k+1,

j=1kXjj=1iXj=11+j=k+1iXjj=1kXj11+j=k+1iXjj=1kXj=j=1kXjj=1iXj=j=1kZjj=1iZj\frac{\sum_{j=1}^{k}X_{j}}{\sum_{j=1}^{i}X_{j}}=\frac{1}{1+\frac{\sum_{j=k+1}^{i}X_{j}}{\sum_{j=1}^{k}X_{j}}}\geq\frac{1}{1+\frac{\sum_{j=k+1}^{i}X^{\prime}_{j}}{\sum_{j=1}^{k}X^{\prime}_{j}}}=\frac{\sum_{j=1}^{k}X^{\prime}_{j}}{\sum_{j=1}^{i}X^{\prime}_{j}}=\frac{\sum_{j=1}^{k}Z_{j}}{\sum_{j=1}^{i}Z_{j}} (25)

This holds in particular for each term in the RHS of (24). Therefore we obtain

j=k+1nyjyki=k+1n(j=1kZj)q/p(j=1iZj)q/p.\frac{\sum_{j=k+1}^{n}y_{j}}{y_{k}}\geq\sum_{i=k+1}^{n}\frac{\left(\sum_{j=1}^{k}Z_{j}\right)^{q/p}}{\left(\sum_{j=1}^{i}Z_{j}\right)^{q/p}}\enspace.

Appendix D Tail bounds on Rk,n,ρR_{k,n,\rho}

We establish the following upper tail bounds on the distribution Rn,k,ρR_{n,k,\rho}:

Theorem D.1 (Concentration of Rn,k,ρR_{n,k,\rho}).

There is a constant CC, such that for any n,k,ρn,k,\rho

ρ=1:\displaystyle\rho=1\text{:} PrrRn,k,ρ[rCkln(nk)]3ek\displaystyle\Pr_{r\sim R_{n,k,\rho}}\left[r\geq Ck\ln(\frac{n}{k})\right]\leq 3e^{-k} (26)
ρ>1:\displaystyle\rho>1\text{:} PrrRn,k,ρ[rCk1ρ1]3ek\displaystyle\Pr_{r\sim R_{n,k,\rho}}\left[r\geq Ck\frac{1}{\rho-1}\right]\leq 3e^{-k} (27)

We start with a “back of the envelope” calculation to provide intuition: replace the random variables ZiZ_{i} in Rn,k,ρR_{n,k,\rho} (see Definition B.1) by their expectation E[Zi]=1\textsf{E}[Z_{i}]=1 to obtain

Sn,k,ρ:=i=k+1nkρiρ.S_{n,k,\rho}:=\sum_{i=k+1}^{n}\frac{k^{\rho}}{i^{\rho}}\enspace.

For ρ=1\rho=1, Sn,k,ρk(HnHk)kln(n/k)S_{n,k,\rho}\leq k(H_{n}-H_{k})\approx k\ln(n/k). For ρ>1\rho>1 we have Sn,k,ρkρ1S_{n,k,\rho}\approx\frac{k}{\rho-1}. We will see that we can expect the sums not to deviate too far from this value.

The sum of \ell i.i.d. Exp[1]\textsf{Exp}[1] random variables generates an Erlang distribution Erlang[,1]\textsf{Erlang}[\ell,1] (rate parameter 11). The expectation is ErErlang[,1]=\textsf{E}_{r\sim\textsf{Erlang}[\ell,1]}=\ell and variance is VarrErlang[,1][r]=\textsf{Var}_{r\sim\textsf{Erlang}[\ell,1]}[r]=\ell. We will use the following Erlang tail bounds [48]:

Lemma D.1.

For XErlang[,1]X\sim\textsf{Erlang}[\ell,1]

ε1 :\displaystyle\varepsilon\geq 1\text{ : } Pr[xε]1εe(ε1lnε)e1ε\displaystyle\Pr[x\geq\varepsilon\ell]\leq\frac{1}{\varepsilon}e^{-\ell(\varepsilon-1-\ln\varepsilon)}\leq e^{1-\varepsilon}
ε1 :\displaystyle\varepsilon\leq 1\text{ : } Pr[xε]e(ε1lnε)\displaystyle\Pr[x\leq\varepsilon\ell]\leq e^{-\ell(\varepsilon-1-\ln\varepsilon)}

When ε<0.159\varepsilon<0.159 or ε>3.2\varepsilon>3.2 we have the bound ee^{-\ell}. For ε>3.2\varepsilon>3.2 we also have 1εe(ε2.2)\frac{1}{\varepsilon}e^{-\ell(\varepsilon-2.2)}

Proof of Theorem D.1.

We bound the probability of a “bad event" which we define as the numerator being “too large” and denominators being too “small.” More formally, the numerator is the sum N=i=1kZiN=\sum_{i=1}^{k}Z_{i} and we define a bad event as N3.2kN\geq 3.2k. Substituting ε=3.2\varepsilon=3.2 and =k\ell=k in the upper tail bounds from Lemma D.1, we have that the probability of this bad event is bounded by

PrrErlang[k,1][r>kε]ek.\Pr_{r\sim\textsf{Erlang}[k,1]}\left[r>k\varepsilon\right]\leq e^{-k}\enspace. (28)

The denominators are prefix sums of of the sequence of random variables. We consider a partition the sequence Zk+1,,ZnZ_{k+1},\ldots,Z_{n} to consecutive stretches of size

h:=2hk,(h1).\ell_{h}:=2^{h}k,(h\geq 1)\ .

We denote by ShS_{h} the sum of stretch hh. Note that ShErlang[h,1]S_{h}\sim\textsf{Erlang}[\ell_{h},1] are independent random variables. We define a bad event as the probability that for some h1h\geq 1, Sh0.15h=0.14 2hkS_{h}\leq 0.15\ell_{h}=0.14\ 2^{h}k. From the lower tail bound of Lemma D.1, we have

Pr[Sh0.15h]=PrrErlang[h,1][r<0.15h]ehe2hk.\Pr[S_{h}\leq 0.15\ell_{h}]=\Pr_{r\sim\textsf{Erlang}[\ell_{h},1]}\left[r<0.15\ell_{h}\right]\leq e^{-\ell_{h}}\leq e^{-2^{h}k}\enspace. (29)

The combined probability of the union of these bad events (for the numerator and all stretches) is at most ek+h1e2hk3eke^{-k}+\sum_{h\geq 1}e^{-2^{h}k}\leq 3e^{-k}.

We are now ready to compute probabilistic upper bound on the ratios when there is no bad event

Rn,k,ρ\displaystyle R_{n,k,\rho} \displaystyle\leq h1hNρ(N+i<hSi)ρ\displaystyle\sum_{h\geq 1}\ell_{h}\frac{N^{\rho}}{(N+\sum_{i<h}S_{i})^{\rho}}
\displaystyle\leq h1h(3.2k)ρ(3.2k+0.15i<hi)ρ\displaystyle\sum_{h\geq 1}\ell_{h}\frac{\left(3.2k\right)^{\rho}}{\left(3.2k+0.15\sum_{i<h}\ell_{i}\right)^{\rho}}
=\displaystyle= 2k(3.2k)ρ(3.2k)ρ+h22hk(3.2k)ρ(3.2k+(2h2)k)ρ\displaystyle 2k\frac{\left(3.2k\right)^{\rho}}{\left(3.2k\right)^{\rho}}+\sum_{h\geq 2}2^{h}k\frac{\left(3.2k\right)^{\rho}}{\left(3.2k+(2^{h}-2)k\right)^{\rho}}
\displaystyle\leq k(2+h=2log2(n/k)2h(3.22h+1.2)ρk(2+3.2ρh=2log2(n/k)2h(ρ1))\displaystyle k(2+\sum_{h=2}^{\lceil\log_{2}(n/k)\rceil}2^{h}\left(\frac{3.2}{2^{h}+1.2}\right)^{\rho}\leq k\left(2+3.2^{\rho}\sum_{h=2}^{\lceil\log_{2}(n/k)\rceil}2^{-h(\rho-1)}\right)

For ρ=1\rho=1 we have O(klogn)O(k\log n). For ρ>1\rho>1, we have O(k/(ρ1))O(k/(\rho-1)). ∎

From the proof of Theorem D.1 we obtain:

Corollary D.2.

There is a constant BB such that when there are no “bad events" in the sense of the proof of Theorem D.1,

i=1kZii=1BkZi1/3.\frac{\sum_{i=1}^{k}Z_{i}}{\sum_{i=1}^{Bk}Z_{i}}\leq 1/3\ .
Proof.

With no bad events, N=i=1kZi<3.2kN=\sum_{i=1}^{k}Z_{i}<3.2k and i=k+1k2hZi0.15k(2h1)\sum_{i=k+1}^{k2^{h}}Z_{i}\geq 0.15k(2^{h}-1). Solving for 0.15kB6.4k0.15kB\geq 6.4k (for B=2h1B=2^{h}-1 for some hh) we obtain B=63B=63. ∎

Appendix E Ratio of magnitudes of transformed weights

For k2>k1k_{2}>k_{1} we consider the distribution of the ratio between the k2thk_{2}^{th} and k1thk_{1}^{th} transformed weights:

G𝒘,p,q,k1,k2π:=|w(k2)pw(k1)p|G_{\boldsymbol{w},p,q,k_{1},k_{2}}\mid\pi:=\left|\frac{w^{*p}_{(k_{2})}}{w^{*p}_{(k_{1})}}\right|
Lemma E.1.

For any 𝐰n\boldsymbol{w}\in\Re^{n}, πSn\pi\in S^{n}, and k1<k2nk_{1}<k_{2}\leq n, the distribution G𝐰,p,q,k1,k2πG_{\boldsymbol{w},p,q,k_{1},k_{2}}\mid\pi is dominated by

Gρ=q/p,k1,k2:=(i=1k1Zii=1k2Zi)ρ,G^{\prime}_{\rho=q/p,k_{1},k_{2}}:=\left(\frac{\sum_{i=1}^{k_{1}}Z_{i}}{\sum_{i=1}^{k_{2}}Z_{i}}\right)^{\rho}\enspace, (30)

where ZiExp[1]Z_{i}\sim\textsf{Exp}[1] are i.i.d.

Proof.

Following the notation in the proof of Lemma C.1, the distribution G𝒘,p,q,k1,k2G_{\boldsymbol{w},p,q,k_{1},k_{2}} can be expressed as

(i=1k1Xii=1k2Xi)ρ\left(\frac{\sum_{i=1}^{k_{1}}X_{i}}{\sum_{i=1}^{k_{2}}X_{i}}\right)^{\rho}

where Xi:=Zij=inwjpX_{i}:=\frac{Z_{i}}{\sum_{j=i}^{n}w_{j}^{p}}.

For i[n]i\in[n] we define Xi:=Zij=k1nwjpX^{\prime}_{i}:=\frac{Z_{i}}{\sum_{j=k_{1}}^{n}w_{j}^{p}}. Now note that XiXiX^{\prime}_{i}\geq X_{i} for ik1i\leq k_{1} and XiXiX^{\prime}_{i}\geq X_{i} for ik1i\geq k_{1}. Therefore,

i=1k1Xii=1k2Xi\displaystyle\frac{\sum_{i=1}^{k_{1}}X_{i}}{\sum_{i=1}^{k_{2}}X_{i}} =11+i=k1+1k2Xii=1k1Xi\displaystyle=\frac{1}{1+\frac{\sum_{i=k_{1}+1}^{k_{2}}X_{i}}{\sum_{i=1}^{k_{1}}X_{i}}}
11+i=k1+1k2Xii=1k1Xi=11+i=k1+1k2Zii=1k1Zi=i=1k1Zii=1k2Zi.\displaystyle\leq\frac{1}{1+\frac{\sum_{i=k_{1}+1}^{k_{2}}X^{\prime}_{i}}{\sum_{i=1}^{k_{1}}X^{\prime}_{i}}}=\frac{1}{1+\frac{\sum_{i=k_{1}+1}^{k_{2}}Z_{i}}{\sum_{i=1}^{k_{1}}Z_{i}}}=\frac{\sum_{i=1}^{k_{1}}Z_{i}}{\sum_{i=1}^{k_{2}}Z_{i}}\enspace.

Appendix F 1-pass with total variation distance on sample kk-tuple: upper and lower bounds

Perfect ppswor returns each subset of kk keys S={i1,,ik}S=\{i_{1},\ldots,i_{k}\} with a certain probability:

p(S)=π{π1,,πk}=Sj=1kwij𝒘1h<jwih.p(S)=\sum_{\pi\mid\{\pi_{1},\ldots,\pi_{k}\}=S}\prod_{j=1}^{k}\frac{w_{i_{j}}}{\|\boldsymbol{w}\|_{1}-\sum_{h<j}w_{i_{h}}}\ .

Recall that the distribution is equivalent to successive weighted sampling without replacement. It is also equivalent to successive sampling with replacement if we “skip" repetitions until we obtain kk distinct keys.

With pp-ppswor and unaggregated data, this is with respect to νxp\nu_{x}^{p}. The WORp 1-pass method returns an approximate pp-ppswor sample in terms of estimation quality and per-key inclusion probability but the TV distance on kk-tuples can be large.

We present here another 1-pass method that returns a kk-tuple with a polynomially small VT distance from pp-ppswor.

Input: p\ell_{p} rHH method, perfect p\ell_{p}-single sampler method, sample size kk, pp, δ\delta, nn,
Initialization:
      Initialize r=Cklognr=C\cdot k\log n independent perfect p\ell_{p}-single sampling algorithms A1,,ArA^{1},\ldots,A^{r}.
      Initialize an p\ell_{p} rHH method RR.
Pass 1:
      Feed each stream update into A1,,ArA^{1},\ldots,A^{r} as well as into RR.
Produce sample:
      SS\leftarrow\emptyset
      For i=1,,ri=1,\ldots,r
       Let OutiOut_{i} be the index returned by AiA^{i}
       If OutiSOut_{i}\notin S, then
         SS{Outi}S\leftarrow S\cup\{Out_{i}\}
         For each j>ij>i, feed the update xOutixOutiR(Outi)x_{Out_{i}}\leftarrow x_{Out_{i}}-R(Out_{i}) into AjA^{j}
  // R(Outi)R(Out_{i}) is the estimate of xix_{i} given by RR
       If |S|=k|S|=k then exit and return SS
      end
Output Fail
  // Algorithms returns SS before reaching this line with high probability
Algorithm 1 1-pass Low Variation Distance Sampling
Theorem F.1.

Let p(0,2]p\in(0,2]. There is a 11-pass turnstile streaming algorithm using kpoly(logn)k\cdot\textrm{poly}(\log n) bits of memory which, given a stream of updates to an underlying vector x{M,M+1,,M1,M}nx\in\{-M,-M+1,\ldots,M-1,M\}^{n}, with M=poly(n)M=\textrm{poly}(n), outputs a set SS of size kk such that the distribution of SS has variation distance at most 1nC\frac{1}{n^{C}} from the distribution of a sample without replacement of size kk from the distribution μ=(μ1,,μn)\mu=(\mu_{1},\ldots,\mu_{n}), where μi=|xi|pxpp\mu_{i}=\frac{|x_{i}|^{p}}{\|x\|_{p}^{p}}, where C>0C>0 is an arbitrarily large constant.

Proof.

The algorithm is 11-pass and works in a turnstile stream given an p\ell_{p} rHH method and perfect p\ell_{p}-single sampler methods that have this property. We use the p\ell_{p} rHH method of [52], which has this property and uses O(klog2n)O(k\cdot\log^{2}n) bits of memory. We also use the perfect p\ell_{p}-single sampler method of [50], which has this property and uses log2npoly(loglogn)\log^{2}n\cdot\textrm{poly}(\log\log n) bits of memory for 0<p<20<p<2 and O(log3n)O(\log^{3}n) bits of memory for p=2p=2. The perfect p\ell_{p}-single sampler method of [50] can output Fail with constant probability, but we can repeat it ClognC\log n times and output the first sample found, so that it outputs Fail with probability at most 1nC\frac{1}{n^{C}} for an arbitrarily large constant C>0C>0, and consequently we can assume it never outputs Fail (by say, always outputting index 11 when Fail occurs). This gives us the claimed kpoly(logn)k\cdot\textrm{poly}(\log n) total bits of memory.

We next state properties of these subroutines. The p\ell_{p} rHH method we use satisfies: with probability 11nC1-\frac{1}{n^{C}} for an arbitrarily large constant C>0C>0, simultaneously for all j[n]j\in[n], it outputs an estimate R(j)R(j) for which

R(j)=xi±(12k)1/ptailk(x)p.R(j)=x_{i}\pm\left(\frac{1}{2k}\right)^{1/p}\cdot\|\textsf{tail}_{k}(x)\|_{p}.

We assume this event occurs and add 1nC\frac{1}{n^{C}} to our overall variation distance.

The next property concerns the perfect p\ell_{p}-single samplers AjA^{j} we use. Each AjA^{j} returns an index i{1,2,,n}i\in\{1,2,\ldots,n\} such that the distribution of ii has variation distance at most 1nC\frac{1}{n^{C}} from the distribution μ\mu. Here C>0C>0 is an arbitrarily large constant of our choice.

We next analyze our procedure for producing a sample. Consider the joint distribution of (Out1,Out2,,Out2Cklogn)(Out_{1},Out_{2},\ldots,Out_{2Ck\log n}). The algorithms AiA^{i} use independent randomness. However, the input to AiA^{i} may depend on the randomness of AiA^{i^{\prime}} for i<ii^{\prime}<i. However, by definition, conditioned on AiA^{i} not outputting OutiOut_{i^{\prime}} for any i<i,i^{\prime}<i, we have that OutiOut_{i} is independent of Out1,,Outi1Out_{1},\ldots,Out_{i-1} and moreover, the distribution of OutiOut_{i} has variation distance 1nC\frac{1}{n^{C}} from the distribution of a sample ss from μ\mu conditioned on s{Out1,,Outi1}s\notin\{Out_{1},\ldots,Out_{i-1}\}, for an arbitrarily large constant C>0C>0.

Let EE be the event that we sample kk distinct indices, i.e., do not output Fail in our overall algorithm. We show below that Pr[E]11nC\Pr[E]\geq 1-\frac{1}{n^{C}} for an arbitrarily large constant C>0C>0. Consequently, our output distribution has variation distance 1nC\textrm{1}{n^{C}} from an idealized algorithm that samples until it has kk distinct values.

Consider the probability of outputting a particular ordered tuple (i1,,ik)(i_{1},\ldots,i_{k}) of kk distinct indices in the idealized algorithm that samples until it has kk distinct values. By the above, this is

j=1k(1±1nC)μij1j<jμij=(1±2knC)j=1kμij1j<jμij,\prod_{j=1}^{k}(1\pm\frac{1}{n^{C}})\frac{\mu_{i_{j}}}{1-\sum_{j^{\prime}<j}\mu_{i_{j^{\prime}}}}=(1\pm\frac{2k}{n^{C}})\prod_{j=1}^{k}\frac{\mu_{i_{j}}}{1-\sum_{j^{\prime}<j}\mu_{i_{j^{\prime}}}},

for an arbitrarily large constant C>0C>0. Summing up over all orderings, we obtain the probability of obtaining the sample {i1,,ik}\{i_{1},\ldots,i_{k}\} is within (1±1nC)(1\pm\frac{1}{n^{C}}) times its probability of being sampled from μ\mu in a sample without replacement of size kk, where C>0C>0 is a sufficiently large constant.

It remains to show Pr[E]nC\Pr[E]\leq n^{-C} for an arbitrarily large constant C>0C>0. Here we use that for all j{1,2,,n},j\in\{1,2,\ldots,n\}, R(j)=xi±(12k)1/ptailk(x)p.R(j)=x_{i}\pm\left(\frac{1}{2k}\right)^{1/p}\cdot\|\textsf{tail}_{k}(x)\|_{p}. Let YiY_{i} be the number of trials until (and including the time) we sample the ii-th distinct item, given that we have just sampled i1i-1 distinct items. The total probability mass on the items we have already sampled is at most i12ktailk(x)ppi\cdot\frac{1}{2k}\|\textsf{tail}_{k}(x)\|_{p}^{p}, and thus the probability we re-sample an item already sampled is at most 12\frac{1}{2}. It follows that 𝐄[Yi]2{\bf E}[Y_{i}]\leq 2. Thus, the number of trials in the algorithm is stochastically dominated by i=1kZi\sum_{i=1}^{k}Z_{i}, where ZiZ_{i} is a geometric random variable with 𝐄[Zi]=2{\bf E}[Z_{i}]=2. This sum is a negative binomial random variable, and by standard tail bounds relating a sum of independent geometric random variables to binomial random variables[62] 333See, also, e.g., https://math.stackexchange.com/questions/1565559/tail-bound-for-sum-of-geometric-random-variables, is at most CklognCk\log n with probability 11nC1-\frac{1}{n^{C}} for an arbitrarily large constant C>0C>0.

This completes the proof. ∎

We now analyze the memory in Theorem F.1 more precisely. Algorithm 1 runs r=O(klogn)r=O(k\log n) independent perfect p\ell_{p}-sampling algorithms of [50]. The choice of r=O(klogn)r=O(k\log n) is to ensure that the variation distance is at most 1poly(n)\frac{1}{\mathop{\mathrm{poly}}(n)}; however, with only r=O(k)r=O(k) such samplers, the same argument as in the proof of Theorem F.1 gives variation distance at most 2Θ(k)2^{-\Theta(k)}. Now, each p\ell_{p}-sampler of [50] uses O(log2n(loglogn)2)O(\log^{2}n(\log\log n)^{2}) bits of memory for 0<p<20<p<2, and uses O(log3n)O(\log^{3}n) bits of memory for p=2p=2. We also do not need to repeat the algorithm O(logn)O(\log n) times to create a high probability of not outputting Fail; indeed, already if with only constant probability the algorithm does not output Fail, we will still obtain kk distinct samples with 2Θ(k)2^{-\Theta(k)} failure probability provided we have a large enough r=O(k)r=O(k) number of such samplers.

Algorithm 1 also runs an p\ell_{p} rHH method, and this uses O(klog2n)O(k\log^{2}n) bits of memory [52]. Consequently, to acheive variation distance at most 2Θ(k)2^{-\Theta(k)}, Algorithm 1 uses O(klog2n(loglogn)2)O(k\log^{2}n(\log\log n)^{2}) bits of memory for 0<p<20<p<2, and O(klog3n)O(k\log^{3}n) bits of memory for p=2p=2.

We now show that for 0<p<20<p<2, the memory used of Algorithm 1 is best possible for any algorithm, up to a multiplicative O((loglogn)2)O((\log\log n)^{2}) factor. For p=2p=2, we show our algorithm’s memory is optimal up to a multiplicative O(logn)O(\log n) factor. Further, our lower bound holds even for any algorithm with the much weaker requirement of achieving variation distance at most 13\frac{1}{3}, as opposed to the variation distance at most 2Θ(k)2^{-\Theta(k)} that we achieve.

Theorem F.2.

Any 11-pass turnstile streaming algorithm which outputs a set SS of size kk such that the distribution of SS has variation distance at most 13\frac{1}{3} from the distribution of a sample without replacement of size kk from the distribution μ=(μ1,,μn)\mu=(\mu_{1},\ldots,\mu_{n}), where μi=|xi|pxpp\mu_{i}=\frac{|x_{i}|^{p}}{\|x\|_{p}^{p}}, requires Ω(klog2n)\Omega(k\log^{2}n) bits of memory, provided k<nC0k<n^{C_{0}} for a certain absolute constant C0>0C_{0}>0.

Proof.

We use the fact that such a sample SS can be used to find a constant fraction of the q(k,1)\ell_{q}(k,1) residual heavy hitters in a data stream. Here we do not need to achieve residual error for our lower bound, and can instead define such indices ii to be those that satisfy |xi|p1kxpp|x_{i}|^{p}\geq\frac{1}{k}\|x\|_{p}^{p}. Notice that there are at most kk such indices ii, and any sample SS (with or without replacement) with variation distance at most 1/31/3 from a true sample has probability at least 1(11/k)k1/311/e1/3.291-(1-1/k)^{k}-1/3\geq 1-1/e-1/3\geq.29 of containing the index ii. By repeating our algorithm O(1)O(1) times, we obtain a superset of size O(k)O(k) which contains any particular such index ii with arbitrarily large constant probability, and these O(1)O(1) repetitions only increase our memory by a constant factor.

It is also well-known that there exists a point-query data structure, in particular the CountSketch data structure [15, 67], which only needs O(log|S|)=O(logk)O(\log|S|)=O(\log k) rows and thus O((klogk)logn)O((k\log k)\log n) bits of memory, such that given all the indices jj in a set SS, one can return all items jSj\in S for which |xj|p1kxpp|x_{j}|^{p}\geq\frac{1}{k}\|x\|_{p}^{p} and no items jSj\in S for which |xj|p<12kxpp|x_{j}|^{p}<\frac{1}{2k}\|x\|_{p}^{p}. Here we avoid the need for O(logn)O(\log n) rows since we only need to union bound over correct estimates in the set SS.

In short, the above procedure allows us to, with arbitrarily large constant probability, return a set SS containing a random .99.99 fraction of the indices jj for which |xj|p1kxpp|x_{j}|^{p}\geq\frac{1}{k}\|x\|_{p}^{p}, and containing no index jj for which |xj|p<12kxpp|x_{j}|^{p}<\frac{1}{2k}\|x\|_{p}^{p}.

We now use an existing Ω(klog2n)\Omega(k\log^{2}n) bit lower bound, which is stated for finding all the heavy hitters [52], to show an Ω(klog2n)\Omega(k\log^{2}n) bit lower bound for the above task. This is in fact immediate from the proof of Theorem 9 of [52], which is a reduction from the one-way communication complexity of the Augmented Indexing Problem and just requires any particular heavy hitter index to be output with constant probability. In particular, our algorithm, combined with the O((klogk)logn)O((k\log k)\log n) bits of memory side data structure of [15] described above, achieves this.

Consequently, the memory required of any 11-pass streaming algorithm for the sampling problem is at least Ω(klog2n)O((klogk)logn)\Omega(k\log^{2}n)-O((k\log k)\log n) bits, which gives us an Ω(klog2n)\Omega(k\log^{2}n) lower bound provided k<nC0k<n^{C_{0}} for an absolute constant C0>0C_{0}>0, completing the proof. ∎

Appendix G Estimates of one-pass WORp

We first review the setup. Our one-pass WORp method returns the top kk keys by νx^\widehat{\nu^{*}_{x}} as our sample SS and returns ν^(k+1)\widehat{\nu^{*}}_{(k+1)} as the threshold. The estimate of f(νx)f(\nu_{x}) is 0 for xSx\not\in S and for xSx\in S is

f(νx)^:=f(νx^rx1/p)1exp(rx(νx^ν^(k+1))p).\widehat{f(\nu_{x})}:=\frac{f(\widehat{\nu^{*}_{x}}r_{x}^{1/p})}{1-\exp\left(-r_{x}(\frac{\widehat{\nu^{*}_{x}}}{\widehat{\nu^{*}}_{(k^{\prime}+1)}})^{p}\right)}\enspace. (31)

We assume that f(w)f(w) is such that for some constant cc,

ε<1/2,|f((1+ε)w)f(w)|cεf(w).\forall\varepsilon<1/2,|f((1+\varepsilon)w)-f(w)|\leq c\varepsilon f(w)\enspace. (32)

We need to establish that

Bias[f(νx)^]\displaystyle\textsf{Bias}[\widehat{f(\nu_{x})}] O(ε)f(νx)\displaystyle\leq O(\varepsilon)f(\nu_{x})
MSE[f(νx)^]\displaystyle\textsf{MSE}[\widehat{f(\nu_{x})}] (1+O(ε))Var[f(νx)^]+O(ε)f(νx)2,\displaystyle\leq(1+O(\varepsilon))\textsf{Var}[\widehat{f(\nu_{x})}^{\prime}]+O(\varepsilon)f(\nu_{x})^{2}\enspace,

where f(νx)^\widehat{f(\nu_{x})}^{\prime} are estimates obtained with a (perfect) pp-ppswor sample.

Proof of Theorem 5.1.

From (10), the rHH sketch has the property that for all keys in the dataset,

𝝂^𝝂εν(k+1).\|\widehat{\boldsymbol{\nu}^{*}}-\boldsymbol{\nu}^{*}\|_{\infty}\leq\varepsilon\nu^{*}_{(k+1)}\enspace. (33)

For sampled keys, |νx||ν(k+1)||\nu^{*}_{x}|\geq|\nu^{*}_{(k+1)}| and hence |νx^νx|ε|ν(k+1)|ε|νx||\widehat{\nu^{*}_{x}}-\nu^{*}_{x}|\leq\varepsilon|\nu^{*}_{(k+1)}|\leq\varepsilon|\nu^{*}_{x}|. Using (6) we obtain that νxνxε|νx|\|\nu^{\prime}_{x}-\nu_{x}\|\leq\varepsilon|\nu_{x}|.

From our assumption (32) , we have |f(νx)f(νx)|cεf(νx)|f(\nu^{\prime}_{x})-f(\nu_{x})|\leq c\varepsilon f(\nu_{x}).

We consider the inclusion probability and frequency estimate of a particular key xx, conditioned on fixed randomization rzr_{z} of all other keys zxz\not=x. The key xx is included in the sample if νx^𝝂^(k+1)\widehat{\nu^{*}_{x}}\geq\widehat{\boldsymbol{\nu}^{*}}_{(k+1)}. We consider the distribution of νx^\widehat{\nu^{*}_{x}} as a function of rxExp[1]r_{x}\sim\textsf{Exp}[1]. The value has a form of E+νx/rx1/pE+\nu_{x}/r_{x}^{1/p}, where the erro EE satisfies |E|ε|ν(k+1)||E|\leq\varepsilon|\nu^{*}_{(k+1)}|. The conditioned inclusion probability thus falls in the range

px=Pr[νx/rx1/p±ε|ν(k)|𝝂^(k)]=Pr[rx(νx𝝂^(k+1)±ε|ν(k)|)p]=1exp((νx𝝂^(k+1)±ε|ν(k)|)p).p^{\prime}_{x}=\Pr[\nu_{x}/r_{x}^{1/p}\pm\varepsilon|\nu^{*}_{(k)}|\geq\widehat{\boldsymbol{\nu}^{*}}_{(k)}]=\Pr\left[r_{x}\leq\left(\frac{\nu_{x}}{\widehat{\boldsymbol{\nu}^{*}}_{(k+1)}\pm\varepsilon|\nu^{*}_{(k)}|}\right)^{p}\right]=1-\exp(-\left(\frac{\nu_{x}}{\widehat{\boldsymbol{\nu}^{*}}_{(k+1)}\pm\varepsilon|\nu^{*}_{(k)}|}\right)^{p})\enspace.

We estimate pxp^{\prime}_{x} by

px′′=1exp(rx(νx^ν^(k+1))p).p^{\prime\prime}_{x}=1-\exp\left(-r_{x}(\frac{\widehat{\nu^{*}_{x}}}{\widehat{\nu^{*}}_{(k+1)}})^{p}\right)\enspace. (34)

This estimate has a small relative error. This due to the relative error in νx^\widehat{\nu^{*}_{x}} and because |(1exp((1±ϵ)b)))(1exp(b))|=O(ϵ)(1exp(b))|(1-\exp(-(1\pm\epsilon)b)))-(1-\exp(-b))|=O(\epsilon)(1-\exp(-b)) and (νx𝝂^(k))p)(\frac{\nu^{\prime}_{x}}{\widehat{\boldsymbol{\nu}^{*}}_{(k)}})^{p}) is an O(ϵ)O(\epsilon) relative error approximation of (νx𝝂^(k)E)p(\frac{\nu_{x}}{\widehat{\boldsymbol{\nu}^{*}}_{(k)}-E})^{p}.

We first consider the bias. Instead of using the unbiased inverse probability estimate f(νx)/pxf(\nu_{x})/p^{\prime}_{x} when xx is sampled (with probability pxp^{\prime}_{x}) our estimator (17) (f(νx)/px′′(f(\nu^{\prime}_{x})/p^{\prime\prime}_{x} approximates both the numerator and the denominator.

In the numerator of the estimator, we replace f(νx)f(\nu_{x}) by the relative error approximation f(νx)f(\nu^{\prime}_{x}). Therefore overall, we use a small relative error estimate of the actual inverse probability estimate when it is non zero, which translates to a bias that is O(ϵ)f(νx)O(\epsilon)f(\nu_{x}).

We next bound the Mean Squared Error (MSE) of the estimator (17). We express the variance contribution of exact pp-ppswor conditioned on the same randomization rzr_{z} of all keys zxz\not=x. This is Var[f(νx)^]=(1/px1)f(νx)2\textsf{Var}[\widehat{f(\nu_{x})}^{\prime}]=(1/p_{x}-1)f(\nu_{x})^{2}, where px=Pr[νx/rx1/p𝝂(k)]=1exp((νx𝝂(k))p)p_{x}=\Pr[\nu_{x}/r_{x}^{1/p}\geq\boldsymbol{\nu}^{*}_{(k)}]=1-\exp(-\left(\frac{\nu_{x}}{\boldsymbol{\nu}^{*}_{(k)}}\right)^{p}). The MSE contribution is

px(f(νx)/px′′f(νx))2+(1px)f(νx)2.p^{\prime}_{x}(f(\nu^{\prime}_{x})/p^{\prime\prime}_{x}-f(\nu_{x}))^{2}+(1-p^{\prime}_{x})f(\nu_{x})^{2}\enspace. (35)

We observe that the approximate threshold (that determines pxp^{\prime}_{x}) approximates the perfect pp-ppswor threshold: |𝝂^(k)𝝂(k)|ε|ν(k)||\widehat{\boldsymbol{\nu}^{*}}_{(k)}-\boldsymbol{\nu}^{*}_{(k)}|\leq\varepsilon|\nu^{*}_{(k)}|. When px<1/2p_{x}<1/2, (35) approximates Var[f(νx)^]\textsf{Var}[\widehat{f(\nu_{x})}^{\prime}] with relative error O(ε)O(\varepsilon).

When pxp_{x} is close to 11 this is dominated by O(ε)f(νx)2O(\varepsilon)f(\nu_{x})^{2}.

We remark that our analysis of the error only assumed the rHH error bound (33) which holds for all sketch types including Counters. The bias analysis can be tightened for CountSketch that returns unbiased estimates of the frequency.

Appendix H Pseudocode

Input: q\ell_{q} rHH method, sample size kk, pp, δ\delta, nn,
Initialization:
Draw a random hash rx𝒟r_{x}\sim\mathcal{D}
  // Random map of keys xx to rxr_{x}
ψ13Ψn,k,q/p(δ)\psi\leftarrow\frac{1}{3}\Psi_{n,k,q/p}(\delta)
Initialize KeyHash
  // random hash function from strings to [n][n]
R.Initialize(k,ψ)R.\textnormal{{Initialize}}(k,\psi)
  // Initialize rHH structure randomization
Pass I:
  // Use composable aggregation (process input keys into rHH structures and merge rHH structures)
begin
       Process data element e=(e.key,e.val)e=(e.\textsf{key},e.\textsf{val}) into rHH sketch RR
       R.Process(KeyHash(e.key),e.val/re.key1/p)R.\textnormal{{Process}}(\textnormal{{KeyHash}}(e.\textsf{key}),e.\textsf{val}/r_{e.\textsf{key}}^{1/p})
        // Generate and process output element
      
Pass II:
  // For keys with top 2k2k estimates νx^\widehat{\nu^{*}_{x}}, collect exact frequencies νx\nu_{x}.
Initialize a composable top-2k2k structure TT. The structure stores for each key its priority and frequency. The structure collects exact frequencies for the keys with top 2k2k priorities. Merge(T1,T2)\textnormal{{Merge}}(T_{1},T_{2}): Add up values and retain 3k3k top priority keys.
Process data element e=(e.key,e.val)e=(e.\textsf{key},e.\textsf{val}) into TT
begin
       if e.keyTe.\textsf{key}\in T then
            T[e.key].val+=e.valT[e.\textsf{key}].val+=e.\textsf{val}
      else
             estR.Est[KeyHash(e.key)]est\leftarrow R.\textnormal{{Est}}[\textnormal{{KeyHash}}(e.\textsf{key})]
              // νkey^\widehat{\nu^{*}_{\textsf{key}}}
            
             if est>lowest priority in Test>\text{lowest priority in $T$} then
                  Insert e.keye.\textsf{key} to TT
                   T[e.key].vale.valT[e.\textsf{key}].val\leftarrow e.\textsf{val}
                   T[e.key].priorityestT[e.\textsf{key}].priority\leftarrow est
                   if |T|>2k then
                        Eject lowest priority key from TT
                  
            
      
Produce sample: Sort TT by T[x].valrx1/pT[x].val*r^{1/p}_{x}
  // actual νx\nu^{*}_{x} for keys in TT
Return (x,T[x].val)(x,T[x].val) for top-kk keys and (k+1)(k+1)th T[x].valrx1/pT[x].val*r^{1/p}_{x}
Algorithm 2 2-pass WORp