This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unmasking Vulnerabilities:
Cardinality Sketches under Adaptive Inputs

Sara Ahmadian Google Research, United States Edith Cohen Google Research, United States Department of Computer Science, Tel Aviv University, Israel
Abstract

Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong guarantees of correctly answering a number of queries exponential in the sketch size kk. In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the “standard” estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size kk. Empirically, our attack used only 4k4k queries with the widely used HyperLogLog (HLL++) (Flajolet et al., 2007a; Heule et al., 2013) sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as any estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in kk, matching a generic upper bound.

1 Introduction

Composable sketches for cardinality estimation are data structures that are commonly used in practice (Apache Software Foundation, Accessed: 2024; Google Cloud, Accessed: 2024) and had been studied extensively (Flajolet and Martin, 1985; Flajolet et al., 2007b; Cohen, 1997; Bar-Yossef et al., 2002; Kane et al., 2010; Nelson et al., 2014; Cohen, 2015; Pettie and Wang, 2021). The sketch of a set is a compact representation that supports merge (set union) operations, adding elements, and retrieval of approximate cardinality. The sketch size is only logarithmic or double logarithmic in the cardinality of queries, which allows for a significant efficiency boost over linear size data structures such as Bloom filters.

Formally, for a universe 𝒰\mathcal{U} of keys, and randomness ρ\rho, a sketch map USρ(U)U\mapsto S_{\rho}(U) is a mapping from sets of keys U2𝒰U\in 2^{\mathcal{U}} to their sketch Sρ(U)S_{\rho}(U). Sketch maps are designed so that for each UU we can recover with high probability (over ρ\rho) an estimate of |U||U| by applying an estimator \mathcal{M} to Sρ(U)S_{\rho}(U). A common guarantee is a bound on the Normalized Root Mean Squared Error (NRMSE) so that for accuracy parameter ϵ\epsilon,

U,Eρ[((Sρ(U))|U||U|)2]ϵ2.\forall U,\ \textsf{E}_{\rho}\left[\left(\frac{\mathcal{M}(S_{\rho}(U))-|U|}{|U|}\right)^{2}\right]\leq\epsilon^{2}\ . (1)

The maps SρS_{\rho} are designed to be composable: For a set UU and key uu, the sketch Sρ(U{u})S_{\rho}(U\cup\{u\}) can be computed from Sρ(U)S_{\rho}(U) and uu. For two sets UU, VV, the sketch Sρ(UV)S_{\rho}(U\cup V) can be computed from their respective sketches Sρ(U)S_{\rho}(U), Sρ(V)S_{\rho}(V). Composability is a crucial property that makes the sketch representations useful for streaming, distributed, and parallel applications. Importantly, the use of the same internal randomness ρ\rho across all queries is necessary for composability and therefore in typical use cases it is fixed across a system.

The basic technique in the design of cardinality (distinct count) sketches is to randomly prioritize the keys in the universe 𝒰\mathcal{U} through the use of random hash functions (specified by ρ\rho). The sketch of a set UU keeps the lowest priorities of keys that are in the set UU. This provides information on the cardinality |U||U|, since a larger cardinality corresponds to the presence of lower priorities keys in UU. This technique was introduced by Flajolet and Martin (1985) for counting distinct elements in streaming and as composable sketches of sets by Cohen (1997). The core idea of sampling keys based on a random order emerged in reservoir sampling (Knuth, 1998), and in weighted sampling  (Rosén, 1997). Cardinality sketches are also Locality Sensitive Hashing (LSH) maps (Indyk and Motwani, 1998) with respect to set differences.

The specific designs of cardinality sketches vary and include MinHash sketches (randomly map keys to priorities) or domain sampling (randomly map keys to sampling rates). With these methods, the sketch size dependence on the maximum query size |U|n|U|\leq n is logn\log n or loglogn\log\log n. The sketch size (number of registers) needed for the NRMSE guarantee of Equation (1) is k=O(ϵ2)k=O(\epsilon^{-2}). The sketch size needed for the following (ϵ,δ)(\epsilon,\delta) guarantee (confidence 1δ1-\delta of relative error of ϵ\epsilon)

U,Prρ[|(Sρ(U))|U||U||>ϵ]δ\forall U,\ \textsf{Pr}_{\rho}\left[\left|\frac{\mathcal{M}(S_{\rho}(U))-|U|}{|U|}\right|>\epsilon\right]\leq\delta\ (2)

is k=O(ϵ2log(1/δ))k=O(\epsilon^{-2}\log(1/\delta)). This guarantee means that for any UU, almost any sampled ρ\rho works well.

We model the use of the sketching map SρS_{\rho} as an interaction between a source, that issues queries Ui𝒰U_{i}\subset\mathcal{U}, and a query response (QR) algorithm, that receives the sketch Sρ(Ui)S_{\rho}(U_{i}) (but not UiU_{i}) applies an estimator \mathcal{M} and returns the estimate (Sρ(Ui))\mathcal{M}(S_{\rho}(U_{i})) on the cardinality |Ui||U_{i}|.

When randomized data structures or algorithms are invoked interactively, it is important to make a distinction between non-adaptive queries, that do not depend on ρ\rho, and adaptive queries. In non-adaptive settings we can treat queries as fixed in advance. In this case, we can apply a union bound with the guarantee of Equation 2 and obtain that the probability that the responses for all rr queries are within a relative error of ϵ\epsilon, is at least 1rδ1-r\delta. Therefore, the sketch size needed to provide an (ϵ,δ)(\epsilon,\delta)-guarantee on this \ell_{\infty} error is k=O(ϵ2log(r/δ)k=O(\epsilon^{-2}\log(r/\delta). In particular, the query response algorithm can be correct on a number of non-adaptive queries that is exponential in the sketch size until a set is encountered for which the estimate is off.

Many settings, however, such as control loops, optimization processes, or malicious behavior, give rise to adaptive inputs. This can happen inadvertently when a platform such as Apache Software Foundation (Accessed: 2024) or SQL (Google Cloud, Accessed: 2024) is used. In such cases, information on the randomness ρ\rho may leak from query responses, and the union bound argument does not hold. An important question that arises is thus to understand the actual vulnerability of our specific algorithms in such settings. Are they practically robust? How efficiently can they be attacked? What can be the consequences of finding such an adversarial input?

Randomized data structures designed for non-adaptive queries can be applied in generic ways with adaptive queries. However, the guarantees provided by the resulting (robust) algorithms tend to be significantly weaker than those of their non-robust counterparts. The straightforward approach is to maintain multiple copies of the sketch maps (with independent randomness) and discard a copy after it is used once to respond to a query. This results in a linear dependence of the number of queries rr in the size of the data structure. Hassidim et al. (2020) proposed the robustness wrapper method that allows for r2r^{2} adaptive queries using O~(r)\tilde{O}(r) sketch maps. The method uses differential privacy to protect the randomness and the analysis uses generalization (Dwork et al., 2015b; Bassily et al., 2021) and advanced composition (Dwork et al., 2006). The quadratic relation is known to be tight in the worst-case for adaptive statistical queries with an attack (Hardt and Ullman, 2014; Steinke and Ullman, 2015) designed using Fingerprinting Codes (Boneh and Shaw, 1998). But these attacks do not preclude a tailored design with a better utility guarantee for cardinality sketches and also do not apply with “natural” workloads.

Contributions and Overview

We consider the known sublinear composable sketch structures for cardinality estimation, which we review in Section 3. Our primary contribution is designing attacks that construct a set UU that is adversarial for the randomness ρ\rho. We make this precise in the sequel, but for now, an adversarial set UU results in cardinality estimates that are off.

  • We consider query response algorithms that use the “standard” cardinality estimators. These estimators optimally use the information in the sketch and report a value that is a function of a sufficient statistic of the cardinality. In Section 4 we present an attack on these estimators. The product of the attack is an adversarial set, one for which the sketch Sρ(U)S_{\rho}(U) is grossly out of distribution. The attack uses linearly many queries O(k)O(k) in the sketch size and importantly, issues all queries in a single batch. The only adaptive component is the post processing of the query responses. This single-batch attack suggests that it is possible to construct an adversarial input by simply observing and post-processing a normal and non-adaptive workload of the system. The linear size of the attack matches the straightforward upper bound of using disjoint components of the sketch for different queries.

  • We conduct an empirical evaluation of our proposed attack on the HyperLogLog (HLL) sketch (Durand and Flajolet, 2003; Flajolet et al., 2007b) with the HLL++ estimator (Heule et al., 2013). This is the most widely utilized sketch for cardinality estimation in practice. The results reported in Section 5 show that even with a single-batch attack using 4k4k queries, we can consistently construct adversarial inputs on which the estimator substantially overestimates or underestimates the cardinality by 40%.

  • In Section 6 and Section 7, we present an attack that broadly applies against any correct query response algorithm. By that, we establish inherent vulnerability of the sketch structures themselves. Our attack uses O~(k2)\tilde{O}(k^{2}) adaptive queries. We show that multiple batches are necessary against strategic query response algorithms. This quadratic attack size matches the generic quadratic upper bound construction of Hassidim et al. (2020). The product of our attack is a small mask set MM that can poison larger sets UU in the sense that S(MU)S(M)S(M\cup U)\approx S(M), making any estimator ineffective. The attack applies even when the query response is for the more specialized soft threshold problem: Determine if the cardinality is below or above a range of the form [A,2A][A,2A]. Moreover, it applies even when the response is tailored to the attack algorithm and its internal state including the distribution from which the query sets are selected at each step. Note that this strengthening of the query response and simplification of the task only makes the query response algorithm harder to attack.

Our attacks have the following structure: We fix a ground set NN of keys and issue queries that are subsets UiNU_{i}\subset N. We maintain scores to keys in NN that are adjusted for the keys in UiU_{i} based on the response. The design has the property that scores are correlated with the priorities of keys and the score is higher when the cardinality is underestimated. The adversarial set is then identified as a prefix or a suffix of keys ordered by their score.

The vulnerabilities we exposed may have practical significance in multiple scenarios: In a non-malicious setting, an adaptive algorithm or an optimization process that is applied in sketch space can select keys that tend to be in overestimated (or underestimates) sets, essentially emulating an attack and inadvertently selecting a biased set on which the estimate is off. In malicious settings, the construction of an adversarial input set UU can be an end goal. For example, a system that collects statistics on network traffic can be tricked to report that traffic is much larger or much smaller than it actually is. A malicious player can poison the dataset by injecting a small adversarial set MM to the data UU, for example, by issuing respective search queries to a system that sketches sets of search queries. The sketch Sρ(MU)S_{\rho}(M\cup U) then masks Sρ(U)S_{\rho}(U), making it impossible to recover an estimate of the true cardinality of UU. Finally, cardinality sketches have weighted extensions (max-distinct statistics) and are building blocks of sketches designed for a large class of concave sublinear frequency statistics, that include cap statistics and frequency moments with p1p\leq 1 (Cohen, 2018; Cohen and Geri, 2019; Jayaram and Woodruff, 2023), and thus these vulnerabilities apply to these extensions.

2 Related Work

There are prolific lines of research on the effect of adaptive inputs that span multiple areas including dynamic graph algorithms (Shiloach and Even, 1981; Ahn et al., 2012; Gawrychowski et al., 2020; Gutenberg and Wulff-Nilsen, 2020; Wajc, 2020; Beimel et al., 2022), sketching and streaming algorithms (Mironov et al., 2008; Hardt and Woodruff, 2013; Ben-Eliezer et al., 2021b; Hassidim et al., 2020; Woodruff and Zhou, 2021; Attias et al., 2021; Ben-Eliezer et al., 2021a; Cohen et al., 2022a, b), adaptive data analysis (Freedman, 1983; Ioannidis, 2005; Lukacs et al., 2009; Hardt and Ullman, 2014; Dwork et al., 2015a) and machine learning (Szegedy et al., 2013; Goodfellow et al., 2014; Athalye et al., 2018; Papernot et al., 2017).

Reviriego and Ting (2020) and Paterson and Raynal (2021) proposed attacks on the HLL sketch with standards estimators. The proposed attacks were in a streaming setting and utilized many dependent queries in order to detect keys whose insertion results in updates to the cardinality estimate. Our attacks are more general: We construct single-batch attacks with standard estimators and also construct attacks on these cardinality sketches that apply with any estimator. The question of robustness of cardinality sketches to adaptive inputs is related but different than the well studied question of whether they are differentialy private. Cardinality sketches were shown to be not privacy preserving when the sketch randomness or content are public (Desfontaines et al., 2019). Other works (Smith et al., 2020; Pagh and Stausholm, 2021; Knop and Steinke, 2023) showed that cardinality sketches are privacy preserving, but this is under the assumption that the randomness is used once. Our contribution here is designing attacks and quantifying their efficiency. The common grounds with privacy is the high sensitivity of the sketch maps to removal or insertion of low priority key.

Several works constructed attacks on linear sketches, including the Johnson Lindenstrauss Transform (Cherapanamjeri and Nelson, 2020), the AMS sketch (Ben-Eliezer et al., 2021b; Cohen et al., 2022b), and CountSketch (Cohen et al., 2022a, b). The latter showed that the standard estimators for CountSketch and the AMS sketch can be compromised with a linear number of queries and the sketches with arbitrary estimators can be compromised with a quadratic number of queries. The method was to combine low bias inputs with disjoint supports and have the bias amplified, since the bias increases linearly whereas the 2\ell_{2} norms increases proportionally to r\sqrt{r}. This approach does not work with cardinality sketches, which required a different attack structure. Combining disjoint sets on which the estimate is slightly biased up will not amplify the bias. The common ground, perhaps surprisingly, is that these fundamental and popular sketches are all vulnerable with adaptive inputs and in a similar manner: Estimators that optimally use the sketch require linear size attacks. Arbitrary correct estimators require quadratic size attacks.

3 Preliminaries

An attack is an interaction designed to construct a set that is adversarial to the randomness ρ\rho. An adversarial set can be identified by trying out a large number of inputs. We measure the efficiency of the attack by its size (number of issued queries) and concurrency (number of batches of concurrent queries).

A set UU is adversarial for the randomness ρ\rho if a sufficient statistics for the cardinality that is computed from Sρ(U)S_{\rho}(U) is very skewed with respect to its distribution under sampling of ρ\rho. That is, it has proportionally too few or two many low priority keys.

Definition 3.1 (Sufficient Statistics).

A statistic TT on the sketch domain SS\mapsto\mathbb{R} is sufficient for the cardinality |U||U| if it includes all information the sketch provides on the cardinality |U||U|. That is, for each tt, the conditional distribution of the random variable Sρ(U)S_{\rho}(U) given T(Sρ(U))=tT(S_{\rho}(U))=t, does not depend on |U||U|.

3.1 Composable Cardinality Sketches

The underlying technique in all small space cardinality sketches is to use random hashmaps hh that assign “priorities” to keys in 𝒰\mathcal{U}.111(Chakraborty et al., 2022) proposed a method for streaming cardinality estimates that does not require hashmaps but the sketch is not composable. The sketch of a set is specified by the priorities of a small set of keys with the lowest priorities. This information is related to the cardinality as smaller lowest priorities in a subset UU are indicative of larger cardinality. Therefore cardinality estimates can be recovered from the sketch. We describe several common designs. MinHash sketches (see surveys (Cohen, 2008, 2023)) are suitable for insertions only (set unions and insertions of new elements) and are also suitable for sketch-based sampling. Domain sampling has priorities that are discretized sampling rates and has the advantage that the sketch can be represented as random linear maps (specified by ρ\rho) of the data vector and therefore have support for deletions (negative entries in sketched vectors) (Ganguly, 2007).

MinHash sketches

Types of MinHash sketches

  • kk-mins (Flajolet and Martin, 1985; Cohen, 1997) kk random hash functions h1,,hkh_{1},\ldots,h_{k} that map each key x𝒰x\in\mathcal{U} to i.i.d samples from the domain of the hash function. The sketch Sρ(U)S_{\rho}(U) of a set UU is the list (minxUhi(x))i[k](\min_{x\in U}h_{i}(x))_{i\in[k]} of minimum values of each hash function over the keys in UU. The sketch distribution for a subset UU is Exp[|U|]k\textsf{Exp}[|U|]^{k}, a set of kk i.i.d. exponentially distributed random variables with parameter |U||U|. The sum T(S):=S1T(S):=\|S\|_{1} is a sufficient statistics for estimating the parameter |U||U|. An unbiased cardinality estimator is (k1)/T(S)(k-1)/T(S).

  • Bottom-kk (Rosén, 1997; Cohen, 1997; Bar-Yossef et al., 2002) One random hash function hh that maps x𝒰x\in\mathcal{U} to i.i.d samples from a distribution. The sketch {h(x)xU}(1:k)\{h(x)\mid x\in U\}_{(1:k)} stores the kk smallest hash values of keys xUx\in U. The kkth smallest value T(S):={h(x)xU}(k)T(S):=\{h(x)\mid x\in U\}_{(k)} is a sufficient statistics for estimating |U||U|. When the distribution is U[0,1]U[0,1], the unbiased cardinality estimate is (k1)/T(S)(k-1)/T(S).

  • kk-partition (Flajolet et al., 2007b). One hash P:𝒰[k]P:\mathcal{U}\to[k] randomly partition keys to kk parts. One hash function h:𝒰h:\mathcal{U} maps keys to i.i.d Exp[1]\textsf{Exp}[1]. The sketch includes the minimum in each part (minxUP(x)=ih(x))i[k](\min_{x\in U\mid P(x)=i}h(x))_{i\in[k]}.

Note that the choice of (continuous) distribution does not affect the information content in the sketch. Variations of these sketches store rounded/truncated numbers (HLL (Flajolet et al., 2007b) stores a maximum negated exponent). When studying vulnerabilities of query response algorithms, the result is stronger when the full precision representation is available to them.

The cardinality estimates obtained with these sketches have NRMSE error (1) of 1/k1/\sqrt{k}.

Definition 3.2 (bias of the sketch).

We say that the sketch Sρ(U)S_{\rho}(U) of a set UU is biased up by a factor of 1/α1/\alpha when T(Sρ(U))αk/|U|T(S_{\rho}(U))\leq\alpha k/|U| and we say it is biased down by a factor of α\alpha when T(Sρ(U))(1/α)k/|U|T(S_{\rho}(U))\geq(1/\alpha)k/|U|.

For our purposes, α1/2\alpha\leq 1/2 would places the sketch at the δ=eΩ(k)\delta=e^{-\Omega(k)} tail of the distribution under sampling of ρ\rho and we say that UU is adversarial for ρ\rho.

Domain sampling

These cardinality sketches can be expressed as discretized bottom-kk sketches. Therefore vulnerabilities of bottom-kk sketches also apply with domain sampling sketches. The input is viewed as a vector of dimension |𝒰||\mathcal{U}| where the set UU corresponds to its nonzero entries. The cardinality |U||U| is thus the sparsity (number of nonzero entries). The sketch map SρS_{\rho} is a dimensionality reduction via a random linear map (specified by ρ\rho).

We sample the domain 𝒰=[n]\mathcal{U}=[n] with different rates p=2jp=2^{-j}. For each rate, we collect a count cjc_{j} (capped by kk) of the number of sampled keys from our set XX. This can be done by storing the first kk distinct keys we see or (approximately) by random hashing into a domain of size kk and considering how many cells were hit. A continuous version known as liquid legions (Kreuter et al., 2020)) is equivalent to a bottom-kk sketch: Each key is assigned a random i.i.d. priority (lowest sampling rate in which it is counted with domain sampling) and we seek the sampling rate with which we have kk keys.

Specifying keys for the sketch

Note that with all these sketch maps, the sketch of a set UU is specified by a small subset U0UU_{0}\subset U of the “lowest priority” keys in UU. With kk-mins and kk-partition sketches it is the keys argminxU{hi(x)}\arg\min_{x\in U}\{h_{i}(x)\} for i[k]i\in[k]. With bottom-kk sketches, it is the keys with kk smallest values in {hi(x)}xU\{h_{i}(x)\}_{x\in U}. With domain sampling, it is the keys with the highest sampling rate. Note that |U0|=O(k)|U||U_{0}|=O(k)\ll|U| but Sρ(U0)=Sρ(U)S_{\rho}(U_{0})=S_{\rho}(U).

4 Attack on the “standard” estimators

The “standard” cardinality estimators optimally use the content in the sketch. They can be equivalently viewed as reporting a sufficient statistics. We design a single-batch attack described in Algorithm 1. The algorithm fixes a ground set NN of keys. For rr queries, it samples a random subset UNU\subset N where each uNu\in N is included independently with probability 1/21/2. It receives from the estimator the value of the sufficient statistics T(Sρ(U)):=1/M(Sρ(U))T(S_{\rho}(U)):=1/M(S_{\rho}(U)) (we use the inverse of the cardinality estimate). For each key xNx\in N it computes a score A[x]A[x] that is the average value of T(Sρ(U))T(S_{\rho}(U)) over all subsets where xUx\in U.

Input: ρ\rho, nn, rr, TT
Fix a set NN of nn keys // selected randomly independently of ρ\rho)
foreach key xNx\in N do // initialize
      t[x]0t[x]\leftarrow 0
       c[x]0c[x]\leftarrow 0
foreach i=1,,ri=1,\ldots,r do
       UU\leftarrow include each xNx\in N independently with prob 12\frac{1}{2}
       foreach key xUx\in U do // score keys
            t[x]t[x]+1t[x]\leftarrow t[x]+1
             c[x]c[x]+T(Sρ(U))c[x]\leftarrow c[x]+T(S_{\rho}(U))
      
return The keys in NN ordered by average score A[x]=c[x]t[x]A[x]=\frac{c[x]}{t[x]}.
Algorithm 1 Attack ‘‘standard’’ estimators

We show that for α>0\alpha>0, an attack of size O(r/α2)O(r/\alpha^{2}) produces an adversarial set with sketch that is biased up by a factor α\alpha (see Definition 3.2).

Theorem 4.1 (Utility of Algorithm 1).

Consider Algorithm 1 with kk-mins or bottom-kk sketches and T(S)T(S) being the inverse of the cardinality estimate as specified in Section 3.1. For α>0\alpha>0, set the parameters n=Ω(1αklog(kr))n=\Omega(\frac{1}{\alpha}k\log(kr)) and r=O(kα2)r=O\left(\frac{k}{\alpha^{2}}\right). Then with probability at least 0.990.99, the sketch Sρ(Uα)S_{\rho}(U_{\alpha}), where UαNU_{\alpha}\subset N is the of the αn\alpha n lowest A[u]A[u] scores, is biased up by a factor of Ω(1/α)\Omega(1/\alpha):

E[M(Uα)]=Θ(n).\textsf{E}\left[M(U_{\alpha})\right]=\Theta(n)\ .

Our analysis extends to the case when the estimator reports T(Sρ(U))T(S_{\rho}(U)) with relative error O(1/k)O(1/\sqrt{k}). That is, as long as the estimates are sufficiently accurate (within the order of the accuracy guarantees of a size kk sketch), then O(k)O(k) attack queries suffice.

Analysis Highlights

The proof of Theorem 4.1 is presented in Appendix A. The high level idea is as follows. We establish that scores are correlated with the priorities of keys – the keys with lowest priorities have in expectation lower scores. Therefore a prefix of the order will contain disproportionately more of them and overestimate the cardinality and a suffix will contain disproportionately fewer of them and underestimate the cardinality.

We consider, for each key xNx\in N, the distributions of T(Sρ(U))T(S_{\rho}(U)) conditioned on xUx\in U. We bound from above the variance of these distributions and bound from below the gap in the means of the distributions between the keys that have the “lowest priority” in NN and the bulk of other keys in NN. We then apply Chebyshev’s Inequality to bound the number of rounds that is needed so that enough of the low priority keys have lower average scores A[u]A[u] than “most” other keys. A nuance we overcame in the analysis was to handle the dependence of the sketches of the different queries that are selected from the same ground set.

5 Experimental Evaluation

In this section, we empirically demonstrate the efficacy of our proposed attack (Algorithm 1) against the HyperLogLog (HLL) sketch Durand and Flajolet (2003); Flajolet et al. (2007b) with the HLL++ estimator Heule et al. (2013). This is the most widely utilized sketch for cardinality estimation in practice. Given an accuracy parameter ϵ\epsilon, the HLL sketch stores k=1.04ϵ2k=1.04\epsilon^{-2} values that are the negated exponents of a kk-partition MinHash sketch (described in Section 3.1).

The HLL++ estimator is a hybrid that was introduced in order to improve accuracy on low cardinality queries. When the sketch representation is sparse (fewer parts are populated), which is the case with cardinality lower than the sketch size, HLL++ uses the sketch as a hash table and estimates cardinality based on the number of populated parts. This yields essentially precise values. When all parts are populated, HLL++ uses an estimator based on the MinHash property. We set the size of our ground set nkn\gg k to be in this relevant regime.

We conduct two primary experiments: (i) For a fixed sketch size, we analyze the efficacy of the attack with a varying number of queries. (ii) For different sketch sizes, we evaluate the effectiveness of the attack with the number of queries linearly proportional to the sketch size. In the following section, we will first provide a detailed explanation of the ingredients required for our experimental setup. Subsequently, we will present the results of each experiment.

Experiment setup.

To generate the data, we ensure that a ground set with a size of at least 10k10\cdot k is produced for a given sketch size kk. The size of the ground set must be at least linearly larger than the sketch size to prevent the sketch from memorizing the entire dataset. Given the desired size of the ground set, we generate random strings using the English alphabet of a fixed length, where the length is appropriately chosen so that we can generate the desired size set with different strings.

We utilize the open-source implementation of HLL++ algorithm in github. In this implementation, the sketch is fixed by giving the error rate ϵ(0,1)\epsilon\in(0,1) and the sketch size kk for error rate ϵ\epsilon is 1.04/ϵ2\lceil 1.04/\epsilon^{2}\rceil (consistent with Flajolet et al. (2007b)).

5.1 Efficacy with a varying number of queries

In this experiment we examine the impact of introducing a variable quantity of queries. The attack is executed with the same ground set for eight distinct query counts, where each count is a power of 4. At the conclusion, the algorithm generates scores and returns keys sorted in ascending order according to their scores. Keys with high score correspond to low-priority keys which are expected to appear when the estimate is biased up. By including these keys in the adversarial set, we basically can trick the estimator to think that they are seeing a sketch of a large set. Similarly we can construct adversarial input sets by including keys with low scores and trick the estimator to think they are seeing a sketch of a small set.

We present two sets of plots corresponding to how the estimator overestimates or underestimates as keys are incrementally added to the adversarial input in the increasing or decreasing order of their scores. We consider two different error rates, ϵ=0.1\epsilon=0.1 with corresponding sketch size k=104k=104 and ϵ=0.05\epsilon=0.05, with corresponding sketch size k=416k=416. We use the same ground set comprising of 50005000 keys for both sets of experiments. It is worth noting that the plot with one query, which oscillates around the line y=xy=x (denoted by a dashed line), is close to a non-adversarial setting and we can see that the estimates are within the desired specified error of ϵ\epsilon.

Figure 1 reports cardinality estimates when keys are added incrementally in increasing order of their scores. We can see that as we increase the number of queries, the gap between estimated value and the y=xy=x line (actual value) widens. This gap indicate the overestimation error. Our algorithm is able to construct more effective adversarial input with a larger number of queries. However the gain in effectiveness becomes marginal at some point. For example, for k=104k=104, we already see good degree of error in estimation with 40964096 queries.

Refer to caption
Refer to caption
Figure 1: Attack on the HLL++ sketch and estimator, for varying number of queries. Cardinality estimates for the prefix of keys with lowest average score after r=4ir=4^{i} queries.

Figure 2 reports results when keys are added incrementally in decreasing order of their scores. The gap here corresponds to an underestimation error. We can see that the attacks are more effective with more queries.

Refer to caption
Refer to caption
Figure 2: Attack on the HLL++ sketch and estimator, for varying number of queries. Cardinality estimates for the prefix of keys with largest average score after r=4ir=4^{i} queries.

5.2 Efficacy of the attack with a varying sketch sizes

In this section, our focus is on examining HLL++ with different sketch sizes, namely we consider six different error rates corresponding to sketch sizes k=2ik=2^{i} for ii ranging from 66 to 1111. For each sketch size kk, we generate a ground set of size n=1010log10(k)n=10*10^{\lceil log_{10}(k)\rceil} to ensure that the ground set is larger than sketch size and the MinHash component of the HLL++ estimator is used. In Figure 3, we report the ratio of estimated size to actual size of the set for all subsets constructed as a prefix of the order on keys, sorted by increasing average scores A[x]A[x] for a fixed number of queries set to 4k4k. In this cardinality regime, HLL++ is nearly unbiased and we expect a ratio that is close to 11 when the queries are not adversarial. However by running attacks with enough number of queries (linear in the size of sketch), we are able to identify keys with low-priority and then trick the estimator to give an estimate for a set much higher than the actual size.

Refer to caption
Figure 3: Attack on HLL++ for varying sketch sizes while utilizing queries of size 4 times the sketch size.

6 Attack Setup Against Strategic Estimators

We design attacks that apply generally against any query response (QR) algorithm. The attacks are effective even when the specifics of the attack and the full internal state of the attack algorithm are shared with the QR algorithm, including the per-step distribution from which the attacker selects each query. Moreover, we can even assume that the QR algorithm is provided with an enhanced sketch that includes the identities of the low priority keys that determined the sketch and that after the QR algorithm responds to a query, the full query set UU is shared with it. The only requirement from the QR algorithm is that it selects correct response maps (with respect to the query distribution). Note that such a powerful QR algorithm precludes attacks that use queries of fixed cardinality, since the QR algorithm can simply return that cardinality value without even considering the actual input sketch.222Our attack in Algorithm 1 is also precluded, since a fixed response of n/2n/2 satisfies the requirements (the cardinality is Binom(n,p)\textsf{Binom}(n,p) and for nkn\gg k, all queries have size close to n/2n/2).

Moreover, the task of the QR algorithm is the following problem that is more specialized than cardinality estimation:

Problem 6.1 (Soft Threshold AA).

Return 0 when |U|A|U|\leq A and 11 when |U|2A|U|\geq 2A.

Remark 6.2.

Soft Threshold can be solved via a cardinality estimate with a multiplicative error of at most 2\sqrt{2} by reporting 11 when the estimate is larger than 2A\sqrt{2}A and 0 otherwise. When estimates are computed from cardinality sketches with randomness ρ\rho that does not depend on the queries, a sketch of size k=Θ(log(1/δ))k=\Theta(\log(1/\delta)) is necessary and suffices for providing correct responses with probability 1δ\geq 1-\delta.

Attack Framework

We describe the attack framework. We specify attacks in Sections 7 and 8. We model the interaction as a process between three parties: the Attacker, the QR algorithm, and System. The attacker fixes a ground set NN from which it samples query sets. The product of the attack is a subset MNM\subset N which we refer to as a mask. The aim is for the mask to have size that is much smaller than our query subset sizes and the property that for uniformly sampled UNU\subset N (|U||M||U|\gg|M|) the information in the sketch Sρ(MU)S_{\rho}(M\cup U) is insufficient to estimate |U||U|. The attack proceeds in steps described in Algorithm 2:

\bullet Attacker specifies a distribution 𝒟\mathcal{D} over the support 2N2^{N} and sends it to QR. It then selects a query U𝒟U\sim\mathcal{D} and sends it to System. \bullet QR selects a correct map with δ=O(1/k)\delta=O(1/\sqrt{k}) (as in Definition 6.3) of sketches to probabilities Sπ(S)[0,1]S\mapsto\pi(S)\in[0,1]. The selection may depend on the prior interaction transcript and on 𝒟\mathcal{D}. If there is no correct map QR reports failure and halts. \bullet System computes the sketch Sρ(U)S_{\rho}(U) and sends it to QR. \bullet QR sends ZBern[π(Sρ(U))]Z\sim\textsf{Bern}[\pi(S_{\rho}(U))] to Attacker. Attacker shares UU and its internal state with QR.
Algorithm 2 Attack Interaction Step
Definition 6.3 (Correct Map).

We say that the map π\pi is correct for AA and δ\delta and query distribution 𝒟\mathcal{D} if for any cardinality value, over the query distribution for this value, it returns a correct response to a soft threshold problem with AA with probability at least 1δ1-\delta. That is,

for c<AEU𝒟|U|=c(π(Sρ(U)))\displaystyle\text{for $c<A$, }\textsf{E}_{U\sim\mathcal{D}\mid|U|=c}(\pi(S_{\rho}(U))) δ\displaystyle\leq\delta
for c>2AEU𝒟|U|=c(π(Sρ(U)))\displaystyle\text{for $c>2A$, }\textsf{E}_{U\sim\mathcal{D}\mid|U|=c}(\pi(S_{\rho}(U))) 1δ.\displaystyle\geq 1-\delta\ .
Remark 6.4 (Many correct maps).

There can be multiple correct maps and QR may choose any one at any step. Since the output when |U|[A,2A]|U|\in[A,2A] is not specified, the probability EU𝒟π(Sρ(U))\textsf{E}_{U\sim\mathcal{D}}\pi(S_{\rho}(U)) of reporting Z=1Z=1 may vary by PrU𝒟[|U|[A,2A]]+δ\approx\textsf{Pr}_{U\sim\mathcal{D}}[|U|\in[A,2A]]+\delta between correct maps.

Recall that our attack on the standard estimators (Algorithm 1) issued a single batch of queries (all drawn from the same pre-specified distribution 𝒟0\mathcal{D}_{0}). We show that multiple batches are necessary to attack general QR algorithms:

Lemma 6.5 (Multiple batches are necessary).

Any attack of polynomial size in kk on a soft threshold estimator must use multiple batches.

Proof.

When there is a single batch of rr queries, we can apply the standard estimator while accessing only a “component” of the sketch that is of size k=O(log(r/δ))k^{\prime}=O(\log(r/\delta)) and obtain correct responses on all queries. This component is the same kk^{\prime} hash functions with kk-mins sketches, only kk^{\prime} parts in a kk-partition sketch, or the bottom-kk^{\prime} values in a bottom-kk sketch. Since the query response only leaks information on this component, the attacker is only able to compromise that component in this single-batch attack. Therefore, an exponential number of queries in kk is needed in order to construct an adversarial input in a single batch. ∎

7 Single-batch attack on symmetric QR

Algorithm 3 specifies a single-batch attack. We establish that the attack succeeds when we set the size r=O~(k2)r=\tilde{O}(k^{2}) and QR is constrained to be symmetric (see Definition 7.2). In essence symmetry means that the QR algorithm does not make a significant distinction between components of the sketch. Symmetry excludes strategies that distinguish between components of the sketch as in the proof of Lemma 6.5 but still allows for flexibility including randomly selecting components of the sketch. In Section 8 we extend this attack to an adaptive attack that works against any QR algorithm.

Input: ρ\rho, nn, rr
Select a set NN of nn keys
  // Randomly from 𝒰\mathcal{U}
An/16A\leftarrow n/16
foreach key xNx\in N do C[x]0C[x]\leftarrow 0
  // initialize for i=1,,ri=1,\ldots,r do
       Sample qq as specified in Algorithm 5
        // using A,nA,n
       UU\leftarrow includes each uNu\in N independently with prob qq
       send UU
        // \to System
       receive ZZ // \leftarrow Symmetric Query Response
       foreach key xUx\in U do C[x]C[x]+ZC[x]\leftarrow C[x]+Z
        // score
C¯median{C[N]}\overline{C}\leftarrow\textrm{median}\{C[N]\}
  // Compute median score
return M{xNC[x]>C¯+Ω~(rk)}M\leftarrow\{x\in N\mid C[x]>\overline{C}+\tilde{\Omega}(\frac{r}{k})\}
  // Mask
Algorithm 3 Single Batch Attacker

The attacker initializes the scores C[x]0C[x]\leftarrow 0 of all keys xNx\in N in the ground set. Each query is formed by sampling a rate qq (as described in Algorithm 5) and selecting a random subset UNU\subset N so that each key in NN is included independently with probability qq. The attacker receives ZZ and increments by ZZ the score C[x]C[x] of all keys xUx\in U. The final product is the set MM of keys with scores that are higher by Ω~(r/k)\tilde{\Omega}(r/k) than the median score.

Theorem 7.1 (Utility of Algorithm 3 with symmetric maps).

For α>0\alpha>0, set n=Ω(1αklog(kr))n=\Omega(\frac{1}{\alpha}k\log(kr)) and r=Ω~(k2α2)r=\tilde{\Omega}\left(\frac{k^{2}}{\alpha^{2}}\right). Then Pr[(Sρ(M)=Sρ(N))(|M|<αn)]0.99\textsf{Pr}[(S_{\rho}(M)=S_{\rho}(N))\land(|M|<\alpha n)]\geq 0.99.

7.1 Proof Overview

See Appendix B for details. We work with the rank-domain representations of the sketches with respect to the ground set NN. This representation simplifies our analysis as it only depends on the rank order of keys by their hash values and by that factors out the hash values. The rank-domain sketches SR(U)S^{R}(U) have the form (Y1,,Yk)(Y_{1},\ldots,Y_{k}) where YiY_{i} are positive integers in [N][N]. The sketch distribution over the sampling of UU for fixed qq is that of kk independent Geom[q]\textsf{Geom}[q] random variables YiY_{i}. The sum T=i=1kYiT=\sum_{i=1}^{k}Y_{i} is a sufficient statistics for qq from the sketch.

Definition 7.2 (symmetric map).

A map π\pi is symmetric if it uses the rank-domain sketch as an unordered set and (ii) is monotone in that if a sketch S1S2S_{1}\leq S_{2} coordinate-wise then then π(S1)π(S2)\pi(S_{1})\geq\pi(S_{2}).

We denote by N0N_{0}^{*} the set of kk lowest-rank (lowest priority) keys. It includes the bottom-kk keys with bottom-kk sketches, and the minimum hash key with respect to each of the kk hashmaps with kk-mins and kk-partition sketches. Note that Sρ(N)=Sρ(N0)S_{\rho}(N)=S_{\rho}(N_{0}^{*}). We denote by NNN^{\prime}\subset N a set of keys that are transparent – very unlikely to influence the sketch if included in the attack subsets of Algorithm 3. We show that a key in N0N_{0}^{*} obtains in expectation a higher score than a key in NN^{\prime} and use that to establish the utility claim:

Lemma 7.3 (separation with symmetric maps).

Let π\pi be correct and symmetric (Definition 7.2). Then for any mN0m\in N^{*}_{0} and uNu\in N^{\prime},

EU[π(Sρ(U)𝟏(mU)]EU[π(Sρ(U)𝟏(uU)]=Ω~(1/k)\textsf{E}_{U}[\pi(S_{\rho}(U)\cdot\mathbf{1}(m\in U)]-\textsf{E}_{U}[\pi(S_{\rho}(U)\cdot\mathbf{1}(u\in U)]=\tilde{\Omega}(1/k)
Proof.

See Appendix B.5. The gap only holds on average for general correct maps (Lemma B.6) but holds per-key when specialized to symmetric maps. ∎

Proof of Theorem 7.1.

The distribution of the score C[x]C[x] of all transparent keys uNu\in N^{\prime} is identical and is the sum of rr independent Poisson random variables i=1rZiBern[qi]\sum_{i=1}^{r}Z_{i}\cdot\textsf{Bern}[q_{i}].

From Lemma 7.3 E[C[m]]E[C[u]]=Ω~(r/k)\textsf{E}[C[m]]-\textsf{E}[C[u]]=\tilde{\Omega}(r/k). For xN0x\in N^{*}_{0}, the gap random variables of different steps may be dependent, but the lower bound on the expected gap in Lemma 7.3 holds even conditioned on transcript. Additionally, the expected gap in each step is bounded in [1,1][-1,1]. We can apply Chernoff bounds (Chernoff, 1952) to bound the probability that a sum deviates by more than λ\lambda from its expectation

Pr[|C[x]E[C[x]]|λ]2e2λ2/r.\textsf{Pr}[|C[x]-\textsf{E}[C[x]]|\geq\lambda]\leq 2e^{-2\lambda^{2}/r}\ . (3)

Setting λ=cr/k\lambda=cr/k separates a key in N0N_{0}^{*} from a key in NN^{\prime} with probability 12e2c2r/k21-2e^{-2c^{2}r/k^{2}}. Choosing r=O(k2log|N|)r=O(k^{2}\log|N|) we get that the order separates out with high probability all the keys N0N^{*}_{0} from all the keys NN^{\prime}. Note that there are only Ω~(k)\tilde{\Omega}(k) non transparent keys N0:=NNN_{0}:=N\setminus N^{\prime}. Therefore with high probability N0MN0N^{*}_{0}\subset M\subset N_{0} and (we can fix constants so that) |M|=O~(k)αn|M|=\tilde{O}(k)\leq\alpha n. Since N0MN^{*}_{0}\subset M, Sρ(M)=Sρ(N)S_{\rho}(M)=S_{\rho}(N). ∎

8 Adaptive Attack on General QR

Input: ρ\rho, nn, rr
Select a set NN of nn keys
  // Randomly from 𝒰\mathcal{U}
An/16A\leftarrow n/16; MM\leftarrow\emptyset
foreach key xNx\in N do C[x]0C[x]\leftarrow 0
  // initialize for i=1,,ri=1,\ldots,r do
       Sample U𝒟0U\sim\mathcal{D}_{0} as in Algorithm 3
       send MUM\cup U to system
       receive ZZ from QR
       if failure then exit
       foreach key xUx\in U do // score keys
            C[x]C[x]+ZC[x]\leftarrow C[x]+Z
             if C[x]median(C[NM])+ilog(200nr)/2C[x]\geq\textrm{median}(C[N\setminus M])+\sqrt{i\log(200nr)/2} then // test if score is high
                  MM{x}M\leftarrow M\cup\{x\}
            
      send M,C,UM,C,U to QR
        // share internal state
      
return MM
  // Mask
Algorithm 4 Adaptive Attacker

An attack on general QR algorithms is given in Algorithm 4. The attacker maintains an initially empty set MNM\subset N of keys which we refer to as mask. The query sets have the form MUM\cup U, where U𝒟0U\sim\mathcal{D}_{0} is sampled and scored as in Algorithm 3. A key is added to MM when its score separates out from the median score. We establish the following:

Theorem 8.1 (Utility of Algorithm 4).

For α>0\alpha>0, set n=Ω(1αklog(kr))n=\Omega(\frac{1}{\alpha}k\log(kr)) and r=Ω~(k2α2)r=\tilde{\Omega}\left(\frac{k^{2}}{\alpha^{2}}\right). Then with probability at least 0.990.99, |M|<αn|M|<\alpha n and there is no correct map for the query distribution MUM\cup U where U𝒟0U\sim\mathcal{D}_{0} is as in Algorithm 3.

We overview the proof with details deferred to Appendix B. The condition for adding a key to MM is such that with probability at least 0.990.99, only N0N_{0} keys are placed in MM, so |M|αn|M|\leq\alpha n (Claim B.9). If the QR algorithm fails, there is no correct map for the distribution M𝒟0M\cup\mathcal{D}_{0}.333A situation of no correct maps can be identified by Attacker, by tracking the error rate of QR, even if not declared by QR. It remains to consider the case where the attack is not halted.

Since the mask MM is shared with QR “for free,” QR only needs to estimate |U||U| (or qq). But the sketch of MUM\cup U partially masks the sketch of UU. The set of non-transparent keys N0N0N^{\prime}_{0}\subset N_{0} decreases as MM increases. Additionally, the effective sketch size kkk^{\prime}\leq k is lower (that is, QR only obtains kk^{\prime} i.i.d Geom[q]\textsf{Geom}[q] random variables). Recall that when k<log(k)/2k^{\prime}<\log(k)/2, there is no correct map.

With general correct maps, we can only establish a weaker average version of the score gap over N0N^{\prime}_{0} keys. This allows some N0N^{\prime}_{0} keys to remain indistinguishable by score from transparent keys. But what works in Attacker’s favour is that in this case the score of other N0N_{0} keys must increase faster. Let p(π,M,x)p(\pi,M,x) be the probability that key xx is scored with map π\pi and mask MM. The probability is the same for all transparent keys xN0x\not\in N^{\prime}_{0} and we denote it by p(π,M)p^{\prime}(\pi,M). We establish (see Lemma B.8) that for a correct map π\pi for M𝒟0M\cup\mathcal{D}_{0} it holds that

xN0(p(π,M,x)p(π,M))=Ω~(1).\sum_{x\in N^{\prime}_{0}}\left(p(\pi,M,x)-p^{\prime}(\pi,M)\right)=\tilde{\Omega}(1)\ .

Therefore, in r=O~(k2)r=\tilde{O}(k^{2}) steps, the combined score advantage of N0N_{0} keys is (concentrated well around) O~(k2)\tilde{O}(k^{2}). But crucially, any one key can not get too much advantage: once C(x)C¯=Ω~(k)C(x)-\overline{C}=\tilde{\Omega}(k) (where C¯\overline{C} is the median score), then key xx is placed in the mask MM, exits N0N^{\prime}_{0}, and stops getting scored. Therefore if QR does not fail, Ω~(r/k)>|N0|\tilde{\Omega}(r/k)>|N_{0}| keys are eventually placed in MM, which must include all N0N_{0} keys.

9 Conclusion

We demonstrated the inherent vulnerability of the known composable cardinality sketches to adaptive inputs. We designed attacks that use a number of queries that asymptotically match the upper bounds: A linear number of queries with the “standard” estimator and a quadratic number of queries with any estimator applied to the sketch. Empirically, our attacks are simple and effective with small constants. An interesting direction for further study is to show that this vulnerability applies with any composable sketch structure. On the positive side, we suspect that restricting the maximum number of queries that any one key can participate in to sublinear (with standard estimators) or subquadratic (with general estimators) would enhance robustness.

Acknowledgements

The authors are grateful to Jelani Nelson and Uri Stemmer for discussions. Edith Cohen is partially supported by Israel Science Foundation (grant no. 1156/23).

References

  • Ahn et al. [2012] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linear measurements. In Proceedings of the 2012 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 459–467, 2012. doi: 10.1137/1.9781611973099.40. URL https://epubs.siam.org/doi/abs/10.1137/1.9781611973099.40.
  • Apache Software Foundation [Accessed: 2024] Apache Software Foundation. DataSketches, Accessed: 2024. URL https://datasketches.apache.org. Apache Software Foundation Documentation.
  • Athalye et al. [2018] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In International conference on machine learning, pages 284–293. PMLR, 2018.
  • Attias et al. [2021] Idan Attias, Edith Cohen, Moshe Shechner, and Uri Stemmer. A framework for adversarial streaming via differential privacy and difference estimators. CoRR, abs/2107.14527, 2021.
  • Bar-Yossef et al. [2002] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM. ACM, 2002.
  • Bassily et al. [2021] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan R. Ullman. Algorithmic stability for adaptive data analysis. SIAM J. Comput., 50(3), 2021. doi: 10.1137/16M1103646. URL https://doi.org/10.1137/16M1103646.
  • Beimel et al. [2022] Amos Beimel, Haim Kaplan, Yishay Mansour, Kobbi Nissim, Thatchaphol Saranurak, and Uri Stemmer. Dynamic algorithms against an adaptive adversary: generic constructions and lower bounds. page 1671–1684, 2022. doi: 10.1145/3519935.3520064. URL https://doi.org/10.1145/3519935.3520064.
  • Ben-Eliezer et al. [2021a] Omri Ben-Eliezer, Talya Eden, and Krzysztof Onak. Adversarially robust streaming via dense-sparse trade-offs. CoRR, abs/2109.03785, 2021a.
  • Ben-Eliezer et al. [2021b] Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff, and Eylon Yogev. A framework for adversarially robust streaming algorithms. SIGMOD Rec., 50(1):6–13, 2021b.
  • Boneh and Shaw [1998] Dan Boneh and James Shaw. Collusion-secure fingerprinting for digital data. IEEE Trans. Inf. Theory, 44(5):1897–1905, 1998. doi: 10.1109/18.705568. URL https://doi.org/10.1109/18.705568.
  • Chakraborty et al. [2022] Sourav Chakraborty, N. V. Vinodchandran¹, and Kuldeep S. Meel. Distinct elements in streams: An algorithm for the (text) book. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi: 10.4230/LIPICS.ESA.2022.34. URL https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2022.34.
  • Cherapanamjeri and Nelson [2020] Yeshwanth Cherapanamjeri and Jelani Nelson. On adaptive distance estimation. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Chernoff [1952] H. Chernoff. A measure of the asymptotic efficiency for test of a hypothesis based on the sum of observations. Annals of Math. Statistics, 23:493–509, 1952.
  • Cohen [1997] E. Cohen. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 55:441–453, 1997.
  • Cohen [2015] E. Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis. TKDE, 2015. URL http://arxiv.org/abs/1306.3284.
  • Cohen [2008] Edith Cohen. Min-Hash Sketches, pages 1–7. Springer US, Boston, MA, 2008. ISBN 978-3-642-27848-8. doi: 10.1007/978-3-642-27848-8˙573-1. URL https://doi.org/10.1007/978-3-642-27848-8_573-1.
  • Cohen [2018] Edith Cohen. Stream sampling framework and application for frequency cap statistics. ACM Trans. Algorithms, 14(4):52:1–52:40, 2018. ISSN 1549-6325. doi: 10.1145/3234338.
  • Cohen [2023] Edith Cohen. Sampling big ideas in query optimization. In Floris Geerts, Hung Q. Ngo, and Stavros Sintos, editors, Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023, pages 361–371. ACM, 2023. doi: 10.1145/3584372.3589935. URL https://doi.org/10.1145/3584372.3589935.
  • Cohen and Geri [2019] Edith Cohen and Ofir Geri. Sampling sketches for concave sublinear functions of frequencies. In NeurIPS, 2019.
  • Cohen et al. [2022a] Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, and Uri Stemmer. On the robustness of countsketch to adaptive inputs. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 4112–4140. PMLR, 2022a.
  • Cohen et al. [2022b] Edith Cohen, Jelani Nelson, Tamás Sarlós, and Uri Stemmer. Tricking the hashing trick: A tight lower bound on the robustness of CountSketch to adaptive inputs. arXiv:2207.00956, 2022b. doi: 10.48550/ARXIV.2207.00956. URL https://arxiv.org/abs/2207.00956.
  • Desfontaines et al. [2019] Damien Desfontaines, Andreas Lochbihler, and David A. Basin. Cardinality estimators do not preserve privacy. In Privacy Enhancing Technologies Symposium, volume 19, 2019. doi: https://doi.org/10.2478/popets-2019-0018. URL http://arxiv.org/abs/1808.05879.
  • Durand and Flajolet [2003] M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.
  • Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.
  • Dwork et al. [2015a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In STOC, pages 117–126. ACM, 2015a.
  • Dwork et al. [2015b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15, page 117–126, New York, NY, USA, 2015b. Association for Computing Machinery. ISBN 9781450335362. doi: 10.1145/2746539.2746580. URL https://doi.org/10.1145/2746539.2746580.
  • Flajolet and Martin [1985] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31:182–209, 1985.
  • Flajolet et al. [2007a] P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AofA). DMTCS, 2007a.
  • Flajolet et al. [2007b] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science, (Proceedings), 2007b.
  • Freedman [1983] David A. Freedman. A note on screening regression equations. The American Statistician, 37(2):152–155, 1983. doi: 10.1080/00031305.1983.10482729. URL https://www.tandfonline.com/doi/abs/10.1080/00031305.1983.10482729.
  • Ganguly [2007] Sumit Ganguly. Counting distinct items over update streams. Theoretical Computer Science, 378(3):211–222, 2007. ISSN 0304-3975. doi: https://doi.org/10.1016/j.tcs.2007.02.031. URL https://www.sciencedirect.com/science/article/pii/S0304397507001223. Algorithms and Computation.
  • Gawrychowski et al. [2020] Pawel Gawrychowski, Shay Mozes, and Oren Weimann. Minimum cut in o(m log2 n) time. In ICALP, volume 168 of LIPIcs, pages 57:1–57:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  • Goodfellow et al. [2014] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Google Cloud [Accessed: 2024] Google Cloud. BigQuery Documentation: Approximate Aggregate Functions, Accessed: 2024. URL https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions. Google Cloud Documentation.
  • Gutenberg and Wulff-Nilsen [2020] Maximilian Probst Gutenberg and Christian Wulff-Nilsen. Decremental SSSP in weighted digraphs: Faster and against an adaptive adversary. In Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’20, page 2542–2561, USA, 2020. Society for Industrial and Applied Mathematics.
  • Hardt and Ullman [2014] M. Hardt and J. Ullman. Preventing false discovery in interactive data analysis is hard. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS), pages 454–463. IEEE Computer Society, 2014. doi: 10.1109/FOCS.2014.55. URL https://doi.ieeecomputersociety.org/10.1109/FOCS.2014.55.
  • Hardt and Woodruff [2013] Moritz Hardt and David P. Woodruff. How robust are linear sketches to adaptive inputs? In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, page 121–130, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450320290. doi: 10.1145/2488608.2488624. URL https://doi.org/10.1145/2488608.2488624.
  • Hassidim et al. [2020] Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Yossi Matias, and Uri Stemmer. Adversarially robust streaming algorithms via differential privacy. In Annual Conference on Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Heule et al. [2013] S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013.
  • Indyk and Motwani [1998] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symposium on Theory of Computing, pages 604–613. ACM, 1998.
  • Ioannidis [2005] John P. A. Ioannidis. Why most published research findings are false. PLoS Med, (2):8, 2005.
  • Janson [2017a] Svante Janson. Tail bounds for sums of geometric and exponential variables, 2017a. URL https://arxiv.org/abs/1709.08157.
  • Janson [2017b] Svante Janson. Tail bounds for sums of geometric and exponential variables, 2017b. URL https://arxiv.org/abs/1709.08157.
  • Jayaram and Woodruff [2023] Rajesh Jayaram and David P. Woodruff. Towards optimal moment estimation in streaming and distributed models. ACM Trans. Algorithms, 19(3), jun 2023. ISSN 1549-6325. doi: 10.1145/3596494. URL https://doi.org/10.1145/3596494.
  • Kane et al. [2010] D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010.
  • Knop and Steinke [2023] Alexander Knop and Thomas Steinke. Counting distinct elements under person-level differential privacy. CoRR, abs/2308.12947, 2023. doi: 10.48550/ARXIV.2308.12947. URL https://doi.org/10.48550/arXiv.2308.12947.
  • Knuth [1998] D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 2nd edition, 1998.
  • Kreuter et al. [2020] Benjamin Kreuter, Craig William Wright, Evgeny Sergeevich Skvortsov, Raimundo Mirisola, and Yao Wang. Privacy-preserving secure cardinality and frequency estimation. Technical report, Google, LLC, 2020.
  • Lukacs et al. [2009] Paul M. Lukacs, Kenneth P. Burnham, and David R. Anderson. Model selection bias and Freedman’s paradox. Annals of the Institute of Statistical Mathematics, 62(1):117, 2009.
  • Mironov et al. [2008] Ilya Mironov, Moni Naor, and Gil Segev. Sketching in adversarial environments. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC ’08, page 651–660, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605580470. doi: 10.1145/1374376.1374471. URL https://doi.org/10.1145/1374376.1374471.
  • Nelson et al. [2014] Jelani Nelson, Huy L. Nguye^~\tilde{\hat{\mbox{e}}}n, and David P. Woodruff. On deterministic sketching and streaming for sparse recovery and norm estimation. Lin. Alg. Appl., 441:152–167, January 2014. Preliminary version in RANDOM 2012.
  • Pagh and Stausholm [2021] Rasmus Pagh and Nina Mesing Stausholm. Efficient differeffentially private F0 linear sketching. In Ke Yi and Zhewei Wei, editors, 24th International Conference on Database Theory (ICDT 2021), volume 186 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1–18:19, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. ISBN 978-3-95977-179-5. doi: 10.4230/LIPIcs.ICDT.2021.18. URL https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2021.18.
  • Papernot et al. [2017] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017.
  • Paterson and Raynal [2021] Kenneth G. Paterson and Mathilde Raynal. Hyperloglog: Exponentially bad in adversarial settings. Cryptology ePrint Archive, Paper 2021/1139, 2021. URL https://eprint.iacr.org/2021/1139. https://eprint.iacr.org/2021/1139.
  • Pettie and Wang [2021] Seth Pettie and Dingyu Wang. Information theoretic limits of cardinality estimation: Fisher meets shannon. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 556–569. ACM, 2021. doi: 10.1145/3406325.3451032. URL https://doi.org/10.1145/3406325.3451032.
  • Reviriego and Ting [2020] Pedro Reviriego and Daniel Ting. Security of hyperloglog (HLL) cardinality estimation: Vulnerabilities and protection. IEEE Commun. Lett., 24(5):976–980, 2020. doi: 10.1109/LCOMM.2020.2972895. URL https://doi.org/10.1109/LCOMM.2020.2972895.
  • Rosén [1997] B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135–158, 1997.
  • Shiloach and Even [1981] Yossi Shiloach and Shimon Even. An on-line edge-deletion problem. J. ACM, 28(1):1–4, jan 1981. ISSN 0004-5411. doi: 10.1145/322234.322235. URL https://doi.org/10.1145/322234.322235.
  • Smith et al. [2020] Adam Smith, Shuang Song, and Abhradeep Guha Thakurta. The flajolet-martin sketch itself preserves differential privacy: Private counting with minimal space. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 19561–19572. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/e3019767b1b23f82883c9850356b71d6-Paper.pdf.
  • Steinke and Ullman [2015] Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1588–1628, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Steinke15.html.
  • Szegedy et al. [2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Wajc [2020] David Wajc. Rounding Dynamic Matchings against an Adaptive Adversary. Association for Computing Machinery, New York, NY, USA, 2020. URL https://doi.org/10.1145/3357713.3384258.
  • Woodruff and Zhou [2021] David P. Woodruff and Samson Zhou. Tight bounds for adversarially robust streams and sliding windows via difference estimators. In Proceedings of the 62nd IEEE Annual Symposium on Foundations of Computer Science (FOCS), 2021.

Appendix A Analysis of the Attack on the Standard Estimators

This section includes the proof of Theorem 4.1. We first consider kk-mins sketches and T(S)=S1T(S)=\|S\|_{1}. The modification needed for bottom-kk sketches are in Section A.4.

A.1 Preliminaries

The following are order statistics properties useful for analysing MinHash sketches. Let XiExp[1]X_{i}\sim\textsf{Exp}[1] for i[n]i\in[n] be i.i.d. random variables. Then the distribution of the minimum value and of the differences between the i+1i+1 and the iith order statistics (smallest values) are independent random variables with distributions

Δ1\displaystyle\Delta_{1} :=mini[n]XiExp[n]\displaystyle:=\min_{i\in[n]}X_{i}\sim\textsf{Exp}[n]
Δi\displaystyle\Delta_{i} :={Xi}(i+1){Xi}(i)Exp[ni]\displaystyle:=\{X_{i}\}_{(i+1)}-\{X_{i}\}_{(i)}\sim\textsf{Exp}[n-i] i>1\displaystyle i>1
Lemma A.1 (Chebyshev’s Inequality).
Pr[|ZE[Z]|cσ2]1/c2.\textsf{Pr}\left[|Z-\textsf{E}[Z]|\leq c\sigma^{2}\right]\leq 1/c^{2}\ .

We set some notation: For a fixed ground set NN and randomness ρ\rho, for each hash function i[k]i\in[k], let mjiNm^{i}_{j}\in N be the key with the jjth rank in the iith hashmap, that is, hi(mji)h_{i}(m^{i}_{j}) is the jjth smallest in {hi(u)}uN\{h_{i}(u)\}_{u\in N}. Let

L\displaystyle L :=log2(rk)+10\displaystyle:=\log_{2}(rk)+10
N0i\displaystyle N^{i}_{0} :={mji}jL\displaystyle:=\{m^{i}_{j}\}_{j\leq L}
N0\displaystyle N_{0} :=i[k]N0i\displaystyle:=\bigcup_{i\in[k]}N^{i}_{0}
N\displaystyle N^{\prime} :=NN0\displaystyle:=N\setminus N_{0}

be a rank threshold LL, for i[k]i\in[k] the set N0iN^{i}_{0} of keys with rank up to LL in the iith hashmap, the set N0N_{0} that is the union of these keys across hashmaps, and the set NN^{\prime} of the remaining keys in NN.

We show that a choice of n=O(kL/α)n=O(kL/\alpha) ensures that certain properties that simplify our analysis hold. Our analysis applies to the event that these properties are satisfied:

Lemma A.2 (Good draws).

For n=Ω(1αklog(rk))n=\Omega(\frac{1}{\alpha}k\log(rk)), the following hold with probability at least 0.990.99:

  • p1

    (property of ρ\rho and NN) The keys mjim^{i}_{j} for i[k]i\in[k] and jLj\leq L are distinct.

  • p2

    In a run of Algorithm 1, all rr steps, for all i[k]i\in[k], UU includes a key from N0iN^{i}_{0}.

  • p3

    n3kL/αn\geq 3kL/\alpha

Proof.

p3 is immediate. For p1, note that if we set n2pkLn\geq\frac{2}{p}kL then the claim follows with probability 1p1-p using the birthday paradox. For p2, the probability that a random UU does not include one of the LL smallest values of a particular hash function is 2L2^{-L}. The probability that this happens for any of the kk hash functions in any of the rr rounds is at most rk2Lrk2^{-L}. Substituting L=log(rk)+10L=\log(rk)+10 we get the claim. ∎

For fixed NN and ρ\rho, consider the random variable Z:=T(Sρ(U))Z:=T(S_{\rho}(U)) over sampling of UU and the contributions ZiZ_{i} of hash function i[k]i\in[k] to ZZ.

Zi\displaystyle Z_{i} :=minxUhi(x)\displaystyle:=\min_{x\in U}h_{i}(x)
Z\displaystyle Z :=i[k]Zi\displaystyle:=\sum_{i\in[k]}Z_{i}

For a key uNu\in N, we consider the random variables ZiuUZ_{i}\mid u\in U and ZuUZ\mid u\in U that are conditioned on uUu\in U. From property p1 in Lemma A.2, the ZiZ_{i} are independent and are also independent when conditioned on uUu\in U.

A.2 Proof outline

We will need the following two Lemmas (the proofs are deferred to Section A.3). Intuitively we expect EU[Z]\textsf{E}_{U}[Z] to be lower when conditioned on m1iUm^{i}_{1}\in U. We bound this gap from below. For fixed ρ\rho and NN, let

G(u,v):=EUuU[Z]EUvU[Z].G(u,v):=\textsf{E}_{U\mid u\in U}[Z]-\textsf{E}_{U\mid v\in U}[Z]\ .
Lemma A.3 (Expectations gap bound).

For each i[k]i\in[k] and δ>0\delta>0,

Prρ,N[minuNG(u,m1i)δ3n]1δ.\textsf{Pr}_{\rho,N}\left[\min_{u\in N^{\prime}}G(u,m^{i}_{1})\geq\frac{\delta}{3n}\right]\geq 1-\delta\ .

We bound from above the maximum over uNu\in N of VarUuU[Z]\textsf{Var}_{U\mid u\in U}[Z]:

Lemma A.4 (Variance bound).

For δ>0\delta>0 there is a constant cc,

Prρ,N[maxuNVarUuU[Z]c(1+1kδ)kn2]1δ.\textsf{Pr}_{\rho,N}\left[\max_{u\in N}\textsf{Var}_{U\mid u\in U}[Z]\leq c\left(1+\frac{1}{\sqrt{k\delta}}\right)\frac{k}{n^{2}}\right]\geq 1-\delta\ .

We use the following to bound the number rr of attack queries needed so that the sorted order by average score separates the minimum hash keys from the bulk of the keys in NN^{\prime}:

Lemma A.5 (Separation).

Let α>0\alpha>0. Assume that

  • minuNG(u,m1i)M>0\min_{u\in N^{\prime}}G(u,m^{i}_{1})\geq M>0

  • maxuNVarUuU[Z]V2\max_{u\in N}\textsf{Var}_{U\mid u\in U}[Z]\leq V^{2}

  • During Algorithm 1, the keys uNu\in N^{\prime} and m1im^{i}_{1} are selected in UU in at least r2V2M21αr^{\prime}\geq\frac{2V^{2}}{M^{2}}\frac{1}{\alpha} rounds each.

Then

Pr[A[u]>A[m1i]]α\textsf{Pr}[A[u]>A[m^{i}_{1}]]\leq\alpha
Proof.

Consider the random variable

Y=A[u]A[m1i].Y=A[u]-A[m^{i}_{1}]\ .

From our assumptions:

E[Y]\displaystyle\textsf{E}[Y] M\displaystyle\geq M
Var[Y]\displaystyle\textsf{Var}[Y] 2V2r\displaystyle\leq\frac{2V^{2}}{r^{\prime}}

We get

Pr[Y<0]\displaystyle\textsf{Pr}[Y<0] Pr[|YE[Y]|E[Y]]\displaystyle\leq\textsf{Pr}[|Y-\textsf{E}[Y]|\geq E[Y]]
Pr[|YE[Y]|M]\displaystyle\leq\textsf{Pr}[|Y-\textsf{E}[Y]|\geq M]
=Pr[|YE[Y]|1αV2r]α\displaystyle=\textsf{Pr}[|Y-\textsf{E}[Y]|\geq\frac{1}{\sqrt{\alpha}}\cdot\frac{V}{\sqrt{2r^{\prime}}}]\leq\alpha

Using Chebyshev’s inequality. ∎

We are now ready to conclude the utility proof of Algorithm 1:

Lemma A.6 (Utility of Algorithm 1).

For α,δ>0\alpha,\delta>0. Consider Algorithm 1 with r=O(kδα)r=O(\frac{k}{\delta\alpha}). Then for each i[r]i\in[r], with probability 1δ1-\delta, for any uNu\in N^{\prime} we have Pr[A[u]<A[m1i]]α\textsf{Pr}[A[u]<A[m^{i}_{1}]]\leq\alpha.

Proof.

Consider a key m1im^{i}_{1} and a key uNu\not\in N^{\prime}. We bound the probability that A[u]<A[m1i]A[u]<A[m^{i}_{1}].

From Lemma A.4, with probability 11/k1-1/k we have V2=O(kn2)V^{2}=O\left(\frac{k}{n^{2}}\right). From Lemma A.3, for each ii, with probability 1δ1-\delta, we have Mδ/(3n)M\geq\delta/(3n). If we choose r=4rr=4r^{\prime} in Algorithm 1, then with high probability a key xNx\in N is selected to UU at least rr^{\prime} times. The claim follows from Lemma A.5 by setting r=O(kδα)r^{\prime}=O(\frac{k}{\delta\alpha}) . ∎

We are now ready to conclude the proof of Theorem 4.1. Recall that a subset UU of keys of size M=2αnM=2\alpha n has T(S(U))T(S(U)) over random ρ\rho concentrated around k/Mk/M with standard error k/M)\sqrt{k}/M). We now consider the prefix UU of M=2αnM=2\alpha n keys in the sorted order by average scores. This selection has E[T(S(U))](3α)k/(2αn)\textsf{E}[T(S(U))]\leq(3\alpha)\cdot k/(2\alpha n). To see this, note that (1δ)(1-\delta) of the MinHash values are the same as in the sketch of NN. Therefore they have expected value 1/n1/n. The remaining δ\delta fraction have expected value 1/(2αn)1/(2\alpha n). Therefore E[Tρ(S(U))]=k((1δ)/n+δ/(2αn))=(k/(2αn))(δ+(1δ)2α)\textsf{E}[T_{\rho}(S(U))]=k((1-\delta)/n+\delta/(2\alpha n))=(k/(2\alpha n))(\delta+(1-\delta)2\alpha). Therefore TT is a factor of 1/(δ+2α)1/(\delta+2\alpha) too small, which for small α\alpha and δ<α\delta<\alpha is a large constant multiplicative error.

A.3 Proofs of Lemma A.3 and Lemma A.4

For fixed ρ\rho (and NN), for i[k]i\in[k] and j[N1]j\in[N-1], denote by

Δ1i\displaystyle\Delta^{i}_{1} :=minxNhi(x)hi(m1i)\displaystyle:=\min_{x\in N}h_{i}(x)\equiv h_{i}(m^{i}_{1})
Δji\displaystyle\Delta^{i}_{j} :={hi(x)xN}(j+1){hi(x)xN}(j)hi(mj+1i)hi(mji)\displaystyle:=\{h_{i}(x)\mid x\in N\}_{(j+1)}-\{h_{i}(x)\mid x\in N\}_{(j)}\equiv h_{i}(m^{i}_{j+1})-h_{i}(m^{i}_{j}) for j>1j>1

the gap between the jj and j+1j+1 smallest values in {hi(x)}xN\{h_{i}(x)\}_{x\in N}.

Lemma A.7 (Properties of Δji\Delta^{i}_{j}).

The random variables Δji\Delta^{i}_{j} i[k]i\in[k], j[L]j\in[L] over the sampling of NN are independent with distributions ΔjiExp[Nj]\Delta^{i}_{j}\sim\textsf{Exp}[N-j]. This holds also when conditioning on properties p1 and p2 of Lemma A.2.

Proof.

It follows from properties of the exponential distribution that over the sampling of ρ\rho, ΔjiExp[Nj]\Delta^{i}_{j}\sim\textsf{Exp}[N-j] are independent random variables for i,ji,j. Note that p1 and p2 are independent of the actual values of hi(mji)h_{i}(m^{i}_{j}) and only depend on the rank in the order. ∎

We can now express the distribution of the random variable ZiZ_{i} in terms of Δji\Delta^{i}_{j}: We have

For j1j\geq 1, the probability that Zi==1jΔiZ_{i}=\sum_{\ell=1}^{j}\Delta^{i}_{\ell} is 2j/(12L)2^{-j}/(1-2^{-L}). This corresponds to the event that UU does not include the keys mim^{i}_{\ell} for <j\ell<j and includes the key mjim^{i}_{j}. The normalizing factor (12L)(1-2^{-L}) arises from property p2 in Lemma A.2. In the sequel we omit this normalizing factor for brevity.

Zi={Δ1iwith probability 21/(12L)Δ1i+Δ2iwith probability 22/(12L)Δ1i+Δ2i+Δ3iwith probability 23/(12L)Δ1i++Δjiwith probability 2j/(12L)\displaystyle Z_{i}=\begin{cases}\Delta^{i}_{1}&\text{with probability }2^{-1}/(1-2^{-L})\\ \Delta^{i}_{1}+\Delta^{i}_{2}&\text{with probability }2^{-2}/(1-2^{-L})\\ \Delta^{i}_{1}+\Delta^{i}_{2}+\Delta^{i}_{3}&\text{with probability }2^{-3}/(1-2^{-L})\\ &\vdots\\ \Delta^{i}_{1}+\cdots+\Delta^{i}_{j}&\text{with probability }2^{-j}/(1-2^{-L})\\ &\vdots\end{cases}

We now consider ZiZ_{i} conditioned on the event mjiUm^{i}_{j}\in U. Clearly for j=1j=1 (conditioning on the event m1iUm^{i}_{1}\in U) we have ZiΔ1iZ_{i}\equiv\Delta^{i}_{1}. For j1j\geq 1, we have Zi==1hΔiZ_{i}=\sum_{\ell=1}^{h}\Delta^{i}_{\ell} with probability 2h2^{-h} for h<jh<j and Zi==1jΔiZ_{i}=\sum_{\ell=1}^{j}\Delta^{i}_{\ell} with probability 2j+12^{-j+1}.

We bound the expected value of ZiZ_{i} conditioned on presence of a key uUu\in U.

Lemma A.8.
  • (i)

    (Anti-concentration) For um1iu\not=m^{i}_{1}, the random variable over sampling of ρ,N\rho,N

    G=maxuN{m1i}EUuU[Zi]EUm1iU[Zi]G=\max_{u\in N\setminus\{m^{i}_{1}\}}\textsf{E}_{U\mid u\in U}[Z_{i}]-\textsf{E}_{U\mid m^{i}_{1}\in U}[Z_{i}]

    is such that for all c>0c>0, Prρ,N[Gcn1]e2c\textsf{Pr}_{\rho,N}\left[G\geq\frac{c}{n-1}\right]\geq e^{-2c}.

  • (ii)

    (Concentration) For uNu\in N, the random variable

    G=EUuU[Zi]EUmjiU[Zi]G=\textsf{E}_{U\mid u\in U}[Z_{i}]-\textsf{E}_{U\mid m^{i}_{j}\in U}[Z_{i}]

    is such that for c1c\geq 1

    Prρ,N[Gc3n2j]cec.\textsf{Pr}_{\rho,N}\left[G\geq c\cdot\frac{3}{n2^{j}}\right]\leq ce^{-c}\ .
Proof.

Per Lemma A.2 we are assuming the event (that happens with probability 1δ1-\delta that UU includes a key u{mji}j[L]u\in\{m^{i}_{j}\}_{j\in[L]} for all i[k]i\in[k]. Therefore, for a key uNu\in N^{\prime} it holds that

EUuU[Zi]=EU[Zi].\textsf{E}_{U\mid u\in U}[Z_{i}]=\textsf{E}_{U}[Z_{i}]\ .

Recall that Zim1iU=Δ1iZ_{i}\mid m^{i}_{1}\in U=\Delta^{i}_{1}. Otherwise (when not conditioned on m1iUm^{i}_{1}\in U or conditioned on presence of um1iu\not=m^{i}_{1}) Zi=Δ1iZ_{i}=\Delta^{i}_{1} with probability 1/21/2 (when the random UU includes m1im^{i}_{1}) and ZiΔ1i+Δ2iZ_{i}\geq\Delta^{i}_{1}+\Delta^{i}_{2} otherwise (when UU does not include m1im^{i}_{1}). Thus,

EU[Zi]EUm1iU[Zi]Δ2i/2\displaystyle\textsf{E}_{U}[Z_{i}]-\textsf{E}_{U\mid m^{i}_{1}\in U}[Z_{i}]\geq\Delta^{i}_{2}/2
EUmjiU[Zi]EUm1iU[Zi]Δ2i/2\displaystyle\textsf{E}_{U\mid m^{i}_{j}\in U}[Z_{i}]-\textsf{E}_{U\mid m^{i}_{1}\in U}[Z_{i}]\geq\Delta^{i}_{2}/2 for 1<jL1<j\leq L

Therefore for um1iu\not=m^{i}_{1},

G:=EUuU[Zi]EUm1iU[Zi]Δ2i/2.G:=\textsf{E}_{U\mid u\in U}[Z_{i}]-\textsf{E}_{U\mid m^{i}_{1}\in U}[Z_{i}]\geq\Delta^{i}_{2}/2\ .

Since Δ2iExp[N1]\Delta^{i}_{2}\sim\textsf{Exp}[N-1], we have for all t>0t>0, Prρ[Gt]e2t/(n1)\textsf{Pr}_{\rho}[G\geq t]\geq e^{-2t/(n-1)}. This establishes claim (i).

For claim (ii), note that

EUuU[Zi]\displaystyle\textsf{E}_{U\mid u\in U}[Z_{i}] E[Zi]=1L121Δi\displaystyle\leq\textsf{E}[Z_{i}]\leq\sum_{\ell=1}^{L}\frac{1}{2^{\ell-1}}\Delta^{i}_{\ell}
EUmjiU[Zi]\displaystyle\textsf{E}_{U\mid m^{i}_{j}\in U}[Z_{i}] =1j112j1Δji\displaystyle\geq\sum_{\ell=1}^{j-1}\frac{1}{2^{j-1}}\Delta^{i}_{j}

Therefore

G:=EUuU[Zi]EUmjiU[Zi]=jL121Δi.G:=\textsf{E}_{U\mid u\in U}[Z_{i}]-\textsf{E}_{U\mid m^{i}_{j}\in U}[Z_{i}]\leq\sum_{\ell=j}^{L}\frac{1}{2^{\ell-1}}\Delta^{i}_{\ell}\ .

This is a sum of independent exponential random variables and recall that we assumed Ln/3L\leq n/3. Therefore, this is stochastically smaller that the respective geometrically decreasing weighted sum of independent Exp[3/n]\textsf{Exp}[3/n] random variables. It follows that

E[G]3n=jL121=3n2j\textsf{E}\left[G\right]\leq\frac{3}{n}\sum_{\ell=j}^{L}\frac{1}{2^{\ell-1}}=\frac{3}{n2^{j}}

We apply an upper bound on the tail [Janson, 2017a] that shows that this concentrates almost as well as a single exponential random variable: Pr[Gtμ]tet\textsf{Pr}[G\geq t\mu]\leq te^{-t} and obtain claim (ii)

Pr[Gt3n2j]tet.\textsf{Pr}\left[G\geq t\frac{3}{n2^{j}}\right]\leq te^{-t}\ .

Lemma A.3 is a corollary of the first claim of Lemma A.8.

We now express a bound on the variance of ZiZ_{i}, also when conditioned on the presence of any key uUu\in U, for fixed ρ\rho, NN.

Lemma A.9.

For fixed NN, ρ\rho, and any uNu\in N

VarU|uU[Zi],Var[Zi]=Θ(j(3/2)j(Δji)2).\textsf{Var}_{U|u\in U}[Z_{i}],\textsf{Var}[Z_{i}]=\Theta(\sum_{j}(3/2)^{-j}(\Delta_{j}^{i})^{2})\ .
Proof.
VarU|uU[Zi],Var[Zi]\displaystyle\textsf{Var}_{U|u\in U}[Z_{i}],\textsf{Var}[Z_{i}] E[Zi2]j12j(=1jΔji)2j12jj=1j(Δji)2\displaystyle\leq\textsf{E}[Z_{i}^{2}]\leq\sum_{j\geq 1}2^{-j}\left(\sum_{\ell=1}^{j}\Delta^{i}_{j}\right)^{2}\leq\sum_{j\geq 1}2^{-j}j\sum_{\ell=1}^{j}(\Delta^{i}_{j})^{2}
=j1(j2)(Δji)2=Θ(j(3/2)j(Δji)2)\displaystyle=\sum_{j\geq 1}\left(\sum_{\ell\geq j}\frac{\ell}{2^{\ell}}\right)(\Delta^{i}_{j})^{2}=\Theta(\sum_{j}(3/2)^{-j}(\Delta_{j}^{i})^{2})

Proof of Lemma A.4.

Since the ZiZ_{i}, also when conditioned on uUu\in U, are independent (modulu our simplifying assumption in Lemma A.2), it follows from Lemma A.9 that

VarUuU[Z]=Θ(i[k]j1(3/2)j(Δji)2).\textsf{Var}_{U\mid u\in U}[Z]=\Theta\left(\sum_{i\in[k]}\sum_{j\geq 1}(3/2)^{-j}(\Delta_{j}^{i})^{2}\right)\ .

Therefore,

maxuNVarUuU[Z]=Θ(i[k]j1(3/2)j(Δji)2).\max_{u\in N}\textsf{Var}_{U\mid u\in U}[Z]=\Theta\left(\sum_{i\in[k]}\sum_{j\geq 1}(3/2)^{-j}(\Delta_{j}^{i})^{2}\right)\ .

The right hand side

Y:=maxuNVarUuU[Z]Y:=\max_{u\in N}\textsf{Var}_{U\mid u\in U}[Z]

is a random variable over ρ,N\rho,N that is a a weighted sum of the squares of independent exponential random variables Δji\Delta^{i}_{j}. The PDF of a squared exponential random variable Exp[w]2\textsf{Exp}[w]^{2} is w2tewt\frac{w}{2\sqrt{t}}e^{-w\sqrt{t}}. The mean is 2w2\frac{2}{w^{2}} and the variance is at most E[t2]=24/w4\textsf{E}[t^{2}]=24/w^{4}. Applying this, we obtain that E[Y]=Θ(k/n2)\textsf{E}[Y]=\Theta(k/n^{2}) and Var[Y]=O(k/n4)\textsf{Var}[Y]=O(k/n^{4}).

From Chebyshev’s inequality, Pr[YE[Y]ck/n2]=O(1/c2)\textsf{Pr}[Y-\textsf{E}[Y]\geq c\cdot\sqrt{k}/n^{2}]=O(1/c^{2}) and we obtain for any δ>0\delta>0 and a fixed constant cc

Prρ,N[maxuNVarUuU[Z]c(1+1kδ)kn2]]δ.\textsf{Pr}_{\rho,N}\left[\max_{u\in N}\textsf{Var}_{U\mid u\in U}[Z]\geq c\left(1+\frac{1}{\sqrt{k\delta}}\right)\frac{k}{n^{2}}]\right]\leq\delta\ .

A.4 Attack on the Bottom-kk standard estimator

The argument is similar to that of kk-mins sketches. We highlight the differences. Recall that a bottom-kk sketch uses a single hash function hh with the sketch storing the kk smallest values S(U):={h(x)xU}(1:k)S(U):=\{h(x)\mid x\in U\}_{(1:k)}. We use the kkth order statistics (kkth smallest value) T(S):={h(x)xU}(k)T(S):=\{h(x)\mid x\in U\}_{(k)}.

For fixed ρ\rho and NN, let mjNm_{j}\in N (j[n]j\in[n]) be the key with the jjth smallest hashmap h(mj)={h(x)xU}(k)h(m_{j})=\{h(x)\mid x\in U\}_{(k)}. Define Δ1:=h(m1)\Delta_{1}:=h(m_{1}) and for j>1j>1, Δj:=h(mj)h(mj1)\Delta_{j}:=h(m_{j})-h(m_{j-1}).

Let RR be the random variable that is the rank in NN of the key with the kkth smallest hashmap in UU. The distribution of RR is the sum of kk i.i.d. Geometric random variables Geom[q=1/2]\textsf{Geom}[q=1/2]. We have E[R]=k/q\textsf{E}[R]=k/q and the concentration bound [Janson, 2017a] that for any c1c\geq 1, Pr[R>cE[R]]cec\textsf{Pr}[R>c\textsf{E}[R]]\leq ce^{-c}.

Let L=10+2logrL=10+2\log r, N0={mj}jkL/qN_{0}=\{m_{j}\}_{j\leq kL/q} be the keys with the L(k/q)L(k/q) smallest hashmaps. Let N=NN0N^{\prime}=N\setminus N_{0} be the remaining keys. We show that the attack separates with probability α\alpha a key with one of the bottom-kk ranks and a key in NN^{\prime}.

Assume n>3|N0|n>3|N_{0}|. Assume that we declare failure when a set UU selected by the algorithm does not contain kk keys from N0N_{0}. The probability of such selection is at most LeL<0.01/rLe^{-L}<0.01/r and at most 0.010.01 in all rr steps.

For fixed ρ\rho and NN, consider the random variable Z:=T(S(U))Z:=T(S(U)). The following parallels Lemma A.3 and Lemma A.9:

Lemma A.10.

Fixing ρ,N\rho,N,

  • (i)

    let

    G(u,v):=EUuU[Z]EUvU[Z]G(u,v):=\textsf{E}_{U\mid u\in U}[Z]-\textsf{E}_{U\mid v\in U}[Z]

    For each i[k]i\leq[k] and δ>0\delta>0,

    Pr[minuNG(u,mi)δ3n]1δ.\textsf{Pr}\left[\min_{u\in N^{\prime}}G(u,m_{i})\geq\frac{\delta}{3n}\right]\geq 1-\delta\ .
  • (ii)

    for δ>0\delta>0 and some constant cc,

    Prρ,N[maxuNVarUuU[Z]c(1+1kδ)kn2]1δ.\textsf{Pr}_{\rho,N}\left[\max_{u\in N}\textsf{Var}_{U\mid u\in U}[Z]\leq c\left(1+\frac{1}{\sqrt{k\delta}}\right)\frac{k}{n^{2}}\right]\geq 1-\delta\ .
Proof.

(i) Note that Z=mRZ=m_{R}. When i[k]i\leq[k], Z=mR+1Z=m_{R+1}. The gap is a weighted average of Δj\Delta_{j} for j[R,|N0|]j\in[R,|N_{0}|]. These are independent Exp[ni]\textsf{Exp}[n-i] random variables with in/3i\leq n/3. The expected value is Θ(1/n)\Theta(1/n) and the tail bounds are at least as tight as for a single Exp[n/3]\textsf{Exp}[n/3] random variable.

(ii) We use the concentration bound on RR to express the variance for fixed ρ,N\rho,N as a weighted sum with total weight Θ(k)\Theta(k) and each of weight O(1)O(1) of independent squared exponential random variables. The argument is as in the proof of Lemma A.9. ∎

Using the same analysis, a subset UNU\subset N of size αn\alpha n has T(S(U))T(S(U)) that in expectation has the k/αk/\alpha smallest rank in NN with standard error k/α\sqrt{k}/\alpha and normalized standard error 1/k1/\sqrt{k}. The subset UU selected as a prefix of the order generating in the attack includes the (1δ)k(1-\delta)k of the bottom-kk in NN and δk\delta k of the bottom in UU. This means that in expectation T(S(U))T(S(U)) has the kδ/αk\delta/\alpha rank in NN. That is, error that is (1/δ)(1/\delta) factor off.

Appendix B Analysis of Attack on General Query Response Algorithms

We include details for Sections 7 and 8.

B.1 Rank-domain representation of sketches

We use the rank domain representation SρR(U)S_{\rho}^{R}(U) of the input sketch Sρ(U)S_{\rho}(U). This representation is defined for subsets of a fixed ground set NN. Instead of hash values, it includes the ranks in NN of the keys that are represented in the sketch Sρ(U)S_{\rho}(U) with respect to the relevant hashmaps.

Definition B.1.

(Rank domain representation) For a fixed ground set NN, and a subset UNU\subset N, the rank domain representation SρR(U)S^{R}_{\rho}(U) of a respective MinHash sketch has the form (Y1,,Yk)(Y_{1},\ldots,Y_{k}), where YiY_{i}\in\mathbbm{N}.

  • kk-mins sketch: For i[k]i\in[k] and j1j\geq 1, let mjim^{i}_{j} be the key xNx\in N with the jjth smallest hi(x)h_{i}(x). For i[k]i\in[k], let Yi:=argminjmjiUY_{i}:=\arg\min_{j}m^{i}_{j}\in U. That is, YiY_{i} is the smallest jj such that mjiUm^{i}_{j}\in U.

  • kk-partition sketch: For i[k]i\in[k] and j1j\geq 1, let mjim^{i}_{j} be the key xNx\in N that is in part ii with the jjth smallest h(x)h(x). For i[k]i\in[k], let Yi:=argminjmjiUY_{i}:=\arg\min_{j}m^{i}_{j}\in U be the smallest rank in the iith part.

  • Bottom-kk sketch: For j1j\geq 1, let mjm_{j} be the key xNx\in N with the jjth smallest h(x)h(x) value. Let the bottom-kk keys in UU be mi1,mi2,,mikm_{i_{1}},m_{i_{2}},\ldots,m_{i_{k}} where i1<i2<<iki_{1}<i_{2}<\cdots<i_{k}. We then define Y1:=iiY_{1}:=i_{i} and Yj:=ijij1Y_{j}:=i_{j}-i_{j-1} for 1<jk1<j\leq k.

Note that when the ground set NN is available to the query response algorithm the rank domain and MinHash representations are equivalent (we can compute one from the other).

The following properties of the rank domain facilitate a simpler analysis: (i) It only depends on the order induced by the hashmaps and not on actual values and thus allows us to factor out dependence on ρ\rho, (ii) It subsumes the information on qq (and |U||U|) available from Sρ(U)S_{\rho}(U) and (iii) It has a unified form and facilitates a unified treatment across the MinHash sketch types.

The subsets U𝒟0U\sim\mathcal{D}_{0} generated by our attack algorithm selects a rate qq and then sample UU by including each xNx\in N independently with probability qq. We consider the distribution, which we denote by SR[q]S^{R}[q] of the rank domain sketch under this sampling of UU with rate qq. We show that for a sufficiently large |N|=n|N|=n, the rank domain representation is as follows:

Lemma B.2 (distribution of rank-domain sketches).

For δ>0\delta>0, q(0,1)q\in(0,1), and an upper bound rr on the attack size, let L=log2(rk/δ)/q+10L=\log_{2}(rk/\delta)/q+10, and assume n>3kL/δn>3kL/\delta. Then for all the three MinHash sketch types, the distribution SR[q]S^{R}[q] is within total variation distance δ\delta from (Y1,,Yk)(Y_{1},\ldots,Y_{k}) that are kk independent geometric random variables with parameter qq: YiGeom[q]Y_{i}\sim\textsf{Geom}[q].

Proof.

As in Lemma A.2. Applying the birthday paradox with n>3kL/δn>3kL/\delta, with probability at least 1δ1-\delta: For kk-mins sketches, the keys mjim^{i}_{j} for i[k]i\in[k] and j[L]j\in[L] are distinct. For kk partition sketches, there are at least LL keys assigned to each part so the keys mjim^{i}_{j} for i[k]i\in[k] and j[L]j\in[L] are well specified.

A sketch from SR[q]S^{R}[q] can be equivalently sampled using the following process:

  • kk-mins and kk-partition sketch: For each i[k]i\in[k], process keys mjim^{i}_{j} by increasing j1j\geq 1 until Bern[q]\textsf{Bern}[q] and then set Yi=jY_{i}=j.

  • Bottom-kk sketch: Process keys mjm_{j} in increasing j1j\geq 1 until we get Bern[q]\textsf{Bern}[q] kk times.

We next establish that with our choice of nn, with probability at least 1δ1-\delta, in all of rr sampling of UU, the sketch SR(U)S^{R}(U) is determined by the LkLk smallest rank keys. Therefore there are sufficiently many keys for the sketch to agree with the sampling kk i.i.d. Geom[k]\textsf{Geom}[k] random variables.

For kk-mins and kk-partition sketches, the probability that for a single hashmap i[k]i\in[k] none of the LL smallest rank is included is at most (1q)L(1-q)^{L}. Taking a union bounds over kk maps and rr sampling and using that log(1/(1q)q\log(1/(1-q)\approx q gives the claim. With bottom-kk sketches the requirement is that in all rr selections, the kkth smallest rank is O(kL)O(kL). ∎

Remark B.3.

Estimating qq from a sketch from SR[q]S^{R}[q] is a standard parameter estimation problem. A sufficient statistic TT for estimating qq is T(SR):=i=1kYiT(S^{R}):=\sum_{i=1}^{k}Y_{i}. Note the following properties:

  • The distribution SR[q]S^{R}[q] does not provide additional information on the cardinality |U||U| beyond an estimate of qq.

  • The distribution of SR[q]S^{R}[q] conditioned on T(S)=τT(S)=\tau is the same for all qq (this follows from the definition of sufficient statistic).

  • The statistic TT has expected value k/qk/q, variance k(1q)/q2k(1-q)/q^{2}, and single-exponential concentration [Janson, 2017b].

B.1.1 Continuous rank domain representation

We now cast the distribution of SR[q]S^{R}[q] using a continuous representation SCS^{C}. This is simply a tool we use in the analysis.

We can sample a sketch from SR[q]S^{R}[q] as follows

  • Set the rate q:=ln(1q)q^{\prime}:=-\ln(1-q)

  • Sample a sketch SC(U)=(Y1,,Yk)S^{C}(U)=(Y^{\prime}_{1},\ldots,Y^{\prime}_{k}) where YiExp[q]Y^{\prime}_{i}\sim\textsf{Exp}[q^{\prime}] are i.i.d

  • Compute SR(U)S^{R}(U) from SC(U)S^{C}(U) using Yi1+Yi+1Y_{i}\leftarrow 1+\lfloor Y^{\prime}_{i}\rfloor+1 for i[k]i\in[k].

The correctness of this transformation is from the relation between a geometric Geom[q]\textsf{Geom}[q] and exponential Exp[q]\textsf{Exp}[q^{\prime}] distributions:

Pr[Yi=t]=Pr[t1Yi<t]=eq(t1)eqt=(1eq)eqt=q(1q)t.\textsf{Pr}[Y_{i}=t]=\textsf{Pr}[t-1\leq Y^{\prime}_{i}<t]=e^{-q^{\prime}(t-1)}-e^{-q^{\prime}t}=(1-e^{-q^{\prime}})\cdot e^{-q^{\prime}t}=q\cdot(1-q)^{t}\ .

Note that we can always recover SRS^{R} from SCS^{C} but we need to know qq in order to compute SCS^{C} from SRS^{R}:

YiExp[ln(1q)]Yi[Yi1,Yi).Y_{i}^{\prime}\sim\textsf{Exp}[-\ln(1-q)]\mid Y^{\prime}_{i}\in[Y_{i}-1,Y_{i})\ .

Therefore being provided with the continuous representation only makes the query response algorithm more informed and potentially more powerful. Also note that |qq|<q2/2|q-q^{\prime}|<q^{2}/2.

A sufficient statistic for estimating qq^{\prime} from SC[q]S^{C}[q^{\prime}] is T:=i=1kYiT^{\prime}:=\sum_{i=1}^{k}Y^{\prime}_{i}. In the sequel we will work with SCS^{C} and omit the prime from qq and TT.

Now note that the distribution of T:=SC[q]1T:=\|S^{C}[q]\|_{1} for a given kk and qq is the sum of kk i.i.d. Exp[q]\textsf{Exp}[q] random variables. This is the Erlang distribution that has density function for x[0,]x\in[0,\infty]:

fT(k,q;x)=qk(k1)!xk1eqxf_{T}(k,q;x)=\frac{q^{k}}{(k-1)!}x^{k-1}e^{-qx} (4)

The distribution has mean E[T]=k/q\textsf{E}[T]=k/q, variance Var[T]=k/q2\textsf{Var}[T]=k/q^{2} and exponential tail bounds [Janson, 2017a]:

For c>1c>1: Pr[Tck/q]1cek(c1lnc)\displaystyle\textsf{Pr}[T\geq c\cdot k/q]\leq\frac{1}{c}e^{-k(c-1-\ln c)} (5)
For c<1c<1: Pr[Tck/q]ek(c1lnc)\displaystyle\textsf{Pr}[T\leq c\cdot k/q]\leq e^{-k(c-1-\ln c)} (6)

Consider the random variable

Z=(TE[T])/Var[T]Z=(T-\textsf{E}[T])/\sqrt{\textsf{Var}[T]} (7)

that is the number of standard deviations of TT from its mean. We have T=kq+ZkqT=\frac{k}{q}+Z\cdot\frac{\sqrt{k}}{q} and Z=qTkkZ=\frac{qT}{\sqrt{k}}-\sqrt{k}.

The domain of ZZ is [k,)[-\sqrt{k},\infty) and the density function of ZZ is

fZ(k;z)\displaystyle f_{Z}(k;z) =kqqk(k1)!(kq+zkq)k1eq(kq+zkq)\displaystyle=\frac{\sqrt{k}}{q}\frac{q^{k}}{(k-1)!}(\frac{k}{q}+z\frac{\sqrt{k}}{q})^{k-1}e^{-q(\frac{k}{q}+z\frac{\sqrt{k}}{q})}
=k(k1)!(k+zk)k1e(k+zk)\displaystyle=\frac{\sqrt{k}}{(k-1)!}(k+z\sqrt{k})^{k-1}e^{-(k+z\sqrt{k})} (8)

This density satisfies

kfZ(k;x)x𝑑x=0\displaystyle\int_{-\sqrt{k}}^{\infty}f_{Z}(k;x)xdx=0 (9)
kfZ(k;x)x2𝑑x=1\displaystyle\int_{-\sqrt{k}}^{\infty}f_{Z}(k;x)x^{2}dx=1 (10)
k/4k/4fZ(k;x)x2𝑑x=Θ(1)\displaystyle\int_{-\sqrt{k}/4}^{\sqrt{k}/4}f_{Z}(k;x)x^{2}dx=\Theta(1) (11)
for c(0,1]c0fZ(k;x)𝑑x,0cfZ(k;x)𝑑x=Ω(c)\displaystyle\text{for $c\in(0,1]$, }\int_{-c}^{0}f_{Z}(k;x)dx,\int_{0}^{c}f_{Z}(k;x)dx=\Omega(c) (12)
for c0Pr[Tck]1c+1ek(cln(c+1))\displaystyle\text{for $c\geq 0$, }\textsf{Pr}[T\geq c\cdot\sqrt{k}]\leq\frac{1}{c+1}e^{-k\cdot(c-\ln(c+1))} (13)
for c(0,1)Pr[Tck]ek(1cln(1c))\displaystyle\text{for $c\in(0,1)$, }\textsf{Pr}[T\leq-c\cdot\sqrt{k}]\leq e^{-k\cdot(1-c-\ln(1-c))} (14)

Note that TT is available to the query response algorithm but qq, and thus the value of ZZ are not available.

B.2 Correct maps

A map Sπ(S)[0,1]S\mapsto\pi(S)\in[0,1] maps sketches to the probability of returning 11. We require that the maps selected by QR are correct as in Definition 6.3 with δ=O(1/k)\delta=O(1/\sqrt{k}).

For a map π\pi and τ\tau we denote by π¯(τ)\overline{\pi}(\tau) the mean value of π(S)\pi(S) over sketches with statistic value T(S)=τT(S)=\tau. This is well defined since for the query distribution in our attacks 𝒟0q=q\mathcal{D}_{0}\mid q=q^{*}, even when conditioned on a fixed rate qq^{*}, the distribution of the sketch conditioned on T(S)=τT(S)=\tau does not depend on qq^{*} (See Remark B.3).

We now specify conditions on the map π¯(τ)\overline{\pi}(\tau) that must be satisfied by a correct π\pi. A correct map may return an incorrect output, when conditioned on cardinality, with probability δ\delta. This means that there are correct maps with large error on certain τ\tau (since each cardinality has a distribution on τ\tau). We therefore can not make a sharp claim on π¯(τ)\overline{\pi}(\tau) that must hold for any τ\tau in an applicable range. Instead, we make an average claim: For any interval of τ\tau values that is wide enough to include Ω(1)\Omega(1) of the values for some cardinality value c[A,2A]c\not\in[A,2A], the average error of the mapping must be O(δ)O(\delta).

Claim B.4.

For any ξ>0\xi>0, there is c0>0c_{0}>0 such that for any correct map π\pi for AA and δc0/k\delta\leq c_{0}/\sqrt{k} and τb>(1+0.1/k)τa\tau_{b}>(1+0.1/\sqrt{k})\tau_{a} it holds that

{1τbτaτaτbπ¯(x)𝑑x<ξif τa>knA(11/k)1τbτaτ(1a/k)τπ¯(x)𝑑x>1ξif τb<kn2A(1+1/k)\begin{cases}\frac{1}{\tau_{b}-\tau_{a}}\int_{\tau_{a}}^{\tau_{b}}\overline{\pi}(x)dx<\xi&\text{if }\tau_{a}>\frac{kn}{A}(1-1/\sqrt{k})\\ \frac{1}{\tau_{b}-\tau_{a}}\int_{\tau(1-a/\sqrt{k})}^{\tau}\overline{\pi}(x)dx>1-\xi&\text{if }\tau_{b}<\frac{kn}{2A}(1+1/\sqrt{k})\end{cases} (15)
Proof.

For a cardinality value cc, the distribution of the statistic TT conditioned on a cardinality value cc is fT(k,c/n;x)f_{T}(k,c/n;x) (4). With cardinality value cc, it holds that Pr[T<kn/c]1/e\textsf{Pr}[T<kn/c]\geq 1/e and Pr[T>kn/c]1/e\textsf{Pr}[T>kn/c]\geq 1/e. Moreover, the density in the interval knc[10.1/k,1+0.1/k]\frac{kn}{c}[1-0.1/\sqrt{k},1+0.1/\sqrt{k}] is Θ(1)\Theta(1). It follows from the correctness requirement for cardinality value c=k/τc=k/\tau that there exists c1>0c_{1}>0 such that:

{ττ(1+0.1/k)π¯(x)𝑑x<c1δif τ>knA(11/k)τ(10.1/k)τπ¯(x)𝑑x>1c1δif τ<kn2A(1+1/k)\begin{cases}\int_{\tau}^{\tau(1+0.1/\sqrt{k})}\overline{\pi}(x)dx<c_{1}\delta&\text{if }\tau>\frac{kn}{A}(1-1/\sqrt{k})\\ \int_{\tau(1-0.1/\sqrt{k})}^{\tau}\overline{\pi}(x)dx>1-c_{1}\delta&\text{if }\tau<\frac{kn}{2A}(1+1/\sqrt{k})\end{cases} (16)

Therefore for τa>knA(11/k)\tau_{a}>\frac{kn}{A}(1-1/\sqrt{k})

τaτbπ¯(x)𝑑x10k(τbτa)c1δ\displaystyle\int_{\tau_{a}}^{\tau_{b}}\overline{\pi}(x)dx\leq 10\sqrt{k}(\tau_{b}-\tau_{a})c_{1}\delta

and for τb<kn2A(1+1/k)\tau_{b}<\frac{kn}{2A}(1+1/\sqrt{k})

τaτbπ¯(x)𝑑x(τbτa)(110kc1δ).\displaystyle\int_{\tau_{a}}^{\tau_{b}}\overline{\pi}(x)dx\geq(\tau_{b}-\tau_{a})\cdot(1-10\sqrt{k}c_{1}\delta)\ .

Choosing c010c1c_{0}\leq 10c_{1} establishes the claim. ∎

For fixed qq, the cardinality |U||U| of the selected UU has distribution Binom(q,n)\textsf{Binom}(q,n). The nn chosen for the attack is large enough so that for all our rr queries ||U|qn|/(qn)1/k||U|-qn|/(qn)\ll 1/\sqrt{k}. That is, the variation in |U||U| for fixed qq is small compared with the error of the sketch and |U|qn|U|\approx qn.

B.3 Relating ZZ and sampling probability of low rank keys

The sketch is determined by kk keys that are lowest rank in UU. We can view the sampling of UU to the point that the sketch is determined in terms of a process, as in the proof of Lemma B.2, that examines keys from the ground set NN in a certain order until the kk that determine the sketch are selected. The process selects each examined key with probability qq. For a bottom-kk sketch, keys are examined in order of increasing rank until kk are selected. With kk-mins and kk-partition, keys in each part (or hash map) are examined sequentially by increasing rank until there is selection for the part. For all sketch types, the statistic value TT corresponds to the number of keys from NN that are examined until kk are selected. This applies also with the continuous representation SCS^{C}.

We denote by N0N_{0} the set of keys that are examined with probability at least δc1/(rk)\delta_{c}\leq 1/(rk) when the rate is at least qaq_{a}. We refer to these keys as low rank keys. It holds that |N0|kln(1/δc)/qa|N_{0}|\leq k\ln(1/\delta_{c})/q_{a}. For δc=1/O(rk)\delta_{c}=1/O(rk) we have |N0|=O(klog(rk))|N_{0}|=O(k\log(rk)). The remaining keys N:=NN0N^{\prime}:=N\setminus N_{0} are unlikely to impact the sketch content and we refer to them as transparent.

With rate qq, the probability that a certain key is included in UU is qq. We now consider a rate qq and the inclusion probability conditioned on the normalized deviation from the mean ZZ. Transparent keys have inclusion probability qq. The low rank keys N0N_{0} however have average inclusion probability that depends on ZZ. Qualitatively, we expect that when Z<0Z<0, the inclusion probability is larger than qq and this increases with magnitude |Z||Z|. When Z>0Z>0, the inclusion is lower and decreases with the magnitude. This is quantified in the following claim:

Claim B.5.

Fix a rate qq and Δ\Delta. Consider the distribution of UU conditioned on Z=ΔZ=\Delta. The average probability over N0N_{0} keys to be in UU is qΔk|N0|q-\Delta\frac{\sqrt{k}}{|N_{0}|}.

Proof.

Equivalently, consider the distribution conditioned on T=1q(k+Δk)T=\frac{1}{q}(k+\Delta\sqrt{k}). The sampling process selects kk keys after examining TT keys. The average effective sampling rate for the examined keys is qe=k/Tq_{e}=k/T. There are TT keys out of N0N_{0} that are processed with effective rate qeq_{e} and the remaining keys in N0N_{0} have effective rate qq.

Averaging the effective rate over the T=1q(k+Zk)T=\frac{1}{q}(k+Z\sqrt{k}) processed keys and the remaining N0N_{0} keys we obtain

Tqe+(|N0|T)q|N0|\displaystyle\frac{T\cdot q_{e}+(|N_{0}|-T)\cdot q}{|N_{0}|} =TkT+(|N0|T)q|N0|=k+(|N0|(k+Δkq))q|N0|\displaystyle=\frac{T\cdot\frac{k}{T}+(|N_{0}|-T)\cdot q}{|N_{0}|}=\frac{k+(|N_{0}|-(\frac{k+\Delta\sqrt{k}}{q}))\cdot q}{|N_{0}|}
=qΔk|N0|\displaystyle=q-\Delta\frac{\sqrt{k}}{|N_{0}|}

B.4 Scoring probability gap

For a map π\pi, let p(π)p^{\prime}(\pi) be the score probability, over the distribution of qq and ZZ, of a key in NN^{\prime}. Let p0(π)p_{0}(\pi) be the average over N0N_{0} of the score probability of keys in N0N_{0}.

Let fλ(x)f_{\lambda}(x) be the density function of the selected rate, described by Algorithm 5.

Input: AA, nn
ωn2A\omega\leftarrow\frac{n}{2A}
ωa12ω\omega_{a}\leftarrow\frac{1}{2}\omega; ωb52ω\omega_{b}\leftarrow\frac{5}{2}\omega
  // range of inverse rates
DU[0,ω/4]D\sim U[0,\omega/4]
ωaωa+D\omega^{*}_{a}\leftarrow\omega_{a}+D; ωbωa+74ω\omega^{*}_{b}\leftarrow\omega^{*}_{a}+\frac{7}{4}\omega
  // range of sampled inverse rate
return q1U[ωa,ωb]q\sim\frac{1}{U[\omega^{*}_{a},\omega^{*}_{b}]}
Algorithm 5 Sample rate

Note that the selected rate is in the interval q[1ωb,1ωa]=1ω[25,2]=An[45,4]q\in[\frac{1}{\omega_{b}},\frac{1}{\omega_{a}}]=\frac{1}{\omega}\cdot[\frac{2}{5},2]=\frac{A}{n}\cdot[\frac{4}{5},4].

For each transparent key, the score probability is:

p(π)=qaqbkπ¯(kq(1+z/k))fZ(k;z)𝑑zqfλ(q)𝑑q\displaystyle p^{\prime}(\pi)=\int_{q_{a}}^{q_{b}}\int_{-\sqrt{k}}^{\infty}\overline{\pi}(\frac{k}{q}(1+z/\sqrt{k}))f_{Z}(k;z)\cdot dz\cdot q\cdot f_{\lambda}(q)dq (17)

On average over the low-rank keys N0N_{0} using Claim B.5 it is

p0(π)=qaqbkπ¯(kq(1+z/k))fZ(k;z)(qzk|N0|)𝑑zfλ(q)𝑑q\displaystyle p_{0}(\pi)=\int_{q_{a}}^{q_{b}}\int_{-\sqrt{k}}^{\infty}\overline{\pi}(\frac{k}{q}(1+z/\sqrt{k}))f_{Z}(k;z)\cdot\left(q-z\frac{\sqrt{k}}{|N_{0}|}\right)\cdot dz\cdot f_{\lambda}(q)dq (18)

For a correct map π\pi (as in Definition 6.3), we express the gap between p(π)p^{\prime}(\pi) and p0(π)p_{0}(\pi). Note that we bound the gap without assuming much on the actual values, as they can highly vary for different correct π\pi.

Lemma B.6 (Score probability gap).

Consider a step of the algorithm and a correct map π\pi (see Remark 6.4). Then

p0(π)p(π)=Ω(1|N0|)=Ω(1klog(kr))p_{0}(\pi)-p^{\prime}(\pi)=\Omega\left(\frac{1}{|N_{0}|}\right)=\Omega\left(\frac{1}{k\log(kr)}\right)

In the remaining part of this section we present the proof of Lemma B.6. We will need the following claim, that relates Δ\Delta and scoring probability.

Claim B.7.

For |Δ|<k/4|\Delta|<\sqrt{k}/4

qaqbπ¯(kq(1+Δk))fλ(q)𝑑q=qaqbπ¯(kq)fλ(q)𝑑q+Θ(Δk)\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k}{q}(1+\frac{\Delta}{\sqrt{k}}))\cdot f_{\lambda}(q)dq=\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k}{q})\cdot f_{\lambda}(q)dq+\Theta(\frac{\Delta}{\sqrt{k}})
Proof.

Using the distribution specified in Algorithm 5, for any g()g():

qaqbg(q)fλ(q)𝑑q=4ω0ω/4𝑑D47ωωa+Dωa+D+74ωg(1/x)𝑑x\displaystyle\int_{q_{a}}^{q_{b}}g(q)f_{\lambda}(q)dq=-\frac{4}{\omega}\int_{0}^{\omega/4}dD\frac{4}{7\omega}\int_{\omega_{a}+D}^{\omega_{a}+D+\frac{7}{4}\omega}g(1/x)dx (19)

We use wa=ωa+Dw^{*}_{a}=\omega_{a}+D and wb=ωa+D+74ωw^{*}_{b}=\omega_{a}+D+\frac{7}{4}\omega and get

ωaωbπ¯(kx(1+Δk))𝑑x\displaystyle\int_{\omega^{*}_{a}}^{\omega^{*}_{b}}\overline{\pi}(kx(1+\frac{\Delta}{\sqrt{k}}))\cdot dx =11+Δkωa(1+Δ/k)ωb(1+Δ/k)π¯(ky)𝑑y (change variable x to y=x(1+Δ/k))\displaystyle=\frac{1}{1+\frac{\Delta}{\sqrt{k}}}\int_{\omega^{*}_{a}\cdot(1+\Delta/\sqrt{k})}^{\omega^{*}_{b}\cdot(1+\Delta/\sqrt{k})}\overline{\pi}(ky)dy\text{$\;\;$ (change variable $x$ to $y=x(1+\Delta/\sqrt{k})$)}
=11+Δk(ωaωbπ¯(kx)𝑑xωaωa(1+Δ/k)π¯(kx)𝑑x+ωbωb(1+Δ/k)π¯(kx)𝑑x)\displaystyle=\frac{1}{1+\frac{\Delta}{\sqrt{k}}}\left(\int_{\omega^{*}_{a}}^{\omega^{*}_{b}}\overline{\pi}(kx)dx-\int_{\omega^{*}_{a}}^{\omega^{*}_{a}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx+\int_{\omega^{*}_{b}}^{\omega^{*}_{b}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx\right)

Therefore,444Note that Δ\Delta can be negative. In which case in order to streamline expressions we interpret the asymptotic notation O(cΔ)O(c\Delta) as O(c|Δ|)-O(c|\Delta|).

ωaωb(π¯(kx(1+Δk))π¯(kx))𝑑x=\displaystyle\int_{\omega^{*}_{a}}^{\omega^{*}_{b}}\left(\overline{\pi}(kx(1+\frac{\Delta}{\sqrt{k}}))-\overline{\pi}(kx)\right)\cdot dx= (20)
=Θ(Δk)ωaωbπ¯(kx)𝑑xΘ(1)ωaωa(1+Δ/k)π¯(kx)𝑑x+Θ(1)ωbωb(1+Δ/k)π¯(kx)𝑑x\displaystyle=-\Theta(\frac{\Delta}{\sqrt{k}})\cdot\int_{\omega^{*}_{a}}^{\omega^{*}_{b}}\overline{\pi}(kx)dx-\Theta(1)\cdot\int_{\omega^{*}_{a}}^{\omega^{*}_{a}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx+\Theta(1)\cdot\int_{\omega^{*}_{b}}^{\omega^{*}_{b}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx
=O(Δkω)Θ(1)ωaωa(1+Δ/k)π¯(kx)𝑑x+Θ(1)ωbωb(1+Δ/k)π¯(kx)𝑑x.\displaystyle=-O(\frac{\Delta}{\sqrt{k}}\omega)-\Theta(1)\cdot\int_{\omega^{*}_{a}}^{\omega^{*}_{a}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx+\Theta(1)\cdot\int_{\omega^{*}_{b}}^{\omega^{*}_{b}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx\ .

The last equality follows using

ωaωbπ¯(kx)𝑑x[0,ωbωa]=[0,74ω]\int_{\omega^{*}_{a}}^{\omega^{*}_{b}}\overline{\pi}(kx)dx\in[0,\omega^{*}_{b}-\omega^{*}_{a}]=[0,\frac{7}{4}\omega]
qaqbπ¯(kq(1+Δk))fλ(q)𝑑qqaqbπ¯(kq)fλ(q)𝑑q=\displaystyle\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k}{q}(1+\frac{\Delta}{\sqrt{k}}))\cdot f_{\lambda}(q)dq-\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k}{q})\cdot f_{\lambda}(q)dq= (21)
=qaqb(π¯(kq(1+Δk))π¯(kq))fλ(q)𝑑q=\displaystyle=\int_{q_{a}}^{q_{b}}\left(\overline{\pi}(\frac{k}{q}(1+\frac{\Delta}{\sqrt{k}}))-\overline{\pi}(\frac{k}{q})\right)\cdot f_{\lambda}(q)dq=
=4ω0ω/4𝑑D47ωωa+Dωa+D+74ω(π¯(kx(1+Δk))π¯(kx))𝑑x(Using (19))\displaystyle=\frac{4}{\omega}\int_{0}^{\omega/4}dD\frac{4}{7\omega}\int_{\omega_{a}+D}^{\omega_{a}+D+\frac{7}{4}\omega}\left(\overline{\pi}(kx(1+\frac{\Delta}{\sqrt{k}}))-\overline{\pi}(kx)\right)dx\;\;\text{(Using \eqref{q2omega:eq})}
=O(Δk)Θ(1ω2)0ω/4𝑑Dωaωa(1+Δ/k)π¯(kx)𝑑x+Θ(1ω2)0ω/4𝑑Dωbωb(1+Δ/k)π¯(kx)𝑑x(Apply (20))\displaystyle=-O(\frac{\Delta}{\sqrt{k}})-\Theta(\frac{1}{\omega^{2}})\cdot\int_{0}^{\omega/4}dD\int_{\omega^{*}_{a}}^{\omega^{*}_{a}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx+\Theta(\frac{1}{\omega^{2}})\cdot\int_{0}^{\omega/4}dD\cdot\int_{\omega^{*}_{b}}^{\omega^{*}_{b}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx\ \;\;\text{(Apply \eqref{diffomega:eq})}

We now separately bound terms555Argument for negative Δ\Delta is similar:

0ω/4𝑑Dωaωa(1+Δ/k)π¯(kx)𝑑x\displaystyle\int_{0}^{\omega/4}dD\int_{\omega^{*}_{a}}^{\omega^{*}_{a}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx 0ω/4𝑑Dωa+Dωa+D+Δkωaπ¯(kx)𝑑x\displaystyle\geq\int_{0}^{\omega/4}dD\int_{\omega_{a}+D}^{\omega_{a}+D+\frac{\Delta}{\sqrt{k}}\omega_{a}}\overline{\pi}(kx)dx
=0Δk(ω/4)𝑑Wωa+Wωa+ω/4+Wπ¯(kx)𝑑x\displaystyle=\int_{0}^{\frac{\Delta}{\sqrt{k}}(\omega/4)}dW\int_{\omega_{a}+W}^{\omega_{a}+\omega/4+W}\overline{\pi}(kx)dx
Δkω4ω4(1ξ)=Θ(ω2Δk)(Using Claim B.4)\displaystyle\geq\frac{\Delta}{\sqrt{k}}\frac{\omega}{4}\cdot\frac{\omega}{4}(1-\xi)=\Theta(\omega^{2}\frac{\Delta}{\sqrt{k}})\;\;\text{(Using Claim~{}\ref{Tforcorrectmap:claim})} (22)
0ω/4𝑑Dωaωa(1+Δ/k)π¯(kx)𝑑x\displaystyle\int_{0}^{\omega/4}dD\int_{\omega^{*}_{a}}^{\omega^{*}_{a}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx 0ω/4𝑑Dωa+Dωa+D+Δk(ωa+ω/4)π¯(kx)𝑑x\displaystyle\leq\int_{0}^{\omega/4}dD\int_{\omega_{a}+D}^{\omega_{a}+D+\frac{\Delta}{\sqrt{k}}(\omega_{a}+\omega/4)}\overline{\pi}(kx)dx
=0Δk(ωa+ω/4)𝑑Wωa+Wωa+ω/4+Wπ¯(kx)𝑑x\displaystyle=\int_{0}^{\frac{\Delta}{\sqrt{k}}(\omega_{a}+\omega/4)}dW\int_{\omega_{a}+W}^{\omega_{a}+\omega/4+W}\overline{\pi}(kx)dx =O(ω2Δk)\displaystyle=O(\omega^{2}\frac{\Delta}{\sqrt{k}}) (23)

Combining (22) and (23) we obtain that

0ω/4𝑑Dωaωa(1+Δ/k)π¯(kx)𝑑x=Θ(ω2Δk)\int_{0}^{\omega/4}dD\int_{\omega^{*}_{a}}^{\omega^{*}_{a}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx=\Theta(\omega^{2}\frac{\Delta}{\sqrt{k}}) (24)

We next bound the last term:

0ω/4𝑑Dωbωb(1+Δ/k)π¯(kx)𝑑x\displaystyle\int_{0}^{\omega/4}dD\int_{\omega^{*}_{b}}^{\omega^{*}_{b}(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx =0ω/4𝑑D94ω+D(94ω+D)(1+Δ/k)π¯(kx)𝑑x\displaystyle=\int_{0}^{\omega/4}dD\int_{\frac{9}{4}\omega+D}^{(\frac{9}{4}\omega+D)\cdot(1+\Delta/\sqrt{k})}\overline{\pi}(kx)dx
0ω/4𝑑D94ω+D(94ω+D)+52ωΔkπ¯(kx)𝑑x\displaystyle\leq\int_{0}^{\omega/4}dD\int_{\frac{9}{4}\omega+D}^{(\frac{9}{4}\omega+D)+\frac{5}{2}\omega\cdot\frac{\Delta}{\sqrt{k}}}\overline{\pi}(kx)dx
=052ωΔk𝑑W94ω+W94ω+W+ω/4π¯(kx)𝑑x\displaystyle=\int_{0}^{\frac{5}{2}\omega\cdot\frac{\Delta}{\sqrt{k}}}dW\int_{\frac{9}{4}\omega+W}^{\frac{9}{4}\omega+W+\omega/4}\overline{\pi}(kx)dx
52ωΔkω4ξ=58Δkω2ξ(Apply Claim B.4)\displaystyle\leq\frac{5}{2}\omega\frac{\Delta}{\sqrt{k}}\cdot\frac{\omega}{4}\xi=\frac{5}{8}\frac{\Delta}{\sqrt{k}}\omega^{2}\xi\;\;\text{(Apply Claim~{}\ref{Tforcorrectmap:claim})} (25)

We substitute (24) and (25) in (21) to conclude the proof, choosing a small enough constant ξ\xi. ∎

Proof of Lemma B.6.

We express the difference between the average score probability of a key in N0N_{0} (18) and the score probability of a key in NN^{\prime} (17):

p0(π)p(π)\displaystyle p_{0}(\pi)-p^{\prime}(\pi) =k|N0|qaqbkπ¯(k+kzq)fZ(k;z)z𝑑zfλ(q)𝑑q\displaystyle=\frac{\sqrt{k}}{|N_{0}|}\cdot\int_{q_{a}}^{q_{b}}\int_{-\sqrt{k}}^{\infty}\overline{\pi}(\frac{k+\sqrt{k}z}{q})f_{Z}(k;z)\cdot zdz\cdot f_{\lambda}(q)dq
=k|N0|k(qaqbπ¯(k+kzq)fλ(q)𝑑q)fZ(k;z)z𝑑z\displaystyle=\frac{\sqrt{k}}{|N_{0}|}\cdot\int_{-\sqrt{k}}^{\infty}\left(\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k+\sqrt{k}z}{q})\cdot f_{\lambda}(q)dq\right)\cdot f_{Z}(k;z)\cdot zdz (26)

We separately consider ZZ in the range Iin:=[k/4,k/4]I_{\text{in}}:=[-\sqrt{k}/4,\sqrt{k}/4] and ZZ outside this range in Iout=[k,k/4][k/4,]I_{\text{out}}=[-\sqrt{k},\sqrt{k}/4]\cup[\sqrt{k}/4,\infty]

For outside the range we use that qaqbπ¯(k+kzq)fλ(q)𝑑q[0,1]\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k+\sqrt{k}z}{q})\cdot f_{\lambda}(q)dq\in[0,1] and tail bounds on fZ(k;z)f_{Z}(k;z) (13) (14) and get:

k|N0|Iout(qaqbπ¯(k+kzq)fλ(q)𝑑q)fZ(k;z)z𝑑z=1N0eΩ(k)Apply (14) and (13)\displaystyle\frac{\sqrt{k}}{|N_{0}|}\int_{I_{\text{out}}}\left(\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k+\sqrt{k}z}{q})\cdot f_{\lambda}(q)dq\right)\cdot f_{Z}(k;z)\cdot zdz=\frac{1}{N_{0}}e^{-\Omega(k)}\;\;\text{Apply \eqref{lowtailZ:eq} and \eqref{uptailZ:eq}} (27)

For inside the range we apply Claim B.7:

k|N0|Iin(qaqbπ¯(k+kzq)fλ(q)𝑑q)fZ(k;z)z𝑑z\displaystyle\frac{\sqrt{k}}{|N_{0}|}\cdot\int_{I_{\text{in}}}\left(\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k+\sqrt{k}z}{q})\cdot f_{\lambda}(q)dq\right)\cdot f_{Z}(k;z)\cdot zdz
=k|N0|(Iin(qaqbπ¯(kq)fλ(q)𝑑q)fZ(k;z)z𝑑z+IinΘ(zk)fZ(k;z)z𝑑z)\displaystyle=\frac{\sqrt{k}}{|N_{0}|}\cdot\left(\int_{I_{\text{in}}}\left(\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k}{q})\cdot f_{\lambda}(q)dq\right)\cdot f_{Z}(k;z)\cdot zdz+\int_{I_{\text{in}}}\Theta(\frac{z}{\sqrt{k}})\cdot f_{Z}(k;z)\cdot zdz\right) Claim B.7
=k|N0|(qaqbπ¯(kq)fλ(q)𝑑qIinfZ(k;z)z𝑑z+1kΘ(IinfZ(k;z)z2𝑑z))\displaystyle=\frac{\sqrt{k}}{|N_{0}|}\cdot\left(\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k}{q})\cdot f_{\lambda}(q)dq\cdot\int_{I_{\text{in}}}f_{Z}(k;z)\cdot zdz+\frac{1}{\sqrt{k}}\cdot\Theta\left(\int_{I_{\text{in}}}f_{Z}(k;z)\cdot z^{2}dz\right)\right)
=k|N0|(qaqbπ¯(kq)fλ(q)𝑑q0+1kΘ(1))\displaystyle=\frac{\sqrt{k}}{|N_{0}|}\cdot\left(\int_{q_{a}}^{q_{b}}\overline{\pi}(\frac{k}{q})\cdot f_{\lambda}(q)dq\cdot 0+\frac{1}{\sqrt{k}}\cdot\Theta\left(1\right)\right) Using (9) and (11)
=k|N0|1kΘ(1)=Θ(1|N0|)\displaystyle=\frac{\sqrt{k}}{|N_{0}|}\cdot\frac{1}{\sqrt{k}}\cdot\Theta(1)=\Theta(\frac{1}{|N_{0}|}) (28)

The statement of the Lemma follows by combining (27) and (28) in (26). ∎

B.5 The case of symmetric estimators

The proof of Lemma 7.3 (gap for symmetric estimators) follows as a corollary of the proof of Lemma B.6.

Proof of Lemma 7.3.

For symmetric maps (Definition 7.2) keys in N0N_{0} that have lower rank can only have higher scoring probabilities. That is, when j<jj<j^{\prime}, the score probability of mjim^{i}_{j} is no lower than that of mjim^{i}_{j^{\prime}}. With bottom-kk sketches, the score probability of mjm_{j} is no lower than that of mjm_{j^{\prime}}. In particular, the keys in N0N^{*}_{0} have the highest average score among keys in their components. Additionally, there is symmetry between components. Therefore, the average score of each of the kk lowest rank keys in N0N^{*}_{0} is no lower than the average over all N0N_{0} keys:

EU[π(Sρ(U)𝟏(mU)]p0(π).\textsf{E}_{U}[\pi(S_{\rho}(U)\cdot\mathbf{1}(m\in U)]\geq p_{0}(\pi)\ .

Therefore using Lemma B.6:

EU[π(Sρ(U)𝟏(mU)]p(π)p0(π)p(π)EU[π(Sρ(U)𝟏(uU)]=Ω(1klog(kr)).\textsf{E}_{U}[\pi(S_{\rho}(U)\cdot\mathbf{1}(m\in U)]-p^{\prime}(\pi)\geq p_{0}(\pi)-p^{\prime}(\pi)-\textsf{E}_{U}[\pi(S_{\rho}(U)\cdot\mathbf{1}(u\in U)]=\Omega(\frac{1}{k\log(kr)})\ .

B.6 Analysis details of the Adaptive Algorithm

We consider the information available to the query response algorithm. The mask MM is shared with the query response and hence it only needs to estimate the cardinality of (the much larger) set UU. The mask keys MM hide information in Sρ(U)S_{\rho}(U) and make additional keys trasparent. For kk-partition and kk-mins sketches, keys mhim^{i}_{h} where h>argminjmjiMh>\arg\min_{j}m^{i}_{j}\in M are transparent. For bottom-kk sketches, we see only kkk^{\prime}\leq k bottom ranks in UU if kkk-k^{\prime} keys from MM have lower ranks.

We describe the sampling of S(UM)S(U\cup M) (for given mask MM) as an equivalent process that examines keys in UU in order, selecting each examined key with probability qq, until the sketch is determined. This process generalizes the process we described for the case without a mask in the proof of Lemma B.2:

  • Bottom-kk sketch: Set counter ckc\leftarrow k. tGeom[q]t\sim\textsf{Geom}[q]. Process keys mjm_{j} in increasing jj:

    1. 1.

      If mjMm_{j}\in M decrease cc and output mjm_{j}. If c=0c=0 halt.

    2. 2.

      If mjMm_{j}\not\in M then decrease tt.

      1. (a)

        If t=0t=0 output mjm_{j}, decrease cc, and sample a new tGeom[q]t\sim\textsf{Geom}[q]. If c=0c=0 halt.

  • kk-mins and kk-partition sketches: For i[k]i\in[k] let hiargminmiMh^{i}\leftarrow\arg\min_{\ell}m^{i}_{\ell}\in M. Sample tGeom[q]t\sim\textsf{Geom}[q]. Process i[k]i\in[k] in order:

    1. 1.

      If hh is defined and hth\leq t then tth+1t\leftarrow t-h+1 and continue with next ii.

    2. 2.

      If hh is undefined or h>th>t then output mtim^{i}_{t}, sample new tGeom[q]t\sim\textsf{Geom}[q]. Continue to next ii.

The QR algorithm has the results of the process which yields kkk^{\prime}\leq k i.i.d. Geom[q]\textsf{Geom}[q] random variables. As keys are added to the mask MM the information we can glean on qq from the sketch, that corresponds to the number kk^{\prime} of Geom[q]\textsf{Geom}[q] samples we obtain, decreases. As the mask gets augmented, the number of keys, additional keys in N0N_{0} become transparent in the sense that they have probability smaller than δ/r\delta/r to impact the sketch if included in UU. With kk-mins and kk-partition sketches keys where mji>him^{i}_{j}>h^{i} become transparent. With bottom-kk sketches keys mjm_{j} where j>mini(k|M(m)<j|)Ω(log(r))j>\min_{i}(k-|M\cap(m_{\ell})_{\ell<j}|)\cdot\Omega(\log(r)) are transparent. These keys are no longer candidates to be examined by the process above. We denote by N0N0N^{\prime}_{0}\subset N_{0} the set of keys that remain non-transparent. It holds that |N0|=O(k¯log(kr)|N^{\prime}_{0}|=O(\overline{k^{\prime}}\log(kr), where k¯\overline{k^{\prime}} is the mean kk^{\prime} with our current mask. When k=O(log(kr))k^{\prime}=O(\log(kr)) becomes too small (see Remark 6.2), there are no correct maps and the algorithm halts and returns MM.

Let p(π,M,x)p(\pi,M,x) be the probability that key xx is scored with map π\pi and mask MM. The probability is the same for all transparent keys xN0x\not\in N^{\prime}_{0} and we denote it by p(π,M)p^{\prime}(\pi,M).

Lemma B.8 (Score probability gap with mask).

Let π\pi be a correct map for M𝒟0M\cup\mathcal{D}_{0}. Then

xN0(p(π,M,x)p(π,M))=Ω(1log(kr)).\sum_{x\in N^{\prime}_{0}}\left(p(\pi,M,x)-p^{\prime}(\pi,M)\right)=\Omega\left(\frac{1}{\log(kr)}\right)\ .
Proof.

The proof is similar to that of Lemma B.6 applies with respect to kk^{\prime} and using that |N0|=O(k¯log(kr)|N^{\prime}_{0}|=O(\overline{k^{\prime}}\log(kr). ∎

Claim B.9.

With probability at least 0.990.99, no transparent keys are placed in MM.

Proof.

First note that all transparent keys have the same score distribution (see proof of Theorem 7.1). Keys get placed in MM when their score separates from the median score in NMN\setminus M. Note that since nearly all keys (except α\alpha fraction) are transparent, the median score is the score of a transparent key. From Chernoff bounds (3) the probability that a transparent key at a given step is placed in MM (and deviates by more than λ\lambda from its expectation) is <1/(100nr)<1/(100nr). Taking a union bound over all steps and transparent keys we obtain the claim. ∎