Towards Better Bounds for Finding Quasi-Identifiers ¹¹1Authors are in alphabetical order.

Ryan Hildebrant ²²2University of California, Irvine. Email: [email protected]. Part of this work was done while the author was at San Diego State University. Quoc-Tung Le ³³3Univ Lyon, ENS de Lyon, UCBL,CNRS, Inria, LIP, F-69342, LYON Cedex 07, France. Email: [email protected] Duy-Hoang Ta ⁴⁴4National University of Singapore. Email: [email protected] Hoa T. Vu ⁵⁵5San Diego State University. Email: [email protected].

Abstract

We revisit the problem of finding small $\varepsilon$ -separation keys introduced by Motwani and Xu (2008). In this problem, the input is a data set consisting of $m$ -dimensional tuples $\{x_{1},x_{2},\ldots,x_{n}\}$ . The goal is to find a small subset of coordinates that separates at least $(1-\varepsilon){n\choose 2}$ pairs of tuples. When $n$ is large, they provided a fast algorithm that runs on $\Theta(m/\varepsilon)$ tuples sampled uniformly at random. We show that the sample size can be improved to $\Theta\left({m}/{\sqrt{\varepsilon}}\right)$ . Our algorithm also enjoys a faster running time.

To obtain this result, we consider a decision problem that takes a subset of coordinates $A\subseteq[m]$ . It rejects if $A$ separates fewer than $(1-\varepsilon){n\choose 2}$ pairs of tuples, and accepts if $A$ separates all ${n\choose 2}$ pairs of tuples. The algorithm must be correct with probability at least $1-\delta$ for all $2^{m}$ choices of $A$ . We show that for algorithms based on uniform sampling:

•

$\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ samples are sufficient and necessary so that $\delta\leq e^{-m}$ .
•

$\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ samples are necessary so that $\delta$ is a constant. Closing the gap between the upper and lower bounds in this case is still an open question.

The analysis is based on a constrained version of the balls-into-bins problem whose worst case can be determined using Karush Kuhn Tucker (KKT) conditions. We believe our analysis may be of independent interest.

We also study a related problem that asks for the following sketching algorithm: with given parameters $\alpha,k$ and $\varepsilon$ , the algorithm takes a subset of coordinates $A$ of size at most $k$ and returns an estimate of the number of unseparated pairs in $A$ up to a $(1\pm\varepsilon)$ factor if it is at least $\alpha{n\choose 2}$ . We show that even for constant $\alpha$ and success probability, such a sketching algorithm must use $\Omega(mk\log\varepsilon^{-1})$ bits of space; on the other hand, uniform sampling yields a sketch of size $\Theta\left(\frac{mk}{\alpha\varepsilon^{2}}\right)$ for this purpose.

1 Introduction

Motivation.

In large data sets, it is often important to find a small subset of attributes that identifies most of the tuples. Consider $n$ tuples $x_{1},x_{2},\ldots,x_{n}\in U^{m}$ each of which has $m$ coordinates (attributes) in some universe $U$ . We say a subset of coordinates $A\subseteq[m]$ is a key if every pair of tuples differ by at least one coordinate value in $A$ (i.e., $A$ uniquely identifies all tuples). Motwani and Xu [14] considered the problem of finding the minimum key $I^{\star}$ of a given data set.

Let ${R\choose t}$ denote the collection of all subsets of $R$ of size $t$ and $X=\{x_{1},\ldots,x_{n}\}$ denote the data set, Motwani and Xu [14] reduced the minimum key problem to the set cover problem where the ground set is ${X\choose 2}$ and each coordinate $i$ corresponds to a subset of ${X\choose 2}$ that consists of pairs of tuples differing at their $i$ th coordinates. Finding the minimum key of $X$ is equivalent to finding the minimum set cover of ${X\choose 2}$ . The set cover problem admits a $(\ln N+1)$ greedy approximation with time complexity $O(NM^{2})$ (where $N$ is the cardinality of the ground set and $M$ is the number of subsets) [20]. The combination of these two ideas yields an approach that runs in $\Theta(m^{2}n^{2})$ time ⁶⁶6With a careful implementation, described in Appendix B, one can improve the running time to $O(m^{2}n\log n)$ which still depends on $n$ . . For massive data sets (i.e., large $n$ ), this approach is, however, costly.

Approximate minimum $\varepsilon$ -separation key via minimum set cover.

To this end, Motwani and Xu considered the relaxed problem of finding small $\varepsilon$ -separation keys. We say that a subset of coordinates $A$ separates $x_{i}$ and $x_{j}$ if they are different in at least one coordinate in $A$ . In this problem, with high probability, we want to find a small subset of coordinates $I$ that separates at least $(1-\varepsilon)$ fraction of pairs of tuples such that $|I|\leq\gamma|I^{\star}|$ . In this case, one often refers to $I$ as a quasi-identifier with separation ratio $1-\varepsilon$ . The parameter $\gamma$ controls how large $I$ is allowed to be compared to the minimum key $I^{\star}$ .

Their idea is to uniformly sample $\Theta\left(m/\varepsilon\right)$ pairs of tuples $R=\{(x_{i_{1}},x_{j_{1}}),(x_{i_{2}},x_{j_{2}}),\ldots\}$ and solve the set cover problem in which the ground set is $R$ and each coordinate $i$ corresponds to a subset of $R$ that consists of pairs of tuples differing at their $i$ th coordinates.

If $A$ separates fewer than $(1-\varepsilon){n\choose 2}$ pairs, the probability that $A$ separates all pairs in $R$ (or equivalently, $A$ is the set cover of the described set cover instance) is at most

(1-\varepsilon)^{|R|}\leq e^{-\varepsilon|R|}\leq e^{-10m}

by choosing $|R|={10}m/\varepsilon$ . By appealing to a union bound over all $2^{m}$ subsets of coordinates, we guarantee that no such subset of attributes separates all pairs in $R$ with probability at least $1-e^{-5m}$ ⁷⁷7Note that the failure probability can be set to $e^{-Km}$ for an arbitrary constant $K$ . This constant is absorbed in the big $O$ notation of the sample complexity.. Therefore, finding a $\gamma$ approximation to the aforementioned set cover instance yields an $\varepsilon$ -separation key whose size is at most $\gamma|I^{\star}|$ .

One can attain $\gamma=1$ using the brute-force algorithm whose running time is $2^{O\left({m}\right)}$ (but the size of the ground set is much smaller - $O(m/\varepsilon)$ instead of $O(n)$ ). On the other hand, one can also attain $\gamma=O\left(\ln\frac{m}{\varepsilon}\right)$ using the greedy set cover algorithm whose running time is $O(m^{3}/\varepsilon)$ , assuming we can compare two coordinates in constant time. This running time is more manageable as it does not depend on the size of the data set $n$ . Note that sampling pairs of tuples can easily be implemented in the streaming model and the space would be proportional to the number of samples.

In this work, we mainly focus on sample and space bounds. However, we will also address running time and query time whenever appropriate. For this purpose, we make a mild assumption that one can define a total ordering of values in $U$ . This is the case in most natural applications (i.e., numbers, strings, etc.); we also assume that comparing two values in $U$ takes constant time.

A small tweak to Motwani-Xu’s algorithm.

Our goal is to improve the sample complexity with a small tweak. In particular, we sample $\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ tuples uniformly at random, and let $R$ be the set of sampled tuples. We then solve the set cover instance in which the ground set is ${R\choose 2}$ and each coordinate corresponds to the pairs in ${R\choose 2}$ that it separates.

In other words, the key difference between the two algorithms is that ours samples $R=\Theta(m/\sqrt{\varepsilon})$ tuples and use $R\choose 2$ as ground set (for the set cover instance) while the approach of [14] samples $R^{\prime}=\Theta(m/\varepsilon)$ pairs of tuples of the data set and uses $R^{\prime}$ as the ground set.

We show that our approach achieves the same guarantees. In particular, for all $A\subseteq[m]$ , if $A$ separates fewer than $(1-\varepsilon){n\choose 2}$ pairs, the probability that $A$ separates all pairs in ${R\choose 2}$ is at most $e^{-m}$ . While this seems like a small tweak, the analysis is significantly more involved.

If we use the greedy set cover algorithm to obtain the approximation factor $\gamma=O\left(\ln\frac{m}{\varepsilon}\right)$ , this tweak also leads to a better time complexity $O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right)$ in comparison to that of Motwani and Xu whose running time is $O(\frac{m^{3}}{\varepsilon})$ .

$\varepsilon$ -separation key filter.

We first consider a decision problem that captures the essence of our analysis and then describe how this yields the aforementioned improvements to finding a small $\varepsilon$ - separation key. We say $A$ is bad if it separates fewer than $(1-\varepsilon){n\choose 2}$ pairs of tuples. This problem asks for an algorithm that takes $A\subseteq[m]$ and rejects if $A$ is bad. Furthermore, the algorithm must accept if $A$ separates all ${n\choose 2}$ pairs (i.e., $A$ is a key). If $A$ is neither bad nor a key, the algorithm can either accept or reject. The success probability is required to be at least $1-\delta$ for all $2^{m}$ choices of $A$ . More formally,

\mathbb{P}\left(\forall A\subseteq[m],\text{the algorithm is correct on $A$}\right)\geq 1-\delta.

Hereinafter, we only consider the above “for all” notion of success (as opposed to the “for each” notion). The query time is the time to compute the answer for a given subset of coordinates $A$ .

From the discussion above, it is fairly easy to see that the approach of Motwani and Xu solves this problem using a sample size $\Theta(m/\varepsilon)$ with query time $O\left(\frac{m|A|}{\varepsilon}\right)$ . Specifically, we reject $A$ if it fails to separate any of $\Theta(m/\varepsilon)$ pairs in $R^{\prime}$ and accept otherwise. Our goal is to improve the sample size and query time.

Non-separation estimation.

We also consider a related approximate counting problem. Let $\Gamma_{A}$ be the number of pairs of tuples that are not separated by $A$ . Let $\alpha,\varepsilon$ and $k$ be some given parameters, the non-separation estimation problem asks for an algorithm that takes a subset of coordinates $A\subset[m]$ as the input where $|A|\leq k$ . If $\Gamma_{A}\geq\alpha{n\choose 2}$ , the algorithm must return an estimate $\hat{\Gamma}_{A}$ of $\Gamma_{A}$ such that

\hat{\Gamma}_{A}\in[(1-\varepsilon)\Gamma_{A},(1+\varepsilon)\Gamma_{A}]~{}.

Otherwise, the algorithm may output “small”. The algorithm must return correct answers for all queries $A\in[m]$ (where $|A|\leq k$ ) with probability at least $1-\delta$ .

Sketching algorithms and algorithms based on uniform sampling.

A sketch $S$ of a data set $X$ , with respect to some problem $F$ , is a compression of $X$ such that given only access to $S$ , one can solve the problem $F$ on $X$ . Often, the primary objective is to minimize the size of $S$ .

Given $X$ , an algorithm based on uniform sampling is a special case of sketching algorithms in which $S$ is a set of items drawn uniformly at random from $X$ .

Main results and techniques.

For the $\varepsilon$ -separation key filter problem, our main results are as follows. We first propose a better sketching algorithm based on uniform sampling in terms of sample size and query time. We then establish its sampling complexity lower bounds. In certain regime, we show that our strategy is optimal in term of sampling complexity.

Theorem 1 (Main result 1).

Consider the $\varepsilon$ -separation key filter problem. For algorithms based on uniform sampling, if $n\geq Km/\varepsilon$ for some sufficiently large constant $K$ , then:

•

$\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ samples are sufficient and necessary so that $\delta\leq e^{-m}$ ; furthermore, the query time is $O\left(\frac{m|A|}{\sqrt{\varepsilon}}\log\frac{m}{{\varepsilon}}\right)$ where $A$ is the subset of coordinates being queried.
•

$\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ samples are necessary so that $\delta$ is a constant.

The analysis of the upper bound is heavily non-trivial. At a high level, it uses Karush–Kuhn–Tucker (KKT) [15, Chapter 12] conditions to identify the worst-case input. With some further arguments, we can apply the birthday problem [13, Page 45] to show that in this worst-case input, $\Theta(m/\sqrt{\varepsilon})$ sample size suffices.

This result leads to the following improvements for the approximate minimum $\varepsilon$ -separation key problem in both sample size and running time.

Proposition 1.

There exists an algorithm based on uniform sampling that solves the approximate minimum $\varepsilon$ -separation key problem with a sample size $\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ . Furthermore, if the approximation factor $\gamma=O\left(\ln\frac{m}{\varepsilon}\right)$ , the running time is $O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right)$ .

For the non-separation estimation problem, we show that a sketching algorithm based on uniform sampling is optimal in terms of $k$ and $m$ up to a logarithmic factors.

Theorem 2 (Main result 2).

Consider the non-separation estimation problem, the following holds.

•

There exists an algorithm based on uniform sampling that requires a sample size $\Theta\left(\frac{k\log m}{\alpha\varepsilon^{2}}\right)$ .
•

Suppose $m\geq k/\sqrt{\varepsilon}$ and $\alpha$ is a constant. Such a sketching algorithm must use $\Omega(mk\log(1/\varepsilon))$ space.

Note that a sample size $\Theta\left(\frac{k\log m}{\alpha\varepsilon^{2}}\right)$ requires $\Omega\left(\frac{km\log m\log|U|}{\alpha\varepsilon^{2}}\right)$ bits of space. Hence, for constant $\alpha$ , the upper and lower bounds are tight in terms of $k$ and $m$ up to a logarithmic factor. The upper bound in the theorem above is a direct application of Chernoff and union bounds. The lower bound is based on an encoding argument. Liberty et al. [11] also used an encoding argument to prove a space lower bound for finding frequent itemset although the technical details are different.

Further applications.

Motwani and Xu [14] have highlighted multiple applications of this problem, which we will summarize here. One example is using this problem as a fundamental tool to identify various dependencies or keys in noisy data. This problem can also aid in comprehending the risks associated with releasing data, specifically the potential for attacks that may re-identify individuals.

Small quasi-identifiers are crucial information to consider from a privacy perspective because they can be utilized by adversaries to conduct linking attacks. The collection of attribute values may come with a cost for adversaries, leading them to seek a small set of attributes that form a key.

This problem also has applications in data cleaning, such as identifying and removing fuzzy duplicates resulting from spelling mistakes or inconsistent conventions [2, 3]. Moreover, quasi-identifiers are a specific case of approximate functional dependency [8, 16]. The discovery of such quasi-identifiers can be valuable in query optimization and indexing [6].

Related work.

There has been recent work on sketching and streaming algorithms as well as lower bounds for multidimensional data. In this line of work, we want to compute a sketch of the input data set such that given only access to the sketch, we can approximate statistics of the data restricted to some query subset(s) of coordinates $A$ ; in this setting, $A$ is provided after the sketch has been computed. Some example problems include heavy-hitters and frequent itemset [10, 11], frequency estimation [4], and box-queries [5].

Preliminaries.

We provide some useful tools and facts as well as common notations for the rest of the paper here.

Chernoff bound. We often rely on the following version of Chernoff bound:

Theorem 3 (Chernoff bound [18]).

Let $X_{1},X_{2},\ldots,X_{N}$ be independent Bernoulli random variables such that $\mathbb{P}\left(X_{i}=1\right)=p$ . Let $X=\sum_{i=1}^{N}X_{i}$ and $\mu=\mathbb{E}[X]=pn$ . Then, for all $\varepsilon>0$ ,

\mathbb{P}\left(|X-pN|\geq\varepsilon pN\right)\leq 2e^{-\frac{\varepsilon^{2}}{2+\varepsilon}\mu}\implies\mathbb{P}\left(X\leq\frac{pN}{2}\right)\leq 2e^{-0.1pn}~{}.

When $\varepsilon\geq 2$ , we have:

\mathbb{P}\left(|X-pN|\geq\varepsilon pN\right)\leq 2e^{-\frac{\varepsilon}{2}\mu}~{}.

The birthday problem. If one throws $q$ balls (people) into $N$ bins (birthdays) uniformly at random ⁸⁸8This assumption is indeed unnecessary. If the distribution is non-uniform, one can show that the probability of collision only increases [12, 17]., one can show that the probability that there is a collision (i.e., there is a bin that contains at least two balls) is approximately $1-e^{-\Theta(q^{2}/N)}$ . The following can be found in various textbooks and lecture notes (for example, see [13, Page 45]).

Theorem 4 (The birthday problem).

Let $C(N,q)$ be the probability that there is at least one collision if one throws $q$ balls into $N$ bins uniformly at random. We have:

C(N,q)\geq 1-e^{\frac{-q(q-1)}{2N}}~{}.

This implies that for the non-collision probability to be less than $\delta^{\star}$ , we can set

q\geq\frac{1}{2}\left(1+\sqrt{8N\log\frac{1}{\delta^{\star}}+1}\right)\geq 4\sqrt{N\log\frac{1}{\delta^{\star}}}~{}.

Notation and assumptions. For convenience, hereinafter, we let $K$ be a universal sufficiently large constant. Unless stated otherwise, we may round $\varepsilon$ down to the nearest power of $1/4$ , i.e., $\varepsilon=1/4^{t}$ where $t$ is an integer so that $1/\varepsilon$ and $1/\sqrt{\varepsilon}$ are integers. This does not change the complexity as the extra constant is absorbed in the big- $O$ notation. Unless stated otherwise, $\log n$ refers to $\log_{2}n$ .

Organization.

Section 2 provides the improved upper bound and lower bounds for the $\varepsilon$ -key filter problem in Theorem 1. Section 3 provides the upper and lower bounds for the non-separation estimation problem in Theorem 2. We also implemented a small experiment showing the efficiency of the new approach in Section 4.

The resulted improvements to the approximate minimum $\varepsilon$ -separation key in Proposition 1 are explained in Appendix B.

All omitted proofs can be found in Appendix C.

2 Improved Upper and Lower Bounds for $\varepsilon$ -Separation Key Filter

2.1 An Improved Upper Bound

Algorithms.

We will prove the first part of Theorem 1 in this subsection. The detailed algorithm is given in Algorithm 1. Our algorithm samples $R=\Theta(m/\sqrt{\varepsilon})$ tuples without replacement uniformly at random. For any given $A\subseteq[m]$ , if $A$ fails to separate any pair in ${R\choose 2}$ , then reject $A$ ; otherwise, accept. We will show that this algorithm is correct with probability at least $1-e^{-m}$ .

Algorithm 1 An improved sampling-based algorithm for

\varepsilon

-separation key filter

1:Data set

X=\{x_{1},\ldots,x_{n}\},x_{i}\in U^{m}

\varepsilon

2:Sample without replacement

R=\Theta\left(m/\sqrt{\varepsilon}\right)

tuples.

3:Given

A\subseteq[m]

, accept

A

if and only if

A

separates all

R\choose 2

pairs of samples.

Improvement to minimum $\varepsilon$ -separation key.

Note that if this algorithm for $\varepsilon$ -separation keys filter is correct, the improvement for the minimum $\varepsilon$ -separation key problem in terms of sample size is immediate. The improved running time is less trivial and requires some careful bookkeepings; we defer the discussion to Appendix B.

Analysis.

Let us visualize this problem in terms of auxiliary graphs. Consider a subset of attributes $A$ . If $A$ fails to separate $x_{i}$ and $x_{j}$ then we draw an edge between $x_{i}$ and $x_{j}$ . Note that if $A$ fails to separate $x_{i}$ and $x_{j}$ , and $A$ fails to separate $x_{j}$ and $x_{l}$ , then $A$ also fails to separate $x_{i}$ and $x_{l}$ . Therefore, the graph $G_{A}$ consists of disjoint cliques. The goal is to sample an edge in $G_{A}$ (or equivalently, sample two tuples not separated by $A$ ) if $A$ fails to separate $\varepsilon$ fraction of the pairs (in this case, we call $A$ bad).

We want to argue that if $A$ is bad, then we sample two distinct vertices in the same clique in $G_{A}$ with probability at least $1-e^{-10m}$ . By an application of the union bound over $2^{m}$ subsets of attributes, we are able to discard all bad subsets of attributes with probability at least $1-e^{-m}$ .

We note that if $A$ is bad, $G_{A}$ has at least $\varepsilon{n\choose 2}$ edges. Thus, the problem can be rephrased as follows. Given a graph of $n$ vertices and at least $\varepsilon{n\choose 2}$ edges that consists of disjoint cliques, what is the smallest number of vertices that one needs to sample uniformly at random such that the induced graph has at least one edge with probability at least $1-e^{-10m}$ ?

It remains to show that sampling $r=\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ tuples (or vertices) is sufficient. The key observation is that this problem is fairly similar to the birthday problem. Of course, this is not quite the same since we want to sample two distinct vertices from cliques of size at least two. We can simply sample without replacement to avoid this issue. However, for a cleaner analysis, our analysis is based on sampling with replacement and relate the two.

There could be at most $n$ cliques so we use a vector $s=(s_{1},s_{2},\ldots,s_{n})$ to denote the cliques’ sizes where zero-entries are empty cliques. Recall that $\sum_{i=1}^{n}\frac{s_{i}(s_{i}-1)}{2}\geq\varepsilon{n\choose 2}$ which implies $\sum_{i=1}^{n}s_{i}^{2}\geq\varepsilon n^{2}/4$ for sufficiently large $n$ .

We relax the problem so that $s_{i}$ can be a non-negative real number (i.e., $s_{i}\in\mathbb{R}_{\geq 0}$ ). Consider following set of points $s\in\mathbb{R}_{\geq 0}^{n}$ such that

	$\displaystyle\sum_{i=1}^{n}s_{i}^{2}\geq\varepsilon n^{2}/4$		(1)
	$\displaystyle\sum_{i=1}^{n}s_{i}=n$		(2)
	$\displaystyle s_{i}\in\mathbb{R}_{\geq 0}\text{, for all $i=1,2,\ldots,n$}.$		(3)

We use $\mathcal{P}$ to denote the set of all $s$ that satisfies all three constraints. For convenience, we can think of cliques as colors and each vertex that we sample uniformly at random is a ball whose color is distributed according to the multinomial distribution $\mathcal{D}_{s}=(\frac{s_{1}}{n},\frac{s_{2}}{n},\ldots,\frac{s_{n}}{n})$ where $s\in\mathcal{P}$ .

We are interested in how large $r$ should be such that the probability of having two balls with the same color is at least $1-e^{-10m}$ . In particular, let $\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)$ be the probability that at no two balls have the same color (we refer to this event $\xi$ as non-collision) if we draw $r$ balls whose colors follow the distribution $\mathcal{D}_{s}$ .

Let us fix $r$ and figure out $s$ that maximizes the non-collision probability. Without constraint (1), it is easy to see that the non-collision probability is largest when the color distribution is uniform [12, 17]. With constraint (1), one might suspect that this is still true. That is the non-collision probability is largest when all non-zero $s_{i}$ have the same value (i.e., all $1/\varepsilon^{\prime}$ non-zero entries have the same value $\varepsilon^{\prime}n$ where $\varepsilon^{\prime}=\varepsilon/4$ ). This intuition is however incorrect; one can see an example in Appendix C.3. In this case, the problem indeed becomes more complicated.

Given the distribution $\mathcal{D}_{s}$ , the probability that we do not have a collision after sampling $r$ vertices uniformly at random is

\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)=\frac{r!}{n^{r}}\sum_{1\leq j_{1}<j_{2}<\ldots<j_{r}\leq n}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r}}:=\frac{r!}{n^{r}}f_{r}(s)~{}.

Suppose we sample without replacement, the probability $\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right)$ that we do not have a collision is

\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right):=\frac{r!}{n(n-1)(n-2)\ldots(n-r+1)}f_{r}(s)~{}.

As a consequence, we have:

\displaystyle\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right)=\frac{n^{r}}{n(n-1)\ldots(n-r+1)}\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)

The next claim relates $\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right)$ and $\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)$ .

Claim 1.

For $n>\frac{r(r-1)}{m}+r-1=\Theta(\frac{r^{2}}{m}+r)$ , we have

\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right)<e^{m}\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)~{}.

Proof.

Note that $n>\frac{r(r-1)}{m}+r-1$ implies $\frac{r(r-1)}{n-r+1}<m$ . Thus,

\displaystyle\frac{n^{r}}{n(n-1)(n-2)\ldots(n-r+1)}

\displaystyle\leq\left(\frac{n}{n-r+1}\right)^{r}=\left(1+\frac{r-1}{n-r+1}\right)^{r}\leq e^{\frac{r(r-1)}{n-r+1}}\leq e^{m}~{}.\qed

(4)

We will later set $r=\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ ; the condition $n>\frac{r(r-1)}{m}+r-1$ implies that $n\geq\frac{Km}{\varepsilon}$ . We are now interested in maximizing $f(s)$ for fixed $r$ . Let

\displaystyle s^{\star}=\max_{s\in\mathcal{P}}\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)=\max_{s\in\mathcal{P}}f_{r}(s)~{}.

We make the following key observation.

Lemma 1.

If $4\leq r\leq 1+(1-\sqrt{\varepsilon}/2)n$ , then the optimal $s^{\star}=\max_{s\in P}f_{r}(s)$ satisfies the following: all non-zero entries in $s^{\star}$ have at most two distinct values.

Proof.

Since $r$ is being fixed, we drop $r$ from $f_{r}$ for a cleaner presentation. Let

\tilde{s}=(\sqrt{\varepsilon}n/2,\underbrace{1,1,\ldots,1}_{(1-\sqrt{\varepsilon}/2)n\text{ times }},0,0,\ldots,0)~{}.

(5)

In fact, $\tilde{s}$ is feasible as it satisfies all constraints (1) - (3). We note that any $s$ that has fewer than $r$ non-zero entries cannot be the optimal value since $f(s)=0$ while $f(\tilde{s})>0$ since it has $(1-\sqrt{\varepsilon}/2)n+1$ non-zero entries and therefore at least one term in $f(\tilde{s})$ must be positive.

Our proof uses the KKT conditions with LICQ regularity condition as its core. We invite readers who are not familiar with this technical theorem to visit Appendix A for full details.

Let $s$ be any local optimal. First, we assume that LICQ holds at $s$ . Based on the KKT conditions (Theorem 5), there exist some constants $\mu\geq 0$ , $\eta\in\mathbb{R}$ , and $\{\lambda_{i}\}_{i\in[n]}\geq 0$ , such that for any local maxima, we have

\displaystyle\nabla f(s)=\mu\nabla\left(\sum_{i=1}^{n}s_{i}^{2}\right)+\sum_{i=1}^{n}\lambda_{i}\nabla(s_{i})+\eta\nabla\left(\sum_{i=1}^{n}s_{i}\right)

(6)

and

\displaystyle s_{i}>0\implies\lambda_{i}=0~{}.

(7)

Equations (6) and (7) correspond to the stationarity and complementary slackness conditions (see Appendix A). Each coordinate of the right hand side of Eq. (6) has the following form

\displaystyle 2\mu s_{i}+\lambda_{i}+\eta~{}.

(8)

Each coordinate $i$ of $\nabla f(s)$ is as follows:

\displaystyle\frac{\partial f(s)}{\partial s_{i}}

\displaystyle=\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots j_{r-1}\leq n\\ j_{1},j_{2},\ldots,j_{r-1}\neq i\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-1}}~{}.

(9)

Therefore, for each $i=1,2,\ldots,n$ :

\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-1}\leq n\\ j_{1},j_{2},\ldots,j_{r-1}\neq i\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-1}}-(2\mu s_{i}+\lambda_{i}+\eta)=0~{}.

(10)

Consider any $s_{a}>0$ and $s_{b}>0$ . Note that $\lambda_{a}=\lambda_{b}=0$ . We have:

$\displaystyle\bigg{(}\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-1}\leq n\\ j_{1},j_{2},\ldots,j_{r-1}\neq a\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-1}}\bigg{)}-2\mu s_{a}$	$\displaystyle=\bigg{(}\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-1}\leq n\\ j_{1},j_{2},\ldots,j_{r-1}\neq b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-1}}\bigg{)}-2\mu s_{b}$	(11)
$\displaystyle\iff s_{b}\bigg{(}\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}}\bigg{)}-2\mu s_{a}$	$\displaystyle=s_{a}\bigg{(}\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}}\bigg{)}-2\mu s_{b}$
$\displaystyle\iff(s_{b}-s_{a})\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}}$	$\displaystyle=-2\mu(s_{b}-s_{a})~{}.$

Thus, either $s_{a}=s_{b}$ or

\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}}

\displaystyle=-2\mu_{1}~{}.

(12)

If $s_{a}\neq s_{b}$ , then consider any other entry $s_{c}>0$ . It must be the case that either $s_{c}\neq s_{a}$ or $s_{c}\neq s_{b}$ . Without loss of generality, suppose $s_{c}\neq s_{b}$ . With the same derivation, we have:

\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq b,c\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}}

\displaystyle=-2\mu_{1}~{}.

(13)

From Eq. (12) and Eq. (13), we have:

$\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<\ldots<j_{r-2}\leq n\\ j_{1},\ldots,j_{r-2}\neq b,c\end{subarray}}s_{j_{1}}\ldots s_{j_{r-2}}$	$\displaystyle=\sum_{\begin{subarray}{c}1\leq j_{1}<\ldots<j_{r-2}\leq n\\ j_{1},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}\ldots s_{j_{r-2}}$	(14)
$\displaystyle s_{a}\sum_{\begin{subarray}{c}1\leq j_{1}<\ldots<j_{r-3}\leq n\\ j_{1},\ldots,j_{r-3}\neq a,b,c\end{subarray}}s_{j_{1}}\ldots s_{j_{r-3}}$	$\displaystyle=s_{c}\sum_{\begin{subarray}{c}1\leq j_{1}<\ldots<j_{r-3}\leq n\\ j_{1},\ldots,j_{r-3}\neq a,b,c\end{subarray}}s_{j_{1}}\ldots s_{j_{r-3}}$
$\displaystyle s_{a}$	$\displaystyle=s_{c}~{}.$

In the last step, we divide both sides by

\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-3}\leq n\\ j_{1},j_{2},\ldots,j_{r-3}\neq a,b,c\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-3}}>0~{}.

(15)

This is because we assume that at least $r\geq 4$ entries are non-zero. Thus, all entries in $s$ can either take value 0, $a$ , or $b$ .

It remains to deal with the case where LICQ does not hold at $s$ . We denote by $\mathbf{0}$ and $\mathbf{1}$ the vectors in which all entries are zeros and ones respectively; furthermore, $\mathbf{1}_{i}$ denotes the $i$ th canonical vector. Let $\mathcal{A}(s)$ be the (active) set of constraints whose equality hold (see Appendix A and Definition 1 for a formal definition). The active set $\mathcal{A}(s)$ can contain three following types of constraints:

1.

Constraint (1): $\sum_{i=1}^{n}s_{i}^{2}\geq\varepsilon n^{2}/4$ . The gradient corresponding to this constraint is $2s$ .
2.

Constraint (2): $\sum_{i=1}^{n}s_{i}=n$ . The gradient corresponding to this constraint is $\mathbf{1}$ . Note that constraint (2) must be in $\mathcal{A}(s)$ by definition.
3.

Constraints (3): $s_{i}\geq 0$ . The gradient corresponding to this type of constraints is $\mathbf{1}_{i}$ .

Denote by $A:=\{i\mid s_{i}=0\}$ the set of indices whose components of $s$ are zero. If $\mathcal{A}(s)$ does not contain Constraint (1) and LICQ does not hold means, then there exist $\gamma\in\mathbb{R}^{|A|+1},\gamma\neq 0$ such that:

\mathbf{0}=\gamma_{0}\mathbf{1}+\sum_{i\in A}\gamma_{i}\mathbf{1}_{i}

This happens if and only if $\mathcal{A}(s)$ contains all constraints (3), i.e, $s=\mathbf{0}$ , which contradicts Constraint (2). Therefore, $\mathcal{A}(s)$ must contain Constraint (1). By our assumption that LICQ does not hold, there exists a vector $\gamma\in\mathbb{R}^{|A|+1}$ such that:

2s=\gamma_{0}\mathbf{1}+\sum_{i\in A}\gamma_{i}\mathbf{1}_{i}

As a consequence, $s_{i}=\gamma_{0}/2$ for all $i\notin A$ . Thus, there is at most one distinct value among non-zero entries of $s$ when LICQ does not hold. That concludes our proof. ∎

The next lemma shows that we can still apply the birthday problem for $r=\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ .

Lemma 2.

Suppose the non-zero entries in $s\in\mathcal{P}$ ( $\mathcal{P}$ defined by Constraints (1)-(3)) have at most two distinct values. By drawing $r=\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ balls uniformly at random from $\mathcal{D}_{s}$ , the probability of no two balls having the same color is at most $1-e^{-20m^{2}}-e^{-20m}$ .

Proof.

Consider the case where there are two distinct values among non-zero entries in $s$ . Let these two distinct values be $a$ and $b$ . We say a color $i\in[n]$ is in group $A$ if $s_{i}=a$ ; otherwise $i$ is in group $B$ . Define $k_{a}:=|A|,k_{b}:=|B|$ , we have:

k_{a}a^{2}+k_{b}b^{2}\geq\frac{\varepsilon n^{2}}{4}~{}.

Thus either $k_{a}a^{2}\geq\frac{\varepsilon n^{2}}{8}$ or $k_{b}b^{2}\geq\frac{\varepsilon n^{2}}{8}$ . Without loss of generality, suppose $k_{a}a^{2}\geq\frac{\varepsilon n^{2}}{8}$ . Then

\displaystyle a

\displaystyle\geq\frac{\sqrt{\varepsilon}n}{\sqrt{8k_{a}}}\iff k_{a}a\geq\frac{\sqrt{k_{a}\varepsilon}n}{\sqrt{8}}~{}.

Thus, in expectation, each random ball drawn uniformly at random has color is in group $A$ with probability at least $\frac{\sqrt{\varepsilon k_{a}}}{\sqrt{8}}$ .

Let $K$ be some sufficiently large constant, if we draw at least $Km\sqrt{k_{a}}$ balls uniformly at random restricted to group- $A$ colors, by the birthday problem (i.e., Theorem 4), the probability that no two balls share the same color is at most $e^{-20m^{2}}$ .

Suppose we sample $\frac{2\sqrt{8}Km}{\sqrt{\varepsilon}}$ balls uniformly at random. In expectation, the number of balls whose colors are in group $A$ in the sample is at least

\frac{2\sqrt{8}Km}{\sqrt{\varepsilon}}\times\frac{\sqrt{\varepsilon k_{a}}}{\sqrt{8}}=2Km\sqrt{k_{a}}~{}.

By Chernoff bound (Theorem 3), the probability that we sample fewer than $Km\sqrt{k_{a}}$ balls whose colors are in group $A$ is at most $2e^{-0.1\times 2Km\sqrt{k_{a}}}\leq e^{-20m}$ . Thus, we get two balls with the same color with probability at least

\displaystyle 1-e^{-20m^{2}}-e^{-20m}~{}.

The case of one distinct value can be dealt with similarly by simply choose $b$ to be some arbitrary value and $k_{b}=0$ . The same argument is still valid and this concludes our proof. ∎

Let $r=\frac{Cm}{\sqrt{\varepsilon}}$ for some sufficiently large constant $C$ . From Lemmas 1 and 2, for any distribution $\mathcal{D}_{s}$ ,

\displaystyle\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)

\displaystyle\leq\mathbb{P}_{r,\mathcal{D}_{s^{\star}}}\left(\xi\right)\leq 1-e^{-20m^{2}}-e^{-20m}~{}.

Putting it all together.

If we sample tuples without replacement, for each bad subset of coordinates $A\subseteq[m]$ , the probability that we do not sample an edge in $G_{A}$ is at most

e^{m}(e^{-20m^{2}}+e^{-20m})=e^{-20m^{2}+m}+e^{-19m}<e^{-10m}~{}.

Taking a union bound over at most $2^{m}$ bad subsets $A\subseteq[m]$ , we have proved that the probability of failing to detect a bad subset of coordinates is at most $2^{m}e^{-10m}<e^{-m}$ .

Query time.

Recall that each coordinate has values in $U$ . If $U$ has a total ordering, then we can sort the tuples in $R$ using $O(\frac{m}{\sqrt{\varepsilon}}\log\frac{m}{\varepsilon})$ comparisons each of which takes $O(|A|)$ time to detect duplicate(s). Thus, the query time is $O\left(\frac{m|A|}{\sqrt{\varepsilon}}\log\frac{m}{\varepsilon}\right)$ .

2.2 A lower bound $\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ for constant failure probability

In this section, we consider the following question: What is the lower bound on the number of tuples one needs to sample (uniform sampling, with or without replacement) such that the probability of failure of Algorithm 1 is smaller than a fixed constant $\delta<1$ . To this end, we construct a model to get a lower bound $\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ for this question.

Lemma 3.

There exists a data set where one needs to sample with replacement (resp. without replacement) at least $\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ tuples to reject all bad subsets $A$ with probability $1/e$ (resp. $2/e$ ).

Proof sketch.

We provide a proof sketch for the case of sampling with replacement here. The full proof is deferred to Appendix C.1. We consider the data set $\mathcal{D}:=\{1,\ldots,\lfloor 1/\varepsilon\rfloor\}^{m}$ . In this data set, the following holds:

1.

All subsets of size one are bad (i.e they separate fewer than $(1-\varepsilon){n\choose 2}$ pairs).
2.

Sampling a tuple uniformly at random from $\mathcal{D}$ is equivalent to sample each coordinate i.i.d from the uniform multinomial distribution on $\{1,\ldots,\lfloor 1/\varepsilon\rfloor\}$ .

Denote $R_{A,r}$ the event that a bad subset $A$ is detected after $r$ samples (i.e., we sample a pair of tuples that are not separated by $A$ ). We derive our lower bound by bounding $\mathbb{P}\left(\bigcap_{A:|A|=1}R_{A,r}\right)$ , which is the probability that one can detect all the bad subsets of cardinality one. Let $q=1/\varepsilon$ . We have

\displaystyle\mathbb{P}\left(\bigcap_{A:|A|=1}R_{A,r}\right)

\displaystyle=\prod_{i=1}^{m}\mathbb{P}\left(R_{\{i\},r}\right)=\mathbb{P}\left(R_{\{i\},r}\right)^{m}\leq\left(1-\prod_{i=0}^{r-1}(1-i/q)\right)^{m}~{}.

For $m\leq 2^{(1/\varepsilon)},r=\sqrt{\frac{\log m}{\varepsilon}}$ , the above can be upper bounded as

\displaystyle\mathbb{P}\left(\bigcap_{A:|A|=1}R_{A,r}\right)\leq\left(1-{\textup{exp}}\left(-r^{2}/q\right)\right)^{m}\leq\left(1-1/m\right)^{m}\leq 1/e.

2.3 A lower bound $\Omega\left(\frac{m}{\sqrt{\varepsilon}}\right)$ for $e^{-m}$ failure probability

For a constant success probability $\delta$ , our analysis leaves a gap between the upper and lower bound of the sampling complexity (i.e the upper bound is $\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right)$ while the lower bound is $\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ ). However, for large success probability (i.e., $1-e^{-m}$ ), one can actually show that our analysis is tight. In fact, this lower bound holds even if one only needs to tell if a single coordinate is a $\varepsilon$ -separation key.

Lemma 4.

There exists a data set where one needs to sample without replacement at least $\Omega\left(\frac{m}{\sqrt{\varepsilon}}\right)$ to reject a bad subset $A$ with probability $1-e^{-m}$ .

Proof sketch.

We build a data set $\mathcal{D}:=\{x_{1},\ldots,x_{n}\}$ satisfying the following properties: A) If we sample a tuple uniformly at random, its first coordinate follows the multinomial distribution given in Equation (5) and B) The remaining coordinates can be chosen arbitrarily as long as there exists a key for $\mathcal{D}$ . For this data set, one can show that:

1.

Coordinate $\{1\}$ is bad.
2.

The graph $G_{\{1\}}$ (the auxiliary graph corresponding the first coordinate) has one cluster $C$ of size $\Theta(\sqrt{\varepsilon}n)$ and $n-\Theta(\sqrt{\varepsilon}n)$ clusters of size one.

To detect that $\{1\}$ is bad with probability $1-e^{-m}$ , one needs to sample at least two vertices (or tuples) in the largest cluster $C$ with the same probability. This requires $\Theta(m/\sqrt{\varepsilon})$ samples since the probability of sampling a tuple in $C$ is $\Theta(\sqrt{\varepsilon})$ . A detailed proof is can be found in Appendix C.2. ∎

3 Estimating Non-Separation

In this section, we prove Theorem 2. Specifically, we present space upper and lower bounds for estimating non-separation. The upper bound is based on sampling $O\left(\frac{k\log m}{\alpha\varepsilon^{2}}\right)$ pairs of tuples uniformly at random.

We then prove a lower bound $\Omega\left({km}\log\frac{1}{\varepsilon}\right)$ on the sketch’s size for constant $\alpha$ . Note that this implies a lower bound $\Omega{\left(k\log\frac{1}{\varepsilon}\right)}$ on the sample size since each sample requires $\Omega(m)$ bits of space. Therefore, for constant $\alpha$ , the simple sketching algorithm based on uniform sampling is tight in terms of $k$ and $m$ , up to poly-logarithmic factor.

3.1 Upper bound

We give a simple upper bound based on random sampling. Let $\mathcal{D}$ denote the data set $\{x_{1},x_{2},\ldots,x_{n}\}$ . The algorithm samples $\frac{Kk\log m}{\alpha\varepsilon^{2}}$ pairs of tuples $(y_{1},z_{1}),(y_{2},z_{2}),\ldots\in{\mathcal{D}\choose 2}$ uniformly at random for some sufficiently large constant $K$ . For a fixed $A\subseteq[n]$ , we use $D_{A}$ to denote the number of sample pairs that $A$ fails to separate. That is

D_{A}:=\left|\{(y_{i},z_{i})\mid\text{$y_{i}$ and $z_{i}$ have the same coordinates in $A$}\}\right|~{}.

If $D_{A}<\frac{Kk\log m}{10\varepsilon^{2}}$ , output “small”. Otherwise, output

\hat{\Gamma}_{A}=D_{A}\cdot\frac{\alpha\varepsilon^{2}{n\choose 2}}{Kk\log m}

as an estimate of $\Gamma_{A}$ . If $A$ fails to separate at most $(\alpha/100){n\choose 2}$ pairs of tuples, appealing to Chernoff bound yields:

\mathbb{P}\left(D_{A}\geq\frac{Kk\log m}{10\varepsilon^{2}}\right)\leq 2e^{-4.5Kk\log m}\leq m^{-100k}~{}.

If $A$ fails to separate at least $(\alpha/100){n\choose 2}$ pairs, we have

\mathbb{P}\left(D_{A}\notin(1\pm\varepsilon)\Gamma_{A}\frac{Kk\log m}{\alpha\varepsilon^{2}{n\choose 2}}\right)\leq 2{\textup{exp}}\left(-\varepsilon^{2}\frac{Kk\log m}{\alpha\varepsilon^{2}{n\choose 2}}\Gamma_{A}\right)\leq m^{-100k}~{}.

The success probability is therefore at least $1-m^{-90k}$ by appealing to a union bound over at most ${m\choose 1}+{m\choose 2}+\ldots+{m\choose k}\leq m^{k+1}$ possible choices of $A$ .

3.2 Lower bound

We show that even for constant $\alpha$ , a valid data sketch must use $\Omega(mk\log{1/\varepsilon})$ bits. The proof is based on an encoding argument. Let $d_{H}$ denote the Hamming distance. We will make use of the following communication problem.

Lemma 5.

Suppose $Alice$ has a bit string $C$ of length $ktm$ indexed as a $[kt]\times[m]$ matrix. Each of $m$ columns has exactly $k$ one-entries and $k(t-1)$ zero-entries. For Bob to compute a reconstruction $\hat{C}$ of $C$ such that $d_{H}(C,\hat{C})\leq\frac{|C|}{10t}$ with probability at least $2/3$ , the one-way (randomized) communication complexity is $\Omega(km\log(t))$ bits.

The proof of the above lemma is an adaptation of the proof of the Index problem [1, 9] which we defer to the end of this section. We will later set $t=\Theta(1/\sqrt{\varepsilon})$ to obtain the lower bound $\Omega(km\log{(1/\varepsilon)})$ .

Let $n=kt$ and $D$ be a $n\times m$ matrix whose entries are all ones. Furthermore, let $\mathbf{1}_{i}$ be the $2n\times 1$ canonical vector with the one-entry in the $i$ th row. Alice constructs the following $2n\times(m+n)$ data set $M$ as below. Here, we make the assumption that $m\geq\frac{k}{\sqrt{\varepsilon}}$ ; therefore, $m+n=O(m)$ .

\displaystyle M=\left[\begin{array}[]{ccccccc}C&&\rule[-4.30554pt]{0.5pt}{10.76385pt}&\rule[-4.30554pt]{0.5pt}{10.76385pt}&\ldots&\rule[-4.30554pt]{0.5pt}{10.76385pt}\\ &&\mathbf{1}_{1}&\mathbf{1}_{2}&\ldots&\mathbf{1}_{n}\\ D&&\rule[-4.30554pt]{0.5pt}{10.76385pt}&\rule[-4.30554pt]{0.5pt}{10.76385pt}&\ldots&\rule[-4.30554pt]{0.5pt}{10.76385pt}\end{array}\right]=\left[\begin{array}[]{c|cccccc}&1&0&\ldots&0\\ C&0&1&\ldots&0\\ &\ldots&\ldots&\ldots&\ldots\\ &0&0&\ldots&1\\ \hline\cr&0&0&\ldots&0\\ D&0&0&\ldots&0\\ &\ldots&\ldots&\ldots&\ldots\\ &0&0&\ldots&0\\ \end{array}\right]

We will show that Bob can recover $C$ , column-by-column using on the described data sketch for non-separation estimation with $\alpha=1/16$ . Then, appealing to Lemma 5, such a data sketch must have size $\Omega\left(mk\log\frac{1}{\varepsilon}\right)$ bits.

Fix a column $c$ of $C$ . Bob can recover column $c$ of $C$ as follows. If Bob guesses that rows

R=\{r_{1},\ldots,r_{k}\}\subset[kt]

in $c$ contain all the one-entries, Bob can verify if this guess is good using the estimate $\hat{\Gamma}_{A}$ where

A=\{i,m+r_{1},m+r_{2},\ldots,m+{r_{k}}\}.

Here, a guess is good if the reconstruction $\hat{c}$ of $c$ corresponding to this guess satisfies $d_{H}(\hat{c},c)\leq|c|/(10t)$ .

Note that $|A|=k+1=O(k)$ . Let $u$ and $v$ be the number of rows in $R$ that Bob guesses correctly and incorrectly respectively. That is $u:=|\{r_{i}:c_{r_{i}}=1\}|$ , and $v:=|\{r_{i}:c_{r_{i}}=0\}|$ . Note that $v=k-u$ . The intuition is that for each $r_{i}$ that is a correct guess, $A$ includes another coordinate that separates another $\approx n+k$ pairs; if $r_{i}$ is a wrong guess, $A$ includes another coordinate that separates $\approx n-k$ pairs.

Lemma 6.

We have the following equality regarding the number of unseparated pairs in $A$ :

\Gamma_{A}=(t^{2}-t+5/2)k^{2}-(t-1/2)k+u^{2}-3ku~{}.

Proof.

Consider the following process. Starting with $A$ as the coordinate corresponding to column $c$ . We then add coordinates $m+r_{i}$ for each $i$ to $A$ one by one. Originally, $A$ fails to separate $n+k\choose 2$ pairs corresponding to $n+k$ rows that are 1’s in column $c$ and $n-k\choose 2$ pairs that corresponds to $k$ rows that are 0’s in column $c$ .

For the first correct $r_{i_{1}}$ , $A$ includes a coordinate that separates row $r_{i_{1}}$ from $n+k-1$ other rows; for the second correct $r_{i_{2}}$ , $A$ includes a coordinate that separates row $r_{i_{2}}$ from $n+k-2$ other rows and so on. Similarly, for the first incorrect $r_{j_{1}}$ , $A$ includes a coordinate that separates row $r_{j_{1}}$ from $n-k-1$ other rows; for the second incorrect $r_{j_{2}}$ , $A$ includes a coordinate that separates row $r_{j_{2}}$ from $n-k-2$ other rows and so on.

Therefore, in the end, the number of pairs that remain unseparated in $A$ is

	$\displaystyle{n+k\choose 2}+{n-k\choose 2}-\sum_{a=1}^{u}(n+k-a)-\sum_{b=1}^{v}(n-k-b)$
$\displaystyle=$	$\displaystyle{n+k\choose 2}+{n-k\choose 2}-u(n+k)-(k-u)(n-k)+\frac{u(u+1)}{2}+\frac{(k-u)(k-u+1)}{2}$
$\displaystyle=$	$\displaystyle{n+k\choose 2}+{n-k\choose 2}+u^{2}-3ku+\frac{k^{2}+k}{2}-k(n-k)$
$\displaystyle=$	$\displaystyle\frac{(n+k)(n+k-1)}{2}+\frac{(n-k)(n-k-1)}{2}+u^{2}-3ku+\frac{k^{2}+k}{2}-k(n-k)$
$\displaystyle=$	$\displaystyle n^{2}+k^{2}-n+u^{2}-3ku+\frac{k^{2}+k}{2}-k(n-k)+u^{2}-3ku$
$\displaystyle=$	$\displaystyle n^{2}+\frac{5k^{2}}{2}-n-kn+\frac{k}{2}+u^{2}-3ku$
$\displaystyle=$	$\displaystyle(t^{2}-t+5/2)k^{2}-(t-1/2)k+u^{2}-3ku~{}.$	(16)

∎

First, note that $\Gamma_{A}>{n\choose 2}>\frac{1}{16}{2n\choose 2}$ . Thus, a correct data sketch must output the required estimate $\hat{\Gamma}_{A}=(1\pm\varepsilon)\Gamma_{A}$ .

Observe that the expression (16) is decreasing for $u\leq 3k/2$ . Suppose $u\leq 0.9k$ ,

		$\displaystyle\Gamma_{A}\geq(t^{2}-t+5/2+0.9^{2}-2.7)k^{2}-(t-1/2)k$
	$\displaystyle=$	$\displaystyle(t^{2}-t+0.61)k^{2}-(t-1/2)k~{}.$

On the other hand, if $u=k$ , then

	$\displaystyle\Gamma_{A}$	$\displaystyle\leq(t^{2}-t+5/2-2)k^{2}-(t-1/2)k$
		$\displaystyle=(t^{2}-t+0.5)k^{2}-(t-1/2)k~{}.$

Our objective is to choose $t$ such that if the provided estimate $\hat{\Gamma}_{A}=(1\pm\varepsilon)\Gamma_{A}$ , Bob can tell whether his guess is good. To do so, it suffices to choose $t$ such that:

\displaystyle\frac{t^{2}-t+0.61}{t^{2}-t+0.5}>\frac{1+\varepsilon}{1-\varepsilon}\iff\frac{11}{200t^{2}-200t+11}>\varepsilon~{}.

Hence, we can set $t=\frac{1}{K\sqrt{\varepsilon}}$ for some sufficiently large constant $K$ . Specifically, Bob queries the estimates $\hat{\Gamma}_{A}$ for at most $n\choose k+1$ choices of $A$ . If $\hat{\Gamma}_{A}\leq(1+\varepsilon)((t^{2}-t+0.5)k^{2}-(t-1/2)k)$ , then he knows that the guess is good; he will use the corresponding reconstruction for column $c$ .

Therefore, provided a valid sketch. Bob can correctly compute a reconstruction $\hat{C}$ of $C$ such that

d_{H}(C,\hat{C})\leq\frac{|C|}{10t}~{}.

with probability at least 3/4. Reparameterize $k\leftarrow k+1$ and $m\leftarrow m+n$ gives us the lower bound.

Proof of Lemma 5.

For convenience, let $N=mkt$ (i.e., the length of the bit string $C$ that is given to Alice) and $L=mk\log t$ (we want to show that $\Omega(L)$ communication is needed).

Recall that Bob wants to compute a reconstruction $\hat{C}$ of $C$ such that

d_{H}(\hat{C},C)\leq\frac{|C|}{10t}~{}.

We consider the distributional complexity methodology where $C$ follows a distribution $\mathcal{D}$ and communication protocols are deterministic. Here, we simply choose $\mathcal{D}$ in which in each column, $k$ bits are chosen uniformly at random to be 1’s and the remaining $(t-1)k$ bits are set to 0’s. It suffices to show that for any deterministic protocol $P$ that uses $0.001L$ bits of communication,

\mathbb{P}_{C\sim\mathcal{D}}\left(\text{$P$ fails to reconstruct $C$ as required}\right)>1/3~{}.

By the minimax principle [19], this implies that there exists an input $C$ such that any randomized protocol that uses $0.001L$ bits of communication will fail to reconstruct $C$ as required with probability more than 1/3.

Suppose Bob gets a message $Z$ from Alice ( $Z$ is a function of $C$ ). Let $Q(Z)$ be his reconstruction based on $Z$ . Let

\mathcal{A}=\{Q(Z):Z\in\{0,1\}^{N}\}

be the set of all possible reconstructions by Bob. Note that $|\mathcal{A}|\leq 2^{0.001L}$ .

We say Alice’s input $C$ good if there exists a reconstruction $Q$ such that

d_{H}(Q,C)\leq\frac{N}{10t}~{}.

Otherwise, we say $C$ is bad. Fix a reconstruction $Q$ , the number of inputs $C$ whose Hamming distance is at most $\frac{N}{10t}$ from $Q$ is

\displaystyle{N\choose 0}+{N\choose 1}+{N\choose 2}+\ldots+{N\choose N/(10t)}~{}.

(17)

Recall that $(a/b)^{b}\leq{a\choose b}\leq(ea/b)^{b}$ . For sufficiently large $k,m,$ and $t$ , expression (17) can be upper bounded as:

	$\displaystyle N(10te)^{N/(10t)}$	$\displaystyle=N(10te)^{0.1km}$
		$\displaystyle=N2^{0.1km\log(10et)}$
		$\displaystyle\leq N2^{0.1km\log t+0.1km\log(10e)}$
		$\displaystyle<N2^{0.11km\log t}~{}.$

Hence, for sufficiently large $k,m,$ and $t$ , the total number of good inputs is at most

		$\displaystyle\|\mathcal{A}\|N2^{0.11km\log t}\leq 2^{\log N+0.111km\log t}$
	$\displaystyle=$	$\displaystyle 2^{\log k+\log m+\log t+0.111km\log t}<2^{0.2km\log t}.$

Therefore,

		$\displaystyle\mathbb{P}_{C\sim\mathcal{D}}\left(\text{$P$ fails to reconstruct of $C$ as required}\right)$
	$\displaystyle\geq$	$\displaystyle\mathbb{P}_{C\sim\mathcal{D}}\left(\text{$P$ fails to reconstruct $C$ as required}\>\|\>\text{$C$ is bad}\right)\times\mathbb{P}_{C\sim\mathcal{D}}\left(\text{$C$ is bad}\right)$
	$\displaystyle\geq$	$\displaystyle\left(1-\frac{2^{0.2km\log t}}{{tk\choose k}^{m}}\right)\times 1\geq 1-\frac{t^{0.2km}}{t^{km}}=1-\frac{1}{t^{0.8km}}>1/3.\qed$

4 Implementation

The modified algorithm to detect $\varepsilon$ -separation keys is very simple, despite the involved analysis. Therefore, we briefly demonstrate the effectiveness of our approach described in Section 2 and compare it to the naive approach given by Motwani and Xu [14]. We ran our experiments on an M1 Pro processor with 16 gigabytes of unified memory. The code for the experiments is available on GitHub [7].

Description of datasets.

We used two of the data sets that were tested in [14], the adult income data set and the covtype data set, both of which are from the UCI Machine Learning Repository. We also used the 2016 Current Population Survey, which is publicly provided by the US Census. These data sets are a good representation in terms of data size and attribute size. For example, the adult data set contains slightly more than 32,000 values with 14 attributes, while the census data contains millions of records with 388 attributes.

Comparison methodology.

We compare the two approaches with $\varepsilon=0.001$ and $\delta=0.01$ . These are the same tuning parameters that were considered in [14]. We compare our results in terms of the following: (i) sample size, (ii) run-time , and (iii) the percentage in which both algorithms agree on accepting or rejecting a set of attributes. Note that in some cases, even though the two algorithms’ outputs are different, both can be correct. In particular, if a subset is not a perfect key but it separates at least $(1-\varepsilon){n\choose 2}$ pairs, then it is correct to either accept or reject.

For each data set, we select about $100$ random subsets of attributes to query. See Table 1 for the detailed results. At a high level, both approaches agree on nearly all queries while requiring a much smaller sample size.

Table 1: Sample size and average running time across 10 different trials,

\star

and

\star\star

denote the results by the approaches in [14] and in this paper respectively. S: sample sizes. T: time. A: agreement

Dataset	S ( $\star$ )	S ( $\star\star$ )	T ( $\star$ )	T ( $\star\star$ )	A %
Adult	13,000	411	1.903 sec	0.208 sec	95%
Covtype	55,000	1,739	188.02 sec	2.49 sec	98%
CPS	372,000	11,764	790.08 sec	60.03 sec	100%

References

[1] Farid M. Ablayev. Lower bounds for one-way probabilistic communication complexity and their application to space complexity. Theor. Comput. Sci., 157(2):139–159, 1996.
[2] Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586–597. Morgan Kaufmann, 2002.
[3] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313–324. ACM, 2003.
[4] Graham Cormode, Charlie Dickens, and David P. Woodruff. Subspace exploration: Bounds on projected frequency estimation. In PODS, pages 273–284. ACM, 2021.
[5] Roy Friedman and Rana Shahout. Box queries over multi-dimensional streams. Inf. Syst., 109:102086, 2022.
[6] Chris M Giannella, Mehmet M Dalkilic, Dennis P Groth, and Edward L Robertson. Using horizontal-vertical decompositions to improve query evaluation. In Proceedings of the 19th British National Conference on Databases (BNCOD), Lee. Notes in Comp. Sci. vol, volume 2405, pages 26–41. Citeseer, 2002.
[7] Ryan Hildebrant. Github. https://github.com/Ryanhilde/min_set_cover/.
[8] Jyrki Kivinen and Heikki Mannila. Approximate dependency inference from relations. In ICDT, volume 646 of Lecture Notes in Computer Science, pages 86–98. Springer, 1992.
[9] Ilan Kremer, Noam Nisan, and Dana Ron. Errata for: "on randomized one-round communication complexity". Comput. Complex., 10(4):314–315, 2001.
[10] Branislav Kveton, S. Muthukrishnan, Hoa T. Vu, and Yikun Xian. Finding subcube heavy hitters in analytics data streams. In WWW, pages 1705–1714. ACM, 2018.
[11] Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan R. Ullman. Space lower bounds for itemset frequency sketches. In PODS, pages 441–454. ACM, 2016.
[12] Terry R McConnell. An inequality related to the birthday problem. Preprint, 2001.
[13] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge university press, 1995.
[14] Rajeev Motwani and Ying Xu. Efficient algorithms for masking and finding quasi-identifiers. In Technical Report, 2008.
[15] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, 2e edition, 2006.
[16] Bernhard Pfahringer and Stefan Kramer. Compression-based evaluation of partial determinations. In KDD, pages 234–239. AAAI Press, 1995.
[17] J Michael Steele. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.
[18] Wikipedia. Chernoff bound — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Chernoff%20bound&oldid=1119845299, 2022. [Online; accessed 18-November-2022].
[19] Andrew Chi-Chih Yao. Lower bounds by probabilistic arguments (extended abstract). In FOCS, pages 420–428. IEEE Computer Society, 1983.
[20] Neal E. Young. Greedy set-cover algorithms. In Ming-Yang Kao, editor, Encyclopedia of Algorithms. Springer, 2008.

Appendix A Applying Karush–Kuhn–Tucker (KKT) conditions

In this section, we present the KKT condition for self-containedness. This classical result is presented as in [15, Chapter $12$ ]. Consider a constrained optimization problem with an objective function $f$ , inequality and equality constraints $c_{i},i\in\mathcal{I}\cup\mathcal{E}$ .

Maximize	$\displaystyle~{}~{}~{}~{}f(s)$		(18)
Subject to:
$\displaystyle c_{i}(s)$	$\displaystyle\geq 0$	for $i\in\mathcal{I}$ .
$\displaystyle c_{i}(s)$	$\displaystyle=0$	for $i\in\mathcal{E}$ .

Definition 1 (Active set [15]).

The active set $\mathcal{A}(s)$ of a feasible solution $s$ of (18) is a set of indices defined as: $\mathcal{A}(s):=\{i\in\mathcal{I}\mid c_{i}(s)=0\}\cup\mathcal{E}$ . In other words, $\mathcal{A}(s)$ refers to the set of equality and inequality (where equality happens) constraints.

Definition 2 (LICQ [15]).

Given a feasible solution $s$ of (18) and its active set in Definition 1, we say that linear independence constraint qualification (LICQ) holds at $s$ if $\{\nabla c_{i}(s),i\in\mathcal{A}(s)\}$ is linearly independent.

Theorem 5 (KKT conditions [15]).

Consider $s^{\star}$ a local minimum of (18), whose $f$ and $c_{i},i\in\mathcal{I}\cup\mathcal{E}$ are continuously differentiable and that the LICQ holds at $s^{\star}$ , then there is a Lagrange multiplier vector $\lambda^{\star}$ with components $\lambda_{i}^{\star},i\in\mathcal{I}\cup\mathcal{E}$ such that the following conditions are satisfied:

$\displaystyle c_{i}(s^{\star})$	$\displaystyle\geq 0,\forall i\in\mathcal{I}$	(Primal feasibility for inequality)	(19)
$\displaystyle c_{i}(s^{\star})$	$\displaystyle=0,\forall i\in\mathcal{E}$	(Primal feasibility for equality)	(20)
$\displaystyle\lambda^{\star}_{i}$	$\displaystyle\geq 0,\forall i\in\mathcal{I}$	(Dual feasibility)	(21)
$\displaystyle\nabla f(s^{\star})$	$\displaystyle=\sum_{i=1}^{t}\lambda^{\star}_{i}\nabla g^{\star}_{i}(s)$	(Stationarity)	(22)
$\displaystyle\lambda^{\star}_{i}c_{i}(s^{\star})$	$\displaystyle=0,\forall i\in\mathcal{E}\cup\mathcal{I}$	$\displaystyle\text{(Complementary slackness)}~{}.$	(23)

Application to our problem.

Recall that we want to maximize:

\displaystyle f(s)=\sum_{1\leq j_{1}<j_{2}<\ldots<j_{r}\leq n}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r}}~{}.

Subject to:

	$\displaystyle\sum_{i=1}^{n}s_{i}$	$\displaystyle=n$
	$\displaystyle s_{i}$	$\displaystyle\geq 0\text{, for $i=1,2,\ldots,n$}$
	$\displaystyle\sum_{i=1}^{n}s_{i}^{2}$	$\displaystyle\geq\varepsilon n^{2}/4\text{, for $i=1,2,\ldots,n$.}$

KKT conditions imply that when LICQ holds, there exist constants $\eta\in\mathbb{R},\{\lambda_{i}\}_{i\in[n]}\geq 0,\mu\geq 0$ such that for any local maxima, we have:

\displaystyle\nabla f(s)=\mu\nabla\left(\sum_{i=1}^{n}s_{i}^{2}\right)+\sum_{i=1}^{n}\lambda_{i}\nabla(s_{i})+\eta\nabla\left(\sum_{i=1}^{n}s_{i}\right)~{}.

Furthermore, according to the complementary slackness condition, for each $i=1,2,\ldots,n$ :

\displaystyle\lambda_{i}s_{i}=0

Therefore, if $s_{i}>0$ then $\lambda_{i}=0$ .

Appendix B Improvements for Approximate Minimum $\varepsilon$ -Separation Key

In this section, we prove Proposition 1. Recall from Section 1 that we treat ${R\choose 2}$ as the ground set and each coordinate is a set that contains all pairs that it separates. The minimum key $I^{\star}$ separates all pairs in ${R\choose 2}$ since ${R\choose 2}\subset{X\choose 2}$ . Furthermore, in Section 2, we showed that no subset of coordinates that separates fewer than $(1-\varepsilon){n\choose 2}$ pairs is a set cover of ${R\choose 2}$ with high probability. Hence, a $\gamma$ approximation to this minimum set cover instance yields an $\varepsilon$ -separation key whose size is at most $\gamma|I^{\star}|$ with high probability. This implies the improvement in terms of sample size.

In terms of running time, to achieve the approximation factor $\gamma=O\left(\ln\frac{m}{\varepsilon}\right)$ , both the algorithm in [14] and ours use the greedy set cover algorithm [20] (described in Algorithm 2).

Let $A$ be the output, initialized to $\emptyset$ . The greedy algorithm, at each step, adds the coordinate that separates the most number of currently unseparated pairs to $A$ . The algorithm stops when all pairs are separated by $A$ .

The key difference between two algorithms is: ours samples $R=\Theta(m/\sqrt{\varepsilon})$ tuples and use $R\choose 2$ as ground set while the approach of [14] samples $R^{\prime}=\Theta(m/\varepsilon)$ pairs of tuples of the data set and use $R^{\prime}$ as the ground set. Algorithm 2 runs in $O\left(M^{2}N\right)$ time where $N$ is the ground set’s size and $M$ is the number of sets in the input. Therefore, our algorithm and that of Mowani and Xu [14] yield time complexity $O\left(\frac{m^{4}}{\varepsilon}\right)$ and $O\left(\frac{m^{3}}{\varepsilon}\right)$ respectively. However, our algorithm can be refined to a better time complexity $O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right)$ . We outline how to achieve this running time as follows.

Algorithm 2 Greedy Set Cover Algorithm

X=\{X_{1},\ldots,X_{N}\}

ground set,

S=\{S_{1},\ldots,S_{M}\}

collections of subsets of

X

U=X

\triangleright

U

stores the uncovered elements

C=\emptyset

\triangleright

C

stores the indices of sets of the cover

4:while

U\neq\emptyset

5: Select

S_{i}\in S

that maximizes

|S_{i}\cap U|

U\leftarrow U\setminus S_{i}

C\leftarrow C\cup\{i\}

8:end while

9:return

C

We can visualize this process in terms of auxiliary graphs introduced in Section 2. Let $A$ be the output, initialized to $\emptyset$ . Originally, $G_{A}$ consists of one clique $C$ that contains all $x_{j}\in R$ since the empty set does not separate any pair.

As we add more coordinates to the solution, $A$ will separate more pairs in ${R\choose 2}$ and the number of disjoint cliques $c$ in $G_{A}$ increases. We stop when $c=|R|$ ; in other words, all pairs in ${R\choose 2}$ are separated by $A$ . We will make use of the following procedure.

Partitioning.

Let $C\subseteq R$ be a subset of the sampled tuples. For some given $i$ , we can partition $C$ into $D_{1},D_{2},\ldots$ based on the $i$ th coordinates. The simplest approach is to sort the $i$ th coordinates of the tuples in $C$ . This takes $O\left(|C|\log|C|\right)$ time.

Let $C_{y,1},C_{y,2},\ldots,$ be the disjoint cliques in $G_{A}$ after $y$ steps (i.e., after we have added $y$ coordinates to $A$ ). Originally, $C_{1,1}=R$ .

At each step $y$ , for each coordinate $k\notin A$ , adding $k$ to $A$ breaks each clique $C_{y,i}$ into some new cliques $D_{1}^{(i)},D_{2}^{(i)},\ldots$ based on the $k$ th coordinates. This corresponds to the fact that adding $k$ to $A$ results in $A$ separating more pairs of tuples. We use the procedure $D^{(i)}\leftarrow Partition(C_{y,i},k)$ to compute $D_{1}^{(i)},D_{2}^{(i)},\ldots$ We observe that coordinate $k$ would separate

	$\displaystyle g_{k}:=\frac{1}{2}\sum_{i}\sum_{a,b}{\|D_{a}^{(i)}\|\|D_{b}^{(i)}\|}$	$\displaystyle=\frac{1}{2}\sum_{i}\left[\left(\sum_{a}\|D_{a}^{(i)}\|\right)^{2}-\sum_{a}\|D_{a}^{(i)}\|^{2}\right]$
		$\displaystyle=\frac{1}{2}\sum_{i}\left[\|C_{i}\|^{2}-\sum_{a}\|D_{a}^{(i)}\|^{2}\right]~{}.$

new pairs of tuples in ${R\choose 2}$ .

Given $D^{(i)}$ , $g_{k}$ can be computed in $O(\sum_{i}|C_{i}|)=O(|R|)$ time by computing $\sum_{a}|D_{a}^{(i)}|^{2}$ as there can be at most $|C_{i}|$ terms in the sum. The algorithm then adds coordinate $k$ with the highest $g_{k}$ to $A$ and update the cliques accordingly. In other words, replace $C_{i}$ with $D_{1}^{(i)},D_{2}^{(i)},\ldots$ For each $k\notin A$ , the time to partition based on the $k$ th coordinates is

O\left(\sum_{i}|C_{i}|\log|C_{i}|\right)=O(|R|\log|R|)~{}.

We need to do this for at most $m$ coordinates not in $A$ and the process repeats for at most $m$ steps. Thus, the algorithm’s running time is

O\left(m^{2}|R|\log|R|\right)=O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\log\frac{m}{\varepsilon}\right)~{}.

If we allow an extra factor $m$ in the space use, we can reduce the running time further by improving the partitioning procedure. For each $k\notin A$ , the partition step can be done can be done in $O\left(\sum_{i}|C_{i}|\right)=O(|R|)$ time (instead of $O(|R|\log|R|)$ time) using Algorithm 3.

Algorithm 3 is provided with a lookup table $P\in\mathbb{N}^{|R|\times m}$ and an index $k$ . We partition $R$ based on the $k$ th coordinates into $C^{(k)}_{1}$ , $C^{(k)}_{2}$ , $\ldots$ . The value $P[k,j]$ is equal to the index of the partition that $x_{j}\in R$ belongs to. In other words, $x_{j}\in C^{(k)}_{P[k,j]}$ . Pre-computing $P$ involves sorting each of $m$ coordinates of tuples in $R$ . The running time to compute $P$ is therefore

O(m|R|\log|R|)=O\left(\frac{m^{2}}{\sqrt{\varepsilon}}\log\left(\frac{m}{\varepsilon}\right)\right)~{}.

Thus, the overall running time is

O\left(\frac{m^{2}}{\sqrt{\varepsilon}}\log\left(\frac{m}{\varepsilon}\right)\right)+O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right)=O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right)~{}.

Algorithm 3 Partitioning

C

based on the

k

th coordinates using the look-up table

P

1:A subset

C\subseteq R

, a look-up table

P

and an index

1\leq k\leq m

2:Initialize

D

as an array of empty list.

\triangleright

D

contains

D^{(i)}

as an array of list.

3:Initialize

L

as an empty list.

\triangleright

L

is the list of indices

j

s.t

D[j]

is non empty

4:for

x_{j}\in C

5: if

D[P[k,j]]=\emptyset

then

L=L\cup\{P[k,j]\}

7: end if

D[P[k,j]]\leftarrow D[P[k,j]]\cup\{x_{j}\}

9:end for

10:return

\{D[j]\mid j\in L\}

Appendix C Omitted Proofs and Examples

C.1 Proof of Lemma 3

Proof.

For technical reason, we will consider $\varepsilon$ such that $1/\varepsilon=q+1/2$ where $q\in\mathbb{N}$ . Our construction of the data set $\mathcal{D}$ is $\mathcal{D}:=\{1,2,\ldots,q\}^{m}=[q]^{m}$ . Thus, $n:=|\mathcal{D}|=q^{m}$ . We choose $m$ such that $m<{\textup{exp}}\left(\frac{1}{4}(\frac{1}{\varepsilon}-\frac{1}{2})\right)$ (or equivalently $\log m<q/4$ ).

Firstly, we will prove that all subsets of attributes of cardinality one is bad. Indeed, given any singleton set $A=\{i\mid 1\leq i\leq m\}$ , the auxiliary graph $G_{A}=(V_{A},E_{A})$ will be decomposed evenly into $q$ cliques whose size are $n/q$ . The number of edges (of $G_{A}$ ) is:

\displaystyle|E_{A}|=q\frac{n/q(n/q-1)}{2}=\frac{n(n/q-1)}{2}~{}.

To show $|E_{A}|>\varepsilon n(n-1)/2$ , it is sufficient to demonstrate that:

	$\displaystyle n/q-1>\varepsilon(n-1)$	$\displaystyle\iff\frac{n}{\varepsilon q}-\frac{1}{\varepsilon}>n-1$
		$\displaystyle\iff n\frac{q+1/2}{q}-q-\frac{1}{2}>n-1$
		$\displaystyle\iff\frac{n}{2q}>q-\frac{1}{2}$
		$\displaystyle\iff q^{m-1}>2q-1$

which is true if $m\geq 3,q>1$ .

Denote $R_{A,r}$ the event that a bad subset $A$ is detected after $r$ samples, we derive our lower bound by bounding $\mathbb{P}\left(\bigcap_{A:|A|=1}R_{A,r}\right)$ , which is the probability that one can detect all the bad subsets of cardinality one. To distinguish between the case of sampling with and without replacement, we denote $\mathbb{P}_{w}\left(E\right)$ and $\mathbb{P}_{o}\left(E\right)$ the probability of an event $E$ under the sampling with and without replacement respectively.

We first deal with the case of uniform sampling with replacement. In this case, sampling a tuple $x$ is equivalent to sampling each coordinate i.i.d from uniform multinomial distribution of the set $\{1,\ldots,q\}$ . Hence,

\displaystyle\mathbb{P}_{w}\left(\bigcap_{A:|A|=1}R_{A,r}\right)=\prod_{i=1}^{m}\mathbb{P}_{w}\left(R_{\{i\},r}\right)=\mathbb{P}_{w}\left(R_{\{i\},r}\right)^{m}~{}.

Moreover, we have: $\mathbb{P}_{w}\left(R_{A,r}\right)\leq 1-\prod_{i=0}^{r-1}(1-i/q)$ . This is not an equality since we need to exclude the cases where a tuple of $\mathcal{D}$ is sampled more than once. Since ${\textup{exp}}\left(-2x\right)\leq 1-x,\forall x\leq 1/2$ , we have:

\displaystyle\mathbb{P}_{w}\left(R_{A,r}\right)\leq 1-{\textup{exp}}\left(-r(r-1)/q\right)\leq 1-{\textup{exp}}\left(-r^{2}/q\right)~{}.

for $r\leq q/2$ . If one chooses $r=\sqrt{q\log m}$ (which makes the inequality ${\textup{exp}}\left(-2x\right)\leq 1-x$ valid since $\log m\leq q/4$ ), we have:

\mathbb{P}_{w}\left(R_{A,r}\right)\leq 1-\frac{1}{m}

Finally, one can bound $\mathbb{P}_{w}\left(\bigcap_{A\mid|A|=1}R_{A,r}\right)=\mathbb{P}_{w}\left(R_{A,r}\right)^{m}\leq(1-1/m)^{m}\leq 1/e$ . Thus, the sampling complexity is lower bounded by $\Theta(\sqrt{q\log m})=\Theta\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ .

For the case of sampling without replacement, we will use the following observation: Let $x:=(x_{1},\ldots,x_{r})$ be a sequence of distinct elements in $\mathcal{D}$ and $X_{i},1\leq i\leq r$ be the $i$ th element sampling from $\mathcal{D}$ , we have:

	$\displaystyle\mathbb{P}_{o}\left(X_{i}=x_{i},\forall 1\leq i\leq r\right)=\frac{1}{n(n-1)\ldots(n-r+1)}$
	$\displaystyle\mathbb{P}_{w}\left(X_{i}=x_{i},\forall 1\leq i\leq r\right)=\frac{1}{n^{r}}$

Let $\mathcal{X}$ be the set of all sequences of distinct tuples of length $r$ that remove all bad subsets and $R_{r}$ be the event of all bad subsets being detected (i.e., there exists an unseparated pair in the sample for each bad subset) after sampling $r$ samples, we have:

	$\displaystyle\mathbb{P}_{o}\left(R\right)$	$\displaystyle=\sum_{x\in\mathcal{X}}\mathbb{P}_{o}\left(X=x\right)$
		$\displaystyle=\frac{n^{s}}{n(n-1)\ldots(n-s+1)}\sum_{x\in\mathcal{X}}\mathbb{P}_{w}\left(X=x\right)$
		$\displaystyle=\frac{n^{s}}{n(n-1)\ldots(n-s+1)}\mathbb{P}_{w}\left(R\right)$
		$\displaystyle\overset{\eqref{eq:boundedterm}}{\leq}e^{s(s-1)/(n-s+1)}\mathbb{P}_{w}\left(R\right).$

We choose $m$ such that $n-s+1\geq\frac{1}{\ln 2}s(s-1)$ . With $s=\sqrt{q\log m}$ (as in the case of sampling with replacement), this is equivalent to:

\displaystyle q^{m}\geq\frac{1}{\ln 2}\sqrt{q\log m}(\sqrt{q\log m}+1)+\sqrt{q\log m}-1

which is true for $q>3,m>2$ . If $n-s+1\geq\frac{1}{\ln 2}s(s-1)$ , we have:

\mathbb{P}_{o}\left(R\right)\leq 2\mathbb{P}_{w}\left(R\right)

Thus, we can derive the same complexity bound $\Theta(\sqrt{q\log m})=\Theta\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ for the case of sampling without replacement with failure probability $2/e$ . ∎

C.2 Proof of Lemma 4

Proof.

We choose $n$ such that $n\gg m^{2}/\varepsilon$ . We construct the data set $\mathcal{D}$ of $n$ tuples $\mathcal{D}:=\{x_{1},\ldots,x_{n}\}$ as follows.

1.

The first coordinate of each of $\{x_{i}\}_{i\in[n]}$ has $q:=(1-\sqrt{2\varepsilon})n+1$ distinct values. Without loss of generality, we assume the set of distinct values of the first coordinate are $\{1,\ldots,q\}$ . More importantly, there are exactly $\sqrt{2\varepsilon}n$ tuples having their first coordinates equal to $1$ and $(1-\sqrt{2\varepsilon})n$ remaining tuples having distinct first coordinates in the set $\{2,\ldots,q\}$ . We denote $S$ the set of tuples whose first coordinates are $1$ and the set of remaining tuples as $\bar{S}=\{x_{i}\mid 1\leq i\leq n\}\setminus S$ . Thus, the graph $G_{\{1\}}$ has one big clique of sizes $\sqrt{2\varepsilon}n$ and $(1-\sqrt{2\varepsilon})n$ isolated vertices.
2.

Each of the remaining coordinates can be chosen so that a key exists for $\mathcal{D}$ .

It is worth noting that $A:=\{1\}$ is a bad subset of attributes. This is clear since:

|E(G_{A})|=\frac{\sqrt{2\varepsilon}n(\sqrt{2\varepsilon}n-1)}{2}>\varepsilon\frac{n(n-1)}{2}~{}.

Assume we samples $r$ times and $r\leq n/2$ . The event $\overline{R_{A,r}}$ (i.e., failing to reject $A$ ) includes the event $\overline{R_{A,r}^{\prime}}$ where we get exactly $r$ distinct elements of $\bar{S}$ . Hence,

	$\displaystyle\mathbb{P}(\overline{R_{A,r}})$	$\displaystyle\geq\mathbb{P}\left(\overline{R^{\prime}_{A,r}}\right)$
		$\displaystyle\geq\frac{(1-\sqrt{2\varepsilon})n}{n}\frac{(1-\sqrt{2\varepsilon})n-1}{n-1}\ldots\frac{(1-\sqrt{2\varepsilon})n-r+1}{n-r+1}$
		$\displaystyle=\prod_{i=0}^{r-1}\left(1-\sqrt{2\varepsilon}-\frac{\sqrt{2\varepsilon}i}{n-i}\right)$
		$\displaystyle\geq\prod_{i=0}^{r-1}\left(1-\sqrt{2\varepsilon}-\frac{\sqrt{2\varepsilon}i}{n-r}\right)$
		$\displaystyle\geq\prod_{i=0}^{r-1}\left(1-\sqrt{2\varepsilon}-\frac{2\sqrt{2\varepsilon}i}{n}\right)~{}.$

Since $e^{-2x}\leq 1-x$ for all $x\leq 1/2$ , we further have:

		$\displaystyle\mathbb{P}\left(\overline{R_{A,r}}\right)\geq\prod_{i=0}^{r-1}{\textup{exp}}\left(-2\left(\sqrt{2\varepsilon}+\frac{\sqrt{2\varepsilon}i}{n-r}\right)\right)$
	$\displaystyle=$	$\displaystyle{\textup{exp}}\left(-2\sqrt{2\varepsilon}r-2\sqrt{2\varepsilon}\frac{r(r-1)}{n}\right)~{}.$

To reject $A$ with probability $1-{\textup{exp}}\left(-m\right)$ , one needs to have $\mathbb{P}\left(\overline{R_{A,r}}\right)\leq e^{-m}$ . Our analysis implies:

m\leq 2r\sqrt{2}\sqrt{\varepsilon}+2\sqrt{2\varepsilon}\frac{r(r-1)}{n}

For $r=\frac{m}{4\sqrt{\varepsilon}}$ , we have:

\displaystyle m\leq

\displaystyle 2\sqrt{2\varepsilon}r+2\sqrt{2\varepsilon}\frac{r(r-1)}{n}

\displaystyle\leq\frac{m}{\sqrt{2}}+2\sqrt{2\varepsilon}\leq\frac{m}{\sqrt{2}}+2\sqrt{2}

since we choose $n$ such that $r^{2}=m^{2}/\varepsilon\ll n$ . This does not hold for $m\geq 4/(\sqrt{2}-1)$ . This shows one needs to sample $\Omega(m/\sqrt{\varepsilon})$ to reject $A$ with probability $1-{\textup{exp}}\left(-m\right)$ . ∎

C.3 An Example Regarding Lemma 2

Let $\varepsilon^{\prime}=\varepsilon/4$ . We provide an example rejecting the intuition that the optimal value $s^{\star}\in P$ that maximizes the non-collision probability must have equal non-zeroes entries (with values $\varepsilon^{\prime}n$ ). This implies that Lemma 2 is necessary.

In particular, let $n=40,\varepsilon^{\prime}=1/4^{2}=0.0625$ , and $r=10$ . Consider $s_{1}\in P$ where

s_{1}=(\underbrace{2.5,\ldots,2.5}_{16\text{ times }},0,0,\ldots,0).

Note that $2.5^{2}\times 16=0.0625\times 40^{2}=\varepsilon^{\prime}n^{2}$ .

Then, consider $s_{2}\in P$ where

s_{2}=(10,\underbrace{1,\ldots,1}_{30\text{ times}},0,\ldots,0).

Also note that $10^{2}+30\times 1^{2}>0.0625\times 40^{2}=\varepsilon^{\prime}n^{2}$ as required. One can check that $f(s_{1})\approx 76370239.25\ldots<f(s_{2})=173116515$ .

Towards Better Bounds for Finding Quasi-Identifiers 111Authors are in alphabetical order.

Abstract

1 Introduction

Motivation.

Approximate minimum ε\varepsilon-separation key via minimum set cover.

A small tweak to Motwani-Xu’s algorithm.

ε\varepsilon-separation key filter.

Non-separation estimation.

Sketching algorithms and algorithms based on uniform sampling.

Main results and techniques.

Theorem 1 (Main result 1).

Proposition 1.

Theorem 2 (Main result 2).

Further applications.

Related work.

Preliminaries.

Theorem 3 (Chernoff bound [18]).

Theorem 4 (The birthday problem).

Organization.

2 Improved Upper and Lower Bounds for ε\varepsilon-Separation Key Filter

2.1 An Improved Upper Bound

Algorithms.

Improvement to minimum ε\varepsilon-separation key.

Analysis.

Claim 1.

Proof.

Lemma 1.

Proof.

Lemma 2.

Proof.

Putting it all together.

Query time.

2.2 A lower bound Ω​(log⁡mε)\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right) for constant failure probability

Lemma 3.

Proof sketch.

2.3 A lower bound Ω​(mε)\Omega\left(\frac{m}{\sqrt{\varepsilon}}\right) for e−me^{-m} failure probability

Lemma 4.

Proof sketch.

3 Estimating Non-Separation

3.1 Upper bound

3.2 Lower bound

Lemma 5.

Lemma 6.

Proof.

Proof of Lemma 5.

4 Implementation

Description of datasets.

Comparison methodology.

References

Appendix A Applying Karush–Kuhn–Tucker (KKT) conditions

Definition 1 (Active set [15]).

Definition 2 (LICQ [15]).

Theorem 5 (KKT conditions [15]).

Application to our problem.

Appendix B Improvements for Approximate Minimum ε\varepsilon-Separation Key

Partitioning.

Appendix C Omitted Proofs and Examples

C.1 Proof of Lemma 3

Proof.

C.2 Proof of Lemma 4

Proof.

C.3 An Example Regarding Lemma 2

Towards Better Bounds for Finding Quasi-Identifiers ¹¹1Authors are in alphabetical order.

Approximate minimum $\varepsilon$ -separation key via minimum set cover.

$\varepsilon$ -separation key filter.

2 Improved Upper and Lower Bounds for $\varepsilon$ -Separation Key Filter

Improvement to minimum $\varepsilon$ -separation key.

2.2 A lower bound $\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right)$ for constant failure probability

2.3 A lower bound $\Omega\left(\frac{m}{\sqrt{\varepsilon}}\right)$ for $e^{-m}$ failure probability

Appendix B Improvements for Approximate Minimum $\varepsilon$ -Separation Key