This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Better Bounds for Finding Quasi-Identifiers 111Authors are in alphabetical order.

Ryan Hildebrant 222University of California, Irvine. Email: [email protected]. Part of this work was done while the author was at San Diego State University.    Quoc-Tung Le 333Univ Lyon, ENS de Lyon, UCBL,CNRS, Inria, LIP, F-69342, LYON Cedex 07, France. Email: [email protected]    Duy-Hoang Ta 444National University of Singapore. Email: [email protected]    Hoa T. Vu 555San Diego State University. Email: [email protected].
Abstract

We revisit the problem of finding small ε\varepsilon-separation keys introduced by Motwani and Xu (2008). In this problem, the input is a data set consisting of mm-dimensional tuples {x1,x2,,xn}\{x_{1},x_{2},\ldots,x_{n}\}. The goal is to find a small subset of coordinates that separates at least (1ε)(n2)(1-\varepsilon){n\choose 2} pairs of tuples. When nn is large, they provided a fast algorithm that runs on Θ(m/ε)\Theta(m/\varepsilon) tuples sampled uniformly at random. We show that the sample size can be improved to Θ(m/ε)\Theta\left({m}/{\sqrt{\varepsilon}}\right). Our algorithm also enjoys a faster running time.

To obtain this result, we consider a decision problem that takes a subset of coordinates A[m]A\subseteq[m]. It rejects if AA separates fewer than (1ε)(n2)(1-\varepsilon){n\choose 2} pairs of tuples, and accepts if AA separates all (n2){n\choose 2} pairs of tuples. The algorithm must be correct with probability at least 1δ1-\delta for all 2m2^{m} choices of AA. We show that for algorithms based on uniform sampling:

  • Θ(mε)\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right) samples are sufficient and necessary so that δem\delta\leq e^{-m}.

  • Ω(logmε)\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right) samples are necessary so that δ\delta is a constant. Closing the gap between the upper and lower bounds in this case is still an open question.

The analysis is based on a constrained version of the balls-into-bins problem whose worst case can be determined using Karush Kuhn Tucker (KKT) conditions. We believe our analysis may be of independent interest.

We also study a related problem that asks for the following sketching algorithm: with given parameters α,k\alpha,k and ε\varepsilon, the algorithm takes a subset of coordinates AA of size at most kk and returns an estimate of the number of unseparated pairs in AA up to a (1±ε)(1\pm\varepsilon) factor if it is at least α(n2)\alpha{n\choose 2}. We show that even for constant α\alpha and success probability, such a sketching algorithm must use Ω(mklogε1)\Omega(mk\log\varepsilon^{-1}) bits of space; on the other hand, uniform sampling yields a sketch of size Θ(mkαε2)\Theta\left(\frac{mk}{\alpha\varepsilon^{2}}\right) for this purpose.

1 Introduction

Motivation.

In large data sets, it is often important to find a small subset of attributes that identifies most of the tuples. Consider nn tuples x1,x2,,xnUmx_{1},x_{2},\ldots,x_{n}\in U^{m} each of which has mm coordinates (attributes) in some universe UU. We say a subset of coordinates A[m]A\subseteq[m] is a key if every pair of tuples differ by at least one coordinate value in AA (i.e., AA uniquely identifies all tuples). Motwani and Xu [14] considered the problem of finding the minimum key II^{\star} of a given data set.

Let (Rt){R\choose t} denote the collection of all subsets of RR of size tt and X={x1,,xn}X=\{x_{1},\ldots,x_{n}\} denote the data set, Motwani and Xu [14] reduced the minimum key problem to the set cover problem where the ground set is (X2){X\choose 2} and each coordinate ii corresponds to a subset of (X2){X\choose 2} that consists of pairs of tuples differing at their iith coordinates. Finding the minimum key of XX is equivalent to finding the minimum set cover of (X2){X\choose 2}. The set cover problem admits a (lnN+1)(\ln N+1) greedy approximation with time complexity O(NM2)O(NM^{2}) (where NN is the cardinality of the ground set and MM is the number of subsets) [20]. The combination of these two ideas yields an approach that runs in Θ(m2n2)\Theta(m^{2}n^{2}) time 666With a careful implementation, described in Appendix B, one can improve the running time to O(m2nlogn)O(m^{2}n\log n) which still depends on nn. . For massive data sets (i.e., large nn), this approach is, however, costly.

Approximate minimum ε\varepsilon-separation key via minimum set cover.

To this end, Motwani and Xu considered the relaxed problem of finding small ε\varepsilon-separation keys. We say that a subset of coordinates AA separates xix_{i} and xjx_{j} if they are different in at least one coordinate in AA. In this problem, with high probability, we want to find a small subset of coordinates II that separates at least (1ε)(1-\varepsilon) fraction of pairs of tuples such that |I|γ|I||I|\leq\gamma|I^{\star}|. In this case, one often refers to II as a quasi-identifier with separation ratio 1ε1-\varepsilon. The parameter γ\gamma controls how large II is allowed to be compared to the minimum key II^{\star}.

Their idea is to uniformly sample Θ(m/ε)\Theta\left(m/\varepsilon\right) pairs of tuples R={(xi1,xj1),(xi2,xj2),}R=\{(x_{i_{1}},x_{j_{1}}),(x_{i_{2}},x_{j_{2}}),\ldots\} and solve the set cover problem in which the ground set is RR and each coordinate ii corresponds to a subset of RR that consists of pairs of tuples differing at their iith coordinates.

If AA separates fewer than (1ε)(n2)(1-\varepsilon){n\choose 2} pairs, the probability that AA separates all pairs in RR (or equivalently, AA is the set cover of the described set cover instance) is at most

(1ε)|R|eε|R|e10m(1-\varepsilon)^{|R|}\leq e^{-\varepsilon|R|}\leq e^{-10m}

by choosing |R|=10m/ε|R|={10}m/\varepsilon. By appealing to a union bound over all 2m2^{m} subsets of coordinates, we guarantee that no such subset of attributes separates all pairs in RR with probability at least 1e5m1-e^{-5m} 777Note that the failure probability can be set to eKme^{-Km} for an arbitrary constant KK. This constant is absorbed in the big OO notation of the sample complexity.. Therefore, finding a γ\gamma approximation to the aforementioned set cover instance yields an ε\varepsilon-separation key whose size is at most γ|I|\gamma|I^{\star}|.

One can attain γ=1\gamma=1 using the brute-force algorithm whose running time is 2O(m)2^{O\left({m}\right)} (but the size of the ground set is much smaller - O(m/ε)O(m/\varepsilon) instead of O(n)O(n)). On the other hand, one can also attain γ=O(lnmε)\gamma=O\left(\ln\frac{m}{\varepsilon}\right) using the greedy set cover algorithm whose running time is O(m3/ε)O(m^{3}/\varepsilon), assuming we can compare two coordinates in constant time. This running time is more manageable as it does not depend on the size of the data set nn. Note that sampling pairs of tuples can easily be implemented in the streaming model and the space would be proportional to the number of samples.

In this work, we mainly focus on sample and space bounds. However, we will also address running time and query time whenever appropriate. For this purpose, we make a mild assumption that one can define a total ordering of values in UU. This is the case in most natural applications (i.e., numbers, strings, etc.); we also assume that comparing two values in UU takes constant time.

A small tweak to Motwani-Xu’s algorithm.

Our goal is to improve the sample complexity with a small tweak. In particular, we sample Θ(mε)\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right) tuples uniformly at random, and let RR be the set of sampled tuples. We then solve the set cover instance in which the ground set is (R2){R\choose 2} and each coordinate corresponds to the pairs in (R2){R\choose 2} that it separates.

In other words, the key difference between the two algorithms is that ours samples R=Θ(m/ε)R=\Theta(m/\sqrt{\varepsilon}) tuples and use (R2)R\choose 2 as ground set (for the set cover instance) while the approach of [14] samples R=Θ(m/ε)R^{\prime}=\Theta(m/\varepsilon) pairs of tuples of the data set and uses RR^{\prime} as the ground set.

We show that our approach achieves the same guarantees. In particular, for all A[m]A\subseteq[m], if AA separates fewer than (1ε)(n2)(1-\varepsilon){n\choose 2} pairs, the probability that AA separates all pairs in (R2){R\choose 2} is at most eme^{-m}. While this seems like a small tweak, the analysis is significantly more involved.

If we use the greedy set cover algorithm to obtain the approximation factor γ=O(lnmε)\gamma=O\left(\ln\frac{m}{\varepsilon}\right), this tweak also leads to a better time complexity O(m3ε)O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right) in comparison to that of Motwani and Xu whose running time is O(m3ε)O(\frac{m^{3}}{\varepsilon}) .

ε\varepsilon-separation key filter.

We first consider a decision problem that captures the essence of our analysis and then describe how this yields the aforementioned improvements to finding a small ε\varepsilon- separation key. We say AA is bad if it separates fewer than (1ε)(n2)(1-\varepsilon){n\choose 2} pairs of tuples. This problem asks for an algorithm that takes A[m]A\subseteq[m] and rejects if AA is bad. Furthermore, the algorithm must accept if AA separates all (n2){n\choose 2} pairs (i.e., AA is a key). If AA is neither bad nor a key, the algorithm can either accept or reject. The success probability is required to be at least 1δ1-\delta for all 2m2^{m} choices of AA. More formally,

(A[m],the algorithm is correct on A)1δ.\mathbb{P}\left(\forall A\subseteq[m],\text{the algorithm is correct on $A$}\right)\geq 1-\delta.

Hereinafter, we only consider the above “for all” notion of success (as opposed to the “for each” notion). The query time is the time to compute the answer for a given subset of coordinates AA.

From the discussion above, it is fairly easy to see that the approach of Motwani and Xu solves this problem using a sample size Θ(m/ε)\Theta(m/\varepsilon) with query time O(m|A|ε)O\left(\frac{m|A|}{\varepsilon}\right). Specifically, we reject AA if it fails to separate any of Θ(m/ε)\Theta(m/\varepsilon) pairs in RR^{\prime} and accept otherwise. Our goal is to improve the sample size and query time.

Non-separation estimation.

We also consider a related approximate counting problem. Let ΓA\Gamma_{A} be the number of pairs of tuples that are not separated by AA. Let α,ε\alpha,\varepsilon and kk be some given parameters, the non-separation estimation problem asks for an algorithm that takes a subset of coordinates A[m]A\subset[m] as the input where |A|k|A|\leq k. If ΓAα(n2)\Gamma_{A}\geq\alpha{n\choose 2}, the algorithm must return an estimate Γ^A\hat{\Gamma}_{A} of ΓA\Gamma_{A} such that

Γ^A[(1ε)ΓA,(1+ε)ΓA].\hat{\Gamma}_{A}\in[(1-\varepsilon)\Gamma_{A},(1+\varepsilon)\Gamma_{A}]~{}.

Otherwise, the algorithm may output “small”. The algorithm must return correct answers for all queries A[m]A\in[m] (where |A|k|A|\leq k) with probability at least 1δ1-\delta.

Sketching algorithms and algorithms based on uniform sampling.

A sketch SS of a data set XX, with respect to some problem FF, is a compression of XX such that given only access to SS, one can solve the problem FF on XX. Often, the primary objective is to minimize the size of SS.

Given XX, an algorithm based on uniform sampling is a special case of sketching algorithms in which SS is a set of items drawn uniformly at random from XX.

Main results and techniques.

For the ε\varepsilon-separation key filter problem, our main results are as follows. We first propose a better sketching algorithm based on uniform sampling in terms of sample size and query time. We then establish its sampling complexity lower bounds. In certain regime, we show that our strategy is optimal in term of sampling complexity.

Theorem 1 (Main result 1).

Consider the ε\varepsilon-separation key filter problem. For algorithms based on uniform sampling, if nKm/εn\geq Km/\varepsilon for some sufficiently large constant KK, then:

  • Θ(mε)\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right) samples are sufficient and necessary so that δem\delta\leq e^{-m}; furthermore, the query time is O(m|A|εlogmε)O\left(\frac{m|A|}{\sqrt{\varepsilon}}\log\frac{m}{{\varepsilon}}\right) where AA is the subset of coordinates being queried.

  • Ω(logmε)\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right) samples are necessary so that δ\delta is a constant.

The analysis of the upper bound is heavily non-trivial. At a high level, it uses Karush–Kuhn–Tucker (KKT) [15, Chapter 12] conditions to identify the worst-case input. With some further arguments, we can apply the birthday problem [13, Page 45] to show that in this worst-case input, Θ(m/ε)\Theta(m/\sqrt{\varepsilon}) sample size suffices.

This result leads to the following improvements for the approximate minimum ε\varepsilon-separation key problem in both sample size and running time.

Proposition 1.

There exists an algorithm based on uniform sampling that solves the approximate minimum ε\varepsilon-separation key problem with a sample size Θ(mε)\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right). Furthermore, if the approximation factor γ=O(lnmε)\gamma=O\left(\ln\frac{m}{\varepsilon}\right), the running time is O(m3ε)O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right).

For the non-separation estimation problem, we show that a sketching algorithm based on uniform sampling is optimal in terms of kk and mm up to a logarithmic factors.

Theorem 2 (Main result 2).

Consider the non-separation estimation problem, the following holds.

  • There exists an algorithm based on uniform sampling that requires a sample size Θ(klogmαε2)\Theta\left(\frac{k\log m}{\alpha\varepsilon^{2}}\right).

  • Suppose mk/εm\geq k/\sqrt{\varepsilon} and α\alpha is a constant. Such a sketching algorithm must use Ω(mklog(1/ε))\Omega(mk\log(1/\varepsilon)) space.

Note that a sample size Θ(klogmαε2)\Theta\left(\frac{k\log m}{\alpha\varepsilon^{2}}\right) requires Ω(kmlogmlog|U|αε2)\Omega\left(\frac{km\log m\log|U|}{\alpha\varepsilon^{2}}\right) bits of space. Hence, for constant α\alpha, the upper and lower bounds are tight in terms of kk and mm up to a logarithmic factor. The upper bound in the theorem above is a direct application of Chernoff and union bounds. The lower bound is based on an encoding argument. Liberty et al. [11] also used an encoding argument to prove a space lower bound for finding frequent itemset although the technical details are different.

Further applications.

Motwani and Xu [14] have highlighted multiple applications of this problem, which we will summarize here. One example is using this problem as a fundamental tool to identify various dependencies or keys in noisy data. This problem can also aid in comprehending the risks associated with releasing data, specifically the potential for attacks that may re-identify individuals.

Small quasi-identifiers are crucial information to consider from a privacy perspective because they can be utilized by adversaries to conduct linking attacks. The collection of attribute values may come with a cost for adversaries, leading them to seek a small set of attributes that form a key.

This problem also has applications in data cleaning, such as identifying and removing fuzzy duplicates resulting from spelling mistakes or inconsistent conventions [2, 3]. Moreover, quasi-identifiers are a specific case of approximate functional dependency [8, 16]. The discovery of such quasi-identifiers can be valuable in query optimization and indexing [6].

Related work.

There has been recent work on sketching and streaming algorithms as well as lower bounds for multidimensional data. In this line of work, we want to compute a sketch of the input data set such that given only access to the sketch, we can approximate statistics of the data restricted to some query subset(s) of coordinates AA; in this setting, AA is provided after the sketch has been computed. Some example problems include heavy-hitters and frequent itemset [10, 11], frequency estimation [4], and box-queries [5].

Preliminaries.

We provide some useful tools and facts as well as common notations for the rest of the paper here.

Chernoff bound. We often rely on the following version of Chernoff bound:

Theorem 3 (Chernoff bound [18]).

Let X1,X2,,XNX_{1},X_{2},\ldots,X_{N} be independent Bernoulli random variables such that (Xi=1)=p\mathbb{P}\left(X_{i}=1\right)=p. Let X=i=1NXiX=\sum_{i=1}^{N}X_{i} and μ=𝔼[X]=pn\mu=\mathbb{E}[X]=pn. Then, for all ε>0\varepsilon>0,

(|XpN|εpN)2eε22+εμ(XpN2)2e0.1pn.\mathbb{P}\left(|X-pN|\geq\varepsilon pN\right)\leq 2e^{-\frac{\varepsilon^{2}}{2+\varepsilon}\mu}\implies\mathbb{P}\left(X\leq\frac{pN}{2}\right)\leq 2e^{-0.1pn}~{}.

When ε2\varepsilon\geq 2, we have:

(|XpN|εpN)2eε2μ.\mathbb{P}\left(|X-pN|\geq\varepsilon pN\right)\leq 2e^{-\frac{\varepsilon}{2}\mu}~{}.

The birthday problem. If one throws qq balls (people) into NN bins (birthdays) uniformly at random 888This assumption is indeed unnecessary. If the distribution is non-uniform, one can show that the probability of collision only increases [12, 17]., one can show that the probability that there is a collision (i.e., there is a bin that contains at least two balls) is approximately 1eΘ(q2/N)1-e^{-\Theta(q^{2}/N)}. The following can be found in various textbooks and lecture notes (for example, see [13, Page 45]).

Theorem 4 (The birthday problem).

Let C(N,q)C(N,q) be the probability that there is at least one collision if one throws qq balls into NN bins uniformly at random. We have:

C(N,q)1eq(q1)2N.C(N,q)\geq 1-e^{\frac{-q(q-1)}{2N}}~{}.

This implies that for the non-collision probability to be less than δ\delta^{\star}, we can set

q12(1+8Nlog1δ+1)4Nlog1δ.q\geq\frac{1}{2}\left(1+\sqrt{8N\log\frac{1}{\delta^{\star}}+1}\right)\geq 4\sqrt{N\log\frac{1}{\delta^{\star}}}~{}.

Notation and assumptions. For convenience, hereinafter, we let KK be a universal sufficiently large constant. Unless stated otherwise, we may round ε\varepsilon down to the nearest power of 1/41/4, i.e., ε=1/4t\varepsilon=1/4^{t} where tt is an integer so that 1/ε1/\varepsilon and 1/ε1/\sqrt{\varepsilon} are integers. This does not change the complexity as the extra constant is absorbed in the big-OO notation. Unless stated otherwise, logn\log n refers to log2n\log_{2}n.

Organization.

Section 2 provides the improved upper bound and lower bounds for the ε\varepsilon-key filter problem in Theorem 1. Section 3 provides the upper and lower bounds for the non-separation estimation problem in Theorem 2. We also implemented a small experiment showing the efficiency of the new approach in Section 4.

The resulted improvements to the approximate minimum ε\varepsilon-separation key in Proposition 1 are explained in Appendix B.

All omitted proofs can be found in Appendix C.

2 Improved Upper and Lower Bounds for ε\varepsilon-Separation Key Filter

2.1 An Improved Upper Bound

Algorithms.

We will prove the first part of Theorem 1 in this subsection. The detailed algorithm is given in Algorithm 1. Our algorithm samples R=Θ(m/ε)R=\Theta(m/\sqrt{\varepsilon}) tuples without replacement uniformly at random. For any given A[m]A\subseteq[m], if AA fails to separate any pair in (R2){R\choose 2}, then reject AA; otherwise, accept. We will show that this algorithm is correct with probability at least 1em1-e^{-m}.

Algorithm 1 An improved sampling-based algorithm for ε\varepsilon-separation key filter
1:Data set X={x1,,xn},xiUmX=\{x_{1},\ldots,x_{n}\},x_{i}\in U^{m}, ε\varepsilon.
2:Sample without replacement R=Θ(m/ε)R=\Theta\left(m/\sqrt{\varepsilon}\right) tuples.
3:Given A[m]A\subseteq[m], accept AA if and only if AA separates all (R2)R\choose 2 pairs of samples.

Improvement to minimum ε\varepsilon-separation key.

Note that if this algorithm for ε\varepsilon-separation keys filter is correct, the improvement for the minimum ε\varepsilon-separation key problem in terms of sample size is immediate. The improved running time is less trivial and requires some careful bookkeepings; we defer the discussion to Appendix B.

Analysis.

Let us visualize this problem in terms of auxiliary graphs. Consider a subset of attributes AA. If AA fails to separate xix_{i} and xjx_{j} then we draw an edge between xix_{i} and xjx_{j}. Note that if AA fails to separate xix_{i} and xjx_{j}, and AA fails to separate xjx_{j} and xlx_{l}, then AA also fails to separate xix_{i} and xlx_{l}. Therefore, the graph GAG_{A} consists of disjoint cliques. The goal is to sample an edge in GAG_{A} (or equivalently, sample two tuples not separated by AA) if AA fails to separate ε\varepsilon fraction of the pairs (in this case, we call AA bad).

We want to argue that if AA is bad, then we sample two distinct vertices in the same clique in GAG_{A} with probability at least 1e10m1-e^{-10m}. By an application of the union bound over 2m2^{m} subsets of attributes, we are able to discard all bad subsets of attributes with probability at least 1em1-e^{-m}.

We note that if AA is bad, GAG_{A} has at least ε(n2)\varepsilon{n\choose 2} edges. Thus, the problem can be rephrased as follows. Given a graph of nn vertices and at least ε(n2)\varepsilon{n\choose 2} edges that consists of disjoint cliques, what is the smallest number of vertices that one needs to sample uniformly at random such that the induced graph has at least one edge with probability at least 1e10m1-e^{-10m}?

It remains to show that sampling r=Θ(mε)r=\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right) tuples (or vertices) is sufficient. The key observation is that this problem is fairly similar to the birthday problem. Of course, this is not quite the same since we want to sample two distinct vertices from cliques of size at least two. We can simply sample without replacement to avoid this issue. However, for a cleaner analysis, our analysis is based on sampling with replacement and relate the two.

There could be at most nn cliques so we use a vector s=(s1,s2,,sn)s=(s_{1},s_{2},\ldots,s_{n}) to denote the cliques’ sizes where zero-entries are empty cliques. Recall that i=1nsi(si1)2ε(n2)\sum_{i=1}^{n}\frac{s_{i}(s_{i}-1)}{2}\geq\varepsilon{n\choose 2} which implies i=1nsi2εn2/4\sum_{i=1}^{n}s_{i}^{2}\geq\varepsilon n^{2}/4 for sufficiently large nn.

We relax the problem so that sis_{i} can be a non-negative real number (i.e., si0s_{i}\in\mathbb{R}_{\geq 0}). Consider following set of points s0ns\in\mathbb{R}_{\geq 0}^{n} such that

i=1nsi2εn2/4\displaystyle\sum_{i=1}^{n}s_{i}^{2}\geq\varepsilon n^{2}/4 (1)
i=1nsi=n\displaystyle\sum_{i=1}^{n}s_{i}=n (2)
si0, for all i=1,2,,n.\displaystyle s_{i}\in\mathbb{R}_{\geq 0}\text{, for all $i=1,2,\ldots,n$}. (3)

We use 𝒫\mathcal{P} to denote the set of all ss that satisfies all three constraints. For convenience, we can think of cliques as colors and each vertex that we sample uniformly at random is a ball whose color is distributed according to the multinomial distribution 𝒟s=(s1n,s2n,,snn)\mathcal{D}_{s}=(\frac{s_{1}}{n},\frac{s_{2}}{n},\ldots,\frac{s_{n}}{n}) where s𝒫s\in\mathcal{P}.

We are interested in how large rr should be such that the probability of having two balls with the same color is at least 1e10m1-e^{-10m}. In particular, let r,𝒟s(ξ)\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right) be the probability that at no two balls have the same color (we refer to this event ξ\xi as non-collision) if we draw rr balls whose colors follow the distribution 𝒟s\mathcal{D}_{s}.

Let us fix rr and figure out ss that maximizes the non-collision probability. Without constraint (1), it is easy to see that the non-collision probability is largest when the color distribution is uniform [12, 17]. With constraint (1), one might suspect that this is still true. That is the non-collision probability is largest when all non-zero sis_{i} have the same value (i.e., all 1/ε1/\varepsilon^{\prime} non-zero entries have the same value εn\varepsilon^{\prime}n where ε=ε/4\varepsilon^{\prime}=\varepsilon/4). This intuition is however incorrect; one can see an example in Appendix C.3. In this case, the problem indeed becomes more complicated.

Given the distribution 𝒟s\mathcal{D}_{s}, the probability that we do not have a collision after sampling rr vertices uniformly at random is

r,𝒟s(ξ)=r!nr1j1<j2<<jrnsj1sj2sjr:=r!nrfr(s).\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)=\frac{r!}{n^{r}}\sum_{1\leq j_{1}<j_{2}<\ldots<j_{r}\leq n}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r}}:=\frac{r!}{n^{r}}f_{r}(s)~{}.

Suppose we sample without replacement, the probability r,𝒟s,(ξ)\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right) that we do not have a collision is

r,𝒟s,(ξ):=r!n(n1)(n2)(nr+1)fr(s).\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right):=\frac{r!}{n(n-1)(n-2)\ldots(n-r+1)}f_{r}(s)~{}.

As a consequence, we have:

r,𝒟s,(ξ)=nrn(n1)(nr+1)r,𝒟s(ξ)\displaystyle\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right)=\frac{n^{r}}{n(n-1)\ldots(n-r+1)}\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)

The next claim relates r,𝒟s,(ξ)\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right) and r,𝒟s(ξ)\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right).

Claim 1.

For n>r(r1)m+r1=Θ(r2m+r)n>\frac{r(r-1)}{m}+r-1=\Theta(\frac{r^{2}}{m}+r), we have

r,𝒟s,(ξ)<emr,𝒟s(ξ).\mathbb{P}_{r,\mathcal{D}_{s},\diamond}\left(\xi\right)<e^{m}\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)~{}.
Proof.

Note that n>r(r1)m+r1n>\frac{r(r-1)}{m}+r-1 implies r(r1)nr+1<m\frac{r(r-1)}{n-r+1}<m. Thus,

nrn(n1)(n2)(nr+1)\displaystyle\frac{n^{r}}{n(n-1)(n-2)\ldots(n-r+1)} (nnr+1)r=(1+r1nr+1)rer(r1)nr+1em.\displaystyle\leq\left(\frac{n}{n-r+1}\right)^{r}=\left(1+\frac{r-1}{n-r+1}\right)^{r}\leq e^{\frac{r(r-1)}{n-r+1}}\leq e^{m}~{}.\qed (4)

We will later set r=Θ(mε)r=\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right); the condition n>r(r1)m+r1n>\frac{r(r-1)}{m}+r-1 implies that nKmεn\geq\frac{Km}{\varepsilon}. We are now interested in maximizing f(s)f(s) for fixed rr. Let

s=maxs𝒫r,𝒟s(ξ)=maxs𝒫fr(s).\displaystyle s^{\star}=\max_{s\in\mathcal{P}}\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right)=\max_{s\in\mathcal{P}}f_{r}(s)~{}.

We make the following key observation.

Lemma 1.

If 4r1+(1ε/2)n4\leq r\leq 1+(1-\sqrt{\varepsilon}/2)n, then the optimal s=maxsPfr(s)s^{\star}=\max_{s\in P}f_{r}(s) satisfies the following: all non-zero entries in ss^{\star} have at most two distinct values.

Proof.

Since rr is being fixed, we drop rr from frf_{r} for a cleaner presentation. Let

s~=(εn/2,1,1,,1(1ε/2)n times ,0,0,,0).\tilde{s}=(\sqrt{\varepsilon}n/2,\underbrace{1,1,\ldots,1}_{(1-\sqrt{\varepsilon}/2)n\text{ times }},0,0,\ldots,0)~{}. (5)

In fact, s~\tilde{s} is feasible as it satisfies all constraints (1) - (3). We note that any ss that has fewer than rr non-zero entries cannot be the optimal value since f(s)=0f(s)=0 while f(s~)>0f(\tilde{s})>0 since it has (1ε/2)n+1(1-\sqrt{\varepsilon}/2)n+1 non-zero entries and therefore at least one term in f(s~)f(\tilde{s}) must be positive.

Our proof uses the KKT conditions with LICQ regularity condition as its core. We invite readers who are not familiar with this technical theorem to visit Appendix A for full details.

Let ss be any local optimal. First, we assume that LICQ holds at ss. Based on the KKT conditions (Theorem 5), there exist some constants μ0\mu\geq 0, η\eta\in\mathbb{R}, and {λi}i[n]0\{\lambda_{i}\}_{i\in[n]}\geq 0, such that for any local maxima, we have

f(s)=μ(i=1nsi2)+i=1nλi(si)+η(i=1nsi)\displaystyle\nabla f(s)=\mu\nabla\left(\sum_{i=1}^{n}s_{i}^{2}\right)+\sum_{i=1}^{n}\lambda_{i}\nabla(s_{i})+\eta\nabla\left(\sum_{i=1}^{n}s_{i}\right) (6)

and

si>0λi=0.\displaystyle s_{i}>0\implies\lambda_{i}=0~{}. (7)

Equations (6) and (7) correspond to the stationarity and complementary slackness conditions (see Appendix A). Each coordinate of the right hand side of Eq. (6) has the following form

2μsi+λi+η.\displaystyle 2\mu s_{i}+\lambda_{i}+\eta~{}. (8)

Each coordinate ii of f(s)\nabla f(s) is as follows:

f(s)si\displaystyle\frac{\partial f(s)}{\partial s_{i}} =1j1<j2<jr1nj1,j2,,jr1isj1sj2sjr1.\displaystyle=\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots j_{r-1}\leq n\\ j_{1},j_{2},\ldots,j_{r-1}\neq i\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-1}}~{}. (9)

Therefore, for each i=1,2,,ni=1,2,\ldots,n:

1j1<j2<<jr1nj1,j2,,jr1isj1sj2sjr1(2μsi+λi+η)=0.\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-1}\leq n\\ j_{1},j_{2},\ldots,j_{r-1}\neq i\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-1}}-(2\mu s_{i}+\lambda_{i}+\eta)=0~{}. (10)

Consider any sa>0s_{a}>0 and sb>0s_{b}>0. Note that λa=λb=0\lambda_{a}=\lambda_{b}=0. We have:

(1j1<j2<<jr1nj1,j2,,jr1asj1sj2sjr1)2μsa\displaystyle\bigg{(}\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-1}\leq n\\ j_{1},j_{2},\ldots,j_{r-1}\neq a\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-1}}\bigg{)}-2\mu s_{a} =(1j1<j2<<jr1nj1,j2,,jr1bsj1sj2sjr1)2μsb\displaystyle=\bigg{(}\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-1}\leq n\\ j_{1},j_{2},\ldots,j_{r-1}\neq b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-1}}\bigg{)}-2\mu s_{b} (11)
sb(1j1<j2<<jr2nj1,j2,,jr2a,bsj1sj2sjr2)2μsa\displaystyle\iff s_{b}\bigg{(}\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}}\bigg{)}-2\mu s_{a} =sa(1j1<j2<<jr2nj1,j2,,jr2a,bsj1sj2sjr2)2μsb\displaystyle=s_{a}\bigg{(}\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}}\bigg{)}-2\mu s_{b}
(sbsa)1j1<j2<<jr2nj1,j2,,jr2a,bsj1sj2sjr2\displaystyle\iff(s_{b}-s_{a})\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}} =2μ(sbsa).\displaystyle=-2\mu(s_{b}-s_{a})~{}.

Thus, either sa=sbs_{a}=s_{b} or

1j1<j2<<jr2nj1,j2,,jr2a,bsj1sj2sjr2\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}} =2μ1.\displaystyle=-2\mu_{1}~{}. (12)

If sasbs_{a}\neq s_{b}, then consider any other entry sc>0s_{c}>0. It must be the case that either scsas_{c}\neq s_{a} or scsbs_{c}\neq s_{b}. Without loss of generality, suppose scsbs_{c}\neq s_{b}. With the same derivation, we have:

1j1<j2<<jr2nj1,j2,,jr2b,csj1sj2sjr2\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-2}\leq n\\ j_{1},j_{2},\ldots,j_{r-2}\neq b,c\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-2}} =2μ1.\displaystyle=-2\mu_{1}~{}. (13)

From Eq. (12) and Eq. (13), we have:

1j1<<jr2nj1,,jr2b,csj1sjr2\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<\ldots<j_{r-2}\leq n\\ j_{1},\ldots,j_{r-2}\neq b,c\end{subarray}}s_{j_{1}}\ldots s_{j_{r-2}} =1j1<<jr2nj1,,jr2a,bsj1sjr2\displaystyle=\sum_{\begin{subarray}{c}1\leq j_{1}<\ldots<j_{r-2}\leq n\\ j_{1},\ldots,j_{r-2}\neq a,b\end{subarray}}s_{j_{1}}\ldots s_{j_{r-2}} (14)
sa1j1<<jr3nj1,,jr3a,b,csj1sjr3\displaystyle s_{a}\sum_{\begin{subarray}{c}1\leq j_{1}<\ldots<j_{r-3}\leq n\\ j_{1},\ldots,j_{r-3}\neq a,b,c\end{subarray}}s_{j_{1}}\ldots s_{j_{r-3}} =sc1j1<<jr3nj1,,jr3a,b,csj1sjr3\displaystyle=s_{c}\sum_{\begin{subarray}{c}1\leq j_{1}<\ldots<j_{r-3}\leq n\\ j_{1},\ldots,j_{r-3}\neq a,b,c\end{subarray}}s_{j_{1}}\ldots s_{j_{r-3}}
sa\displaystyle s_{a} =sc.\displaystyle=s_{c}~{}.

In the last step, we divide both sides by

1j1<j2<<jr3nj1,j2,,jr3a,b,csj1sj2sjr3>0.\displaystyle\sum_{\begin{subarray}{c}1\leq j_{1}<j_{2}<\ldots<j_{r-3}\leq n\\ j_{1},j_{2},\ldots,j_{r-3}\neq a,b,c\end{subarray}}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r-3}}>0~{}. (15)

This is because we assume that at least r4r\geq 4 entries are non-zero. Thus, all entries in ss can either take value 0, aa, or bb.

It remains to deal with the case where LICQ does not hold at ss. We denote by 𝟎\mathbf{0} and 𝟏\mathbf{1} the vectors in which all entries are zeros and ones respectively; furthermore, 𝟏i\mathbf{1}_{i} denotes the iith canonical vector. Let 𝒜(s)\mathcal{A}(s) be the (active) set of constraints whose equality hold (see Appendix A and Definition 1 for a formal definition). The active set 𝒜(s)\mathcal{A}(s) can contain three following types of constraints:

  1. 1.

    Constraint (1): i=1nsi2εn2/4\sum_{i=1}^{n}s_{i}^{2}\geq\varepsilon n^{2}/4. The gradient corresponding to this constraint is 2s2s.

  2. 2.

    Constraint (2): i=1nsi=n\sum_{i=1}^{n}s_{i}=n. The gradient corresponding to this constraint is 𝟏\mathbf{1}. Note that constraint (2) must be in 𝒜(s)\mathcal{A}(s) by definition.

  3. 3.

    Constraints (3): si0s_{i}\geq 0. The gradient corresponding to this type of constraints is 𝟏i\mathbf{1}_{i}.

Denote by A:={isi=0}A:=\{i\mid s_{i}=0\} the set of indices whose components of ss are zero. If 𝒜(s)\mathcal{A}(s) does not contain Constraint (1) and LICQ does not hold means, then there exist γ|A|+1,γ0\gamma\in\mathbb{R}^{|A|+1},\gamma\neq 0 such that:

𝟎=γ0𝟏+iAγi𝟏i\mathbf{0}=\gamma_{0}\mathbf{1}+\sum_{i\in A}\gamma_{i}\mathbf{1}_{i}

This happens if and only if 𝒜(s)\mathcal{A}(s) contains all constraints (3), i.e, s=𝟎s=\mathbf{0}, which contradicts Constraint (2). Therefore, 𝒜(s)\mathcal{A}(s) must contain Constraint (1). By our assumption that LICQ does not hold, there exists a vector γ|A|+1\gamma\in\mathbb{R}^{|A|+1} such that:

2s=γ0𝟏+iAγi𝟏i2s=\gamma_{0}\mathbf{1}+\sum_{i\in A}\gamma_{i}\mathbf{1}_{i}

As a consequence, si=γ0/2s_{i}=\gamma_{0}/2 for all iAi\notin A. Thus, there is at most one distinct value among non-zero entries of ss when LICQ does not hold. That concludes our proof. ∎

The next lemma shows that we can still apply the birthday problem for r=Θ(mε)r=\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right).

Lemma 2.

Suppose the non-zero entries in s𝒫s\in\mathcal{P} (𝒫\mathcal{P} defined by Constraints (1)-(3)) have at most two distinct values. By drawing r=Θ(mε)r=\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right) balls uniformly at random from 𝒟s\mathcal{D}_{s}, the probability of no two balls having the same color is at most 1e20m2e20m1-e^{-20m^{2}}-e^{-20m}.

Proof.

Consider the case where there are two distinct values among non-zero entries in ss. Let these two distinct values be aa and bb. We say a color i[n]i\in[n] is in group AA if si=as_{i}=a; otherwise ii is in group BB. Define ka:=|A|,kb:=|B|k_{a}:=|A|,k_{b}:=|B|, we have:

kaa2+kbb2εn24.k_{a}a^{2}+k_{b}b^{2}\geq\frac{\varepsilon n^{2}}{4}~{}.

Thus either kaa2εn28k_{a}a^{2}\geq\frac{\varepsilon n^{2}}{8} or kbb2εn28k_{b}b^{2}\geq\frac{\varepsilon n^{2}}{8}. Without loss of generality, suppose kaa2εn28k_{a}a^{2}\geq\frac{\varepsilon n^{2}}{8}. Then

a\displaystyle a εn8kakaakaεn8.\displaystyle\geq\frac{\sqrt{\varepsilon}n}{\sqrt{8k_{a}}}\iff k_{a}a\geq\frac{\sqrt{k_{a}\varepsilon}n}{\sqrt{8}}~{}.

Thus, in expectation, each random ball drawn uniformly at random has color is in group AA with probability at least εka8\frac{\sqrt{\varepsilon k_{a}}}{\sqrt{8}}.

Let KK be some sufficiently large constant, if we draw at least KmkaKm\sqrt{k_{a}} balls uniformly at random restricted to group-AA colors, by the birthday problem (i.e., Theorem 4), the probability that no two balls share the same color is at most e20m2e^{-20m^{2}}.

Suppose we sample 28Kmε\frac{2\sqrt{8}Km}{\sqrt{\varepsilon}} balls uniformly at random. In expectation, the number of balls whose colors are in group AA in the sample is at least

28Kmε×εka8=2Kmka.\frac{2\sqrt{8}Km}{\sqrt{\varepsilon}}\times\frac{\sqrt{\varepsilon k_{a}}}{\sqrt{8}}=2Km\sqrt{k_{a}}~{}.

By Chernoff bound (Theorem 3), the probability that we sample fewer than KmkaKm\sqrt{k_{a}} balls whose colors are in group AA is at most 2e0.1×2Kmkae20m2e^{-0.1\times 2Km\sqrt{k_{a}}}\leq e^{-20m}. Thus, we get two balls with the same color with probability at least

1e20m2e20m.\displaystyle 1-e^{-20m^{2}}-e^{-20m}~{}.

The case of one distinct value can be dealt with similarly by simply choose bb to be some arbitrary value and kb=0k_{b}=0. The same argument is still valid and this concludes our proof. ∎

Let r=Cmεr=\frac{Cm}{\sqrt{\varepsilon}} for some sufficiently large constant CC. From Lemmas 1 and 2, for any distribution 𝒟s\mathcal{D}_{s},

r,𝒟s(ξ)\displaystyle\mathbb{P}_{r,\mathcal{D}_{s}}\left(\xi\right) r,𝒟s(ξ)1e20m2e20m.\displaystyle\leq\mathbb{P}_{r,\mathcal{D}_{s^{\star}}}\left(\xi\right)\leq 1-e^{-20m^{2}}-e^{-20m}~{}.

Putting it all together.

If we sample tuples without replacement, for each bad subset of coordinates A[m]A\subseteq[m], the probability that we do not sample an edge in GAG_{A} is at most

em(e20m2+e20m)=e20m2+m+e19m<e10m.e^{m}(e^{-20m^{2}}+e^{-20m})=e^{-20m^{2}+m}+e^{-19m}<e^{-10m}~{}.

Taking a union bound over at most 2m2^{m} bad subsets A[m]A\subseteq[m], we have proved that the probability of failing to detect a bad subset of coordinates is at most 2me10m<em2^{m}e^{-10m}<e^{-m}.

Query time.

Recall that each coordinate has values in UU. If UU has a total ordering, then we can sort the tuples in RR using O(mεlogmε)O(\frac{m}{\sqrt{\varepsilon}}\log\frac{m}{\varepsilon}) comparisons each of which takes O(|A|)O(|A|) time to detect duplicate(s). Thus, the query time is O(m|A|εlogmε)O\left(\frac{m|A|}{\sqrt{\varepsilon}}\log\frac{m}{\varepsilon}\right).

2.2 A lower bound Ω(logmε)\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right) for constant failure probability

In this section, we consider the following question: What is the lower bound on the number of tuples one needs to sample (uniform sampling, with or without replacement) such that the probability of failure of Algorithm 1 is smaller than a fixed constant δ<1\delta<1. To this end, we construct a model to get a lower bound Ω(logmε)\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right) for this question.

Lemma 3.

There exists a data set where one needs to sample with replacement (resp. without replacement) at least Ω(logmε)\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right) tuples to reject all bad subsets AA with probability 1/e1/e (resp. 2/e2/e).

Proof sketch.

We provide a proof sketch for the case of sampling with replacement here. The full proof is deferred to Appendix C.1. We consider the data set 𝒟:={1,,1/ε}m\mathcal{D}:=\{1,\ldots,\lfloor 1/\varepsilon\rfloor\}^{m}. In this data set, the following holds:

  1. 1.

    All subsets of size one are bad (i.e they separate fewer than (1ε)(n2)(1-\varepsilon){n\choose 2} pairs).

  2. 2.

    Sampling a tuple uniformly at random from 𝒟\mathcal{D} is equivalent to sample each coordinate i.i.d from the uniform multinomial distribution on {1,,1/ε}\{1,\ldots,\lfloor 1/\varepsilon\rfloor\}.

Denote RA,rR_{A,r} the event that a bad subset AA is detected after rr samples (i.e., we sample a pair of tuples that are not separated by AA). We derive our lower bound by bounding (A:|A|=1RA,r)\mathbb{P}\left(\bigcap_{A:|A|=1}R_{A,r}\right), which is the probability that one can detect all the bad subsets of cardinality one. Let q=1/εq=1/\varepsilon. We have

(A:|A|=1RA,r)\displaystyle\mathbb{P}\left(\bigcap_{A:|A|=1}R_{A,r}\right) =i=1m(R{i},r)=(R{i},r)m(1i=0r1(1i/q))m.\displaystyle=\prod_{i=1}^{m}\mathbb{P}\left(R_{\{i\},r}\right)=\mathbb{P}\left(R_{\{i\},r}\right)^{m}\leq\left(1-\prod_{i=0}^{r-1}(1-i/q)\right)^{m}~{}.

For m2(1/ε),r=logmεm\leq 2^{(1/\varepsilon)},r=\sqrt{\frac{\log m}{\varepsilon}}, the above can be upper bounded as

(A:|A|=1RA,r)(1exp(r2/q))m(11/m)m1/e.\displaystyle\mathbb{P}\left(\bigcap_{A:|A|=1}R_{A,r}\right)\leq\left(1-{\textup{exp}}\left(-r^{2}/q\right)\right)^{m}\leq\left(1-1/m\right)^{m}\leq 1/e.

2.3 A lower bound Ω(mε)\Omega\left(\frac{m}{\sqrt{\varepsilon}}\right) for eme^{-m} failure probability

For a constant success probability δ\delta, our analysis leaves a gap between the upper and lower bound of the sampling complexity (i.e the upper bound is Θ(mε)\Theta\left(\frac{m}{\sqrt{\varepsilon}}\right) while the lower bound is Ω(logmε)\Omega\left(\sqrt{\frac{\log m}{\varepsilon}}\right)). However, for large success probability (i.e., 1em1-e^{-m}), one can actually show that our analysis is tight. In fact, this lower bound holds even if one only needs to tell if a single coordinate is a ε\varepsilon-separation key.

Lemma 4.

There exists a data set where one needs to sample without replacement at least Ω(mε)\Omega\left(\frac{m}{\sqrt{\varepsilon}}\right) to reject a bad subset AA with probability 1em1-e^{-m}.

Proof sketch.

We build a data set 𝒟:={x1,,xn}\mathcal{D}:=\{x_{1},\ldots,x_{n}\} satisfying the following properties: A) If we sample a tuple uniformly at random, its first coordinate follows the multinomial distribution given in Equation (5) and B) The remaining coordinates can be chosen arbitrarily as long as there exists a key for 𝒟\mathcal{D}. For this data set, one can show that:

  1. 1.

    Coordinate {1}\{1\} is bad.

  2. 2.

    The graph G{1}G_{\{1\}} (the auxiliary graph corresponding the first coordinate) has one cluster CC of size Θ(εn)\Theta(\sqrt{\varepsilon}n) and nΘ(εn)n-\Theta(\sqrt{\varepsilon}n) clusters of size one.

To detect that {1}\{1\} is bad with probability 1em1-e^{-m}, one needs to sample at least two vertices (or tuples) in the largest cluster CC with the same probability. This requires Θ(m/ε)\Theta(m/\sqrt{\varepsilon}) samples since the probability of sampling a tuple in CC is Θ(ε)\Theta(\sqrt{\varepsilon}). A detailed proof is can be found in Appendix C.2. ∎

3 Estimating Non-Separation

In this section, we prove Theorem 2. Specifically, we present space upper and lower bounds for estimating non-separation. The upper bound is based on sampling O(klogmαε2)O\left(\frac{k\log m}{\alpha\varepsilon^{2}}\right) pairs of tuples uniformly at random.

We then prove a lower bound Ω(kmlog1ε)\Omega\left({km}\log\frac{1}{\varepsilon}\right) on the sketch’s size for constant α\alpha. Note that this implies a lower bound Ω(klog1ε)\Omega{\left(k\log\frac{1}{\varepsilon}\right)} on the sample size since each sample requires Ω(m)\Omega(m) bits of space. Therefore, for constant α\alpha, the simple sketching algorithm based on uniform sampling is tight in terms of kk and mm, up to poly-logarithmic factor.

3.1 Upper bound

We give a simple upper bound based on random sampling. Let 𝒟\mathcal{D} denote the data set {x1,x2,,xn}\{x_{1},x_{2},\ldots,x_{n}\}. The algorithm samples Kklogmαε2\frac{Kk\log m}{\alpha\varepsilon^{2}} pairs of tuples (y1,z1),(y2,z2),(𝒟2)(y_{1},z_{1}),(y_{2},z_{2}),\ldots\in{\mathcal{D}\choose 2} uniformly at random for some sufficiently large constant KK. For a fixed A[n]A\subseteq[n], we use DAD_{A} to denote the number of sample pairs that AA fails to separate. That is

DA:=|{(yi,zi)yi and zi have the same coordinates in A}|.D_{A}:=\left|\{(y_{i},z_{i})\mid\text{$y_{i}$ and $z_{i}$ have the same coordinates in $A$}\}\right|~{}.

If DA<Kklogm10ε2D_{A}<\frac{Kk\log m}{10\varepsilon^{2}}, output “small”. Otherwise, output

Γ^A=DAαε2(n2)Kklogm\hat{\Gamma}_{A}=D_{A}\cdot\frac{\alpha\varepsilon^{2}{n\choose 2}}{Kk\log m}

as an estimate of ΓA\Gamma_{A}. If AA fails to separate at most (α/100)(n2)(\alpha/100){n\choose 2} pairs of tuples, appealing to Chernoff bound yields:

(DAKklogm10ε2)2e4.5Kklogmm100k.\mathbb{P}\left(D_{A}\geq\frac{Kk\log m}{10\varepsilon^{2}}\right)\leq 2e^{-4.5Kk\log m}\leq m^{-100k}~{}.

If AA fails to separate at least (α/100)(n2)(\alpha/100){n\choose 2} pairs, we have

(DA(1±ε)ΓAKklogmαε2(n2))2exp(ε2Kklogmαε2(n2)ΓA)m100k.\mathbb{P}\left(D_{A}\notin(1\pm\varepsilon)\Gamma_{A}\frac{Kk\log m}{\alpha\varepsilon^{2}{n\choose 2}}\right)\leq 2{\textup{exp}}\left(-\varepsilon^{2}\frac{Kk\log m}{\alpha\varepsilon^{2}{n\choose 2}}\Gamma_{A}\right)\leq m^{-100k}~{}.

The success probability is therefore at least 1m90k1-m^{-90k} by appealing to a union bound over at most (m1)+(m2)++(mk)mk+1{m\choose 1}+{m\choose 2}+\ldots+{m\choose k}\leq m^{k+1} possible choices of AA.

3.2 Lower bound

We show that even for constant α\alpha, a valid data sketch must use Ω(mklog1/ε)\Omega(mk\log{1/\varepsilon}) bits. The proof is based on an encoding argument. Let dHd_{H} denote the Hamming distance. We will make use of the following communication problem.

Lemma 5.

Suppose AliceAlice has a bit string CC of length ktmktm indexed as a [kt]×[m][kt]\times[m] matrix. Each of mm columns has exactly kk one-entries and k(t1)k(t-1) zero-entries. For Bob to compute a reconstruction C^\hat{C} of CC such that dH(C,C^)|C|10td_{H}(C,\hat{C})\leq\frac{|C|}{10t} with probability at least 2/32/3, the one-way (randomized) communication complexity is Ω(kmlog(t))\Omega(km\log(t)) bits.

The proof of the above lemma is an adaptation of the proof of the Index problem [1, 9] which we defer to the end of this section. We will later set t=Θ(1/ε)t=\Theta(1/\sqrt{\varepsilon}) to obtain the lower bound Ω(kmlog(1/ε))\Omega(km\log{(1/\varepsilon)}).

Let n=ktn=kt and DD be a n×mn\times m matrix whose entries are all ones. Furthermore, let 𝟏i\mathbf{1}_{i} be the 2n×12n\times 1 canonical vector with the one-entry in the iith row. Alice constructs the following 2n×(m+n)2n\times(m+n) data set MM as below. Here, we make the assumption that mkεm\geq\frac{k}{\sqrt{\varepsilon}}; therefore, m+n=O(m)m+n=O(m).

M=[C   𝟏1𝟏2𝟏nD   ]=[100C010001000D000000]\displaystyle M=\left[\begin{array}[]{ccccccc}C&&\rule[-4.30554pt]{0.5pt}{10.76385pt}&\rule[-4.30554pt]{0.5pt}{10.76385pt}&\ldots&\rule[-4.30554pt]{0.5pt}{10.76385pt}\\ &&\mathbf{1}_{1}&\mathbf{1}_{2}&\ldots&\mathbf{1}_{n}\\ D&&\rule[-4.30554pt]{0.5pt}{10.76385pt}&\rule[-4.30554pt]{0.5pt}{10.76385pt}&\ldots&\rule[-4.30554pt]{0.5pt}{10.76385pt}\end{array}\right]=\left[\begin{array}[]{c|cccccc}&1&0&\ldots&0\\ C&0&1&\ldots&0\\ &\ldots&\ldots&\ldots&\ldots\\ &0&0&\ldots&1\\ \hline\cr&0&0&\ldots&0\\ D&0&0&\ldots&0\\ &\ldots&\ldots&\ldots&\ldots\\ &0&0&\ldots&0\\ \end{array}\right]

We will show that Bob can recover CC, column-by-column using on the described data sketch for non-separation estimation with α=1/16\alpha=1/16. Then, appealing to Lemma 5, such a data sketch must have size Ω(mklog1ε)\Omega\left(mk\log\frac{1}{\varepsilon}\right) bits.

Fix a column cc of CC. Bob can recover column cc of CC as follows. If Bob guesses that rows

R={r1,,rk}[kt]R=\{r_{1},\ldots,r_{k}\}\subset[kt]

in cc contain all the one-entries, Bob can verify if this guess is good using the estimate Γ^A\hat{\Gamma}_{A} where

A={i,m+r1,m+r2,,m+rk}.A=\{i,m+r_{1},m+r_{2},\ldots,m+{r_{k}}\}.

Here, a guess is good if the reconstruction c^\hat{c} of cc corresponding to this guess satisfies dH(c^,c)|c|/(10t)d_{H}(\hat{c},c)\leq|c|/(10t).

Note that |A|=k+1=O(k)|A|=k+1=O(k). Let uu and vv be the number of rows in RR that Bob guesses correctly and incorrectly respectively. That is u:=|{ri:cri=1}|u:=|\{r_{i}:c_{r_{i}}=1\}|, and v:=|{ri:cri=0}|v:=|\{r_{i}:c_{r_{i}}=0\}|. Note that v=kuv=k-u. The intuition is that for each rir_{i} that is a correct guess, AA includes another coordinate that separates another n+k\approx n+k pairs; if rir_{i} is a wrong guess, AA includes another coordinate that separates nk\approx n-k pairs.

Lemma 6.

We have the following equality regarding the number of unseparated pairs in AA:

ΓA=(t2t+5/2)k2(t1/2)k+u23ku.\Gamma_{A}=(t^{2}-t+5/2)k^{2}-(t-1/2)k+u^{2}-3ku~{}.
Proof.

Consider the following process. Starting with AA as the coordinate corresponding to column cc. We then add coordinates m+rim+r_{i} for each ii to AA one by one. Originally, AA fails to separate (n+k2)n+k\choose 2 pairs corresponding to n+kn+k rows that are 1’s in column cc and (nk2)n-k\choose 2 pairs that corresponds to kk rows that are 0’s in column cc.

For the first correct ri1r_{i_{1}}, AA includes a coordinate that separates row ri1r_{i_{1}} from n+k1n+k-1 other rows; for the second correct ri2r_{i_{2}}, AA includes a coordinate that separates row ri2r_{i_{2}} from n+k2n+k-2 other rows and so on. Similarly, for the first incorrect rj1r_{j_{1}}, AA includes a coordinate that separates row rj1r_{j_{1}} from nk1n-k-1 other rows; for the second incorrect rj2r_{j_{2}}, AA includes a coordinate that separates row rj2r_{j_{2}} from nk2n-k-2 other rows and so on.

Therefore, in the end, the number of pairs that remain unseparated in AA is

(n+k2)+(nk2)a=1u(n+ka)b=1v(nkb)\displaystyle{n+k\choose 2}+{n-k\choose 2}-\sum_{a=1}^{u}(n+k-a)-\sum_{b=1}^{v}(n-k-b)
=\displaystyle= (n+k2)+(nk2)u(n+k)(ku)(nk)+u(u+1)2+(ku)(ku+1)2\displaystyle{n+k\choose 2}+{n-k\choose 2}-u(n+k)-(k-u)(n-k)+\frac{u(u+1)}{2}+\frac{(k-u)(k-u+1)}{2}
=\displaystyle= (n+k2)+(nk2)+u23ku+k2+k2k(nk)\displaystyle{n+k\choose 2}+{n-k\choose 2}+u^{2}-3ku+\frac{k^{2}+k}{2}-k(n-k)
=\displaystyle= (n+k)(n+k1)2+(nk)(nk1)2+u23ku+k2+k2k(nk)\displaystyle\frac{(n+k)(n+k-1)}{2}+\frac{(n-k)(n-k-1)}{2}+u^{2}-3ku+\frac{k^{2}+k}{2}-k(n-k)
=\displaystyle= n2+k2n+u23ku+k2+k2k(nk)+u23ku\displaystyle n^{2}+k^{2}-n+u^{2}-3ku+\frac{k^{2}+k}{2}-k(n-k)+u^{2}-3ku
=\displaystyle= n2+5k22nkn+k2+u23ku\displaystyle n^{2}+\frac{5k^{2}}{2}-n-kn+\frac{k}{2}+u^{2}-3ku
=\displaystyle= (t2t+5/2)k2(t1/2)k+u23ku.\displaystyle(t^{2}-t+5/2)k^{2}-(t-1/2)k+u^{2}-3ku~{}. (16)

First, note that ΓA>(n2)>116(2n2)\Gamma_{A}>{n\choose 2}>\frac{1}{16}{2n\choose 2}. Thus, a correct data sketch must output the required estimate Γ^A=(1±ε)ΓA\hat{\Gamma}_{A}=(1\pm\varepsilon)\Gamma_{A}.

Observe that the expression (16) is decreasing for u3k/2u\leq 3k/2. Suppose u0.9ku\leq 0.9k,

ΓA(t2t+5/2+0.922.7)k2(t1/2)k\displaystyle\Gamma_{A}\geq(t^{2}-t+5/2+0.9^{2}-2.7)k^{2}-(t-1/2)k
=\displaystyle= (t2t+0.61)k2(t1/2)k.\displaystyle(t^{2}-t+0.61)k^{2}-(t-1/2)k~{}.

On the other hand, if u=ku=k, then

ΓA\displaystyle\Gamma_{A} (t2t+5/22)k2(t1/2)k\displaystyle\leq(t^{2}-t+5/2-2)k^{2}-(t-1/2)k
=(t2t+0.5)k2(t1/2)k.\displaystyle=(t^{2}-t+0.5)k^{2}-(t-1/2)k~{}.

Our objective is to choose tt such that if the provided estimate Γ^A=(1±ε)ΓA\hat{\Gamma}_{A}=(1\pm\varepsilon)\Gamma_{A}, Bob can tell whether his guess is good. To do so, it suffices to choose tt such that:

t2t+0.61t2t+0.5>1+ε1ε11200t2200t+11>ε.\displaystyle\frac{t^{2}-t+0.61}{t^{2}-t+0.5}>\frac{1+\varepsilon}{1-\varepsilon}\iff\frac{11}{200t^{2}-200t+11}>\varepsilon~{}.

Hence, we can set t=1Kεt=\frac{1}{K\sqrt{\varepsilon}} for some sufficiently large constant KK. Specifically, Bob queries the estimates Γ^A\hat{\Gamma}_{A} for at most (nk+1)n\choose k+1 choices of AA. If Γ^A(1+ε)((t2t+0.5)k2(t1/2)k)\hat{\Gamma}_{A}\leq(1+\varepsilon)((t^{2}-t+0.5)k^{2}-(t-1/2)k), then he knows that the guess is good; he will use the corresponding reconstruction for column cc.

Therefore, provided a valid sketch. Bob can correctly compute a reconstruction C^\hat{C} of CC such that

dH(C,C^)|C|10t.d_{H}(C,\hat{C})\leq\frac{|C|}{10t}~{}.

with probability at least 3/4. Reparameterize kk+1k\leftarrow k+1 and mm+nm\leftarrow m+n gives us the lower bound.

Proof of Lemma 5.

For convenience, let N=mktN=mkt (i.e., the length of the bit string CC that is given to Alice) and L=mklogtL=mk\log t (we want to show that Ω(L)\Omega(L) communication is needed).

Recall that Bob wants to compute a reconstruction C^\hat{C} of CC such that

dH(C^,C)|C|10t.d_{H}(\hat{C},C)\leq\frac{|C|}{10t}~{}.

We consider the distributional complexity methodology where CC follows a distribution 𝒟\mathcal{D} and communication protocols are deterministic. Here, we simply choose 𝒟\mathcal{D} in which in each column, kk bits are chosen uniformly at random to be 1’s and the remaining (t1)k(t-1)k bits are set to 0’s. It suffices to show that for any deterministic protocol PP that uses 0.001L0.001L bits of communication,

C𝒟(P fails to reconstruct C as required)>1/3.\mathbb{P}_{C\sim\mathcal{D}}\left(\text{$P$ fails to reconstruct $C$ as required}\right)>1/3~{}.

By the minimax principle [19], this implies that there exists an input CC such that any randomized protocol that uses 0.001L0.001L bits of communication will fail to reconstruct CC as required with probability more than 1/3.

Suppose Bob gets a message ZZ from Alice (ZZ is a function of CC). Let Q(Z)Q(Z) be his reconstruction based on ZZ. Let

𝒜={Q(Z):Z{0,1}N}\mathcal{A}=\{Q(Z):Z\in\{0,1\}^{N}\}

be the set of all possible reconstructions by Bob. Note that |𝒜|20.001L|\mathcal{A}|\leq 2^{0.001L}.

We say Alice’s input CC good if there exists a reconstruction QQ such that

dH(Q,C)N10t.d_{H}(Q,C)\leq\frac{N}{10t}~{}.

Otherwise, we say CC is bad. Fix a reconstruction QQ, the number of inputs CC whose Hamming distance is at most N10t\frac{N}{10t} from QQ is

(N0)+(N1)+(N2)++(NN/(10t)).\displaystyle{N\choose 0}+{N\choose 1}+{N\choose 2}+\ldots+{N\choose N/(10t)}~{}. (17)

Recall that (a/b)b(ab)(ea/b)b(a/b)^{b}\leq{a\choose b}\leq(ea/b)^{b}. For sufficiently large k,m,k,m, and tt, expression (17) can be upper bounded as:

N(10te)N/(10t)\displaystyle N(10te)^{N/(10t)} =N(10te)0.1km\displaystyle=N(10te)^{0.1km}
=N20.1kmlog(10et)\displaystyle=N2^{0.1km\log(10et)}
N20.1kmlogt+0.1kmlog(10e)\displaystyle\leq N2^{0.1km\log t+0.1km\log(10e)}
<N20.11kmlogt.\displaystyle<N2^{0.11km\log t}~{}.

Hence, for sufficiently large k,m,k,m, and tt, the total number of good inputs is at most

|𝒜|N20.11kmlogt2logN+0.111kmlogt\displaystyle|\mathcal{A}|N2^{0.11km\log t}\leq 2^{\log N+0.111km\log t}
=\displaystyle= 2logk+logm+logt+0.111kmlogt<20.2kmlogt.\displaystyle 2^{\log k+\log m+\log t+0.111km\log t}<2^{0.2km\log t}.

Therefore,

C𝒟(P fails to reconstruct of C as required)\displaystyle\mathbb{P}_{C\sim\mathcal{D}}\left(\text{$P$ fails to reconstruct of $C$ as required}\right)
\displaystyle\geq C𝒟(P fails to reconstruct C as required|C is bad)×C𝒟(C is bad)\displaystyle\mathbb{P}_{C\sim\mathcal{D}}\left(\text{$P$ fails to reconstruct $C$ as required}\>|\>\text{$C$ is bad}\right)\times\mathbb{P}_{C\sim\mathcal{D}}\left(\text{$C$ is bad}\right)
\displaystyle\geq (120.2kmlogt(tkk)m)×11t0.2kmtkm=11t0.8km>1/3.\displaystyle\left(1-\frac{2^{0.2km\log t}}{{tk\choose k}^{m}}\right)\times 1\geq 1-\frac{t^{0.2km}}{t^{km}}=1-\frac{1}{t^{0.8km}}>1/3.\qed

4 Implementation

The modified algorithm to detect ε\varepsilon-separation keys is very simple, despite the involved analysis. Therefore, we briefly demonstrate the effectiveness of our approach described in Section 2 and compare it to the naive approach given by Motwani and Xu [14]. We ran our experiments on an M1 Pro processor with 16 gigabytes of unified memory. The code for the experiments is available on GitHub [7].

Description of datasets.

We used two of the data sets that were tested in [14], the adult income data set and the covtype data set, both of which are from the UCI Machine Learning Repository. We also used the 2016 Current Population Survey, which is publicly provided by the US Census. These data sets are a good representation in terms of data size and attribute size. For example, the adult data set contains slightly more than 32,000 values with 14 attributes, while the census data contains millions of records with 388 attributes.

Comparison methodology.

We compare the two approaches with ε=0.001\varepsilon=0.001 and δ=0.01\delta=0.01. These are the same tuning parameters that were considered in [14]. We compare our results in terms of the following: (i) sample size, (ii) run-time , and (iii) the percentage in which both algorithms agree on accepting or rejecting a set of attributes. Note that in some cases, even though the two algorithms’ outputs are different, both can be correct. In particular, if a subset is not a perfect key but it separates at least (1ε)(n2)(1-\varepsilon){n\choose 2} pairs, then it is correct to either accept or reject.

For each data set, we select about 100100 random subsets of attributes to query. See Table 1 for the detailed results. At a high level, both approaches agree on nearly all queries while requiring a much smaller sample size.

Table 1: Sample size and average running time across 10 different trials, \star and \star\star denote the results by the approaches in [14] and in this paper respectively. S: sample sizes. T: time. A: agreement
Dataset S (\star) S (\star\star) T (\star) T (\star\star) A %
Adult 13,000 411 1.903 sec 0.208 sec 95%
Covtype 55,000 1,739 188.02 sec 2.49 sec 98%
CPS 372,000 11,764 790.08 sec 60.03 sec 100%

References

  • [1] Farid M. Ablayev. Lower bounds for one-way probabilistic communication complexity and their application to space complexity. Theor. Comput. Sci., 157(2):139–159, 1996.
  • [2] Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586–597. Morgan Kaufmann, 2002.
  • [3] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313–324. ACM, 2003.
  • [4] Graham Cormode, Charlie Dickens, and David P. Woodruff. Subspace exploration: Bounds on projected frequency estimation. In PODS, pages 273–284. ACM, 2021.
  • [5] Roy Friedman and Rana Shahout. Box queries over multi-dimensional streams. Inf. Syst., 109:102086, 2022.
  • [6] Chris M Giannella, Mehmet M Dalkilic, Dennis P Groth, and Edward L Robertson. Using horizontal-vertical decompositions to improve query evaluation. In Proceedings of the 19th British National Conference on Databases (BNCOD), Lee. Notes in Comp. Sci. vol, volume 2405, pages 26–41. Citeseer, 2002.
  • [7] Ryan Hildebrant. Github. https://github.com/Ryanhilde/min_set_cover/.
  • [8] Jyrki Kivinen and Heikki Mannila. Approximate dependency inference from relations. In ICDT, volume 646 of Lecture Notes in Computer Science, pages 86–98. Springer, 1992.
  • [9] Ilan Kremer, Noam Nisan, and Dana Ron. Errata for: "on randomized one-round communication complexity". Comput. Complex., 10(4):314–315, 2001.
  • [10] Branislav Kveton, S. Muthukrishnan, Hoa T. Vu, and Yikun Xian. Finding subcube heavy hitters in analytics data streams. In WWW, pages 1705–1714. ACM, 2018.
  • [11] Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan R. Ullman. Space lower bounds for itemset frequency sketches. In PODS, pages 441–454. ACM, 2016.
  • [12] Terry R McConnell. An inequality related to the birthday problem. Preprint, 2001.
  • [13] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge university press, 1995.
  • [14] Rajeev Motwani and Ying Xu. Efficient algorithms for masking and finding quasi-identifiers. In Technical Report, 2008.
  • [15] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, 2e edition, 2006.
  • [16] Bernhard Pfahringer and Stefan Kramer. Compression-based evaluation of partial determinations. In KDD, pages 234–239. AAAI Press, 1995.
  • [17] J Michael Steele. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.
  • [18] Wikipedia. Chernoff bound — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Chernoff%20bound&oldid=1119845299, 2022. [Online; accessed 18-November-2022].
  • [19] Andrew Chi-Chih Yao. Lower bounds by probabilistic arguments (extended abstract). In FOCS, pages 420–428. IEEE Computer Society, 1983.
  • [20] Neal E. Young. Greedy set-cover algorithms. In Ming-Yang Kao, editor, Encyclopedia of Algorithms. Springer, 2008.

Appendix A Applying Karush–Kuhn–Tucker (KKT) conditions

In this section, we present the KKT condition for self-containedness. This classical result is presented as in [15, Chapter 1212]. Consider a constrained optimization problem with an objective function ff, inequality and equality constraints ci,ic_{i},i\in\mathcal{I}\cup\mathcal{E}.

Maximize f(s)\displaystyle~{}~{}~{}~{}f(s) (18)
Subject to:
ci(s)\displaystyle c_{i}(s) 0\displaystyle\geq 0 for ii\in\mathcal{I} .
ci(s)\displaystyle c_{i}(s) =0\displaystyle=0 for ii\in\mathcal{E} .
Definition 1 (Active set [15]).

The active set 𝒜(s)\mathcal{A}(s) of a feasible solution ss of (18) is a set of indices defined as: 𝒜(s):={ici(s)=0}\mathcal{A}(s):=\{i\in\mathcal{I}\mid c_{i}(s)=0\}\cup\mathcal{E}. In other words, 𝒜(s)\mathcal{A}(s) refers to the set of equality and inequality (where equality happens) constraints.

Definition 2 (LICQ [15]).

Given a feasible solution ss of (18) and its active set in Definition 1, we say that linear independence constraint qualification (LICQ) holds at ss if {ci(s),i𝒜(s)}\{\nabla c_{i}(s),i\in\mathcal{A}(s)\} is linearly independent.

Theorem 5 (KKT conditions [15]).

Consider ss^{\star} a local minimum of (18), whose ff and ci,ic_{i},i\in\mathcal{I}\cup\mathcal{E} are continuously differentiable and that the LICQ holds at ss^{\star}, then there is a Lagrange multiplier vector λ\lambda^{\star} with components λi,i\lambda_{i}^{\star},i\in\mathcal{I}\cup\mathcal{E} such that the following conditions are satisfied:

ci(s)\displaystyle c_{i}(s^{\star}) 0,i\displaystyle\geq 0,\forall i\in\mathcal{I} (Primal feasibility for inequality) (19)
ci(s)\displaystyle c_{i}(s^{\star}) =0,i\displaystyle=0,\forall i\in\mathcal{E} (Primal feasibility for equality) (20)
λi\displaystyle\lambda^{\star}_{i} 0,i\displaystyle\geq 0,\forall i\in\mathcal{I} (Dual feasibility) (21)
f(s)\displaystyle\nabla f(s^{\star}) =i=1tλigi(s)\displaystyle=\sum_{i=1}^{t}\lambda^{\star}_{i}\nabla g^{\star}_{i}(s) (Stationarity) (22)
λici(s)\displaystyle\lambda^{\star}_{i}c_{i}(s^{\star}) =0,i\displaystyle=0,\forall i\in\mathcal{E}\cup\mathcal{I} (Complementary slackness).\displaystyle\text{(Complementary slackness)}~{}. (23)

Application to our problem.

Recall that we want to maximize:

f(s)=1j1<j2<<jrnsj1sj2sjr.\displaystyle f(s)=\sum_{1\leq j_{1}<j_{2}<\ldots<j_{r}\leq n}s_{j_{1}}s_{j_{2}}\ldots s_{j_{r}}~{}.

Subject to:

i=1nsi\displaystyle\sum_{i=1}^{n}s_{i} =n\displaystyle=n
si\displaystyle s_{i} 0, for i=1,2,,n\displaystyle\geq 0\text{, for $i=1,2,\ldots,n$}
i=1nsi2\displaystyle\sum_{i=1}^{n}s_{i}^{2} εn2/4, for i=1,2,,n.\displaystyle\geq\varepsilon n^{2}/4\text{, for $i=1,2,\ldots,n$.}

KKT conditions imply that when LICQ holds, there exist constants η,{λi}i[n]0,μ0\eta\in\mathbb{R},\{\lambda_{i}\}_{i\in[n]}\geq 0,\mu\geq 0 such that for any local maxima, we have:

f(s)=μ(i=1nsi2)+i=1nλi(si)+η(i=1nsi).\displaystyle\nabla f(s)=\mu\nabla\left(\sum_{i=1}^{n}s_{i}^{2}\right)+\sum_{i=1}^{n}\lambda_{i}\nabla(s_{i})+\eta\nabla\left(\sum_{i=1}^{n}s_{i}\right)~{}.

Furthermore, according to the complementary slackness condition, for each i=1,2,,ni=1,2,\ldots,n:

λisi=0\displaystyle\lambda_{i}s_{i}=0

Therefore, if si>0s_{i}>0 then λi=0\lambda_{i}=0.

Appendix B Improvements for Approximate Minimum ε\varepsilon-Separation Key

In this section, we prove Proposition 1. Recall from Section 1 that we treat (R2){R\choose 2} as the ground set and each coordinate is a set that contains all pairs that it separates. The minimum key II^{\star} separates all pairs in (R2){R\choose 2} since (R2)(X2){R\choose 2}\subset{X\choose 2}. Furthermore, in Section 2, we showed that no subset of coordinates that separates fewer than (1ε)(n2)(1-\varepsilon){n\choose 2} pairs is a set cover of (R2){R\choose 2} with high probability. Hence, a γ\gamma approximation to this minimum set cover instance yields an ε\varepsilon-separation key whose size is at most γ|I|\gamma|I^{\star}| with high probability. This implies the improvement in terms of sample size.

In terms of running time, to achieve the approximation factor γ=O(lnmε)\gamma=O\left(\ln\frac{m}{\varepsilon}\right), both the algorithm in [14] and ours use the greedy set cover algorithm [20] (described in Algorithm 2).

Let AA be the output, initialized to \emptyset. The greedy algorithm, at each step, adds the coordinate that separates the most number of currently unseparated pairs to AA. The algorithm stops when all pairs are separated by AA.

The key difference between two algorithms is: ours samples R=Θ(m/ε)R=\Theta(m/\sqrt{\varepsilon}) tuples and use (R2)R\choose 2 as ground set while the approach of [14] samples R=Θ(m/ε)R^{\prime}=\Theta(m/\varepsilon) pairs of tuples of the data set and use RR^{\prime} as the ground set. Algorithm 2 runs in O(M2N)O\left(M^{2}N\right) time where NN is the ground set’s size and MM is the number of sets in the input. Therefore, our algorithm and that of Mowani and Xu [14] yield time complexity O(m4ε)O\left(\frac{m^{4}}{\varepsilon}\right) and O(m3ε)O\left(\frac{m^{3}}{\varepsilon}\right) respectively. However, our algorithm can be refined to a better time complexity O(m3ε)O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right). We outline how to achieve this running time as follows.

Algorithm 2 Greedy Set Cover Algorithm
1:X={X1,,XN}X=\{X_{1},\ldots,X_{N}\} ground set, S={S1,,SM}S=\{S_{1},\ldots,S_{M}\} collections of subsets of XX.
2:U=XU=X \triangleright UU stores the uncovered elements
3:C=C=\emptyset \triangleright CC stores the indices of sets of the cover
4:while UU\neq\emptyset do
5:     Select SiSS_{i}\in S that maximizes |SiU||S_{i}\cap U|.
6:     UUSiU\leftarrow U\setminus S_{i}.
7:     CC{i}C\leftarrow C\cup\{i\}.
8:end while
9:return CC

We can visualize this process in terms of auxiliary graphs introduced in Section 2. Let AA be the output, initialized to \emptyset. Originally, GAG_{A} consists of one clique CC that contains all xjRx_{j}\in R since the empty set does not separate any pair.

As we add more coordinates to the solution, AA will separate more pairs in (R2){R\choose 2} and the number of disjoint cliques cc in GAG_{A} increases. We stop when c=|R|c=|R|; in other words, all pairs in (R2){R\choose 2} are separated by AA. We will make use of the following procedure.

Partitioning.

Let CRC\subseteq R be a subset of the sampled tuples. For some given ii, we can partition CC into D1,D2,D_{1},D_{2},\ldots based on the iith coordinates. The simplest approach is to sort the iith coordinates of the tuples in CC. This takes O(|C|log|C|)O\left(|C|\log|C|\right) time.

Let Cy,1,Cy,2,,C_{y,1},C_{y,2},\ldots, be the disjoint cliques in GAG_{A} after yy steps (i.e., after we have added yy coordinates to AA). Originally, C1,1=RC_{1,1}=R.

At each step yy, for each coordinate kAk\notin A, adding kk to AA breaks each clique Cy,iC_{y,i} into some new cliques D1(i),D2(i),D_{1}^{(i)},D_{2}^{(i)},\ldots based on the kkth coordinates. This corresponds to the fact that adding kk to AA results in AA separating more pairs of tuples. We use the procedure D(i)Partition(Cy,i,k)D^{(i)}\leftarrow Partition(C_{y,i},k) to compute D1(i),D2(i),D_{1}^{(i)},D_{2}^{(i)},\ldots We observe that coordinate kk would separate

gk:=12ia,b|Da(i)||Db(i)|\displaystyle g_{k}:=\frac{1}{2}\sum_{i}\sum_{a,b}{|D_{a}^{(i)}||D_{b}^{(i)}|} =12i[(a|Da(i)|)2a|Da(i)|2]\displaystyle=\frac{1}{2}\sum_{i}\left[\left(\sum_{a}|D_{a}^{(i)}|\right)^{2}-\sum_{a}|D_{a}^{(i)}|^{2}\right]
=12i[|Ci|2a|Da(i)|2].\displaystyle=\frac{1}{2}\sum_{i}\left[|C_{i}|^{2}-\sum_{a}|D_{a}^{(i)}|^{2}\right]~{}.

new pairs of tuples in (R2){R\choose 2}.

Given D(i)D^{(i)}, gkg_{k} can be computed in O(i|Ci|)=O(|R|)O(\sum_{i}|C_{i}|)=O(|R|) time by computing a|Da(i)|2\sum_{a}|D_{a}^{(i)}|^{2} as there can be at most |Ci||C_{i}| terms in the sum. The algorithm then adds coordinate kk with the highest gkg_{k} to AA and update the cliques accordingly. In other words, replace CiC_{i} with D1(i),D2(i),D_{1}^{(i)},D_{2}^{(i)},\ldots For each kAk\notin A, the time to partition based on the kkth coordinates is

O(i|Ci|log|Ci|)=O(|R|log|R|).O\left(\sum_{i}|C_{i}|\log|C_{i}|\right)=O(|R|\log|R|)~{}.

We need to do this for at most mm coordinates not in AA and the process repeats for at most mm steps. Thus, the algorithm’s running time is

O(m2|R|log|R|)=O(m3εlogmε).O\left(m^{2}|R|\log|R|\right)=O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\log\frac{m}{\varepsilon}\right)~{}.

If we allow an extra factor mm in the space use, we can reduce the running time further by improving the partitioning procedure. For each kAk\notin A, the partition step can be done can be done in O(i|Ci|)=O(|R|)O\left(\sum_{i}|C_{i}|\right)=O(|R|) time (instead of O(|R|log|R|)O(|R|\log|R|) time) using Algorithm 3.

Algorithm 3 is provided with a lookup table P|R|×mP\in\mathbb{N}^{|R|\times m} and an index kk. We partition RR based on the kkth coordinates into C1(k)C^{(k)}_{1}, C2(k)C^{(k)}_{2}, \ldots. The value P[k,j]P[k,j] is equal to the index of the partition that xjRx_{j}\in R belongs to. In other words, xjCP[k,j](k)x_{j}\in C^{(k)}_{P[k,j]}. Pre-computing PP involves sorting each of mm coordinates of tuples in RR. The running time to compute PP is therefore

O(m|R|log|R|)=O(m2εlog(mε)).O(m|R|\log|R|)=O\left(\frac{m^{2}}{\sqrt{\varepsilon}}\log\left(\frac{m}{\varepsilon}\right)\right)~{}.

Thus, the overall running time is

O(m2εlog(mε))+O(m3ε)=O(m3ε).O\left(\frac{m^{2}}{\sqrt{\varepsilon}}\log\left(\frac{m}{\varepsilon}\right)\right)+O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right)=O\left(\frac{m^{3}}{\sqrt{\varepsilon}}\right)~{}.
Algorithm 3 Partitioning CC based on the kkth coordinates using the look-up table PP
1:A subset CRC\subseteq R, a look-up table PP and an index 1km1\leq k\leq m.
2:Initialize DD as an array of empty list. \triangleright DD contains D(i)D^{(i)} as an array of list.
3:Initialize LL as an empty list. \triangleright LL is the list of indices jj s.t D[j]D[j] is non empty
4:for xjCx_{j}\in C do
5:     if D[P[k,j]]=D[P[k,j]]=\emptyset then
6:         L=L{P[k,j]}L=L\cup\{P[k,j]\}
7:     end if
8:     D[P[k,j]]D[P[k,j]]{xj}D[P[k,j]]\leftarrow D[P[k,j]]\cup\{x_{j}\}.
9:end for
10:return {D[j]jL}\{D[j]\mid j\in L\}.

Appendix C Omitted Proofs and Examples

C.1 Proof of Lemma 3

Proof.

For technical reason, we will consider ε\varepsilon such that 1/ε=q+1/21/\varepsilon=q+1/2 where qq\in\mathbb{N}. Our construction of the data set 𝒟\mathcal{D} is 𝒟:={1,2,,q}m=[q]m\mathcal{D}:=\{1,2,\ldots,q\}^{m}=[q]^{m}. Thus, n:=|𝒟|=qmn:=|\mathcal{D}|=q^{m}. We choose mm such that m<exp(14(1ε12))m<{\textup{exp}}\left(\frac{1}{4}(\frac{1}{\varepsilon}-\frac{1}{2})\right) (or equivalently logm<q/4\log m<q/4).

Firstly, we will prove that all subsets of attributes of cardinality one is bad. Indeed, given any singleton set A={i1im}A=\{i\mid 1\leq i\leq m\}, the auxiliary graph GA=(VA,EA)G_{A}=(V_{A},E_{A}) will be decomposed evenly into qq cliques whose size are n/qn/q. The number of edges (of GAG_{A}) is:

|EA|=qn/q(n/q1)2=n(n/q1)2.\displaystyle|E_{A}|=q\frac{n/q(n/q-1)}{2}=\frac{n(n/q-1)}{2}~{}.

To show |EA|>εn(n1)/2|E_{A}|>\varepsilon n(n-1)/2, it is sufficient to demonstrate that:

n/q1>ε(n1)\displaystyle n/q-1>\varepsilon(n-1) nεq1ε>n1\displaystyle\iff\frac{n}{\varepsilon q}-\frac{1}{\varepsilon}>n-1
nq+1/2qq12>n1\displaystyle\iff n\frac{q+1/2}{q}-q-\frac{1}{2}>n-1
n2q>q12\displaystyle\iff\frac{n}{2q}>q-\frac{1}{2}
qm1>2q1\displaystyle\iff q^{m-1}>2q-1

which is true if m3,q>1m\geq 3,q>1.

Denote RA,rR_{A,r} the event that a bad subset AA is detected after rr samples, we derive our lower bound by bounding (A:|A|=1RA,r)\mathbb{P}\left(\bigcap_{A:|A|=1}R_{A,r}\right), which is the probability that one can detect all the bad subsets of cardinality one. To distinguish between the case of sampling with and without replacement, we denote w(E)\mathbb{P}_{w}\left(E\right) and o(E)\mathbb{P}_{o}\left(E\right) the probability of an event EE under the sampling with and without replacement respectively.

We first deal with the case of uniform sampling with replacement. In this case, sampling a tuple xx is equivalent to sampling each coordinate i.i.d from uniform multinomial distribution of the set {1,,q}\{1,\ldots,q\}. Hence,

w(A:|A|=1RA,r)=i=1mw(R{i},r)=w(R{i},r)m.\displaystyle\mathbb{P}_{w}\left(\bigcap_{A:|A|=1}R_{A,r}\right)=\prod_{i=1}^{m}\mathbb{P}_{w}\left(R_{\{i\},r}\right)=\mathbb{P}_{w}\left(R_{\{i\},r}\right)^{m}~{}.

Moreover, we have: w(RA,r)1i=0r1(1i/q)\mathbb{P}_{w}\left(R_{A,r}\right)\leq 1-\prod_{i=0}^{r-1}(1-i/q). This is not an equality since we need to exclude the cases where a tuple of 𝒟\mathcal{D} is sampled more than once. Since exp(2x)1x,x1/2{\textup{exp}}\left(-2x\right)\leq 1-x,\forall x\leq 1/2, we have:

w(RA,r)1exp(r(r1)/q)1exp(r2/q).\displaystyle\mathbb{P}_{w}\left(R_{A,r}\right)\leq 1-{\textup{exp}}\left(-r(r-1)/q\right)\leq 1-{\textup{exp}}\left(-r^{2}/q\right)~{}.

for rq/2r\leq q/2. If one chooses r=qlogmr=\sqrt{q\log m} (which makes the inequality exp(2x)1x{\textup{exp}}\left(-2x\right)\leq 1-x valid since logmq/4\log m\leq q/4), we have:

w(RA,r)11m\mathbb{P}_{w}\left(R_{A,r}\right)\leq 1-\frac{1}{m}

Finally, one can bound w(A|A|=1RA,r)=w(RA,r)m(11/m)m1/e\mathbb{P}_{w}\left(\bigcap_{A\mid|A|=1}R_{A,r}\right)=\mathbb{P}_{w}\left(R_{A,r}\right)^{m}\leq(1-1/m)^{m}\leq 1/e. Thus, the sampling complexity is lower bounded by Θ(qlogm)=Θ(logmε)\Theta(\sqrt{q\log m})=\Theta\left(\sqrt{\frac{\log m}{\varepsilon}}\right).

For the case of sampling without replacement, we will use the following observation: Let x:=(x1,,xr)x:=(x_{1},\ldots,x_{r}) be a sequence of distinct elements in 𝒟\mathcal{D} and Xi,1irX_{i},1\leq i\leq r be the iith element sampling from 𝒟\mathcal{D}, we have:

o(Xi=xi,1ir)=1n(n1)(nr+1)\displaystyle\mathbb{P}_{o}\left(X_{i}=x_{i},\forall 1\leq i\leq r\right)=\frac{1}{n(n-1)\ldots(n-r+1)}
w(Xi=xi,1ir)=1nr\displaystyle\mathbb{P}_{w}\left(X_{i}=x_{i},\forall 1\leq i\leq r\right)=\frac{1}{n^{r}}

Let 𝒳\mathcal{X} be the set of all sequences of distinct tuples of length rr that remove all bad subsets and RrR_{r} be the event of all bad subsets being detected (i.e., there exists an unseparated pair in the sample for each bad subset) after sampling rr samples, we have:

o(R)\displaystyle\mathbb{P}_{o}\left(R\right) =x𝒳o(X=x)\displaystyle=\sum_{x\in\mathcal{X}}\mathbb{P}_{o}\left(X=x\right)
=nsn(n1)(ns+1)x𝒳w(X=x)\displaystyle=\frac{n^{s}}{n(n-1)\ldots(n-s+1)}\sum_{x\in\mathcal{X}}\mathbb{P}_{w}\left(X=x\right)
=nsn(n1)(ns+1)w(R)\displaystyle=\frac{n^{s}}{n(n-1)\ldots(n-s+1)}\mathbb{P}_{w}\left(R\right)
(4)es(s1)/(ns+1)w(R).\displaystyle\overset{\eqref{eq:boundedterm}}{\leq}e^{s(s-1)/(n-s+1)}\mathbb{P}_{w}\left(R\right).

We choose mm such that ns+11ln2s(s1)n-s+1\geq\frac{1}{\ln 2}s(s-1). With s=qlogms=\sqrt{q\log m} (as in the case of sampling with replacement), this is equivalent to:

qm1ln2qlogm(qlogm+1)+qlogm1\displaystyle q^{m}\geq\frac{1}{\ln 2}\sqrt{q\log m}(\sqrt{q\log m}+1)+\sqrt{q\log m}-1

which is true for q>3,m>2q>3,m>2. If ns+11ln2s(s1)n-s+1\geq\frac{1}{\ln 2}s(s-1), we have:

o(R)2w(R)\mathbb{P}_{o}\left(R\right)\leq 2\mathbb{P}_{w}\left(R\right)

Thus, we can derive the same complexity bound Θ(qlogm)=Θ(logmε)\Theta(\sqrt{q\log m})=\Theta\left(\sqrt{\frac{\log m}{\varepsilon}}\right) for the case of sampling without replacement with failure probability 2/e2/e. ∎

C.2 Proof of Lemma 4

Proof.

We choose nn such that nm2/εn\gg m^{2}/\varepsilon. We construct the data set 𝒟\mathcal{D} of nn tuples 𝒟:={x1,,xn}\mathcal{D}:=\{x_{1},\ldots,x_{n}\} as follows.

  1. 1.

    The first coordinate of each of {xi}i[n]\{x_{i}\}_{i\in[n]} has q:=(12ε)n+1q:=(1-\sqrt{2\varepsilon})n+1 distinct values. Without loss of generality, we assume the set of distinct values of the first coordinate are {1,,q}\{1,\ldots,q\}. More importantly, there are exactly 2εn\sqrt{2\varepsilon}n tuples having their first coordinates equal to 11 and (12ε)n(1-\sqrt{2\varepsilon})n remaining tuples having distinct first coordinates in the set {2,,q}\{2,\ldots,q\}. We denote SS the set of tuples whose first coordinates are 11 and the set of remaining tuples as S¯={xi1in}S\bar{S}=\{x_{i}\mid 1\leq i\leq n\}\setminus S. Thus, the graph G{1}G_{\{1\}} has one big clique of sizes 2εn\sqrt{2\varepsilon}n and (12ε)n(1-\sqrt{2\varepsilon})n isolated vertices.

  2. 2.

    Each of the remaining coordinates can be chosen so that a key exists for 𝒟\mathcal{D}.

It is worth noting that A:={1}A:=\{1\} is a bad subset of attributes. This is clear since:

|E(GA)|=2εn(2εn1)2>εn(n1)2.|E(G_{A})|=\frac{\sqrt{2\varepsilon}n(\sqrt{2\varepsilon}n-1)}{2}>\varepsilon\frac{n(n-1)}{2}~{}.

Assume we samples rr times and rn/2r\leq n/2. The event RA,r¯\overline{R_{A,r}} (i.e., failing to reject AA) includes the event RA,r¯\overline{R_{A,r}^{\prime}} where we get exactly rr distinct elements of S¯\bar{S}. Hence,

(RA,r¯)\displaystyle\mathbb{P}(\overline{R_{A,r}}) (RA,r¯)\displaystyle\geq\mathbb{P}\left(\overline{R^{\prime}_{A,r}}\right)
(12ε)nn(12ε)n1n1(12ε)nr+1nr+1\displaystyle\geq\frac{(1-\sqrt{2\varepsilon})n}{n}\frac{(1-\sqrt{2\varepsilon})n-1}{n-1}\ldots\frac{(1-\sqrt{2\varepsilon})n-r+1}{n-r+1}
=i=0r1(12ε2εini)\displaystyle=\prod_{i=0}^{r-1}\left(1-\sqrt{2\varepsilon}-\frac{\sqrt{2\varepsilon}i}{n-i}\right)
i=0r1(12ε2εinr)\displaystyle\geq\prod_{i=0}^{r-1}\left(1-\sqrt{2\varepsilon}-\frac{\sqrt{2\varepsilon}i}{n-r}\right)
i=0r1(12ε22εin).\displaystyle\geq\prod_{i=0}^{r-1}\left(1-\sqrt{2\varepsilon}-\frac{2\sqrt{2\varepsilon}i}{n}\right)~{}.

Since e2x1xe^{-2x}\leq 1-x for all x1/2x\leq 1/2, we further have:

(RA,r¯)i=0r1exp(2(2ε+2εinr))\displaystyle\mathbb{P}\left(\overline{R_{A,r}}\right)\geq\prod_{i=0}^{r-1}{\textup{exp}}\left(-2\left(\sqrt{2\varepsilon}+\frac{\sqrt{2\varepsilon}i}{n-r}\right)\right)
=\displaystyle= exp(22εr22εr(r1)n).\displaystyle{\textup{exp}}\left(-2\sqrt{2\varepsilon}r-2\sqrt{2\varepsilon}\frac{r(r-1)}{n}\right)~{}.

To reject AA with probability 1exp(m)1-{\textup{exp}}\left(-m\right), one needs to have (RA,r¯)em\mathbb{P}\left(\overline{R_{A,r}}\right)\leq e^{-m}. Our analysis implies:

m2r2ε+22εr(r1)nm\leq 2r\sqrt{2}\sqrt{\varepsilon}+2\sqrt{2\varepsilon}\frac{r(r-1)}{n}

For r=m4εr=\frac{m}{4\sqrt{\varepsilon}}, we have:

m\displaystyle m\leq 22εr+22εr(r1)n\displaystyle 2\sqrt{2\varepsilon}r+2\sqrt{2\varepsilon}\frac{r(r-1)}{n} m2+22εm2+22\displaystyle\leq\frac{m}{\sqrt{2}}+2\sqrt{2\varepsilon}\leq\frac{m}{\sqrt{2}}+2\sqrt{2}

since we choose nn such that r2=m2/εnr^{2}=m^{2}/\varepsilon\ll n. This does not hold for m4/(21)m\geq 4/(\sqrt{2}-1). This shows one needs to sample Ω(m/ε)\Omega(m/\sqrt{\varepsilon}) to reject AA with probability 1exp(m)1-{\textup{exp}}\left(-m\right). ∎

C.3 An Example Regarding Lemma 2

Let ε=ε/4\varepsilon^{\prime}=\varepsilon/4. We provide an example rejecting the intuition that the optimal value sPs^{\star}\in P that maximizes the non-collision probability must have equal non-zeroes entries (with values εn\varepsilon^{\prime}n). This implies that Lemma 2 is necessary.

In particular, let n=40,ε=1/42=0.0625n=40,\varepsilon^{\prime}=1/4^{2}=0.0625, and r=10r=10. Consider s1Ps_{1}\in P where

s1=(2.5,,2.516 times ,0,0,,0).s_{1}=(\underbrace{2.5,\ldots,2.5}_{16\text{ times }},0,0,\ldots,0).

Note that 2.52×16=0.0625×402=εn22.5^{2}\times 16=0.0625\times 40^{2}=\varepsilon^{\prime}n^{2}.

Then, consider s2Ps_{2}\in P where

s2=(10,1,,130 times,0,,0).s_{2}=(10,\underbrace{1,\ldots,1}_{30\text{ times}},0,\ldots,0).

Also note that 102+30×12>0.0625×402=εn210^{2}+30\times 1^{2}>0.0625\times 40^{2}=\varepsilon^{\prime}n^{2} as required. One can check that f(s1)76370239.25<f(s2)=173116515f(s_{1})\approx 76370239.25\ldots<f(s_{2})=173116515.