Towards Better Bounds for Finding Quasi-Identifiers 111Authors are in alphabetical order.
Abstract
We revisit the problem of finding small -separation keys introduced by Motwani and Xu (2008). In this problem, the input is a data set consisting of -dimensional tuples . The goal is to find a small subset of coordinates that separates at least pairs of tuples. When is large, they provided a fast algorithm that runs on tuples sampled uniformly at random. We show that the sample size can be improved to . Our algorithm also enjoys a faster running time.
To obtain this result, we consider a decision problem that takes a subset of coordinates . It rejects if separates fewer than pairs of tuples, and accepts if separates all pairs of tuples. The algorithm must be correct with probability at least for all choices of . We show that for algorithms based on uniform sampling:
-
•
samples are sufficient and necessary so that .
-
•
samples are necessary so that is a constant. Closing the gap between the upper and lower bounds in this case is still an open question.
The analysis is based on a constrained version of the balls-into-bins problem whose worst case can be determined using Karush Kuhn Tucker (KKT) conditions. We believe our analysis may be of independent interest.
We also study a related problem that asks for the following sketching algorithm: with given parameters and , the algorithm takes a subset of coordinates of size at most and returns an estimate of the number of unseparated pairs in up to a factor if it is at least . We show that even for constant and success probability, such a sketching algorithm must use bits of space; on the other hand, uniform sampling yields a sketch of size for this purpose.
1 Introduction
Motivation.
In large data sets, it is often important to find a small subset of attributes that identifies most of the tuples. Consider tuples each of which has coordinates (attributes) in some universe . We say a subset of coordinates is a key if every pair of tuples differ by at least one coordinate value in (i.e., uniquely identifies all tuples). Motwani and Xu [14] considered the problem of finding the minimum key of a given data set.
Let denote the collection of all subsets of of size and denote the data set, Motwani and Xu [14] reduced the minimum key problem to the set cover problem where the ground set is and each coordinate corresponds to a subset of that consists of pairs of tuples differing at their th coordinates. Finding the minimum key of is equivalent to finding the minimum set cover of . The set cover problem admits a greedy approximation with time complexity (where is the cardinality of the ground set and is the number of subsets) [20]. The combination of these two ideas yields an approach that runs in time 666With a careful implementation, described in Appendix B, one can improve the running time to which still depends on . . For massive data sets (i.e., large ), this approach is, however, costly.
Approximate minimum -separation key via minimum set cover.
To this end, Motwani and Xu considered the relaxed problem of finding small -separation keys. We say that a subset of coordinates separates and if they are different in at least one coordinate in . In this problem, with high probability, we want to find a small subset of coordinates that separates at least fraction of pairs of tuples such that . In this case, one often refers to as a quasi-identifier with separation ratio . The parameter controls how large is allowed to be compared to the minimum key .
Their idea is to uniformly sample pairs of tuples and solve the set cover problem in which the ground set is and each coordinate corresponds to a subset of that consists of pairs of tuples differing at their th coordinates.
If separates fewer than pairs, the probability that separates all pairs in (or equivalently, is the set cover of the described set cover instance) is at most
by choosing . By appealing to a union bound over all subsets of coordinates, we guarantee that no such subset of attributes separates all pairs in with probability at least 777Note that the failure probability can be set to for an arbitrary constant . This constant is absorbed in the big notation of the sample complexity.. Therefore, finding a approximation to the aforementioned set cover instance yields an -separation key whose size is at most .
One can attain using the brute-force algorithm whose running time is (but the size of the ground set is much smaller - instead of ). On the other hand, one can also attain using the greedy set cover algorithm whose running time is , assuming we can compare two coordinates in constant time. This running time is more manageable as it does not depend on the size of the data set . Note that sampling pairs of tuples can easily be implemented in the streaming model and the space would be proportional to the number of samples.
In this work, we mainly focus on sample and space bounds. However, we will also address running time and query time whenever appropriate. For this purpose, we make a mild assumption that one can define a total ordering of values in . This is the case in most natural applications (i.e., numbers, strings, etc.); we also assume that comparing two values in takes constant time.
A small tweak to Motwani-Xu’s algorithm.
Our goal is to improve the sample complexity with a small tweak. In particular, we sample tuples uniformly at random, and let be the set of sampled tuples. We then solve the set cover instance in which the ground set is and each coordinate corresponds to the pairs in that it separates.
In other words, the key difference between the two algorithms is that ours samples tuples and use as ground set (for the set cover instance) while the approach of [14] samples pairs of tuples of the data set and uses as the ground set.
We show that our approach achieves the same guarantees. In particular, for all , if separates fewer than pairs, the probability that separates all pairs in is at most . While this seems like a small tweak, the analysis is significantly more involved.
If we use the greedy set cover algorithm to obtain the approximation factor , this tweak also leads to a better time complexity in comparison to that of Motwani and Xu whose running time is .
-separation key filter.
We first consider a decision problem that captures the essence of our analysis and then describe how this yields the aforementioned improvements to finding a small - separation key. We say is bad if it separates fewer than pairs of tuples. This problem asks for an algorithm that takes and rejects if is bad. Furthermore, the algorithm must accept if separates all pairs (i.e., is a key). If is neither bad nor a key, the algorithm can either accept or reject. The success probability is required to be at least for all choices of . More formally,
Hereinafter, we only consider the above “for all” notion of success (as opposed to the “for each” notion). The query time is the time to compute the answer for a given subset of coordinates .
From the discussion above, it is fairly easy to see that the approach of Motwani and Xu solves this problem using a sample size with query time . Specifically, we reject if it fails to separate any of pairs in and accept otherwise. Our goal is to improve the sample size and query time.
Non-separation estimation.
We also consider a related approximate counting problem. Let be the number of pairs of tuples that are not separated by . Let and be some given parameters, the non-separation estimation problem asks for an algorithm that takes a subset of coordinates as the input where . If , the algorithm must return an estimate of such that
Otherwise, the algorithm may output “small”. The algorithm must return correct answers for all queries (where ) with probability at least .
Sketching algorithms and algorithms based on uniform sampling.
A sketch of a data set , with respect to some problem , is a compression of such that given only access to , one can solve the problem on . Often, the primary objective is to minimize the size of .
Given , an algorithm based on uniform sampling is a special case of sketching algorithms in which is a set of items drawn uniformly at random from .
Main results and techniques.
For the -separation key filter problem, our main results are as follows. We first propose a better sketching algorithm based on uniform sampling in terms of sample size and query time. We then establish its sampling complexity lower bounds. In certain regime, we show that our strategy is optimal in term of sampling complexity.
Theorem 1 (Main result 1).
Consider the -separation key filter problem. For algorithms based on uniform sampling, if for some sufficiently large constant , then:
-
•
samples are sufficient and necessary so that ; furthermore, the query time is where is the subset of coordinates being queried.
-
•
samples are necessary so that is a constant.
The analysis of the upper bound is heavily non-trivial. At a high level, it uses Karush–Kuhn–Tucker (KKT) [15, Chapter 12] conditions to identify the worst-case input. With some further arguments, we can apply the birthday problem [13, Page 45] to show that in this worst-case input, sample size suffices.
This result leads to the following improvements for the approximate minimum -separation key problem in both sample size and running time.
Proposition 1.
There exists an algorithm based on uniform sampling that solves the approximate minimum -separation key problem with a sample size . Furthermore, if the approximation factor , the running time is .
For the non-separation estimation problem, we show that a sketching algorithm based on uniform sampling is optimal in terms of and up to a logarithmic factors.
Theorem 2 (Main result 2).
Consider the non-separation estimation problem, the following holds.
-
•
There exists an algorithm based on uniform sampling that requires a sample size .
-
•
Suppose and is a constant. Such a sketching algorithm must use space.
Note that a sample size requires bits of space. Hence, for constant , the upper and lower bounds are tight in terms of and up to a logarithmic factor. The upper bound in the theorem above is a direct application of Chernoff and union bounds. The lower bound is based on an encoding argument. Liberty et al. [11] also used an encoding argument to prove a space lower bound for finding frequent itemset although the technical details are different.
Further applications.
Motwani and Xu [14] have highlighted multiple applications of this problem, which we will summarize here. One example is using this problem as a fundamental tool to identify various dependencies or keys in noisy data. This problem can also aid in comprehending the risks associated with releasing data, specifically the potential for attacks that may re-identify individuals.
Small quasi-identifiers are crucial information to consider from a privacy perspective because they can be utilized by adversaries to conduct linking attacks. The collection of attribute values may come with a cost for adversaries, leading them to seek a small set of attributes that form a key.
This problem also has applications in data cleaning, such as identifying and removing fuzzy duplicates resulting from spelling mistakes or inconsistent conventions [2, 3]. Moreover, quasi-identifiers are a specific case of approximate functional dependency [8, 16]. The discovery of such quasi-identifiers can be valuable in query optimization and indexing [6].
Related work.
There has been recent work on sketching and streaming algorithms as well as lower bounds for multidimensional data. In this line of work, we want to compute a sketch of the input data set such that given only access to the sketch, we can approximate statistics of the data restricted to some query subset(s) of coordinates ; in this setting, is provided after the sketch has been computed. Some example problems include heavy-hitters and frequent itemset [10, 11], frequency estimation [4], and box-queries [5].
Preliminaries.
We provide some useful tools and facts as well as common notations for the rest of the paper here.
Chernoff bound. We often rely on the following version of Chernoff bound:
Theorem 3 (Chernoff bound [18]).
Let be independent Bernoulli random variables such that . Let and . Then, for all ,
When , we have:
The birthday problem. If one throws balls (people) into bins (birthdays) uniformly at random 888This assumption is indeed unnecessary. If the distribution is non-uniform, one can show that the probability of collision only increases [12, 17]., one can show that the probability that there is a collision (i.e., there is a bin that contains at least two balls) is approximately . The following can be found in various textbooks and lecture notes (for example, see [13, Page 45]).
Theorem 4 (The birthday problem).
Let be the probability that there is at least one collision if one throws balls into bins uniformly at random. We have:
This implies that for the non-collision probability to be less than , we can set
Notation and assumptions. For convenience, hereinafter, we let be a universal sufficiently large constant. Unless stated otherwise, we may round down to the nearest power of , i.e., where is an integer so that and are integers. This does not change the complexity as the extra constant is absorbed in the big- notation. Unless stated otherwise, refers to .
Organization.
Section 2 provides the improved upper bound and lower bounds for the -key filter problem in Theorem 1. Section 3 provides the upper and lower bounds for the non-separation estimation problem in Theorem 2. We also implemented a small experiment showing the efficiency of the new approach in Section 4.
The resulted improvements to the approximate minimum -separation key in Proposition 1 are explained in Appendix B.
All omitted proofs can be found in Appendix C.
2 Improved Upper and Lower Bounds for -Separation Key Filter
2.1 An Improved Upper Bound
Algorithms.
We will prove the first part of Theorem 1 in this subsection. The detailed algorithm is given in Algorithm 1. Our algorithm samples tuples without replacement uniformly at random. For any given , if fails to separate any pair in , then reject ; otherwise, accept. We will show that this algorithm is correct with probability at least .
Improvement to minimum -separation key.
Note that if this algorithm for -separation keys filter is correct, the improvement for the minimum -separation key problem in terms of sample size is immediate. The improved running time is less trivial and requires some careful bookkeepings; we defer the discussion to Appendix B.
Analysis.
Let us visualize this problem in terms of auxiliary graphs. Consider a subset of attributes . If fails to separate and then we draw an edge between and . Note that if fails to separate and , and fails to separate and , then also fails to separate and . Therefore, the graph consists of disjoint cliques. The goal is to sample an edge in (or equivalently, sample two tuples not separated by ) if fails to separate fraction of the pairs (in this case, we call bad).
We want to argue that if is bad, then we sample two distinct vertices in the same clique in with probability at least . By an application of the union bound over subsets of attributes, we are able to discard all bad subsets of attributes with probability at least .
We note that if is bad, has at least edges. Thus, the problem can be rephrased as follows. Given a graph of vertices and at least edges that consists of disjoint cliques, what is the smallest number of vertices that one needs to sample uniformly at random such that the induced graph has at least one edge with probability at least ?
It remains to show that sampling tuples (or vertices) is sufficient. The key observation is that this problem is fairly similar to the birthday problem. Of course, this is not quite the same since we want to sample two distinct vertices from cliques of size at least two. We can simply sample without replacement to avoid this issue. However, for a cleaner analysis, our analysis is based on sampling with replacement and relate the two.
There could be at most cliques so we use a vector to denote the cliques’ sizes where zero-entries are empty cliques. Recall that which implies for sufficiently large .
We relax the problem so that can be a non-negative real number (i.e., ). Consider following set of points such that
(1) | |||
(2) | |||
(3) |
We use to denote the set of all that satisfies all three constraints. For convenience, we can think of cliques as colors and each vertex that we sample uniformly at random is a ball whose color is distributed according to the multinomial distribution where .
We are interested in how large should be such that the probability of having two balls with the same color is at least . In particular, let be the probability that at no two balls have the same color (we refer to this event as non-collision) if we draw balls whose colors follow the distribution .
Let us fix and figure out that maximizes the non-collision probability. Without constraint (1), it is easy to see that the non-collision probability is largest when the color distribution is uniform [12, 17]. With constraint (1), one might suspect that this is still true. That is the non-collision probability is largest when all non-zero have the same value (i.e., all non-zero entries have the same value where ). This intuition is however incorrect; one can see an example in Appendix C.3. In this case, the problem indeed becomes more complicated.
Given the distribution , the probability that we do not have a collision after sampling vertices uniformly at random is
Suppose we sample without replacement, the probability that we do not have a collision is
As a consequence, we have:
The next claim relates and .
Claim 1.
For , we have
Proof.
Note that implies . Thus,
(4) |
We will later set ; the condition implies that . We are now interested in maximizing for fixed . Let
We make the following key observation.
Lemma 1.
If , then the optimal satisfies the following: all non-zero entries in have at most two distinct values.
Proof.
Since is being fixed, we drop from for a cleaner presentation. Let
(5) |
In fact, is feasible as it satisfies all constraints (1) - (3). We note that any that has fewer than non-zero entries cannot be the optimal value since while since it has non-zero entries and therefore at least one term in must be positive.
Our proof uses the KKT conditions with LICQ regularity condition as its core. We invite readers who are not familiar with this technical theorem to visit Appendix A for full details.
Let be any local optimal. First, we assume that LICQ holds at . Based on the KKT conditions (Theorem 5), there exist some constants , , and , such that for any local maxima, we have
(6) |
and
(7) |
Equations (6) and (7) correspond to the stationarity and complementary slackness conditions (see Appendix A). Each coordinate of the right hand side of Eq. (6) has the following form
(8) |
Each coordinate of is as follows:
(9) |
Therefore, for each :
(10) |
Consider any and . Note that . We have:
(11) | ||||
Thus, either or
(12) |
If , then consider any other entry . It must be the case that either or . Without loss of generality, suppose . With the same derivation, we have:
(13) |
From Eq. (12) and Eq. (13), we have:
(14) | ||||
In the last step, we divide both sides by
(15) |
This is because we assume that at least entries are non-zero. Thus, all entries in can either take value 0, , or .
It remains to deal with the case where LICQ does not hold at . We denote by and the vectors in which all entries are zeros and ones respectively; furthermore, denotes the th canonical vector. Let be the (active) set of constraints whose equality hold (see Appendix A and Definition 1 for a formal definition). The active set can contain three following types of constraints:
-
1.
Constraint (1): . The gradient corresponding to this constraint is .
- 2.
-
3.
Constraints (3): . The gradient corresponding to this type of constraints is .
Denote by the set of indices whose components of are zero. If does not contain Constraint (1) and LICQ does not hold means, then there exist such that:
This happens if and only if contains all constraints (3), i.e, , which contradicts Constraint (2). Therefore, must contain Constraint (1). By our assumption that LICQ does not hold, there exists a vector such that:
As a consequence, for all . Thus, there is at most one distinct value among non-zero entries of when LICQ does not hold. That concludes our proof. ∎
The next lemma shows that we can still apply the birthday problem for .
Lemma 2.
Proof.
Consider the case where there are two distinct values among non-zero entries in . Let these two distinct values be and . We say a color is in group if ; otherwise is in group . Define , we have:
Thus either or . Without loss of generality, suppose . Then
Thus, in expectation, each random ball drawn uniformly at random has color is in group with probability at least .
Let be some sufficiently large constant, if we draw at least balls uniformly at random restricted to group- colors, by the birthday problem (i.e., Theorem 4), the probability that no two balls share the same color is at most .
Suppose we sample balls uniformly at random. In expectation, the number of balls whose colors are in group in the sample is at least
By Chernoff bound (Theorem 3), the probability that we sample fewer than balls whose colors are in group is at most . Thus, we get two balls with the same color with probability at least
The case of one distinct value can be dealt with similarly by simply choose to be some arbitrary value and . The same argument is still valid and this concludes our proof. ∎
Putting it all together.
If we sample tuples without replacement, for each bad subset of coordinates , the probability that we do not sample an edge in is at most
Taking a union bound over at most bad subsets , we have proved that the probability of failing to detect a bad subset of coordinates is at most .
Query time.
Recall that each coordinate has values in . If has a total ordering, then we can sort the tuples in using comparisons each of which takes time to detect duplicate(s). Thus, the query time is .
2.2 A lower bound for constant failure probability
In this section, we consider the following question: What is the lower bound on the number of tuples one needs to sample (uniform sampling, with or without replacement) such that the probability of failure of Algorithm 1 is smaller than a fixed constant . To this end, we construct a model to get a lower bound for this question.
Lemma 3.
There exists a data set where one needs to sample with replacement (resp. without replacement) at least tuples to reject all bad subsets with probability (resp. ).
Proof sketch.
We provide a proof sketch for the case of sampling with replacement here. The full proof is deferred to Appendix C.1. We consider the data set . In this data set, the following holds:
-
1.
All subsets of size one are bad (i.e they separate fewer than pairs).
-
2.
Sampling a tuple uniformly at random from is equivalent to sample each coordinate i.i.d from the uniform multinomial distribution on .
Denote the event that a bad subset is detected after samples (i.e., we sample a pair of tuples that are not separated by ). We derive our lower bound by bounding , which is the probability that one can detect all the bad subsets of cardinality one. Let . We have
For , the above can be upper bounded as
2.3 A lower bound for failure probability
For a constant success probability , our analysis leaves a gap between the upper and lower bound of the sampling complexity (i.e the upper bound is while the lower bound is ). However, for large success probability (i.e., ), one can actually show that our analysis is tight. In fact, this lower bound holds even if one only needs to tell if a single coordinate is a -separation key.
Lemma 4.
There exists a data set where one needs to sample without replacement at least to reject a bad subset with probability .
Proof sketch.
We build a data set satisfying the following properties: A) If we sample a tuple uniformly at random, its first coordinate follows the multinomial distribution given in Equation (5) and B) The remaining coordinates can be chosen arbitrarily as long as there exists a key for . For this data set, one can show that:
-
1.
Coordinate is bad.
-
2.
The graph (the auxiliary graph corresponding the first coordinate) has one cluster of size and clusters of size one.
To detect that is bad with probability , one needs to sample at least two vertices (or tuples) in the largest cluster with the same probability. This requires samples since the probability of sampling a tuple in is . A detailed proof is can be found in Appendix C.2. ∎
3 Estimating Non-Separation
In this section, we prove Theorem 2. Specifically, we present space upper and lower bounds for estimating non-separation. The upper bound is based on sampling pairs of tuples uniformly at random.
We then prove a lower bound on the sketch’s size for constant . Note that this implies a lower bound on the sample size since each sample requires bits of space. Therefore, for constant , the simple sketching algorithm based on uniform sampling is tight in terms of and , up to poly-logarithmic factor.
3.1 Upper bound
We give a simple upper bound based on random sampling. Let denote the data set . The algorithm samples pairs of tuples uniformly at random for some sufficiently large constant . For a fixed , we use to denote the number of sample pairs that fails to separate. That is
If , output “small”. Otherwise, output
as an estimate of . If fails to separate at most pairs of tuples, appealing to Chernoff bound yields:
If fails to separate at least pairs, we have
The success probability is therefore at least by appealing to a union bound over at most possible choices of .
3.2 Lower bound
We show that even for constant , a valid data sketch must use bits. The proof is based on an encoding argument. Let denote the Hamming distance. We will make use of the following communication problem.
Lemma 5.
Suppose has a bit string of length indexed as a matrix. Each of columns has exactly one-entries and zero-entries. For Bob to compute a reconstruction of such that with probability at least , the one-way (randomized) communication complexity is bits.
The proof of the above lemma is an adaptation of the proof of the Index problem [1, 9] which we defer to the end of this section. We will later set to obtain the lower bound .
Let and be a matrix whose entries are all ones. Furthermore, let be the canonical vector with the one-entry in the th row. Alice constructs the following data set as below. Here, we make the assumption that ; therefore, .
We will show that Bob can recover , column-by-column using on the described data sketch for non-separation estimation with . Then, appealing to Lemma 5, such a data sketch must have size bits.
Fix a column of . Bob can recover column of as follows. If Bob guesses that rows
in contain all the one-entries, Bob can verify if this guess is good using the estimate where
Here, a guess is good if the reconstruction of corresponding to this guess satisfies .
Note that . Let and be the number of rows in that Bob guesses correctly and incorrectly respectively. That is , and . Note that . The intuition is that for each that is a correct guess, includes another coordinate that separates another pairs; if is a wrong guess, includes another coordinate that separates pairs.
Lemma 6.
We have the following equality regarding the number of unseparated pairs in :
Proof.
Consider the following process. Starting with as the coordinate corresponding to column . We then add coordinates for each to one by one. Originally, fails to separate pairs corresponding to rows that are 1’s in column and pairs that corresponds to rows that are 0’s in column .
For the first correct , includes a coordinate that separates row from other rows; for the second correct , includes a coordinate that separates row from other rows and so on. Similarly, for the first incorrect , includes a coordinate that separates row from other rows; for the second incorrect , includes a coordinate that separates row from other rows and so on.
Therefore, in the end, the number of pairs that remain unseparated in is
(16) |
∎
First, note that . Thus, a correct data sketch must output the required estimate .
Observe that the expression (16) is decreasing for . Suppose ,
On the other hand, if , then
Our objective is to choose such that if the provided estimate , Bob can tell whether his guess is good. To do so, it suffices to choose such that:
Hence, we can set for some sufficiently large constant . Specifically, Bob queries the estimates for at most choices of . If , then he knows that the guess is good; he will use the corresponding reconstruction for column .
Therefore, provided a valid sketch. Bob can correctly compute a reconstruction of such that
with probability at least 3/4. Reparameterize and gives us the lower bound.
Proof of Lemma 5.
For convenience, let (i.e., the length of the bit string that is given to Alice) and (we want to show that communication is needed).
Recall that Bob wants to compute a reconstruction of such that
We consider the distributional complexity methodology where follows a distribution and communication protocols are deterministic. Here, we simply choose in which in each column, bits are chosen uniformly at random to be 1’s and the remaining bits are set to 0’s. It suffices to show that for any deterministic protocol that uses bits of communication,
By the minimax principle [19], this implies that there exists an input such that any randomized protocol that uses bits of communication will fail to reconstruct as required with probability more than 1/3.
Suppose Bob gets a message from Alice ( is a function of ). Let be his reconstruction based on . Let
be the set of all possible reconstructions by Bob. Note that .
We say Alice’s input good if there exists a reconstruction such that
Otherwise, we say is bad. Fix a reconstruction , the number of inputs whose Hamming distance is at most from is
(17) |
Recall that . For sufficiently large and , expression (17) can be upper bounded as:
Hence, for sufficiently large and , the total number of good inputs is at most
Therefore,
4 Implementation
The modified algorithm to detect -separation keys is very simple, despite the involved analysis. Therefore, we briefly demonstrate the effectiveness of our approach described in Section 2 and compare it to the naive approach given by Motwani and Xu [14]. We ran our experiments on an M1 Pro processor with 16 gigabytes of unified memory. The code for the experiments is available on GitHub [7].
Description of datasets.
We used two of the data sets that were tested in [14], the adult income data set and the covtype data set, both of which are from the UCI Machine Learning Repository. We also used the 2016 Current Population Survey, which is publicly provided by the US Census. These data sets are a good representation in terms of data size and attribute size. For example, the adult data set contains slightly more than 32,000 values with 14 attributes, while the census data contains millions of records with 388 attributes.
Comparison methodology.
We compare the two approaches with and . These are the same tuning parameters that were considered in [14]. We compare our results in terms of the following: (i) sample size, (ii) run-time , and (iii) the percentage in which both algorithms agree on accepting or rejecting a set of attributes. Note that in some cases, even though the two algorithms’ outputs are different, both can be correct. In particular, if a subset is not a perfect key but it separates at least pairs, then it is correct to either accept or reject.
For each data set, we select about random subsets of attributes to query. See Table 1 for the detailed results. At a high level, both approaches agree on nearly all queries while requiring a much smaller sample size.
Dataset | S () | S () | T () | T () | A % |
---|---|---|---|---|---|
Adult | 13,000 | 411 | 1.903 sec | 0.208 sec | 95% |
Covtype | 55,000 | 1,739 | 188.02 sec | 2.49 sec | 98% |
CPS | 372,000 | 11,764 | 790.08 sec | 60.03 sec | 100% |
References
- [1] Farid M. Ablayev. Lower bounds for one-way probabilistic communication complexity and their application to space complexity. Theor. Comput. Sci., 157(2):139–159, 1996.
- [2] Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586–597. Morgan Kaufmann, 2002.
- [3] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313–324. ACM, 2003.
- [4] Graham Cormode, Charlie Dickens, and David P. Woodruff. Subspace exploration: Bounds on projected frequency estimation. In PODS, pages 273–284. ACM, 2021.
- [5] Roy Friedman and Rana Shahout. Box queries over multi-dimensional streams. Inf. Syst., 109:102086, 2022.
- [6] Chris M Giannella, Mehmet M Dalkilic, Dennis P Groth, and Edward L Robertson. Using horizontal-vertical decompositions to improve query evaluation. In Proceedings of the 19th British National Conference on Databases (BNCOD), Lee. Notes in Comp. Sci. vol, volume 2405, pages 26–41. Citeseer, 2002.
- [7] Ryan Hildebrant. Github. https://github.com/Ryanhilde/min_set_cover/.
- [8] Jyrki Kivinen and Heikki Mannila. Approximate dependency inference from relations. In ICDT, volume 646 of Lecture Notes in Computer Science, pages 86–98. Springer, 1992.
- [9] Ilan Kremer, Noam Nisan, and Dana Ron. Errata for: "on randomized one-round communication complexity". Comput. Complex., 10(4):314–315, 2001.
- [10] Branislav Kveton, S. Muthukrishnan, Hoa T. Vu, and Yikun Xian. Finding subcube heavy hitters in analytics data streams. In WWW, pages 1705–1714. ACM, 2018.
- [11] Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan R. Ullman. Space lower bounds for itemset frequency sketches. In PODS, pages 441–454. ACM, 2016.
- [12] Terry R McConnell. An inequality related to the birthday problem. Preprint, 2001.
- [13] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge university press, 1995.
- [14] Rajeev Motwani and Ying Xu. Efficient algorithms for masking and finding quasi-identifiers. In Technical Report, 2008.
- [15] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, 2e edition, 2006.
- [16] Bernhard Pfahringer and Stefan Kramer. Compression-based evaluation of partial determinations. In KDD, pages 234–239. AAAI Press, 1995.
- [17] J Michael Steele. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.
- [18] Wikipedia. Chernoff bound — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Chernoff%20bound&oldid=1119845299, 2022. [Online; accessed 18-November-2022].
- [19] Andrew Chi-Chih Yao. Lower bounds by probabilistic arguments (extended abstract). In FOCS, pages 420–428. IEEE Computer Society, 1983.
- [20] Neal E. Young. Greedy set-cover algorithms. In Ming-Yang Kao, editor, Encyclopedia of Algorithms. Springer, 2008.
Appendix A Applying Karush–Kuhn–Tucker (KKT) conditions
In this section, we present the KKT condition for self-containedness. This classical result is presented as in [15, Chapter ]. Consider a constrained optimization problem with an objective function , inequality and equality constraints .
Maximize | (18) | ||||
Subject to: | |||||
for . | |||||
for . |
Definition 1 (Active set [15]).
The active set of a feasible solution of (18) is a set of indices defined as: . In other words, refers to the set of equality and inequality (where equality happens) constraints.
Definition 2 (LICQ [15]).
Theorem 5 (KKT conditions [15]).
Consider a local minimum of (18), whose and are continuously differentiable and that the LICQ holds at , then there is a Lagrange multiplier vector with components such that the following conditions are satisfied:
(Primal feasibility for inequality) | (19) | ||||
(Primal feasibility for equality) | (20) | ||||
(Dual feasibility) | (21) | ||||
(Stationarity) | (22) | ||||
(23) |
Application to our problem.
Recall that we want to maximize:
Subject to:
KKT conditions imply that when LICQ holds, there exist constants such that for any local maxima, we have:
Furthermore, according to the complementary slackness condition, for each :
Therefore, if then .
Appendix B Improvements for Approximate Minimum -Separation Key
In this section, we prove Proposition 1. Recall from Section 1 that we treat as the ground set and each coordinate is a set that contains all pairs that it separates. The minimum key separates all pairs in since . Furthermore, in Section 2, we showed that no subset of coordinates that separates fewer than pairs is a set cover of with high probability. Hence, a approximation to this minimum set cover instance yields an -separation key whose size is at most with high probability. This implies the improvement in terms of sample size.
In terms of running time, to achieve the approximation factor , both the algorithm in [14] and ours use the greedy set cover algorithm [20] (described in Algorithm 2).
Let be the output, initialized to . The greedy algorithm, at each step, adds the coordinate that separates the most number of currently unseparated pairs to . The algorithm stops when all pairs are separated by .
The key difference between two algorithms is: ours samples tuples and use as ground set while the approach of [14] samples pairs of tuples of the data set and use as the ground set. Algorithm 2 runs in time where is the ground set’s size and is the number of sets in the input. Therefore, our algorithm and that of Mowani and Xu [14] yield time complexity and respectively. However, our algorithm can be refined to a better time complexity . We outline how to achieve this running time as follows.
We can visualize this process in terms of auxiliary graphs introduced in Section 2. Let be the output, initialized to . Originally, consists of one clique that contains all since the empty set does not separate any pair.
As we add more coordinates to the solution, will separate more pairs in and the number of disjoint cliques in increases. We stop when ; in other words, all pairs in are separated by . We will make use of the following procedure.
Partitioning.
Let be a subset of the sampled tuples. For some given , we can partition into based on the th coordinates. The simplest approach is to sort the th coordinates of the tuples in . This takes time.
Let be the disjoint cliques in after steps (i.e., after we have added coordinates to ). Originally, .
At each step , for each coordinate , adding to breaks each clique into some new cliques based on the th coordinates. This corresponds to the fact that adding to results in separating more pairs of tuples. We use the procedure to compute We observe that coordinate would separate
new pairs of tuples in .
Given , can be computed in time by computing as there can be at most terms in the sum. The algorithm then adds coordinate with the highest to and update the cliques accordingly. In other words, replace with For each , the time to partition based on the th coordinates is
We need to do this for at most coordinates not in and the process repeats for at most steps. Thus, the algorithm’s running time is
If we allow an extra factor in the space use, we can reduce the running time further by improving the partitioning procedure. For each , the partition step can be done can be done in time (instead of time) using Algorithm 3.
Algorithm 3 is provided with a lookup table and an index . We partition based on the th coordinates into , , . The value is equal to the index of the partition that belongs to. In other words, . Pre-computing involves sorting each of coordinates of tuples in . The running time to compute is therefore
Thus, the overall running time is
Appendix C Omitted Proofs and Examples
C.1 Proof of Lemma 3
Proof.
For technical reason, we will consider such that where . Our construction of the data set is . Thus, . We choose such that (or equivalently ).
Firstly, we will prove that all subsets of attributes of cardinality one is bad. Indeed, given any singleton set , the auxiliary graph will be decomposed evenly into cliques whose size are . The number of edges (of ) is:
To show , it is sufficient to demonstrate that:
which is true if .
Denote the event that a bad subset is detected after samples, we derive our lower bound by bounding , which is the probability that one can detect all the bad subsets of cardinality one. To distinguish between the case of sampling with and without replacement, we denote and the probability of an event under the sampling with and without replacement respectively.
We first deal with the case of uniform sampling with replacement. In this case, sampling a tuple is equivalent to sampling each coordinate i.i.d from uniform multinomial distribution of the set . Hence,
Moreover, we have: . This is not an equality since we need to exclude the cases where a tuple of is sampled more than once. Since , we have:
for . If one chooses (which makes the inequality valid since ), we have:
Finally, one can bound . Thus, the sampling complexity is lower bounded by .
For the case of sampling without replacement, we will use the following observation: Let be a sequence of distinct elements in and be the th element sampling from , we have:
Let be the set of all sequences of distinct tuples of length that remove all bad subsets and be the event of all bad subsets being detected (i.e., there exists an unseparated pair in the sample for each bad subset) after sampling samples, we have:
We choose such that . With (as in the case of sampling with replacement), this is equivalent to:
which is true for . If , we have:
Thus, we can derive the same complexity bound for the case of sampling without replacement with failure probability . ∎
C.2 Proof of Lemma 4
Proof.
We choose such that . We construct the data set of tuples as follows.
-
1.
The first coordinate of each of has distinct values. Without loss of generality, we assume the set of distinct values of the first coordinate are . More importantly, there are exactly tuples having their first coordinates equal to and remaining tuples having distinct first coordinates in the set . We denote the set of tuples whose first coordinates are and the set of remaining tuples as . Thus, the graph has one big clique of sizes and isolated vertices.
-
2.
Each of the remaining coordinates can be chosen so that a key exists for .
It is worth noting that is a bad subset of attributes. This is clear since:
Assume we samples times and . The event (i.e., failing to reject ) includes the event where we get exactly distinct elements of . Hence,
Since for all , we further have:
To reject with probability , one needs to have . Our analysis implies:
For , we have:
since we choose such that . This does not hold for . This shows one needs to sample to reject with probability . ∎
C.3 An Example Regarding Lemma 2
Let . We provide an example rejecting the intuition that the optimal value that maximizes the non-collision probability must have equal non-zeroes entries (with values ). This implies that Lemma 2 is necessary.
In particular, let , and . Consider where
Note that .
Then, consider where
Also note that as required. One can check that .