Differentially Private Data Release over Multiple Tables
Abstract.
We study synthetic data release for answering multiple linear queries over a set of database tables in a differentially private way. Two special cases have been well-studied in the literature: how to release a synthetic dataset for answering multiple linear queries over a single table, and how to release the answer for a single counting (join size) query over a set of database tables. Compared to the single-table case, the join operator has emerged as the major difficulty in query answering, since the sensitivity (i.e., how much change an individual data record can lead in the query answer) could be heavily amplified via complex join relationships.
We present an algorithm for the general problem, and prove a lower bound illustrating that our general algorithm achieves parameterized optimality (up to logarithmic factors) on some simple queries (for example, two-table join queries) in the most commonly-used privacy parameter regimes. For the class of hierarchical joins, we present a data partition procedure that exploits the idea of uniformizing sensitivities to further improve the utility.
1. Introduction
Synthetic data release is a very useful objective in private data analysis. Differential privacy (DP) (dwork2006calibrating; dwork2006our) has emerged as a compelling model that enables us to formally study the tradeoff between utility of released information and the privacy of individuals. In the literature, there has been a large body of work that studies synthetic data release over a single table, for example, the private multiplicative weight algorithm (hardt2012simple), the histogram-based algorithm (vadhan2017complexity), the matrix mechanism (li2010optimizing), the Bayesian network algorithm (zhang2017privbayes); as well as other works focusing on geometric range queries (li2011efficient; li2012adaptive; dwork2010differential; bun2015differentially; chan2011private; dwork2015pure; cormode2012differentially; huang2021approximate), and datacubes (ding2011differentially).
Data analysis over multiple private tables connected via join operators has been the subject of significant interest within the area of modern database systems. In particular, the challenging question of releasing the join size over a set of private tables has been studied in several recent works including the sensitivity-based framework (johnson2018towards; dong2021residual; dong2021nearly), the truncation-based mechanism (kotsogiannis2019privatesql; dong2022r2t; tao2020computing), as well as in works on one-to-one joins (mcsherry2009privacy; proserpio2014calibrating), and on graph databases (blocki2013differentially; chen2013recursive). In practice, multiple queries (as opposed to a single one) are typically issued for data analysis, for example, a large class of linear queries on top of join results with different weights on input tuples, as a generalization of the counting join size query. One might consider answering each query independently but the utility would be very low due to the limited privacy budget, implied by the DP composition rule (dwork2006calibrating; dwork2006our). Hence the question that we tackle in this paper is: how can we release synthetic data for accurately answering a large class of linear queries over multiple tables in a differentially private way?
1.1. Problem Definition
Multi-way Join and Linear Queries.
A (natural) join query can be represented as a hypergraph (abiteboul1995foundations), where is the set of vertices or attributes and each is a hyperedge or relation. Let be the domain of attribute . Define , and for any . Given an input instance , each table is defined as a function from domain to a non-negative integer, indicating the frequency of each tuple. This is more general than the setting where each tuple appears at most once in each relation, since it can capture annotated relations (green2007provenance) with non-negative values. The input size of is defined as the sum of frequencies of data records, i.e., . When the context is clear, we just drop the superscript from , and often use to denote the set of tuples with non-zero frequency.linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: This notion is pretty confusing to me. Do we really need it?
For each , we have a family of linear queries defined over such that for , . The family of linear queries defined over is . For each linear query , the result is defined as
where is the natural join function, such that for each combination , if and only if for each pair of . Here, is the standard projection operator, such that indicates the value(s) that tuple displays on attribute(s) .
Our goal is to release a function such that all queries in can be answered over as accurately as possible, i.e., minimizing the -error , where
DP Setting.
Two DP settings have been studied in the relational model, depending on whether foreign key constraints are considered or not. The one considering foreign key constraints assumes the existence of a primary private table, and deleting a tuple in the primary private relation will delete all other tuples referencing it; see (kotsogiannis2019privatesql; tao2020computing; dong2022r2t). In this work, we adopt the other notion, which does not consider foreign key constrains, but defines instances to be neighboring if one can be converted into the other by adding/removing a single tuple; this is the same as the notion studied in previous works (johnson2018towards; kifer2011no; narayan2012djoin; proserpio2014calibrating). 111Our definition corresponds to the add/remove variant of DP where a record is added/removed from a single database . Another possible definition is based on substitution DP. Add/remove DP implies substitution DP with only a factor of two increase in the privacy parameters; therefore, all of our results also apply to substitution DP.
Definition 1.1 (Neighboring Instance).
A pair of instances and are neighboring if there exist some and such that:
-
•
for any , for every ;
-
•
for every and .
Definition 1.2 (Differential Privacy (dwork2006calibrating; dwork2006our)).
For , a randomized algorithm is -differentially private (denoted by -DP) if for any pair , of neighboring instances and any subset of outputs, .
Notation.
For simplicity of presentation, we henceforth assume throughout that and . Furthermore, all lower bounds below hold against an algorithm that achieves the stated error with probability for some sufficiently small constant ; we will omit mentioning this probabilistic part for brevity. For notational ease, we define the following quantities:
When are clear from context, we will omit them. Let , which is a commonly-used parameter in this paper.
1.2. Prior Work
We first review the problem of releasing synthetic data for a single table, and mention two important results in the literature. In the single table case, nearly tight upper and lower bounds are known, as stated below.
Theorem 1.3 ((hardt2012simple)).
For a single table of at most records, a family of linear queries, and , , there exists an algorithm that is -DP, and with probability at least produces such that all queries in can be answered within error using .
Theorem 1.4 ((bun2018fingerprinting)).
For every sufficiently small , and , there exists a family of queries of size on a domain of size such that any -DP algorithm that takes as input a database of size at most , and outputs an approximate answer to each query in to within error must satisfy .
Another related problem that has been widely studied by the database community is how to release the answer of join size queries over multiple tables with DP, denoted as :
which can be captured by a special linear query such that for .
A popular approach of answering the counting join size query is based on the sensitivity framework, which first computes , then computes the sensitivity of the query (measuring the difference between the query answers on neighboring database instances), and releases a noise-masked query answer, where the noise is drawn from some distribution calibrated appropriately according to the sensitivity. It is known that the local sensitivity cannot be directly used for calibrating noise, as for a pair of neighboring databases can lead to distinguishability of . Global sensitivity (e.g., (dwork2006calibrating)) could be used, but the utility is not satisfactory in many instances. Instead, smooth upper bounds on local sensitivity (nissim2007smooth) have been considered, for example, the smooth sensitivity, which is the smallest smooth upper bound, the residual sensitivity (dong2021residual), which is a constant-approximation to the smooth sensitivity, and the elastic sensitivity (johnson2018towards), which is further strictly larger than residual sensitivity.
We note that this sensitivity-based framework can be adapted to answering any single linear query, as long as the sensitivity is correctly computed. However, we are not interested in answering each linear query individually, since the privacy budget blows up with the number of queries to be answered, implied by the DP composition rule (dwork2006calibrating; dwork2006our), which would be too costly when the query space is extremely large. Instead, we aim to release a synthetic dataset such that an arbitrary linear query can be freely answered over it while preserving DP. In building our data release algorithm, we also resort to the existing literature on the sensitivities derived for this counting join size query . Although smooth sensitivity enjoys the best utility, we adopt the residual sensitivity in this paper for two reasons: it is much more computationally efficient compared to smooth sensitivity; and more importantly, it is only built on the statistics of the input instance while smooth sensitivity is built on all neighboring instances, which is critical to our optimization technique of uniformizing sensitivities.
1.3. Our Results
linecolor=blue,backgroundcolor=blue!25,bordercolor=bluelinecolor=blue,backgroundcolor=blue!25,bordercolor=bluetodo: linecolor=blue,backgroundcolor=blue!25,bordercolor=blueBadih: Based on discussion with Xiao, we might need to add more interpretation of the results in this section?linecolor=purple,backgroundcolor=purple!25,bordercolor=purplelinecolor=purple,backgroundcolor=purple!25,bordercolor=purpletodo: linecolor=purple,backgroundcolor=purple!25,bordercolor=purpleXiao: I added a few sentences after Theorem 1.6, but not sure if they make sense.We first present the basic join-as-one approach for releasing a synthetic dataset for multi-table queries, with error a function of the output size of join results and the residual sensitivity of .
Theorem 1.5.
For any join query , an instance , a family of linear queries, and , , there exists an -DP algorithm that with probability at least produces that can be used to answer all queries in within error:
where is the join size of and is the -residual sensitivity of the counting join size query on for (see Definition 3.5).
We next present the parameterized optimality with respect to the output size of join results and local sensitivity of .
Theorem 1.6.
Given arbitrary parameters and a join query , for every sufficiently small , and , there exists a family of queries of size on domain of size such that any -differentially private algorithm that takes as input a multi-table database over of input size at most , join size and local sensitivity , and outputs an approximate answer to each query in to within error must satisfy
It should be noted that for the setting with and for some constant , we just have . The gap between the upper bound in Theorem 1.5 and the lower bound in Theorem 1.6 boils down to the gap between smooth sensitivity and local sensitivity of , which also appears (but in a different formula) in answering such a single query.linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: Moved instance-optimality discussion to the conclusion.
From the upper bound in terms of residual sensitivity, we exploit the idea of uniformizing sensitivity222Similar idea of uniformizing degrees of join values, i.e., how many tuples a join value appears in a relation, has been used in query processing (joglekar2015s), but we use it in a more complicated scenario in combination of sensitivity.linecolor=blue,backgroundcolor=blue!25,bordercolor=bluelinecolor=blue,backgroundcolor=blue!25,bordercolor=bluetodo: linecolor=blue,backgroundcolor=blue!25,bordercolor=blueBadih: Per discussion with Xiao, since the previous work used “uniformization” to refer to a much simpler technique, we might want to consider using a different term to avoid confusion? to further improve Theorem 1.5, by using a more fine-grained characterization of input instances. For the class of hierarchical queries, which has been widely studied in the context of various database problems (dalvi2007efficient; fink2016dichotomies; berkholz2017answering; hu2022temporal), we obtain a more fine-grained parameterized upper bound in Theorem 4.12 and lower bound in Theorem 4.13. As the utility formula display a rather complicated form, we defer all details to Section 4.
2. Preliminaries
For random variables and scalars , we write if and for all sets of outcomes.
For convenience, we write as a shorthand for where is a random variable with distribution (denoted by ).
Laplacian distribution .
to be done.
Shifted and truncated geometric distribution (chan2019foundations).
Let . Let , , and . Let be the smallest positive integer such that , where . The shifted and truncated geometric distribution has support in , and is defined as . It has been proved (balcer2017differential; chan2019foundations) that for any pair with ,
Exponential Mechanism.
The exponential mechanism (EM) (McSherryT07) can select a “good” candidate from a set of candidates. The goodness is defined by a scoring function for input dataset and a candidate and is assumed have sensitivity at most one. Then, the EM algorithm is to sample each candidate with probability ; this algorithm is -DP.
Local Sensitivity.
Given a function that takes an input dataset and outputs a vector, its local sensitivity at is defined as where the maximum is over all neighbors of .
Join Result as a Function.
For , as each input relation is abstracted as a function, we introduce the join result function as , such that for any . Moreover, is denoted as join size of , and we might exchangably use them later.
When we mention “local sensitivity” without a specific function , it should be assumed that the function is the counting join size query; note that this is equal to .
3. Join-As-One Algorithm
In this section we present our first join-as-one algorithm for general join queries, which namely computes the join results as a single function and then invokes the single-table private multiplicative weight (PMW) algorithm (hardt2012simple). While this is apparently simple, there are many non-trivial issues in putting everything together. (All missing proofs are in Appendix B.)
3.1. Algorithms for Two-Table Query
We start with the simplest two-table query. Assume is defined on , , and . Let us define some parameters first. For , we define the degree of join value to be . The maximum degree of any join value in is denoted as . Define . We note that .
A Natural (but Flawed) Idea.
Let us first consider a natural approach, which turns out to violate DP:
-
•
Compute and release a synthetic dataset for as by invoking the single-table PMW algorithm;
-
•
As the PMW algorithm releases a set of records with their total frequencies as , which will reveal information of the input instance, we add noise to to obtain .
-
•
We draw a random noise from and choose records randomly from , denoted as .
-
•
We release the union of these two datasets .
The challenge is that invoking the single-table algorithm on does not preserve DP, as the initial data distribution parameterized by can lead to leaking of information about the input database.
Modified Algorithm.
The main insight is to change the order of two steps in the flawed version. We add random noise to as , and then invoke the single-table algorithm on starting with a uniform distribution parameterized by . As is released under DP, and the single-table algorithm (hardt2012simple) also releases a synthetic dataset for under DP, Algorithm 1 also satisfies DP, as stated in Lemma 3.1.
Lemma 3.1.
Algorithm 1 is -DP.
Error Analysis.
As shown in Appendix A, with probability at least , Algorithm 1 produces such that every linear query can be answered within error (omitting ): . Then, we come to the following result:
Theorem 3.2.
For any two-table instance , a family of linear queries , and , , there exists an algorithm that is -DP, and with probability at least produces such that all queries in can be answered within error:
where , is the join size of , and is the local sensitivity of on .
3.2. Lower Bounds for Two-Table Query
The first lower bound is based on the local sensitivity, which can be shown via standard techniques.
Theorem 3.3.
For any , if there exists a family of queries such that any -DP algorithm that takes as input a database with local sensitivity at most and outputs an approximate answer to each query in to within error , then .
Our second lower bound is via a reduction to the single-table case: we create a two-table instance where encodes the single-table and “amplifies” both the sensitivity and the join size by a factor of . This eventually results in the following lower bound.
Theorem 3.4.
For any , any sufficiently small , and , if there exists a family of queries of size on domain of size such that any -DP algorithm that takes as input a two-table database of join size and local sensitivity and outputs an approximate answer to each query in to within error , then .
Proof.
Let . From Theorem 1.4, there exists a set of queries on domain for which any -DP algorithm that takes as input a single-table database and outputs an approximate answer to each query in within error requires that . For an arbitrary single-table database , we construct a two-table instance of join size , and local sensitivity as follows:
-
•
Set , and .
-
•
Let for all and .
-
•
Let for all and .
It can be easily checked that has join size at most and local sensitivity , and that two neighboring datasets result in neighboring . Finally, let contain queries from applied on its first attribute (i.e., ), and let contain only a single query . An example is illustrated in Figure 1.
Our lower bound argument is a reduction to the single-table case. Let be an algorithm that takes any two-table instance in and outputs an approximate answer to each query in within error . For each query , let be its corresponding query in . Let be the approximate answer for query . We then return as an approximate answer of .
We first note that this algorithm for answering queries in is -DP due to the post-processing property of DP. The error guarantee follows immediately from the observation that .
Therefore, from Theorem 1.4, we can conclude that . ∎
3.3. Multi-Table Query
We next extend the join-as-one approach to multi-table query. First, notice that Algorithm 1 does not work in multi-table setting because may itself have a large local sensitivity, unlike in the two table case where the global sensitivity was one.
To handle this, we will need the notion of the residual sensitivity. For a join query , the residual query on a subset of relations is . Its boundary, denoted as , is the set of attributes that belong to relations both in and out of , i.e., . For a residual query on over instance , the query result is333The semi-join result of is defined as function such that if and otherwise.:
(1) |
which is a join-aggregate queries over attributes in .
Definition 3.5 (Residual Sensitivity for Counting Query (dong2021residual; dong2021nearly)).
Given , the -residual sensitivity of counting the join size of multi-way query over instance is defined as
where
(2) |
for .
It is clear that is an upper bound on , and therefore it suffices for us to find an upper bound for (which is then passed to the single-table algorithm). To compute , we observe that has global sensitivity at most . Therefore, adding an appropriately calibrated (truncated and shifted) Laplace noise to it provides a DP upper bound. The idea is formalized in Algorithm 3. Its privacy guarantee is immediate:
Lemma 3.6.
Algorithm 3 is -DP.
Error analysis.
With probability at least , , then since , and therefore . Hence . Putting everything together, the error can be achieved is (omitting ):
4. Uniformize Sensitivity
So far, we have showed a parameterized algorithm for answering linear queries, whose utility is in terms of the join size and the residual sensitivity. A natural question arises: Can we achieve a more fine-grained parameterized algorithm with better utility?
Let us start with a two-table instance (see Figure 2), with input size , join size and local sensitivity . Algorithm 1 achieves an error of . However, this instance is beyond the scope of Theorem 3.4, as the sensitivity (or degree in this simple scenario) distribution over join values is extremely non-uniform. Revisiting the error complexity in Theorem 3.2, we can gain some intuition regarding why Algorithm 1 does not perform well on this instance. The costly term is , where is the largest degree of join values in the input instance. However, there are many join values whose degree is much smaller than , so a natural idea is to uniformize sensitivities, i.e., partition the input instance into a set of sub-instances by join values, such that join values with similar sensitivities are put into same sub-instance. After this uniformization, we just invoke our previous join-as-one approach as a primitive on each sub-instance independently, and return the union of synthetic datasets generated for all sub-instances. The framework of uniformization is illustrated in Algorithm 4. (All missing proofs are given in Appendix B.1).
4.1. Uniformize Two-Table
linecolor=purple,backgroundcolor=purple!25,bordercolor=purplelinecolor=purple,backgroundcolor=purple!25,bordercolor=purpletodo: linecolor=purple,backgroundcolor=purple!25,bordercolor=purpleXiao: Changed Algorithm 5 as well as the text below. Now is not explicitly used in the algorithm description, but can be argued that any with must have . Please check!As mentioned, there is quite an intuitive way to uniformize a two-table join. As described in Algorithm 5, the high-level idea is to bucket join values in by their maximum degree in and . To protect the privacy of individual tuples, we add some noise drawn from to each join value’s degree, before determining to which bucket it should go. Recall that . Let and for all . Conceptually, we divide into buckets, where the -th bucket is associated with for . The set of values from whose maximum noisy degree in and falls into the -bucket is denoted as . For each , we identify tuples in whose join value falls into as , which forms the sub-instance . At last, we just return all sub-instances. Note that is not explicitly used in Algorithm 5 as this is not public, but it is easily to check that there exists no join value with degree more than under the input size constraint. Hence, it is safe to only consider in our analysis.
Lemma 4.1.
Algorithm 4 on two-table join is -DP.
The key insight in Lemma 4.1 is that adding or removing one input tuple can increase or decrease the degree of one join value by at most one. Hence, Algorithm 5 satisfies -DP by the parallel composition rule (mcsherry2009privacy). Moreover, since each input tuple participates in exactly one sub-instance, and Algorithm 2 preserves -DP for each sub-instance by Lemma 3.1, Algorithm 4 preserves -DP by the basic composition (dwork2006calibrating).
Error Analysis.
We next analyze the error of Algorithm 8. Given an instance , let be the partition of generated by Algorithm 8. Correspondingly, derives a partition of input instance , where is the sub-instance in which tuples have their join values falling into . Moreover, derives a partition of join result . Let be the synthetic dataset generated for by JoinMultiTable. From Theorem 3.2, with probability , the error of answering linear queries defined over that can be achieved with is . By a union bound, with probability , the error can be achieved with is:
where .
One may observe an overlap between the bucket ranges defined by , but always holds as induces a partition of as well as . In this way, Algorithm 4 on two-table query is always better (at least not worse) than Algorithm 1, since , which is implied by the Cauchy-Schwarz inequality. Furthermore, we show in Appendix B.1 that for some input instances the gap between Theorem 4.3 and Theorem 3.2 can be polynomially large, in terms of the data size.
Careful inspection reveals that although the partition generated by Algorithm 5 is randomized due to the noise, it is not far away from the fixed partition based on true degrees. As shown in Appendix B.1, Algorithm 5 always has its error bounded by what can be achieved through the uniform partition defined as below.
Definition 4.2 (Uniform Partition).
For , and an input instance , the uniform partition of on is such that for any , if .
Theorem 4.3.
For any two-table instance , a family of linear queries , and , , there exists an algorithm that is -DP, and with probability at least produces such that all queries in can be answered within error:
where , is the number of join results whose join value falls into under the uniform partition .
Let be a characterization vector of join sizes under the uniform sensitivity partition . A two-table database conforms to if
holds for each . Then, we introduce the following parameterized lower bound in terms of join size distribution, and show that Algorithm 8 is optimal on two-table query up to poly-logarithmic factors. The proof is given in Appendix B.1.
Theorem 4.4.
Given arbitrary parameters under the uniform partition , for every sufficiently small , and , there exists a family of queries of size on domain of size such that any -differentially private algorithm that takes as input a two-table database of input size at most while conforming to , and outputs an approximate answer to each query in to within error must satisfy .
4.2. Uniformize Hierarchical Query
Beyond two-table query, we surprisingly find that this uniformization technique can be further extended to the class of hierarchical queries. A join query is hierarchical, if for any pair of attributes , either , or , or , where is the set of relations containing attribute . One can always organize the attributes of a hierarchical join into a tree such that every relation corresponds to a root-to-node path (see Figure 3).
Thanks to the nice structures of hierarchical joins, it is feasible to play with the magic of uniformization. Recall that the essence of uniformization is to partition instance into a set of sub-instances, so that we can get an upper bound on the residual sensitivity as tight as possible, as implied by the error expression in Theorem 1.5. In (2), the residual sensitivity is built on the join sizes of a set of residual queries, but these statistics are too far away from being the partition criteria. Instead of using residual sensitivity directly, we first rewrite the join size of an residual query and get an upper bound for it in terms of degrees.
4.2.1. An Upper Bound on Residual Query Size
To derive an upper bound on ’s for , we target a broader class of -hierarchical queries444This is different from q-hierarchical queries in (berkholz2017answering), where serves as the set of output attributes, although they have the same property characterization on . by generalizing the aggregate attributes in (1) to any subset of attributes that sit on the “top” of . It is not hard to see that for boundary attributes .
Definition 4.5.
For a hierarchical join and instance , a -hierarchical query on and is defined as , where satisfies the property: for any , if and , then .
In the remaining, we focus on deriving an upper bound for -hierarchical queries , which boils down to a product chain of degree functions. For example, in Figure 3 indicates the number of tuples in that appears.
Definition 4.6 (Degree Function).
For a hierarchical join and an instance , the degree function for a subset of relations , a subset of attributes and a tuple is defined as555When the context is clear, we drop the superscript from . Moreover, when , say , we just write as short for .:
(3) |
Base Case. When , say , degenerates to the degree of tuples over attributes in (as ):
General Case. Denote as the residual query of by removing attributes in . We distinguish the following two cases:
-
•
is disconnected. Let be the set of connected component of . We can then decompose into a set of sub-queries, after pushing the semi-join inside each connected component:
The correctness follows that and satisfies the property in Defition 4.6.
-
•
is connected. In this case, we must have .
The correctness follows that and satisfies the property in Definition 4.6.
It is not hard to see that is eventually upper bounded by a product chain of degree functions (see Example 4.8). Careful inspection reveals that degree functions participate in are not arbitrary; instead, they display very special structures captured by Lemma 4.7, which is critical to our partition procedure.
Lemma 4.7.
Every degree function participating in the upper bound of corresponds to a distinct attribute such that and corresponds to the set of attributes lying on the path from the root to ’s parent in .
example 0.
An upper bound derived for in Figure 3:
4.2.2. Partition with Degree Characterization
From the upper bound on , we now define the degree characterization for hierarchical joins, similar as the two-table join. Our target is to partition the input instance into a set of sub-instances (which may not be tuple-disjoint), such that each sub-instance is captured by a distinct degree characterization, and the join results over all sub-instances form a partition of the final join result.
Definition 4.9 (Degree Characterization).
For a hierarchical join , a degree characterization is defined as such that for any and , if and only if there exists an attribute such that and is the set of ancestors of in .
As described in Algorithm 6, the high-level idea is to partition the input instance by attributes in a bottom-up way on , where is the attribute tree of the input hierarchical query , and invoke Algorithm 7 as a primitive on every sub-instance obtained so far.
As described in Algorithm 7, PartitionByAttr takes an input an attribute and an instance such that
-
•
for every attribute as a descendant of in , is uniformized for every , where is the set of ancestors of in .
and outputs a set of sub-instances of , such that
-
•
, i.e., the join results of sub-instances in form a partition of join result of ;
-
•
for every sub-instance , is roughly the same for every , where is the set of ancestors of in .
Similar to the two-table case, for every , we compute a noisy version of its degree and put it into the bucket indexed by its logarithmic value. For each , we obtain a tuple-disjoint partition of based on , such that each defines a sub-instance involving a sub-relation for each as well as all remaining relations with . At last, we return the union of all sub-instances.
Lemma 4.10.
For input instance , let be the set of sub-instances returned by Algorithm 7. satisfies the following properties:
-
•
, i.e., the join results of all sub-instances in form a partition of the join result of ;
-
•
Each input tuple appears in sub-instances of , where is a constant depending on query size;
-
•
Each sub-instance corresponds to a distinct degree characterization such that for any and with :
(4)
Lemma 4.11.
Algorithm 4 on hierarchical joins is -DP, where is a constant depending only on the tree. linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: I fixed this: I think is tree-dependent but *not* query (i.e. )-dependent.
The logarithmic factor in Lemma 4.11 arise since each input tuple participates in sub-instances, implied by Lemma 4.10. Next, given a degree characterization , how to derive the residual sensitivity of instances consistent with ? Actually, this is already resolved by just rewriting the upper bound derived for , denoted as . Similar to the two-table case, we also define the uniform partition of an input instance over a hierarchical join query using true degrees as . As shown in Appendix B.1, the error complexity achieved by the partition based on noisy degrees can be bounded by the uniform one.
Theorem 4.12.
For any hierarchical join and an instance , a family of linear queries , and , , there exists an algorithm that is -DP, and with probability at least produces such that all queries in can be answered within error:
where is over all degree characterizations of , is the sub-instance characterized by , is the join size of , and .
Let as the output size distribution over the uniform degree partition. An instance conforms to if , for every degree characterization , where is the join result of sub-instance . Extending the lower bound argument of Theorem 4.4, we obtain the following parameterized lower bound for hierarchical queries.
Theorem 4.13.
Given a hierarchical join and an arbitrary parameters , for every sufficiently small , and , there exists a family of queries of size on domain of size such that any -differentially private algorithm that takes as input a multi-table database over of input size at most while conforming to , and outputs an approximate answer to each query in to within error must satisfy
where is over all degree characterizations and is the local sensitivity of under .
5. Conclusions and Discussion
In this work, we proposed algorithms for answering linear queries of multi-table joins. Our work opens up several interesting questions that we list below.
Non-Hierarchical Setting.
First and perhaps the most immediate question is if uniformization can benefit the non-hierarchical case. We note that the uniformization technique from Section 4.2 itself is applicable but may not lead to uniformized residual sensitivity on non-hierarchical queries, even on the simplest non-hierarchical join . We can simply partition values in into buckets, each of which is indexed by two integers such that Similar partition idea applies to values in . Note that each pair of buckets corresponding to defines a sub-instance . But, residual sensitivity within each sub-instance is not guaranteed to be uniformized. In determining the residual sensitivity in (2), we observe that and ; however, and are defined over the whole instance , instead of the specific instance . This coarse-grained partition can be extended to arbitrary acyclic query (see Appendix B.1), but does not necessarily lead to uniformized residual sensitivity inside each sub-instance. We leave this as an interesting future work.
Query-Specific Optimality.
Throughout this work, we consider worst-case set of queries parameterized by its size. Although this is a reasonable starting point, it is also plausible to hope for an algorithm that is nearly optimal for all query set . In the single table case, this has been achieved in (HardtT10; BhaskaraDKT12; NikolovT016). However, the situation is rather complicated in multi-table setting since we have to take the local sensitivity into account, whereas, in the single table case, the query family already tells us everything about the possible change by moving to a neighboring dataset. This also seems like an interesting open question for future research.
Instance-Specific Optimality.
Similarly, we have considered worst-case instances. One might instead prefer to achieve finer-grained instance-optimal errors. For the single table case, (asi2020instance) observed that an instance-optimal DP algorithm is not achievable, since a trivial algorithm can return the same answer(s) for all input instances, which could work perfectly on one specific instance but miserably on all remaining instances. To overcome this, a notion of “neighborhood optimality” has been proposed (dong2021nearly; asi2020instance) where we consider not only a single instance but also its neighbors at some constant distance. We note, however, that this would not work for our setting when there are a large number of queries. Specifically, if we again consider an algorithm that always returns the true answer for this instance, then its error with respect to the entire neighborhood set is still quite small–at most the distance times the maximum local sensitivity in the set. This is independent of the table size , whereas our lower bounds show that the dependency on is inevitable. As such, the question of how to define and prove instance-optimal errors remains open for the multi-query case.
Appendix A Missing Proof in Section 1
We adapt the single-table proof to our context. Let be a table of data items over domain . Define be the set of linear queries. Let be the distribution returned by MWEM algorithm in the -iteration. Define as the maximum error over all queries in terms of and . Now we are going to bound the maximum error of the MWEM algorithm:
Define for . Note that the sensitivity for invoking the exponential mechanism is
and that for invoking the laplacian mechanism is
followed by the similar argument.
Lemma A.1.
With probability at least for any , for all , we have:
Proof.
For the first inequality, we note that the probability the exponential mechanism with parameter selects a query with quality score at least less than the optimal is bounded by
For the second inequality, we note that happens if and only if happens. By definition of the Laplace distribution, we have:
A union bound over events yields such a failure probability. ∎
Then, how the MW mechanism improve the approximation in each round where has large magnitude. To capture the improvement, we use the relative entropy function:
Here, we can only show that
where such that for every .
We next consider how changes in each iteration:
where
Using the fact that for and ,
Then, we can rewrite in the following way:
where the last inequality follows the previous analysis by taking . By rewritting the last inequality, we have
Then,
By taking , we can obtain the minimized error as .bIn -DP, this error can be generalized as (omitting the fact of ):
Appendix B Missing Proofs in Section 3
Lemma B.1.
For any pair of neighboring instance , let be the join results of respectively. Then, .
Lemma B.1 directly follows the definition of and the non-negative property of STGeom.
Proof of Lemma 3.1.
Consider two neighboring database and . We first prove . Let be the local sensitivity of respectively. It can be easily checked that . Let be the random variables by adding random noises from to separately. Let be the support of separately. From STGeom, we observe:
(5) |
since , hence . We next note that for any , the following holds:
since . From the sequential composition rule (dwork2006calibrating; dwork2006our), we conclude that .
We next turn to line 6. Let be the synthetic data generated for separately. The single-table algorithm starts with a uniform distribution with for , and for . Implied by the analysis above, the initial state satisfies -DP, i.e., .
We finally come to line 6-9, with invocations to the EM Mechanism and Laplace mechanism. For exponential mechanism, let be the query selected after line 7. For Laplace mechanism, let be the resulted value after line 9 for respectively. From previous, we already have (5). We next note two facts for any :
-
•
since holds for every query ,
since . Summing over iterations, it yields -DP. Thus, Algorithm 1 preserves -DP. ∎
Proof of Lemma 3.6.
For any pair of neighboring instance , we note that always holds. Let be the updated value after line 3 for respectively. It can showed that with probability at least , , then , and therefore . For any , we have
thus, . Then, for any , we know that
since . Let be the resulted value after line 5 for respectively. Then, .
We next turn to line 6. Let be the synthetic data generated for separately. The single-table algorithm starts with a uniform distribution with for , and for . Implied by the analysis above, the initial state satisfies -DP, i.e., .
We finally come to line 7-10, with invocations to the EM Mechanism and Laplace mechanism. For exponential mechanism, let be the query selected after line 7. For Laplace mechanism, let be the resulted value after line 9 for respectively. As previous, we already showed that . We next note two facts for any :
-
•
;
-
•
The first one follows the fact that holds for every query . The second one follows from the fact that . Summing over iterations, it results in -DP. Thus, Algorithm 1 preserves -DP. ∎
Proof of Theorem 1.6.
Set and . From Theorem 1.4, let be the set of queries on which any -differentially private algorithm that takes as input a single-table database and output an approximate answer to each query in within -error requires that . For an arbitrary single-table database , we construct a multi-table instance for of input size , join size , and local sensitivity as follows: Without loss of generality, we pick as the anchor relation to encode .
-
•
Set if , and otherwise;
-
•
There are distinct values in for , where ;
-
•
For each data record , we add a tuple in ;
-
•
Tuples in for form a Cartesian product of size over all attributes in ;
It can be easily checked that . From , we construct a set of queries over as . For each query , we construct another query as follows:
-
•
such that if and only if for any ;
-
•
over for .
We also add the counting query with for any to .
It suffices to show that if there is a -differentially private algorithm that takes any multi-table instance in and outputs an approximate answer to each query in within error , then there is a -differentially private algorithm that takes any single-table instance and outputs an approximate answer to each query in within error . The remaining argument exactly follows that of two-table case. ∎
B.0.1. Missing Proofs for the Lower Bound
Proof of Theorem 3.3.
Let . Let be a pair of neighboring instances such that .
Suppose for the sake of contradiction that there is an -DP algorithm that achieves an error with probability at least 0.99. Let be the results released for join sizes of respectively. Then, as preserves -DP, it must be that
a contradiction. ∎
B.1. Missing Proofs in Section 4
example 0.
example 0.
Proof of Lemma 4.1.
Consider two neighboring instances and . Let be the partitions of for respectively. By definition, if and only if ; is defined similarly. Note that holds for any and . Hence, . As the degrees for different ’s are calculated on disjoint parts of input data, then from the parallel composition rule.
Proof of Theorem 4.3.
Given an input instance over the two-table query, let , be a fixed partition of , such that if and only of
Let be a random partition of by Algorithm 5. Note that happens only if
It is easy to see that . For simplicity, we denote the number of join results participated by values in as and the number of join results participated by values in as . Then the cost of Algorithm 8 under partition is
thus can be bounded by the cost of under the fixed uniform partition of . ∎
Proof of Theorem 4.4.
Let be the set of two-table instances with output size and every join value has maximum degree falling into . Our proof consists of two steps:
-
•
Step (1): There exists a family of queries such that any -algorithm that takes as input an instance from and outputs an approximate answer to each query in within error must require .
-
•
Step (2): There exists a family of queries such that any -algorithm that takes as input an instance and outputs an approximate answer to each query in within error must require .
Let’s first focus on step (1) with arbitrary . Let be an arbitrary integer such that holds. From Theorem 1.4, let be the set of hard queries on which any -differentially private algorithm takes as input any single table , and outputs an approximate answer to each query in within error must require . For an arbitrary single table , we can construct a two-table instance as follows:
-
•
Set , and ;
-
•
Set and ;
-
•
Tuples in form a Cartesian product of size over ;
-
•
Tuples in form a Cartesian product of sizes over ;
-
•
For each of the records in , say , we add a tuple and set .
It can be easily checked that . From Theorem 1.4, let be the set of all linear queries over . For each query , we construct another query such that:
-
•
such that if and only if ;
-
•
.
We borrow the similar argument from Theorem LABEL:the:lb-two-table, showing that an -algorithm that takes a two-table instance in and outputs an approximate answer to each query in within error , there exists an -algorithm that takes an arbitrary single table , and outputs an approximate answer to each query in within error . As , .
Step (2).
From Theorem 1.4, let be the family of linear queries over , such that . Consider an arbitrary -differentially private algorithm takes as input in and outputs an approximate answer to each query in within error . If , there must exist an -differentially private algorithm that takes an input any instance from and output an approximate answer to each query in , for some , contradicting to Step (1). ∎
In the attribute tree , if there exists an attribute with only one child, say with child , we know that always appear together in all relations, then they can be simply considered as one “combined” attribute. Hence, it is safe to assume that every non-leaf attribute has at least two children in .
Proof of Lemma 4.10.
We will prove the first two properties by induction. In the base case when only contains one instance, these two properties hold trivially. Consider an arbitrary iteration of while-loop in Algorithm 6. By hypothesis, all sub-instances in have disjoint join results, and the union of join results of all sub-instances in is . Then, it suffices to show that PartitionByAttr generates a partition of as
such that all ’s have disjoint join results, and the union of join results of all ’s is the join result of .
Consider an arbitrary join result of , where is projection of onto . For any , . Let be the tuple such that holds for every . Denote . Then, for any , if and only if by line 9. Thus, , but for any .
For the third property, we actually show that each input tuple appears in sub-instances, where is the total number of appearance of join attributes in all relations. In an invocation of PartitionHier, tuple appears in sub-instances if , and only one sub-instance otherwise. Overall, the procedure PartitionByAttr will be invoked on every non-leaf attribute in , thus every input tuple appears in sub-instances.
At last, we first show each sub-instance corresponds to a degree characterization. Let’s focus on an arbitrary single relation in Algorithm 6, which is partitioned by attributes lying on the path corresponding to in in a bottom-up way, say . When PartitionByAttr is invoked, the sub-instance must have with
holds for any with . In the subsequent partitions on , we claim that for each , the set of tuples with the same value always appear together. Suppose PartitionByAttr is invoked with any . Observe that for any with , always holds since , where and . By line 9 in Algorithm 7, tuples with the same value always falls into the same sub-relation. In other words, remains the same in subsequent partitions, for any with . This argument can also generalized to any attribute .
Now, let’s interpret the partition process as a directed tree, such that has sub-instances in as its children. Each sub-instance returned by Algorithm 6 corresponds to a distinct root-to-leaf path in this directed tree. Consider an arbitrary sub-instance corresponding to the root-to-leaf path . When each instance goes through an invocation of PartitionByAttr, a distinct value is set up for each of ’s children. As proved above, this degree value will be preserved in any for . As each instance on this path corresponds to one invocation of PartitionByAttr on a distinct attribute, all possible values of will be set up well for . Thus, each sub-instance returned by Algorithm 6 corresponds to a degree characterization.
Moreover, for any pair of sub-instances returned by Algorithm 6, we can find the lowest common ancestor of these two corresponding root-to-leaf paths, say with an invocation of PartitionByAttr on . So, these two sub-instances must have different degree values over attributes in relations from . As proved above, the difference will be preserved in these two sub-instances. This argument can be applied to any pair of sub-instances returned by Algorithm 6, thus each sub-instances corresponds to a distinct degree characterization. ∎
In an attribute tree rooted at , let be the sequence of attributes lying on the path from to .
Proof of Lemma 4.7.
We will prove it through the following four steps:
-
•
Step 1: ;
-
•
Step 2: ;
-
•
Step 3: there exists no such that .
-
•
Step 4: If is the node such that , let be the lowest ancestor of with more than one child. Then, and .
Step 1. It suffices to show that the first property holds for every invocation of . First, it holds trivially for , which follows the definition of partial attributes . By hypothesis, we assume it holds for some . Next, we show that it is also preserved in these two recursive rules applied to . We distinguish two cases.
-
•
is disconnected. Consider an arbitrary connected component . For any pair of and , is implied by the hypothesis. For any pair of and , is implied by the disconnect fact of . Together, for any pair of and , we have . Hence, for each , also satisfies this property.
-
•
is connected. We note that every attribute in only appears in relations from by hypothesis. As , every attribute in only appears in relations from . In other words, if there exists a pair and , . Hence, also satisfies this property.
Step 2. With the first property, we next prove the second property. From Definition 4.6, . In the base case when is invoked, we must have , thus . In the recursive step when is connected, we already show . Overall, when is invoked, we must have .
Step 3. Then, we come to the third property. Suppose there exist some such that . Implied by the first property, , coming to a contradiction of the second property.
Step 4. Finally, we come to the last step. As is hierarchical, any subset of relations from also forms a hierarchical join. Attributes in form a root-to-node path in the attribute forest of . Together with the third property, if equals to the intersection of relations in , then . Moreover, we note that is also a root-to-node sub-path of . Suppose . Then, we can always find , since has at least one child which is not an ancestor of . As , we have . By definition, , coming to a contradiction. Under the assumption that every non-leaf attribute in has at least two children, it is safe to replace with the parent of . ∎
Proof of Lemma 4.11.
Consider two neighboring instances and . Note that holds for any , , and . This is not hard to show since change one tuple can change at most one tuple in as well as one tuple in its projection onto attributes , thus at most one tuple in . Let be the root-to-leaf path corresponding to . Each input tuple contributes to at most degree functions, i.e., with , for any . Hence, , where is a query-dependent quantity.
Note that for arbitrary , adding noisy degree can only enlarge it by a factor of . Let be the set of join attributes, and . We consider the most fine-grained partition of such that , where corresponds to the set of all join results participated by . Let be the degree characterization where falls into with true degrees, and be the degree characterization where falls into with noisy degrees by Algorithm 7. The cost of Algorithm 4 on hierarchical joins is
where the last inequality is implied by the fact that for any , the number of such that some could have and is at most .
Proof of Theorem 4.13.
Let be the set of two-table instances with output size and following the degree characterization . Our proof consists of two steps:
-
•
Step (1): There exists a family of queries such that any -algorithm that takes as input an instance from and outputs an approximate answer to each query in within error must require .
-
•
Step (2): There exists a family of queries such that any -algorithm that takes as input an instance and outputs an approximate answer to each query in within error must require .
Let’s first focus on step (1) with arbitrary . Assume is achieved on some , i.e., . We compute the value of and set . From Theorem 1.4, let be the set of hard queries on which any -differentially private algorithm takes as input any single table , and outputs an approximate answer to each query in within error must require . For an arbitrary single table , we can construct an instance as follows:
-
•
Set for any ;
-
•
When visiting attributes of in a bottom-up way, say , we set where is the set of attributes lying on the path from to ’s parent in ;
-
•
For the root , set ;
-
•
Every relation is a Cartesian product over its own attributes;
-
•
For each data record , we exclusively pick a tuple and attach to it.
It can be easily checked that , and . From Theorem 1.4, let be the set of all linear queries over . For each query , we construct another query such that:
-
•
such that if and only if is attached to ;
-
•
for .
We borrow the similar argument from Theorem LABEL:the:lb-two-table, showing that an -algorithm that takes an instance in and outputs an approximate answer to each query in within error , there exists an -algorithm that takes an arbitrary single table , and outputs an approximate answer to each query in within error . As , .
Step (2).
From Theorem 1.4, let be the family of linear queries over , such that . Consider an arbitrary -differentially private algorithm takes as input in and outputs an approximate answer to each query in within error . If , there must exist an -differentially private algorithm that takes an input any instance from and output an approximate answer to each query in for some , contradicting to Step (1). ∎
Uniformize Acyclic Query.
A join query is acyclic if there is a join tree such that every node of corresponds to a relation, and for every attribute , the set of nodes containing forms a connected subtree of . For general acyclic join query , we can first fix a join tree . Let be the set of connected components of relations in on . For each , we pick an arbitrary relation with as the root, which is always feasible; otherwise itself is a connected component, and . After picking as the root, we denote for as the parent of in . For completeness, let . Then,
At last, . To apply the uniformization technique, it suffices to partition each relation by the degree of values in attributes , for all possible and root .