This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentially Private Data Release over Multiple Tables

(2018)
Abstract.

We study synthetic data release for answering multiple linear queries over a set of database tables in a differentially private way. Two special cases have been well-studied in the literature: how to release a synthetic dataset for answering multiple linear queries over a single table, and how to release the answer for a single counting (join size) query over a set of database tables. Compared to the single-table case, the join operator has emerged as the major difficulty in query answering, since the sensitivity (i.e., how much change an individual data record can lead in the query answer) could be heavily amplified via complex join relationships.

We present an algorithm for the general problem, and prove a lower bound illustrating that our general algorithm achieves parameterized optimality (up to logarithmic factors) on some simple queries (for example, two-table join queries) in the most commonly-used privacy parameter regimes. For the class of hierarchical joins, we present a data partition procedure that exploits the idea of uniformizing sensitivities to further improve the utility.

copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYprice: 15.00isbn: 978-1-4503-XXXX-X/18/06copyright: none

1. Introduction

Synthetic data release is a very useful objective in private data analysis. Differential privacy (DP) (dwork2006calibrating; dwork2006our) has emerged as a compelling model that enables us to formally study the tradeoff between utility of released information and the privacy of individuals. In the literature, there has been a large body of work that studies synthetic data release over a single table, for example, the private multiplicative weight algorithm (hardt2012simple), the histogram-based algorithm (vadhan2017complexity), the matrix mechanism (li2010optimizing), the Bayesian network algorithm (zhang2017privbayes); as well as other works focusing on geometric range queries (li2011efficient; li2012adaptive; dwork2010differential; bun2015differentially; chan2011private; dwork2015pure; cormode2012differentially; huang2021approximate), and datacubes (ding2011differentially).

Data analysis over multiple private tables connected via join operators has been the subject of significant interest within the area of modern database systems. In particular, the challenging question of releasing the join size over a set of private tables has been studied in several recent works including the sensitivity-based framework (johnson2018towards; dong2021residual; dong2021nearly), the truncation-based mechanism (kotsogiannis2019privatesql; dong2022r2t; tao2020computing), as well as in works on one-to-one joins (mcsherry2009privacy; proserpio2014calibrating), and on graph databases (blocki2013differentially; chen2013recursive). In practice, multiple queries (as opposed to a single one) are typically issued for data analysis, for example, a large class of linear queries on top of join results with different weights on input tuples, as a generalization of the counting join size query. One might consider answering each query independently but the utility would be very low due to the limited privacy budget, implied by the DP composition rule (dwork2006calibrating; dwork2006our). Hence the question that we tackle in this paper is: how can we release synthetic data for accurately answering a large class of linear queries over multiple tables in a differentially private way?

1.1. Problem Definition

Multi-way Join and Linear Queries.

A (natural) join query can be represented as a hypergraph =(𝐱,{𝐱1,𝐱2,,𝐱m})\mathcal{H}=(\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{m}\}) (abiteboul1995foundations), where 𝐱\mathbf{x} is the set of vertices or attributes and each 𝐱i𝐱\mathbf{x}_{i}\subseteq\mathbf{x} is a hyperedge or relation. Let dom(x)\textsf{dom}(x) be the domain of attribute xx. Define 𝔻=dom(𝐱)=x𝐱dom(x)\displaystyle{\mathbb{D}=\textsf{dom}(\mathbf{x})=\prod_{x\in\mathbf{x}}\textsf{dom}(x)}, and 𝔻i=dom(𝐱i)=x𝐱idom(x)\displaystyle{\mathbb{D}_{i}=\textsf{dom}(\mathbf{x}_{i})=\prod_{x\in\mathbf{x}_{i}}\textsf{dom}(x)} for any i[m]i\in[m]. Given an input instance 𝐈=(R1𝐈,R2𝐈,,Rm𝐈)\mathbf{I}=\left(R^{\mathbf{I}}_{1},R^{\mathbf{I}}_{2},\cdots,R^{\mathbf{I}}_{m}\right), each table Ri𝐈:𝔻i0R^{\mathbf{I}}_{i}:\mathbb{D}_{i}\to\mathbb{Z}^{\geq 0} is defined as a function from domain 𝔻i\mathbb{D}_{i} to a non-negative integer, indicating the frequency of each tuple. This is more general than the setting where each tuple appears at most once in each relation, since it can capture annotated relations (green2007provenance) with non-negative values. The input size of 𝐈\mathbf{I} is defined as the sum of frequencies of data records, i.e., n=i[m]t𝔻iRi𝐈(t)\displaystyle{n=\sum_{i\in[m]}\sum_{t\in\mathbb{D}_{i}}R^{\mathbf{I}}_{i}(t)}. When the context is clear, we just drop the superscript 𝐈\mathbf{I} from Rj𝐈R^{\mathbf{I}}_{j}, and often use Rj={t𝔻j:Rj(t)>0}R_{j}=\{t\in\mathbb{D}_{j}:R_{j}(t)>0\} to denote the set of tuples with non-zero frequency.linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: This notion RjR_{j} is pretty confusing to me. Do we really need it?

For each i[m]i\in[m], we have a family 𝒬i\mathcal{Q}_{i} of linear queries defined over 𝔻i\mathbb{D}_{i} such that for q𝒬iq\in\mathcal{Q}_{i}, q:𝔻i[1,+1]q:\mathbb{D}_{i}\to[-1,+1]. The family of linear queries defined over 𝐈\mathbf{I} is 𝒬=×i[m]𝒬i\mathcal{Q}=\times_{i\in[m]}\mathcal{Q}_{i}. For each linear query q=(q1,q2,,qm)𝒬q=(q_{1},q_{2},\cdots,q_{m})\in\mathcal{Q}, the result is defined as

q(𝐈)=t=(t1,t2,,tm)×i[m]𝔻iρ(t)i[m]qi(ti)Ri(ti)q(\mathbf{I})=\sum_{\vec{t}=(t_{1},t_{2},\cdots,t_{m})\in\times_{i\in[m]}\mathbb{D}_{i}}\rho\left(\vec{t}\right)\prod_{i\in[m]}q_{i}(t_{i})\cdot R_{i}(t_{i})

where ρ:×i[m]𝔻i{0,1}\rho:\times_{i\in[m]}\mathbb{D}_{i}\to\{0,1\} is the natural join function, such that for each combination t×i[m]𝔻i\vec{t}\in\times_{i\in[m]}\mathbb{D}_{i}, ρ(t)=1\rho(\vec{t})=1 if and only if π𝐱i𝐱jti=π𝐱i𝐱jtj\pi_{\mathbf{x}_{i}\cap\mathbf{x}_{j}}t_{i}=\pi_{\mathbf{x}_{i}\cap\mathbf{x}_{j}}t_{j} for each pair of i,j[m]i,j\in[m]. Here, π\pi is the standard projection operator, such that πxt\pi_{x}t indicates the value(s) that tuple tt displays on attribute(s) xx.

Our goal is to release a function 𝔽:×i[m]𝔻i\mathbb{F}:\times_{i\in[m]}\mathbb{D}_{i}\to\mathbb{N} such that all queries in 𝒬\mathcal{Q} can be answered over 𝔽\mathbb{F} as accurately as possible, i.e., minimizing the \ell_{\infty}-error α=maxq𝒬|q(𝐈)q(𝔽)|\displaystyle{\alpha=\max_{q\in\mathcal{Q}}|q(\mathbf{I})-q(\mathbb{F})|}, where

q(𝔽)\displaystyle q(\mathbb{F}) =(t1,t2,,tm)×i[m]𝔻i𝔽(t1,t2,,tm)i[m]qi(ti).\displaystyle=\sum_{(t_{1},t_{2},\cdots,t_{m})\in\times_{i\in[m]}\mathbb{D}_{i}}\mathbb{F}(t_{1},t_{2},\cdots,t_{m})\prod_{i\in[m]}q_{i}(t_{i}).
DP Setting.

Two DP settings have been studied in the relational model, depending on whether foreign key constraints are considered or not. The one considering foreign key constraints assumes the existence of a primary private table, and deleting a tuple in the primary private relation will delete all other tuples referencing it; see (kotsogiannis2019privatesql; tao2020computing; dong2022r2t). In this work, we adopt the other notion, which does not consider foreign key constrains, but defines instances to be neighboring if one can be converted into the other by adding/removing a single tuple; this is the same as the notion studied in previous works (johnson2018towards; kifer2011no; narayan2012djoin; proserpio2014calibrating). 111Our definition corresponds to the add/remove variant of DP where a record is added/removed from a single database RiR_{i}. Another possible definition is based on substitution DP. Add/remove DP implies substitution DP with only a factor of two increase in the privacy parameters; therefore, all of our results also apply to substitution DP.

Definition 1.1 (Neighboring Instance).

A pair of instances 𝐈=(R1,,Rm)\mathbf{I}=(R_{1},\cdots,R_{m}) and 𝐈=(R1,,Rm)\mathbf{I}^{\prime}=(R^{\prime}_{1},\cdots,R^{\prime}_{m}) are neighboring if there exist some i[m]i\in[m] and t𝔻it^{*}\in\mathbb{D}_{i} such that:

  • for any j[m]{i}j\in[m]-\{i\}, Rj(t)=Rj(t)R_{j}(t)=R^{\prime}_{j}(t) for every t𝔻jt\in\mathbb{D}_{j};

  • Ri(t)=Ri(t)R_{i}(t)=R^{\prime}_{i}(t) for every t𝔻i{t}t\in\mathbb{D}_{i}-\{t^{*}\} and |Ri(t)Ri(t)|=1|R_{i}(t^{*})-R^{\prime}_{i}(t^{*})|=1.

Definition 1.2 (Differential Privacy (dwork2006calibrating; dwork2006our)).

For ϵ,δ>0\epsilon,\delta>0, a randomized algorithm 𝒜\mathcal{A} is (ϵ,δ)(\epsilon,\delta)-differentially private (denoted by (ϵ,δ)(\epsilon,\delta)-DP) if for any pair 𝐈\mathbf{I}, 𝐈\mathbf{I}^{\prime} of neighboring instances and any subset 𝒮\mathcal{S} of outputs, Pr(𝒜(𝐈)𝒮)eϵPr(𝒜(𝐈)𝒮)+δ\Pr(\mathcal{A}(\mathbf{I})\in\mathcal{S})\leq e^{\epsilon}\cdot\Pr(\mathcal{A}(\mathbf{I}^{\prime})\in\mathcal{S})+\delta.

Notation.

For simplicity of presentation, we henceforth assume throughout that 0<ϵO(1)0<\epsilon\leq O(1) and 0δ1/20\leq\delta\leq 1/2. Furthermore, all lower bounds below hold against an algorithm that achieves the stated error α\alpha with probability 1β1-\beta for some sufficiently small constant β>0\beta>0; we will omit mentioning this probabilistic part for brevity. For notational ease, we define the following quantities:

flower(𝔻,𝒬,ϵ)=1ϵlog|𝒬|log|𝔻|, and f^{\mathrm{lower}}(\mathbb{D},\mathcal{Q},\epsilon)=\sqrt{\frac{1}{\epsilon}\cdot\log|\mathcal{Q}|\cdot\sqrt{\log|\mathbb{D}|}},\mbox{ and }
fupper(𝔻,𝒬,ϵ,δ)=flower(𝔻,𝒬,ϵ)log1/δ.f^{\mathrm{upper}}(\mathbb{D},\mathcal{Q},\epsilon,\delta)=f^{\mathrm{lower}}(\mathbb{D},\mathcal{Q},\epsilon)\cdot\sqrt{\log 1/\delta}.

When 𝔻,𝒬,ϵ,δ\mathbb{D},\mathcal{Q},\epsilon,\delta are clear from context, we will omit them. Let λ=1ϵlog1δ\lambda=\frac{1}{\epsilon}\log\frac{1}{\delta}, which is a commonly-used parameter in this paper.

1.2. Prior Work

We first review the problem of releasing synthetic data for a single table, and mention two important results in the literature. In the single table case, nearly tight upper and lower bounds are known, as stated below.

Theorem 1.3 ((hardt2012simple)).

For a single table RR of at most nn records, a family 𝒬\mathcal{Q} of linear queries, and ϵ>0\epsilon>0, δ>0\delta>0, there exists an algorithm that is (ϵ,δ)(\epsilon,\delta)-DP, and with probability at least 11/poly(|𝒬|)1-1/\mathrm{poly}(|\mathcal{Q}|) produces 𝔽\mathbb{F} such that all queries in 𝒬\mathcal{Q} can be answered within error α=O(nfupper)\displaystyle{\alpha=O\left(\sqrt{n}\cdot f^{\mathrm{upper}}\right)} using 𝔽\mathbb{F}.

Theorem 1.4 ((bun2018fingerprinting)).

For every sufficiently small ϵ,α>0\epsilon,\alpha>0, nD(1/α)Ω(1)n_{D}\geq(1/\alpha)^{\Omega(1)} and nQ(1/α)O(1)n_{Q}\leq(1/\alpha)^{O(1)}, there exists a family 𝒬\mathcal{Q} of queries of size nQn_{Q} on a domain 𝔻\mathbb{D} of size nDn_{D} such that any (ϵ,o(1/n))(\epsilon,o(1/n))-DP algorithm that takes as input a database of size at most nn, and outputs an approximate answer to each query in 𝒬\mathcal{Q} to within error α\alpha must satisfy αΩ~(nflower)\displaystyle{\alpha\geq\tilde{\Omega}\left(\sqrt{n}\cdot f^{\mathrm{lower}}\right)}.

Another related problem that has been widely studied by the database community is how to release the answer of join size queries over multiple tables with DP, denoted as count()\textsf{count}(\cdot):

count(𝐈)=t=(t1,t2,,tm)×i[m]𝔻iρ(t)i[m]Ri(ti),\textsf{count}(\mathbf{I})=\sum_{\vec{t}=(t_{1},t_{2},\dots,t_{m})\in\times_{i\in[m]}\mathbb{D}_{i}}\rho\left(\vec{t}\right)\prod_{i\in[m]}R_{i}(t_{i}),

which can be captured by a special linear query q=(q1,q2,,qm)q=(q_{1},q_{2},\dots,q_{m}) such that qi:𝔻i{+1}q_{i}:\mathbb{D}_{i}\to\{+1\} for i[m]i\in[m].

linecolor=green,backgroundcolor=green!25,bordercolor=greenlinecolor=green,backgroundcolor=green!25,bordercolor=greentodo: linecolor=green,backgroundcolor=green!25,bordercolor=greenRavi: The following para needs rephrasinglinecolor=purple,backgroundcolor=purple!25,bordercolor=purplelinecolor=purple,backgroundcolor=purple!25,bordercolor=purpletodo: linecolor=purple,backgroundcolor=purple!25,bordercolor=purpleXiao: Just rewrite this part, and please check!

A popular approach of answering the counting join size query is based on the sensitivity framework, which first computes count(𝐈)\textsf{count}(\mathbf{I}), then computes the sensitivity of the query (measuring the difference between the query answers on neighboring database instances), and releases a noise-masked query answer, where the noise is drawn from some distribution calibrated appropriately according to the sensitivity. It is known that the local sensitivity LScount()LS_{\textsf{count}}(\cdot) cannot be directly used for calibrating noise, as LScount(𝐈),LScount(𝐈)LS_{\textsf{count}}(\mathbf{I}),LS_{\textsf{count}}(\mathbf{I}^{\prime}) for a pair of neighboring databases can lead to distinguishability of 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime}. Global sensitivity (e.g., (dwork2006calibrating)) could be used, but the utility is not satisfactory in many instances. Instead, smooth upper bounds on local sensitivity (nissim2007smooth) have been considered, for example, the smooth sensitivity, which is the smallest smooth upper bound, the residual sensitivity (dong2021residual), which is a constant-approximation to the smooth sensitivity, and the elastic sensitivity (johnson2018towards), which is further strictly larger than residual sensitivity.

We note that this sensitivity-based framework can be adapted to answering any single linear query, as long as the sensitivity is correctly computed. However, we are not interested in answering each linear query individually, since the privacy budget blows up with the number of queries to be answered, implied by the DP composition rule (dwork2006calibrating; dwork2006our), which would be too costly when the query space is extremely large. Instead, we aim to release a synthetic dataset such that an arbitrary linear query can be freely answered over it while preserving DP. In building our data release algorithm, we also resort to the existing literature on the sensitivities derived for this counting join size query count()\textsf{count}(\cdot). Although smooth sensitivity enjoys the best utility, we adopt the residual sensitivity in this paper for two reasons: it is much more computationally efficient compared to smooth sensitivity; and more importantly, it is only built on the statistics of the input instance while smooth sensitivity is built on all neighboring instances, which is critical to our optimization technique of uniformizing sensitivities.

1.3. Our Results

linecolor=blue,backgroundcolor=blue!25,bordercolor=bluelinecolor=blue,backgroundcolor=blue!25,bordercolor=bluetodo: linecolor=blue,backgroundcolor=blue!25,bordercolor=blueBadih: Based on discussion with Xiao, we might need to add more interpretation of the results in this section?linecolor=purple,backgroundcolor=purple!25,bordercolor=purplelinecolor=purple,backgroundcolor=purple!25,bordercolor=purpletodo: linecolor=purple,backgroundcolor=purple!25,bordercolor=purpleXiao: I added a few sentences after Theorem 1.6, but not sure if they make sense.

We first present the basic join-as-one approach for releasing a synthetic dataset for multi-table queries, with error a function of the output size of join results and the residual sensitivity of count()\textsf{count}(\cdot).

Theorem 1.5.

For any join query \mathcal{H}, an instance 𝐈\mathbf{I}, a family 𝒬\mathcal{Q} of linear queries, and ϵ>0\epsilon>0, δ>0\delta>0, there exists an (ϵ,δ)(\epsilon,\delta)-DP algorithm that with probability at least 11/poly(|𝒬|)1-1/\textsf{poly}(|\mathcal{Q}|) produces 𝔽\mathbb{F} that can be used to answer all queries in 𝒬\mathcal{Q} within error:

α=O((𝖮𝖴𝖳SScountβ(𝐈)λ+SScountβ(𝐈)λ)fupper)\alpha=O\left((\sqrt{\mathsf{OUT}\cdot SS^{\beta}_{\textsf{count}}(\mathbf{I})\cdot\lambda}+SS^{\beta}_{\textsf{count}}(\mathbf{I})\cdot\lambda)\cdot f^{\mathrm{upper}}\right)

where 𝖮𝖴𝖳\mathsf{OUT} is the join size of 𝐈\mathbf{I} and SScountβ(𝐈)SS^{\beta}_{\textsf{count}}(\mathbf{I}) is the β\beta-residual sensitivity of the counting join size query count()\textsf{count}(\cdot) on 𝐈\mathbf{I} for β=ϵ2ln2/δ\beta=\frac{\epsilon}{2\ln 2/\delta} (see Definition 3.5).

We next present the parameterized optimality with respect to the output size of join results and local sensitivity of count()\textsf{count}(\cdot).

Theorem 1.6.

Given arbitrary parameters 𝖮𝖴𝖳Δ>0\mathsf{OUT}\geq\Delta>0 and a join query \mathcal{H}, for every sufficiently small ϵ,α>0\epsilon,\alpha>0, nD(1/α)Ω(1)n_{D}\geq(1/\alpha)^{\Omega(1)} and nQ(1/α)O(1)n_{Q}\leq(1/\alpha)^{O(1)}, there exists a family 𝒬\mathcal{Q} of queries of size nQn_{Q} on domain 𝔻\mathbb{D} of size nDn_{D} such that any (ϵ,o(1/n))(\epsilon,o(1/n))-differentially private algorithm that takes as input a multi-table database over \mathcal{H} of input size at most nn, join size 𝖮𝖴𝖳\mathsf{OUT} and local sensitivity Δ\Delta, and outputs an approximate answer to each query in 𝒬\mathcal{Q} to within error α\alpha must satisfy

αΩ~(𝖮𝖴𝖳Δflower).\alpha\geq\tilde{\Omega}\left(\sqrt{\mathsf{OUT}\cdot\Delta}\cdot f^{\mathrm{lower}}\right).

It should be noted that for the setting with ϵ=Ω(1)\epsilon=\Omega(1) and δ=O(1nc)\delta=O\left(\frac{1}{n^{c}}\right) for some constant cc, we just have λ=O(logn)\lambda=O\left(\log n\right). The gap between the upper bound in Theorem 1.5 and the lower bound in Theorem 1.6 boils down to the gap between smooth sensitivity and local sensitivity of count()\textsf{count}(\cdot), which also appears (but in a different formula) in answering such a single query.linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: Moved instance-optimality discussion to the conclusion.

linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: Our upper bound above is in terms of smooth sensitivity but lower bound is in terms of local sensitivity. Do we need to state a connection between the two?linecolor=purple,backgroundcolor=purple!25,bordercolor=purplelinecolor=purple,backgroundcolor=purple!25,bordercolor=purpletodo: linecolor=purple,backgroundcolor=purple!25,bordercolor=purpleXiao: I added a few sentences here, but not sure if they make sense.

From the upper bound in terms of residual sensitivity, we exploit the idea of uniformizing sensitivity222Similar idea of uniformizing degrees of join values, i.e., how many tuples a join value appears in a relation, has been used in query processing (joglekar2015s), but we use it in a more complicated scenario in combination of sensitivity.linecolor=blue,backgroundcolor=blue!25,bordercolor=bluelinecolor=blue,backgroundcolor=blue!25,bordercolor=bluetodo: linecolor=blue,backgroundcolor=blue!25,bordercolor=blueBadih: Per discussion with Xiao, since the previous work used “uniformization” to refer to a much simpler technique, we might want to consider using a different term to avoid confusion? to further improve Theorem 1.5, by using a more fine-grained characterization of input instances. For the class of hierarchical queries, which has been widely studied in the context of various database problems (dalvi2007efficient; fink2016dichotomies; berkholz2017answering; hu2022temporal), we obtain a more fine-grained parameterized upper bound in Theorem 4.12 and lower bound in Theorem 4.13. As the utility formula display a rather complicated form, we defer all details to Section 4.

2. Preliminaries

For random variables X,YX,Y and scalars ϵ>0,δ[0,1)\epsilon>0,\delta\in[0,1), we write X(ϵ,δ)YX\sim_{(\epsilon,\delta)}Y if Pr[XS]eϵPr[YS]+δ\Pr[X\in S]\leq e^{\epsilon}\cdot\Pr[Y\in S]+\delta and Pr[YS]eϵPr[XS]+δ\Pr[Y\in S]\leq e^{\epsilon}\cdot\Pr[X\in S]+\delta for all sets SS of outcomes.

For convenience, we write a+𝒟a+\mathcal{D} as a shorthand for a+Xa+X where XX is a random variable with distribution 𝒟\mathcal{D} (denoted by X𝔻X\sim\mathbb{D}).

Laplacian distribution .

to be done.

Shifted and truncated geometric distribution (chan2019foundations).

Let Geomα(x)=α1α+1α|x|\textsf{Geom}_{\alpha}(x)\\ =\frac{\alpha-1}{\alpha+1}\cdot\alpha^{-|x|}. Let ϵ>0\epsilon>0, δ(0,1)\delta\in(0,1), and Δ1\Delta\geq 1. Let n0n_{0} be the smallest positive integer such that Pr[|Geomexp(ϵ/Δ)(x)|n0]δ\displaystyle{\Pr\left[\left|\textsf{Geom}_{\exp(\epsilon/\Delta)}(x)\right|\geq n_{0}\right]\leq\delta}, where n0Δϵlog2δ+1n_{0}\leq\frac{\Delta}{\epsilon}\log\frac{2}{\delta}+1. The shifted and truncated geometric distribution STGeomϵ,δ,Δ(x)\textsf{STGeom}_{\epsilon,\delta,\Delta}(x) has support in [0,2(n0+Δ1)][0,2\cdot(n_{0}+\Delta-1)], and is defined as min{max{0,n0+Δ1+Geomexp(ϵΔ)(x)},2(n0+Δ1)}\displaystyle{\min\left\{\max\left\{0,n_{0}+\Delta-1+\textsf{Geom}_{\exp(\frac{\epsilon}{\Delta})}(x)\right\},2(n_{0}+\Delta-1)\right\}}. It has been proved (balcer2017differential; chan2019foundations) that for any pair u,vu,v with |uv|Δ|u-v|\leq\Delta,

u+STGeomϵ,δ,Δ(ϵ,δ)v+STGeomϵ,δ,Δu+\textsf{STGeom}_{\epsilon,\delta,\Delta}\sim_{(\epsilon,\delta)}v+\textsf{STGeom}_{\epsilon,\delta,\Delta}
u+Geomexp(ϵ/Δ)(ϵ,0)v+Geomexp(ϵ/Δ)u+\textsf{Geom}_{\exp(\epsilon/\Delta)}\sim_{(\epsilon,0)}v+\textsf{Geom}_{\exp(\epsilon/\Delta)}
Exponential Mechanism.

The exponential mechanism (EM) (McSherryT07) can select a “good” candidate from a set 𝒞\mathcal{C} of candidates. The goodness is defined by a scoring function s(𝐈,c)s(\mathbf{I},c) for input dataset 𝐈\mathbf{I} and a candidate c𝒞c\in\mathcal{C} and is assumed have sensitivity at most one. Then, the EM algorithm is to sample each candidate c𝒞c\in\mathcal{C} with probability exp(0.5ϵs(𝐈,c))\propto\exp\left(-0.5\epsilon\cdot s(\mathbf{I},c)\right); this algorithm is (ϵ,0)(\epsilon,0)-DP.

Local Sensitivity.

Given a function ff that takes an input dataset and outputs a vector, its (1)(\ell_{1}-)local sensitivity at 𝐈\mathbf{I} is defined as sf(𝐈):=max𝐈f(𝐈)f(𝐈)1s_{f}(\mathbf{I}):=\max_{\mathbf{I}^{\prime}}\|f(\mathbf{I}^{\prime})-f(\mathbf{I})\|_{1} where the maximum is over all neighbors 𝐈\mathbf{I}^{\prime} of 𝐈\mathbf{I}.

Join Result as a Function.

For 𝐈=(R1,R2,,Rm)\mathbf{I}=(R_{1},R_{2},\cdots,R_{m}), as each input relation is abstracted as a function, we introduce the join result function as J=i[m]RiJ=\Join_{i\in[m]}R_{i}, such that J(t)=i[m]Ri(π𝐱it)J\left(\vec{t}\right)=\prod_{i\in[m]}R_{i}\left(\pi_{\mathbf{x}_{i}}\vec{t}\right) for any t𝔻\vec{t}\in\mathbb{D}. Moreover, count(𝐈)=|J|=t𝔻J(t)\textsf{count}(\mathbf{I})=|J|=\sum_{\vec{t}\in\mathbb{D}}J\left(\vec{t}\right) is denoted as join size of 𝐈\mathbf{I}, and we might exchangably use them later.

When we mention “local sensitivity” without a specific function ff, it should be assumed that the function is the counting join size query; note that this is equal to LScount(𝐈)\mathop{\mathrm{LS}}_{\textsf{count}}(\mathbf{I}).

linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: Should we add definitions of global and smoothed sensitivity here?

3. Join-As-One Algorithm

In this section we present our first join-as-one algorithm for general join queries, which namely computes the join results as a single function and then invokes the single-table private multiplicative weight (PMW) algorithm (hardt2012simple). While this is apparently simple, there are many non-trivial issues in putting everything together. (All missing proofs are in Appendix B.)

3.1. Algorithms for Two-Table Query

We start with the simplest two-table query. Assume \mathcal{H} is defined on 𝐱={A,B,C}\mathbf{x}=\{A,B,C\}, 𝐱1={A,B}\mathbf{x}_{1}=\{A,B\}, and 𝐱2={B,C}\mathbf{x}_{2}=\{B,C\}. Let us define some parameters first. For i[2]i\in[2], we define the degree of join value bdom(B)b\in\textsf{dom}(B) to be degi(B,b)=t𝔻i:πBt=bRi(t)\displaystyle{\textsf{deg}_{i}(B,b)=\sum_{t\in\mathbb{D}_{i}:\pi_{B}t=b}R_{i}(t)}. The maximum degree of any join value in RiR_{i} is denoted as Δi=maxbdom(B)degi(B,b)\displaystyle{\Delta_{i}=\max_{b\in\textsf{dom}(B)}\textsf{deg}_{i}(B,b)}. Define Δ=max{Δ1,Δ2}\Delta=\max\{\Delta_{1},\Delta_{2}\}. We note that Δ=LScount(𝐈)\Delta=LS_{\textsf{count}}(\mathbf{I}).

A Natural (but Flawed) Idea.

Let us first consider a natural approach, which turns out to violate DP:

  • Compute J=i[m]RiJ=\Join_{i\in[m]}R_{i} and release a synthetic dataset for JJ as J~1\tilde{J}_{1} by invoking the single-table PMW algorithm;

  • As the PMW algorithm releases a set of records with their total frequencies as |J||J|, which will reveal information of the input instance, we add STGeomϵ/2,δ/2,1\textsf{STGeom}_{\epsilon/2,\delta/2,1} noise to Δ\Delta to obtain Δ~\tilde{\Delta}.

  • We draw a random noise ρ\rho from STGeomϵ/2,δ/2,Δ~\textsf{STGeom}_{\epsilon/2,\delta/2,\tilde{\Delta}} and choose ρ\rho records randomly from 𝔻1×𝔻2\mathbb{D}_{1}\times\mathbb{D}_{2}, denoted as J2~\tilde{J_{2}}.

  • We release the union of these two datasets 𝔽=J~1J~2\mathbb{F}=\tilde{J}_{1}\cup\tilde{J}_{2}.

The challenge is that invoking the single-table algorithm on JJ does not preserve DP, as the initial data distribution parameterized by |J||J| can lead to leaking of information about the input database.

1 Δ~LScount(𝐈)+STGeomϵ/2,δ/2,1(x)\tilde{\Delta}\leftarrow LS_{\textsf{count}}(\mathbf{I})+\textsf{STGeom}_{\epsilon/2,\delta/2,1}(x);
2 return PMWϵ/2,δ/2,Δ~(𝐈)\textsc{PMW}_{\epsilon/2,\delta/2,\tilde{\Delta}}(\mathbf{I}) \blacktriangleright See Algorithm 2 ;
Algorithm 1 JoinTwoTable(𝐈={R1,R2})(\mathbf{I}=\{R_{1},R_{2}\})
1 n^=max{1,count(𝐈)+Geomexp(0.5ϵ/Δ~)}\widehat{n}=\max\{1,\textsf{count}(\mathbf{I})+\textsf{Geom}_{\exp(0.5\epsilon/\tilde{\Delta})}\};
2 𝔽0n^\mathbb{F}_{0}\leftarrow\widehat{n} times uniform distribution over 𝔻\mathbb{D}: 𝔽0(x)=n^|𝔻|\mathbb{F}_{0}(x)=\frac{\widehat{n}}{|\mathbb{D}|};
3 ϵϵ16Tlog(1/δ)\epsilon^{\prime}\leftarrow\frac{\epsilon}{16\sqrt{T\cdot\log(1/\delta)}};
4 foreach i=1,2,,Ti=1,2,\dots,T do
5       Sample a query qi𝒬q_{i}\in\mathcal{Q} using the ϵ\epsilon^{\prime}-DP EM with score function si(𝐈,q)=1Δ~|q(𝔽i1)q(𝐈)|s_{i}(\mathbf{I},q)=\frac{1}{\tilde{\Delta}}\cdot|q(\mathbb{F}_{i-1})-q(\mathbf{I})|;
6       miqi(𝐈)+Geomϵ,Δ~m_{i}\leftarrow q_{i}(\mathbf{I})+\textsf{Geom}_{\epsilon^{\prime},\tilde{\Delta}};
7       Update for each x𝔻x\in\mathbb{D}: 𝔽i(x)𝔽i1(x)×exp(qi(x)(miqi(𝔽i1))12n^)\mathbb{F}_{i}(x)\propto\mathbb{F}_{i-1}(x)\times\exp\left(q_{i}(x)\cdot\left(m_{i}-q_{i}(\mathbb{F}_{i-1})\right)\cdot\frac{1}{2\widehat{n}}\right);
8      
9return Avg(i=1T𝔽i)\textsf{Avg}\left(\sum_{i=1}^{T}\mathbb{F}_{i}\right);
Algorithm 2 PMW(𝐈={R1,R2,,Rm})ϵ,δ,Δ~{}_{\epsilon,\delta,\tilde{\Delta}}(\mathbf{I}=\{R_{1},R_{2},\dots,R_{m}\})
Modified Algorithm.

The main insight is to change the order of two steps in the flawed version. We add Geomexp(0.5ϵ/Δ~)(x)\textsf{Geom}_{\exp(0.5\epsilon/\tilde{\Delta})}(x) random noise to count(𝐈)\textsf{count}(\mathbf{I}) as n^\widehat{n}, and then invoke the single-table algorithm on JJ starting with a uniform distribution parameterized by n^\widehat{n}. As n^\widehat{n} is released under DP, and the single-table algorithm (hardt2012simple) also releases a synthetic dataset for JJ under DP, Algorithm 1 also satisfies DP, as stated in Lemma 3.1.

Lemma 3.1.

Algorithm 1 is (ϵ,δ)(\epsilon,\delta)-DP.

Error Analysis.

As shown in Appendix A, with probability at least 11/poly(𝒬)1-1/\textsf{poly}(\mathcal{Q}), Algorithm 1 produces 𝔽=Avgi=1T𝔽i\mathbb{F}=\textsf{Avg}\sum_{i=1}^{T}\mathbb{F}_{i} such that every linear query can be answered within error (omitting fupperf^{\mathrm{upper}}): O(|J|Δ+λ+Δ+λ)O\left(\sqrt{|J|}\cdot\sqrt{\Delta+\lambda}+\Delta+\lambda\right). Then, we come to the following result:

Theorem 3.2.

For any two-table instance 𝐈\mathbf{I}, a family of linear queries 𝒬\mathcal{Q}, and ϵ>0\epsilon>0, δ>0\delta>0, there exists an algorithm that is (ϵ,δ)(\epsilon,\delta)-DP, and with probability at least 11/poly(|𝒬|)1-1/\textsf{poly}(|\mathcal{Q}|) produces 𝔽\mathbb{F} such that all queries in 𝒬\mathcal{Q} can be answered within error:

α=O((𝖮𝖴𝖳(Δ+λ)+Δ+λ)fupper).\alpha=O\left((\sqrt{\mathsf{OUT}\cdot(\Delta+\lambda)}+\Delta+\lambda)\cdot f^{\mathrm{upper}}\right).

where λ=1ϵlog1δ\lambda=\frac{1}{\epsilon}\log\frac{1}{\delta}, 𝖮𝖴𝖳=count(𝐈)\mathsf{OUT}=\textsf{count}(\mathbf{I}) is the join size of 𝐈\mathbf{I}, and Δ=LScount(𝐈)\Delta=LS_{\textsf{count}}(\mathbf{I}) is the local sensitivity of count()\textsf{count}(\cdot) on 𝐈\mathbf{I}.

3.2. Lower Bounds for Two-Table Query

The first lower bound is based on the local sensitivity, which can be shown via standard techniques.

Theorem 3.3.

For any Δ>0\Delta>0, if there exists a family of queries 𝒬\mathcal{Q} such that any (ϵ,δ)(\epsilon,\delta)-DP algorithm that takes as input a database 𝐈\mathbf{I} with local sensitivity at most Δ\Delta and outputs an approximate answer to each query in 𝒬\mathcal{Q} to within error α\alpha, then αΩ(Δ)\alpha\geq\Omega\left(\Delta\right).

Our second lower bound is via a reduction to the single-table case: we create a two-table instance where R1R_{1} encodes the single-table and R2R_{2} “amplifies” both the sensitivity and the join size by a factor of Δ\Delta. This eventually results in the following lower bound.

Theorem 3.4.

For any 𝖮𝖴𝖳Δ>0\mathsf{OUT}\geq\Delta>0, any sufficiently small ϵ,α>0\epsilon,\alpha>0, nD(1/α)Ω(1)n_{D}\geq(1/\alpha)^{\Omega(1)} and nQ(1/α)O(1)n_{Q}\leq(1/\alpha)^{O(1)}, if there exists a family 𝒬\mathcal{Q} of queries of size nQn_{Q} on domain 𝔻\mathbb{D} of size nDn_{D} such that any (ϵ,o(1/n))(\epsilon,o(1/n))-DP algorithm that takes as input a two-table database of join size 𝖮𝖴𝖳\mathsf{OUT} and local sensitivity Δ\Delta and outputs an approximate answer to each query in 𝒬\mathcal{Q} to within error α\alpha, then αΩ~(𝖮𝖴𝖳Δflower)\displaystyle{\alpha\geq\tilde{\Omega}\left(\sqrt{\mathsf{OUT}}\cdot\sqrt{\Delta}\cdot f^{\mathrm{lower}}\right)}.

Proof.

Let n=𝖮𝖴𝖳Δn=\frac{\mathsf{OUT}}{\Delta}. From Theorem 1.4, there exists a set 𝒬one\mathcal{Q}_{\textsf{one}} of queries on domain 𝔻\mathbb{D} for which any (ϵ,δ)(\epsilon,\delta)-DP algorithm that takes as input a single-table database T𝔻T\in\mathbb{D} and outputs an approximate answer to each query in 𝒬one\mathcal{Q}_{\textsf{one}} within error α\alpha requires that αΩ~(nflower(𝔻,𝒬one,ϵ))\alpha\geq\tilde{\Omega}\left(\sqrt{n}\cdot f^{\mathrm{lower}}(\mathbb{D},\mathcal{Q}_{\textsf{one}},\epsilon)\right). For an arbitrary single-table database T:𝔻+T:\mathbb{D}\to\mathbb{Z}^{+}, we construct a two-table instance 𝐈\mathbf{I} of join size 𝖮𝖴𝖳\mathsf{OUT}, and local sensitivity Δ\Delta as follows:

  • Set dom(A)=𝔻\textsf{dom}(A)=\mathbb{D}, dom(B)=𝔻×[n]\textsf{dom}(B)=\mathbb{D}\times[n] and dom(C)=[Δ]\textsf{dom}(C)=[\Delta].

  • Let R1(a,(b1,b2))=𝟏[a=b1b2T(a)]R_{1}(a,(b_{1},b_{2}))=\mathbf{1}[a=b_{1}\wedge b_{2}\leq T(a)] for all aAa\in A and (b1,b2)B(b_{1},b_{2})\in B.

  • Let R2(b,c)=1R_{2}(b,c)=1 for all bBb\in B and cCc\in C.

It can be easily checked that 𝐈\mathbf{I} has join size at most 𝖮𝖴𝖳\mathsf{OUT} and local sensitivity Δ\Delta, and that two neighboring datasets T,TT,T^{\prime} result in neighboring 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime}. Finally, let 𝒬1\mathcal{Q}_{1} contain queries from 𝒬one\mathcal{Q}_{\textsf{one}} applied on its first attribute (i.e., 𝒬1:={qπAq𝒬one}\mathcal{Q}_{1}:=\{q\circ\pi_{A}\mid q\in\mathcal{Q}_{\textsf{one}}\}), and let 𝒬2\mathcal{Q}_{2} contain only a single query qall-one:𝔻2{+1}q_{\textsf{all-one}}:\mathbb{D}_{2}\to\{+1\}. An example is illustrated in Figure 1.

Our lower bound argument is a reduction to the single-table case. Let 𝒜\mathcal{A} be an algorithm that takes any two-table instance in 𝐈(𝖮𝖴𝖳,Δ)\mathbf{I}(\mathsf{OUT},\Delta) and outputs an approximate answer to each query in 𝒬\mathcal{Q} within error α\alpha^{\prime}. For each query q𝒬oneq\in\mathcal{Q}_{\textsf{one}}, let q=(qπA,qall-one)q^{\prime}=(q\circ\pi_{A},q_{\textsf{all-one}}) be its corresponding query in 𝒬\mathcal{Q}. Let q~(𝐈)\widetilde{q}^{\prime}(\mathbf{I}) be the approximate answer for query q~\widetilde{q}^{\prime}. We then return q~(T)=q~(𝐈)/Δ\displaystyle{\widetilde{q}(T)=\widetilde{q}^{\prime}(\mathbf{I})/\Delta} as an approximate answer of q(T)q(T).

We first note that this algorithm for answering queries in 𝒬one\mathcal{Q}_{\textsf{one}} is (ϵ,δ)(\epsilon,\delta)-DP due to the post-processing property of DP. The error guarantee follows immediately from the observation that q(𝐈)=Δq(T)q^{\prime}(\mathbf{I})=\Delta\cdot q(T).

Refer to caption
Figure 1. Lower bound instance with n=9n=9, Δ=3\Delta=3, 𝖮𝖴𝖳=27\mathsf{OUT}=27.

Therefore, from Theorem 1.4, we can conclude that αΔΩ~(nflower(𝔻,𝒬one,ϵ))Ω~(𝖮𝖴𝖳Δflower)\alpha^{\prime}\geq\Delta\cdot\tilde{\Omega}(\sqrt{n}\cdot f^{\mathrm{lower}}(\mathbb{D},\mathcal{Q}_{\textsf{one}},\epsilon))\geq\tilde{\Omega}\left(\sqrt{\mathsf{OUT}}\cdot\sqrt{\Delta}\cdot f^{\mathrm{lower}}\right). ∎

3.3. Multi-Table Query

We next extend the join-as-one approach to multi-table query. First, notice that Algorithm 1 does not work in multi-table setting because LScount(𝐈)\mathop{\mathrm{LS}}_{\textsf{count}}(\mathbf{I}) may itself have a large local sensitivity, unlike in the two table case where the global sensitivity was one.

To handle this, we will need the notion of the residual sensitivity. For a join query =(𝐱,{𝐱1,𝐱2,,𝐱m})\mathcal{H}=(\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{m}\}), the residual query on a subset E[m]E\subseteq[m] of relations is iERi\Join_{i\in E}R_{i}. Its boundary, denoted as E\partial_{E}, is the set of attributes that belong to relations both in and out of EE, i.e., E={xxxixj,iE,jE}\partial E=\{x\mid x\in\textbf{x}_{i}\cap\textbf{x}_{j},i\in E,j\notin E\}. For a residual query on EE over instance 𝐈\mathbf{I}, the query result is333The semi-join result of RitR_{i}\ltimes t is defined as function Ri:𝔻i+R^{\prime}_{i}:\mathbb{D}_{i}\to\mathbb{Z}^{+} such that Rt(t)=R(t)R_{t}(t^{\prime})=R(t^{\prime}) if π𝐱it=π𝐱it\pi_{\mathbf{x}_{i}}t^{\prime}=\pi_{\mathbf{x}_{i}}t and 0 otherwise.:

(1) TE(𝐈)=maxtdom(E)|i[m](Rit)|,T_{E}(\mathbf{I})=\max_{t\in\textsf{dom}(\partial E)}\big{|}\Join_{i\in[m]}(R_{i}\ltimes t)\big{|},

which is a join-aggregate queries over attributes in E\partial E.

Definition 3.5 (Residual Sensitivity for Counting Query (dong2021residual; dong2021nearly)).

Given β>0\beta>0, the β\beta-residual sensitivity of counting the join size of multi-way query \mathcal{H} over instance 𝐈\mathbf{I} is defined as

SScountβ(𝐈)=maxk0eβkLS^count(k)(𝐈),SS^{\beta}_{\textsf{count}}(\mathbf{I})=\max_{k\geq 0}e^{-\beta k}\cdot\hat{LS}^{(k)}_{\textsf{count}}(\mathbf{I}),

where

(2) LS^count(k)(𝐈)=max𝐬𝒮kmaxi[m]E[m]{i}T[m]{i}E(𝐈)jE𝐬j,\hat{LS}^{(k)}_{\textsf{count}}(\mathbf{I})=\max_{\mathbf{s}\in\mathcal{S}^{k}}\max_{i\in[m]}\sum_{E^{\prime}\subseteq[m]-\{i\}}T_{[m]-\{i\}-E^{\prime}}(\mathbf{I})\cdot\prod_{j\in E^{\prime}}\mathbf{s}_{j},

for 𝒮k={𝐬=(𝐬1,𝐬2,,𝐬m):i=1m𝐬i=k,𝐬i,i[m]}\mathcal{S}^{k}=\left\{\mathbf{s}=(\mathbf{s}_{1},\mathbf{s}_{2},\dots,\mathbf{s}_{m}):\sum_{i=1}^{m}\mathbf{s}_{i}=k,\mathbf{s}_{i}\in\mathbb{Z},\forall i\in[m]\right\}.

It is clear that SScountβ(𝐈)SS^{\beta}_{\textsf{count}}(\mathbf{I}) is an upper bound on LScount(𝐈)\mathop{\mathrm{LS}}_{\textsf{count}}(\mathbf{I}), and therefore it suffices for us to find an upper bound Δ~\tilde{\Delta} for SScountβ(𝐈)SS^{\beta}_{\textsf{count}}(\mathbf{I}) (which is then passed to the single-table algorithm). To compute Δ~\tilde{\Delta}, we observe that ln(SScountβ(𝐈))\ln(SS^{\beta}_{\textsf{count}}(\mathbf{I})) has global sensitivity at most β\beta. Therefore, adding an appropriately calibrated (truncated and shifted) Laplace noise to it provides a DP upper bound. The idea is formalized in Algorithm 3. Its privacy guarantee is immediate:

1
2βϵ2ln2/δ\beta\leftarrow\frac{\epsilon}{2\ln 2/\delta};
3 SScountβ(𝐈)βSS^{\beta}_{\textsf{count}}(\mathbf{I})\leftarrow\beta-residual sensitivity of count(𝐈)\textsf{count}(\mathbf{I});
4 Δ~max{SScount(𝐈),SScount(𝐈)e12+Lapϵ/2β}\tilde{\Delta}\leftarrow\max\left\{SS_{\textsf{count}}(\mathbf{I}),SS_{\textsf{count}}(\mathbf{I})\cdot e^{\frac{1}{2}+\textsf{Lap}_{\epsilon/2\beta}}\right\};
5 return PMWϵ/2,δ/2,Δ~(𝐈)\textsc{PMW}_{\epsilon/2,\delta/2,\tilde{\Delta}}(\mathbf{I});
Algorithm 3 JoinMultiTable(𝐈={R1,R2,,Rm})(\mathbf{I}=\{R_{1},R_{2},\dots,R_{m}\})
linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: I changed the notation of Lap the exponent a bit (in Algorithm 3). Might need to be propagated to the proofslinecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: I don’t think we’ve define Laplace distribution before Algorithm 3. Might need to do that.
Lemma 3.6.

Algorithm 3 is (ϵ,δ)(\epsilon,\delta)-DP.

Error analysis.

With probability at least 1δ1-\delta, |Lapϵ/2β|<1/2|\textsf{Lap}_{\epsilon/2\beta}|<1/2, then 12+Lapϵ/2β[0,1]\frac{1}{2}+\textsf{Lap}_{\epsilon/2\beta}\in[0,1] since β=ϵ2ln2/δ\beta=\frac{\epsilon}{2\ln 2/\delta}, and therefore Δ~[SScountβ(𝐈),SScountβ(𝐈)e]\tilde{\Delta}\in[SS^{\beta}_{\textsf{count}}(\mathbf{I}),SS^{\beta}_{\textsf{count}}(\mathbf{I})\cdot e]. Hence n^=O(𝖮𝖴𝖳+SScount(𝐈)λ)\widehat{n}=O\left(\mathsf{OUT}+SS_{\textsf{count}}(\mathbf{I})\cdot\lambda\right). Putting everything together, the error can be achieved is (omitting fupperf^{\mathrm{upper}}):

O(n^Δ~)=O(𝖮𝖴𝖳SScountβ(𝐈)λ+SScountβ(𝐈)λ)O\left(\sqrt{\widehat{n}}\cdot\sqrt{\tilde{\Delta}}\right)=O\left(\sqrt{\mathsf{OUT}\cdot SS^{\beta}_{\textsf{count}}(\mathbf{I})\cdot\lambda}+SS^{\beta}_{\textsf{count}}(\mathbf{I})\cdot\lambda\right)

The upper bound is given in Theorem 1.5. Moreover, extending the previous lower bound argument on the two-table query, we can obtain the lower bound on the multi-table query in Theorem 1.6.

4. Uniformize Sensitivity

So far, we have showed a parameterized algorithm for answering linear queries, whose utility is in terms of the join size and the residual sensitivity. A natural question arises: Can we achieve a more fine-grained parameterized algorithm with better utility?

1
2𝕀Partition(𝐈)\mathbb{I}\leftarrow\textsc{Partition}(\mathbf{I});
3 foreach 𝐈𝕀\mathbf{I}^{\prime}\in\mathbb{I} do  𝔽(𝐈)JoinMultiTable(𝐈)\mathbb{F}(\mathbf{I}^{\prime})\leftarrow\textsc{JoinMultiTable}(\mathbf{I}^{\prime});
4 return 𝐈𝕀𝔽(𝐈)\bigcup_{\mathbf{I}^{\prime}\in\mathbb{I}}\mathbb{F}(\mathbf{I}^{\prime});
Algorithm 4 Uniformize(𝐈)(\mathbf{I})

Let us start with a two-table instance (see Figure 2), with input size Θ(n)\Theta(n), join size 𝖮𝖴𝖳=Θ(nn)\mathsf{OUT}=\Theta(n\sqrt{n}) and local sensitivity n\sqrt{n}. Algorithm 1 achieves an error of O(n)O(n). However, this instance is beyond the scope of Theorem 3.4, as the sensitivity (or degree in this simple scenario) distribution over join values is extremely non-uniform. Revisiting the error complexity in Theorem 3.2, we can gain some intuition regarding why Algorithm 1 does not perform well on this instance. The costly term is O(𝖮𝖴𝖳Δ)O(\sqrt{\mathsf{OUT}}\cdot\sqrt{\Delta}), where Δ\Delta is the largest degree of join values in the input instance. However, there are many join values whose degree is much smaller than Δ\Delta, so a natural idea is to uniformize sensitivities, i.e., partition the input instance into a set of sub-instances by join values, such that join values with similar sensitivities are put into same sub-instance. After this uniformization, we just invoke our previous join-as-one approach as a primitive on each sub-instance independently, and return the union of synthetic datasets generated for all sub-instances. The framework of uniformization is illustrated in Algorithm 4. (All missing proofs are given in Appendix B.1).

4.1. Uniformize Two-Table

linecolor=purple,backgroundcolor=purple!25,bordercolor=purplelinecolor=purple,backgroundcolor=purple!25,bordercolor=purpletodo: linecolor=purple,backgroundcolor=purple!25,bordercolor=purpleXiao: Changed Algorithm 5 as well as the text below. Now nn is not explicitly used in the algorithm description, but can be argued that any ii with BiB^{i}\neq\emptyset must have ilog(nλ+1)i\leq\lceil\log(\frac{n}{\lambda}+1)\rceil. Please check!

As mentioned, there is quite an intuitive way to uniformize a two-table join. As described in Algorithm 5, the high-level idea is to bucket join values in dom(B)\textsf{dom}(B) by their maximum degree in R1R_{1} and R2R_{2}. To protect the privacy of individual tuples, we add some noise drawn from STGeomϵ,δ,1\textsf{STGeom}_{\epsilon,\delta,1} to each join value’s degree, before determining to which bucket it should go. Recall that λ=1ϵlog1δ\lambda=\frac{1}{\epsilon}\log\frac{1}{\delta}. Let γ0=0\gamma_{0}=0 and γi=2iλ\gamma_{i}=2^{i}\lambda for all ii\in\mathbb{N}. Conceptually, we divide [0,n+λ][0,n+\lambda] into =log(nλ+1)\ell=\lceil\log(\frac{n}{\lambda}+1)\rceil buckets, where the ii-th bucket is associated with (γi1,γi](\gamma_{i-1},\gamma_{i}] for i[]i\in[\ell]. The set of values from dom(B)\textsf{dom}(B) whose maximum noisy degree in R1R_{1} and R2R_{2} falls into the ii-bucket is denoted as BiB^{i}. For each ii, we identify tuples in R1,R2R_{1},R_{2} whose join value falls into BiB^{i} as R1i,R2iR^{i}_{1},R^{i}_{2}, which forms the sub-instance (R1i,R2i)(R^{i}_{1},R^{i}_{2}). At last, we just return all sub-instances. Note that nn is not explicitly used in Algorithm 5 as this is not public, but it is easily to check that there exists no join value with degree more than nn under the input size constraint. Hence, it is safe to only consider i[]i\in[\ell] in our analysis.

1
2foreach ii\in\mathbb{N} do BiB^{i}\leftarrow\emptyset;
3 foreach bdom(B)b\in\textsf{dom}(B) do
4       deg~(B,b)=max{deg1(B,b),deg2(B,b)}+STGeomϵ,δ,1\widetilde{\textsf{deg}}(B,b)=\max\{\textsf{deg}_{1}(B,b),\textsf{deg}_{2}(B,b)\}+\textsf{STGeom}_{\epsilon,\delta,1};
5      
6      imax{1,log1λdeg~(B,b)}i\leftarrow\max\left\{1,\left\lceil\log\frac{1}{\lambda}\cdot\widetilde{\textsf{deg}}(B,b)\right\rceil\right\};
7       BiBi{b}B^{i}\leftarrow B^{i}\cup\{b\};
8      
9foreach ii with BiB^{i}\neq\emptyset do
10       foreach j{1,2}j\in\{1,2\} do  Rji{Rj(t):t𝔻j,πBtBi}R^{i}_{j}\leftarrow\{R_{j}(t):t\in\mathbb{D}_{j},\pi_{B}t\in B^{i}\};
11      
12return i:Bi{(R1i,R2i)}\bigcup_{i:B^{i}\neq\emptyset}\{(R^{i}_{1},R^{i}_{2})\};
Algorithm 5 Partition-TwoTable(𝐈={R1,R2})(\mathbf{I}=\{R_{1},R_{2}\})
Refer to caption
Figure 2. An instance beyond the scope of Theorem 3.4. There are n\sqrt{n} join values in attribute BB, where there is exactly one join value with degree ii in both R1,R2R_{1},R_{2} for every i[n]i\in[\sqrt{n}].
Lemma 4.1.

Algorithm 4 on two-table join is (ϵ,δ)(\epsilon,\delta)-DP.

The key insight in Lemma 4.1 is that adding or removing one input tuple can increase or decrease the degree of one join value bdom(B)b\in\textsf{dom}(B) by at most one. Hence, Algorithm 5 satisfies (ϵ,δ)(\epsilon,\delta)-DP by the parallel composition rule (mcsherry2009privacy). Moreover, since each input tuple participates in exactly one sub-instance, and Algorithm 2 preserves (ϵ,δ)(\epsilon,\delta)-DP for each sub-instance by Lemma 3.1, Algorithm 4 preserves (2ϵ,2δ)(2\epsilon,2\delta)-DP by the basic composition (dwork2006calibrating).

Error Analysis.

We next analyze the error of Algorithm 8. Given an instance 𝐈\mathbf{I}, let π={Bπ1,,Bπ}\pi=\{B^{1}_{\pi},\cdots,B^{\ell}_{\pi}\} be the partition of dom(B)\textsf{dom}(B) generated by Algorithm 8. Correspondingly, π\pi derives a partition of input instance 𝐈={(R1i,R2i):i[]}\mathbf{I}=\{(R^{i}_{1},R^{i}_{2}):i\in[\ell]\}, where (R1i,R2i)(R^{i}_{1},R^{i}_{2}) is the sub-instance in which tuples have their join values falling into BπiB^{i}_{\pi}. Moreover, π\pi derives a partition of join result J={Jπ1,Jπ2,,Jπ}J=\left\{J^{1}_{\pi},J^{2}_{\pi},\cdots,J^{\ell}_{\pi}\right\}. Let 𝔽i\mathbb{F}^{i} be the synthetic dataset generated for (R1i,R2i)(R^{i}_{1},R^{i}_{2}) by JoinMultiTable. From Theorem 3.2, with probability 11/poly(|𝒬|)1-1/\textsf{poly}(|\mathcal{Q}|), the error of answering linear queries defined over dom(A)×dom(Bπi)×dom(C)\textsf{dom}(A)\times\textsf{dom}(B^{i}_{\pi})\times\textsf{dom}(C) that can be achieved with 𝔽i\mathbb{F}^{i} is αi=O(|Jπi|2iλ+2iλ)\alpha_{i}=O\left(\sqrt{|J^{i}_{\pi}|}\cdot\sqrt{2^{i}\cdot\lambda}+2^{i}\cdot\lambda\right). By a union bound, with probability 11/poly(|𝒬|)1-1/\mathop{\mathrm{poly}}(|\mathcal{Q}|), the error can be achieved with i𝔽i\bigcup_{i}\mathbb{F}^{i} is:

αi[]αii[]|Jπi|2iλ+(Δ+λ)λ\displaystyle\alpha\leq\sum_{i\in[\ell]}\alpha_{i}\leq\sum_{i\in[\ell]}\sqrt{|J^{i}_{\pi}|}\cdot\sqrt{2^{i}\cdot\lambda}+(\Delta+\lambda)\cdot\lambda

where i[]2i=i[log(Δ+λ)]2i=O(Δ+λ)\sum_{i\in[\ell]}2^{i}=\sum_{i\in[\lceil\log(\Delta+\lambda)\rceil]}2^{i}=O\left(\Delta+\lambda\right).

One may observe an overlap between the bucket ranges defined by π\pi, but i|Jπi|=|J|\sum_{i}|J^{i}_{\pi}|=|J| always holds as π\pi induces a partition of dom(B)\textsf{dom}(B) as well as JJ. In this way, Algorithm 4 on two-table query is always better (at least not worse) than Algorithm 1, since i[]|Jπi|2i|J|Δ\sum_{i\in[\ell]}\sqrt{|J^{i}_{\pi}|}\cdot\sqrt{2^{i}}\leq\sqrt{|J|}\cdot\sqrt{\Delta}, which is implied by the Cauchy-Schwarz inequality. Furthermore, we show in Appendix B.1 that for some input instances the gap between Theorem 4.3 and Theorem 3.2 can be polynomially large, in terms of the data size.

Careful inspection reveals that although the partition generated by Algorithm 5 is randomized due to the noise, it is not far away from the fixed partition based on true degrees. As shown in Appendix B.1, Algorithm 5 always has its error bounded by what can be achieved through the uniform partition defined as below.

Definition 4.2 (Uniform Partition).

For n>0,ϵ>0,δ>0n>0,\epsilon>0,\delta>0, and an input instance 𝐈=(R1,R2)\mathbf{I}=(R_{1},R_{2}), the uniform partition of dom(B)\textsf{dom}(B) on 𝐈\mathbf{I} is π={Bπ1,,Bπlognλ}\pi^{*}=\left\{B^{1}_{\pi^{*}},\cdots,B^{\lceil\log\frac{n}{\lambda}\rceil}_{\pi^{*}}\right\} such that for any i[lognλ]i\in[\lceil\log\frac{n}{\lambda}\rceil], bBib\in B^{i} if max{deg1(B,b),deg2(B,b)}(γi1,γi]\displaystyle{\max\{\textsf{deg}_{1}(B,b),\textsf{deg}_{2}(B,b)\}\in(\gamma_{i-1},\gamma_{i}]}.

Theorem 4.3.

For any two-table instance 𝐈\mathbf{I}, a family of linear queries 𝒬\mathcal{Q}, and ϵ>0\epsilon>0, δ>0\delta>0, there exists an algorithm that is (ϵ,δ)(\epsilon,\delta)-DP, and with probability at least 11/poly(|𝒬|)1-1/\textsf{poly}(|\mathcal{Q}|) produces 𝔽\mathbb{F} such that all queries in 𝒬\mathcal{Q} can be answered within error:

α=O((i[lognλ]𝖮𝖴𝖳πi2iλ+λ2)fupper)\alpha=O\left(\left(\sum_{i\in[\lceil\log\frac{n}{\lambda}\rceil]}\sqrt{\mathsf{OUT}^{i}_{\pi^{*}}\cdot 2^{i}\cdot\lambda}+\lambda^{2}\right)\cdot f^{\mathrm{upper}}\right)

where λ=1ϵlog1δ\lambda=\frac{1}{\epsilon}\log\frac{1}{\delta}, 𝖮𝖴𝖳πi\mathsf{OUT}^{i}_{\pi^{*}} is the number of join results whose join value falls into BπiB^{i}_{\pi^{*}} under the uniform partition π\pi^{*}.

Let 𝖮𝖴𝖳π=(𝖮𝖴𝖳πi)i[lognλ\overrightarrow{\mathsf{OUT}}_{\pi^{*}}=(\mathsf{OUT}^{i}_{\pi^{*}})_{i\in[{\lceil\log\frac{n}{\lambda}\rceil}} be a characterization vector of join sizes under the uniform sensitivity partition π\pi^{*}. A two-table database 𝐈=(R1,R2)\mathbf{I}=(R_{1},R_{2}) conforms to 𝖮𝖴𝖳π\overrightarrow{\mathsf{OUT}}_{\pi^{*}} if

(t1,t2)𝔻1×𝔻2:πBt1=πBt2BπiR1(t1)R2(t2)𝖮𝖴𝖳πi\sum_{(t_{1},t_{2})\in\mathbb{D}_{1}\times\mathbb{D}_{2}:\pi_{B}t_{1}=\pi_{B}t_{2}\in B^{i}_{\pi^{*}}}R_{1}(t_{1})\cdot R_{2}(t_{2})\leq\mathsf{OUT}^{i}_{\pi^{*}}

holds for each i[lognλ]i\in[\lceil\log\frac{n}{\lambda}\rceil]. Then, we introduce the following parameterized lower bound in terms of join size distribution, and show that Algorithm 8 is optimal on two-table query up to poly-logarithmic factors. The proof is given in Appendix B.1.

Theorem 4.4.

Given arbitrary parameters 𝖮𝖴𝖳π\overrightarrow{\mathsf{OUT}}_{\pi^{*}} under the uniform partition π\pi^{*}, for every sufficiently small ϵ,α>0\epsilon,\alpha>0, nD(1/α)Ω(1)n_{D}\geq(1/\alpha)^{\Omega(1)} and nQ(1/α)O(1)n_{Q}\leq(1/\alpha)^{O(1)}, there exists a family of queries 𝒬\mathcal{Q} of size nQn_{Q} on domain 𝔻\mathbb{D} of size nDn_{D} such that any (ϵ,o(1/n))(\epsilon,o(1/n))-differentially private algorithm that takes as input a two-table database of input size at most nn while conforming to 𝖮𝖴𝖳π\overrightarrow{\mathsf{OUT}}_{\pi^{*}}, and outputs an approximate answer to each query in 𝒬\mathcal{Q} to within error α\alpha must satisfy αΩ(maxi[lognλ]𝖮𝖴𝖳πi2iλflower)\displaystyle{\alpha\geq\Omega\left(\max_{i\in[\lceil\log\frac{n}{\lambda}\rceil]}\sqrt{\mathsf{OUT}^{i}_{\pi^{*}}}\cdot\sqrt{2^{i}\cdot\lambda}\cdot f^{\mathrm{lower}}\right)}.

linecolor=purple,backgroundcolor=purple!25,bordercolor=purplelinecolor=purple,backgroundcolor=purple!25,bordercolor=purpletodo: linecolor=purple,backgroundcolor=purple!25,bordercolor=purpleXiao: Pasin: do we need to say something about the hardness of neighborhood-optimality here?linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: Moved the discussion to the conclusion.

4.2. Uniformize Hierarchical Query

Beyond two-table query, we surprisingly find that this uniformization technique can be further extended to the class of hierarchical queries. A join query ={𝐱,{𝐱1,𝐱2,,𝐱m}}\mathcal{H}=\{\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{m}\}\} is hierarchical, if for any pair of attributes x,yx,y, either ExEyE_{x}\subseteq E_{y}, or EyExE_{y}\subseteq E_{x}, or ExEy=E_{x}\cap E_{y}=\emptyset, where Ex={i:x𝐱i}E_{x}=\{i:x\in\mathbf{x}_{i}\} is the set of relations containing attribute xx. One can always organize the attributes of a hierarchical join into a tree such that every relation corresponds to a root-to-node path (see Figure 3).

Thanks to the nice structures of hierarchical joins, it is feasible to play with the magic of uniformization. Recall that the essence of uniformization is to partition instance into a set of sub-instances, so that we can get an upper bound on the residual sensitivity as tight as possible, as implied by the error expression in Theorem 1.5. In (2), the residual sensitivity is built on the join sizes of a set of residual queries, but these statistics are too far away from being the partition criteria. Instead of using residual sensitivity directly, we first rewrite the join size of an residual query and get an upper bound for it in terms of degrees.

4.2.1. An Upper Bound on Residual Query Size

To derive an upper bound on TET_{E}’s for E[m]E\subseteq[m], we target a broader class of q¯\bar{q}-hierarchical queries444This is different from q-hierarchical queries in (berkholz2017answering), where 𝐲\mathbf{y} serves as the set of output attributes, although they have the same property characterization on 𝐲\mathbf{y}. by generalizing the aggregate attributes in (1) to any subset of attributes that sit on the “top” of 𝒯\mathcal{T}. It is not hard to see that TE=TE,ET_{E}=T_{E,\partial E} for boundary attributes E\partial E.

Definition 4.5.

For a hierarchical join =(𝐱,{𝐱1,𝐱2,,𝐱m})\mathcal{H}=(\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{m}\}) and instance 𝐈\mathbf{I}, a q¯\bar{q}-hierarchical query on E[m]E\subseteq[m] and 𝐲(iE𝐱i)\mathbf{y}\subseteq(\cup_{i\in E}\mathbf{x}_{i}) is defined as TE,𝐲(𝐈)=maxtdom(𝐲)|(iERi)t|\displaystyle{T_{E,\mathbf{y}}(\mathbf{I})=\max_{t\in\textsf{dom}(\mathbf{y})}\big{|}(\Join_{i\in E}R_{i})\ltimes t\big{|}}, where 𝐲\mathbf{y} satisfies the property: for any x1,x2𝐱x_{1},x_{2}\in\mathbf{x}, if Ex1Ex2E_{x_{1}}\subseteq E_{x_{2}} and x1𝐲x_{1}\in\mathbf{y}, then x2𝐲x_{2}\in\mathbf{y}.

In the remaining, we focus on deriving an upper bound for q¯\bar{q}-hierarchical queries TE,𝐲(𝐈)T_{E,\mathbf{y}}(\mathbf{I}), which boils down to a product chain of degree functions. For example, deg34𝐈(AB,ab)\textsf{deg}^{\mathbf{I}}_{34}(AB,ab) in Figure 3 indicates the number of tuples in πABE(R3R4)\pi_{ABE}(R_{3}\Join R_{4}) that abab appears.

Definition 4.6 (Degree Function).

For a hierarchical join =(𝐱,{𝐱1,𝐱2,,𝐱m})\mathcal{H}=(\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{m}\}) and an instance 𝐈\mathbf{I}, the degree function degE𝐈(𝐲,y)\textsf{deg}^{\mathbf{I}}_{E}(\mathbf{y},\vec{y}) for a subset of relations E[m]E\subseteq[m], a subset of attributes 𝐲(i[E]𝐱i)\mathbf{y}\subseteq(\cap_{i\in[E]}\mathbf{x}_{i}) and a tuple ydom(𝐲)\vec{y}\in\textsf{dom}(\mathbf{y}) is defined as555When the context is clear, we drop the superscript 𝐈\mathbf{I} from degE𝐈(,)\textsf{deg}^{\mathbf{I}}_{E}(\cdot,\cdot). Moreover, when |E|=1|E|=1, say E={i}E=\{i\}, we just write degi(,)\textsf{deg}_{i}(\cdot,\cdot) as short for degE(,)\textsf{deg}_{E}(\cdot,\cdot).:

(3) degE𝐈(𝐲,y)=|{tπi[E]𝐱iiERi:π𝐲t=y}|\textsf{deg}^{\mathbf{I}}_{E}(\mathbf{y},\vec{y})=\bigg{|}\left\{t\in\pi_{\cap_{i\in[E]}\mathbf{x}_{i}}\Join_{i\in E}R_{i}:\pi_{\mathbf{y}}t=\vec{y}\right\}\bigg{|}

Base Case. When |E|=1|E|=1, say E={i}E=\{i\}, TE,𝐲T_{E,\mathbf{y}} degenerates to the degree of tuples over attributes 𝐲\mathbf{y} in RiR_{i} (as 𝐲𝐱i\mathbf{y}\subseteq\mathbf{x}_{i}):

TE,𝐲(𝐈)=maxtdom(𝐲)|Rit|={1if 𝐲=𝐱imaxtdom(𝐲)degi(𝐲,π𝐲t)otherwiseT_{E,\mathbf{y}}(\mathbf{I})=\max_{t\in\textsf{dom}(\mathbf{y})}\big{|}R_{i}\ltimes t\big{|}=\left\{\begin{array}[]{ll}1&\textrm{if $\mathbf{y}=\mathbf{x}_{i}$}\\ \displaystyle{\max_{t\in\textsf{dom}(\mathbf{y})}\textsf{deg}_{i}(\mathbf{y},\pi_{\mathbf{y}}t)}&\textrm{otherwise}\end{array}\right.

General Case. Denote H¯E\bar{H}_{E} as the residual query of iERi\Join_{i\in E}R_{i} by removing attributes in 𝐲\mathbf{y}. We distinguish the following two cases:

  • H¯E\bar{H}_{E} is disconnected. Let 𝒞E\mathcal{C}_{E} be the set of connected component of H¯E\bar{H}_{E}. We can then decompose iERi\Join_{i\in E}R_{i} into a set of sub-queries, after pushing the semi-join inside each connected component:

    TE,𝐲(𝐈)=\displaystyle T_{E,\mathbf{y}}(\mathbf{I})= maxtdom(𝐲)E𝒞E|(iERi)t|\displaystyle\max_{t\in\textsf{dom}(\mathbf{y})}\prod_{E^{\prime}\in\mathcal{C}_{E}}\big{|}\left(\Join_{i\in E^{\prime}}R_{i}\right)\ltimes t\big{|}
    \displaystyle\leq E𝒞Emaxtdom(𝐲)|(iERi)t|=E𝒞ETE,𝐲(iE𝐱i)(𝐈)\displaystyle\prod_{E^{\prime}\in\mathcal{C}_{E}}\max_{t\in\textsf{dom}(\mathbf{y})}\big{|}\left(\Join_{i\in E^{\prime}}R_{i}\right)\ltimes t\big{|}=\prod_{E^{\prime}\in\mathcal{C}_{E}}T_{E^{\prime},\mathbf{y}\cap(\cup_{i\in E^{\prime}}\mathbf{x}_{i})}(\mathbf{I})

    The correctness follows that 𝐲(iE𝐱i)iE𝐱i\mathbf{y}\cap(\cup_{i\in E^{\prime}}\mathbf{x}_{i})\subseteq\cup_{i\in E^{\prime}}\mathbf{x}_{i} and 𝐲(iE𝐱i)\mathbf{y}\cap(\cup_{i\in E^{\prime}}\mathbf{x}_{i}) satisfies the property in Defition 4.6.

  • H¯E\bar{H}_{E} is connected. In this case, we must have 𝐲(iE𝐱i)\mathbf{y}\subsetneq(\cap_{i\in E}\mathbf{x}_{i}).

    TE,𝐲(𝐈)\displaystyle T_{E,\mathbf{y}}(\mathbf{I})\leq maxtdom(𝐲)degE(𝐲,t)maxtdom(iE𝐱i):π𝐲t=t|(iERi)t|\displaystyle\max_{t\in\textsf{dom}(\mathbf{y})}\textsf{deg}_{E}(\mathbf{y},t)\cdot\max_{t^{\prime}\in\textsf{dom}(\cap_{i\in E}\mathbf{x}_{i}):\pi_{\mathbf{y}}t^{\prime}=t}\big{|}(\Join_{i\in E}R_{i})\ltimes t^{\prime}\big{|}
    \displaystyle\leq maxtdom(𝐲)degE(𝐲,t)maxtdom(iE𝐱i)|(iERi)t|\displaystyle\max_{t\in\textsf{dom}(\mathbf{y})}\textsf{deg}_{E}(\mathbf{y},t)\cdot\max_{t^{\prime}\in\textsf{dom}(\cap_{i\in E}\mathbf{x}_{i})}\big{|}(\Join_{i\in E}R_{i})\ltimes t^{\prime}\big{|}
    =\displaystyle= maxtdom(𝐲)degE(𝐲,t)TE,iE𝐱i(𝐈)\displaystyle\max_{t\in\textsf{dom}(\mathbf{y})}\textsf{deg}_{E}(\mathbf{y},t)\cdot T_{E,\cap_{i\in E}\mathbf{x}_{i}}(\mathbf{I})

    The correctness follows that (iE𝐱i)(iE𝐱i)(\cap_{i\in E}\mathbf{x}_{i})\subseteq(\cup_{i\in E}\mathbf{x}_{i}) and iE𝐱i\cap_{i\in E}\mathbf{x}_{i} satisfies the property in Definition 4.6.

It is not hard to see that TE,𝐲T_{E,\mathbf{y}} is eventually upper bounded by a product chain of degree functions (see Example 4.8). Careful inspection reveals that degree functions participate in TE,ET_{E,\partial E} are not arbitrary; instead, they display very special structures captured by Lemma 4.7, which is critical to our partition procedure.

Lemma 4.7.

Every degree function degE(𝐲,)\textsf{deg}_{E^{\prime}}(\mathbf{y}^{\prime},\cdot) participating in the upper bound of TE,ET_{E,\partial E} corresponds to a distinct attribute x𝐱x\in\mathbf{x} such that E=ExE^{\prime}=E_{x} and 𝐲\mathbf{y}^{\prime} corresponds to the set of attributes lying on the path from the root rr to xx’s parent in 𝒯\mathcal{T}.

Refer to caption
Figure 3. Attribute tree for hierarchical join: R1(A,B,D)R2(A,B,F)R3(A,B,G,K)R4(A,B,G,L)R5(A,C)R_{1}(A,B,D)\Join R_{2}(A,B,F)\Join R_{3}(A,B,G,K)\Join R_{4}(A,B,G,L)\Join R_{5}(A,C), and an example of HEH_{E} for E={3,4,5}E=\{3,4,5\}, with boundary attributes E={A,B}\partial E=\{A,B\}. After removing E\partial E, EE is decomposed into {{3,4},5}\{\{3,4\},5\}.
example 0.

An upper bound derived for TET_{E} in Figure 3:

T345=\displaystyle T_{345}= max(a,b)A×B|R3R4R5(a,b)|\displaystyle\max_{(a,b)\in A\times B}|R_{3}\Join R_{4}\Join R_{5}\ltimes(a,b)|
=\displaystyle= maxaA|R5(a)|×max(a,b)A×B|R3R4(a,b)|\displaystyle\max_{a\in A}|R_{5}\ltimes(a)|\times\max_{(a,b)\in A\times B}|R_{3}\Join R_{4}\ltimes(a,b)|
=\displaystyle= maxaA|R5(a)|×max(a,b)A×B|πABGR3R4|\displaystyle\max_{a\in A}|R_{5}\ltimes(a)|\times\max_{(a,b)\in A\times B}|\pi_{ABG}R_{3}\Join R_{4}|
×\displaystyle\times max(a,b,g)A×B×G|R3(a,b,g)|×max(a,b,g)A×B×G|R4(a,b,g)|\displaystyle\max_{(a,b,g)\in A\times B\times G}|R_{3}\ltimes(a,b,g)|\times\max_{(a,b,g)\in A\times B\times G}|R_{4}\ltimes(a,b,g)|
=\displaystyle= maxaAdeg5(A,a)×max(a,b,g)A×B×Gdeg3(ABG,abg)\displaystyle\max_{a\in A}\textsf{deg}_{5}(A,a)\times\max_{(a,b,g)\in A\times B\times G}\textsf{deg}_{3}(ABG,abg)
×\displaystyle\times max(a,b,g)A×B×Gdeg4(ABG,abg)×max(a,b)A×Bdeg34(AB,ab)\displaystyle\max_{(a,b,g)\in A\times B\times G}\textsf{deg}_{4}(ABG,abg)\times\max_{(a,b)\in A\times B}\textsf{deg}_{34}(AB,ab)

4.2.2. Partition with Degree Characterization

From the upper bound on TE,E(𝐈)T_{E,\partial E}(\mathbf{I}), we now define the degree characterization for hierarchical joins, similar as the two-table join. Our target is to partition the input instance into a set of sub-instances (which may not be tuple-disjoint), such that each sub-instance is captured by a distinct degree characterization, and the join results over all sub-instances form a partition of the final join result.

Definition 4.9 (Degree Characterization).

For a hierarchical join =(𝐱,{𝐱1,𝐱2,,𝐱m})\mathcal{H}=(\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{m}\}), a degree characterization is defined as σ:2[m]×2𝐱{}\sigma:2^{[m]}\times 2^{\mathbf{x}}\to\mathbb{Z}^{*}\cup\{\bot\} such that for any E[m]E\subseteq[m] and 𝐲𝐱\mathbf{y}\subseteq\mathbf{x}, σ(E,𝐲)\sigma(E,\mathbf{y})\neq\bot if and only if there exists an attribute x𝐱x\in\mathbf{x} such that E=ExE=E_{x} and 𝐲\mathbf{y} is the set of ancestors of xx in 𝒯\mathcal{T}.

As described in Algorithm 6, the high-level idea is to partition the input instance by attributes in a bottom-up way on 𝒯\mathcal{T}, where 𝒯\mathcal{T} is the attribute tree of the input hierarchical query \mathcal{H}, and invoke Algorithm 7 as a primitive on every sub-instance obtained so far.

As described in Algorithm 7, PartitionByAttr takes an input an attribute xx and an instance 𝐈\mathbf{I}^{\prime} such that

  • for every attribute xx^{\prime} as a descendant of xx in 𝒯\mathcal{T}, degEx𝐈(𝐲,y)\textsf{deg}^{\mathbf{I}^{\prime}}_{E_{x^{\prime}}}(\mathbf{y}^{\prime},\vec{y}) is uniformized for every ydom(𝐲)\vec{y}\in\textsf{dom}(\mathbf{y}^{\prime}), where 𝐲\mathbf{y}^{\prime} is the set of ancestors of xx^{\prime} in 𝒯\mathcal{T}.

and outputs a set of sub-instances 𝕀\mathbb{I}^{\prime} of 𝐈\mathbf{I}^{\prime}, such that

  • \bigcupplus𝐈′′𝕀(i[m]Ri𝐈′′)=(i[m]Ri𝐈)\bigcupplus_{\mathbf{I}^{\prime\prime}\in\mathbb{I}^{\prime}}\left(\Join_{i\in[m]}R^{\mathbf{I}^{\prime\prime}}_{i}\right)=\left(\Join_{i\in[m]}R^{\mathbf{I}^{\prime}}_{i}\right), i.e., the join results of sub-instances in 𝕀\mathbb{I}^{\prime} form a partition of join result of 𝐈\mathbf{I}^{\prime};

  • for every sub-instance 𝐈′′𝕀\mathbf{I}^{\prime\prime}\in\mathbb{I}^{\prime}, degEx𝐈′′(𝐲,y)\textsf{deg}^{\mathbf{I}^{\prime\prime}}_{E_{x}}(\mathbf{y},\vec{y}) is roughly the same for every ydom(𝐲)\vec{y}\in\textsf{dom}(\mathbf{y}), where 𝐲\mathbf{y} is the set of ancestors of xx in 𝒯\mathcal{T}.

Similar to the two-table case, for every ydom(𝐲)\vec{y}\in\textsf{dom}(\mathbf{y}), we compute a noisy version of its degree degEx𝐈(𝐲,y)\textsf{deg}^{\mathbf{I}^{\prime}}_{E_{x}}(\mathbf{y},\vec{y}) and put it into the bucket indexed by its logarithmic value. For each jExj\in E_{x}, we obtain a tuple-disjoint partition of Rj𝐈R^{\mathbf{I}^{\prime}}_{j} based on 𝐲\mathbf{y}, such that each ilog(nλ+1)i\in\lceil\log\left(\frac{n}{\lambda}+1\right)\rceil defines a sub-instance involving a sub-relation Rj,i𝐈R^{\mathbf{I}^{\prime}}_{j,i} for each jExj\in E_{x} as well as all remaining relations Rj𝐈R^{\mathbf{I}^{\prime}}_{j} with jExj\notin E_{x}. At last, we return the union of all sub-instances.

1
2𝕀{𝐈}\mathbb{I}\leftarrow\{\mathbf{I}\};
3 𝒯\mathcal{T}\leftarrow the attribute tree of (𝐱,{𝐱1,𝐱2,,𝐱m})\left(\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{m}\}\right);
4 while there is an non-visited node in 𝒯\mathcal{T} do
5       xx\leftarrow any leaf or any node whose children are all visited;
6       foreach 𝐈𝕀\mathbf{I}^{\prime}\in\mathbb{I} do
7             𝕀𝕀PartitionByAttr(𝐈,x)\mathbb{I}\leftarrow\mathbb{I}\cup\textsc{PartitionByAttr}(\mathbf{I}^{\prime},x);
8            
9      Mark xx as “visited”;
10      
11return 𝕀\mathbb{I};
Algorithm 6 Partition-Hierarchical(𝐈)(\mathbf{I})
1
2𝐲{y𝐱:ExEy}\mathbf{y}\leftarrow\{y\in\mathbf{x}:E_{x}\subsetneq E_{y}\};
3 foreach ydom(𝐲)\vec{y}\in\textsf{dom}(\mathbf{y}) do
4       deg~Ex𝐈(𝐲,y)=degEx𝐈(𝐲,y)+STGeomϵ,δ,1(x)\widetilde{\textsf{deg}}^{\mathbf{I}^{\prime}}_{E_{x}}(\mathbf{y},\vec{y})=\textsf{deg}^{\mathbf{I}^{\prime}}_{E_{x}}(\mathbf{y},\vec{y})+\textsf{STGeom}_{\epsilon,\delta,1}(x);
5       imax{1,log1λdeg~Ex𝐈(𝐲,y)}i\leftarrow\max\left\{1,\left\lceil\log\frac{1}{\lambda}\cdot\widetilde{\textsf{deg}}^{\mathbf{I}^{\prime}}_{E_{x}}(\mathbf{y},\vec{y})\right\rceil\right\};
6       if 𝐲i\mathbf{y}^{i} does not exist then 𝐲i{y}\mathbf{y}^{i}\leftarrow\left\{\vec{y}\right\};
7       else 𝐲i𝐲i{y}\mathbf{y}^{i}\leftarrow\mathbf{y}^{i}\cup\left\{\vec{y}\right\};
8      
9foreach ii with 𝐲i\mathbf{y}^{i}\neq\emptyset do
10       foreach jExj\in E_{x} do
11             Rj,i𝐈={Rj𝐈(t):t𝔻j,π𝐲t𝐲i,Rj𝐈(t)>0}R^{\mathbf{I}^{\prime}}_{j,i}=\left\{R^{\mathbf{I}^{\prime}}_{j}(t):t\in\mathbb{D}_{j},\pi_{\mathbf{y}}t\in\mathbf{y}^{i},R^{\mathbf{I}^{\prime}}_{j}(t)>0\right\}
12      
13
14return i:𝐲i{{Rj,i𝐈:jEx}{Rj𝐈:jEx}}\bigcup_{i:\mathbf{y}^{i}\neq\emptyset}\left\{\{R^{\mathbf{I}^{\prime}}_{j,i}:j\in E_{x}\}\cup\{R^{\mathbf{I}^{\prime}}_{j}:j\notin E_{x}\}\right\};
Algorithm 7 PartitionByAttr(𝐈,x)(\mathbf{I}^{\prime},x)
Lemma 4.10.

For input instance 𝐈\mathbf{I}, let 𝕀\mathbb{I} be the set of sub-instances returned by Algorithm 7. 𝕀\mathbb{I} satisfies the following properties:

  • \bigcupplus𝐈𝕀(i[m]Ri𝐈)=(i[m]Ri𝐈)\bigcupplus_{\mathbf{I}^{\prime}\in\mathbb{I}}\left(\Join_{i\in[m]}R^{\mathbf{I}^{\prime}}_{i}\right)=\left(\Join_{i\in[m]}R^{\mathbf{I}}_{i}\right), i.e., the join results of all sub-instances in 𝕀\mathbb{I} form a partition of the join result of 𝐈\mathbf{I};

  • Each input tuple appears in O(logcn)O\left(\log^{c}n\right) sub-instances of 𝕀\mathbb{I}, where cc is a constant depending on query size;

  • Each sub-instance 𝐈σ𝕀\mathbf{I}^{\sigma}\in\mathbb{I} corresponds to a distinct degree characterization σ\sigma such that for any E[m]E\in[m] and 𝐲𝐱E\mathbf{y}\subseteq\mathbf{x}_{E} with σ(E,𝐲)\sigma(E,\mathbf{y})\neq\bot:

    (4) deg~E𝐈σ(𝐲,y)(γi12σ(E,𝐲)1,γi2σ(E,𝐲)]holdsforanyydom(𝐲)ifdeg~E𝐈σ(𝐲,y)>0.\widetilde{\textsf{deg}}^{\mathbf{I}^{\sigma}}_{E}(\mathbf{y},\vec{y})\in\left(\gamma_{i-1}\cdot 2^{\sigma(E,\mathbf{y})-1},\gamma_{i}\cdot 2^{\sigma(E,\mathbf{y})\right]holdsforany\vec{y}\in\textsf{dom}(\mathbf{y})if\widetilde{\textsf{deg}}^{\mathbf{I}^{\sigma}}_{E}(\mathbf{y},\vec{y})>0.\par}
Lemma 4.11.

Algorithm 4 on hierarchical joins is (O(logcn)ϵ,O(logcn)δ)(O(\log^{c}n)\cdot\epsilon,O(\log^{c}n)\cdot\delta)-DP, where cc is a constant depending only on the tree. linecolor=red,backgroundcolor=red!25,bordercolor=redlinecolor=red,backgroundcolor=red!25,bordercolor=redtodo: linecolor=red,backgroundcolor=red!25,bordercolor=redPasin: I fixed this: I think cc is tree-dependent but *not* query (i.e. 𝒬\mathcal{Q})-dependent.

The logarithmic factor in Lemma 4.11 arise since each input tuple participates in O(logcn)O(\log^{c}n) sub-instances, implied by Lemma 4.10. Next, given a degree characterization σ\sigma, how to derive the residual sensitivity of instances consistent with σ\sigma? Actually, this is already resolved by just rewriting the upper bound derived for TET_{E}, denoted as TEσT^{\sigma}_{E}. Similar to the two-table case, we also define the uniform partition of an input instance 𝐈\mathbf{I} over a hierarchical join query 𝒬\mathcal{Q} using true degrees as {𝐈σ:σ is a degree characterization of 𝐈}\{\mathbf{I}^{\sigma}:\sigma\textrm{ is a degree characterization of }\mathbf{I}\}. As shown in Appendix B.1, the error complexity achieved by the partition based on noisy degrees can be bounded by the uniform one.

Theorem 4.12.

For any hierarchical join =(𝐱,{𝐱1,𝐱2,,𝐱m})\mathcal{H}=(\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{m}\}) and an instance 𝐈\mathbf{I}, a family of linear queries 𝒬\mathcal{Q}, and ϵ>0\epsilon>0, δ>0\delta>0, there exists an algorithm that is (ϵ,δ)(\epsilon,\delta)-DP, and with probability at least 11/poly(|𝒬|)1-1/\textsf{poly}(|\mathcal{Q}|) produces 𝔽\mathbb{F} such that all queries in 𝒬\mathcal{Q} can be answered within error:

α=O((σ𝖮𝖴𝖳σSScountβ(𝐈σ)λ+SScountβ(𝐈σ)λ)fupper)\alpha=O\left((\sum_{\sigma}\sqrt{\mathsf{OUT}^{\sigma}\cdot SS^{\beta}_{\textsf{count}}\left(\mathbf{I}^{\sigma}\right)\cdot\lambda}+SS^{\beta}_{\textsf{count}}\left(\mathbf{I}^{\sigma}\right)\cdot\lambda)\cdot f^{\mathrm{upper}}\right)

where σ\sigma is over all degree characterizations of 𝐈\mathbf{I}, 𝐈σ\mathbf{I}^{\sigma} is the sub-instance characterized by σ\sigma, 𝖮𝖴𝖳σ\mathsf{OUT}^{\sigma} is the join size of 𝐈σ\mathbf{I}^{\sigma}, and β=ϵ2ln(2/δ)\beta=\frac{\epsilon}{2\ln(2/\delta)}.

Let 𝖮𝖴𝖳={𝖮𝖴𝖳σ:σ is a degree characterization of 𝐈}\overrightarrow{\mathsf{OUT}}=\left\{\mathsf{OUT}^{\sigma}\in\mathbb{Z}:\sigma\textrm{ is a degree characterization of }\mathbf{I}\right\} as the output size distribution over the uniform degree partition. An instance 𝐈\mathbf{I} conforms to 𝖮𝖴𝖳\overrightarrow{\mathsf{OUT}} if |Jσ|=𝖮𝖴𝖳σ|J^{\sigma}|=\mathsf{OUT}^{\sigma}, for every degree characterization σ\sigma, where JσJ^{\sigma} is the join result of sub-instance 𝐈σ\mathbf{I}^{\sigma}. Extending the lower bound argument of Theorem 4.4, we obtain the following parameterized lower bound for hierarchical queries.

Theorem 4.13.

Given a hierarchical join =(𝐱,{𝐱1,𝐱2,,𝐱m})\mathcal{H}=(\mathbf{x},\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{m}\}) and an arbitrary parameters 𝖮𝖴𝖳\overrightarrow{\mathsf{OUT}}, for every sufficiently small ϵ,α>0\epsilon,\alpha>0, nD(1/α)Ω(1)n_{D}\geq(1/\alpha)^{\Omega(1)} and nQ(1/α)O(1)n_{Q}\leq(1/\alpha)^{O(1)}, there exists a family of queries 𝒬\mathcal{Q} of size nQn_{Q} on domain 𝔻\mathbb{D} of size nDn_{D} such that any (ϵ,o(1/n))(\epsilon,o(1/n))-differentially private algorithm that takes as input a multi-table database over \mathcal{H} of input size at most nn while conforming to 𝖮𝖴𝖳\overrightarrow{\mathsf{OUT}}, and outputs an approximate answer to each query in 𝒬\mathcal{Q} to within error α\alpha must satisfy

αΩ~(maxσ𝖮𝖴𝖳σLScountσflower)\alpha\geq\tilde{\Omega}\left(\max_{\sigma}\sqrt{\mathsf{OUT}^{\sigma}}\cdot\sqrt{LS^{\sigma}_{\textsf{count}}}\cdot f^{\mathrm{lower}}\right)

where σ\sigma is over all degree characterizations and LScountσ=maxi[m]T[m]iσ\displaystyle{LS^{\sigma}_{\textsf{count}}=\max_{i\in[m]}T^{\sigma}_{[m]-i}} is the local sensitivity of count()\textsf{count}(\cdot) under σ\sigma.

5. Conclusions and Discussion

In this work, we proposed algorithms for answering linear queries of multi-table joins. Our work opens up several interesting questions that we list below.

Non-Hierarchical Setting.

First and perhaps the most immediate question is if uniformization can benefit the non-hierarchical case. We note that the uniformization technique from Section 4.2 itself is applicable but may not lead to uniformized residual sensitivity on non-hierarchical queries, even on the simplest non-hierarchical join R1(A,B)R2(B,C)R3(C,D)R_{1}(A,B)\Join R_{2}(B,C)\Join R_{3}(C,D). We can simply partition values in dom(B)\textsf{dom}(B) into buckets, each of which is indexed by two integers such that Bij={bdom(B):deg1(B,b)(γi1,γi],deg2(B,b)(γj1,γj]}.B^{ij}=\{b\in\textsf{dom}(B):\textsf{deg}_{1}(B,b)\in(\gamma_{i-1},\gamma_{i}],\textsf{deg}_{2}(B,b)\in(\gamma_{j-1},\gamma_{j}]\}. Similar partition idea applies to values in dom(C)\textsf{dom}(C). Note that each pair (Bij,Ckh)\left(B^{ij},C^{kh}\right) of buckets corresponding to σ=(i,j,k,h)\sigma=(i,j,k,h) defines a sub-instance 𝐈σ\mathbf{I}^{\sigma}. But, residual sensitivity within each sub-instance is not guaranteed to be uniformized. In determining the residual sensitivity in (2), we observe that T23λ22j+hT_{23}\leq\lambda^{2}\cdot 2^{j+h} and T12λ22i+kT_{12}\leq\lambda^{2}\cdot 2^{i+k}; however, deg2(B,b)\textsf{deg}_{2}(B,b) and deg2(C,c)\textsf{deg}_{2}(C,c) are defined over the whole instance 𝐈\mathbf{I}, instead of the specific instance 𝐈σ\mathbf{I}^{\sigma}. This coarse-grained partition can be extended to arbitrary acyclic query (see Appendix B.1), but does not necessarily lead to uniformized residual sensitivity inside each sub-instance. We leave this as an interesting future work.

Query-Specific Optimality.

Throughout this work, we consider worst-case set 𝒬\mathcal{Q} of queries parameterized by its size. Although this is a reasonable starting point, it is also plausible to hope for an algorithm that is nearly optimal for all query set 𝒬\mathcal{Q}. In the single table case, this has been achieved in (HardtT10; BhaskaraDKT12; NikolovT016). However, the situation is rather complicated in multi-table setting since we have to take the local sensitivity into account, whereas, in the single table case, the query family already tells us everything about the possible change by moving to a neighboring dataset. This also seems like an interesting open question for future research.

Instance-Specific Optimality.

Similarly, we have considered worst-case instances. One might instead prefer to achieve finer-grained instance-optimal errors. For the single table case, (asi2020instance) observed that an instance-optimal DP algorithm is not achievable, since a trivial algorithm can return the same answer(s) for all input instances, which could work perfectly on one specific instance but miserably on all remaining instances. To overcome this, a notion of “neighborhood optimality” has been proposed (dong2021nearly; asi2020instance) where we consider not only a single instance but also its neighbors at some constant distance. We note, however, that this would not work for our setting when there are a large number of queries. Specifically, if we again consider an algorithm that always returns the true answer for this instance, then its error with respect to the entire neighborhood set is still quite small–at most the distance times the maximum local sensitivity in the set. This is independent of the table size nn, whereas our lower bounds show that the dependency on nn is inevitable. As such, the question of how to define and prove instance-optimal errors remains open for the multi-query case.

Appendix A Missing Proof in Section 1

We adapt the single-table proof to our context. Let 𝕁\mathbb{J} be a table of nn data items over domain 𝔻\mathbb{D}. Define 𝒬={q:𝔻[1,+1]}\mathcal{Q}=\{q:\mathbb{D}\to[-1,+1]\} be the set of linear queries. Let 𝔽i\mathbb{F}_{i} be the distribution returned by MWEM algorithm in the ii-iteration. Define λi=maxq|q(𝔽i1)q(𝕁)|\lambda_{i}=\max_{q}|q(\mathbb{F}_{i-1})-q(\mathbb{J})| as the maximum error over all queries in terms of 𝔽i1\mathbb{F}_{i-1} and 𝕁\mathbb{J}. Now we are going to bound the maximum error of the MWEM algorithm:

maxq|q(avgiT𝔽i)q(𝕁)|\displaystyle\max_{q}|q(\textrm{avg}_{i\leq T}\mathbb{F}_{i})-q(\mathbb{J})| =maxq|avgiTq(𝔽i)q(𝕁)|\displaystyle=\max_{q}|\textrm{avg}_{i\leq T}q(\mathbb{F}_{i})-q(\mathbb{J})|
maxqavgiT|q(𝔽i)q(𝕁)|\displaystyle\leq\max_{q}\textrm{avg}_{i\leq T}|q(\mathbb{F}_{i})-q(\mathbb{J})|
avgiTmaxq|q(𝔽i)q(𝕁)|\displaystyle\leq\textrm{avg}_{i\leq T}\max_{q}|q(\mathbb{F}_{i})-q(\mathbb{J})|
avgiTλi\displaystyle\leq\textrm{avg}_{i\leq T}\lambda_{i}

Define γ=2TΔ~ϵlog|𝒬|\gamma=2T\cdot\frac{\widetilde{\Delta}}{\epsilon}\cdot\log|\mathcal{Q}| for Δ~=Δ+Gϵ,δ,1\widetilde{\Delta}=\Delta+G_{\epsilon,\delta,1}. Note that the sensitivity for invoking the exponential mechanism is

max𝕁,𝕁are neighboringmaxq|q(𝕁)q(𝕁)|||𝕁||𝕁||ΔΔ~\displaystyle\max_{\mathbb{J},\mathbb{J}^{\prime}\textrm{are neighboring}}\max_{q}|q(\mathbb{J})-q(\mathbb{J}^{\prime})|\leq\big{|}|\mathbb{J}|-|\mathbb{J}^{\prime}|\big{|}\leq\Delta\leq\widetilde{\Delta}

and that for invoking the laplacian mechanism is

max𝕁,𝕁are neighboringmaxqmaxi{1,2,,T}|si(𝕁,q)si(𝕁,q)|\displaystyle\max_{\mathbb{J},\mathbb{J}^{\prime}\textrm{are neighboring}}\max_{q}\max_{i\in\{1,2,\cdots,T\}}|s_{i}(\mathbb{J},q)-s_{i}(\mathbb{J}^{\prime},q)|
\displaystyle\leq max𝕁,𝕁are neighboringmaxq|q(𝕁)q(𝕁)|Δ~\displaystyle\max_{\mathbb{J},\mathbb{J}^{\prime}\textrm{are neighboring}}\max_{q}|q(\mathbb{J})-q(\mathbb{J}^{\prime})|\leq\widetilde{\Delta}

followed by the similar argument.

Lemma A.1.

With probability at least 12T/|𝒬|c1-2T/|\mathcal{Q}|^{c} for any c0c\geq 0, for all 1iT1\leq i\leq T, we have:

|qi(𝔽i1)qi(𝕁)|λi(2c+2)×γ\displaystyle|q_{i}(\mathbb{F}_{i-1})-q_{i}(\mathbb{J})|\geq\lambda_{i}-(2c+2)\times\gamma
|miqi(𝕁)|c×γ\displaystyle|m_{i}-q_{i}(\mathbb{J})|\leq c\times\gamma
Proof.

For the first inequality, we note that the probability the exponential mechanism with parameter ϵ2TΔ~\frac{\epsilon}{2T\cdot\widetilde{\Delta}} selects a query with quality score at least rr less than the optimal is bounded by

Pr[|qi(𝔽i1)qi(𝕁)|<λi(2c+2)×γ]|𝒬|×exp(ϵγ4TΔ~)1|𝒬|c\Pr[|q_{i}(\mathbb{F}_{i-1})-q_{i}(\mathbb{J})|<\lambda_{i}-(2c+2)\times\gamma]\leq|\mathcal{Q}|\times\exp(-\frac{\epsilon\gamma}{4T\cdot\widetilde{\Delta}})\leq\frac{1}{|\mathcal{Q}|^{c}}

For the second inequality, we note that |miqi(𝕁)|cγ|m_{i}-q_{i}(\mathbb{J})|\leq c\cdot\gamma happens if and only if |Lap2TΔ~ϵ(x)|cγ\left|\textsf{Lap}_{\frac{2T\cdot\widetilde{\Delta}}{\epsilon}}(x)\right|\leq c\cdot\gamma happens. By definition of the Laplace distribution, we have:

Pr[|Lap2TΔ~ϵ(x)|>c2TΔ~ϵlog|𝒬|]exp(clog|𝒬|)=1|𝒬|c\displaystyle\Pr\left[\left|\textsf{Lap}_{\frac{2T\cdot\widetilde{\Delta}}{\epsilon}}(x)\right|>c\cdot 2T\cdot\frac{\widetilde{\Delta}}{\epsilon}\cdot\log|\mathcal{Q}|\right]\leq\exp(-c\cdot\log|\mathcal{Q}|)=\frac{1}{|\mathcal{Q}|^{c}}

A union bound over 2T2T events yields such a failure probability. ∎

Then, how the MW mechanism improve the approximation in each round where qi(𝔽i)qi(𝕁)q_{i}(\mathbb{F}_{i})-q_{i}(\mathbb{J}) has large magnitude. To capture the improvement, we use the relative entropy function:

Ψi=1nx𝔻𝕁(x)log(𝕁(x)𝔽i(x)),Ψi=nn^Ψi\displaystyle\Psi^{\prime}_{i}=\frac{1}{n}\sum_{x\in\mathbb{D}}\mathbb{J}(x)\log\left(\frac{\mathbb{J}(x)}{\mathbb{F}_{i}(x)}\right),\ \Psi_{i}=\frac{n}{\hat{n}}\cdot\Psi_{i}

Here, we can only show that

Ψ0\displaystyle\Psi_{0} =1n^t𝔻J(t)logJ(t)𝔽0(t)|𝕁|n^log|𝕁||𝔻|n^log|𝔻|\displaystyle=\frac{1}{\widehat{n}}\cdot\sum_{t\in\mathbb{D}}J(t)\log\frac{J(t)}{\mathbb{F}_{0}(t)}\leq\frac{|\mathbb{J}|}{\widehat{n}}\cdot\log\frac{|\mathbb{J}|\cdot|\mathbb{D}|}{\widehat{n}}\leq\log|\mathbb{D}|
Ψi\displaystyle\Psi_{i} =1n^t𝔻𝕁(t)log𝔽i(t)𝕁(t)1n^t𝔻𝔽i(t)+|𝕁|n^1\displaystyle=-\frac{1}{\widehat{n}}\cdot\sum_{t\in\mathbb{D}}\mathbb{J}(t)\log\frac{\mathbb{F}_{i}(t)}{\mathbb{J}(t)}\geq-\frac{1}{\widehat{n}}\sum_{t\in\mathbb{D}^{\prime}}\mathbb{F}_{i}(t)+\frac{|\mathbb{J}|}{\widehat{n}}\geq-1

where 𝔻𝔻\mathbb{D}^{\prime}\subseteq\mathbb{D} such that J(t)>0J(t)>0 for every t𝔻t\in\mathbb{D}^{\prime}.

We next consider how Ψ\Psi changes in each iteration:

ΨiΨi1=\displaystyle\Psi_{i}-\Psi_{i-1}= 1n^x𝔻𝕁(x)log(𝔽i(x)𝔽i1(x))\displaystyle\frac{1}{\hat{n}}\cdot\sum_{x\in\mathbb{D}}\mathbb{J}(x)\cdot\log\left(\frac{\mathbb{F}_{i}(x)}{\mathbb{F}_{i-1}(x)}\right)
=\displaystyle= 1n^qi(𝕁)ηinn^logβi\displaystyle\frac{1}{\hat{n}}\cdot q_{i}(\mathbb{J})\cdot\eta_{i}-\frac{n}{\hat{n}}\log\beta_{i}

where

ηi\displaystyle\eta_{i} =12n^(miqi(𝔽i1)),\displaystyle=\frac{1}{2\hat{n}}\cdot(m_{i}-q_{i}(\mathbb{F}_{i-1})),
βi\displaystyle\beta_{i} =1n^x𝔻exp(qi(x)ηi)𝔽i1(x).\displaystyle=\frac{1}{\hat{n}}\cdot\sum_{x\in\mathbb{D}}\exp(q_{i}(x)\eta_{i})\mathbb{F}_{i-1}(x).

Using the fact that exp(x)1+x+x2\exp(x)\leq 1+x+x^{2} for |x|1|x|\leq 1 and |qi(x)ηi|1|q_{i}(x)\eta_{i}|\leq 1,

βi\displaystyle\beta_{i}\leq 1n^x𝔻(1+qi(x)ηi+qi2(x)ηi2)𝔽i1(x)\displaystyle\frac{1}{\hat{n}}\cdot\sum_{x\in\mathbb{D}}(1+q_{i}(x)\eta_{i}+q^{2}_{i}(x)\eta^{2}_{i})\mathbb{F}_{i-1}(x)
\displaystyle\leq 1n^x𝔻(1+qi(x)ηi+ηi2)𝔽i1(x)\displaystyle\frac{1}{\hat{n}}\cdot\sum_{x\in\mathbb{D}}(1+q_{i}(x)\eta_{i}+\eta^{2}_{i})\mathbb{F}_{i-1}(x)
\displaystyle\leq 1+1n^ηiqi(𝔽i1)+ηi2\displaystyle 1+\frac{1}{\hat{n}}\cdot\eta_{i}q_{i}(\mathbb{F}_{i-1})+\eta^{2}_{i}

Then, we can rewrite ΨiΨi1\Psi_{i}-\Psi_{i-1} in the following way:

ΨiΨi1=\displaystyle\Psi_{i}-\Psi_{i-1}= 1n^qi(𝕁)ηilogβi\displaystyle\frac{1}{\hat{n}}\cdot q_{i}(\mathbb{J})\cdot\eta_{i}-\log\beta_{i}
\displaystyle\geq 1n^qi(𝕁)ηi(1n^ηiqi(𝔽i1)+ηi2)\displaystyle\frac{1}{\hat{n}}\cdot q_{i}(\mathbb{J})\cdot\eta_{i}-\left(\frac{1}{\hat{n}}\cdot\eta_{i}q_{i}(\mathbb{F}_{i-1})+\eta^{2}_{i}\right)
\displaystyle\geq ηi(qi(𝕁)qi(𝔽i1)n^miqi(𝔽i1)2n^)\displaystyle\eta_{i}\cdot\left(\frac{q_{i}(\mathbb{J})-q_{i}(\mathbb{F}_{i-1})}{\hat{n}}-\frac{m_{i}-q_{i}(\mathbb{F}_{i-1})}{2\hat{n}}\right)
=\displaystyle= miqi(𝔽i1)2n^2qi(𝕁)miqi(𝔽i1)2n^\displaystyle\frac{m_{i}-q_{i}(\mathbb{F}_{i-1})}{2\hat{n}}\cdot\frac{2q_{i}(\mathbb{J})-m_{i}-q_{i}(\mathbb{F}_{i-1})}{2\hat{n}}
=\displaystyle= (qi(𝕁)qi(𝔽i1)2n^)2(qi(𝕁)mi2n^)2\displaystyle\left(\frac{q_{i}(\mathbb{J})-q_{i}(\mathbb{F}_{i-1})}{2\hat{n}}\right)^{2}-\left(\frac{q_{i}(\mathbb{J})-m_{i}}{2\hat{n}}\right)^{2}
\displaystyle\geq 14n^2((λi4γ)2γ2)\displaystyle\frac{1}{4\hat{n}^{2}}\cdot\left((\lambda_{i}-4\gamma)^{2}-\gamma^{2}\right)

where the last inequality follows the previous analysis by taking c=1c=1. By rewritting the last inequality, we have

λi4n^2(ΨiΨi1)+γ2+4×γ\displaystyle\lambda_{i}\leq\sqrt{4\hat{n}^{2}\cdot(\Psi_{i}-\Psi_{i-1})+\gamma^{2}}+4\times\gamma

Then,

avgiTλi\displaystyle\textrm{avg}_{i\leq T}\lambda_{i}\leq 4n^2avgiT(ΨiΨi1)+γ2+4×γ\displaystyle\sqrt{4\hat{n}^{2}\cdot\textrm{avg}_{i\leq T}(\Psi_{i}-\Psi_{i-1})+\gamma^{2}}+4\times\gamma
\displaystyle\leq 2n^log|𝔻|T+5×γ\displaystyle 2\hat{n}\cdot\sqrt{\frac{\log|\mathbb{D}|}{T}}+5\times\gamma
=\displaystyle= O(n^log|𝔻|T+Tlog|𝒬|Δ~ϵ)\displaystyle O\left(\hat{n}\cdot\sqrt{\frac{\log|\mathbb{D}|}{T}}+\frac{T\cdot\log|\mathcal{Q}|\cdot\widetilde{\Delta}}{\epsilon}\right)

By taking T=(n^ϵlog|𝔻|Δ~log|𝒬|)2/3T=\left(\frac{\widehat{n}\cdot\epsilon\cdot\sqrt{\log|\mathbb{D}|}}{\widetilde{\Delta}\cdot\log|\mathcal{Q}|}\right)^{2/3}, we can obtain the minimized error as O(n^2/3(log|𝒬|Δ~log|𝔻|)ϵ)1/3)O\left(\widehat{n}^{2/3}\cdot\left(\frac{\log|\mathcal{Q}|\cdot\widetilde{\Delta}\cdot\log|\mathbb{D}|)}{\epsilon}\right)^{1/3}\right).bIn (ϵ,δ)(\epsilon,\delta)-DP, this error can be generalized as (omitting the fact of (log|𝒬|log1δlog12|𝔻|)ϵ)1/2\left(\frac{\log|\mathcal{Q}|\cdot\log\frac{1}{\delta}\cdot\log^{\frac{1}{2}}|\mathbb{D}|)}{\epsilon}\right)^{1/2}):

O(n^Δ~)=O(|𝕁|Δ~+Δ~)O\left(\sqrt{\widehat{n}}\cdot\sqrt{\widetilde{\Delta}}\right)=O\left(\sqrt{|\mathbb{J}|}\cdot\sqrt{\widetilde{\Delta}}+\widetilde{\Delta}\right)

Appendix B Missing Proofs in Section 3

Lemma B.1.

For any pair of neighboring instance 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime}, let J,JJ,J^{\prime}be the join results of 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} respectively. Then, ||J||J||Δ~\big{|}|J|-|J^{\prime}|\big{|}\leq\widetilde{\Delta}.

Lemma B.1 directly follows the definition of LScount(𝐈)LS_{\textsf{count}}(\mathbf{I}) and the non-negative property of STGeom.

Proof of Lemma 3.1.

Consider two neighboring database 𝐈=(R1,R2)\mathbf{I}=(R_{1},R_{2}) and 𝐈=(R1,R2)\mathbf{I}^{\prime}=(R^{\prime}_{1},R^{\prime}_{2}). We first prove n^(2ϵ,2δ)n^\widehat{n}\sim_{(2\epsilon,2\delta)}\widehat{n^{\prime}}. Let Δ,Δ\Delta,\Delta^{\prime} be the local sensitivity of 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} respectively. It can be easily checked that |ΔΔ|1|\Delta-\Delta^{\prime}|\leq 1. Let Δ~,Δ~\widetilde{\Delta},\widetilde{\Delta^{\prime}} be the random variables by adding random noises from STGeomϵ,δ,1\textsf{STGeom}_{\epsilon,\delta,1} to Δ,Δ\Delta,\Delta^{\prime} separately. Let supp(Δ~),supp(Δ~)\textsf{supp}\left(\widetilde{\Delta}\right),\textsf{supp}\left(\widetilde{\Delta^{\prime}}\right) be the support of Δ~,Δ~\widetilde{\Delta},\widetilde{\Delta^{\prime}} separately. From STGeom, we observe:

(5) Δ+STGeomϵ,δ,1(x)(ϵ,δ)Δ+STGeomϵ,δ,1(x)\Delta+\textsf{STGeom}_{\epsilon,\delta,1}(x)\sim_{(\epsilon,\delta)}\Delta^{\prime}+\textsf{STGeom}_{\epsilon,\delta,1}(x)

since |ΔΔ|=1|\Delta-\Delta^{\prime}|=1, hence Δ~(ϵ,δ)Δ~\widetilde{\Delta}\sim_{(\epsilon,\delta)}\widetilde{\Delta^{\prime}}. We next note that for any Δ^supp(Δ~)supp(Δ~)\widehat{\Delta}\in\textsf{supp}\left(\widetilde{\Delta}\right)\cup\textsf{supp}\left(\widetilde{\Delta^{\prime}}\right), the following holds:

|J|+STGeomϵ,δ,Δ^(x)(ϵ,δ)|J|+STGeomϵ,δ,Δ^\displaystyle|J|+\textsf{STGeom}_{\epsilon,\delta,\widehat{\Delta}}(x)\sim_{(\epsilon,\delta)}|J^{\prime}|+\textsf{STGeom}_{\epsilon,\delta,\widehat{\Delta}}

since ||J||J||Δ~\big{|}|J|-|J^{\prime}|\big{|}\leq\widetilde{\Delta}. From the sequential composition rule (dwork2006calibrating; dwork2006our), we conclude that n^(2ϵ,2δ)n^\widehat{n}\sim_{(2\epsilon,2\delta)}\widehat{n}^{\prime}.

We next turn to line 6. Let 𝔽,𝔽\mathbb{F},\mathbb{F}^{\prime} be the synthetic data generated for (R1,R2),(R1,R2)(R_{1},R_{2}),(R^{\prime}_{1},R^{\prime}_{2}) separately. The single-table algorithm starts with a uniform distribution with 𝔽0(x)=n^|𝔻|\mathbb{F}_{0}(x)=\frac{\widehat{n}}{|\mathbb{D}|} for (R1,R2)(R_{1},R_{2}), and 𝔽0(x)=n^|𝔻|\mathbb{F}^{\prime}_{0}(x)=\frac{\widehat{n}^{\prime}}{|\mathbb{D}|} for (R1,R2)(R^{\prime}_{1},R^{\prime}_{2}). Implied by the analysis above, the initial state satisfies (2ϵ,2δ)(2\epsilon,2\delta)-DP, i.e., 𝔽0(2ϵ,2δ)𝔽0\mathbb{F}_{0}\sim_{(2\epsilon,2\delta)}\mathbb{F}^{\prime}_{0}.

We finally come to line 6-9, with TT invocations to the EM Mechanism and Laplace mechanism. For exponential mechanism, let qj,qjq_{j},q_{j^{\prime}} be the query selected after line 7. For Laplace mechanism, let mi,mim_{i},m^{\prime}_{i} be the resulted value after line 9 for (R1,R2),(R1,R2)(R_{1},R_{2}),(R^{\prime}_{1},R^{\prime}_{2}) respectively. From previous, we already have (5). We next note two facts for any Δ^supp(Δ~)supp(Δ~)\widehat{\Delta}\in\textsf{supp}\left(\widetilde{\Delta}\right)\cup\textsf{supp}\left(\widetilde{\Delta^{\prime}}\right):

  • j(ϵ,δ)jj\sim_{(\epsilon,\delta)}j^{\prime} since si(J,q)si(J,q)|q(J)q(J)|||J||J||Δ^s_{i}(J,q)-s_{i}(J^{\prime},q)\leq|q(J)-q(J^{\prime})|\leq\big{|}|J|-|J^{\prime}|\big{|}\leq\widehat{\Delta} holds for every query qq,

qi(J)+STGeomϵ2T,δ,Δ^(x)(ϵ2T,δ)()qi(J)+STGeomϵ2T,δ,Δ^(x)\displaystyle q_{i}(J)+\textsf{STGeom}_{\frac{\epsilon}{2T},\delta,\widehat{\Delta}}(x)\sim_{(\frac{\epsilon}{2T},\delta)}()q_{i}(J^{\prime})+\textsf{STGeom}_{\frac{\epsilon}{2T},\delta,\widehat{\Delta}}(x)

since |qi(J)q(J)||JJ|ΔΔ^\big{|}q_{i}(J)-q(J^{\prime})\big{|}\leq|J-J^{\prime}|\leq\Delta\leq\widehat{\Delta}. Summing over TT iterations, it yields (ϵ,δ)(\epsilon,\delta)-DP. Thus, Algorithm 1 preserves (3ϵ,3δ)(3\epsilon,3\delta)-DP. ∎

Proof of Lemma 3.6.

For any pair of neighboring instance 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime}, we note that SScountβ(𝐈)eβSScountβ(𝐈)SS^{\beta}_{\textsf{count}}(\mathbf{I})\leq e^{\beta}\cdot SS^{\beta}_{\textsf{count}}(\mathbf{I}^{\prime}) always holds. Let Δ~,Δ~\widetilde{\Delta},\widetilde{\Delta}^{\prime} be the updated value after line 3 for 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} respectively. It can showed that with probability at least 1δ1-\delta, |Lap1(x)|<ln2δ|\textsf{Lap}_{1}(x)|<\ln\frac{2}{\delta}, then βϵLap1(x)+12[0,1]\frac{\beta}{\epsilon}\textsf{Lap}_{1}(x)+\frac{1}{2}\in[0,1], and therefore Δ~[SScountβ(𝐈),SScountβ(𝐈)e]\widetilde{\Delta}\in[SS^{\beta}_{\textsf{count}}(\mathbf{I}),SS^{\beta}_{\textsf{count}}(\mathbf{I})\cdot e]. For any y[SScountβ(𝐈),+)y\in[SS^{\beta}_{\textsf{count}}(\mathbf{I}),+\infty), we have

Pr[Δ~=y]Pr[Δ~=y]=exp(ϵβlogSScount(𝐈)SScount(𝐈))exp(ϵ)\displaystyle\frac{\Pr[\widetilde{\Delta}=y]}{\Pr[\widetilde{\Delta}^{\prime}=y]}=\exp\left(\frac{\epsilon}{\beta}\cdot\log\frac{SS_{\textsf{count}}(\mathbf{I})}{SS_{\textsf{count}}(\mathbf{I}^{\prime})}\right)\leq\exp(\epsilon)

thus, Δ~(σ,δ)Δ~\widetilde{\Delta}\sim_{(\sigma,\delta)}\widetilde{\Delta}^{\prime}. Then, for any Δ^supp(Δ~)supp(Δ~)\widehat{\Delta}\in\textsf{supp}\left(\widetilde{\Delta}\right)\cup\textsf{supp}\left(\widetilde{\Delta}^{\prime}\right), we know that

|J|+STGeomϵ,δ,Δ^(x)(ϵ,δ)|J|+STGeomϵ,δ,Δ^(x)|J|+\textsf{STGeom}_{\epsilon,\delta,\widehat{\Delta}}(x)\sim_{(\epsilon,\delta)}|J^{\prime}|+\textsf{STGeom}_{\epsilon,\delta,\widehat{\Delta}}(x)

since ||J||J||LScount(𝐈)SScount(𝐈)Δ^\big{|}|J|-|J^{\prime}|\big{|}\leq LS_{\textsf{count}}(\mathbf{I})\leq SS_{\textsf{count}}(\mathbf{I})\leq\widehat{\Delta}. Let n^,n^\widehat{n},\widehat{n}^{\prime} be the resulted value after line 5 for 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} respectively. Then, n^(2ϵ,2δ)n^\widehat{n}\sim_{(2\epsilon,2\delta)}\widehat{n}^{\prime}.

We next turn to line 6. Let 𝔽,𝔽\mathbb{F},\mathbb{F}^{\prime} be the synthetic data generated for 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} separately. The single-table algorithm starts with a uniform distribution with 𝔽0(x)=n^|𝔻|\mathbb{F}_{0}(x)=\frac{\widehat{n}}{|\mathbb{D}|} for (R1,R2)(R_{1},R_{2}), and 𝔽0(x)=n^|𝔻|\mathbb{F}^{\prime}_{0}(x)=\frac{\widehat{n}^{\prime}}{|\mathbb{D}|} for (R1,R2)(R^{\prime}_{1},R^{\prime}_{2}). Implied by the analysis above, the initial state satisfies (2ϵ,2δ)(2\epsilon,2\delta)-DP, i.e., 𝔽0(2ϵ,2δ)𝔽0\mathbb{F}_{0}\sim_{(2\epsilon,2\delta)}\mathbb{F}^{\prime}_{0}.

We finally come to line 7-10, with TT invocations to the EM Mechanism and Laplace mechanism. For exponential mechanism, let qj,qjq_{j},q_{j^{\prime}} be the query selected after line 7. For Laplace mechanism, let mi,mim_{i},m^{\prime}_{i} be the resulted value after line 9 for 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} respectively. As previous, we already showed that Δ~(ϵ,δ)Δ~\displaystyle{\widetilde{\Delta}\sim_{(\epsilon,\delta)}\widetilde{\Delta}^{\prime}}. We next note two facts for any Δ^supp(Δ~)supp(Δ~)\widehat{\Delta}\in\textsf{supp}\left(\widetilde{\Delta}\right)\cup\textsf{supp}\left(\widetilde{\Delta^{\prime}}\right):

  • j(ϵ,δ)jj\sim_{(\epsilon,\delta)}j^{\prime};

  • qi(J)+STGeomϵ2T,δ,Δ^(x)(ϵ2T,δ)qi(J)+STGeomϵ2T,δ,Δ^(x)q_{i}(J)+\textsf{STGeom}_{\frac{\epsilon}{2T},\delta,\widehat{\Delta}}(x)\sim_{(\frac{\epsilon}{2T},\delta)}q_{i}(J^{\prime})+\textsf{STGeom}_{\frac{\epsilon}{2T},\delta,\widehat{\Delta}}(x)

The first one follows the fact that si(J,q)si(J,q)|q(J)q(J)|||J||J||LScount(𝐈)SScount(𝐈)Δ^s_{i}(J,q)-s_{i}(J^{\prime},q)\leq|q(J)-q(J^{\prime})|\leq\big{|}|J|-|J^{\prime}|\big{|}\leq LS_{\textsf{count}}(\mathbf{I})\leq SS_{\textsf{count}}(\mathbf{I})\leq\widehat{\Delta} holds for every query qq. The second one follows from the fact that |qi(J)q(J)||JJ|ΔΔ^\big{|}q_{i}(J)-q(J^{\prime})\big{|}\leq|J-J^{\prime}|\leq\Delta\leq\widehat{\Delta}. Summing over TT iterations, it results in (ϵ,δ)(\epsilon,\delta)-DP. Thus, Algorithm 1 preserves (3ϵ,3δ)(3\epsilon,3\delta)-DP. ∎

Proof of Theorem 1.6.

Set n=𝖮𝖴𝖳Δn=\frac{\mathsf{OUT}}{\Delta} and 𝔻=({0,1}d)n\mathbb{D}=(\{0,1\}^{d})^{n}. From Theorem 1.4, let 𝒬one\mathcal{Q}_{\textsf{one}} be the set of queries on which any (ϵ,δ)(\epsilon,\delta)-differentially private algorithm that takes as input a single-table database T𝔻T\in\mathbb{D} and output an approximate answer to each query in 𝒬one\mathcal{Q}_{\textsf{one}} within \ell_{\infty}-error α\alpha requires that αΩ~(n(dlog|𝒬one|ϵ)12)\alpha\geq\widetilde{\Omega}\left(\sqrt{n}\cdot(\frac{\sqrt{d}\log|\mathcal{Q}_{\textsf{one}}|}{\epsilon})^{\frac{1}{2}}\right). For an arbitrary single-table database T𝔻T\in\mathbb{D}, we construct a multi-table instance 𝐈\mathbf{I} for \mathcal{H} of input size mm, join size 𝖮𝖴𝖳\mathsf{OUT}, and local sensitivity Δ\Delta as follows: Without loss of generality, we pick 𝐱1\mathbf{x}_{1} as the anchor relation to encode TT.

  • Set dom(x)=𝔻\textsf{dom}(x)=\mathbb{D} if x𝐱1x\in\mathbf{x}_{1}, and dom(x)=\textsf{dom}(x)=\mathbb{Z} otherwise;

  • There are Δ1k\Delta^{\frac{1}{k}} distinct values in dom(x)\textsf{dom}(x) for x𝐱1x\notin\mathbf{x}_{1}, where k=|i[m]𝐱i𝐱1|k=\left|\bigcup_{i\in[m]}\mathbf{x}_{i}-\mathbf{x}_{1}\right|;

  • For each data record yTy\in T, we add a tuple (y,y,,y)(y,y,\cdots,y) in R1R_{1};

  • Tuples in RiR_{i} for i2i\geq 2 form a Cartesian product of size Δ|𝐱i|k\Delta^{\frac{|\mathbf{x}_{i}|}{k}} over all attributes in 𝐱i\mathbf{x}_{i};

It can be easily checked that 𝐈𝕀(𝖮𝖴𝖳,Δ)\mathbf{I}\in\mathbb{I}(\mathsf{OUT},\Delta). From 𝒬one\mathcal{Q}_{\textsf{one}}, we construct a set of queries over 𝐈\mathbf{I} as 𝒬hier\mathcal{Q}_{\textsf{hier}}. For each query q𝒬oneq\in\mathcal{Q}_{\textsf{one}}, we construct another query q=(q1,q2,,qm)q^{\prime}=(q_{1},q_{2},\cdots,q_{m}) as follows:

  • q1:𝔻1[0,1]q_{1}:\mathbb{D}_{1}\to[0,1] such that q1(t)=q(y)q_{1}(t)=q(y) if and only if πxt=y\pi_{x}t=y for any xdom(x1)x\in\textsf{dom}(x_{1});

  • qi:𝔻i{+1}q_{i}:\mathbb{D}_{i}\to\{+1\} over 𝔻i\mathbb{D}_{i} for i2i\geq 2.

We also add the counting query q=(q1,q2,,qm)q^{\prime}=(q_{1},q_{2},\cdots,q_{m}) with qi:𝔻i{+1}q_{i}:\mathbb{D}_{i}\to\{+1\} for any i[m]i\in[m] to 𝒬two\mathcal{Q}_{\textsf{two}}.

It suffices to show that if there is a (ϵ,δ)(\epsilon,\delta)-differentially private algorithm that takes any multi-table instance in 𝕀(𝖮𝖴𝖳,Δ)\mathbb{I}(\mathsf{OUT},\Delta) and outputs an approximate answer to each query in 𝒬hier\mathcal{Q}_{\textsf{hier}} within error α\alpha, then there is a (ϵ,δ)(\epsilon,\delta)-differentially private algorithm that takes any single-table instance T𝔻T\in\mathbb{D} and outputs an approximate answer to each query in 𝒬one\mathcal{Q}_{\textsf{one}} within error αΔ\frac{\alpha}{\Delta}. The remaining argument exactly follows that of two-table case. ∎

1
2foreach i[m]i\in[m] do ilog(nλ+1)\ell_{i}\leftarrow\lceil\log(\frac{n}{\lambda}+1)\rceil;
3 BσB^{\sigma}\leftarrow\emptyset for each σ[]m\sigma\in[\ell]^{m};
4 foreach bBb\in B do
5       foreach i[m]i\in[m] do
6             deg~i(B,b)=degi(B,b)+STGeomϵ,δ,1(x)\displaystyle{\widetilde{\textsf{deg}}_{i}(B,b)=\textsf{deg}_{i}(B,b)+\textsf{STGeom}_{\epsilon,\delta,1}(x)};
7            
8      σmax{0,log1λdeg~i(B,b)}:i[m]\sigma\leftarrow\left\langle\max\left\{0,\left\lceil\log\frac{1}{\lambda}\cdot\widetilde{\textsf{deg}}_{i}(B,b)\right\rceil\right\}:i\in[m]\right\rangle;
9       BσBσ{b}B^{\sigma}\leftarrow B^{\sigma}\cup\{b\};
10      
11foreach σ[]m\sigma\in[\ell]^{m} do
12       foreach i[m]i\in[m] do
13             Riσ{Ri(t):t𝔻i,πBtBσ}R^{\sigma}_{i}\leftarrow\{R_{i}(t):t\in\mathbb{D}_{i},\pi_{B}t\in B^{\sigma}\};
14            
15      
16return σ[]m{(R1σ,R2σ,,Rmσ)}\bigcup_{\sigma\in[\ell]^{m}}\left\{(R^{\sigma}_{1},R^{\sigma}_{2},\cdots,R^{\sigma}_{m})\right\};
Algorithm 8 Uniformize-Star(𝐈={R1,R2,,Rm})(\mathbf{I}=\{R_{1},R_{2},\cdots,R_{m}\})

B.0.1. Missing Proofs for the Lower Bound

Proof of Theorem 3.3.

Let 𝒬={count}\mathcal{Q}=\{\textsf{count}\}. Let 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} be a pair of neighboring instances 𝐈,𝐈𝕀(Δ)\mathbf{I},\mathbf{I}^{\prime}\in\mathbb{I}(\Delta) such that |count(𝐈)count(𝐈)|Ω(Δ)|\textsf{count}(\mathbf{I})-\textsf{count}(\mathbf{I}^{\prime})|\geq\Omega(\Delta).

Suppose for the sake of contradiction that there is an (ϵ,δ)(\epsilon,\delta)-DP algorithm 𝒜\mathcal{A} that achieves an error α<|count(𝐈)count(𝐈)|/2\alpha<|\textsf{count}(\mathbf{I})-\textsf{count}(\mathbf{I}^{\prime})|/2 with probability at least 0.99. Let count~(𝐈),count~(𝐈)\widetilde{\textsf{count}}(\mathbf{I}),\widetilde{\textsf{count}}(\mathbf{I}^{\prime}) be the results released for join sizes of 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} respectively. Then, as 𝒜\mathcal{A} preserves (ϵ,δ)(\epsilon,\delta)-DP, it must be that

0.99\displaystyle 0.99 Pr(count~(𝐈)[count(𝐈)α,count(𝐈)+α])\displaystyle\leq\Pr\left(\widetilde{\textsf{count}}(\mathbf{I})\in[\textsf{count}(\mathbf{I})-\alpha,\textsf{count}(\mathbf{I})+\alpha]\right)
eϵPr(count~(𝐈)[count(𝐈)α,count(𝐈)+α])+δ\displaystyle\leq e^{\epsilon}\cdot\Pr\left(\widetilde{\textsf{count}}(\mathbf{I}^{\prime})\in[\textsf{count}(\mathbf{I})-\alpha,\textsf{count}(\mathbf{I})+\alpha]\right)+\delta
<eϵ(1Pr(count~(𝐈)[count(𝐈)α,count(𝐈)+α]))+δ\displaystyle<e^{\epsilon}\cdot\left(1-\Pr\left(\widetilde{\textsf{count}}(\mathbf{I}^{\prime})\in[\textsf{count}(\mathbf{I}^{\prime})-\alpha,\textsf{count}(\mathbf{I}^{\prime})+\alpha]\right)\right)+\delta
eϵ0.1+δ<0.99,\displaystyle\leq e^{\epsilon}\cdot 0.1+\delta<0.99,

a contradiction. ∎

B.1. Missing Proofs in Section 4

example 0.

Consider an example that |Ji|=|J||J^{i}|=\frac{|J|}{\ell} for each i[]i\in[\ell]. Algorithm 1 achieves an error as α=O(|J|Δ+Δ)\alpha=O\left(\sqrt{|J|}\cdot\sqrt{\Delta}+\Delta\right), while Algorithm 8 achieves an error as

α=i[]|J|2i+i[]Δi=|J|2+1+Δ=O(|J|Δ+Δ)\alpha=\sum_{i\in[\ell]}\sqrt{\frac{|J|}{\ell}}\cdot\sqrt{2}^{i}+\sum_{i\in[\ell]}\Delta_{i}=\sqrt{\frac{|J|}{\ell}}\cdot\sqrt{2}^{\ell+1}+\Delta=O\left(\sqrt{\frac{|J|}{\ell}}\cdot\sqrt{\Delta}+\Delta\right)

where we can save a factor of =O(logn)\sqrt{\ell}=O(\sqrt{\log n}).

example 0.

Consider an example that |Ji|Δi=Δ2|J^{i}|\cdot\Delta_{i}=\Delta^{2} for each i[]i\in[\ell], by setting |Bi|=4i|B^{i}|=4^{\ell-i} and |Ji|=4i2i|J^{i}|=4^{\ell-i}\cdot 2^{i}. The total number of join results is |J|=i[]|Ji|=i[]4i24=Δ2|J|=\sum_{i\in[\ell]}|J^{i}|=\sum_{i\in[\ell]}4^{\ell-\frac{i}{2}}\leq 4^{\ell}=\Delta^{2}. By setting Δ=n/logn\Delta=\sqrt{n}/\log n, the total number of tuples is at most i[]2i4i=4i[]4iΔ2=n\sum_{i\in[\ell]}2^{i}\cdot 4^{\ell-i}=4^{\ell}\cdot\sum_{i\in[\ell]}4^{-i}\leq\Delta^{2}=n. Algorithm 1 achieves an error as α=O(ΔΔ)\alpha=O(\Delta\cdot\sqrt{\Delta}), while Algorithm 8 achieves an error as

α=i[]4i2i2i+Δ=2+Δ=O(Δlogn)\alpha=\sum_{i\in[\ell]}\sqrt{4^{\ell-i}\cdot 2^{i}}\cdot\sqrt{2^{i}}+\Delta=2^{\ell}\cdot\ell+\Delta=O(\Delta\cdot\log n)

where we can save a factor of O(Δ)=O(n1/4)O(\sqrt{\Delta})=O(n^{1/4}).

Proof of Lemma 4.1.

Consider two neighboring instances 𝐈=(R1,R2)\mathbf{I}=(R_{1},R_{2}) and 𝐈=(R1,R2)\mathbf{I}^{\prime}=(R^{\prime}_{1},R^{\prime}_{2}). Let σ,σ:dom(B)[]\sigma,\sigma^{\prime}:\textsf{dom}(B)\to[\ell] be the partitions of dom(B)\textsf{dom}(B) for 𝐈,𝐈\mathbf{I},\mathbf{I}^{\prime} respectively. By definition, σ(b)=i\sigma(b)=i if and only if bBib\in B^{i}; σ\sigma^{\prime} is defined similarly. Note that |degi(B,b)degi(B,b)|1|\textsf{deg}_{i}(B,b)-\textsf{deg}^{\prime}_{i}(B,b)|\leq 1 holds for any i{1,2}i\in\{1,2\} and bdom(B)b\in\textsf{dom}(B). Hence, σ(b)(ϵ,δ)σ(b)\sigma(b)\sim_{(\epsilon,\delta)}\sigma^{\prime}(b). As the degrees for different bb’s are calculated on disjoint parts of input data, then σ(ϵ,δ)σ\sigma\sim_{(\epsilon,\delta)}\sigma^{\prime} from the parallel composition rule.

Let 𝔽i,𝔽i\mathbb{F}^{i},\mathbb{F}^{\prime i} be the synthetic dataset generated for (R1i,R2i),(R1i,R2i)(R^{i}_{1},R^{i}_{2}),(R^{\prime i}_{1},R^{\prime i}_{2}) respectively. Implied by Lemma 3.1, 𝔽i(3ϵ,3δ)𝔽i\mathbb{F}^{i}\sim_{(3\epsilon,3\delta)}\mathbb{F}^{\prime i} for each i[]i\in[\ell]. Note that {(R1i,R2i):i[]}\{(R^{i}_{1},R^{i}_{2}):i\in[\ell]\} forms a disjoint partition of (R1,R2)(R_{1},R_{2}). By the parallel composition rule, i[]𝔽i(3ϵ,3δ)i[]𝔽i\bigcup_{i\in[\ell]}\mathbb{F}^{i}\sim_{(3\epsilon,3\delta)}\bigcup_{i\in[\ell]}\mathbb{F}^{\prime i}. Putting everything together, we conclude that Algorithm 8 preserves (4ϵ,4δ)(4\epsilon,4\delta)-DP implied by the sequential composition rule. ∎

Proof of Theorem 4.3.

Given an input instance 𝐈\mathbf{I} over the two-table query, let π1={B10,B11,,B1log(nλ+1)}\pi_{1}=\left\{B^{0}_{1},B^{1}_{1},\cdots,B_{1}^{\lceil\log(\frac{n}{\lambda}+1)\rceil}\right\}, be a fixed partition of dom(B)\textsf{dom}(B), such that bB1ib\in B^{i}_{1} if and only of

max{deg1(B,b),deg2(B,b)}(λ2i1,λ2i].\max\{\textsf{deg}_{1}(B,b),\textsf{deg}_{2}(B,b)\}\in(\lambda\cdot 2^{i-1},\lambda\cdot 2^{i}].

Let π2={B20,B21,,B2log(nλ+1)}\pi_{2}=\left\{B^{0}_{2},B^{1}_{2},\cdots,B_{2}^{\lceil\log(\frac{n}{\lambda}+1)\rceil}\right\} be a random partition of dom(B)\textsf{dom}(B) by Algorithm 5. Note that bB2ib\in B^{i}_{2} happens only if

max{deg1(B,b),deg2(B,b)}(λ2i1λ,λ2i].\max\{\textsf{deg}_{1}(B,b),\textsf{deg}_{2}(B,b)\}\in(\lambda\cdot 2^{i-1}-\lambda,\lambda\cdot 2^{i}].

It is easy to see that B1i+1B2iB2i+1B^{i+1}_{1}\subseteq B^{i}_{2}\cup B^{i+1}_{2}. For simplicity, we denote the number of join results participated by values in B1iB2iB^{i}_{1}\cap B^{i}_{2} as xix_{i} and the number of join results participated by values in B1iB2iB^{i}_{1}-B^{i}_{2} as yiy_{i}. Then the cost of Algorithm 8 under partition π2\pi_{2} is

i[log(nλ+1)]xi+yi1λ2i\displaystyle\sum_{i\in[\lceil\log(\frac{n}{\lambda}+1)\rceil]}\sqrt{x_{i}+y_{i-1}}\cdot\sqrt{\lambda\cdot 2^{i}}
\displaystyle\leq λi[log(nλ+1)](xi+yi1)2i\displaystyle\sqrt{\lambda}\cdot\sum_{i\in[\lceil\log(\frac{n}{\lambda}+1)\rceil]}(\sqrt{x_{i}}+\sqrt{y_{i-1}})\cdot\sqrt{2^{i}}
\displaystyle\leq λ(i[log(nλ+1)]xi2i+ilog(nλ+1)λyi12i)\displaystyle\sqrt{\lambda}\cdot\left(\sum_{i\in[\lceil\log(\frac{n}{\lambda}+1)\rceil]}\sqrt{x_{i}}\cdot\sqrt{2^{i}}+\sum_{i\in\lceil\log\frac{(\frac{n}{\lambda}+1)}{\lambda}\rceil}\sqrt{y_{i-1}}\cdot\sqrt{2^{i}}\right)
\displaystyle\leq 2λ(i[log(nλ+1)]xi2i+ilog(nλ+1)λyi12i1)\displaystyle\sqrt{2}\cdot\sqrt{\lambda}\cdot\left(\sum_{i\in[\lceil\log(\frac{n}{\lambda}+1)\rceil]}\sqrt{x_{i}}\cdot\sqrt{2^{i}}+\sum_{i\in\lceil\log\frac{(\frac{n}{\lambda}+1)}{\lambda}\rceil}\sqrt{y_{i-1}}\cdot\sqrt{2^{i-1}}\right)
=\displaystyle= 2λ(i[log(nλ+1)](xi+yi)2i)\displaystyle\sqrt{2}\cdot\sqrt{\lambda}\cdot\left(\sum_{i\in[\lceil\log(\frac{n}{\lambda}+1)\rceil]}(\sqrt{x_{i}}+\sqrt{y_{i}})\cdot\sqrt{2^{i}}\right)
\displaystyle\leq i[log(nλ+1)]xi+yi2iλ\displaystyle\sum_{i\in[\lceil\log(\frac{n}{\lambda}+1)\rceil]}\sqrt{x_{i}+y_{i}}\cdot\sqrt{2^{i}\cdot\lambda}

thus can be bounded by the cost of 𝒜\mathcal{A} under the fixed uniform partition of dom(B)\textsf{dom}(B). ∎

Proof of Theorem 4.4.

Let 𝕀(𝖮𝖴𝖳i,i)\mathbb{I}(\mathsf{OUT}_{i},i) be the set of two-table instances with output size 𝖮𝖴𝖳i\mathsf{OUT}_{i} and every join value has maximum degree falling into (λ2i1,λ2i](\lambda\cdot 2^{i-1},\lambda\cdot 2^{i}]. Our proof consists of two steps:

  • Step (1): There exists a family of queries 𝒬i\mathcal{Q}_{i} such that any (ϵ,δ)(\epsilon,\delta)-algorithm that takes as input an instance from 𝕀(𝖮𝖴𝖳i,i)\mathbb{I}(\mathsf{OUT}_{i},i) and outputs an approximate answer to each query in 𝒬i\mathcal{Q}_{i} within error αi\alpha^{i} must require αiΩ(𝖮𝖴𝖳i2iλ)\displaystyle{\alpha_{i}\geq\Omega\left(\sqrt{\mathsf{OUT}_{i}}\cdot\sqrt{2^{i}\cdot\lambda}\right)}.

  • Step (2): There exists a family of queries 𝒬\mathcal{Q} such that any (ϵ,δ)(\epsilon,\delta)-algorithm that takes as input an instance 𝐈(𝖮𝖴𝖳)\mathbf{I}(\overrightarrow{\mathsf{OUT}}) and outputs an approximate answer to each query in 𝒬\mathcal{Q} within error α\alpha must require αΩ(maxi𝖮𝖴𝖳i2iλ)\displaystyle{\alpha\geq\Omega\left(\max_{i}\sqrt{\mathsf{OUT}_{i}}\cdot\sqrt{2^{i}\cdot\lambda}\right)}.

Let’s first focus on step (1) with arbitrary i[log(nλ)]i\in[\lceil\log(\frac{n}{\lambda})\rceil]. Let nin_{i} be an arbitrary integer such that niλ2i1𝖮𝖴𝖳iniλ2in_{i}\cdot\lambda\cdot 2^{i-1}\leq\mathsf{OUT}_{i}\leq n_{i}\cdot\lambda\cdot 2^{i} holds. From Theorem 1.4, let 𝒬onei\mathcal{Q}^{i}_{\textsf{one}} be the set of hard queries on which any (ϵ,δ)(\epsilon,\delta)-differentially private algorithm takes as input any single table Ti({0,1}d)niT^{i}\in(\{0,1\}^{d})^{n_{i}}, and outputs an approximate answer to each query in 𝒬onei\mathcal{Q}^{i}_{\textsf{one}} within error α\alpha must require αΩ~(ni(dlog|𝒬onei|ϵ)12)\alpha\geq\widetilde{\Omega}\left(\sqrt{n_{i}}\cdot(\frac{\sqrt{d}\cdot\log|\mathcal{Q}^{i}_{\textsf{one}}|}{\epsilon})^{\frac{1}{2}}\right). For an arbitrary single table Ti({0,1}d)niT^{i}\in(\{0,1\}^{d})^{n_{i}}, we can construct a two-table instance (R1i,R2i)(R^{i}_{1},R^{i}_{2}) as follows:

  • Set dom(A)=𝔻\textsf{dom}(A)=\mathbb{D}, dom(B)=\textsf{dom}(B)=\mathbb{Z} and dom(C)=\textsf{dom}(C)=\mathbb{Z};

  • Set 𝔻1=dom(A)×dom(B)\mathbb{D}_{1}=\textsf{dom}(A)\times\textsf{dom}(B) and 𝔻2=dom(B)×dom(C)\mathbb{D}_{2}=\textsf{dom}(B)\times\textsf{dom}(C);

  • Tuples in R1iR^{i}_{1} form a Cartesian product of size ni×1n_{i}\times 1 over dom(A)×dom(B)\textsf{dom}(A)\times\textsf{dom}(B);

  • Tuples in R2iR^{i}_{2} form a Cartesian product of sizes 1×𝖮𝖴𝖳ini1\times\frac{\mathsf{OUT}_{i}}{n_{i}} over dom(B)×dom(C)\textsf{dom}(B)\times\textsf{dom}(C);

  • For each of the nn records in TiT^{i}, say xx, we add a tuple tR1t\in R_{1} and set πAt=x\pi_{A}t=x.

It can be easily checked that 𝐈𝕀(𝖮𝖴𝖳i,i)\mathbf{I}\in\mathbb{I}(\mathsf{OUT}_{i},i). From Theorem 1.4, let 𝒬twoi\mathcal{Q}^{i}_{\textsf{two}} be the set of all linear queries over 𝔻1i×𝔻2i\mathbb{D}^{i}_{1}\times\mathbb{D}^{i}_{2}. For each query q𝒬oneiq\in\mathcal{Q}^{i}_{\textsf{one}}, we construct another query q=(q1,q2)𝒬twoiq^{\prime}=(q_{1},q_{2})\in\mathcal{Q}^{i}_{\textsf{two}} such that:

  • q1:𝔻1i[0,1]q_{1}:\mathbb{D}^{i}_{1}\to[0,1] such that q1(t)=q(x)q_{1}(t)=q(x) if and only if πAt=x\pi_{A}t=x;

  • q2:𝔻2i{+1}q_{2}:\mathbb{D}^{i}_{2}\to\{+1\}.

We borrow the similar argument from Theorem LABEL:the:lb-two-table, showing that an (ϵ,δ)(\epsilon,\delta)-algorithm that takes a two-table instance in 𝕀(𝖮𝖴𝖳i,i)\mathbb{I}(\mathsf{OUT}_{i},i) and outputs an approximate answer to each query in 𝒬twoi\mathcal{Q}^{i}_{\textsf{two}} within error α\alpha, there exists an (ϵ,δ)(\epsilon,\delta)-algorithm that takes an arbitrary single table Ti({0,1})niT_{i}\in(\{0,1\})^{n_{i}}, and outputs an approximate answer to each query in 𝒬onei\mathcal{Q}^{i}_{\textsf{one}} within error αiλ2i\frac{\alpha^{i}}{\lambda\cdot 2^{i}}. As αiλ2ini\frac{\alpha^{i}}{\lambda\cdot 2^{i}}\geq\sqrt{n_{i}}, αi𝖮𝖴𝖳i2iλ\alpha^{i}\geq\sqrt{\mathsf{OUT}_{i}}\cdot\sqrt{2^{i}}\cdot\sqrt{\lambda}.

Step (2).

From Theorem 1.4, let 𝒬two\mathcal{Q}_{\textsf{two}} be the family of linear queries over 𝔻1×𝔻2\mathbb{D}_{1}\times\mathbb{D}_{2}, such that 𝒬two={i[m]qi:qi𝒬twoi,i[logn]}\displaystyle{\mathcal{Q}_{\textsf{two}}=\left\{\cup_{i\in[m]}q_{i}:q_{i}\in\mathcal{Q}^{i}_{\textsf{two}},\forall i\in[\lceil\log n\rceil]\right\}}. Consider an arbitrary (ϵ,δ)(\epsilon,\delta)-differentially private algorithm 𝒜\mathcal{A} takes as input in 𝕀(𝖮𝖴𝖳)\mathbb{I}(\overrightarrow{\mathsf{OUT}}) and outputs an approximate answer to each query in 𝒬two\mathcal{Q}_{\textsf{two}} within error α\alpha. If αmaxi𝖮𝖴𝖳i2iλ\alpha\leq\max_{i}\sqrt{\mathsf{OUT}_{i}}\cdot\sqrt{2^{i}\cdot\lambda}, there must exist an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm that takes an input any instance from 𝕀(𝖮𝖴𝖳)\mathbb{I}(\overrightarrow{\mathsf{OUT}}) and output an approximate answer to each query in 𝒬twoi\mathcal{Q}^{i}_{\textsf{two}}, for some ii, contradicting to Step (1). ∎

In the attribute tree 𝒯\mathcal{T}, if there exists an attribute with only one child, say uu with child vv, we know that u,vu,v always appear together in all relations, then they can be simply considered as one “combined” attribute. Hence, it is safe to assume that every non-leaf attribute has at least two children in 𝒯\mathcal{T}.

Proof of Lemma 4.10.

We will prove the first two properties by induction. In the base case when 𝕀\mathbb{I} only contains one instance, these two properties hold trivially. Consider an arbitrary iteration of while-loop in Algorithm 6. By hypothesis, all sub-instances in 𝕀\mathbb{I} have disjoint join results, and the union of join results of all sub-instances in 𝕀\mathbb{I} is i[m]Ri\Join_{i\in[m]}R_{i}. Then, it suffices to show that PartitionByAttr(𝐈,x)(\mathbf{I},x) generates a partition of 𝐈\mathbf{I} as

{𝐈i=(jExRji)(jExRj):i[log(nλ+1)]},\left\{\mathbf{I}^{i}=\left(\bigcup_{j\in E_{x}}R^{i}_{j}\right)\cup\left(\bigcup_{j\notin E_{x}}R_{j}\right):i\in\left[\left\lceil\log\left(\frac{n}{\lambda}+1\right)\right\rceil\right]\right\},

such that all 𝐈i\mathbf{I}^{i}’s have disjoint join results, and the union of join results of all 𝐈i\mathbf{I}^{i}’s is the join result of 𝐈\mathbf{I}.

Consider an arbitrary join result t=(t1,t2,,tm)t=(t_{1},t_{2},\cdots,t_{m}) of 𝐈\mathbf{I}, where tjt_{j} is projection of tt onto 𝐱j\mathbf{x}_{j}. For any jExj\notin E_{x}, tjRjt_{j}\in R_{j}. Let ydom(𝐲)\vec{y}\in\textsf{dom}(\mathbf{y}) be the tuple such that π𝐲tj=y\pi_{\mathbf{y}}t_{j}=\vec{y} holds for every jExj\in E_{x}. Denote i=log1λdeg~Ex(𝐲,y)i^{*}=\lceil\log\frac{1}{\lambda}\cdot\widetilde{\textsf{deg}}_{E_{x}}(\mathbf{y},\vec{y})\rceil. Then, for any jExj\in E_{x}, tjRjit_{j}\in R^{i}_{j} if and only if i=ii=i^{*} by line 9. Thus, t(jExRji)(jExRj)t\in\left(\Join_{j\in E_{x}}R^{i^{*}}_{j}\right)\Join\left(\Join_{j\notin E_{x}}R_{j}\right), but t(jExRji)(jExRj)t\notin\left(\Join_{j\in E_{x}}R^{i}_{j}\right)\Join\left(\Join_{j\notin E_{x}}R_{j}\right) for any iii\neq i^{*}.

For the third property, we actually show that each input tuple tRit\in R_{i} appears in O(logc(nλ+1))O(\log^{c}(\frac{n}{\lambda}+1)) sub-instances, where c=x𝐱|Ex|c=\sum_{x\in\mathbf{x}}|E_{x}| is the total number of appearance of join attributes in all relations. In an invocation of PartitionHier(𝐈,x)(\mathbf{I},x), tuple tRit\in R_{i} appears in O(log(nλ+1))O\left(\log(\frac{n}{\lambda}+1)\right) sub-instances if iExi\notin E_{x}, and only one sub-instance otherwise. Overall, the procedure PartitionByAttr will be invoked on every non-leaf attribute in 𝒯\mathcal{T}, thus every input tuple tRit\in R_{i} appears in O(x𝐱log(nλ+1))=O(log|𝐱|(nλ+1))O\left(\prod_{x\in\mathbf{x}}\log(\frac{n}{\lambda}+1)\right)=O\left(\log^{|\mathbf{x}|}(\frac{n}{\lambda}+1)\right) sub-instances.

At last, we first show each sub-instance corresponds to a degree characterization. Let’s focus on an arbitrary single relation 𝐱j\mathbf{x}_{j} in Algorithm 6, which is partitioned by attributes lying on the path corresponding to 𝐱j\mathbf{x}_{j} in 𝒯\mathcal{T} in a bottom-up way, say x1,x2,,xk(=r)\langle x_{1},x_{2},\cdots,x_{k}(=r)\rangle. When PartitionByAttr(𝐈,x1)(\mathbf{I},x_{1}) is invoked, the sub-instance 𝐈i={Rji:jEx1}{Rj:jEx1}\mathbf{I}^{i}=\{R^{i}_{j}:j\in E_{x_{1}}\}\cup\{R_{j}:j\notin E_{x_{1}}\} must have σ(Ex1,𝐲)=i\sigma(E_{x_{1}},\mathbf{y})=i with

deg~Ex1𝐈i(𝐲,y)(2σ(Ex1,𝐲)1,2σ(Ex1,𝐲)]\widetilde{\textsf{deg}}^{\mathbf{I}^{i}}_{E_{x_{1}}}(\mathbf{y},\vec{y})\in\left(2^{\sigma(E_{x_{1}},\mathbf{y})-1},2^{\sigma(E_{x_{1}},\mathbf{y})}\right]

holds for any ydom(𝐲)\vec{y}\in\textsf{dom}(\mathbf{y}) with degEx1𝐈i(𝐲,y)>0\textsf{deg}^{\mathbf{I}^{i}}_{E_{x_{1}}}(\mathbf{y},\vec{y})>0. In the subsequent partitions on 𝐈\mathbf{I}^{\prime}, we claim that for each ydom(x1)\vec{y}\in\textsf{dom}(x_{1}), the set of tuples with the same value ydom(x1)\vec{y}\in\textsf{dom}(x_{1}) always appear together. Suppose PartitionByAttr(𝐈,x)(\mathbf{I}^{\prime},x_{\ell}) is invoked with any 2\ell\geq 2. Observe that for any tRjit\in R^{i}_{j} with π𝐲1t=y\pi_{\mathbf{y}_{1}}t=\vec{y}, π𝐲t=π𝐲y\pi_{\mathbf{y}_{\ell}}t=\pi_{\mathbf{y}_{\ell}}\vec{y} always holds since 𝐲𝐲1\mathbf{y}_{\ell}\subseteq\mathbf{y}_{1}, where 𝐲1={x2,x3,,xk}\mathbf{y}_{1}=\{x_{2},x_{3},\cdots,x_{k}\} and 𝐲={x+1,x+2,,xk}\mathbf{y}_{\ell}=\{x_{\ell+1},x_{\ell+2},\cdots,x_{k}\}. By line 9 in Algorithm 7, tuples with the same value ydom(x1)\vec{y}\in\textsf{dom}(x_{1}) always falls into the same sub-relation. In other words, degEx1(𝐲1,y)\textsf{deg}_{E_{x_{1}}}(\mathbf{y}_{1},\vec{y}) remains the same in subsequent partitions, for any y\vec{y} with degEx1(𝐲1,y)>0\textsf{deg}_{E_{x_{1}}}(\mathbf{y}_{1},\vec{y})>0. This argument can also generalized to any attribute xx.

Now, let’s interpret the partition process as a directed tree, such that 𝐈\mathbf{I}^{\prime} has sub-instances in PartitionByAttr(𝐈,x)\textsc{PartitionByAttr}(\mathbf{I}^{\prime},x) as its children. Each sub-instance returned by Algorithm 6 corresponds to a distinct root-to-leaf path in this directed tree. Consider an arbitrary sub-instance 𝐈′′\mathbf{I}^{\prime\prime} corresponding to the root-to-leaf path 𝐈1(=𝐈),𝐈2,,𝐈|𝐱|(=𝐈′′)\langle\mathbf{I}_{1}(=\mathbf{I}),\mathbf{I}_{2},\cdots,\mathbf{I}_{|\mathbf{x}|}(=\mathbf{I}^{\prime\prime})\rangle. When each instance goes through an invocation of PartitionByAttr(𝐈j,x)(\mathbf{I}_{j},x), a distinct value σ(Ex,𝐲)\sigma(E_{x},\mathbf{y}) is set up for each of 𝐈j\mathbf{I}_{j}’s children. As proved above, this degree value will be preserved in any 𝐈k\mathbf{I}_{k} for kjk\geq j. As each instance on this path corresponds to one invocation of PartitionByAttr on a distinct attribute, all possible values of σ(,)\sigma(\cdot,\cdot) will be set up well for 𝐈′′\mathbf{I}^{\prime\prime}. Thus, each sub-instance returned by Algorithm 6 corresponds to a degree characterization.

Moreover, for any pair of sub-instances returned by Algorithm 6, we can find the lowest common ancestor of these two corresponding root-to-leaf paths, say with an invocation of PartitionByAttr on xx. So, these two sub-instances must have different degree values over attributes 𝐲={y𝐱:ExEy}\mathbf{y}=\{y\in\mathbf{x}:E_{x}\subsetneq E_{y}\} in relations from ExE_{x}. As proved above, the difference will be preserved in these two sub-instances. This argument can be applied to any pair of sub-instances returned by Algorithm 6, thus each sub-instances corresponds to a distinct degree characterization. ∎

In an attribute tree \T\T rooted at rr, let path(r,x)\textsf{path}(r,x) be the sequence of attributes lying on the path from rr to xx.

Proof of Lemma 4.7.

We will prove it through the following four steps:

  • Step 1: 𝐲={x𝐱:iE,jE,x𝐱i𝐱j}\mathbf{y}^{\prime}=\{x\in\mathbf{x}:\exists i\in E^{\prime},j\notin E^{\prime},x\in\mathbf{x}_{i}\cap\mathbf{x}_{j}\};

  • Step 2: 𝐲(iE𝐱i)\mathbf{y}^{\prime}\subsetneq(\cap_{i\in E^{\prime}}\mathbf{x}_{i});

  • Step 3: there exists no jEj\notin E^{\prime} such that (iE𝐱i)𝐱j(\cap_{i\in E^{\prime}}\mathbf{x}_{i})\subseteq\mathbf{x}_{j}.

  • Step 4: If xx is the node such that iE𝐱i=path(r,x)\cap_{i\in E^{\prime}}\mathbf{x}_{i}=\textsf{path}(r,x), let xpath(r,x)x^{\prime}\in\textsf{path}(r,x) be the lowest ancestor of xx with more than one child. Then, E=ExE^{\prime}=E_{x} and 𝐲=path(r,x)\mathbf{y}^{\prime}=\textsf{path}(r,x^{\prime}).

Step 1. It suffices to show that the first property holds for every invocation of TE,𝐲T_{E,\mathbf{y}}. First, it holds trivially for TE,ET_{E,\partial E}, which follows the definition of partial attributes E\partial E. By hypothesis, we assume it holds for some TE,𝐲T_{E,\mathbf{y}}. Next, we show that it is also preserved in these two recursive rules applied to TE,𝐲T_{E,\mathbf{y}}. We distinguish two cases.

  • H¯E\bar{H}_{E} is disconnected. Consider an arbitrary connected component E′′𝒞EE^{\prime\prime}\in\mathcal{C}_{E}. For any pair of iE′′i\in E^{\prime\prime} and jEj\notin E, 𝐱i𝐱j𝐲\mathbf{x}_{i}\cap\mathbf{x}_{j}\in\mathbf{y} is implied by the hypothesis. For any pair of iE′′i\in E^{\prime\prime} and jEE′′j\in E-E^{\prime\prime}, 𝐱i𝐱j𝐲\mathbf{x}_{i}\cap\mathbf{x}_{j}\in\mathbf{y} is implied by the disconnect fact of H¯E\bar{H}_{E}. Together, for any pair of iE′′i\in E^{\prime\prime} and jE′′j\notin E^{\prime\prime}, we have 𝐱i𝐱j𝐲\mathbf{x}_{i}\cap\mathbf{x}_{j}\in\mathbf{y}. Hence, for each E′′𝒞EE^{\prime\prime}\in\mathcal{C}_{E}, TE′′,𝐲(iE′′𝐱i)T_{E^{\prime\prime},\mathbf{y}\cap(\cup_{i\in E^{\prime\prime}}\mathbf{x}_{i})} also satisfies this property.

  • H¯E\bar{H}_{E} is connected. We note that every attribute in iE𝐱i𝐲\cup_{i\in E}\mathbf{x}_{i}-\mathbf{y} only appears in relations from EE by hypothesis. As 𝐲(iE𝐱i)\mathbf{y}\subseteq(\cap_{i\in E}\mathbf{x}_{i}), every attribute in iE𝐱i(iE𝐱i)\cup_{i\in E}\mathbf{x}_{i}-(\cap_{i\in E}\mathbf{x}_{i}) only appears in relations from EE. In other words, if there exists a pair iEi\in E and jEj\notin E, 𝐱i𝐱j(iE𝐱i)\mathbf{x}_{i}\cap\mathbf{x}_{j}\in(\cap_{i\in E}\mathbf{x}_{i}). Hence, TE,(iE𝐱i)T_{E,(\cap_{i\in E}\mathbf{x}_{i})} also satisfies this property.

Step 2. With the first property, we next prove the second property. From Definition 4.6, 𝐲(iE𝐱i)\mathbf{y}\subseteq(\cap_{i\in E}\mathbf{x}_{i}). In the base case when degE(𝐲,)\textsf{deg}_{E}(\mathbf{y},\cdot) is invoked, we must have 𝐲𝐱i\mathbf{y}\neq\mathbf{x}_{i}, thus 𝐲𝐱i\mathbf{y}\subsetneq\mathbf{x}_{i}. In the recursive step when H¯E\bar{H}_{E} is connected, we already show 𝐲(iE𝐱i)\mathbf{y}\subsetneq(\cap_{i\in E}\mathbf{x}_{i}). Overall, when degE(𝐲,)\textsf{deg}_{E}(\mathbf{y},\cdot) is invoked, we must have 𝐲(iE𝐱i)\mathbf{y}\subsetneq(\cap_{i\in E}\mathbf{x}_{i}).

Step 3. Then, we come to the third property. Suppose there exist some jEj\in E such that (iE𝐱i)𝐱j(\cap_{i\in E}\mathbf{x}_{i})\subseteq\mathbf{x}_{j}. Implied by the first property, (iE𝐱i)𝐲(\cap_{i\in E}\mathbf{x}_{i})\subseteq\mathbf{y}, coming to a contradiction of the second property.

Step 4. Finally, we come to the last step. As HH is hierarchical, any subset of relations from E[m]E\subseteq[m] also forms a hierarchical join. Attributes in iE𝐱i\cap_{i\in E}\mathbf{x}_{i} form a root-to-node path in the attribute forest of HH. Together with the third property, if path(r,x)\textsf{path}(r,x) equals to the intersection of relations in EE^{\prime}, then E=ExE^{\prime}=E_{x}. Moreover, we note that 𝐲\mathbf{y}^{\prime} is also a root-to-node sub-path of path(r,x)\textsf{path}(r,x). Suppose x𝐲x^{\prime}\notin\mathbf{y}^{\prime}. Then, we can always find j[m]Exj\in[m]-E_{x}, since xx^{\prime} has at least one child which is not an ancestor of xx. As E=ExE^{\prime}=E_{x}, we have jEj\notin E^{\prime}. By definition, 𝐱𝐲\mathbf{x}^{\prime}\in\mathbf{y}^{\prime}, coming to a contradiction. Under the assumption that every non-leaf attribute in 𝒯\mathcal{T} has at least two children, it is safe to replace xx^{\prime} with the parent of xx. ∎

Proof of Lemma 4.11.

Consider two neighboring instances 𝐈\mathbf{I} and 𝐈\mathbf{I}^{\prime}. Note that |degE𝐈(𝐲,y)degE𝐈(𝐲,y)|1|\textsf{deg}^{\mathbf{I}}_{E}(\mathbf{y},\vec{y})-\textsf{deg}^{\mathbf{I}^{\prime}}_{E}(\mathbf{y},\vec{y})|\leq 1 holds for any E[m]E\subseteq[m], 𝐲(iE𝐱i)\mathbf{y}\subseteq(\cap_{i\in E}\mathbf{x}_{i}), and ydom(𝐲)\vec{y}\in\textsf{dom}(\mathbf{y}). This is not hard to show since change one tuple can change at most one tuple in iERi\Join_{i\in E}R_{i} as well as one tuple in its projection onto attributes i[E]𝐱i\cap_{i\in[E]}\mathbf{x}_{i}, thus at most one tuple in {tπi[E]𝐱iiERi:π𝐲t=y}\left\{t\in\pi_{\cap_{i\in[E]}\mathbf{x}_{i}}\Join_{i\in E}R_{i}:\pi_{\mathbf{y}}t=\vec{y}\right\}. Let x1,x2,,xk(=r)\langle x_{1},x_{2},\cdots,x_{k}(=r)\rangle be the root-to-leaf path corresponding to 𝐱i\mathbf{x}_{i}. Each input tuple tRi(𝐱i)t\in R_{i}(\mathbf{x}_{i}) contributes to at most |𝐱i||\mathbf{x}_{i}| degree functions, i.e., degExi(𝐲,π𝐲t)\textsf{deg}_{E_{x_{i}}}(\mathbf{y},\pi_{\mathbf{y}}t) with 𝐲={x1,x2,,xj1}\mathbf{y}=\{x_{1},x_{2},\cdots,x_{j-1}\}, for any j[k]j\in[k]. Hence, deg~(cϵ,cδ)deg~\widetilde{\textsf{deg}}\sim_{(c^{\prime}\cdot\epsilon,c^{\prime}\cdot\delta)}\widetilde{\textsf{deg}}^{\prime}, where c=maxi[m]|𝐱i|c^{\prime}=\max_{i\in[m]}|\mathbf{x}_{i}| is a query-dependent quantity.

Moreover, for each sub-instance 𝐈σ\mathbf{I}^{\sigma} returned, 𝔽σ(3ϵ,3δ)𝔽σ\mathbb{F}^{\sigma}\sim_{(3\epsilon,3\delta)}\mathbb{F}^{\prime\sigma} implied by Lemma 3.1. As each tuple participates in at most O(logc(nλ+1))O(\log^{c}(\frac{n}{\lambda}+1)) sub-instances, σ𝔽σ(3logc(nλ+1)ϵ,3logc(nλ+1)δ)σ𝔽σ\bigcup_{\sigma}\mathbb{F}^{\sigma}\sim_{(3\log^{c}(\frac{n}{\lambda}+1)\cdot\epsilon,3\log^{c}(\frac{n}{\lambda}+1)\cdot\delta)}\bigcup_{\sigma}\mathbb{F}^{\prime\sigma} implied by the group privacy. Putting everything together, we conclude that Algorithm 8 preserves (O(logcnϵ),O(logcnδ))(O(\log^{c}n\cdot\epsilon),O(\log^{c}n\cdot\delta))-DP implied by the sequential composition rule. ∎

Note that for arbitrary TET_{E}, adding noisy degree can only enlarge it by a factor of 2|𝐱|\sqrt{2}^{|\mathbf{x}|}. Let 𝐲𝐱\mathbf{y}\subseteq\mathbf{x} be the set of join attributes, and π𝐲J={t1,t2,,tk}\pi_{\mathbf{y}}J=\{t_{1},t_{2},\cdots,t_{k}\}. We consider the most fine-grained partition of JJ such that J1,J2,,JkJ_{1},J_{2},\cdots,J_{k}, where JiJ_{i} corresponds to the set of all join results participated by tidom(𝐲)t_{i}\in\textsf{dom}(\mathbf{y}). Let π(i)\pi(i) be the degree characterization where JiJ_{i} falls into with true degrees, and π~(i)\widetilde{\pi}(i) be the degree characterization where JiJ_{i} falls into with noisy degrees by Algorithm 7. The cost of Algorithm 4 on hierarchical joins is

σπ~(i)=σ|Ji|SScountσ\displaystyle\sum_{\sigma}\sqrt{\sum_{\widetilde{\pi}(i)=\sigma}|J_{i}|}\cdot\sqrt{SS^{\sigma}_{\textsf{count}}}
=\displaystyle= σi:π(i)=π~(i)=σ|Ji|+i:π(i)σ,π~(i)=σ|Ji|SScountσ\displaystyle\sum_{\sigma}\sqrt{\sum_{i:\pi(i)=\widetilde{\pi}(i)=\sigma}|J_{i}|+\sum_{i:\pi(i)\neq\sigma,\widetilde{\pi}(i)=\sigma}|J_{i}|}\cdot\sqrt{SS^{\sigma}_{\textsf{count}}}
\displaystyle\leq σi:π(i)=π~(i)=σ|Ji|SScountσ+σi:π(i)σ,π~(i)=σ|Ji|SScountσ\displaystyle\sum_{\sigma}\sqrt{\sum_{i:\pi(i)=\widetilde{\pi}(i)=\sigma}|J_{i}|}\cdot\sqrt{SS^{\sigma}_{\textsf{count}}}+\sum_{\sigma}\sqrt{\sum_{i:\pi(i)\neq\sigma,\widetilde{\pi}(i)=\sigma}|J_{i}|}\cdot\sqrt{SS^{\sigma}_{\textsf{count}}}
\displaystyle\leq σi:π(i)=π~(i)=σ|Ji|SScountσ\displaystyle\sum_{\sigma}\sqrt{\sum_{i:\pi(i)=\widetilde{\pi}(i)=\sigma}|J_{i}|}\cdot\sqrt{SS^{\sigma}_{\textsf{count}}}
+\displaystyle+ σσ:σσi:π(i)=σ,π~(i)=σ|Ji|SScountσ2|x|\displaystyle\sum_{\sigma}\sum_{\sigma^{\prime}:\sigma^{\prime}\neq\sigma}\sqrt{\sum_{i:\pi(i)=\sigma^{\prime},\widetilde{\pi}(i)=\sigma}|J_{i}|}\cdot\sqrt{SS^{\sigma^{\prime}}_{\textsf{count}}}\cdot\sqrt{2}^{|x|}
=\displaystyle= σ(i:π(i)=π~(i)=σ|Ji|+σ:σσi:π(i)=σ,π~(i)=σ|Ji|)SScountσ\displaystyle\sum_{\sigma}\left(\sqrt{\sum_{i:\pi(i)=\widetilde{\pi}(i)=\sigma}|J_{i}|}+\sum_{\sigma^{\prime}:\sigma^{\prime}\neq\sigma}\sqrt{\sum_{i:\pi(i)=\sigma,\widetilde{\pi}(i)=\sigma^{\prime}}|J_{i}|}\right)\cdot\sqrt{SS^{\sigma^{\prime}}_{\textsf{count}}}
\displaystyle\leq σi:π(i)=σ|Ji|SScountσ12|𝐱|\displaystyle\sum_{\sigma}\sqrt{\sum_{i:\pi(i)=\sigma}|J_{i}|}\cdot\sqrt{SS^{\sigma}_{\textsf{count}}}\cdot\frac{1}{2^{|\mathbf{x}|}}

where the last inequality is implied by the fact that for any σ\sigma, the number of σ\sigma^{\prime} such that some JiJ_{i} could have π(i)=σ\pi(i)=\sigma and π(i)=σ\pi(i)=\sigma^{\prime} is at most O(2|𝐱|)O(2^{|\mathbf{x}|}).

Proof of Theorem 4.13.

Let 𝕀(𝖮𝖴𝖳σ,σ)\mathbb{I}(\mathsf{OUT}_{\sigma},\sigma) be the set of two-table instances with output size 𝖮𝖴𝖳σ\mathsf{OUT}_{\sigma} and following the degree characterization σ\sigma. Our proof consists of two steps:

  • Step (1): There exists a family of queries 𝒬σ\mathcal{Q}_{\sigma} such that any (ϵ,δ)(\epsilon,\delta)-algorithm that takes as input an instance from 𝕀(𝖮𝖴𝖳σ,σ)\mathbb{I}(\mathsf{OUT}_{\sigma},\sigma) and outputs an approximate answer to each query in 𝒬σ\mathcal{Q}_{\sigma} within error ασ\alpha^{\sigma} must require ασΩ(𝖮𝖴𝖳σSScountσ)\displaystyle{\alpha^{\sigma}\geq\Omega\left(\sqrt{\mathsf{OUT}_{\sigma}}\cdot\sqrt{SS^{\sigma}_{\textsf{count}}}\right)}.

  • Step (2): There exists a family of queries 𝒬\mathcal{Q} such that any (ϵ,δ)(\epsilon,\delta)-algorithm that takes as input an instance 𝐈(𝖮𝖴𝖳)\mathbf{I}(\overrightarrow{\mathsf{OUT}}) and outputs an approximate answer to each query in 𝒬\mathcal{Q} within error α\alpha must require αΩ(maxσ𝖮𝖴𝖳σSScountσ)\displaystyle{\alpha\geq\Omega\left(\max_{\sigma}\sqrt{\mathsf{OUT}_{\sigma}}\cdot\sqrt{SS^{\sigma}_{\textsf{count}}}\right)}.

Let’s first focus on step (1) with arbitrary σ\sigma. Assume LScountσLS^{\sigma}_{\textsf{count}} is achieved on some i[m]i^{*}\in[m], i.e., LScountσ=T[m]{i}σLS^{\sigma}_{\textsf{count}}=T^{\sigma}_{[m]-\{i^{*}\}}. We compute the value of T[m]{i}σT^{\sigma}_{[m]-\{i^{*}\}} and set nσ=𝖮𝖴𝖳σT[m]{i}n_{\sigma}=\frac{\mathsf{OUT}_{\sigma}}{T_{[m]-\{i\}}}. From Theorem 1.4, let 𝒬oneσ\mathcal{Q}^{\sigma}_{\textsf{one}} be the set of hard queries on which any (ϵ,δ)(\epsilon,\delta)-differentially private algorithm takes as input any single table Rσ({0,1}d)nσR_{\sigma}\in(\{0,1\}^{d})^{n_{\sigma}}, and outputs an approximate answer to each query in 𝒬oneσ\mathcal{Q}^{\sigma}_{\textsf{one}} within error α\alpha must require αΩ~(nσ(dlog|𝒬oneσ|ϵ)12)\alpha\geq\widetilde{\Omega}\left(\sqrt{n_{\sigma}}\cdot(\frac{\sqrt{d}\cdot\log|\mathcal{Q}^{\sigma}_{\textsf{one}}|}{\epsilon})^{\frac{1}{2}}\right). For an arbitrary single table Tσ({0,1}d)nσT^{\sigma}\in(\{0,1\}^{d})^{n_{\sigma}}, we can construct an instance 𝐈σ\mathbf{I}^{\sigma} as follows:

  • Set 𝔻jσ=x𝐱jdom(x)\mathbb{D}^{\sigma}_{j}=\prod_{x\in\mathbf{x}_{j}}\textsf{dom}(x) for any j[m]j\in[m];

  • When visiting attributes of 𝒯\mathcal{T} in a bottom-up way, say xx, we set |dom(x)|=σ(Ex,𝐲)|\textsf{dom}(x)|=\sigma(E_{x},\mathbf{y}) where 𝐲\mathbf{y} is the set of attributes lying on the path from rr to xx’s parent in 𝒯\mathcal{T};

  • For the root rr, set |dom(r)|=𝖮𝖴𝖳σx𝐱{r}dom(x)\displaystyle{|\textsf{dom}(r)|=\frac{\mathsf{OUT}_{\sigma}}{\prod_{x\in\mathbf{x}-\{r\}}\textsf{dom}(x)}};

  • Every relation is a Cartesian product over its own attributes;

  • For each data record xRσx\in R_{\sigma}, we exclusively pick a tuple tRit\in R_{i^{*}} and attach xx to it.

It can be easily checked that 𝐈𝕀(𝖮𝖴𝖳σ,σ)\mathbf{I}\in\mathbb{I}(\mathsf{OUT}_{\sigma},\sigma), and |Ri|=nσ|R_{i^{*}}|=n_{\sigma}. From Theorem 1.4, let 𝒬hierσ\mathcal{Q}^{\sigma}_{\textsf{hier}} be the set of all linear queries over 𝔻1σ×𝔻2σ××𝔻mσ\mathbb{D}^{\sigma}_{1}\times\mathbb{D}^{\sigma}_{2}\times\cdots\times\mathbb{D}^{\sigma}_{m}. For each query q𝒬oneσq\in\mathcal{Q}^{\sigma}_{\textsf{one}}, we construct another query q=(q1,q2,,qm)𝒬hierσq^{\prime}=(q_{1},q_{2},\cdots,q_{m})\in\mathcal{Q}^{\sigma}_{\textsf{hier}} such that:

  • qi:𝔻iσ[1,1]q_{i}:\mathbb{D}^{\sigma}_{i}\to[-1,1] such that qi(t)=q(x)q_{i}(t)=q(x) if and only if xx is attached to tt;

  • qj:𝔻jσ{+1}q_{j}:\mathbb{D}^{\sigma}_{j}\to\{+1\} for j[m]{i}j\in[m]-\{i^{*}\}.

We borrow the similar argument from Theorem LABEL:the:lb-two-table, showing that an (ϵ,δ)(\epsilon,\delta)-algorithm that takes an instance in 𝕀(𝖮𝖴𝖳σ,σ)\mathbb{I}(\mathsf{OUT}_{\sigma},\sigma) and outputs an approximate answer to each query in 𝒬hierσ\mathcal{Q}^{\sigma}_{\textsf{hier}} within error α\alpha, there exists an (ϵ,δ)(\epsilon,\delta)-algorithm that takes an arbitrary single table Rσ({0,1})nσR_{\sigma}\in(\{0,1\})^{n_{\sigma}}, and outputs an approximate answer to each query in 𝒬oneσ\mathcal{Q}^{\sigma}_{\textsf{one}} within error ασ/LScountσ\alpha^{\sigma}/LS^{\sigma}_{\textsf{count}}. As ασ/LScountσnσ\alpha^{\sigma}/LS^{\sigma}_{\textsf{count}}\geq\sqrt{n_{\sigma}}, ασ𝖮𝖴𝖳σLScountσ\alpha^{\sigma}\geq\sqrt{\mathsf{OUT}_{\sigma}}\cdot\sqrt{LS^{\sigma}_{\textsf{count}}}.

Step (2).

From Theorem 1.4, let 𝒬hier\mathcal{Q}_{\textsf{hier}} be the family of linear queries over 𝔻1×𝔻2×𝔻m\mathbb{D}_{1}\times\mathbb{D}_{2}\times\cdots\mathbb{D}_{m}, such that 𝒬hier={i[m]qi:qi𝒬hierσ,σ}\displaystyle{\mathcal{Q}_{\textsf{hier}}=\left\{\cup_{i\in[m]}q_{i}:q_{i}\in\mathcal{Q}^{\sigma}_{\textsf{hier}},\forall\sigma\right\}}. Consider an arbitrary (ϵ,δ)(\epsilon,\delta)-differentially private algorithm 𝒜\mathcal{A} takes as input in 𝕀(𝖮𝖴𝖳)\mathbb{I}(\overrightarrow{\mathsf{OUT}}) and outputs an approximate answer to each query in 𝒬hier\mathcal{Q}_{\textsf{hier}} within error α\alpha. If αmaxσ𝖮𝖴𝖳σLScountσ\alpha\leq\max_{\sigma}\sqrt{\mathsf{OUT}_{\sigma}}\cdot\sqrt{LS^{\sigma}_{\textsf{count}}}, there must exist an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm that takes an input any instance from 𝕀(𝖮𝖴𝖳)\mathbb{I}(\overrightarrow{\mathsf{OUT}}) and output an approximate answer to each query in 𝒬hierσ\mathcal{Q}^{\sigma}_{\textsf{hier}} for some σ\sigma, contradicting to Step (1). ∎

Uniformize Acyclic Query.

A join query \mathcal{H} is acyclic if there is a join tree 𝒯\mathcal{T} such that every node of 𝒯\mathcal{T} corresponds to a relation, and for every attribute x𝐱x\in\mathbf{x}, the set of nodes containing xx forms a connected subtree of 𝒯\mathcal{T}. For general acyclic join query \mathcal{H}, we can first fix a join tree 𝒯\mathcal{T}. Let 𝒞E\mathcal{C}_{E} be the set of connected components of relations in EE on 𝒯\mathcal{T}. For each E𝒞EE^{\prime}\in\mathcal{C}_{E}, we pick an arbitrary relation RiR_{i} with 𝐱iE\mathbf{x}_{i}\cap\partial E\neq\emptyset as the root, which is always feasible; otherwise EE^{\prime} itself is a connected component, and TE=|iERi|T_{E^{\prime}}=|\Join_{i\in E^{\prime}}R_{i}|. After picking RiR_{i} as the root, we denote p(j,i)p(j,i) for jij\neq i as the parent of RjR_{j} in 𝒯\mathcal{T}. For completeness, let p(i,i)=𝐱iEp(i,i)=\mathbf{x}_{i}\cap\partial E. Then,

TEjEmaxtdom(𝐱j(𝐱p(j,i)E))|Rjt|.T_{E^{\prime}}\leq\prod_{j\in E^{\prime}}\max_{t\in\textsf{dom}(\mathbf{x}_{j}\cap(\mathbf{x}_{p(j,i)}\cup\partial E))}\big{|}R_{j}\ltimes t\big{|}.

At last, TE=E𝒞ETET_{E}=\prod_{E^{\prime}\in\mathcal{C}_{E}}T_{E^{\prime}}. To apply the uniformization technique, it suffices to partition each relation RiR_{i} by the degree of values in attributes 𝐱j(𝐱p(j,i)E)\mathbf{x}_{j}\cap(\mathbf{x}_{p(j,i)}\cup\partial E), for all possible E[m]E\subseteq[m] and root RiR_{i}.