This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sketches-based join size estimation under local differential privacy

Meifan Zhang, Xin Liu, Lihua Yin1 1 Lihua Yin is the corresponding author. Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, 510006, China [email protected], [email protected], [email protected]
Abstract

Join size estimation on sensitive data poses a risk of privacy leakage. Local differential privacy (LDP) is a solution to preserve privacy while collecting sensitive data, but it introduces significant noise when dealing with sensitive join attributes that have large domains. Employing probabilistic structures such as sketches is a way to handle large domains, but it leads to hash-collision errors. To achieve accurate estimations, it is necessary to reduce both the noise error and hash-collision error. To tackle the noise error caused by protecting sensitive join values with large domains, we introduce a novel algorithm called LDPJoinSketch for sketch-based join size estimation under LDP. Additionally, to address the inherent hash-collision errors in sketches under LDP, we propose an enhanced method called LDPJoinSketch+. It utilizes a frequency-aware perturbation mechanism that effectively separates high-frequency and low-frequency items without compromising privacy. The proposed methods satisfy LDP, and the estimation error is bounded. Experimental results show that our method outperforms existing methods, effectively enhancing the accuracy of join size estimation under LDP.

Index Terms:
Local differential privacy, Join query, Sketch

I Introduction

Join size estimation, also known as inner product estimation, represents a fundamental statistical problem with applications spanning various domains, including query optimization [1], similarity computation [2], correlation computation [3], and dataset discovery [4]. Despite numerous research efforts dedicated to this issue, the data collection process for join size estimation poses inherent privacy risks. Local differential privacy (LDP) [5] is a solution to privacy-preserving data collection [6, 7, 8], gaining traction in real-world applications at companies including Apple [9], Google [10, 11], and Microsoft [12]. However, most previous works have primarily focused on basic statistics like frequency and mean estimation, offering limited solutions for more intricate statistical tasks such as join aggregations. In a specific study [13], the authors explored join operations on two private data sources. However, it is important to highlight that their approach assumed non-private values for the join attributes.

Join size estimation under LDP holds significance in various scenarios. (1) Private similarity computation for data valuation and pricing. Estimating the similarity of two streams is a basic application of join size estimation. Private similarity computation is essential in data markets to assess the value of private data from different sources for analysis or learning tasks [14]. (2) Private correlation computation for dataset search and discovery. Join size estimation plays a role in dataset search by identifying relevant factors through joinable columns and computing column correlation [4]. Private join size estimation benefits relevant industries such as hospitals and genetics research companies in assessing the relevance of private data before collaborating. (3) Private approximate query processing. Approximate Query Processing (AQP) can enhance response times by returning approximate results. Applying AQP to private data presents a promising approach for efficient and private data analysis [15, 16], given that exact query answers are unattainable under DP or LDP. Join size estimation, as a specific task of AQP, holds potential to be extended to handle more general AQP tasks, including other join aggregations and predicates, under LDP.

Join size estimation on private join attribute values under LDP is a complex task. A common LDP setting has two main kinds of participants including a large number of users on client-side and one untrusted aggregator on server-side. An LDP workflow can be broken down into two steps: Each user locally perturbs its sensitive value by LDP mechanisms and sends the perturbed value to the aggregator. The server-side then aggregates the perturbed values for subsequent specific analysis. Join size estimation under LDP comes with several challenges. Challenge I, the join attribute values are sensitive and have a large domain. It is difficult to perturb the private join attribute values from two data sources while preserving the join size. Directly perturbing the join values according to LDP mechanisms, such as k-RR [6] and FLH [17], can introduce significant noise. To tackle the challenge of large domain, previous approaches like Apple-HCMS [9] and RAPPOR [10] have turned to probabilistic data structures, such as sketches and bloom filters, for estimation. However, these structures reduce the domain while introducing hash-collision errors. Challenge II, reducing the hash-collision error of probabilistic structures while preserving privacy. Without considering the privacy, the main idea to address hash-collision error is to compute the join size of high-frequency items and low-frequency items separately. This is because collisions involving high-frequency items tend to introduce more significant errors. However, separating high-frequency and low-frequency items while preserving LDP is also not simple. This is because the frequency property of each value is also private, making it a complex problem to tackle.

Main idea. To tackle the challenge I, we propose the LDPJoinSketch, whose main idea is to modify the fast-AGMS sketch to an LDP version, which constructs fast-AGMS sketch for join size estimation in the server-side based on the perturbed values collected from each individual users in the client-side. To tackle the challenge II, we propose LDPJoinSketch+ to reduce the hash-collision error by differently encoding the low-frequency and high-frequency items while satisfying LDP.

Refer to caption
Figure 1: Sketches-based join size estimation under LDP.

Fig. 1 shows the workflow of the two proposed methods. The workflow of LDPJoinSketch adheres to the common LDP pattern. In step , each user on the client-side encodes and perturbs its sensitive join attribute value did_{i} and transmits the processed value P(E(di))P(E(d_{i})) to the server, here P(E(di))P(E(d_{i})) denotes the encoding and perturbation of value did_{i}. In step , the server receives all perturbed values and constructs sketches MAM_{A} and MBM_{B} for attributes AA and BB, respectively. Ultimately, the join size can be estimated from the product MA×MBM_{A}\times M_{B}. The workflow of the two-phase algorithm LDPJoinSketch+ can be summarized as follows. In phase 1, the server constructs the LDPJoinSketches using steps and based on the values from some sampled users. In step , the server computes the frequent items (FIFI) based on the sketches and sends this set to the remaining users. In phase 2, for each attribute, the remaining users are divided into two groups: one for constructing a sketch for high-frequency items in FIFI and another for low-frequency items. The step involves encoding and perturbing each private value from the two groups based on a Frequency-Aware-Perturbation (FAP) mechanism we proposed. This mechanism can encode high-frequency and low-frequency items differently while satisfying LDP. Finally, in step the server separately constructs sketches to summarize the high-frequency items and low-frequency items. Taking group A1A_{1} for constructing the sketch MLAML_{A} aiming at summarizing low-frequency items of attribute AA as an example, here low-frequency items djFId_{j}\notin FI are target values and high-frequency ones djFId_{j}\in FI are non-target values. FAP encodes each target value in the same way as , and encodes non-target items differently using a method denoted as ENE_{N} in the figure. FAP ensures that the contributions of all non-target values can be removed from sketch MLAML_{A}, thereby reducing hash collisions based on the sketches separately summarizing high-frequency and low-frequency items.

The main contribution of this work:

  • We present a sketch-based join size estimation method called LDPJoinSketch for data with sensitive join values that have large domains. We prove that it satisfies LDP, and the estimation error is limited.

  • We develop a Frequency-Aware Perturbation (FAP) mechanism that distinguishes between high-frequency and low-frequency items while ensuring compliance with LDP. Building upon FAP, we propose an enhanced method named LDPJoinSketch+, which reduces hash-collision errors while upholding privacy preservation.

  • We conduct extensive experiments on synthetic and real-world datasets. The experimental results show that our methods outperforms state-of-the-art LDP mechanisms.

II Related work

Join size estimation. The join size of two data streams is a crucial statistic in data stream analysis, representing the inner product of their frequency vectors. We mainly introduce the sketch-based join size estimation most relevant to this article, while the methods based on sampling and histograms have been well elaborated in previous works [18, 19, 20]. Sketch-based methods such as AGMS sketch [21] and fast-AGMS sketch [22] adopt compact probabilistic data structures to summarize data, and approximately compute the statistics [23] such as frequency estimation and inner-product. To mitigate the error caused by hash collisions, studies such as Skimmed sketch [24], Red sketch [25] and JoinSketch [26] have focused on separating high-frequency and low-frequency items.

Local differential privacy (LDP) is a promising privacy preserving framework widely used in the field of private data releasing [27], data mining [28], machine learning [29] and social network analysis [30].

Since no method has been proposed for estimating join size or inner product under LDP, we will instead focus on discussing the relevant task of frequency estimation under LDP. A basic LDP mechanism k-ary Randomized Response (k-RR) [6] perturbs the original value to any other values within the domain with the same probability. To handle large domain and reduce communication cost, Apple-HCMS [9] uses sketches to encode values before perturbation with hadamard transform. FLH [17] reduces the domain size by local hashing, which sacrifices accuracy to achieve computational gains. These methods can be employed for estimating join sizes by calculating the frequency of each candidate join value. However, they suffer from both cumulative errors and efficiency issues, particularly when the join attribute has a large domain. In contrast, our LDPJoinSketch, which is based on the fast-AGMS sketch, estimates the join size by computing the product of sketches, rather than by accumulating the join size of each join value.

III Preliminaries

III-A AGMS sketch and Fast-AGMS sketch

AGMS sketch [21] uses a counter MAM_{A} to summarize the values of a data stream AA. For every item dd in AA, it adds ξ(d)\xi(d) to the counter MAM_{A}, thus MA=dAξ(d)M_{A}=\sum_{d\in A}\xi(d), where ξ\xi is a 4-wise independent hash function mapping dd to {+1,-1}. The join size of two data streams AA and BB can be estimated based on sketches MAM_{A} and MBM_{B} constructed with the same hash function ξ\xi. The estimation is Est=MAMBEst=M_{A}\cdot M_{B}. The accuracy can be enhanced by computing the median of multiple independent estimators.

Fast-AGMS sketch [22] is proposed to improve the construction efficiency of AGMS sketch. It adds a hash function hh to determine the location for an update, which avoids each update touching every counter as the AGMS does. A fast-AGMS sketch M(k,m)M(k,m) is an array of counters of kk lines and mm columns. For each j[k]j\in[k], two hash functions, hjh_{j} and ξj\xi_{j}, are employed for the jj-th line of the sketch. Here, hjh_{j} is responsible for determining which counter to increment, while ξj\xi_{j} is a member of a family of four-wise independent binary random variables uniformly distributed in {1,+1}\{-1,+1\}. For each value dd, it increases the counter with indices [j,hj(d)][j,h_{j}(d)] by M[j,hj(d)]+=ξj(d)M[j,h_{j}(d)]+=\xi_{j}(d). The join size ABA\Join B can be estimated according to the following equation.

Est=MAMB=mediani[1,k]{j=1mMA[i,j]×MB[i,j]},Est=M_{A}\cdot M_{B}=\mathop{median}\limits_{i\in[1,k]}\{\sum_{j=1}^{m}M_{A}[i,j]\times M_{B}[i,j]\}, (1)

where MAM_{A} and MBM_{B} are fast-AGMS sketches with parameters (m,k)(m,k) for AA and BB, respectively. The estimation error is limited as Pr[|MAMB|AB||>1mA1B1]δ\Pr[|M_{A}\cdot M_{B}-|A\Join B||>\frac{1}{\sqrt{m}}\|A\|_{1}\|B\|_{1}]\leq\delta, where k=log(1/δ)k=log(1/\delta), and A1\|A\|_{1}, B1\|B\|_{1} are the number of values of attributes AA and BB.

III-B Local differential privacy

Local differential privacy (LDP) [5] extends the notion of differential privacy (DP) [31] to scenarios where the aggregator cannot be trusted and each data owner locally perturbs its data before sharing it with the server.

Definition 1.

(ϵ\epsilon-local differential privacy). A local randomized privacy algorithm \mathcal{R} is ϵ\epsilon-locally differentially private(ϵ\epsilon-LDP), where ϵ0\epsilon\geqslant 0, iff for all pairs of inputs x,x𝒟x,x^{\prime}\in\mathcal{D}, we have that

y𝒴:Pr[(x)=y]eϵPr[(x)=y]\forall y\in\mathcal{Y}:Pr[\mathcal{R}(x)=y]\leqslant e^{\epsilon}\cdot Pr[\mathcal{R}(x^{\prime})=y] (2)

where 𝒴\mathcal{Y} denotes the set of all possible outputs, and the privacy budget ϵ\epsilon measures the level of privacy protection.

III-C Hadamard Mechanism

Hadamard Mechanism [9] utilizes hadamard transform to improve efficiency. The Hadamard Transform of a vector is obtained via multiplying with the hadamard matrix HmH_{m}. Here HmH_{m} denotes the hadamard matrix of order mm, a special type of square matrix where each element has only two possible values: +1 or -1. HmH_{m} is defined recursively as Hm=[Hm/2Hm/2Hm/2Hm/2]H_{m}=\begin{bmatrix}H_{m/2}&H_{m/2}\\ H_{m/2}&-H_{m/2}\end{bmatrix}, and H1=[1]H_{1}=[1]. Hadamard transform spreads information from a sparse vector (1 in a single location across multiple dimensions), so that when we sample a bit from this dense set we still have sufficiently strong signal about the original vector.

IV LDPJoinSketch

We propose LDPJoinSketch, a local differentially private sketch for join size estimation. The LDPJoinSketch protocol can be divided into client-side and server-side algorithms. The goal of these two algorithms is as follows. Client-side: To perturb the private value and transmit the perturbed value along with the indices that determine which counter of the sketch in the server the perturbed value should be added to. Server-side: To construct the sketch by adding each perturbed value to the counter with the given indices, and to compute the join size based on the constructed sketches.

IV-A Client-side of LDPJoinSketch

The goal of the client-side of LDPJoinSketch is to encode and perturb each private value, ensuring that the output satisfies LDP and is safe to be transmitted to the server.

Algorithm 1 Client-Side of LDPJoinSketch

Input: join value dDd\in D, privacy budget ϵ\epsilon, sketch parameters (kk, mm), hash function pairs {(h0,ξ0),,(hk1,ξk1)}\{(h_{0},\xi_{0}),...,(h_{k-1},\xi_{k-1})\}
Output: perturbed value yy, indices (j,l)(j,l)

1:Sample jj uniformly at random from [k], sample ll uniformly at random from [m].
2:Initialize a vector v{0}1×mv\leftarrow\{0\}^{1\times m}
3:Encode: v[hj(d)]ξj(d)v[h_{j}(d)]\leftarrow\xi_{j}(d)
4:wv×Hmw\leftarrow v\times H_{m} \triangleright HmH_{m}: hadamard matrix of order mm
5:Sample b{1,+1}b\in\{-1,+1\}, which is -1 with probability 1eϵ+1\frac{1}{e^{\epsilon}+1}.
6:Perturb: ybw[l]y\leftarrow b\cdot w[l].
7:return: yy, (jj, ll)

The pseudo-code of the client-side is shown in Algorithm 1. Given a private join value dd, the privacy budget ϵ\epsilon, the sketch parameters (k,m)(k,m) indicating the number of lines and columns of the sketch in the server, the algorithm first samples a line index jj uniformly from kk lines and samples an index ll uniformly from mm columns (line 1). It then initializes a vector of size (1×m1\times m) with all zeros. It then encodes the vector vv to be ξj(d)\xi_{j}(d) in position hj(d)h_{j}(d) (line 3). To reduce communication cost by sampling while maintaining sufficiently strong signal about the original vector, we adopt hadamard transform before sampling as Apple-HCMS [9] does. The algorithm generates w=v×Hmw=v\times H_{m} which transforms vv with only one non-zero member ξj(d)\xi_{j}(d) to a vector w{ξj(d),ξj(d)}1×mw\in\{-\xi_{j}(d),\xi_{j}(d)\}^{1\times m}. It then samples a bit w[l]w[l] from the vector ww, where l[m]l\sim[m], and perturbs w[l]w[l] by multiplying (-1) with the probability of 1eϵ+1\frac{1}{e^{\epsilon}+1} (line 5-6). The client-side of LDPJoinSketch is almost the same as the client-side of Apple-HCMS [9], and the only difference is the encoding method in line 3. We encode each value based on the fast-AGMS sketch and set v[hj(d)]ξj(d)v[h_{j}(d)]\leftarrow\xi_{j}(d), while Apple-HCMS encodes each item dd based on the Count Mean Sketch and sets v[hj(d)]1v[h_{j}(d)]\leftarrow 1.

Theorem 1.

LDPJoinSketch satisfies ϵ\epsilon-LDP.

Proof.

Let vv and vv^{\prime} be the encodings of dd and dd^{\prime}, respectively. According to Algorithm 1, the differences between vv and vv^{\prime} are on two bits, i.e., v[hj(d)]=ξj(d)v[h_{j}(d)]=\xi_{j}(d) and v[hj(d)]=ξj(d)v^{\prime}[h_{j}(d^{\prime})]=\xi_{j}(d^{\prime}). Let JJ be the random variable selected uniformly from [k][k], and LL be the random variable selected uniformly from [m][m]. Let BB be the random variable for the random bit bb, where Pr[B=1]=eϵ1+eϵ\Pr[B=1]=\frac{e^{\epsilon}}{1+e^{\epsilon}} and Pr[B=1]=11+eϵ\Pr[B=-1]=\frac{1}{1+e^{\epsilon}}. Let w=v×Hmw=v\times H_{m} and w=v×Hmw^{\prime}=v^{\prime}\times H_{m}. We denote Algorithm 1 by 𝒜\mathcal{A} in the followings.

Pr[𝒜(d)=(y,j,l)]Pr[𝒜(d)=(y,j,l)]\displaystyle\frac{\Pr[\mathcal{A}(d)=(y,j,l)]}{\Pr[\mathcal{A}(d^{\prime})=(y,j,l)]}
=Pr[Bw[l]=y|J=j,L=l]Pr[J=j]Pr[L=l]Pr[Bw[l]=y|J=j,L=l]Pr[J=j]Pr[L=l]\displaystyle=\frac{\Pr[B\cdot w[l]=y|J=j,L=l]\Pr[J=j]\Pr[L=l]}{\Pr[B\cdot w^{\prime}[l]=y|J=j,L=l]\Pr[J=j]\Pr[L=l]}
=Pr[BHm[hj(d),l]ξj(d)=y|J=j]Pr[BHm[hj(d),l]ξj(d)=y|J=j]\displaystyle=\frac{\Pr[B\cdot H_{m}[h_{j}(d),l]\cdot\xi_{j}(d)=y|J=j]}{\Pr[B\cdot H_{m}[h_{j}(d^{\prime}),l]\cdot\xi_{j}(d^{\prime})=y|J=j]} (3)

Since ξj(d)\xi_{j}(d), ξj(d){1,1}\xi_{j}(d^{\prime})\in\{-1,1\}, the samples from hadamard transform (Hm[hj(d),l]ξj(d))(H_{m}[h_{j}(d),l]\cdot\xi_{j}(d)), (Hm[hj(d),l]ξj(d)){1,1}(H_{m}[h_{j}(d^{\prime}),l]\cdot\xi_{j}(d^{\prime}))\in\{-1,1\}. In addition, as Pr[B=1]=eϵ1+eϵ\Pr[B=1]=\frac{e^{\epsilon}}{1+e^{\epsilon}} and Pr[B=1]=11+eϵ\Pr[B=-1]=\frac{1}{1+e^{\epsilon}}, the probability of obtaining the same output with different inputs dd and dd^{\prime} is similar:

eϵPr[𝒜(d)=(y,j,l)]Pr[𝒜(d)=(y,j,l)]eϵe^{-\epsilon}\leq\frac{\Pr[\mathcal{A}(d)=(y,j,l)]}{\Pr[\mathcal{A}(d^{\prime})=(y,j,l)]}\leq e^{\epsilon} (4)

Thus, LDPJoinSketch satisfies ϵ\epsilon-LDP. ∎

IV-B Server-side of LDPJoinSketch

After receiving the perturbed values, the server has two tasks: (1) to construct a sketch for each join attribute, and (2) to compute the join size based on the sketches.

Algorithm 2 LDPJoinSketch-construction (PriSK)

Input: {(y(1),j(1),l(1))\{(y^{(1)},j^{(1)},l^{(1)}), …,(y(n),j(n),l(n))}(y^{(n)},j^{(n)},l^{(n)})\}, ϵ\epsilon, (kk, mm)
Output: private sketch MM

1:Initialize M{0}k×mM\in\{0\}^{k\times m}.
2:Set cϵ=eϵ+1eϵ1c_{\epsilon}=\frac{e^{\epsilon}+1}{e^{\epsilon}-1}
3:for each i[n]i\in[n] do
4:     M[j(i),l(i)]M[j(i),l(i)]+kcϵy(i)M[j^{(i)},l^{(i)}]\leftarrow M[j^{(i)},l^{(i)}]+k\cdot c_{\epsilon}\cdot y^{(i)}
5:end for
6:MM×HmTM\leftarrow M\times H_{m}^{T}
7:return: MM

Construction of LDPJoinSketch. The pseudo-code of the construction algorithm is shown in Algorithm 2. For each input (y(i),j(i),l(i))(y^{(i)},j^{(i)},l^{(i)}) from the client-side, it multiplies y(i)y^{(i)} with kcϵk\cdot c_{\epsilon}, and adds it to the counter at indices [j(i),l(i)][j^{(i)},l^{(i)}] (line 4). The scaling factor kcϵk\cdot c_{\epsilon} (line 3) is used to debias the sketch, as both sampling an index jj and the perturbation in Algorithm 1 introduce bias: 𝔼[B]=1cϵ\mathbb{E}[B]=\frac{1}{c_{\epsilon}} and 𝔼[𝟙{J=j}]=1/k\mathbb{E}[\mathbbm{1}\{J=j\}]=1/k. Finally, it multiplies the sketch MM with the hadamard matrix to transform the sketch back (line 6).

Refer to caption
Figure 2: Example of LDPJoinSketch
Example 1.

We use an example in Fig. 2 to demonstrate the working of client-side and server-side in LDPJoinSketch. Suppose k=3k=3, m=4m=4 and the private data in the client is dd.

Client-side: Let hj(d)=2h_{j}(d)=2, where j[k]j\in[k], the client encodes dd as v=[0,0,ξj(d),0]v=[0,0,\xi_{j}(d),0], where v[hj(d)]=ξj(d)v[h_{j}(d)]=\xi_{j}(d). It adopts hadamard transform to convert the only non-zero signal ξj(d)\xi_{j}(d) to a vector v×Hm=[ξj(d),ξj(d),ξj(d),ξj(d)]v\times H_{m}=[\xi_{j}(d),\xi_{j}(d),-\xi_{j}(d),-\xi_{j}(d)], and perturbs a random bit w[l]w[l] of ww as y=bw[l]y=b\cdot w[l], where l[m]l\in[m]. Finally, the client sends (y,j,l)(y,j,l) to the server.

Server-side: Construct the sketch by adding yy to the counter at indices [j,l][j,l]. It then multiplies the sketch with HmTH_{m}^{T} and a scale cϵkc_{\epsilon}k to debias the sketch. We prove that the expectation of the contribution of dd to M[j,hj(d)]M[j,h_{j}(d)] is still ξj(d)\xi_{j}(d).

Join size estimation. Based on the LDPJoinSketches MAM_{A} and MBM_{B} for attributes AA and BB, the server estimates the join size:

Est(|AB|)=MAMB=medianj[1,k]{x=1mMA[j,x]MB[j,x]}Est(|A\!\!\Join\!\!B|)\!=\!\!M_{A}M_{B}\!\!=\!\mathop{median}\limits_{j\in[1,k]}\{\sum_{x=1}^{m}\!\!M_{A}[j,x]M_{B}[j,x]\} (5)

IV-C Analysis of estimation error

We follow three steps to compute the error of join size estimation based on LDPJoinSketch. Step 1, we analyze the contribution of each value dd to the sketch. Step 2, we prove that each estimator MA[j]MB[j]M_{A}[j]\cdot M_{B}[j] (j[k]j\in[k]) is an unbiased estimation of |AB||A\Join B|. Additionally, we compute the variance of each estimator. Step 3, we take the median of kk estimators as the final estimation and calculate the error bound for it. For the convenience of the reader, we provide a notation table (Table I) containing symbols in the following proofs.

TABLE I: The notations of symbols.
Notations Description
dd private join attribute value
MM the sketch
(k,m)(k,m) the number of lines and columns of a sketch
hjh_{j} hash function for jjth line, hj(x)[0,m1]h_{j}(x)\rightarrow[0,m-1]
ξj\xi_{j} 4-wise independent hash function ξj(x){1,+1}\xi_{j}(x)\rightarrow\{-1,+1\}
fA(d)f_{A}(d) true frequency of value dd of attribute AA
M(j,hj(d))(i)M(j,h_{j}(d))^{(i)} contribution of data d(i)d^{(i)} to the counter at M[j,hj(d)]M[j,h_{j}(d)]
F1(X)F_{1}(X) number of values of attribute XX,i.e., dXfX(d)\sum_{d\in X}f_{X}(d)
F2(X)F_{2}(X) second frequency moment dX(fX(d))2\sum_{d\in X}(f_{X}(d))^{2}
L,RL,R the random variables that select uniformly from [m][m]

Step 1. The contribution of each private value to the sketch.

Definition 2.

The contribution of the iith private data d(i)d^{(i)} to the counter at indices (j,hj(d))(j,h_{j}(d)) of the LDPJoinSketch MM sized (k,m)(k,m) can be written as M(j,hj(d))(i)M(j,h_{j}(d))^{(i)}. M(j,hj(d))(i)=cϵkBξj(d(i))Hm[hj(d(i)),L]Hm[L,hj(d)]𝟙{J=j}M(j,h_{j}(d))^{(i)}=c_{\epsilon}kB\xi_{j}(d^{(i)})H_{m}[h_{j}(d^{(i)}),L]H_{m}[L,h_{j}(d)]\mathbbm{1}\{J=j\}, where cϵ=eϵ+1eϵ1c_{\epsilon}=\frac{e^{\epsilon}+1}{e^{\epsilon}-1}, and 𝟙{J=j}\mathbbm{1}\{J=j\} is 0 when JjJ\neq j. Here, JU[k]J\sim U[k], LU[m]L\sim U[m]111XU[n]X\sim U[n] denotes XX is a variable chosen uniformly at random from [n], and B{1,+1}B\in\{-1,+1\} is 1 with probability eϵ1+eϵ\frac{e^{\epsilon}}{1+e^{\epsilon}}.

In Theorem 2, we give the expectation of M(j,hj(d))(i)M(j,h_{j}(d))^{(i)}.

Theorem 2.

The contribution of a value d(i)d^{(i)} to M[j,hj(d)]M[j,h_{j}(d)]: 𝔼[M(j,hj(d))(i)]=ξj(d)𝟙{d(i)=d}+ξj(d(i))1m𝟙{d(i)d}\mathbb{E}[M(j,h_{j}(d))^{(i)}]\!=\!\xi_{j}(d)\cdot\mathbbm{1}\{d^{(i)}=d\}\!+\!\xi_{j}(d^{(i)})\frac{1}{m}\cdot\mathbbm{1}\{d^{(i)}\!\neq\!d\}.

Proof.

We analyze 𝔼[M(j,hj(d))(i)]\mathbb{E}[M(j,h_{j}(d))^{(i)}] under two situations.
(1) If d(i)=dd^{(i)}=d,

𝔼[M(j,hj(d))(i)]=1k𝔼[M(j,hj(d))(i)|J=j]\displaystyle\mathbb{E}[M(j,h_{j}(d))^{(i)}]=\frac{1}{k}\mathbb{E}[M(j,h_{j}(d))^{(i)}|J=j] (6)
=cϵ𝔼[Hm[hj(d),L]ξj(d(i))BHm[L,hj(d(i))]]\displaystyle=c_{\epsilon}\cdot\mathbb{E}[H_{m}[h_{j}(d),L]\xi_{j}(d^{(i)})B\cdot H_{m}[L,h_{j}(d^{(i)})]] (7)
=cϵ𝔼[ξj(d)B]𝔼[Hm[L,hj(d)]2]=ξj(d)\displaystyle=c_{\epsilon}\cdot\mathbb{E}[\xi_{j}(d)B]\cdot\mathbb{E}[H_{m}[L,h_{j}(d)]^{2}]=\xi_{j}(d) (8)

(2) If d(i)dd^{(i)}\neq d,

𝔼[M(j,hj(d))(i)]=1k𝔼[M(j,hj(d))(i)|J=j]\displaystyle\mathbb{E}[M(j,h_{j}(d))^{(i)}]=\frac{1}{k}\mathbb{E}[M(j,h_{j}(d))^{(i)}|J=j] (9)
=𝔼[ξj(d(i))cϵB]𝔼[Hm[hj(d(i)),L]Hm[L,hj(d)]]\displaystyle=\mathbb{E}[\xi_{j}(d^{(i)})c_{\epsilon}B]\mathbb{E}[H_{m}[h_{j}(d^{(i)}),L]H_{m}[L,h_{j}(d)]] (10)
=ξj(d(i))𝔼[𝟙{hj(d)=hj(d(i))}]=ξj(d(i))1m\displaystyle=\xi_{j}(d^{(i)})\mathbb{E}[\mathbbm{1}\{h_{j}(d)=h_{j}(d^{(i)})\}]=\xi_{j}(d^{(i)})\frac{1}{m} (11)

Merging these two situations above, we can get:

𝔼[M(j,hj(d))(i)]=ξj(d)𝟙{d(i)=d}+ξj(d(i))m𝟙{d(i)d}.\mathbb{E}[M(j,h_{j}(d))^{(i)}]=\xi_{j}(d)\cdot\mathbbm{1}\{d^{(i)}=d\}+\frac{\xi_{j}(d^{(i)})}{m}\cdot\mathbbm{1}\{d^{(i)}\neq d\}.

The expectation is the same as that of fast-AGMS sketch.∎

Step 2: The expectation and variance of one estimator. Taking the inner-product of the corresponding lines of two sketches MA[j]MB[j]M_{A}[j]\cdot M_{B}[j] as one estimator of |AB||A\Join B|. We prove that 𝔼[MA[j]MB[j]]=|AB|\mathbb{E}[M_{A}[j]M_{B}[j]]=|A\Join B| and Var[MA[j]MB[j]]\mathrm{Var}[M_{A}[j]M_{B}[j]] is limited. Before computing 𝔼[MA[j]MB[j]]\mathbb{E}[M_{A}[j]M_{B}[j]], we first give a lemma for the product of two entries from two datasets.

Lemma 1.

Let d(iA)d^{(i_{A})} and d(iB)d^{(i_{B})} be two values of attributes AA and BB. Given hj(d(iA))=hj(d(iB))=xh_{j}(d^{(i_{A})})=h_{j}(d^{(i_{B})})=x, we have

𝔼[MA(j,x)(iA)MB(j,x)(iB)]={1,d(iA)=d(iB)0,d(iA)d(iB)\mathbb{E}[M_{A}(j,x)^{(i_{A})}M_{B}(j,x)^{(i_{B})}]=\begin{cases}1,&d^{(i_{A})}=d^{(i_{B})}\\ 0,&d^{(i_{A})}\neq d^{(i_{B})}\end{cases} (12)
Proof.

(1) If d(iA)=d(iB)d^{(i_{A})}=d^{(i_{B})}, 𝔼[MA(j,x)(iA)MB(j,x)(iB)]=𝔼[ξj(d(iA))ξj(d(iA))]=1\mathbb{E}[M_{A}(j,x)^{(i_{A})}M_{B}(j,x)^{(i_{B})}]=\mathbb{E}[\xi_{j}(d^{(i_{A})})\xi_{j}(d^{(i_{A})})]=1. (2) If d(iA)d(iB)d^{(i_{A})}\neq d^{(i_{B})}, then 𝔼[MA(j,x)(iA)MB(j,x)(iB)]=𝔼[ξj(d(iA))ξj(d(iB))]=0\mathbb{E}[M_{A}(j,x)^{(i_{A})}\!M_{B}(j,x)^{(i_{B})}]\!=\!\mathbb{E}[\xi_{j}(d^{(i_{A})})\xi_{j}(d^{(i_{B})})]\!=\!0

Theorem 3.

𝔼[MA[j]MB[j]]=|AB|\mathbb{E}[M_{A}[j]\cdot M_{B}[j]]=|A\Join B|.

Proof.

𝔼[MA[j]MB[j]]=𝔼[x=1mMA[j,x]×MB[j,x]]\mathbb{E}[M_{A}[j]\cdot M_{B}[j]]=\mathbb{E}[\sum_{x=1}^{m}M_{A}[j,x]\times M_{B}[j,x]].

=\displaystyle= x=1m𝔼[d(iA)=d(iB)MA(j,x)(iA)MB(j,x)(iB)\displaystyle\sum_{x=1}^{m}\mathbb{E}[\sum_{d^{(i_{A})}=d^{(i_{B})}}M_{A}(j,x)^{(i_{A})}M_{B}(j,x)^{(i_{B})}
+d(iA)d(iB)MA(j,x)(iA)MB(j,x)(iB)]\displaystyle+\sum_{d^{(i_{A})}\neq d^{(i_{B})}}M_{A}(j,x)^{(i_{A})}M_{B}(j,x)^{(i_{B})}]
=\displaystyle= x=1m(hj(d)=xfA(d)fB(d)+0)=|AB|,\displaystyle\sum_{x=1}^{m}(\sum_{h_{j}(d)=x}f_{A}(d)f_{B}(d)+0)=|A\Join B|, (13)

Here, fA(d)f_{A}(d) denotes the frequency of dd in attribute AA. ∎

Before computing Var[MA[j]MB[j]]\mathrm{Var}[M_{A}[j]M_{B}[j]], we provide Lemma 2 computing 𝔼[M(j,x)(i1)M(j,x)(i2)]\mathbb{E}[M(j,x)^{(i_{1})}M(j,x)^{(i2)}] and Lemma 3 computing 𝔼[(M[j,l])2]\mathbb{E}[(M[j,l])^{2}] based on Lemma 2 as follows.

Lemma 2.

Given hj(d(i1))=hj(d(i2))=xh_{j}(d^{(i_{1})})=h_{j}(d^{(i_{2})})=x, we have

𝔼[M(j,x)(i1)M(j,x)(i2)]={cϵ2k,i1=i21,i1i2,d(i1)=d(i2)0,i1i2,d(i1)d(i2)\!\!\!\!\mathbb{E}[M(j,x)^{(i_{1})}M(j,x)^{(i_{2})}]=\begin{cases}c_{\epsilon}^{2}k,&i_{1}=i_{2}\\ 1,&\!\!\!\!\!\!\!\!i_{1}\neq i_{2},d^{(i_{1})}=d^{(i_{2})}\\ 0,&\!\!\!\!\!\!\!\!i_{1}\neq i_{2},d^{(i_{1})}\neq d^{(i_{2})}\end{cases} (14)
Proof.

According to the definition of M(j,x)(i)M(j,x)^{(i)}, given two data d(i1)d^{(i_{1})} and d(i2)d^{(i_{2})} of the same attribute, we have 𝔼[M(j,x)(i1)M(j,x)(i2)]=cϵ2k2𝔼[J(i1)=J(i2)=j]𝔼[ξj(d(i1))ξj(d(i2))B(i1)B(i2)]\mathbb{E}[M(j,x)^{(i_{1})}M(j,x)^{(i_{2})}]=c_{\epsilon}^{2}k^{2}\cdot\mathbb{E}[J^{(i_{1})}\!\!\!\!=\!\!\!\!J^{(i_{2})}\!\!=\!\!\!\!j]\cdot\mathbb{E}[\xi_{j}(d^{(i_{1})})\xi_{j}(d^{(i_{2})})B^{(i_{1})}B^{(i_{2})}], where J(i1),J(i2)U[k]J^{(i_{1})},J^{(i_{2})}\sim U[k] represent random variables chosen uniformly from [k][k] for d(i1)d^{(i_{1})} and d(i2)d^{(i_{2})}. Similarly, B(i1)B^{(i_{1})} and B(i2)B^{(i_{2})} represent the random variables for perturbation.
Case 1. If i1=i2=ii_{1}=i_{2}=i, then d(i1)d^{(i_{1})} and d(i2)d^{(i_{2})} represent the same data entry d(i)d^{(i)}, so 𝔼[J(i1)=J(i2)=j]=1k\mathbb{E}[J^{(i_{1})}=J^{(i_{2})}=j]=\frac{1}{k} and B(i1)B(i2)=(B(i))2=1B^{(i_{1})}B^{(i_{2})}=(B^{(i)})^{2}=1. Therefore, 𝔼[M(j,x)(i1)M(j,x)(i2)]=cϵ2k\mathbb{E}[M(j,x)^{(i_{1})}M(j,x)^{(i_{2})}]=c_{\epsilon}^{2}k.
Case 2. If i1i2i_{1}\neq i_{2}, then J(i1)J^{(i_{1})} and J(i2)J^{(i_{2})} are independent variables, 𝔼[J(i1)=J(i2)=j]=𝔼[J(i1)=j]𝔼[J(i2)=j]=1/k2\mathbb{E}[J^{(i_{1})}\!\!=\!\!J^{(i_{2})}\!\!\!=\!\!\!j]\!\!=\!\!\mathbb{E}[J^{(i_{1})}\!\!\!\!=\!\!j]\mathbb{E}[J^{(i_{2})}\!\!\!\!=\!\!j]\!\!=\!\!1/{k^{2}}. Similarly 𝔼[B(i1)B(i2)]=𝔼[B(i1)]2=1cϵ2\mathbb{E}[B^{(i_{1})}B^{(i_{2})}]=\mathbb{E}[B^{(i_{1})}]^{2}=\frac{1}{c_{\epsilon}^{2}}. And ξj(d(i1))ξj(d(i2))=1\xi_{j}(d^{(i_{1})})\xi_{j}(d^{(i_{2})})=1, since d(i1)=d(i2)d^{(i_{1})}=d^{(i_{2})}. Therefore, 𝔼[M(j,x)(i1)M(j,x)(i2)]=1\mathbb{E}[M(j,x)^{(i_{1})}M(j,x)^{(i_{2})}]=1.
Case 3. If d(i1)d(i2)d^{(i_{1})}\neq d^{(i_{2})}, similar to (2), we have 𝔼[J(i1)=J(i2)=j]=1k2\mathbb{E}[J^{(i_{1})}=J^{(i_{2})}=j]=\frac{1}{k^{2}} and 𝔼[B(i1)B(i2)]=1cϵ2\mathbb{E}[B^{(i_{1})}B^{(i_{2})}]=\frac{1}{c_{\epsilon}^{2}}. The difference from Case 2 is that E[ξj(d(i1))ξj(d(i2))]=0E[\xi_{j}(d^{(i_{1})})\xi_{j}(d^{(i_{2})})]=0 for d(i1)d(i2)d^{(i_{1})}\neq d^{(i_{2})}. ∎

Lemma 3.

The expectation of (M[j,x])2(M[j,x])^{2}:

𝔼[(M[j,x])2]=hj(d)=x(kcϵ21)f(d)+f(d)2+ddhj(d)=hj(d)=xξj(d)ξj(d)\mathbb{E}[(M[j,x])^{2}]=\!\!\!\!\!\!\sum_{h_{j}(d)=x}\!\!\!\!(kc_{\epsilon}^{2}-1)f(d)+f(d)^{2}+\!\!\!\!\!\!\!\!\!\!\sum_{\begin{subarray}{c}d\neq d^{\prime}\\ h_{j}(d)=h_{j}(d^{\prime})=x\end{subarray}}\!\!\!\!\!\!\!\!\!\!\!\xi_{j}(d)\xi_{j}(d^{\prime})
Proof.

Suppose hj(d(i1))=hj(d(i2))=xh_{j}(d^{(i_{1})})=h_{j}(d^{(i_{2})})=x, where d(i1)d^{(i_{1})} and d(i2)d^{(i_{2})} are two data entries for sketch MM. Since M[j,x]=d(i)DM(j,hj(d))(i)M[j,x]=\sum_{d^{(i)}\in D}M(j,h_{j}(d))^{(i)}, based on lemma 2, we have

𝔼[(M[j,x])2]=d(i1),d(i2)D𝔼[(M(j,x)(i1))(M(j,x)(i2))]\displaystyle\mathbb{E}[(M[j,x])^{2}]=\!\!\!\!\!\!\!\!\!\sum_{d^{(i_{1})},d^{(i_{2})}\in D}\!\!\!\!\!\!\!\!\!\mathbb{E}[(M(j,x)^{(i_{1})})(M(j,x)^{(i_{2})})]
=\displaystyle= d(i1),d(i2)D[cϵ2k𝟙{i1=i2}+𝟙{i1i2d(i1)=d(i2)}\displaystyle\!\!\!\!\!\!\!\!\!\sum_{d^{(i_{1})},d^{(i_{2})}\in D}\!\!\!\!\!\!\!\!\![c_{\epsilon}^{2}k\mathbbm{1}\{i_{1}=i_{2}\}+\mathbbm{1}\{\begin{subarray}{c}i_{1}\neq i_{2}\\ d^{(i_{1})}=d^{(i_{2})}\end{subarray}\}
+ξj(d(i1))ξj(d(i2))𝟙{i1i2d(i1)d(i2)}]\displaystyle+\xi_{j}(d^{(i_{1})})\xi_{j}(d^{(i_{2})})\mathbbm{1}\{\begin{subarray}{c}i_{1}\neq i_{2}\\ d^{(i_{1})}\neq d^{(i_{2})}\end{subarray}\}] (15)
=\displaystyle= kcϵ2hj(d)=xf(d)+hj(d)=xf(d)2f(d)+ddhj(d)=hj(d)=xξj(d)ξj(d)\displaystyle kc_{\epsilon}^{2}\!\!\!\!\sum_{h_{j}(d)=x}\!\!\!\!f(d)+\!\!\!\!\sum_{h_{j}(d)=x}\!\!\!\!f(d)^{2}\!\!-\!\!f(d)+\!\!\!\!\!\!\!\!\!\!\!\!\sum_{\begin{subarray}{c}d\neq d^{\prime}\\ h_{j}(d)=h_{j}(d^{\prime})=x\end{subarray}}\!\!\!\!\!\!\!\!\xi_{j}(d)\xi_{j}(d^{\prime}) (16)

Based on this, we can compute Var[MA[j,x]MB[j,x]]\mathrm{Var}[M_{A}[j,x]M_{B}[j,x]]. ∎

We present herein a definition for certain notations that will be utilized in the subsequent theorems.

Definition 3.

Let dd be a value of attribute XX and fX(d)f_{X}(d) be the frequency (number of occurrences) of dd in the sequence of values of attribute XX. We define (1) the total frequency of all the values of XX as F1(X)=dXfX(d)F_{1}(X)=\sum_{d\in X}f_{X}(d), (2) the second frequency moment F2(X)=dX(fX(d))2F_{2}(X)=\sum_{d\in X}(f_{X}(d))^{2}.

Lemma 4.

Var[MA[j,x]MB[j,x]]2m2(F1(A)+kcϵ212)2×(F1(B)+kcϵ212)2\mathrm{Var}[M_{A}[j,x]M_{B}[j,x]]\leq\frac{2}{m^{2}}(F_{1}(A)+\frac{kc_{\epsilon}^{2}-1}{2})^{2}\times(F_{1}(B)+\frac{kc_{\epsilon}^{2}-1}{2})^{2}.

Proof.

Let X=MA[j,x]MB[j,x]X=M_{A}[j,x]M_{B}[j,x], Var[X]=𝔼[X2]𝔼[X]2\mathrm{Var}[X]=\mathbb{E}[X^{2}]-\mathbb{E}[X]^{2}. (i) First, we compute 𝔼[X2]\mathbb{E}[X^{2}].

MA[j,x]2=\displaystyle M_{A}[j,x]^{2}= 1m[(kcϵ21)F1(A)+F2(A)\displaystyle\frac{1}{m}[(kc_{\epsilon}^{2}-1)F_{1}(A)+F_{2}(A)
+dAdAhj(dA)=hj(dA)fA(dA)fA(dA)ξj(dA)ξj(dA)]\displaystyle+\!\!\!\!\!\!\sum_{\begin{subarray}{c}d_{A}\neq d^{\prime}_{A}\\ h_{j}(d_{A})=h_{j}(d^{\prime}_{A})\end{subarray}}\!\!\!\!\!\!f_{A}(d_{A})f_{A}(d^{\prime}_{A})\xi_{j}(d_{A})\xi_{j}(d^{\prime}_{A})] (17)

Based on Eq. (17), we have 𝔼[X2]=𝔼[MA[j,l]2MB[j,l]2]\mathbb{E}[X^{2}]=\mathbb{E}[M_{A}[j,l]^{2}M_{B}[j,l]^{2}]

=1m2[(kcϵ21)F1(A)+F2(A)][(kcϵ21)F1(B)+F2(B)]\displaystyle=\frac{1}{m^{2}}[(kc_{\epsilon}^{2}-1)F_{1}(A)+F_{2}(A)][(kc_{\epsilon}^{2}-1)F_{1}(B)+F_{2}(B)]
+4m2d<dfA(d)fA(d)fB(d)fB(d)\displaystyle+\frac{4}{m^{2}}\sum_{d<d^{\prime}}f_{A}(d)f_{A}(d^{\prime})f_{B}(d)f_{B}(d^{\prime}) (18)

(ii) Second, 𝔼[X]=1mdfA(d)fB(d)\mathbb{E}[X]=\frac{1}{m}\sum_{d}f_{A}(d)f_{B}(d).
Based on (i) and (ii), we have Var[MA[j,x]MB[j,x]]\mathrm{Var}[M_{A}[j,x]M_{B}[j,x]]

\displaystyle\leq 1m2[(kcϵ21)F1(A)+F2(A)][(kcϵ21)F1(B)+F2(B)]\displaystyle\frac{1}{m^{2}}[(kc_{\epsilon}^{2}-1)F_{1}(A)+F_{2}(A)][(kc_{\epsilon}^{2}-1)F_{1}(B)+F_{2}(B)]
+1m2F2(A)F2(B)\displaystyle+\frac{1}{m^{2}}F_{2}(A)F_{2}(B) (19)
\displaystyle\leq 2m2(F1(A)+kcϵ212)2(F1(B)+kcϵ212)2,\displaystyle\frac{2}{m^{2}}(F_{1}(A)+\frac{kc_{\epsilon}^{2}-1}{2})^{2}(F_{1}(B)+\frac{kc_{\epsilon}^{2}-1}{2})^{2}, (20)

where F2(A)F_{2}(A) and F2(B)F_{2}(B) are defined in Def 3. ∎

With Lemma 4, we compute Var[MA[j]MB[j]]\mathrm{Var}[M_{A}[j]M_{B}[j]] as follows.

Theorem 4.

The variance of MA[j]MB[j]M_{A}[j]M_{B}[j] is limited:
Var[MA[j]MB[j]]2m(F1(A)+kcϵ212)2×(F1(B)+kcϵ212)2\mathrm{Var}[M_{A}[j]M_{B}[j]]\leq\frac{2}{m}(F_{1}(A)+\frac{kc_{\epsilon}^{2}-1}{2})^{2}\times(F_{1}(B)+\frac{kc_{\epsilon}^{2}-1}{2})^{2}.

Proof.

The variance of each estimator MA[j]MB[j]M_{A}[j]M_{B}[j]:

Var[MA[j]MB[j]]=Var[x=1mMA[j,x]MB[j,x]]\displaystyle\mathrm{Var}[M_{A}[j]M_{B}[j]]=\mathrm{Var}[\sum_{x=1}^{m}M_{A}[j,x]M_{B}[j,x]] (21)
2m(F1(A)+kcϵ212)2(F1(B)+kcϵ212)2\displaystyle\leq\frac{2}{m}(F_{1}(A)+\frac{kc_{\epsilon}^{2}-1}{2})^{2}(F_{1}(B)+\frac{kc_{\epsilon}^{2}-1}{2})^{2} (22)

Thus, variance of each estimator is bounded. ∎

Step 3. Error bound of LDPJoinSketch. The final estimation is computed as the median of kk estimators like EstjEst_{j}, i.e., Est=medianj[k](Estj)Est=\mathop{median}\limits_{j\in[k]}(Est_{j}). We use the following theorem to prove that the error of EstEst is limited.

Theorem 5.

Let k=4log1δk=4\log\frac{1}{\delta}, Est=medianj[k]EstjEst=\mathop{median}\limits_{j\in[k]}Est_{j}, and the join size estimation error be Er=Est|AB|Er=Est-|A\Join B|,

Pr[|Er|4m|F1(A)+kcϵ212||F1(B)+kcϵ212|]δ.\Pr[|Er|\geq\frac{4}{\sqrt{m}}|F_{1}(A)+\frac{kc_{\epsilon}^{2}-1}{2}||F_{1}(B)+\frac{kc_{\epsilon}^{2}-1}{2}|]\leq\delta.
Proof.

According to Chebyshev’s Inequality, given a random variable XX, Pr[|XE[X]|w]Var[X]w2Pr[|X-E[X]|\geq w]\leq\frac{\mathrm{Var}[X]}{w^{2}}. We can compute the error bound for each estimator Estj=MA[j]MB[j]Est_{j}=M_{A}[j]M_{B}[j],

Pr[|Estj|AB||w]Var[MA[j]MB[j]]w2\Pr[|Est_{j}-|A\Join B||\geq w]\leq\frac{Var[M_{A}[j]M_{B}[j]]}{w^{2}} (23)

Let w=8Var[MA[j]MB[j]]w=\sqrt{8Var[M_{A}[j]M_{B}[j]]}, we get

Pr[|Estj|AB||8Var[MA[j]MB[j]]]18\Pr[|Est_{j}-|A\Join B||\geq\sqrt{8\mathrm{Var}[M_{A}[j]M_{B}[j]]}]\leq\frac{1}{8} (24)

Using the application of Chernoff Bounds, let k=4log1δk=4\log\frac{1}{\delta}, we can use Est=medianj[k](Estj)Est=\mathop{median}\limits_{j\in[k]}(Est_{j}) to reduce the failure probability to δ\delta. Pr[|Er|4m|F1(A)+kcϵ212||F1(B)+kcϵ212|]δ\Pr[|Er|\geq\frac{4}{\sqrt{m}}|F_{1}(A)+\frac{kc_{\epsilon}^{2}-1}{2}||F_{1}(B)+\frac{kc_{\epsilon}^{2}-1}{2}|]\leq\delta. ∎

Based on Theorem 5, the error of the LDPJoinSketch-based join size estimation is limited. LDP introduces an additional error to the error bound, but the influence caused by LDP does not destroy the utility of the sketch, since kcϵ212\frac{kc_{\epsilon}^{2}-1}{2} is much smaller than F1(A)F_{1}(A) and F1(B)F_{1}(B) in reality.

V LDPJoinSketch+

To reduce the hash-collision while preserving privacy, we present a novel two-phase framework known as LDPJoinSketch+. This framework builds upon and enhances the accuracy of LDPJoinSketch by specifically addressing hash collisions (Section V-A). In LDPJoinSketch+, We design a Frequency-Aware Perturbation (FAP) mechanism for the client-side (Section V-B). Finally, we show how to estimate the join size based on the LDPJoinSketch+ (Section V-C).

V-A The framework of LDPJoinSketch+

LDPJoinSketch+ aims to reduce error caused by hash-collisions under LDP. It has a two-phase framework. In phase 1, we identify candidate frequent join values from each table and communicate these candidates to the clients. In phase 2, each client distinguishes whether its value is high-frequency or not, and encodes high-frequency and low-frequency values in distinct manners, all while ensuring compliance with LDP.

We use the pseudo-code in Algorithm 3 to show the framework of LDPJoinSketch+. LDPJoinSketch+ has two phases. In phase 1, we find the frequent item set FIFI based on the LDPJoinSketches MAM_{A} and MBM_{B} constructed from sample users for attributes AA and BB (Phase 1, line 1-4). In phase 2, to avoid allocating privacy budget, we divide the users into two groups (Phase 2, line 1), one for the join size estimation of low-frequency values, another for high-frequency values. We can use the whole privacy budget for each group according to the composition theorem of LDP. For ease of description, we encapsulate the process of perturbing each data through FAP and aggregating all perturbed data to construct a sketch as a function called sksk. Based on sksk, we can construct sketches MLAML_{A} and MLBML_{B} summarizing the low-frequency items of attribute AA and BB with the data from A1A_{1} and B1B_{1} (Phase 2, line 2). MHAMH_{A} and MHBMH_{B} summarizing the high-frequency items can be constructed similarly (Phase 2, line 3). It then computes the join size for low-frequency values as LEstLEst and for high-frequency values as HEstHEst (line 4-5). The final result is the sum of the scaled HEstHEst and LEstLEst (Phase 2, line 6). Because LEstLEst is the estimated join size of data from A1A_{1} and B1B_{1}, we estimate the join size |AB||A\Join B| by multiplying a scale |A||B||A1||B1|\frac{|A||B|}{|A1||B1|}.

Algorithm 3 LDPJoinSketch+

Phase 1: Find frequent join values

1:SA,SBS_{A},S_{B}\leftarrow sample clients from AA and BB respectively.
2:Clients: Perturb data from SA,SBS_{A},S_{B} with Algorithm 1.
3:Server: Construct LDPJoinSketch MAM_{A} and MBM_{B}.
4:Frequent item set FIFreqItems(MA,MB)FI\leftarrow FreqItems(MA,MB).

Phase 2: Join size estimation

1:Groups A1A_{1}, A2AA_{2}\leftarrow A; Groups B1B_{1}, B2BB_{2}\leftarrow B.
2:MLAsk(A1,L,ϵ,FI)ML_{A}\leftarrow sk(A_{1},L,\epsilon,FI); MLBsk(B1,L,ϵ,FI)ML_{B}\leftarrow sk(B_{1},L,\epsilon,FI).
3:MHAsk(A2,H,ϵ,FI)MH_{A}\leftarrow sk(A_{2},H,\epsilon,FI); MHBsk(B2,H,ϵ,FI)MH_{B}\leftarrow sk(B_{2},H,\epsilon,FI).
4:LEstJoinEst(MLA,MLB,mode=L)LEst\leftarrow JoinEst(ML_{A},ML_{B},mode=L).
5:HEstJoinEst(MHA,MHB,mode=H)HEst\leftarrow JoinEst(MH_{A},MH_{B},mode=H).
6:return: |A||B||A1||B1|LEst+|A||B||A2||B2|HEst\frac{|A||B|}{|A_{1}||B_{1}|}LEst+\frac{|A||B|}{|A_{2}||B_{2}|}HEst

Func sk(D,mode,ϵ,FI)sk(D,mode,\epsilon,FI)

1:for d(i)Dd^{(i)}\in D do
2:     (y(i),j(i),l(i))FAP(d(i),mode,ϵ,FI)(y^{(i)},j^{(i)},l^{(i)})\leftarrow FAP(d^{(i)},mode,\epsilon,FI)
3:end for
4:MPriSk((y(1),j(1),l(1)),,(y(n),j(n),l(n)),ϵ,k,m)M\leftarrow PriSk((y^{(1)},j^{(1)},l^{(1)}),\dots,(y^{(n)},j^{(n)},l^{(n)}),\epsilon,k,m)
5:return MM

We will introduce FAP mechanism and how to estimate the join size based on the LDPJoinSketch+ in subsequent sections.

V-B Frequency-Aware Perturbation

Phase 2 of LDPJoinSketch+ aims to improve accuracy by estimating the join size of high-frequency and low-frequency items separately. To achieve this without privacy leaks, we propose a frequency-aware perturbation (FAP) mechanism.

As the phase 1 of LDPJoinSketch+ gets the frequent items, this knowledge enables each client to distinguish whether its value is high-frequency or not, and differently handle high-frequency and low-frequency items. For ease of explanation, we refer to the value that possesses the same property as the estimation target as the “target value”, while the value that does not possess the same property as the estimation target will be referred to as “non-target values”. For instance, in constructing sketches for estimating the join size of high-frequency items, the high-frequency values would be considered target values, while the low-frequency values would be considered non-target values. FAP has two goals: (1) the server cannot infer whether the true value is high-frequency or not from the perturbed value, (2) the contribution of non-target values to the sketches can be removed.

For the first goal, FAP encodes the target values in the same way as LDPJoinSketch does, while encoding non-target values independently from their true values. Both the encoded values of target and non-target values are perturbed according to random response as LDPJoinSketch does before being sent to the server. Thus, the privacy still can be preserved.

For the second goal, non-target values are encoded randomly, resulting in their impact spreading uniformly across each cell of the sketch. Consequently, the influence of non-target values can be effectively removed from the target sketch, given that we know the total number of non-target values. This improves the accuracy of target estimation.

Algorithm 4 Frequency-Aware Perturbation (FAP)

Input: dDd\in D, mode, ϵ\epsilon, FIFI
Output: perturbed value yy, index j,lj,l

1:if  (mode==H) == (dFId\notin FIthen \triangleright Non-target value
2:     Sample jj uniformly at random from [k]
3:     Sample l,rl,r uniformly at random from [m].
4:     Initialize a vector v{0}1×mv\leftarrow\{0\}^{1\times m}
5:     Set v[r]1v[r]\leftarrow 1
6:     Transform wv×Hmw\leftarrow v\times H_{m}
7:     Sample b{1,+1}b\in\{-1,+1\}, where Pr[b=1]=1eϵ+1\Pr[b=-1]=\frac{1}{e^{\epsilon}+1}.
8:     ybw[l]y\leftarrow b\cdot w[l].
9:else\triangleright Target value
10:     y,j,ly,j,l\leftarrow LDPJoinSketch-client(dd, ϵ\epsilon, mm, kk)
11:end if
12:return: y,j,ly,j,l

The pseudo-code for the FAP algorithm is presented in Algorithm 4. The parameter modemode signifies the encoding model for data. When mode=Hmode=H, it indicates that the target values are high-frequency ones. In this case, low-frequency items are encoded randomly. Conversely, when mode=Lmode=L, the target values are low-frequency ones. Therefore, when both mode==Hmode==H and dFId\notin FI are either true or false, the algorithm employs random encoding for data dd. Otherwise, it encodes and perturbs data dd using the client-side of LDPJoinSketch.

The sole distinction between encode methods for target and non-target values lies in line 5: FAP encodes a non-target value dd randomly with v[r]1v[r]\leftarrow 1 (rr is chosen uniform at random from [m][m]), but encodes a target value dd by v[hj(d)]ξj(d)v[h_{j}(d)]\leftarrow\xi_{j}(d) according to the client side of LDPJoinSketch (line 10).

Example 2.

We also use an example shown in Fig. 3 to show the process of FAP. Since a target value is encoded and perturbed based on LDPJoinSketch, we only show how to handle a non-target value. The parameters in this example are the same with those in Example 1. A significant distinction from LDPJoinSketch is that FAP employs a randomly chosen variable r[m]r\in[m] to replace the index hj(d)h_{j}(d). Consequently, the encoding of a non-target value dd becomes independent of its true value, ensuring that the impact of non-target values is evenly distributed across each cell of the sketch.

Refer to caption
Figure 3: Example of LDPJoinSketch+
Theorem 6.

Algorithm FAP satisfies ϵ\epsilon-LDP.

Proof.

We have proved that LDPJoinSketch satisfies ϵ\epsilon-LDP. We only need to prove that the server still cannot distinguish the outputs of a non-target value dd and a target value dd^{\prime}.

Suppose the encode of dd and dd^{\prime} are vv and vv^{\prime}, respectively, the differences between vv and vv^{\prime} are on two bits, i.e., v[r]=1v[r]=1 and v[hj(d)]=ξj(d)v^{\prime}[h_{j}(d^{\prime})]=\xi_{j}(d^{\prime}). Let JJ be the random variable that selects uniformly from [k][k], LL be the random variable that selects uniformly from [m][m], and BB be the random variable for the random bit bb. Here, Pr[B=1]=eϵ1+eϵ\Pr[B=1]=\frac{e^{\epsilon}}{1+e^{\epsilon}}, Pr[B=1]=11+eϵ\Pr[B=-1]=\frac{1}{1+e^{\epsilon}}.

Pr[𝒜(d)=(y,j,l)]Pr[𝒜(d)=(y,j,l)]=Pr[BHm[l,r]=y|J=j]Pr[BHm[l,hj(d)]ξj(d)=y|J=j]\frac{\Pr[\mathcal{A}(d)=(y,j,l)]}{\Pr[\mathcal{A}(d^{\prime})=(y,j,l)]}\\ =\frac{\Pr[BH_{m}[l,r]=y|J=j]}{\Pr[BH_{m}[l,h_{j}(d^{\prime})]\xi_{j}(d^{\prime})=y|J=j]}

Since both Hm[l,r],Hm[l,hj(d)]ξj(d){1,1}H_{m}[l,r],H_{m}[l,h_{j}(d^{\prime})]\xi_{j}(d^{\prime})\in\{-1,1\}, the probability of obtaining the same output with different inputs dd and dd^{\prime} is similar, as demonstrated in the following equation.

eϵPr[𝒜(d)=(y,j,l)]Pr[𝒜(d)=(y,j,l)]eϵe^{-\epsilon}\leq\frac{\Pr[\mathcal{A}(d)=(y,j,l)]}{\Pr[\mathcal{A}(d^{\prime})=(y,j,l)]}\leq e^{\epsilon} (25)

Thus, the Algorithm FAP, denoted as 𝒜\mathcal{A}, satisfies ϵ\epsilon-LDP. ∎

Each client perturbs its value based on FAP. The server receives the perturbed values from two groups and construct sketches for join size estimation of high-frequency and low-frequency values respectively. We will show how to compute the join size based on LDPJoinSketch+ in the next part.

V-C The server-side of LDPJoinSketch+

The server-side of LDPJoinSketch+ has different aims in the two phases. Phase 1 aims to find the frequent items. Phase 2 aims to construct sketches summarizing high-frequency and low-frequency values separately and estimate the join size.

Phase 1: Frequency estimation based on LDPJoinSketch.
We use the following Theorem 7 to prove that the LDPJoinSketch can provide unbiased frequency estimations.

Theorem 7.

Given an LDPJoinSketch MM summarizing values of attribute AA. The frequency of a value dAd\in A can be estimated as f~(d)=meanj[1,k]M[j,hj(d)]ξj(d)\tilde{f}(d)=\mathop{mean}\limits_{j\in[1,k]}M[j,h_{j}(d)]\xi_{j}(d). Here, f~(d)\tilde{f}(d) is the unbiased estimation of f(d)f(d), i.e., 𝔼[f~(d)]=f(d)\mathbb{E}[\tilde{f}(d)]=f(d).

Proof.

According to lemma 2, we have M(j,hj(d))(i)=ξj(d)M(j,h_{j}(d))^{(i)}=\xi_{j}(d) for d(i)=dd^{(i)}=d, and M(j,hj(d))(i)=ξj(d(i))mM(j,h_{j}(d))^{(i)}=\frac{\xi_{j}(d^{(i)})}{m} for d(i)dd^{(i)}\neq d.

𝔼[f~(d)]=1kj=1ki=1n𝔼[M(j,hj(d))(i)ξj(d)]\displaystyle\mathbb{E}[\tilde{f}(d)]=\frac{1}{k}\sum_{j=1}^{k}\sum_{i=1}^{n}\mathbb{E}[M(j,h_{j}(d))^{(i)}\xi_{j}(d)]
=1kj=1k[f(d)ξj(d)ξj(d)+ddhj(d)=hj(d)f(d)ξj(d)ξj(d)]=f(d)\displaystyle=\frac{1}{k}\sum_{j=1}^{k}[f(d)\xi_{j}(d)\xi_{j}(d)+\!\!\!\!\!\!\!\!\!\sum_{\begin{subarray}{c}d^{\prime}\neq d\\ h_{j}(d)=h_{j}(d^{\prime})\end{subarray}}\!\!\!\!\!\!\!\!\!f(d^{\prime})\xi_{j}(d^{\prime})\xi_{j}(d)]\!=\!f(d)

LDPJoinSketch provides unbiased frequency estimations. ∎

As LDPJoinSketch can provide unbiased frequency estimation, given a proper threshold θ\theta to separate the high-frequency and low-frequency values, we can find the frequent item set FIA={diA|f~(di)>θ|A|}FI_{A}=\{d_{i}\in A|\tilde{f}(d_{i})>\theta|A|\}, and FIB={diB|f~(di)>θ|B|}FI_{B}=\{d_{i}\in B|\tilde{f}(d_{i})>\theta|B|\} for the attributes AA and BB, where |A||A| and |B||B| represent the total frequency of items in attributes AA and BB, respectively. We let the final frequent item set be the union of FIAFI_{A} and FIBFI_{B}, i.e., FI=FIAFIBFI=FI_{A}\cup FI_{B}.

Phase 2: Join size estimation.

The server-side of LDPJoinSketch+ constructs the sketches MHA,MLAMH_{A},ML_{A} for attribute AA and MHB,MLBMH_{B},ML_{B} for attribute BB in the same way as LDPJoinSketch. So we only discuss how to estimate the join size based on LDPJoinSketch+.

LDPJoinSketch+ (Algorithm 3) calls JoinEstJoinEst to separately compute the join size estimation for high-frequency and low-frequency values. The pseudo-code for the JoinEstJoinEst algorithm is presented in Algorithm 5. The algorithm initially calculates the total frequencies of elements belonging to the high-frequency sets of attributes AA and BB (lines 1-3). Different parameters modemode represent different target values for the sketch. When mode=Lmode=L, it indicates that the sketches MAM_{A} and MBM_{B} aim to summarize low-frequency values. In this case, the algorithm removes the contribution of high-frequency values from the sketches (lines 5-8). Conversely, when mode=Hmode=H, it removes the contribution of low-frequency values from the sketches (lines 9-12).

Algorithm 5 JoinEst

Input: MAM_{A}, MBM_{B}, mode
Output: Join size estimation EstEst

1:for dFId\in FI do \triangleright FIFI is the frequent item set in phase 1.
2:     HighFreqA+=f~A(d)|A||SA|HighFreq_{A}+=\tilde{f}_{A}(d)\cdot\frac{|A|}{|SA|}
3:     HighFreqB+=f~B(d)|B||SB|HighFreq_{B}+=\tilde{f}_{B}(d)\cdot\frac{|B|}{|SB|}
4:end for
5:if mode==L then
6:     MAMA{HighFreqAm}k×mM_{A}\leftarrow M_{A}-\{\frac{HighFreq_{A}}{m}\}^{k\times m}
7:     MBMB{HighFreqBm}k×mM_{B}\leftarrow M_{B}-\{\frac{HighFreq_{B}}{m}\}^{k\times m}
8:     Est=MAMBEst=M_{A}\cdot M_{B}.
9:else mode==H
10:     MAMA{|A|HighFreqAm}k×mM_{A}\leftarrow M_{A}-\{\frac{|A|-HighFreq_{A}}{m}\}^{k\times m}
11:     MBMB{|B|HighFreqBm}k×mM_{B}\leftarrow M_{B}-\{\frac{|B|-HighFreq_{B}}{m}\}^{k\times m}
12:     Est=MAMBEst=M_{A}\cdot M_{B}.
13:end if
14:return: EstEst

The following theorem computes the contribution of non-target values to sketch MM.

Theorem 8.

The contribution of non-target values to M[j,x]M[j,x]: 𝔼[d(i)[NT]M(j,x)(i)]=|NT|/m\mathbb{E}[\sum_{d^{(i)}\in[NT]}M(j,x)^{(i)}]=|NT|/m, where |NT||NT| is the total frequency of all the non-target values.

Proof.

The contribution of a non-target value d(i)d^{(i)} to M[j,x]M[j,x]:

𝔼[M(j,x)(i)]=1k𝔼[M(j,x)(i)|J=j]\displaystyle\mathbb{E}[M(j,x)^{(i)}]=\frac{1}{k}\mathbb{E}[M(j,x)^{(i)}|J=j]
=𝔼[cϵHm[x,L]Hm[L,R]B]=𝔼[𝟙{x=R}]=1m,\displaystyle=\mathbb{E}[c_{\epsilon}H_{m}[x,L]H_{m}[L,R]B]=\mathbb{E}[\mathbbm{1}\{x=R\}]=\frac{1}{m}, (26)

where L,RL,R are random variables that select uniformly from [m][m]. Therefore, 𝔼[d(i)[NT]M(j,x)(i)]=|NT|/m\mathbb{E}[\sum_{d^{(i)}\in[NT]}M(j,x)^{(i)}]=|NT|/m. ∎

With the above theorem, we can remove the contribution of non-target values from the sketches. For example, |NT||NT| is HighFreqAHighFreq_{A} for sketch MAM_{A} when mode=Lmode=L (line 6). In this way, we separately estimate the join size of high-frequency and low-frequency items.

VI Extension to multi-way Joins

Taking inspiration from COMPASS [1], which utilizes multi-dimensional fast-AGMS sketches to facilitate multi-way joins, we illustrate in Fig. 4 how our LDPJoinSketch can be extended to support multi-way joins.

Refer to caption
Figure 4: LDPJoinSketch for multi-way join.

Taking Q=T1(A)T2(A,B)T3(B)Q=T_{1}(A)\mathop{\Join}T_{2}(A,B)\mathop{\Join}T_{3}(B) as a multi-way join example. In this scenario, each join attribute is equipped with a unique pair of hash functions denoted as hh and ξ\xi. COMPASS constructs fast-AGMS sketches M1M_{1} and M3M_{3} for attributes T1.AT_{1}.A and T3.BT_{3}.B respectively. As for T2T_{2}, it constructs a 2-dim matrix M2M_{2} in shape of (m1×m2m_{1}\times m_{2}). For every tuple tT2t\in T_{2} with t.A=at.A=a and t.B=bt.B=b, the counter at indices [hA(a),hB(b)][h_{A}(a),h_{B}(b)] is incremented by ξA(a)ξB(b)\xi_{A}(a)\xi_{B}(b). The size of QQ can be estimated as: l1[m1],l2[m2]M1[l1]M2[l1,l2]M3[l2]\sum\limits_{\begin{subarray}{c}l_{1}\in[m_{1}],l_{2}\in[m_{2}]\end{subarray}}M_{1}[l_{1}]\cdot M_{2}[l_{1},l_{2}]\cdot M_{3}[l_{2}]. The accuracy can be improved by computing the median of kk independent estimators like the one in this example.

Our LDPJoinSketch can support multi-way joins in a similar way. We can directly construct LDPJoinSketches M~1\tilde{M}_{1} and M~3\tilde{M}_{3} for T1T_{1} and T3T_{3} which contain only one join attribute in each table. Therefore, we only discuss how to construct a sketch M2M_{2} for T2T_{2} with two join attributes. Assuming each attribute has only one-pair hash functions hh and ξ\xi, for a tuple tT2t\in T_{2} with t.A=at.A=a and t.B=bt.B=b, we encode tt using hadamard matrixes Hm1H_{m_{1}} and Hm2H_{m_{2}}. We extract the hA(a)h_{A}(a)th line (pink cells) of Hm1H_{m_{1}}, the hB(b)h_{B}(b)th column (pink cells) of Hm2H_{m_{2}}, and randomly select two indices l1[m1]l_{1}\in[m_{1}] and l2[m2]l_{2}\in[m_{2}]. The client-side encodes tt by Hm1[hA(a),l1]×ξA(a)ξB(b)×Hm2[l2,hB(b)]H_{m_{1}}[h_{A}(a),l_{1}]\times\xi_{A}(a)\xi_{B}(b)\times H_{m_{2}}[l_{2},h_{B}(b)], and then perturbs the encoded value by multiplying it by (-1) with probability 1eϵ+1\frac{1}{e^{\epsilon}+1}. The perturbed value yy is then sent to the server to update the counter with indices [l1,l2][l_{1},l_{2}]. The server constructs the sketch with the indices and perturbed values, and restores the sketch by M~=Hm1TMHm2T\tilde{M}=H_{m_{1}}^{T}\cdot M\cdot H_{m_{2}}^{T}. A scale factor cϵc_{\epsilon} is applied to debias the sketch, giving M~cϵM~\tilde{M}\leftarrow c_{\epsilon}\tilde{M}. Finally, the server computes the estimation:

Est=l1[m1],l2[m2]M~1[l1]M~2[l1,l2]M~3[l2]Est=\sum\limits_{\begin{subarray}{c}l_{1}\in[m_{1}],l_{2}\in[m_{2}]\end{subarray}}\tilde{M}_{1}[l_{1}]\cdot\tilde{M}_{2}[l_{1},l_{2}]\cdot\tilde{M}_{3}[l_{2}] (27)

Discussion. We give a solution for simple chain multi-join operations. For chain multi-join queries on tables with a maximum of two join attributes in each table, the computational complexity222The computational complexity of matrix multiplication is O(m3)O(m^{3}), which can be further reduced to O(m2.371552)O(m^{2.371552}) [32]. of our method is O(m3)O(m^{3}), if each two-dimensional sketch has a shape of (m×m)(m\times m). The complexity increases for scenarios involving star join and cyclic join operations. While our method adeptly handles uncomplicated cyclic joins, such as T1(A,B)T2(B,C)T3(C,A)T_{1}(A,B)\Join T_{2}(B,C)\Join T_{3}(C,A), addressing general multi-joins poses challenges due to intricate join graphs and the need for privacy budget decomposition among multiple join attributes. These challenges exceed the scope of this paper, and addressing them within a single work is a formidable task. Therefore, we defer this exploration to future research.

VII Experiments

In this section we design experiments to evaluate the performance of our methods on synthetic and real datasets.

VII-A Experimental Setup

Hardware. We implement all the algorithms on a machine of 256 GB RAM running Ubuntu(20.04.1) with Python 3.9.

Queries. We consider the following form of join queries:
Q: Select Count(*) from T1T_{1} join T2T_{2} on T1.A=T2.BT_{1}.A=T_{2}.B, where AA and BB are the private join attributes of tables T1T_{1} and T2T_{2}.

Datasets. We generate one-dimensional synthetic datasets following zipf and gaussian distribution, respectively. Regarding these synthetic data as the join attribute values is the common setting in previous works [24, 26]. We also conduct the experiments on four real-world datasets.

(1) Zipf datasets: We generate several datasets of size NN following Zipf distribution with different skewness parameters, whose probability mass function is f(x|α,N)=1/xαn=1N(1/nα)f(x|\alpha,N)=\frac{1/x^{\alpha}}{\sum_{n=1}^{N}(1/n^{\alpha})}, where α\alpha is the skewness parameter and xx is the rank of item.

(2) Gaussian dataset: We also generate a dataset that follows Gaussian distribution, whose probability density function for a given value xx is f(x)=1σ2πe(xμ)22σ2f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}, where μ\mu is the mean of distribution and σ\sigma is the standard deviation.

(3) TPC-DS dataset333https://www.tpc.org/tpcds/: The TPC Benchmark™ DS is a benchmark for measuring the performance of decision support systems. We extract the store sales data for experiments.

(4) MovieLens dataset444https://grouplens.org/datasets/movielens/. The MovieLens dataset is commonly used in the field of recommender systems and collaborative filtering, containing movie ratings and user information.

(5) Twitter ego-network dataset555https://snap.stanford.edu/data/ego-Twitter.html: This real-world dataset consists data of circles from Twitter.

(6) Facebook ego-network666https://snap.stanford.edu/data/ego-Facebook.html: Facebook data is collected from survey participants using Facebook app.

Table II shows the domain and data size of all datasets.

TABLE II: Information of Datasets.
Dataset Name Domain Size
Zipf 4,377–2,816,390 40,000,000
Gaussian 75,949 40,000,000
MovieLens 83,239 67,664,324
TPC-DS 18,000 5,760,808
Twitter 77,072 4,841,532
Facebook 4,039 352,936

Competitors

(1) k-RR [6]. We adopt it to perturb the join values and compute the join size using the calibrated frequency vectors.

(2) Apple-HCMS [9]. Apple’s private hadamard count mean sketch, which also forms a k×mk\times m sketch.

(3) FLH [17]. The heuristic fast variant of Optimal Local Hashing (OLH).

(4) Fast-AGMS [22]. It estimates the join size without privacy-preserving. We refer to this method as FAGMS for short in the figures.

Error Metrics

(1) Absolute Error (AE). 1t|𝒥𝒥^|\frac{1}{t}\sum|\mathcal{J}-\mathcal{\hat{J}}|, where 𝒥\mathcal{J}, 𝒥^\mathcal{\hat{J}} denote the true and estimated join size, and tt is the testing rounds.

(2) Relative Error (RE). 1t|𝒥𝒥^|/𝒥\frac{1}{t}\sum|\mathcal{J}-\mathcal{\hat{J}}|/\mathcal{J}. The parameters are the same as those defined in AE.

(3) Mean Squared Error (MSE). Metric for frequency estimation. MSE=1ndD(f(d)f~(d))2MSE=\frac{1}{n}\sum_{d\in D}(f(d)-\tilde{f}(d))^{2}, where f(d)f(d) and f~(d)\tilde{f}(d) are true and estimated frequencies of dd, and nn is the number of distinct values in DD.

Parameters

ϵ\epsilon: The privacy budget of differential privacy.

α\alpha: The skewness parameter of synthetic Zipf datasets.

(k,m)(k,m): The number of lines and columns of a sketch.

rr: Sampling rate of LDPJoinSketch+ used in phase 1.

θ\theta: The threshold for separating high-frequency items and low-frequency items.

Refer to caption
Figure 5: The Accuracy of Join Size Estimation.

VII-B Accuracy

We test the accuracy of different methods on all the six datasets, and the experimental results are illustrated in Fig. 5. We fix the privacy parameter ϵ=4\epsilon=4 and set sketch parameters (k=18k=18, m=1024m=1024) for sketch-based methods. Our proposed method outperforms others in most cases, and the relative error (RE) of our methods are sufficiently small, akin to those of FAGMS(Non-private), which indicates that it provides rigorous privacy protection while maintaining excellent accuracy. Because methods like k-RR and FLH introduce significant noises to the join value, consequently leading to extremely poor results. Apple-HCMS does not consider hash-collisions. Besides, we can observe that LDPJoinSketch+ achieves better utility to a certain extent on different datasets. The reason is that the FAP further reduces the hash-collision error. Our method demonstrates clear advantages in situations with large data volumes and large domains, but it does not exhibit significant advantages in small datasets with small domain, cause most LDP mechanisms are not suitable for small data. LDP requires a large amount of data to ensure that the noise introduced by perturbation satisfies the desired distribution.

VII-C Space and Communication cost

Space Cost. We test the accuracy of sketch-based methods with similar space cost on Zipf(α=2.0\alpha=2.0) dataset and the results are shown in Fig. 7. We set ϵ=10\epsilon=10, r=0.1r=0.1, and θ=0.001\theta=0.001. The space cost of Apple-HCMS and LDPJoinSketch only contains the size of one sketch for each table. The space cost of LDPJoinSketch+ includes the size of sketches used in two phases. For simplicity, we set same size of sketches used in two phases in LDPJoinSketch+, which means that the space cost in phase 2 is nearly twice that of phase 1. We take a variety of settings to make different methods have similar space costs. We can observe that our LDPJoinSketch+ attains greater accuracy compared to Apple-HCMS at a comparable space cost level.

Refer to caption
Figure 6: Impact of space cost.
Refer to caption
Figure 7: Communication cost.

Communication cost. Fig. 7 shows the communication costs of different methods on Zipf(α=1.1\alpha=1.1) and MovieLens datasets. We set (k=18k=18, m=1024m=1024) and ϵ=4\epsilon=4. The y-axis represents the cumulative number of bits transmitted from all clients to the server. Since LDPJoinSketch and Apple-HCMS both encode each value by the hadamard matrix and send only one bit to the server, they can significantly reduce the communication cost. Instead, k-RR sends the complete encoded value to the server, and FLH also sends an encoded vector with optimal length to the server.

Refer to caption
(a) Zipf(α\alpha=1.5)
Refer to caption
(b) Gaussian
Refer to caption
(c) MovieLens
Refer to caption
(d) Twitter
Figure 8: Impact of ϵ\epsilon.
Refer to caption
(a) Zipf(α\alpha=1.1)
Refer to caption
(b) Zipf(α\alpha=2.0)
Refer to caption
(c) MovieLens
Refer to caption
(d) Twitter
Refer to caption
(e) Zipf(α\alpha=1.1)
Refer to caption
(f) Zipf(α\alpha=2.0)
Refer to caption
(g) MovieLens
Refer to caption
(h) Twitter
Figure 9: Impact of (m,k)(m,k). (Figures (a)-(d) show impact of mm, figures (e)-(h) show impact of kk.)

VII-D Impact of parameters

The impact of privacy budget ϵ\epsilon. We report the accuracy of different methods with privacy budget ϵ\epsilon varying in {0.1,1,,10}\{0.1,1,...,10\} in Fig. 8. We set (k=18k=18, m=1024m=1024). Overall, the accuracy improves as ϵ\epsilon increases because less noises are introduced. Our methods perform better when ϵ\epsilon is relatively small. And LDPJoinSketch+ shows a notable improvement in accuracy on skewed data. We can also find that the error of sketch-based algorithms do not varies much when ϵ\epsilon is large.

The impact of sketch parameters (k,m)(k,m). Fig. 9 show the impact of sketch size on estimation results. First, Fig. 9.(a)-(d) show the impact of hash domain size mm on accuracy with a fixed k=18k=18. We set ϵ=10\epsilon=10 and r=0.1r=0.1. We find that LDPJoinSketch+ has optimal utility on both synthetic and real-world datasets under different mm. The error of all methods decreases as mm increases, because of less hash collisions. Second, Fig. 9.(e)-(h) illustrate the impact of the number of hash functions kk with a fixed m=1024m=1024. As mentioned in theorem 5, the parameter kk in our methods indicates the failure probability δ\delta, thus we set δ{0.1,0.05,0.01,,0.0001}\delta\in\{0.1,0.05,0.01,...,0.0001\} and correspondingly the number of hash functions is k{9,12,18,21,28,30,36}k\in\{9,12,18,21,28,30,36\}. Our methods performs best in most cases. For both FAGMS and Apple-HCMS, the errors decrease with a larger kk. But for our methods, the error does not change much or just slightly increases along with the increase of kk. The reason is that we only update one counter of the jjth line in the sketch for each entry, and jj is randomly chosen in [k][k]. As kk increases, the error caused by sampling increases.

Refer to caption
Figure 10: Impact of rr.
Refer to caption
Figure 11: Impact of θ\theta.

The impact of sampling rate rr. Fig. 11 illustrates the impact of sampling rate rr of LDPJoinSketch+ on the accuracy. As the sampling rate in phase 1 affects the frequency estimation accuracy, it consequently influences the FAP in phase 2. We set (k=18k=18, m=1024m=1024), ϵ=4\epsilon=4, and vary r{0.1,0.15,,0.30}r\in\{0.1,0.15,...,0.30\} to test the accuracy with Zipf(α\alpha=1.1) dataset. We can observe that the accuracy increases with larger sampling rate.

The impact of threshold θ\theta. Fig. 11 shows the results of LDPJoinSketch+ on Zipf(α\alpha=1.1) dataset with threshold θ\theta ranging from 5×1055\times 10^{-5} to 0.1. LDPJoinSketch+ uses θ\theta to separate the high-frequency and low-frequency items. Here (k=18k=18, m=1024m=1024) and ϵ=4\epsilon=4. On one hand, a smaller θ\theta causes some low-frequency items will be considered as high-frequency ones, thereby reducing the accuracy of LDPJoinSketch+. On the other hand, a larger θ\theta results in fewer items in frequent item set, thus making it less effective in mitigating hash-collision errors. Therefore, it is essential to select an appropriate threshold θ\theta tailored to the data distribution.

Refer to caption
Figure 12: Impact of skewness.
Refer to caption
Figure 13: Efficiency

The impact of skewness α\alpha. Fig. 13 shows the result on Zipf datasets with different level of skewness. We set (k=18k=18, m=1024m=1024) and ϵ=4\epsilon=4. Our methods achieves the optimal performance with different skewness levels. As the skewness increases, the error of all methods decreases because the true join size increases considerably with larger α\alpha, meanwhile, the number of distinct items decreases resulting in smaller errors.

VII-E Efficiency

Fig. 13 shows the efficiency of different methods on three datasets. The solid areas in the figure represent off-line running time including collecting perturbed values and constructing sketches, while the grid areas represent online running time of join size estimation. From the figure, we can observe that the online time cost for all sketches-based method is nearly negligible, indicating that the response can be generated quickly after the sketches constructed, which is valuable in interactive settings. Compared with other LDP mechanisms, our algorithms require a little more off-line time, but considering the huge improvement in the accuracy of estimation and online querying, the extra time is well spent.

Refer to caption
(a) Zipf(α\alpha=1.5)
Refer to caption
(b) MovieLens
Figure 14: Frequency Estimation.

VII-F Frequency Estimation

Fig. 14 shows the performance of LDPJoinSketch on frequency estimation compared with other LDP mechanisms. We can learn that LDPJoinSketch has the same accuracy level as Apple-HCMS since both two algorithms have almost identical sketch structures except the additional hash function ξ\xi in LDPJoinSketch. Moreover, LDPJoinSketch is more accurate when ϵ\epsilon is small. The error of LDPJoinSketch and Apple-HCMS do not change much when ϵ\epsilon reaches a certain value, because the error is mainly introduced by the sketch rather than the privacy budget given a large ϵ\epsilon.

Refer to caption
Figure 15: varying ϵ\epsilon on Multi-Way Join.

VII-G Experiments on Multi-way join

Fig. 15 illustrates the performance of various methods for multi-way chain join on Zipf(α=1.5\alpha=1.5) dataset. Here we use 3-way and 4-way chain join queries for evaluation. Due to the extremely high time cost of frequency-based methods such as k-RR, we simply show the performance of Compass and LDPJoinSketch in 4-way join cases. An example of a 3-way join is like T1(A)T2(A,B)T3(B)T_{1}(A)\Join T_{2}(A,B)\Join T_{3}(B). We set (k=18k=18, m=1024m=1024). The dashed lines in the figure represent the 4-way join cases. We can observe that our LDPJoinSketch is still effective and performs the best. Besides, the estimation error decreases continuously as ϵ\epsilon increases and eventually becomes stable, because the noise at this point mainly comes from the sampling operation of sketch.

VII-H Summaries of experimental results.

  • \bullet

    LDPJoinSketch is more accurate than k-RR, FLH, and Apple-HCMS for join size estimation on sensitive data.

  • \bullet

    Our improved method LDPJoinSketch+ further decreases the error introduced by hash collisions through frequency-aware perturbation mechanism.

  • \bullet

    Both LDPJoinSketch and LDPJoinSketch+ we proposed are better suited for large datasets whose join values are sensitive and have large domains.

VIII Conclusion

This paper proposes two sketch-based methods, namely LDPJoinSketch and LDPJoinSketch+, for join size estimation under LDP. LDPJoinSketch is tailored to handle private join values with large domains, while LDPJoinSketch+ enhances accuracy by mitigating hash collisions within the LDP framework. Join size estimation, exemplified by the aggregation function “COUNT”, serves as an initial focal point. Looking ahead, our future research will focus on addressing generate join aggregation queries under local differential privacy, demanding additional diligence and investigation.

Acknowledgment

This research was partially supported by the NSFC grant 62202113; GuangDong Basic and Applied Basic Research Foundation SL2022A04J01306; Open Project of Jiangsu Province Big Data Intelligent Engineering Laboratory SDGC2229; the Guangzhou Science and Technology Plan Project (No. 2023A03J0119); National Key Research and Development Program of China (No. 2022YFB3104100).

References

  • [1] Y. Izenov, A. Datta, F. Rusu, and J. H. Shin, “COMPASS: online sketch-based query optimization for in-memory databases,” in SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021.   ACM, 2021, pp. 804–816. [Online]. Available: https://doi.org/10.1145/3448016.3452840
  • [2] P. Wang, Y. Qi, Y. Zhang, Q. Zhai, C. Wang, J. C. S. Lui, and X. Guan, “A memory-efficient sketch method for estimating high similarities in streaming sets,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019.   ACM, 2019, pp. 25–33. [Online]. Available: https://doi.org/10.1145/3292500.3330825
  • [3] A. S. R. Santos, A. Bessa, F. Chirigati, C. Musco, and J. Freire, “Correlation sketches for approximate join-correlation queries,” in SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021.   ACM, 2021, pp. 1531–1544. [Online]. Available: https://doi.org/10.1145/3448016.3458456
  • [4] A. Bessa, M. Daliri, J. Freire, C. Musco, C. Musco, A. S. R. Santos, and H. Zhang, “Weighted minwise hashing beats linear sketching for inner product estimation,” in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023.   ACM, 2023, pp. 169–181. [Online]. Available: https://doi.org/10.1145/3584372.3588679
  • [5] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. D. Smith, “What can we learn privately?” 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pp. 531–540, 2008. [Online]. Available: https://api.semanticscholar.org/CorpusID:1935
  • [6] T. Wang, J. Blocki, N. Li, and S. Jha, “Locally differentially private protocols for frequency estimation,” in USENIX Security Symposium, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:10051640
  • [7] M. Zhang, S. Lin, and L. Yin, “Local differentially private frequency estimation based on learned sketches,” Inf. Sci., vol. 649, p. 119667, 2023. [Online]. Available: https://doi.org/10.1016/j.ins.2023.119667
  • [8] J. C. Duchi, M. J. Wainwright, and M. I. Jordan, “Minimax optimal procedures for locally private estimation,” Journal of the American Statistical Association, vol. 113, pp. 182 – 201, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:15762329
  • [9] “Learning with privacy at scale differential,” 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:43986173
  • [10] Ú. Erlingsson, A. Korolova, and V. Pihur, “Rappor: Randomized aggregatable privacy-preserving ordinal response,” Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:6855746
  • [11] G. C. Fanti, V. Pihur, and Ú. Erlingsson, “Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries,” Proceedings on Privacy Enhancing Technologies, vol. 2016, pp. 41 – 61, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:9001011
  • [12] B. Ding, J. Kulkarni, and S. Yekhanin, “Collecting telemetry data privately,” in Neural Information Processing Systems, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:3277268
  • [13] M. Xu, B. Ding, T. Wang, and J. Zhou, “Collecting and analyzing data jointly from multiple services under local differential privacy,” Proceedings of the VLDB Endowment, vol. 13, pp. 2760 – 2772, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221375864
  • [14] R. B. Christensen, S. R. Pandey, and P. Popovski, “Semi-private computation of data similarity with applications to data valuation and pricing,” IEEE Trans. Inf. Forensics Secur., vol. 18, pp. 1978–1988, 2023. [Online]. Available: https://doi.org/10.1109/TIFS.2023.3259879
  • [15] J. Bater, Y. Park, X. He, X. Wang, and J. Rogers, “SAQE: practical privacy-preserving approximate query processing for data federations,” Proc. VLDB Endow., vol. 13, no. 11, pp. 2691–2705, 2020. [Online]. Available: http://www.vldb.org/pvldb/vol13/p2691-bater.pdf
  • [16] J. Ock, T. Lee, and S. Kim, “Privacy-preserving approximate query processing with differentially private generative models,” in IEEE International Conference on Big Data, BigData 2023, Sorrento, Italy, December 15-18, 2023.   IEEE, 2023, pp. 6242–6244. [Online]. Available: https://doi.org/10.1109/BigData59044.2023.10386956
  • [17] G. Cormode, S. Maddock, and C. Maple, “Frequency estimation under local differential privacy,” Proc. VLDB Endow., vol. 14, pp. 2046–2058, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232427949
  • [18] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz, “Bifocal sampling for skew-resistant join size estimation,” in ACM SIGMOD Conference, 1996. [Online]. Available: https://api.semanticscholar.org/CorpusID:2892590
  • [19] C. Estan and J. F. Naughton, “End-biased samples for join cardinality estimation,” 22nd International Conference on Data Engineering (ICDE’06), pp. 20–20, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID:5265860
  • [20] Y. E. Ioannidis and S. Christodoulakis, “Optimal histograms for limiting worst-case error propagation in the size of join results,” ACM Trans. Database Syst., vol. 18, pp. 709–748, 1993. [Online]. Available: https://api.semanticscholar.org/CorpusID:16703047
  • [21] N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy, “Tracking join and self-join sizes in limited storage,” in ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1999. [Online]. Available: https://api.semanticscholar.org/CorpusID:1650858
  • [22] G. Cormode and M. N. Garofalakis, “Sketching streams through the net: Distributed approximate query tracking,” in Very Large Data Bases Conference, 2005. [Online]. Available: https://api.semanticscholar.org/CorpusID:3402807
  • [23] H. Chen, Z. Wang, Y. Li, R. Yang, Y. Zhao, R. Zhou, and K. Zheng, “Deep learning-based bloom filter for efficient multi-key membership testing,” Data Science and Engineering, vol. 8, pp. 234–246, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261499850
  • [24] S. Ganguly, M. N. Garofalakis, and R. Rastogi, “Processing data-stream join aggregates using skimmed sketches,” in International Conference on Extending Database Technology, 2004. [Online]. Available: https://api.semanticscholar.org/CorpusID:11330374
  • [25] S. Ganguly, D. Kesh, and C. Saha, “Practical algorithms for tracking database join sizes,” in Foundations of Software Technology and Theoretical Computer Science, 2005. [Online]. Available: https://api.semanticscholar.org/CorpusID:1195913
  • [26] F. Wang, Q. Chen, Y. Li, T. Yang, Y. Tu, L. Yu, and B. Cui, “Joinsketch: A sketch algorithm for accurate and unbiased inner-product estimation,” Proceedings of the ACM on Management of Data, vol. 1, pp. 1 – 26, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259077177
  • [27] S. Aydöre, W. Brown, M. Kearns, K. Kenthapadi, L. Melis, A. Roth, and A. Siva, “Differentially private query release through adaptive projection,” in International Conference on Machine Learning, 2021.
  • [28] T. Wang, N. Li, and S. Jha, “Locally differentially private frequent itemset mining,” 2018 IEEE Symposium on Security and Privacy (SP), pp. 127–143, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:50787144
  • [29] A. Triastcyn and B. Faltings, “Bayesian differential privacy for machine learning,” in International Conference on Machine Learning, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:199472691
  • [30] H. Jiang, J. Pei, D. Yu, J. Yu, B. Gong, and X. Cheng, “Applications of differential privacy in social network analysis: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, pp. 108–127, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:235083200
  • [31] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography Conference, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID:2468323
  • [32] V. V. Williams, Y. Xu, Z. Xu, and R. Zhou, “New bounds for matrix multiplication: from alpha to omega,” in Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2024.