Sketches-based join size estimation under local differential privacy
Abstract
Join size estimation on sensitive data poses a risk of privacy leakage. Local differential privacy (LDP) is a solution to preserve privacy while collecting sensitive data, but it introduces significant noise when dealing with sensitive join attributes that have large domains. Employing probabilistic structures such as sketches is a way to handle large domains, but it leads to hash-collision errors. To achieve accurate estimations, it is necessary to reduce both the noise error and hash-collision error. To tackle the noise error caused by protecting sensitive join values with large domains, we introduce a novel algorithm called LDPJoinSketch for sketch-based join size estimation under LDP. Additionally, to address the inherent hash-collision errors in sketches under LDP, we propose an enhanced method called LDPJoinSketch+. It utilizes a frequency-aware perturbation mechanism that effectively separates high-frequency and low-frequency items without compromising privacy. The proposed methods satisfy LDP, and the estimation error is bounded. Experimental results show that our method outperforms existing methods, effectively enhancing the accuracy of join size estimation under LDP.
Index Terms:
Local differential privacy, Join query, SketchI Introduction
Join size estimation, also known as inner product estimation, represents a fundamental statistical problem with applications spanning various domains, including query optimization [1], similarity computation [2], correlation computation [3], and dataset discovery [4]. Despite numerous research efforts dedicated to this issue, the data collection process for join size estimation poses inherent privacy risks. Local differential privacy (LDP) [5] is a solution to privacy-preserving data collection [6, 7, 8], gaining traction in real-world applications at companies including Apple [9], Google [10, 11], and Microsoft [12]. However, most previous works have primarily focused on basic statistics like frequency and mean estimation, offering limited solutions for more intricate statistical tasks such as join aggregations. In a specific study [13], the authors explored join operations on two private data sources. However, it is important to highlight that their approach assumed non-private values for the join attributes.
Join size estimation under LDP holds significance in various scenarios. (1) Private similarity computation for data valuation and pricing. Estimating the similarity of two streams is a basic application of join size estimation. Private similarity computation is essential in data markets to assess the value of private data from different sources for analysis or learning tasks [14]. (2) Private correlation computation for dataset search and discovery. Join size estimation plays a role in dataset search by identifying relevant factors through joinable columns and computing column correlation [4]. Private join size estimation benefits relevant industries such as hospitals and genetics research companies in assessing the relevance of private data before collaborating. (3) Private approximate query processing. Approximate Query Processing (AQP) can enhance response times by returning approximate results. Applying AQP to private data presents a promising approach for efficient and private data analysis [15, 16], given that exact query answers are unattainable under DP or LDP. Join size estimation, as a specific task of AQP, holds potential to be extended to handle more general AQP tasks, including other join aggregations and predicates, under LDP.
Join size estimation on private join attribute values under LDP is a complex task. A common LDP setting has two main kinds of participants including a large number of users on client-side and one untrusted aggregator on server-side. An LDP workflow can be broken down into two steps: Each user locally perturbs its sensitive value by LDP mechanisms and sends the perturbed value to the aggregator. The server-side then aggregates the perturbed values for subsequent specific analysis. Join size estimation under LDP comes with several challenges. Challenge I, the join attribute values are sensitive and have a large domain. It is difficult to perturb the private join attribute values from two data sources while preserving the join size. Directly perturbing the join values according to LDP mechanisms, such as k-RR [6] and FLH [17], can introduce significant noise. To tackle the challenge of large domain, previous approaches like Apple-HCMS [9] and RAPPOR [10] have turned to probabilistic data structures, such as sketches and bloom filters, for estimation. However, these structures reduce the domain while introducing hash-collision errors. Challenge II, reducing the hash-collision error of probabilistic structures while preserving privacy. Without considering the privacy, the main idea to address hash-collision error is to compute the join size of high-frequency items and low-frequency items separately. This is because collisions involving high-frequency items tend to introduce more significant errors. However, separating high-frequency and low-frequency items while preserving LDP is also not simple. This is because the frequency property of each value is also private, making it a complex problem to tackle.
Main idea. To tackle the challenge I, we propose the LDPJoinSketch, whose main idea is to modify the fast-AGMS sketch to an LDP version, which constructs fast-AGMS sketch for join size estimation in the server-side based on the perturbed values collected from each individual users in the client-side. To tackle the challenge II, we propose LDPJoinSketch+ to reduce the hash-collision error by differently encoding the low-frequency and high-frequency items while satisfying LDP.

Fig. 1 shows the workflow of the two proposed methods. The workflow of LDPJoinSketch adheres to the common LDP pattern. In step , each user on the client-side encodes and perturbs its sensitive join attribute value and transmits the processed value to the server, here denotes the encoding and perturbation of value . In step , the server receives all perturbed values and constructs sketches and for attributes and , respectively. Ultimately, the join size can be estimated from the product . The workflow of the two-phase algorithm LDPJoinSketch+ can be summarized as follows. In phase 1, the server constructs the LDPJoinSketches using steps and based on the values from some sampled users. In step , the server computes the frequent items () based on the sketches and sends this set to the remaining users. In phase 2, for each attribute, the remaining users are divided into two groups: one for constructing a sketch for high-frequency items in and another for low-frequency items. The step involves encoding and perturbing each private value from the two groups based on a Frequency-Aware-Perturbation (FAP) mechanism we proposed. This mechanism can encode high-frequency and low-frequency items differently while satisfying LDP. Finally, in step the server separately constructs sketches to summarize the high-frequency items and low-frequency items. Taking group for constructing the sketch aiming at summarizing low-frequency items of attribute as an example, here low-frequency items are target values and high-frequency ones are non-target values. FAP encodes each target value in the same way as , and encodes non-target items differently using a method denoted as in the figure. FAP ensures that the contributions of all non-target values can be removed from sketch , thereby reducing hash collisions based on the sketches separately summarizing high-frequency and low-frequency items.
The main contribution of this work:
-
•
We present a sketch-based join size estimation method called LDPJoinSketch for data with sensitive join values that have large domains. We prove that it satisfies LDP, and the estimation error is limited.
-
•
We develop a Frequency-Aware Perturbation (FAP) mechanism that distinguishes between high-frequency and low-frequency items while ensuring compliance with LDP. Building upon FAP, we propose an enhanced method named LDPJoinSketch+, which reduces hash-collision errors while upholding privacy preservation.
-
•
We conduct extensive experiments on synthetic and real-world datasets. The experimental results show that our methods outperforms state-of-the-art LDP mechanisms.
II Related work
Join size estimation. The join size of two data streams is a crucial statistic in data stream analysis, representing the inner product of their frequency vectors. We mainly introduce the sketch-based join size estimation most relevant to this article, while the methods based on sampling and histograms have been well elaborated in previous works [18, 19, 20]. Sketch-based methods such as AGMS sketch [21] and fast-AGMS sketch [22] adopt compact probabilistic data structures to summarize data, and approximately compute the statistics [23] such as frequency estimation and inner-product. To mitigate the error caused by hash collisions, studies such as Skimmed sketch [24], Red sketch [25] and JoinSketch [26] have focused on separating high-frequency and low-frequency items.
Local differential privacy (LDP) is a promising privacy preserving framework widely used in the field of private data releasing [27], data mining [28], machine learning [29] and social network analysis [30].
Since no method has been proposed for estimating join size or inner product under LDP, we will instead focus on discussing the relevant task of frequency estimation under LDP. A basic LDP mechanism k-ary Randomized Response (k-RR) [6] perturbs the original value to any other values within the domain with the same probability. To handle large domain and reduce communication cost, Apple-HCMS [9] uses sketches to encode values before perturbation with hadamard transform. FLH [17] reduces the domain size by local hashing, which sacrifices accuracy to achieve computational gains. These methods can be employed for estimating join sizes by calculating the frequency of each candidate join value. However, they suffer from both cumulative errors and efficiency issues, particularly when the join attribute has a large domain. In contrast, our LDPJoinSketch, which is based on the fast-AGMS sketch, estimates the join size by computing the product of sketches, rather than by accumulating the join size of each join value.
III Preliminaries
III-A AGMS sketch and Fast-AGMS sketch
AGMS sketch [21] uses a counter to summarize the values of a data stream . For every item in , it adds to the counter , thus , where is a 4-wise independent hash function mapping to {+1,-1}. The join size of two data streams and can be estimated based on sketches and constructed with the same hash function . The estimation is . The accuracy can be enhanced by computing the median of multiple independent estimators.
Fast-AGMS sketch [22] is proposed to improve the construction efficiency of AGMS sketch. It adds a hash function to determine the location for an update, which avoids each update touching every counter as the AGMS does. A fast-AGMS sketch is an array of counters of lines and columns. For each , two hash functions, and , are employed for the -th line of the sketch. Here, is responsible for determining which counter to increment, while is a member of a family of four-wise independent binary random variables uniformly distributed in . For each value , it increases the counter with indices by . The join size can be estimated according to the following equation.
(1) |
where and are fast-AGMS sketches with parameters for and , respectively. The estimation error is limited as , where , and , are the number of values of attributes and .
III-B Local differential privacy
Local differential privacy (LDP) [5] extends the notion of differential privacy (DP) [31] to scenarios where the aggregator cannot be trusted and each data owner locally perturbs its data before sharing it with the server.
Definition 1.
(-local differential privacy). A local randomized privacy algorithm is -locally differentially private(-LDP), where , iff for all pairs of inputs , we have that
(2) |
where denotes the set of all possible outputs, and the privacy budget measures the level of privacy protection.
III-C Hadamard Mechanism
Hadamard Mechanism [9] utilizes hadamard transform to improve efficiency. The Hadamard Transform of a vector is obtained via multiplying with the hadamard matrix . Here denotes the hadamard matrix of order , a special type of square matrix where each element has only two possible values: +1 or -1. is defined recursively as , and . Hadamard transform spreads information from a sparse vector (1 in a single location across multiple dimensions), so that when we sample a bit from this dense set we still have sufficiently strong signal about the original vector.
IV LDPJoinSketch
We propose LDPJoinSketch, a local differentially private sketch for join size estimation. The LDPJoinSketch protocol can be divided into client-side and server-side algorithms. The goal of these two algorithms is as follows. Client-side: To perturb the private value and transmit the perturbed value along with the indices that determine which counter of the sketch in the server the perturbed value should be added to. Server-side: To construct the sketch by adding each perturbed value to the counter with the given indices, and to compute the join size based on the constructed sketches.
IV-A Client-side of LDPJoinSketch
The goal of the client-side of LDPJoinSketch is to encode and perturb each private value, ensuring that the output satisfies LDP and is safe to be transmitted to the server.
Input: join value , privacy budget , sketch parameters (, ), hash function pairs
Output: perturbed value , indices
The pseudo-code of the client-side is shown in Algorithm 1. Given a private join value , the privacy budget , the sketch parameters indicating the number of lines and columns of the sketch in the server, the algorithm first samples a line index uniformly from lines and samples an index uniformly from columns (line 1). It then initializes a vector of size () with all zeros. It then encodes the vector to be in position (line 3). To reduce communication cost by sampling while maintaining sufficiently strong signal about the original vector, we adopt hadamard transform before sampling as Apple-HCMS [9] does. The algorithm generates which transforms with only one non-zero member to a vector . It then samples a bit from the vector , where , and perturbs by multiplying (-1) with the probability of (line 5-6). The client-side of LDPJoinSketch is almost the same as the client-side of Apple-HCMS [9], and the only difference is the encoding method in line 3. We encode each value based on the fast-AGMS sketch and set , while Apple-HCMS encodes each item based on the Count Mean Sketch and sets .
Theorem 1.
LDPJoinSketch satisfies -LDP.
Proof.
Let and be the encodings of and , respectively. According to Algorithm 1, the differences between and are on two bits, i.e., and . Let be the random variable selected uniformly from , and be the random variable selected uniformly from . Let be the random variable for the random bit , where and . Let and . We denote Algorithm 1 by in the followings.
(3) |
Since , , the samples from hadamard transform , . In addition, as and , the probability of obtaining the same output with different inputs and is similar:
(4) |
Thus, LDPJoinSketch satisfies -LDP. ∎
IV-B Server-side of LDPJoinSketch
After receiving the perturbed values, the server has two tasks: (1) to construct a sketch for each join attribute, and (2) to compute the join size based on the sketches.
Input: , …,, , (, )
Output: private sketch
Construction of LDPJoinSketch. The pseudo-code of the construction algorithm is shown in Algorithm 2. For each input from the client-side, it multiplies with , and adds it to the counter at indices (line 4). The scaling factor (line 3) is used to debias the sketch, as both sampling an index and the perturbation in Algorithm 1 introduce bias: and . Finally, it multiplies the sketch with the hadamard matrix to transform the sketch back (line 6).

Example 1.
We use an example in Fig. 2 to demonstrate the working of client-side and server-side in LDPJoinSketch. Suppose , and the private data in the client is .
Client-side: Let , where , the client encodes as , where . It adopts hadamard transform to convert the only non-zero signal to a vector , and perturbs a random bit of as , where . Finally, the client sends to the server.
Server-side: Construct the sketch by adding to the counter at indices . It then multiplies the sketch with and a scale to debias the sketch. We prove that the expectation of the contribution of to is still .
Join size estimation. Based on the LDPJoinSketches and for attributes and , the server estimates the join size:
(5) |
IV-C Analysis of estimation error
We follow three steps to compute the error of join size estimation based on LDPJoinSketch. Step 1, we analyze the contribution of each value to the sketch. Step 2, we prove that each estimator () is an unbiased estimation of . Additionally, we compute the variance of each estimator. Step 3, we take the median of estimators as the final estimation and calculate the error bound for it. For the convenience of the reader, we provide a notation table (Table I) containing symbols in the following proofs.
Notations | Description |
---|---|
private join attribute value | |
the sketch | |
the number of lines and columns of a sketch | |
hash function for th line, | |
4-wise independent hash function | |
true frequency of value of attribute | |
contribution of data to the counter at | |
number of values of attribute ,i.e., | |
second frequency moment | |
the random variables that select uniformly from |
Step 1. The contribution of each private value to the sketch.
Definition 2.
The contribution of the th private data to the counter at indices of the LDPJoinSketch sized can be written as . , where , and is 0 when . Here, , 111 denotes is a variable chosen uniformly at random from [n], and is 1 with probability .
In Theorem 2, we give the expectation of .
Theorem 2.
The contribution of a value to : .
Proof.
We analyze under two situations.
(1) If ,
(6) | |||
(7) | |||
(8) |
(2) If ,
(9) | ||||
(10) | ||||
(11) |
Merging these two situations above, we can get:
The expectation is the same as that of fast-AGMS sketch.∎
Step 2: The expectation and variance of one estimator. Taking the inner-product of the corresponding lines of two sketches as one estimator of . We prove that and is limited. Before computing , we first give a lemma for the product of two entries from two datasets.
Lemma 1.
Let and be two values of attributes and . Given , we have
(12) |
Proof.
(1) If , . (2) If , then ∎
Theorem 3.
.
Proof.
.
(13) |
Here, denotes the frequency of in attribute . ∎
Lemma 2.
Given , we have
(14) |
Proof.
According to the definition of , given two data and of the same attribute, we have
,
where represent random variables chosen uniformly from for and . Similarly, and represent the random variables for perturbation.
Case 1. If , then and represent the same data entry , so and . Therefore, .
Case 2. If , then and are independent variables, . Similarly . And , since . Therefore, .
Case 3. If , similar to (2), we have and . The difference from Case 2 is that for .
∎
Lemma 3.
The expectation of :
Proof.
Suppose , where and are two data entries for sketch . Since , based on lemma 2, we have
(15) | ||||
(16) |
Based on this, we can compute . ∎
We present herein a definition for certain notations that will be utilized in the subsequent theorems.
Definition 3.
Let be a value of attribute and be the frequency (number of occurrences) of in the sequence of values of attribute . We define (1) the total frequency of all the values of as , (2) the second frequency moment .
Lemma 4.
.
Proof.
With Lemma 4, we compute as follows.
Theorem 4.
The variance of is limited:
.
Proof.
The variance of each estimator :
(21) | |||
(22) |
Thus, variance of each estimator is bounded. ∎
Step 3. Error bound of LDPJoinSketch. The final estimation is computed as the median of estimators like , i.e., . We use the following theorem to prove that the error of is limited.
Theorem 5.
Let , , and the join size estimation error be ,
Proof.
According to Chebyshev’s Inequality, given a random variable , . We can compute the error bound for each estimator ,
(23) |
Let , we get
(24) |
Using the application of Chernoff Bounds, let , we can use to reduce the failure probability to . . ∎
Based on Theorem 5, the error of the LDPJoinSketch-based join size estimation is limited. LDP introduces an additional error to the error bound, but the influence caused by LDP does not destroy the utility of the sketch, since is much smaller than and in reality.
V LDPJoinSketch+
To reduce the hash-collision while preserving privacy, we present a novel two-phase framework known as LDPJoinSketch+. This framework builds upon and enhances the accuracy of LDPJoinSketch by specifically addressing hash collisions (Section V-A). In LDPJoinSketch+, We design a Frequency-Aware Perturbation (FAP) mechanism for the client-side (Section V-B). Finally, we show how to estimate the join size based on the LDPJoinSketch+ (Section V-C).
V-A The framework of LDPJoinSketch+
LDPJoinSketch+ aims to reduce error caused by hash-collisions under LDP. It has a two-phase framework. In phase 1, we identify candidate frequent join values from each table and communicate these candidates to the clients. In phase 2, each client distinguishes whether its value is high-frequency or not, and encodes high-frequency and low-frequency values in distinct manners, all while ensuring compliance with LDP.
We use the pseudo-code in Algorithm 3 to show the framework of LDPJoinSketch+. LDPJoinSketch+ has two phases. In phase 1, we find the frequent item set based on the LDPJoinSketches and constructed from sample users for attributes and (Phase 1, line 1-4). In phase 2, to avoid allocating privacy budget, we divide the users into two groups (Phase 2, line 1), one for the join size estimation of low-frequency values, another for high-frequency values. We can use the whole privacy budget for each group according to the composition theorem of LDP. For ease of description, we encapsulate the process of perturbing each data through FAP and aggregating all perturbed data to construct a sketch as a function called . Based on , we can construct sketches and summarizing the low-frequency items of attribute and with the data from and (Phase 2, line 2). and summarizing the high-frequency items can be constructed similarly (Phase 2, line 3). It then computes the join size for low-frequency values as and for high-frequency values as (line 4-5). The final result is the sum of the scaled and (Phase 2, line 6). Because is the estimated join size of data from and , we estimate the join size by multiplying a scale .
Phase 1: Find frequent join values
Phase 2: Join size estimation
Func
We will introduce FAP mechanism and how to estimate the join size based on the LDPJoinSketch+ in subsequent sections.
V-B Frequency-Aware Perturbation
Phase 2 of LDPJoinSketch+ aims to improve accuracy by estimating the join size of high-frequency and low-frequency items separately. To achieve this without privacy leaks, we propose a frequency-aware perturbation (FAP) mechanism.
As the phase 1 of LDPJoinSketch+ gets the frequent items, this knowledge enables each client to distinguish whether its value is high-frequency or not, and differently handle high-frequency and low-frequency items. For ease of explanation, we refer to the value that possesses the same property as the estimation target as the “target value”, while the value that does not possess the same property as the estimation target will be referred to as “non-target values”. For instance, in constructing sketches for estimating the join size of high-frequency items, the high-frequency values would be considered target values, while the low-frequency values would be considered non-target values. FAP has two goals: (1) the server cannot infer whether the true value is high-frequency or not from the perturbed value, (2) the contribution of non-target values to the sketches can be removed.
For the first goal, FAP encodes the target values in the same way as LDPJoinSketch does, while encoding non-target values independently from their true values. Both the encoded values of target and non-target values are perturbed according to random response as LDPJoinSketch does before being sent to the server. Thus, the privacy still can be preserved.
For the second goal, non-target values are encoded randomly, resulting in their impact spreading uniformly across each cell of the sketch. Consequently, the influence of non-target values can be effectively removed from the target sketch, given that we know the total number of non-target values. This improves the accuracy of target estimation.
Input: , mode, ,
Output: perturbed value , index
The pseudo-code for the FAP algorithm is presented in Algorithm 4. The parameter signifies the encoding model for data. When , it indicates that the target values are high-frequency ones. In this case, low-frequency items are encoded randomly. Conversely, when , the target values are low-frequency ones. Therefore, when both and are either true or false, the algorithm employs random encoding for data . Otherwise, it encodes and perturbs data using the client-side of LDPJoinSketch.
The sole distinction between encode methods for target and non-target values lies in line 5: FAP encodes a non-target value randomly with ( is chosen uniform at random from ), but encodes a target value by according to the client side of LDPJoinSketch (line 10).
Example 2.
We also use an example shown in Fig. 3 to show the process of FAP. Since a target value is encoded and perturbed based on LDPJoinSketch, we only show how to handle a non-target value. The parameters in this example are the same with those in Example 1. A significant distinction from LDPJoinSketch is that FAP employs a randomly chosen variable to replace the index . Consequently, the encoding of a non-target value becomes independent of its true value, ensuring that the impact of non-target values is evenly distributed across each cell of the sketch.

Theorem 6.
Algorithm FAP satisfies -LDP.
Proof.
We have proved that LDPJoinSketch satisfies -LDP. We only need to prove that the server still cannot distinguish the outputs of a non-target value and a target value .
Suppose the encode of and are and , respectively, the differences between and are on two bits, i.e., and . Let be the random variable that selects uniformly from , be the random variable that selects uniformly from , and be the random variable for the random bit . Here, , .
Since both , the probability of obtaining the same output with different inputs and is similar, as demonstrated in the following equation.
(25) |
Thus, the Algorithm FAP, denoted as , satisfies -LDP. ∎
Each client perturbs its value based on FAP. The server receives the perturbed values from two groups and construct sketches for join size estimation of high-frequency and low-frequency values respectively. We will show how to compute the join size based on LDPJoinSketch+ in the next part.
V-C The server-side of LDPJoinSketch+
The server-side of LDPJoinSketch+ has different aims in the two phases. Phase 1 aims to find the frequent items. Phase 2 aims to construct sketches summarizing high-frequency and low-frequency values separately and estimate the join size.
Phase 1: Frequency estimation based on LDPJoinSketch.
We use the following Theorem 7 to prove that the LDPJoinSketch can provide unbiased frequency estimations.
Theorem 7.
Given an LDPJoinSketch summarizing values of attribute . The frequency of a value can be estimated as . Here, is the unbiased estimation of , i.e., .
Proof.
According to lemma 2, we have for , and for .
LDPJoinSketch provides unbiased frequency estimations. ∎
As LDPJoinSketch can provide unbiased frequency estimation, given a proper threshold to separate the high-frequency and low-frequency values, we can find the frequent item set , and for the attributes and , where and represent the total frequency of items in attributes and , respectively. We let the final frequent item set be the union of and , i.e., .
Phase 2: Join size estimation.
The server-side of LDPJoinSketch+ constructs the sketches for attribute and for attribute in the same way as LDPJoinSketch. So we only discuss how to estimate the join size based on LDPJoinSketch+.
LDPJoinSketch+ (Algorithm 3) calls to separately compute the join size estimation for high-frequency and low-frequency values. The pseudo-code for the algorithm is presented in Algorithm 5. The algorithm initially calculates the total frequencies of elements belonging to the high-frequency sets of attributes and (lines 1-3). Different parameters represent different target values for the sketch. When , it indicates that the sketches and aim to summarize low-frequency values. In this case, the algorithm removes the contribution of high-frequency values from the sketches (lines 5-8). Conversely, when , it removes the contribution of low-frequency values from the sketches (lines 9-12).
Input: , , mode
Output: Join size estimation
The following theorem computes the contribution of non-target values to sketch .
Theorem 8.
The contribution of non-target values to : , where is the total frequency of all the non-target values.
Proof.
The contribution of a non-target value to :
(26) |
where are random variables that select uniformly from . Therefore, . ∎
With the above theorem, we can remove the contribution of non-target values from the sketches. For example, is for sketch when (line 6). In this way, we separately estimate the join size of high-frequency and low-frequency items.
VI Extension to multi-way Joins
Taking inspiration from COMPASS [1], which utilizes multi-dimensional fast-AGMS sketches to facilitate multi-way joins, we illustrate in Fig. 4 how our LDPJoinSketch can be extended to support multi-way joins.

Taking as a multi-way join example. In this scenario, each join attribute is equipped with a unique pair of hash functions denoted as and . COMPASS constructs fast-AGMS sketches and for attributes and respectively. As for , it constructs a 2-dim matrix in shape of (). For every tuple with and , the counter at indices is incremented by . The size of can be estimated as: . The accuracy can be improved by computing the median of independent estimators like the one in this example.
Our LDPJoinSketch can support multi-way joins in a similar way. We can directly construct LDPJoinSketches and for and which contain only one join attribute in each table. Therefore, we only discuss how to construct a sketch for with two join attributes. Assuming each attribute has only one-pair hash functions and , for a tuple with and , we encode using hadamard matrixes and . We extract the th line (pink cells) of , the th column (pink cells) of , and randomly select two indices and . The client-side encodes by , and then perturbs the encoded value by multiplying it by (-1) with probability . The perturbed value is then sent to the server to update the counter with indices . The server constructs the sketch with the indices and perturbed values, and restores the sketch by . A scale factor is applied to debias the sketch, giving . Finally, the server computes the estimation:
(27) |
Discussion. We give a solution for simple chain multi-join operations. For chain multi-join queries on tables with a maximum of two join attributes in each table, the computational complexity222The computational complexity of matrix multiplication is , which can be further reduced to [32]. of our method is , if each two-dimensional sketch has a shape of . The complexity increases for scenarios involving star join and cyclic join operations. While our method adeptly handles uncomplicated cyclic joins, such as , addressing general multi-joins poses challenges due to intricate join graphs and the need for privacy budget decomposition among multiple join attributes. These challenges exceed the scope of this paper, and addressing them within a single work is a formidable task. Therefore, we defer this exploration to future research.
VII Experiments
In this section we design experiments to evaluate the performance of our methods on synthetic and real datasets.
VII-A Experimental Setup
Hardware. We implement all the algorithms on a machine of 256 GB RAM running Ubuntu(20.04.1) with Python 3.9.
Queries.
We consider the following form of join queries:
Q: Select Count(*) from join on ,
where and are the private join attributes of tables and .
Datasets. We generate one-dimensional synthetic datasets following zipf and gaussian distribution, respectively. Regarding these synthetic data as the join attribute values is the common setting in previous works [24, 26]. We also conduct the experiments on four real-world datasets.
(1) Zipf datasets: We generate several datasets of size following Zipf distribution with different skewness parameters, whose probability mass function is , where is the skewness parameter and is the rank of item.
(2) Gaussian dataset: We also generate a dataset that follows Gaussian distribution, whose probability density function for a given value is , where is the mean of distribution and is the standard deviation.
(3) TPC-DS dataset333https://www.tpc.org/tpcds/: The TPC Benchmark™ DS is a benchmark for measuring the performance of decision support systems. We extract the store sales data for experiments.
(4) MovieLens dataset444https://grouplens.org/datasets/movielens/. The MovieLens dataset is commonly used in the field of recommender systems and collaborative filtering, containing movie ratings and user information.
(5) Twitter ego-network dataset555https://snap.stanford.edu/data/ego-Twitter.html: This real-world dataset consists data of circles from Twitter.
(6) Facebook ego-network666https://snap.stanford.edu/data/ego-Facebook.html: Facebook data is collected from survey participants using Facebook app.
Table II shows the domain and data size of all datasets.
Dataset Name | Domain | Size |
---|---|---|
Zipf | 4,377–2,816,390 | 40,000,000 |
Gaussian | 75,949 | 40,000,000 |
MovieLens | 83,239 | 67,664,324 |
TPC-DS | 18,000 | 5,760,808 |
77,072 | 4,841,532 | |
4,039 | 352,936 |
Competitors
(1) k-RR [6]. We adopt it to perturb the join values and compute the join size using the calibrated frequency vectors.
(2) Apple-HCMS [9]. Apple’s private hadamard count mean sketch, which also forms a sketch.
(3) FLH [17]. The heuristic fast variant of Optimal Local Hashing (OLH).
(4) Fast-AGMS [22]. It estimates the join size without privacy-preserving. We refer to this method as FAGMS for short in the figures.
Error Metrics
(1) Absolute Error (AE). , where , denote the true and estimated join size, and is the testing rounds.
(2) Relative Error (RE). . The parameters are the same as those defined in AE.
(3) Mean Squared Error (MSE). Metric for frequency estimation. , where and are true and estimated frequencies of , and is the number of distinct values in .
Parameters
: The privacy budget of differential privacy.
: The skewness parameter of synthetic Zipf datasets.
: The number of lines and columns of a sketch.
: Sampling rate of LDPJoinSketch+ used in phase 1.
: The threshold for separating high-frequency items and low-frequency items.

VII-B Accuracy
We test the accuracy of different methods on all the six datasets, and the experimental results are illustrated in Fig. 5. We fix the privacy parameter and set sketch parameters (, ) for sketch-based methods. Our proposed method outperforms others in most cases, and the relative error (RE) of our methods are sufficiently small, akin to those of FAGMS(Non-private), which indicates that it provides rigorous privacy protection while maintaining excellent accuracy. Because methods like k-RR and FLH introduce significant noises to the join value, consequently leading to extremely poor results. Apple-HCMS does not consider hash-collisions. Besides, we can observe that LDPJoinSketch+ achieves better utility to a certain extent on different datasets. The reason is that the FAP further reduces the hash-collision error. Our method demonstrates clear advantages in situations with large data volumes and large domains, but it does not exhibit significant advantages in small datasets with small domain, cause most LDP mechanisms are not suitable for small data. LDP requires a large amount of data to ensure that the noise introduced by perturbation satisfies the desired distribution.
VII-C Space and Communication cost
Space Cost. We test the accuracy of sketch-based methods with similar space cost on Zipf() dataset and the results are shown in Fig. 7. We set , , and . The space cost of Apple-HCMS and LDPJoinSketch only contains the size of one sketch for each table. The space cost of LDPJoinSketch+ includes the size of sketches used in two phases. For simplicity, we set same size of sketches used in two phases in LDPJoinSketch+, which means that the space cost in phase 2 is nearly twice that of phase 1. We take a variety of settings to make different methods have similar space costs. We can observe that our LDPJoinSketch+ attains greater accuracy compared to Apple-HCMS at a comparable space cost level.


Communication cost. Fig. 7 shows the communication costs of different methods on Zipf() and MovieLens datasets. We set (, ) and . The y-axis represents the cumulative number of bits transmitted from all clients to the server. Since LDPJoinSketch and Apple-HCMS both encode each value by the hadamard matrix and send only one bit to the server, they can significantly reduce the communication cost. Instead, k-RR sends the complete encoded value to the server, and FLH also sends an encoded vector with optimal length to the server.












VII-D Impact of parameters
The impact of privacy budget . We report the accuracy of different methods with privacy budget varying in in Fig. 8. We set (, ). Overall, the accuracy improves as increases because less noises are introduced. Our methods perform better when is relatively small. And LDPJoinSketch+ shows a notable improvement in accuracy on skewed data. We can also find that the error of sketch-based algorithms do not varies much when is large.
The impact of sketch parameters . Fig. 9 show the impact of sketch size on estimation results. First, Fig. 9.(a)-(d) show the impact of hash domain size on accuracy with a fixed . We set and . We find that LDPJoinSketch+ has optimal utility on both synthetic and real-world datasets under different . The error of all methods decreases as increases, because of less hash collisions. Second, Fig. 9.(e)-(h) illustrate the impact of the number of hash functions with a fixed . As mentioned in theorem 5, the parameter in our methods indicates the failure probability , thus we set and correspondingly the number of hash functions is . Our methods performs best in most cases. For both FAGMS and Apple-HCMS, the errors decrease with a larger . But for our methods, the error does not change much or just slightly increases along with the increase of . The reason is that we only update one counter of the th line in the sketch for each entry, and is randomly chosen in . As increases, the error caused by sampling increases.


The impact of sampling rate . Fig. 11 illustrates the impact of sampling rate of LDPJoinSketch+ on the accuracy. As the sampling rate in phase 1 affects the frequency estimation accuracy, it consequently influences the FAP in phase 2. We set (, ), , and vary to test the accuracy with Zipf(=1.1) dataset. We can observe that the accuracy increases with larger sampling rate.
The impact of threshold . Fig. 11 shows the results of LDPJoinSketch+ on Zipf(=1.1) dataset with threshold ranging from to 0.1. LDPJoinSketch+ uses to separate the high-frequency and low-frequency items. Here (, ) and . On one hand, a smaller causes some low-frequency items will be considered as high-frequency ones, thereby reducing the accuracy of LDPJoinSketch+. On the other hand, a larger results in fewer items in frequent item set, thus making it less effective in mitigating hash-collision errors. Therefore, it is essential to select an appropriate threshold tailored to the data distribution.


The impact of skewness . Fig. 13 shows the result on Zipf datasets with different level of skewness. We set (, ) and . Our methods achieves the optimal performance with different skewness levels. As the skewness increases, the error of all methods decreases because the true join size increases considerably with larger , meanwhile, the number of distinct items decreases resulting in smaller errors.
VII-E Efficiency
Fig. 13 shows the efficiency of different methods on three datasets. The solid areas in the figure represent off-line running time including collecting perturbed values and constructing sketches, while the grid areas represent online running time of join size estimation. From the figure, we can observe that the online time cost for all sketches-based method is nearly negligible, indicating that the response can be generated quickly after the sketches constructed, which is valuable in interactive settings. Compared with other LDP mechanisms, our algorithms require a little more off-line time, but considering the huge improvement in the accuracy of estimation and online querying, the extra time is well spent.


VII-F Frequency Estimation
Fig. 14 shows the performance of LDPJoinSketch on frequency estimation compared with other LDP mechanisms. We can learn that LDPJoinSketch has the same accuracy level as Apple-HCMS since both two algorithms have almost identical sketch structures except the additional hash function in LDPJoinSketch. Moreover, LDPJoinSketch is more accurate when is small. The error of LDPJoinSketch and Apple-HCMS do not change much when reaches a certain value, because the error is mainly introduced by the sketch rather than the privacy budget given a large .

VII-G Experiments on Multi-way join
Fig. 15 illustrates the performance of various methods for multi-way chain join on Zipf() dataset. Here we use 3-way and 4-way chain join queries for evaluation. Due to the extremely high time cost of frequency-based methods such as k-RR, we simply show the performance of Compass and LDPJoinSketch in 4-way join cases. An example of a 3-way join is like . We set (, ). The dashed lines in the figure represent the 4-way join cases. We can observe that our LDPJoinSketch is still effective and performs the best. Besides, the estimation error decreases continuously as increases and eventually becomes stable, because the noise at this point mainly comes from the sampling operation of sketch.
VII-H Summaries of experimental results.
-
LDPJoinSketch is more accurate than k-RR, FLH, and Apple-HCMS for join size estimation on sensitive data.
-
Our improved method LDPJoinSketch+ further decreases the error introduced by hash collisions through frequency-aware perturbation mechanism.
-
Both LDPJoinSketch and LDPJoinSketch+ we proposed are better suited for large datasets whose join values are sensitive and have large domains.
VIII Conclusion
This paper proposes two sketch-based methods, namely LDPJoinSketch and LDPJoinSketch+, for join size estimation under LDP. LDPJoinSketch is tailored to handle private join values with large domains, while LDPJoinSketch+ enhances accuracy by mitigating hash collisions within the LDP framework. Join size estimation, exemplified by the aggregation function “COUNT”, serves as an initial focal point. Looking ahead, our future research will focus on addressing generate join aggregation queries under local differential privacy, demanding additional diligence and investigation.
Acknowledgment
This research was partially supported by the NSFC grant 62202113; GuangDong Basic and Applied Basic Research Foundation SL2022A04J01306; Open Project of Jiangsu Province Big Data Intelligent Engineering Laboratory SDGC2229; the Guangzhou Science and Technology Plan Project (No. 2023A03J0119); National Key Research and Development Program of China (No. 2022YFB3104100).
References
- [1] Y. Izenov, A. Datta, F. Rusu, and J. H. Shin, “COMPASS: online sketch-based query optimization for in-memory databases,” in SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021. ACM, 2021, pp. 804–816. [Online]. Available: https://doi.org/10.1145/3448016.3452840
- [2] P. Wang, Y. Qi, Y. Zhang, Q. Zhai, C. Wang, J. C. S. Lui, and X. Guan, “A memory-efficient sketch method for estimating high similarities in streaming sets,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. ACM, 2019, pp. 25–33. [Online]. Available: https://doi.org/10.1145/3292500.3330825
- [3] A. S. R. Santos, A. Bessa, F. Chirigati, C. Musco, and J. Freire, “Correlation sketches for approximate join-correlation queries,” in SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021. ACM, 2021, pp. 1531–1544. [Online]. Available: https://doi.org/10.1145/3448016.3458456
- [4] A. Bessa, M. Daliri, J. Freire, C. Musco, C. Musco, A. S. R. Santos, and H. Zhang, “Weighted minwise hashing beats linear sketching for inner product estimation,” in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023. ACM, 2023, pp. 169–181. [Online]. Available: https://doi.org/10.1145/3584372.3588679
- [5] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. D. Smith, “What can we learn privately?” 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pp. 531–540, 2008. [Online]. Available: https://api.semanticscholar.org/CorpusID:1935
- [6] T. Wang, J. Blocki, N. Li, and S. Jha, “Locally differentially private protocols for frequency estimation,” in USENIX Security Symposium, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:10051640
- [7] M. Zhang, S. Lin, and L. Yin, “Local differentially private frequency estimation based on learned sketches,” Inf. Sci., vol. 649, p. 119667, 2023. [Online]. Available: https://doi.org/10.1016/j.ins.2023.119667
- [8] J. C. Duchi, M. J. Wainwright, and M. I. Jordan, “Minimax optimal procedures for locally private estimation,” Journal of the American Statistical Association, vol. 113, pp. 182 – 201, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:15762329
- [9] “Learning with privacy at scale differential,” 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:43986173
- [10] Ú. Erlingsson, A. Korolova, and V. Pihur, “Rappor: Randomized aggregatable privacy-preserving ordinal response,” Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:6855746
- [11] G. C. Fanti, V. Pihur, and Ú. Erlingsson, “Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries,” Proceedings on Privacy Enhancing Technologies, vol. 2016, pp. 41 – 61, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:9001011
- [12] B. Ding, J. Kulkarni, and S. Yekhanin, “Collecting telemetry data privately,” in Neural Information Processing Systems, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:3277268
- [13] M. Xu, B. Ding, T. Wang, and J. Zhou, “Collecting and analyzing data jointly from multiple services under local differential privacy,” Proceedings of the VLDB Endowment, vol. 13, pp. 2760 – 2772, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221375864
- [14] R. B. Christensen, S. R. Pandey, and P. Popovski, “Semi-private computation of data similarity with applications to data valuation and pricing,” IEEE Trans. Inf. Forensics Secur., vol. 18, pp. 1978–1988, 2023. [Online]. Available: https://doi.org/10.1109/TIFS.2023.3259879
- [15] J. Bater, Y. Park, X. He, X. Wang, and J. Rogers, “SAQE: practical privacy-preserving approximate query processing for data federations,” Proc. VLDB Endow., vol. 13, no. 11, pp. 2691–2705, 2020. [Online]. Available: http://www.vldb.org/pvldb/vol13/p2691-bater.pdf
- [16] J. Ock, T. Lee, and S. Kim, “Privacy-preserving approximate query processing with differentially private generative models,” in IEEE International Conference on Big Data, BigData 2023, Sorrento, Italy, December 15-18, 2023. IEEE, 2023, pp. 6242–6244. [Online]. Available: https://doi.org/10.1109/BigData59044.2023.10386956
- [17] G. Cormode, S. Maddock, and C. Maple, “Frequency estimation under local differential privacy,” Proc. VLDB Endow., vol. 14, pp. 2046–2058, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232427949
- [18] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz, “Bifocal sampling for skew-resistant join size estimation,” in ACM SIGMOD Conference, 1996. [Online]. Available: https://api.semanticscholar.org/CorpusID:2892590
- [19] C. Estan and J. F. Naughton, “End-biased samples for join cardinality estimation,” 22nd International Conference on Data Engineering (ICDE’06), pp. 20–20, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID:5265860
- [20] Y. E. Ioannidis and S. Christodoulakis, “Optimal histograms for limiting worst-case error propagation in the size of join results,” ACM Trans. Database Syst., vol. 18, pp. 709–748, 1993. [Online]. Available: https://api.semanticscholar.org/CorpusID:16703047
- [21] N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy, “Tracking join and self-join sizes in limited storage,” in ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1999. [Online]. Available: https://api.semanticscholar.org/CorpusID:1650858
- [22] G. Cormode and M. N. Garofalakis, “Sketching streams through the net: Distributed approximate query tracking,” in Very Large Data Bases Conference, 2005. [Online]. Available: https://api.semanticscholar.org/CorpusID:3402807
- [23] H. Chen, Z. Wang, Y. Li, R. Yang, Y. Zhao, R. Zhou, and K. Zheng, “Deep learning-based bloom filter for efficient multi-key membership testing,” Data Science and Engineering, vol. 8, pp. 234–246, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261499850
- [24] S. Ganguly, M. N. Garofalakis, and R. Rastogi, “Processing data-stream join aggregates using skimmed sketches,” in International Conference on Extending Database Technology, 2004. [Online]. Available: https://api.semanticscholar.org/CorpusID:11330374
- [25] S. Ganguly, D. Kesh, and C. Saha, “Practical algorithms for tracking database join sizes,” in Foundations of Software Technology and Theoretical Computer Science, 2005. [Online]. Available: https://api.semanticscholar.org/CorpusID:1195913
- [26] F. Wang, Q. Chen, Y. Li, T. Yang, Y. Tu, L. Yu, and B. Cui, “Joinsketch: A sketch algorithm for accurate and unbiased inner-product estimation,” Proceedings of the ACM on Management of Data, vol. 1, pp. 1 – 26, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259077177
- [27] S. Aydöre, W. Brown, M. Kearns, K. Kenthapadi, L. Melis, A. Roth, and A. Siva, “Differentially private query release through adaptive projection,” in International Conference on Machine Learning, 2021.
- [28] T. Wang, N. Li, and S. Jha, “Locally differentially private frequent itemset mining,” 2018 IEEE Symposium on Security and Privacy (SP), pp. 127–143, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:50787144
- [29] A. Triastcyn and B. Faltings, “Bayesian differential privacy for machine learning,” in International Conference on Machine Learning, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:199472691
- [30] H. Jiang, J. Pei, D. Yu, J. Yu, B. Gong, and X. Cheng, “Applications of differential privacy in social network analysis: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, pp. 108–127, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:235083200
- [31] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography Conference, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID:2468323
- [32] V. V. Williams, Y. Xu, Z. Xu, and R. Zhou, “New bounds for matrix multiplication: from alpha to omega,” in Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2024.