DUMP: A Dummy-point-based Local Differential Privacy Enhancement Approach under the Shuffle Model

Xiaochen Li Zhejiang University [email protected] , Weiren Liu Alibaba Group [email protected] , Hanwen Feng Alibaba Group [email protected] , Kunzhe Huang Zhejiang University [email protected] , Yuke Hu Zhejiang University [email protected] , Jinfei Liu Zhejiang University [email protected] , Kui Ren Zhejiang University [email protected] and Zhan Qin Zhejiang University [email protected]

Abstract.

The shuffle model is recently proposed to address the issue of severe utility loss in Local Differential Privacy (LDP) due to distributed data randomization. In the shuffle model, a shuffler is utilized to break the link between the user identity and the message uploaded to the data analyst. Since less noise needs to be introduced to achieve the same privacy guarantee, following this paradigm, the utility of privacy-preserving data collection is improved.

In this paper, we propose DUMP (DUMmy-Point-based), a new framework for privacy-preserving histogram estimation in the shuffle model. In DUMP, dummy messages are introduced on the user side, creating an additional dummy blanket to further improve the utility of the shuffle model. In addition, this new privacy-preserving mechanism also achieves LDP privacy amplification effect to the user uploaded data against a curious shuffler. We instantiate DUMP by proposing two protocols: pureDUMP and mixDUMP, both of which achieving theoretical improvements on accuracy and communication complexity comparing with existing protocols. A comprehensive experimental evaluation is conducted to compare the proposed two designs under DUMP with the existing protocols. Experimental results on both synthetic and real-world datasets confirm our theoretical analysis and show that the proposed protocols achieve better utility than existing protocols under the same privacy guarantee.

PVLDB Reference Format:
PVLDB, 14(1): XXX-XXX, 2020.
doi:XX.XX/XXX.XX ^†^†This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:XX.XX/XXX.XX

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at %leave␣empty␣if␣no␣availability␣url␣should␣be␣setURL_TO_YOUR_ARTIFACTS.

1. Introduction

Differential privacy is widely accepted, by both academia and industry (Erlingsson et al., 2014; Team et al., 2017; Ding et al., 2017; Wang et al., 2019), as the standard privacy definition for statistical analysis on databases containing sensitive user information. In the last decades, the community has devoted a number of efforts to designing practical mechanisms achieving better utility while providing reliable privacy guarantee for users, and most of them can be categorized into the central model (Geng et al., 2015; Ghosh et al., 2012) or local model (Kairouz et al., 2014; Holohan et al., 2017). In the central model, the analyst, who collects data from individuals, is responsible to add noise to the aggregated result such that the noisy result will not leak too much information about each individual. In contrast, the local model enables each user to perturb his data before submitting it to an analyst and thus provides a more reliable privacy guarantee even when the analyst is curious. However, mechanisms achieving local differential privacy (LDP) only have limited utility and turn out ill-suited to practice (Bittau et al., 2017). It is known that the estimation error is at least $\Omega(\sqrt{n})$ for a wide range of natural problems in the local model ( $n$ is the number of users), while an error of just $O(1)$ can be achieved in the central model(Duchi et al., 2013).

The above dilemma motivates the study of new approaches that prevent third parties from violating users’ privacy while achieving reasonable utility; the shuffle model, initiated by Bittau et al. (Bittau et al., 2017), is emerging as a promising one. This model introduces an independent third party called shuffler who is responsible for shuffling all (permuted) messages received from users before passing them to the analyst¹¹1In practice, various cryptographic tools like nest encryptions (Bittau et al., 2017) can be employed to ensure that the shuffler cannot learn the content of these reports.. Regarding accuracy, for queries that are independent of the sequence, this shuffling will not introduce extra utility loss. Regarding privacy, intuitively, the analyst cannot link each report in a shuffled database to an individual, and thus the privacy against analysts can be amplified. Or, less “noise” shall be added by users to messages for the same privacy, which in turn improves the utility. Though similar (or even better) results could be achieved using heavy cryptographic primitives like homomorphic encryptions and multiparty computation protocols (Lazar et al., 2018; Roy Chowdhury et al., 2020), shuffle-model approaches are more popular in the industry since they are considerably efficient and implementation-friendly! In recent years, several shuffle-model protocols have been presented for various tasks (e.g., frequency estimation (Balcer and Cheu, 2019), data summation (Ghazi et al., 2019b), and range query (Ghazi et al., 2019a)) over either numerical data or categorical data.

We focus on the problem of histogram estimation, the fundamental task over categorical data, in the shuffle model. Many recent works (Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019; Balle et al., 2019b; Ghazi et al., 2020b) have aimed to design truly practical mechanisms for this problem, and they indeed have different tradeoffs. In terms of efficiency and implementation-friendliness, the generalized random response (GRR) based mechanisms(Balle et al., 2019b), in which a user just replaces his data $x$ by a uniformly random value $r$ in the data domain $\mathbb{D}$ with certain probability $q$ , turns out to be optimal: (1) only a single message shall be sent per user, and (2) it directly uses uniform randomness, which is naturally provided by operating systems via standard means. We note the feature (2) is highly desirable in practice: although random variables with many other distributions can be calculated from uniformly random variables, the inexact calculation over finite precision values cannot produce randomness exactly having the specified distribution, and the implemented privacy mechanism may leak significant information of the original data (see (Ilvento, 2020; Mironov, 2012; Balcer and Vadhan, 2018; Gazeau et al., 2016)). More interestingly, Balle et al. (Balle et al., 2019b) proposed the privacy blanket intuition for GRR-based protocols, which gives a thoroughly theoretical explanation on how (and how much) shuffling enhances the privacy, in a conceptually simple manner.

Nonetheless, the estimation error of GRR-based protocols will become unexpectedly large as the domain size $|\mathbb{D}|$ grows; the privacy guarantee they can provide is also subjected to the number of users and the data domain size and might not be enough for applications. Subsequent works (Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019; Ghazi et al., 2020b) improve the accuracy and achieve errors independent of $|\mathbb{D}|$ (under the same privacy guarantee) by letting users randomize their data in rather complex ways (e.g., encoding data using hash functions (Wang et al., 2020) or Hadamard matrix (Ghazi et al., 2021), adding noises from negative binomial distributions (Ghazi et al., 2020b), etc.); some of them (Ghazi et al., 2021; Ghazi et al., 2020b) also enables high privacy guarantee. Yet they indeed sacrifice efficiency and/or implementation-friendliness (detailed comparison will be presented later).

In this work, we present shuffle-model mechanisms for histogram estimation, which are almost as efficient and implementation-friendly as GRR-based protocols (Balle et al., 2019b), enable high privacy guarantee, and have competitive accuracy with (Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019; Ghazi et al., 2020b). Moreover, we give an intuition that clearly explains how and how much our mechanisms provide privacy (mimicking the privacy blanket intuition); we believe that such intuition can assist in the design and analysis of future mechanisms.

Our Contributions. We overview our contributions in more detail below.

Dummy blanket & DUMP framework. We introduce the concept of dummy blanket, which enables a very simple method of enhancing user’s privacy (particularly, in the shuffle model) and significantly outperforms the privacy blanket (Balle et al., 2019b) in the accuracy. Recall that privacy blanket provides an insightful view of the GRR mechanism: if each user reports the true value $v$ with probability $p$ and a uniformly sampled $v^{\prime}$ with probability $q=1-p$ , then, equivalently, there is expected to be $nq$ uniformly random values among $n$ reports. After shuffling, the victim data $v^{*}$ is mixed with the $nq$ points, and these points will serve as a “blanket” to hide $v^{*}$ . Our dummy blanket intuition indeed generalizes the privacy blanket. Specifically, we observe that privacy essentially comes from these uniformly random points and is determined by the number of them, and we do not need to insist that these points are contributed by data randomization. Instead, we allow users to directly add uniformly random points (called dummy points) before reporting to the shuffler. Similar to the case of privacy blanket, these dummy points will also serve as a “blanket” to protect the victim!

Dummy blanket allows us to achieve better accuracy at the cost of slightly increasing communication overhead. To see this, we first show the privacy guarantee is mostly determined by the total number of dummy points. Then, for achieving certain privacy (that needs $s$ dummy points), the privacy blanket intuition requires to replace $s$ true values, while we just add $s$ dummy points. Since original values are preserved, the estimation result of the dummy blanket will be more accurate. Formally, we show the estimation error introduced by dummy blanket is independent of the domain size $|\mathbb{D}|$ and always concretely smaller than that of privacy blanket (see Sect.5.1). Regarding communication overhead, note that only the total number of dummy points matters, and thus the amortized communication overhead will be close to a single message when $n$ is large. Furthermore, to some extent, the dummy blanket intuition could explain why previous works (Balcer and Cheu, 2019; Wang et al., 2020) improved accuracy: their data randomizers implicitly generate dummy points, though the methodology of adding dummy points and its advantage had never been formally studied.

We incorporate the dummy blanket intuition into the existing shuffle-model framework and present a general DUMP (DUMmy-Point-based) framework for histogram estimation. In DUMP, a user first processes his data using a data randomizer (as he does in the conventional shuffle-model framework), then invokes a dummy-point generator to have a certain amount of dummy points, and finally submits the randomized data together with these dummy points to the shuffler. From the dummy blanket intuition, adding dummy points could enhance privacy without significantly downgrading the accuracy. Note that most existing locally private protection mechanisms (e.g., Randomized Response (Warner, 1965), RAPPOR (Erlingsson et al., 2014), OLH (Wang et al., 2017)) could be formulated as one of the special cases of the data randomizer, and thus the framework admits diverse instantiations which can give different tradeoffs for different purposes.

Instantiations: pureDUMP & mixDUMP. We first instantiate DUMP with an empty data randomizer and obtain the pureDUMP protocol. In pureDUMP, the privacy guarantee is fully provided by dummy points, and thus the analysis of pureDUMP could quantify properties of the dummy blanket. Assume each user adds $s$ dummy points and the data domain $\mathbb{D}$ has size $k$ . We prove that (1) pureDUMP satisfies $(\epsilon,\delta)$ -differential privacy for $\delta\leq 0.2907$ and $\epsilon=\sqrt{14k\cdot\ln(2/\delta)/(ns-1)}$ ; (2) its expected mean squared error (eMSE) of frequency estimation is $s(k-1)/(nk^{2})$ . So the eMSE is $O(\log(1/\delta)/(\epsilon^{2}n^{2}))$ , surely independent of $k$ , and only increases the error in the central model by $\log(1/\delta)$ that is logarithmic in $n$ ; the number of messages per user is $1+O(k\log(1/\delta)/(n\epsilon^{2}))$ that will be close to $1$ as $n$ grows. Moreover, by increasing $s$ , $\epsilon$ can be arbitrarily close to 0 as wishes. Meanwhile, we prove that pureDUMP even satisfies local differential privacy, which is meaningful when the shuffler and the analyst collude.

We then present mixDUMP, which utilizes GRR as the data randomizer in the DUMP framework. mixDUMP leads to a new tradeoff between communication overhead and accuracy. Particularly, since privacy can also be provided by GRR, fewer dummy points are needed. We show that mixDUMP has the smallest communication overhead (compared with all existing multi-messages mechnisms) while its eMSE is still competitive (though slightly larger than that of pureDUMP).

Finally, notice that privacy is determined by the total number of dummy points; when $n$ is large, it may not be necessary to ask each user to send even a single dummy point. Therefore we investigate the setting that each user has a certain probability to generate a fixed number of dummy points. Meanwhile, we find that this setting is interesting for the case that not all users are willing to utilize dummy-points-based privacy protection mechanism due to the limited network bandwidth or computation capability of their devices. We formally prove that pureDUMP and mixDUMP can still provide enhanced privacy protection of $(\epsilon,\delta)$ -differential privacy in this setting. This result allows having smaller communication overhead and better accuracy when $n$ is large.

Implementations & Comparisons. We analyze and implement all the existing related protocols (Balle et al., 2019b; Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019; Ghazi et al., 2020b) (at least to the best of our knowledge) for comprehensive comparisons. We conduct experiments on both synthetic and real-world datasets; we are the first (also to the best of knowledge) to report (Balcer and Cheu, 2019; Ghazi et al., 2021) on large-size experiments. With the same privacy guarantee, the results show that (1) pureDUMP and mixDUMP have the minimal communication overhead compared with all existing multi-message protocols ((Ghazi et al., 2021; Balcer and Cheu, 2019; Ghazi et al., 2020b)); (2) pureDUMP achieves the best accuracy compared with (Balle et al., 2019b; Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019), while the only exception (Ghazi et al., 2020b), achieving near-optimal accuracy, has much higher communication-overhead and has to use randomness from negative-binomial distributions whose implementation is indeed challenging and vulnerable in practice. Details follow.

-

Efficiency. The communication overhead of both pureDUMP and mixDUMP decreases with the increase of the numbe of users, while (Ghazi et al., 2021; Balcer and Cheu, 2019) do not have this feature; the expected communication overhead of (Ghazi et al., 2020b) turns out to be concretely larger. In our experiments, for example, over the real-world dataset Ratings (rat, 2017) with $\epsilon=1$ and $\delta=10^{-6}$ , pureDUMP (and mixDUMP) only require around $0.8$ (and $0.5$ ) expected extra messages per user, while (Ghazi et al., 2021; Balcer and Cheu, 2019) and (Ghazi et al., 2020b) need around $10^{2}$ , $10^{3}$ and $10^{4}$ extra messages respectively. On the other hand, single-message protocols (Balle et al., 2019b; Wang et al., 2020) are less competitive in accuracy and/or subjected to upper-bounds on privacy guarantee (namely, the lower-bounds on $\epsilon$ ). Particularly, for Ratings, (Balle et al., 2019b) cannot even achieve $(\epsilon,\delta)$ -privacy with $\epsilon\leq 1$ .
-

Accuracy. Our experiements show that pureDUMP and mixDUMP enjoy better accuracy than all existing protocols except (Ghazi et al., 2020b). Particularly, in the experiments over Ratings with $\epsilon=1$ and $\delta=10^{-6}$ , the eMSE of pureDUMP and that of (Ghazi et al., 2020b) are in $[10^{-10},10^{-9})$ , while the eMSEs of others are close to $10^{-8}$ .
-

Implementation-friendliness. pureDUMP and mixDUMP are extremely implementation-friendly, as they just generate a fixed amount of uniformly random values with fixed probability and do not involve any cumputation that shall be approximated over finite-precision devices (e.g., base- $e$ logarithm $\ln(\cdot)$ ). In contrast, (Ghazi et al., 2020b) uses randomness with the negative binomial distribution $\mathsf{NB}(r,p)$ s.t. $r=1/n$ and $p=e^{-\epsilon}$ . ²²2For $k\in\mathbb{Z}$ , and $X\leftarrow\mathsf{NB}(r,p)$ , $\Pr[X=k]=\tbinom{k+r-1}{k}(1-p)^{r}p^{k}$ . Notice that $r=1/n$ is not an integer, so we cannot sample from this distribution by naively running independent events that have succesful probability $p$ untill seeing $r$ failures. We may have to follow the common approach of sampling negative binomial distributions: first sample $\lambda$ from a $\Gamma$ distribution $\Gamma(r,p/(1-p))$ and then sample from a Poisson distribution $\mathsf{Poisson}(\lambda)$ . Note that sampling from Poisson distributions involves $\ln(\cdot)$ computation, and Mironov (Mironov, 2012) showed that DP meachnisms implementing $\ln(\cdot)$ over double-precision floating-point numbers cannot provide the claimed privacy protection! Other sampling methods for $\mathsf{NB}(r,p)$ might be possible but to be explored.

2. Background

Central Differential Privacy. The initial setting of differential privacy is defined in central model (Dwork, 2008), in which a trusted analyst collects sensitive data from users, gathers them into a dataset, and works out the statistical results. The statistical results are perturbed by the analyst before being released. Intuitively, differential privacy protects each individual’s privacy by requiring any single element in the dataset has limited impact on the output.

Definition 2.1.

(Central Differential Privacy, CDP). An algorithm $\mathcal{M}$ satisfies ( $\epsilon,\delta$ )-DP,where $\epsilon,\delta\geq 0$ , if and only if for any neighboring datasets $\mathcal{D}$ and $\mathcal{D^{\prime}}$ that differ in one element, and any possible outputs $\mathcal{R}\subseteq Range(\mathcal{M})$ , we have

\Pr\left[\mathcal{M}(\mathcal{D})\in\mathcal{R}\right]\leq e^{\epsilon}\Pr\left[\mathcal{M}(\mathcal{D^{\prime}})\in\mathcal{R}\right]+\delta

Here, $\epsilon$ is called the privacy budget. The smaller $\epsilon$ means that the outputs of $\mathcal{M}$ on two adjacent datasets are more similar, and thus the provided privacy guarantee is stronger. $\delta$ is generally interpreted as $\mathcal{M}$ not satisfying $\epsilon$ -DP with probability $\delta$ . As usual, we consider the neighboring datasets $\mathcal{D}$ and $\mathcal{D^{\prime}}$ having the same number of elements (aka. bounded differential privacy). For simplicity, we assume w.o.l.g. that $\mathcal{D}$ and $\mathcal{D^{\prime}}$ differ at the $n$ -th element, and we denote them by $\{x_{1},x_{2},\ldots,x_{n}\}$ and $\{x_{1},x_{2},\ldots,x^{\prime}_{n}\}$ .

Local Differential Privacy. The central model assumes the analyst is fully trusted and does not ensure privacy against curious analysts, which is not practical in many scenarios. In constrast, the local model (Kasiviswanathan et al., 2011) allows each user to locally perturb his data using a randomization algorithm $\mathcal{M}$ before sending it to the untrusted analyst so that the analyst cannot see the raw data. We say $\mathcal{M}$ satisfies local differential privacy (LDP) if for different data $v$ and $v^{\prime}$ , the distribution of $\mathcal{M}(v)$ and that of $\mathcal{M}(v^{\prime})$ are close.

Definition 2.2.

(Local Differential Privacy, LDP). An algorithm $\mathcal{M}$ satisfies $(\epsilon,\delta)$ -LDP, where $\epsilon,\delta\geq 0$ , if and only if for any different elements $v,v^{\prime}\in\mathbb{D}$ , and any possible outputs $\mathcal{R}\subseteq Range(\mathcal{M})$ , we have

\Pr\left[\mathcal{M}(v)\in\mathcal{R}\right]\leq e^{\epsilon}\Pr\left[\mathcal{M}(v^{\prime})\in\mathcal{R}\right]+\delta

$\delta$ is typically set as $0$ , and $(\epsilon,0)$ -LDP can be denoted by $\epsilon$ -LDP.

(Generalized) Randomized Response. Randomized Response (RR) (Warner, 1965) is considered as the first LDP mechanism that supports binary response. It allows each user to provide a false answer with certain probability and thus provides plausible deniability to users. RR mechanism can be extended to support multi-valued domain response, known as Generalized Randomized Response (GRR) (Wang et al., 2017). Denote $k$ as the size of the domain $\mathbb{D}$ . Each user with private value $v\in\mathbb{D}$ reports the true value $v^{\prime}=v$ with probability $p$ and reports a randomly sampled value $v^{\prime}\in\mathbb{D}$ where $v^{\prime}\neq v$ with probability $q$ . By standard analysis, it is easy to see that, GRR satisfies $\epsilon$ -LDP when

(1)

\left\{\begin{aligned} p&=\frac{e^{\epsilon}}{e^{\epsilon}+k-1},\\ q&=\frac{1}{e^{\epsilon}+k-1}.\end{aligned}\right.

Shuffle Model. The shuffle model follows the Encode, Shuffle, Analyze (ESA) architecture proposed by Bittau et al. (Bittau et al., 2017). A shuffler is inserted between users and the analyst to break the link between the user identity and the message. Following (Cheu et al., 2019), we describe a shuffle-model protocol by the following three algorithms.

-

$R$ : $\mathcal{X}\rightarrow\mathcal{Y}^{m}$ . The local randomized encoder $R$ takes a single user’s data $x_{i}$ as input and outputs a set of $m$ messages $y_{i,1},\ldots,y_{i,m}\in\mathcal{Y}$ .
-

$S$ : $(\mathcal{Y}^{m})^{n}\rightarrow\mathcal{Y}^{mn}$ . The shuffler $S$ takes $n$ users’ output messages of $R$ as input and outputs these messages in a uniformly random order.
-

$A$ : $\mathcal{Y}^{mn}\rightarrow\mathcal{Z}$ . The analyst $A$ that takes all the output messages of S as input and runs some analysis functions on these messages.

Such a protocol is called a single-message protocol when $m=1$ or a multi-message protocol when $m>1$ .

Variable	Description
$\mathbb{D}$	Finite and discrete domain {1, 2 , …, k}
$k$	Size of domain $\mathbb{D}$
$\mathcal{D}$	Set of users’ input values
$n$	Number of users
$\mathcal{S}$	Set of dummy points
$\|\mathcal{S}\|$	Total number of dummy points
$s$	Number of dummy points sent per user
$x_{i}\in\mathbb{D}$	Input value from the $i^{\text{th}}$ user
$y_{i,j}\in\mathbb{D}$	The $j^{\text{th}}$ value sent from the $i^{\text{th}}$ user
$\vec{z}$	The estimated frequency of elements in $\mathbb{D}$
$\mathsf{Ber}(\cdot)$	Sample a random binary value under Bernoulli distribution
$\mathsf{Unif}([k])$	Uniformly sample a random value from $\{1,2,\ldots,k\}$

Table 1. Important Notations

3. Histogram Estimation in the shuffle model

3.1. Problem Statement

System Setting. The task is performed in the shuffle model. The system involves an analyst, a shuffler, and $n$ users. Specifically, the analyst knows the data domain $\mathbb{D}$ . We assume the size of $\mathbb{D}$ is $k$ , i.e., $\mathbb{D}=\{1,2,\ldots,k\}$ . The $i^{\text{th}}$ user has a single data $x_{i}\in\mathbb{D}$ . All users send perturbed data to the shuffler, and the shuffler scrambles the order before transmitting the data to the analyst. For the $j^{\text{th}}$ item in $\mathbb{D}$ , the analyst estimates frequency $\vec{z}[j]$ (the proportion of users who has a certain value), i.e., $\vec{z}[j]=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}_{x_{i}=j}$ . Here $\mathbbm{1}_{x_{i}=j}=1$ if $x_{i}=j$ ; $\mathbbm{1}_{x_{i}=j}=0$ otherwise.

Threat Model. Similar to most shuffle-model works, we consider privacy against curious analysts. We concentrate on the standard case, in which the analyst does not collude with shufller. We say a shuffle-model protocol $P=(R,S,A)$ (see Section 2) satisfies $(\epsilon,\delta)$ -DP against curious analysts, if the algorithm $S\circ\cup_{i}R(x_{i})=A(S(R(x_{1}),\ldots,R(x_{n})))$ satisfies $(\epsilon,\delta)$ -DP in the central model (see Definition 2.1). We also study the case that the analyst colludes with the shufller such that he has a direct access to users’ reports. In this case, the differential privacy of $P$ against the collusion is equivalent to the local differential privacy of $R$ (see Definition 2.2).

Following (Balle et al., 2019b), throughout this paper we consider powerful adversaris who have the same background knowledge, i.e., all other users’ values except one victim’s; thus our mechanisms are also secure against any weaker adversaries. All privacy analyses are under the assumption of bounded differential privacy, thus we do not need to consider the case that some users dropout.

3.2. Existing Related Protocols

There are several existing works on histogram estimation in the shuffle model. To the best of our knowledge, the first protocol is proposed by Balle et al. (Balle et al., 2019b), in which each user randomizes sensitive data with GRR mechanism. Then, other mechanisms are utilized as randomizers to improve the utility of protocols. Ghazi et al. (Ghazi et al., 2021) propose two multi-message protocols: private-coin and public-coin. The private-coin utilizes Hadamard Response as the data randomizer. The public-coin is based on combining the Count Min data structure (Cormode and Muthukrishnan, 2005) with GRR (Wang et al., 2017). The user in the truncation-based protocol proposed by Balcer et al. (Balcer and Cheu, 2019) sends sensitive data with at most $k$ (size of data domain) non-repeated data sampled from $\mathbb{D}$ to the shuffler. However, the truncation-based protocol may suffer from an undesirable communication load, i.e., each user needs to send thousands of messages when there are thousands of candidate values in $\mathbb{D}$ . The SOLH proposed by Wang et al. (Wang et al., 2020) utilizes Optimal Local Hash (OLH) as the data randomizer to improve the utility when the size of $\mathbb{D}$ is large. The correlated-noise protocol proposed by Ghazi et al. (Ghazi et al., 2020b) adds noises from negative binomial distributions, which achieves accuracy that is arbitrarily close to that of CDP.

All the above existing protocols seek for more suitable randomizers to improve the utility of the shuffle model. Different from existing works, we try to introduce a new mechanism to provide privacy protection. Our proposed method gets more privacy benefits than existing approaches, which further improves the utility of histogram estimation in the shuffle model.

4. DUMP Framework

4.1. Framework Overview

There are three parties in DUMP framework: user side, shuffler, and analyst. Figure 1 shows the interactions among them.

User Side. There are two types of protection procedures on the user side. One is the data randomizer. Most existing LDP mechanisms could be categorized as one of special cases of the data randomizer such as Randomized Response (RR), Unary Encoding (UE), Optimal Local Hash (OLH), and Hadamard Response (HR). Hence, the output of the data randomizer $R(x_{i})$ could be in different forms such as element, vector, or matrix. Besides, the data randomizer directly outputs the user’s sensitive data when no privacy protection is provided by the data randomizer, i.e., $R(x_{i})=x_{i}$ . Another is the dummy-point generator, which provides privacy protection by generating dummy points. $s$ dummy points are uniformly random sampled from the output space of the data randomizer, where $s$ is calculated from the number of users $n$ , the domain size $k$ , and privacy parameters. The output of $R(x_{i})$ mixed with $s$ dummy points is then sent to the shuffler.

Shuffler. The operation of shuffler in DUMP is same as the existing shuffle model(Bittau et al., 2017). The shuffler collects all messages from users and removes all implicit metadata (such as IP address, timestamps, and routing paths). Then all messages are randomly shuffled before sending to the analyst.

Refer to caption — Figure 1. The overview of DUMP framework

Analyst. The analyst performs histogram estimation on the set of messages received from the shuffler. It removes the bias caused by dummy points and data randomization to obtain unbiased estimation results.

Under DUMP framework, we design two protocols, pureDUMP and mixDUMP, in Section 5 to show the advantage of dummy points in improving utility. The dummy points provides privacy protection in both two protocols. In addition, the mixDUMP uses GRR as the data randomizer, while no protection is provided by the data randomizer in pureDUMP. We show that the dummy points can improve the accuracy without increasing too much communication overhead under a certain privacy guarantee. Table 2 presents our results alongside existing results, in which all protocols are used for $k$ -bins histogram estimation in the shuffle model.

Protocol	No. Messages per User	No. Bits per Message	Expected Mean Square Error	Condition
privacy-amplification(Balle et al., 2019b)	$1$	$O(\log k)$	$O(\frac{\epsilon^{2}}{(\epsilon^{2}(n-1)-k\log(1/\delta))^{2}}\cdot\log(1/\delta))$	$\sqrt{\frac{1}{n-1}\cdot 14k\log(2/\delta)}<\epsilon\leq 1$
SOLH(Wang et al., 2020)	$1$	$O(\log k)$	$O(\frac{1}{\epsilon^{2}n^{2}-\log(1/\delta)}\cdot\log(1/\delta))$	$\sqrt{\frac{1}{n-1}\cdot 14\log(2/\delta)}<\epsilon\leq 1$
truncation-based(Balcer and Cheu, 2019)	1+ $O(k)$	$O(\log k)$	$O(\frac{1}{\epsilon^{2}n^{2}}\cdot\log(1/\delta))$	$e^{-O(n\epsilon^{2})}\leq\delta<1/n$
private-coin(Ghazi et al., 2021)	1+ $O(\frac{1}{\epsilon^{2}}\log\frac{1}{\epsilon\delta})$	$O(\log n\log k)$	$O(\frac{1}{\epsilon^{2}n^{2}}\cdot\log(1/\delta))$	$\epsilon<1$ , $\delta<1/\log(k)$
public-coin(Ghazi et al., 2021)	1+ $O(\frac{1}{\epsilon^{2}}\log^{3}k\cdot\log(\frac{1}{\delta}\log k))$	$O(\log n+\log\log k)$	$O(\frac{1}{n}\cdot\log(\log(k)/\delta)\cdot\log^{2}k)$	$\epsilon<1$ , $\delta<1/\log(k)$
correlated-noise(Ghazi et al., 2020b)	1+ $O(\frac{1}{n\epsilon^{2}}k\cdot\log^{2}\frac{1}{\delta})$	$O(\log k)$	$O(\frac{1}{\epsilon^{2}n^{2}})$	$\epsilon\leq O(1)$
mixDUMP	1+ $O(\frac{1}{n\epsilon^{2}}k\cdot\log\frac{1}{\delta})$	$O(\log k)$	$O(\frac{1}{\epsilon^{2}n^{2}}\cdot(1+\frac{k}{e^{\epsilon}-1})^{2}\cdot\log(1/\delta))$	$\epsilon\leq 1$
pureDUMP	1+ $O(\frac{1}{n\epsilon^{2}}k\cdot\log\frac{1}{\delta})$	$O(\log k)$	$O(\frac{1}{\epsilon^{2}n^{2}}\cdot\log(1/\delta))$	$\epsilon\leq 1$

Table 2. Comparison of results for the histogram estimation. Each user is assumed to hold number of

1

value from

\mathbb{D}

4.2. Dummy Blanket

The blanket intuition is first proposed by Balle et al. (Balle et al., 2019b). They decompose the probability distribution of the data randomizer’s output into two distributions to show the privacy enhancement of the shuffle model. One distribution is dependent on the input value and the other is independently random. Consider inputting a dataset $\{x_{1},x_{2},\ldots,x_{n}\}$ into the data randomizer. The output histogram $Y$ can be decomposed as $Y=Y_{1}\cup Y_{2}$ , where $Y_{1}$ is the histogram formed by independently random data and $Y_{2}$ is the histogram formed by unchanged sensitive data. Suppose that the analyst knows $\{x_{1},x_{2},\ldots,x_{n-1}\}$ , and $x_{n}$ remains unchanged. The analyst needs to observe $Y_{1}\cup\{x_{n}\}$ if it tries to break the privacy of $x_{n}$ from the dataset sent by the shuffler. Although $x_{n}$ is not randomized, $Y_{1}$ hides $x_{n}$ like a random blanket.

In this paper, we propose the concept of dummy blanket to take further advantage of the shuffle model. We only consider the uniformly random distributed dummy blanket in this paper. Whether participating in the randomization or not, each user sends $s$ uniformly random generated dummy points to the shuffler. The total number of $ns$ dummy points form a histogram $Y_{3}$ that hides each user’s data like a dummy blanket. In the aforementioned assumptions, the analyst needs to distinguish $x_{n}$ from $Y_{1}\cup Y_{3}\cup\{x_{n}\}$ . The dummy blanket provides the same privacy guarantee for each user, while the communication overhead is shared by all users. It can bring significant enhancement of utility when there are a large number of users. Moreover, the dummy blanket does not conflict with any randomization mechanism, it can be wrapped outside of the blanket formed by any data randomizer.

5. Proposed Protocols

In this section, we present pureDUMP protocol (Section 5.1) and mixDUMP protocol (Section 5.2). For each proposed protocol, we analyze the privacy, accuracy, and efficiency properties.

5.1. pureDUMP: A Pure DUMP protocol

Design of Algorithms. All the privacy guarantee is provided by dummy blanket in pureDUMP. On the user side, instead of randomizing the input data $x_{i}\in\mathbb{D}$ , $s$ dummy points are randomly sampled from the domain $\mathbb{D}$ to hide $x_{i}$ . Here $s$ is determined by privacy parameters when the number of users and the size of data domain are fixed, and the effect of $s$ on privacy budget $\epsilon$ will be discussed later in privacy analysis. Then the shuffler collects all $n(s+1)$ messages from users and randomly shuffles them before sending them to the analyst. From the receiving messages, the analyst can approximate the frequency of each $i\in\mathbb{D}$ with a debiasing standard step. We omit the description of the shuffling operation as it’s same as in the standard shuffle-based mechanisms. The details of the user-side algorithm and the analyst-side algorithm are given in Algorithm 1 and Algorithm 2, respectively.

Algorithm 1 User-side Algorithm of PureDUMP

1:procedure P(

x_{i}

x_{i}\in\mathbb{D}

, parameters

n,k,s\in\mathbb{N}^{+}

3:Multiset

Y_{i}\subseteq\mathbb{D}

4: for

j\leftarrow 1\text{ to }s

\triangleright

Generate dummy points

y_{i,j}\leftarrow\mathsf{Unif}([k])

6: end for

7: Mixes

x_{i}

with

\{y_{i,1},\ldots,y_{i,s}\}

to get

Y_{i}:=\{y_{i,1},\ldots,y_{i,s+1}\}

8:return

Y_{i}

9:end procedure

Algorithm 2 Analyst-side Algorithm of PureDUMP

1:procedure A(

Y_{1},\ldots,Y_{n}

2:Multiset

\{\{y_{1,1},\ldots,y_{1,s+1}\},\ldots,\{y_{n,1},\ldots,y_{n,s+1}\}\}\subseteq\mathbb{D}

, parameters

n,k,s\in\mathbb{N}^{+}

3:Vector

\vec{z}\in[0,1]^{k}

\triangleright

Frequency of k elements

4: for

m\leftarrow 1\text{ to }k

\vec{z}[m]=\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{m=y_{i,j}}-\frac{ns}{k})

6: end for

7:return

\vec{z}

8:end procedure

Privacy Analysis. Now we analyze the privacy guarantee of pureDUMP. First, we show how much the privacy guarantee on the analyst side is provided by dummy blanket. We focus on the regime where the count of each element in the dummy dataset is much smaller than the total number of dummy points, which captures numerous practical scenarios since the size of data domain $\mathbb{D}$ is usually large. Therefore, our privacy analysis is based on Lemma 5.1. In fact, the privacy analyses in (Balle et al., 2019b; Wang et al., 2020) are also need to be based on Lemma 5.1, but it is ignored.

Lemma 5.1.

Denote $\{a_{1},a_{2},...,a_{m}\}$ are random variables that follow the binomial distribution, and $a_{1}+a_{2}+...+a_{m}=N$ . For any $i,j\in[1,m],i\neq j$ , $a_{i}$ and $a_{j}$ are approximately considered as mutually independent when $a_{1},a_{2},...,a_{m}<<N$ .

The proof of Lemma 5.1 is in Appendix B.1. Then we show the analysis result of pureDUMP in Theorem 5.2.

Theorem 5.2.

PureDUMP satisfies ( $\epsilon_{d},\delta_{d}$ )-DP against the analyst for any $k,s,n\in\mathbb{N}^{+}$ , $\epsilon_{d}\in(0,1]$ , and $\delta_{d}\in(0,0.2907]$ , where

\epsilon_{d}=\sqrt{\frac{14k\cdot\ln{(2/\delta_{d})}}{|\mathcal{S}|-1}},\ |\mathcal{S}|=ns

Proof.

(Sketch) For any two neighboring datasets $\mathcal{D}$ and $\mathcal{D^{\prime}}$ , w.l.o.g., we assume they differ in the $n^{\text{th}}$ value, and $x_{n}=1$ in $\mathcal{D}$ , $x^{\prime}_{n}=2$ in $\mathcal{D^{\prime}}$ . Let $\mathcal{S}$ and $\mathcal{S^{\prime}}$ be two dummy datasets with $|\mathcal{S}|=|\mathcal{S}^{\prime}|=ns$ dummy points sent by users. What the analyst receives is a mixed dataset $\mathcal{V}$ that is a mixture of user dataset $\mathcal{D}$ (or $\mathcal{D^{\prime}})$ and the dummy dataset $\mathcal{S}$ (or $\mathcal{S^{\prime}})$ . We denote $\mathcal{V}=\mathcal{D}\cup\mathcal{S}$ (or $\mathcal{V}=\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}$ ). Denote $s_{j}$ as the number of value $j$ in the dummy dataset. To get the same $\mathcal{V}$ , there should be a dummy point equals to 2 in $\mathcal{S}$ but equals to 1 in $\mathcal{S^{\prime}}$ . Thus the numbers of $k$ values in other $|\mathcal{S}|-1$ dummy points are $[s_{1}-1,s_{2},\ldots,s_{k}]$ in $\mathcal{S}$ and $[s_{1},s_{2}-1,\ldots,s_{k}]$ in $\mathcal{S^{\prime}}$ . Then we have

\displaystyle\frac{\Pr[\mathcal{D}\cup\mathcal{S}=\mathcal{V}]}{\Pr[\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}=\mathcal{V}]}=\frac{(_{s_{1}-1,s_{2},\ldots,s_{k}}^{\ \ \ \ |\mathcal{S}|-1})}{(_{s_{1},s_{2}-1,\ldots,s_{k}}^{\ \ \ \ |\mathcal{S}|-1})}=\frac{s_{1}}{s_{2}}

As dummy points are all uniformly random sampled from $\mathbb{D}$ , both $s_{1}$ and $s_{2}$ follow binomial distributions, where $s_{1}$ follows $Bin(|\mathcal{S}|-1,\frac{1}{k})+1$ and $s_{2}$ follows $Bin(|\mathcal{S}|-1,\frac{1}{k})$ . Denote $S_{1}$ as $Bin(|\mathcal{S}|-1,\frac{1}{k})+1$ and $S_{2}$ as $Bin(|\mathcal{S}|-1,\frac{1}{k})$ . Here, $s_{1}$ and $s_{2}$ can be considered as mutually independent variables according to Lemma 5.1. Then the probability that pureDUMP does not satisfy $\epsilon_{d}$ -DP is

\displaystyle\Pr[\frac{\Pr[\mathcal{D}\cup\mathcal{S}=\mathcal{V}]}{\Pr[\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}=\mathcal{V}]}\geq e^{\epsilon_{d}}]=\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]\leq\delta_{d}

Let $c=E[S_{2}]=\frac{|\mathcal{S}|-1}{k}$ , the $\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]$ can be divided into two cases with overlap that $S_{1}\geq ce^{\epsilon_{d}/2}$ and $S_{2}\leq ce^{-\epsilon_{d}/2}$ . According to multiplicative Chernoff bound, we have

	$\displaystyle\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]$	$\displaystyle\leq\Pr[S_{1}\geq ce^{\epsilon_{d}/2}]+\Pr[S_{2}\leq ce^{-\epsilon_{d}/2}]$
		$\displaystyle\leq e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}+e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}\leq\delta_{d}$

Both $e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}$ and $e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}$ are no greater than $\frac{\delta_{d}}{2}$ . We can deduce that $c$ always satisfies $c\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}$ when $\epsilon_{d}\leq 1$ and $\delta_{d}\leq 0.2907$ . Substitute $c=\frac{|\mathcal{S}|-1}{k}$ into $c\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}$ , we can get the upper bound of $\epsilon_{d}$ .

The detailed proof is deferred to Appendix B.2. ∎

Then, we show that when the shuffler colludes with the analyst, each individual user’s data can also be protected in pureDUMP.

Corollary 5.3.

PureDUMP satisfies $(\epsilon_{l},\delta_{l})$ -LDP against the shuffler for any $k,s\in\mathbb{N}^{+}$ , $\epsilon_{l}\in(0,1]$ , and $\delta_{l}\in(0,0.2907]$ , where

\epsilon_{l}=\sqrt{\frac{14k\cdot\ln{(2/\delta_{l})}}{s-1}}

Proof.

The proof is similar to Theorem 5.2, except that each sensitive data only protected by $s$ dummy points generated by its owner. More specifically, LDP protection is provided by dummy blanket composed of $s$ dummy points before the shuffling operation. ∎

Accuracy Analysis. We analyze the accuracy of pureDUMP by measuring the Mean Squared Error (MSE) of the estimation results.

(2)

MSE=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]

where $f_{v}$ is the true frequency of element $v\in\mathbb{D}$ in users’ dataset, and $\tilde{f}_{v}$ is the estimation result of $f_{v}$ . We first show $\tilde{f}_{v}$ of pureDUMP is unbiased in Lemma 5.4.

Lemma 5.4.

The estimation result $\tilde{f}_{v}$ for any $v\in\mathbb{D}$ in pureDUMP is an unbiased estimation of $f_{v}$ .

Proof.

	$\displaystyle E[\tilde{f}_{v}]$	$\displaystyle=E\left[\frac{1}{n}\sum_{i\in[n],j\in[s+1]}\left(\mathbbm{1}_{v=y_{i,j}}-\frac{ns}{k}\right)\right]$
		$\displaystyle=\frac{1}{n}E\left[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{v=y_{i,j}}\right]-\frac{s}{k}$
		$\displaystyle=\frac{1}{n}(nf_{v}+\frac{ns}{k})-\frac{s}{k}$
		$\displaystyle=f_{v}$

∎

Since $\tilde{f}_{v}$ is an unbiased estimation of $f_{v}$ , the MSE of $\tilde{f}_{v}$ in pureDUMP is equal to the variance of $\tilde{f}_{v}$ according to Equation (2).

(3)		$\displaystyle MSE$	$\displaystyle=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}(Var[\tilde{f}_{v}]+[E[\tilde{f}_{v}]-f_{v}]^{2})$
(3)			$\displaystyle=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})$

Now we calculate the MSE of pureDUMP in Theorem 5.5.

Theorem 5.5.

The MSE of frequency estimation in pureDUMP is

\frac{s(k-1)}{nk^{2}}

Proof.

	$\displaystyle Var[\tilde{f}_{v}]$	$\displaystyle=Var\left[\frac{1}{n}\left(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{v=y_{i,j}}-\frac{ns}{k}\right)\right]$
		$\displaystyle=\frac{1}{n^{2}}Var\left[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{v=y_{i,j}}\right]$
		$\displaystyle=\frac{1}{n^{2}}(\frac{1}{k}\cdot(1-\frac{1}{k})\cdot ns)$
		$\displaystyle=\frac{s(k-1)}{nk^{2}}$

According to Equation (3), we have

\displaystyle MSE=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})=\frac{s(k-1)}{nk^{2}}

∎

Efficiency Analysis. Now we analyze the efficiency of pureDUMP. For an input $x_{i}\in\mathbb{D}$ , the output of each user side $P(x_{i})$ consists of $s+1$ messages of length $O(\log k)$ bits. The runtime of the user side $P(x_{i})$ is at most $O(s)$ . Moreover, $s$ is at most $O(k\ln(1/\delta)/(n\epsilon^{2}))$ when pureDUMP satisfies $(\epsilon,\delta)$ -DP on the analyst side. The shuffler transmits $n(s+1)$ messages to the analyst with the space of $O(ns\log k)$ . The runtime of the analyst on input $n(s+1)$ messages is at most $O(kns)$ and its output has space $O(k\log(n(s+1)))$ bits.

According to Theorem 5.5, the expected MSE of estimation in pureDUMP is $O(\log(1/\delta)/(\epsilon^{2}n^{2}))$ . Note that the domain size has almost no effect on the estimation error because more dummy points are generated to offset the error caused by the larger domain. When the user side only uses a data randomizer that satisfies $\epsilon_{l}$ -LDP to provide protection, we can deduce from results in (Balle et al., 2019b) that the expected MSE is $O(\epsilon^{2}\log(1/\delta)/(\epsilon^{2}(n-1)-k\log(1/\delta))^{2})$ . By comparison, the estimation error of pureDUMP is always smaller than the method that only using data randomization on the user side, which proves the dummy blanket dominates the randomization in the shuffle model.

5.2. mixDUMP: A DUMP Protocol with GRR

Design of Algorithms. mixDUMP utilizes GRR as the data randomizer to provide privacy guarantee along with dummy points. The details of the algorithm are given in Algorithm 3. Note that GRR mechanism utilized in mixDUMP is a little different from the form in (Wang et al., 2017). To analyze the privacy benefit with the idea of blanket, we use the technique proposed in (Balle et al., 2019b) to decompose GRR into

\displaystyle\forall_{x_{r}\in\mathbb{D}}\Pr[GRR(x)=x_{r}]=(1-\lambda)\Pr[x=x_{r}]+\lambda\Pr[\mathsf{Unif}([k])=x_{r}]

where $\Pr[x=x_{r}]$ is the real data distribution that depends on $x$ , and $\Pr[\mathsf{Unif}([k])=x_{r}]$ is the uniformly random distribution. With probability $1-\lambda$ , the output is true input, and with probability $\lambda$ , the output is random sampled from the data domain. Given the probability of $q=1/(e^{\epsilon_{l}}+k-1)$ in GRR, it has

\displaystyle\lambda=kq=\frac{k}{e^{\epsilon_{l}}+k-1}

where $\epsilon_{l}$ is the privacy budget of LDP.

The procedure of generating dummy points is same as Algorithm 1. Same as before, we omit the description of shuffling operation. The details of the user-side algorithm and the analyst-side algorithm are shown in Algorithm 4 and Algorithm 5, respectively.

Algorithm 3 Data Randomizer of MixDUMP

1:procedure R(

x_{i}

x_{i}\in\mathbb{D}

, parameters

k\in\mathbb{N}^{+}

and

\lambda\in(0,\ 1]

x_{r}\in\mathbb{D}

b\leftarrow\mathsf{Ber}(\lambda)

\triangleright

Randomize data with probability

\lambda

5: if

b=0

then

x_{r}\leftarrow x

7: else

x_{r}\leftarrow\mathsf{Unif}([k])

9: end if

10:end procedure

Privacy Analysis. We show how much the privacy guarantee is provided by both randomization and dummy blanket to compare with pureDUMP. In particular, we analyze the privacy amplification provided by dummy blanket locally to distinguish the work of adding dummy points on the shuffler side (Wang et al., 2020). We first show the privacy guarantee on the analyst side in Theorem 5.6.

Theorem 5.6.

MixDUMP satisfies ( $\epsilon_{s},\delta_{s}$ )-DP for any $k,s,n\in\mathbb{N}^{+}$ , $\epsilon_{s}\in(0,1]$ , $\lambda\in(0,1]$ , and $\delta_{s}\in(0,0.5814]$ , where

\epsilon_{s}=\sqrt{\frac{14k\ln{(4/\delta_{s})}}{|\mathcal{S}|+(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}-1}},\ |\mathcal{S}|=ns

Proof.

(Sketch) The assumption is same as in Theorem 5.2. We assume the $n^{\text{th}}$ value $x_{n}=1$ in $\mathcal{D}$ and $x^{\prime}_{n}=2$ in its neighboring dataset $\mathcal{D^{\prime}}$ . Denote $R(\mathcal{D})$ (or $R(\mathcal{D}^{\prime})$ ) as randomized users’ dataset $\{R(x_{1}),...,R(x_{n})\}$ (or $\{R(x_{1}),...,R(x_{n}^{\prime})\}$ ). If the $n^{\text{th}}$ user randomizes its data, then we can easily get

\displaystyle\frac{\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]}{\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]}=1

So, we focus on the case that the $n^{\text{th}}$ user does not randomize its data. Denote $\mathcal{H}$ as the set of users participating in randomization in Algorithm 3, and the size of $\mathcal{H}$ is $|\mathcal{H}|$ . Since the randomized users’ data and dummy points are all follows the uniform distribution, mixDUMP can be regarded as pureDUMP with dummy dataset $\mathcal{H}\cup\mathcal{S}$ is introduced. From Theorem 5.2, mixDUMP satisfies $(\epsilon_{H},\frac{\delta_{s}}{2})$ -DP, we have

\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\leq e^{\epsilon_{H}}\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]+\frac{\delta_{s}}{2}

where $\epsilon_{H}=\sqrt{\frac{14k\ln{(4/\delta_{s})}}{|\mathcal{H}|+|\mathcal{S}|-1}}$ , $\delta_{s}\leq 0.5814$ . Since $|\mathcal{H}|$ follows $Bin(n-1,\lambda)$ , according to Chernoff bound, we can get

\displaystyle\Pr[|\mathcal{H}|<(1-\gamma)\mu]<\frac{\delta_{s}}{2}

where $\gamma=\sqrt{\frac{2\ln(2/\delta_{s})}{(n-1)\lambda}}$ . Denote $t=(n-1)\lambda-\sqrt{2(n-1)\lambda ln(2/\delta_{s})}$ , then we have

		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]$
		$\displaystyle\leq\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\wedge\|\mathcal{H}\|\geq t]+\frac{\delta_{s}}{2}$
		$\displaystyle\leq(\sum_{h\geq t}\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\Pr[\|\mathcal{H}\|=h])+\frac{\delta_{s}}{2}$
		$\displaystyle\leq e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{t+\|\mathcal{S}\|-1}}}\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]+\delta_{s}$

The details of this proof is in Appendix B.3. ∎

Algorithm 4 User-side Algorithm of MixDUMP

1:procedure M(

x_{i}

x_{i}\in\mathbb{D}

, parameters

n,k,s\in\mathbb{N}^{+}

and

\lambda\in(0,\ 1]

3:Multiset

Y_{i}\subseteq\mathbb{D}

x_{r}\leftarrow R(x_{i})

\triangleright

Randomize real data

Y_{i}\leftarrow P(x_{r})

\triangleright

Mix

x_{r}

with

s

dummy points

6:return

Y_{i}

7:end procedure

Algorithm 5 Analyst-side Algorithm of MixDUMP

1:procedure A(

Y_{1},\ldots,Y_{n}

2:Multiset

\{\{y_{1,1},\ldots,y_{1,s+1}\},\ldots,\{y_{n,1},\ldots,y_{n,s+1}\}\}\subseteq\mathbb{D}

, parameters

n,k,s\in\mathbb{N}

\lambda\in(0,\ 1]

3:Vector

\vec{z}\in[0,1]^{k}

\triangleright

Frequency of k elements

4: for

m\leftarrow 1\text{ to }k

\vec{z}[m]=\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{m=y_{i,j}}-\frac{n(\lambda+s)}{k})/(1-\lambda)

6: end for

7:return

\vec{z}

8:end procedure

Then, we analyze the LDP privacy guarantee provided by mixDUMP when the shuffler colludes with the analyst. We show that our design different from (Wang et al., 2020) can obtain privacy benefits from dummy points before shuffling operation, which enables users to be less dependent on trust in the shuffler. When the data randomizer satisfies $\epsilon_{l}$ -LDP, the introduced dummy points can achieve an enhanced LDP with the privacy budget smaller than $\epsilon_{l}$ . The theoretical analysis is shown in Theorem 5.7.

Theorem 5.7.

If the data randomizer satisfies $\epsilon_{l}$ -LDP, after introducing $s$ dummy points, $(\epsilon_{r},\delta_{r})$ -LDP is achieved for any $\frac{k}{s}\leq min(\frac{1}{2\ln(1/\delta_{r})},(\sqrt{c^{2}+\frac{4c}{\sqrt{2\ln(1/\delta_{r})}}}-c)^{2})$ , $c=\sqrt{2\ln(1/\delta_{r})}\cdot\frac{e^{\epsilon}_{l}(e^{\epsilon_{l}}+1)}{s(e^{\epsilon_{l}}+1)+e^{\epsilon_{l}}}$ , where

\epsilon_{r}=\ln{(\frac{k}{1-\sqrt{2k\ln(1/\delta_{r})/s}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)}))}

Proof.

(Sketch) Denote $[n_{1},\ldots,n_{k}]$ as numbers of $k$ values in the $i^{\text{th}}$ user’s reports set $\mathcal{V}_{i}$ where $\sum_{i=1}^{k}n_{i}=s+1$ . Denote $\mathcal{S}_{i}$ (or $\mathcal{S}_{i}^{\prime}$ ) as the dummy dataset generated by the $i^{\text{th}}$ user. Let $v$ be the true value held by the $i^{\text{th}}$ user. According to the definition of LDP, we compare the probability of shuffler receiving the same set $\mathcal{V}_{i}$ with numbers of k values are $[n_{1},\ldots,n_{k}]$ when $v=1$ and $v=2$ . We prove the following formula

\displaystyle\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]\leq e^{\epsilon_{r}}\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]+\delta_{r}

According to Equation (1) of GRR mechanism, we have

\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}

\displaystyle\leq\frac{e^{\epsilon_{l}}n_{1}+n_{2}}{e^{\epsilon_{l}}n_{2}+n_{1}}

Both $n_{1}$ and $n_{2}$ follow binomial distribution, where $n_{1}$ follows $Bin(s,\frac{1}{k})+1$ and $n_{2}$ follows $Bin(s,\frac{1}{k})$ . Based on Lemma 5.1, $n_{1}$ and $n_{2}$ are considered as mutually independent variables. Then we have

\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}\leq\frac{Bin(s,\frac{1}{k})}{Bin(s,\frac{1}{k})}+\frac{e^{\epsilon_{l}}}{Bin(s,\frac{1}{k})(e^{\epsilon_{l}}+1)}

According to the Chernoff bound, we have

\Pr[Bin(s,\frac{1}{k})\geq\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}})]\geq 1-\delta_{r}

Note that $1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}}\geq 0$ , so $\frac{k}{s}\leq\frac{1}{2\ln(1/\delta_{r})}$ . As $Bin(s,\frac{1}{k})\in[\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}}),s]$ with the probability larger than $1-\delta_{r}$ , we have

\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}

\displaystyle\leq\frac{k}{1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)})

when $\frac{k}{s}\leq(\sqrt{c^{2}+\frac{4c}{\sqrt{2\ln(1/\delta_{r})}}}-c)^{2})$ , it has $\frac{k}{1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)})\\ \leq e^{\epsilon_{l}}$ , where $c=\sqrt{2\ln(1/\delta_{r})}\cdot\frac{e^{\epsilon}_{l}(e^{\epsilon_{l}}+1)}{s(e^{\epsilon_{l}}+1)+e^{\epsilon_{l}}}$ .

The details of this proof is in Appendix B.5. ∎

Figure 2 shows the theoretical result of Theorem 5.7. We set $\epsilon_{l}=5$ , $\delta_{r}=10^{-4}$ , and show the privacy budget $\epsilon_{r}$ with varying numbers of dummy points on three different sizes of the data domain. We can observe that dummy points significantly reduce the upper bound of the privacy budget of LDP protection, which enables enhanced privacy protection is provided when the shuffler colludes with the analyst.

Accuracy Analysis. We still use MSE as metrics to analyze the accuracy of mixDUMP. Same as before, we first show $\tilde{f}_{v}$ is unbiased in mixDUMP in Lemma 5.8.

Lemma 5.8.

The estimation result $\tilde{f}_{v}$ for any $v\in\mathbb{D}$ in mixDUMP is an unbiased estimation of $f_{v}$ .

The proof of Lemma 5.8 is deferred to Appendix B.4. Then the MSE of frequency estimation equals to the variance of $\tilde{f}_{v}$ in mixDUMP. Next, we show the MSE of mixDUMP in Theorem 5.9.

Theorem 5.9.

The MSE of frequency estimation in mixDUMP is

\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s(k-1)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}

where $\epsilon_{l}$ is the privacy budget of GRR in mixDUMP.

Proof.

We give the estimation function in Algorithm 5. With $p=\frac{e^{\epsilon_{l}}}{e^{\epsilon_{l}}+k-1},\ q=\frac{1}{e^{\epsilon_{l}}+k-1}$ in Equation (1), and $\lambda=kq$ , we replace $\lambda$ with $q$ to simplify the derivation. We can get the variance of $\tilde{f}_{v}$ is

	$\displaystyle Var[\tilde{f}_{v}]$	$\displaystyle=Var[\frac{1}{n}\cdot\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-ns\cdot\frac{1}{k}-nq}{p-q}]$
		$\displaystyle=\frac{1}{n^{2}}\cdot\frac{1}{(p-q)^{2}}Var[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]$
		$\displaystyle=\frac{nf_{v}p(1-p)+n(1-f_{v})q(1-q)+ns(k-1)/k^{2}}{n^{2}(p-q)^{2}}$
		$\displaystyle=\frac{nq(1-q)+nf_{v}(p(1-p)-q(1-q))+ns(k-1)/k^{2}}{n^{2}(p-q)^{2}}$

As in (Wang et al., 2017), $f_{v}$ is assumed to be small on average, then we have

	$\displaystyle Var[\tilde{f}_{v}]$	$\displaystyle\simeq\frac{1}{n^{2}(p-q)^{2}}(nq(1-q)+\frac{ns(k-1)}{k^{2}})$
		$\displaystyle=\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s(k-1)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}$

According to Equation (3) we have

	$\displaystyle MSE=$	$\displaystyle\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})=$
		$\displaystyle\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s(k-1)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}$

∎

Efficiency Analysis. Next, we show the efficiency analysis of mixDUMP. For an input $x_{i}\in\mathbb{D}$ , the output of the user side $M(x_{i})$ is $s+1$ messages of length $O(\log k)$ bits. Since we amplify twice in the process of proving the upper bound of the privacy budget in mixDUMP, the advantage of mixDUMP compared to pureDUMP in the number of messages is weakened. When mixDUMP satisfies $(\epsilon,\delta)$ -DP, $s$ is at most $O(14k\ln(4/\delta)/(n\epsilon^{2})-\lambda(1-\sqrt{2\ln(2/\delta)/(\lambda n)}))$ . The upper bound of $s$ can be approximated as $O(14k\ln(4/\delta)/(n\epsilon^{2}))$ when $\epsilon$ is close to $0$ .

The shuffler transmits $n(s+1)$ dummy points to the analyst with the space of $O(ns\log k)$ bits. The analyst outputs $O(k\log(n(s+1))$ bits estimated from $n(s+1)$ messages with expected MSE is $O(\ln(1/\delta)/(n\epsilon(1-\lambda))^{2})$ . Note that $\lambda=k/(e^{\epsilon_{l}}+k-1)$ is the probability that each user randomizes its data. As $\epsilon_{l}\geq\epsilon$ , we can amplify $\lambda\leq k/(e^{\epsilon}+k-1)$ . Then, the estimation error can be approximated as $O(\frac{1}{\epsilon^{2}n^{2}}\cdot(1+\frac{k}{e^{\epsilon}-1})^{2}\cdot\log\frac{1}{\delta})$ .

5.3. Analysis for Practical Implementations

We propose two protocols for privacy-preserving histogram estimation in Subsection 5.1 and Subsection 5.2. Under the assumption that each user sends $s$ dummy points, we analyze the performance of the proposed protocols theoretically. In practical implementations, some users might be limited by network bandwidth or computation capability (e.g., running on mobile devices or sensors), and some users may even refuse to send additional dummy points. Taking these conditions into account, we also analyze the privacy and accuracy of the proposed protocols under a more realistic assumption.

The weak assumption of flexible DUMP protocols assumes that each user sends dummy points with the probability of $\gamma$ . More specifically, each user uniformly samples $s$ random dummy points from $\mathbb{D}$ with the probability of $\gamma$ , and only sends one message with the probability of $1-\gamma$ . In practical implementations, $\gamma$ can be obtained by investigating users’ desires or referring to the past experience of collectors. In the following part, we first try to answer how much the weak assumption affects the privacy of protocols in DUMP framework. We prove the following lemma.

Lemma 5.10.

For any DUMP protocols that satisfy $(\epsilon,\ \delta)$ -DP for any $n\in\mathbb{N}^{+}$ and $\gamma\in(0,1]$ , the flexible DUMP protocols satisfy $(\epsilon-\ln(1-e^{-\gamma n}),\ \delta+e^{-\gamma n})$ -DP.

Proof.

The probability that all users do not generate dummy points is $\Pr[A]=(1-\gamma)^{n}\leq e^{-\gamma n}$ , then it has $\Pr[\overline{A}]\geq 1-e^{-\gamma n}$ . According to the results of Theorem 5.2 and Theorem 5.6, the DUMP protocols satisfy

\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}|\overline{A}]\leq e^{\epsilon}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}|\overline{A}]+\delta

where $R(\mathcal{D})$ and $R(\mathcal{D}^{\prime})$ are outputs of data randomizers, $\mathcal{S}$ and $\mathcal{S}^{\prime}$ are sets of dummy points. Then we have

		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]$
		$\displaystyle=\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\|\overline{A}]\Pr[\overline{A}]+\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\|A]\Pr[A]$
		$\displaystyle\leq\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\|\overline{A}]+e^{-\gamma n}$
		$\displaystyle\leq e^{\epsilon}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}\|\overline{A}]+\delta+e^{-\gamma n}$
		$\displaystyle\leq\frac{1}{\Pr[\overline{A}]}\cdot e^{\epsilon}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta+e^{-\gamma n}$
		$\displaystyle\leq\frac{1}{1-e^{-\gamma n}}\cdot e^{\epsilon}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta+e^{-\gamma n}$

∎

According to Lemma 5.10, we can get Theorem 5.11 and Theorem 5.12. We show that if each user sends dummy points with the probability of $\gamma$ , it can increase the privacy budget and the failure probability of DUMP protocols. Because under this weak assumption, there is a possibility that no user sends dummy points, and it may cause the protocols to fail to satisfy DP.

Theorem 5.11.

The flexible pureDUMP satisfies $(\epsilon_{d}-\ln(1-e^{-\gamma n}),\ \delta_{d}+e^{-\gamma n})$ -DP for any $k,S,n\in\mathbb{N}^{+},\ \gamma\in(0,1]$ , where $\delta_{d}\in(0,0.2907]$ , and $\epsilon_{d}=\sqrt{14k\ln(2/\delta_{d})/(S-1)}$ .

Theorem 5.12.

The flexible mixDUMP satisfies $(\epsilon_{s}-\ln(1-e^{-\gamma n}),\ \delta_{s}+e^{-\gamma n})$ -DP for any $k,S,n\in\mathbb{N}^{+},\ \lambda\in(0,1],\ \gamma\in(0,1]$ , where $\delta_{s}\in(0,0.5814]$ , and $\epsilon_{s}=\sqrt{\frac{14k\ln(4/\delta_{s})}{(S+(n-1)\lambda-\sqrt{2(n-1)\lambda\ln(2/\delta_{s})}-1)}}$ .

Next, we analyze the impact of weak assumptions on the accuracy of the DUMP protocols. Same as before, we first show that the estimation results of flexible pureDUMP and mixDUMP are unbiased in Lemma 5.13 and Lemma 5.14. Both the estimation functions and proofs are deferred to Appendix B.6 and Appendix B.7.

Lemma 5.13.

The estimation result $\tilde{f}_{v}$ for any $v\in\mathbb{D}$ in flexible pureDUMP is an unbiased estimation of $f_{v}$ .

Lemma 5.14.

The estimation result $\tilde{f}_{v}$ for any $v\in\mathbb{D}$ in flexible mixDUMP is an unbiased estimation of $f_{v}$ .

As the weak assumption reduces the strength of privacy protection, the fewer dummy points reduce the MSE of the estimation results. We show the accuracy of the flexible DUMP protocols in Theorem 5.15 and Theorem 5.16, and proofs are deferred to Appendix B.8 and Appendix B.9.

Theorem 5.15.

The MSE of frequency estimation in flexible pureDUMP is

\frac{s\gamma(k-\gamma)}{nk^{2}}

Theorem 5.16.

The MSE of frequency estimation in flexible mixDUMP is

\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s\gamma(k-\gamma)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}

where $\epsilon_{l}$ is the privacy budget of GRR in mixDUMP.

6. Evaluation

6.1. Setup

We design experiments to evaluate the performance of the proposed protocols (pureDUMP and mixDUMP) and compare them with the existing protocols (truncation-based (Balcer and Cheu, 2019), private-coin(Ghazi et al., 2021), public-coin (Ghazi et al., 2021), SOLH (Wang et al., 2020), correlated-noise (Ghazi et al., 2020b)). Note that although the source codes of these existing protocols except SOLH are not formally released, we still implemented them with our best effort and included them for comparison purposes.

In experiments, we measure (1) the accuracy of the proposed protocols, i.e., how much accuracy they improve over existing shuffle protocols; (2) the communication overhead of the proposed protocols, i.e., how many messages does each user needs to send; and how do they compare with existing shuffle protocols. (3) how key parameters would affect the accuracy of the proposed protocols, i.e., the number of users $n$ and the size of domain $k$ ; Towards these goals, we conduct experiments over both synthetic and real-world datasets. The synthetic datasets allow us to adjust the key parameters of the datasets to observe the impact on the utility of protocols. And the real-world datasets would show the utility of the proposed protocols in a more practical setting.

Environment. All protocols are implemented in Python 3.7 with pyhton-xxhash 1.4.4 and numpy 1.18.1 libraries. All experiments are conducted on servers running Ubuntu 18.04.4 LTS 2.50GHz with 4*32GB memory.

Synthetic Datasets. We generate six synthetic datasets with different sizes of the data domain $k$ and the number of users $n$ . In the first three datasets, we fix $n=500,000$ and generate different datasets with $k=50$ , $k=500$ , and $k=5,000$ , respectively. In the rest of datasets, we fix $k=50$ and generate different datasets with $n=5,000$ , $n=50,000$ , and $n=500,000$ , respectively. All datasets are following a uniform distribution. We observe that the distribution of datasets has little effect on the performance of protocols, so we only show the experimental results on uniform distribution.

Real-World Datasets. We conduct experiments on following two real-world datasets.

•

IPUMS (ipu, 2017): We sample $1\%$ of the US Census data from the Los Angeles and Long Beach area for the year 1970. The dataset contains 70,187 users and 61 attributes. We select the birthplace attribute for evaluation, which includes 900 different values.
•

Ratings (rat, 2017): This dataset contains 20 million ratings of the movies have been rated by 138,000 users. We sample 494,352 records containing 2,000 different movies.

Protocols for Comparison. We compare pureDUMP and mixDUMP with following protocols.

•

private-coin (Ghazi et al., 2021): A multi-message protocol for histogram frequency estimation in the shuffle model. It utilizes the Hadamard response as the data randomizer.
•

public-coin (Ghazi et al., 2021): A multi-message protocol for histogram frequency estimation in the shuffle model. It is based on combining the Count Min data structure (Cormode and Muthukrishnan, 2005) with GRR mechanism. It has $\epsilon>\tau\sqrt{90\ln(2\tau/\delta)/n}$ when randomization probability $\gamma$ is set to $90\tau^{2}\ln(2\tau/\delta)/(n\epsilon^{2})$ , where $\tau=\ln(2k)$ .
•

SOLH (Wang et al., 2020): A single-message protocol for histogram frequency estimation in the shuffle model. It utilizes the hashing-based mechanism OLH as the data randomizer.
•

truncation-based (Balcer and Cheu, 2019): A multi-message protocol for histogram frequency estimation in the shuffle model. The protocol can provide privacy protection with $\epsilon>\sqrt{50\ln(2/\delta)/n}$ .
•

correlated-noise (Ghazi et al., 2020b): A multi-message protocol for histogram frequency estimation in the shuffle model. It adds noises from negative binomial distribution to users’ data.
•

privacy-amplification (Balle et al., 2019b): A privacy amplification result in the shuffle model. It has amplification effect only when $\epsilon$ within $(\sqrt{14k\ln{(2/\delta)}/(n-1)},1]$ .

Accuracy Metrics. In our experiments, we use mean squared error (MSE) as the accuracy metrics. For each value $v$ in $\mathbb{D}$ , we compute the real frequency $f_{v}$ on the sensitive dataset and estimate frequency $\tilde{f}_{v}$ on the noise dataset. Then we calculate their squared difference for $k$ different values.

MSE=\frac{1}{k}\sum_{v\in\mathbb{D}}(\tilde{f}_{v}-f_{v})^{2}

All MSE results in experiments are averaged with 50 repeats.

For the private-coin and correlated-noise, we calculate the expected mean squared error. As $O(nk\log n)$ comparisons are required to match messages with Hadamard codewords on the analyst side of private-coin, which leads to excessive computational overhead, i.e., 1,450 hours for dataset with k=512 and n =70,000. In the correlated-noise, we didn’t find a way to sample the negative binomial distribution $NB(r,q)$ where $r$ is non-integer. We hope in the further we can further improve this implementation.

6.2. Experiments with Synthetic Datasets

Overall Results. The influence of domain size $k$ on the MSE of pureDUMP and mixDUMP is shown in Figure 3, where the domain size has almost no influence on the MSE of pureDUMP. The influence of number of users $n$ on the MSE of pureDUMP and mixDUMP is shown in Figure 4, where the MSE of pureDUMP and mixDUMP is in inverse proportion to the number of users. Meanwhile, the small gap between the theoretical and empirical results validates the correctness of our theoretical error analysis.

Figure 5 and Figure 7 show MSE comparisons of all protocols on two different synthetic datasets. The proposed protocols, pureDUMP and mixDUMP, support any $\epsilon$ in the range of $(0,1]$ with higher accuracy. Meanwhile, the numbers of messages sent by each user in all protocols except shuffle-based and SOLH (always one message) are depicted in Figure 6 and Figure 8. The proposed protocols, pureDUMP and mixDUMP, support any $\epsilon$ in the range of $(0,1]$ with the best accuracy compared with other protocols, while the only exception correlated-noise, achieving near-optimal accuracy, but has much higher communication overhead ³³3Since the coefficient $500e^{\epsilon^{2}/\log(1/\delta)}$ is ignored, the declared asymptotic upper bound is smaller than eMSE.. Meanwhile, we observe that the numbers of messages sent by each user in pureDUMP and mixDUMP are always the smallest in all protocols. Although the MSE of truncation-based is performing similar to pureDUMP and mixDUMP, the communication overhead of private-shuffle is at least one order of magnitude higher than pureDUMP and mixDUMP. Since sufficient privacy guarantee can be provided by privacy amplification when hundreds of thousands of users are involved, we set $\epsilon_{l}$ of mixDUMP equals to 8 throughout the experiment to adapt large $\epsilon$ in the range of $(0,1]$ .

Influence of Domain Size. We compare the MSE of pureDUMP on three synthetic datasets with different sizes of data domain in Figure 3a. The result shows that the size of the data domain has little influence on the accuracy of pureDUMP. Because according to our theoretical result in Subsection 5.1, the MSE of pureDUMP can be approximately simplified to $14\ln(2/\delta)/(n^{2}\epsilon^{2})$ when $k$ is large. Figure 3b shows that the MSE of mixDUMP is proportional to the domain size $k$ because the large size of the data domain increases the error introduced by the data randomizer in mixDUMP.

Influence of Users Number. Figure 4a and Figure 4b show the influence of the number of users on the MSE of pureDUMP and mixDUMP, respectively. We can observe that the MSE in inverse proportion to the number of users $n$ . The MSE decreases by two orders of magnitude when the number of users increases by an order of magnitude, which is consistent with our efficiency analysis in Subsection 5.1 and Subsection 5.2.

Accuracy Comparisons of Existing Protocols. We choose two synthetic datasets that enable as many protocols as possible can provide privacy protection with $\epsilon$ within the range of $(0,1]$ , and MSE values of some $\epsilon$ out of the valid range of protocols are omitted. The comparison results is shown in Figure 5. We can observe that pureDUMP and mixDUMP always enjoy privacy amplification and have smaller MSE than the other ones except correlated-noise. Meanwhile, we can also observe that pureDUMP has higher accuracy than mixDUMP. It means that dummy points can introduce less error than the data randomizer when providing the same strength of privacy protection. Besides, the privacy-amplification has no privacy amplification on some small $\epsilon$ values, which causes poor MSE results. Moreover, we can also observe that the lines of public-coin always slightly increase at the beginning. Because the noise obeys binomial distribution, i.e., $Bin(n,\gamma)$ , where $\gamma$ decreases with the increase of $\epsilon$ , the variance increases before $\gamma$ drops to $0.5$ .

Communication Comparisons of Existing Protocols. The communication overhead is compared by computing the number of messages sent by each user in protocols. Since the number of messages sent by each user in private-coin and public-coin is always larger than the other protocols and only these two protocols has length of message larger than $O(\log k)$ for datasets with $n>k$ (see Table 2), comparing the number of messages does not affect the fairness of the results. Moreover, as the number of messages sent by each user in privacy-amplification and SOLH is always $1$ , we omit them in our comparisons. We also omit the $\epsilon$ values which are out of the valid range of the protocol. The comparison results are shown in Figure 6. We can observe that each user always sends fewer messages than others in pureDUMP and mixDUMP, and the mixDUMP requires the least number of messages per user because the data randomizer provides partial privacy protection so that fewer dummy points need to be generated. Note that the number of messages that each user sends in DUMP protocols is less than $1$ when a large $\epsilon$ is required. It means that fewer than $n$ dummy points are needed to satisfy a large $\epsilon$ when there is a sufficient number of users. According to the theoretical results of flexible DUMP protocols in Subsection 5.3, we provide their parameter settings for different $\epsilon$ on these two synthetic datasets in Appendix A. We can also observe that the number of messages sent by each user increases as $\epsilon$ increases in truncation-based protocol. Because each user sends additional $1$ with probability of $p=1-50\ln(2/\delta)/(\epsilon^{2}n)$ where $p\in(1/2,1)$ . Hence the probability $p$ closes to 1 when $\epsilon$ increases and a larger number of 1s are sent by each user.

6.3. Experiments with Real-World Datasets

We also conduct experiments on two real-world datasets: IPUMS and Ratings. The MSE comparisons of protocols are shown in Figure 9 and Figure 11, and their corresponding communication overhead are shown in Figure 10 and Figure 12. The privacy-amplification has no privacy amplification on both IPUMS and Ratings datasets, so it’s not involved in the comparison. Besides, public-coin is also not involved in the comparison because most of $\epsilon\in(0,1]$ are not in the valid range of public-coin on both datasets. We can observe that the pureDUMP and mixDUMP still achieve higher accuracy among all protocols except the correlated-noise under the large domain size in real-world datasets. Over the Ratings dataset with $\epsilon=1$ and $\delta=10^{-6}$ , pureDUMP (and mixDUMP) only require around $0.8$ (and $0.5$ ) expected extra messages per user, whiletruncation-based, private-coin, and correlated-noise need around $10^{2}$ , $10^{3}$ and $10^{4}$ extra messages respectively. On the other hand, single-message protocols privacy-amplification and SOLH are less competitive in accuracy and/or subjected to upper-bounds on privacy guarantee (namely, the lower-bounds on $\epsilon$ ).

7. Related Works

Anonymous Data Collection. Many works focus on anonymizing users’ messages. Some mechanisms are based on the cryptographic idea, including (Syverson et al., 2004; Lazar et al., 2018; Tyagi et al., 2017; Van Den Hooff et al., 2015). Recently, Chowdhury et al. (Roy Chowdhury et al., 2020) propose a system Crypt $\epsilon$ that achieves the accuracy guarantee of CDP with no trusted data collector required like LDP. In this approach, users encrypt their messages with homomorphic encryption and then they send messages to an auxiliary server to aggregate and add random noise. After that, the encrypted aggregated result is sent to the analyst, and the analyst decrypts to get the noisy result that satisfies CDP. Other mechanisms are based on the shuffling idea. This idea is originally proposed by Bittau et al. (Bittau et al., 2017). They design Encode, Shuffle, Analyze (ESA) architecture. The ESA architecture eliminates the link between the message and the user identity by introducing a shuffler between users and the analyst. A lot of following works analyze the privacy benefits that come from the shuffle model (Erlingsson et al., 2019; Cheu et al., 2019; Balle et al., 2019b; Feldman et al., 2020). Among them, Feldman et al. (Feldman et al., 2020) provide the tightest upper bound that $\epsilon$ can be reduced to $(1-e^{-\epsilon})\sqrt{e^{\epsilon}\log(1/\delta)}/\sqrt{n}$ .

Protocols in the shuffle model. Recent shuffle-related works seek to design more accurate communication-efficient protocols. The studies focus on frequency estimation in the shuffle model are shown in Section 3.2. We show some other related works as a supplement. Some studies focus on binary data summation. Cheu et al. (Cheu et al., 2019) design a single-message protocol for binary summation. Each user only sends one-bit message to the analyst and has expected error at most $O(\sqrt{\log(1/\delta)}/\epsilon)$ . Ghazi et al. (Ghazi et al., 2020a) design a multi-message protocol for binary summation where each user sends $O(\log n/\epsilon)$ one-bit messages to the analyst and has expected error at most $O(\sqrt{\log(1/\epsilon)/\epsilon^{3}})$ . Balcer et al. (Balcer and Cheu, 2019) propose a two-message protocol for binary summation with expected error at most $O(\log(1/\delta)/\epsilon^{2})$ . There are also a lot of studies on real data summation and many excellent protocols are proposed (Cheu et al., 2019; Balle et al., 2019b; Ghazi et al., 2019b; Balle et al., 2019a; Ghazi et al., 2020c).

8. Conclusions

In this paper, a novel framework called DUMP is proposed for privacy-preserving histogram estimation in the shuffle model. Two types of protection mechanisms, data randomization and dummy-point generator, are contained on the user side of DUMP. The analysis results show that the dummy points can get more privacy benefits from shuffling operation. To further explore the advantages of dummy points, two protocols: pureDUMP and mixDUMP are proposed. The proposed protocols achieve better trade-offs between privacy guarantee, statistics accuracy, and communication overhead than existing protocols. Besides, the proposed protocols are also analyzed under a more realistic assumption for practical application, we prove that the privacy and utility of the proposed protocols still hold. Finally, we perform experiments to compare different protocols and demonstrate the advantage of the proposed protocols on both synthetic and real-world datasets. For future work, we will study different distributions of dummy points and apply DUMP to more statistical types such as real number summation, range query, and quantile query.

References

(1)
ipu (2017) 2017. IPUMS Census Dataset. https://data.world/uci/ipums-census-database/.
rat (2017) 2017. Movie rating dataset. https://www.kaggle.com/ashukr/movie-rating-data/.
Balcer and Cheu (2019) Victor Balcer and Albert Cheu. 2019. Separating local & shuffled differential privacy via histograms. arXiv preprint arXiv:1911.06879 (2019).
Balcer and Vadhan (2018) Victor Balcer and Salil P. Vadhan. 2018. Differential Privacy on Finite Computers. In ITCS (LIPIcs), Vol. 94. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 43:1–43:21.
Balle et al. (2019a) Borja Balle, James Bell, Adria Gascon, and Kobbi Nissim. 2019a. Differentially private summation with multi-message shuffling. arXiv preprint arXiv:1906.09116 (2019).
Balle et al. (2019b) Borja Balle, James Bell, Adria Gascón, and Kobbi Nissim. 2019b. The privacy blanket of the shuffle model. In Annual International Cryptology Conference. Springer, 638–667.
Bittau et al. (2017) Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. 2017. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles. 441–459.
Cheu et al. (2019) Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. 2019. Distributed differential privacy via shuffling. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 375–403.
Cormode and Muthukrishnan (2005) Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
Ding et al. (2017) Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. 2017. Collecting telemetry data privately. In Advances in Neural Information Processing Systems. 3571–3580.
Duchi et al. (2013) John C Duchi, Michael I Jordan, and Martin J Wainwright. 2013. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 429–438.
Dwork (2008) Cynthia Dwork. 2008. Differential privacy: A survey of results. In International conference on theory and applications of models of computation. Springer, 1–19.
Erlingsson et al. (2019) Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. 2019. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2468–2479.
Erlingsson et al. (2014) Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 1054–1067.
Feldman et al. (2020) Vitaly Feldman, Audra McMillan, and Kunal Talwar. 2020. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. arXiv preprint arXiv:2012.12803 (2020).
Gazeau et al. (2016) Ivan Gazeau, Dale Miller, and Catuscia Palamidessi. 2016. Preserving differential privacy under finite-precision semantics. Theor. Comput. Sci. 655 (2016), 92–108.
Geng et al. (2015) Quan Geng, Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2015. The staircase mechanism in differential privacy. IEEE Journal of Selected Topics in Signal Processing 9, 7 (2015), 1176–1184.
Ghazi et al. (2020a) Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, and Ameya Velingker. 2020a. Pure differentially private summation from anonymous messages. arXiv preprint arXiv:2002.01919 (2020).
Ghazi et al. (2019a) Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, and Ameya Velingker. 2019a. Private Heavy Hitters and Range Queries in the Shuffled Model. arXiv preprint arXiv:1908.11358 (2019).
Ghazi et al. (2021) Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, and Ameya Velingker. 2021. On the Power of Multiple Anonymous Messages: Frequency Estimation and Selection in the Shuffle Model of Differential Privacy. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 463–488.
Ghazi et al. (2020b) Badih Ghazi, Ravi Kumar, Pasin Manurangsi, and Rasmus Pagh. 2020b. Private counting from anonymous messages: Near-optimal accuracy with vanishing communication overhead. In International Conference on Machine Learning. PMLR, 3505–3514.
Ghazi et al. (2020c) Badih Ghazi, Pasin Manurangsi, Rasmus Pagh, and Ameya Velingker. 2020c. Private aggregation from fewer anonymous messages. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 798–827.
Ghazi et al. (2019b) Badih Ghazi, Rasmus Pagh, and Ameya Velingker. 2019b. Scalable and differentially private distributed aggregation in the shuffled model. arXiv preprint arXiv:1906.08320 (2019).
Ghosh et al. (2012) Arpita Ghosh, Tim Roughgarden, and Mukund Sundararajan. 2012. Universally utility-maximizing privacy mechanisms. SIAM J. Comput. 41, 6 (2012), 1673–1693.
Holohan et al. (2017) Naoise Holohan, Douglas J Leith, and Oliver Mason. 2017. Optimal differentially private mechanisms for randomised response. IEEE Transactions on Information Forensics and Security 12, 11 (2017), 2726–2735.
Ilvento (2020) Christina Ilvento. 2020. Implementing the Exponential Mechanism with Base-2 Differential Privacy. In CCS. ACM, 717–742.
Kairouz et al. (2014) Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2014. Extremal mechanisms for local differential privacy. Advances in neural information processing systems 27 (2014), 2879–2887.
Kasiviswanathan et al. (2011) Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM J. Comput. 40, 3 (2011), 793–826.
Lazar et al. (2018) David Lazar, Yossi Gilad, and Nickolai Zeldovich. 2018. Karaoke: Distributed private messaging immune to passive traffic analysis. In USENIX Symposium on Operating Systems Design and Implementation. 711–725.
Mironov (2012) Ilya Mironov. 2012. On significance of the least significant bits for differential privacy. In CCS. ACM, 650–661.
Roy Chowdhury et al. (2020) Amrita Roy Chowdhury, Chenghong Wang, Xi He, Ashwin Machanavajjhala, and Somesh Jha. 2020. Crypt $\epsilon$ : Crypto-Assisted Differential Privacy on Untrusted Servers. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 603–619.
Syverson et al. (2004) Paul Syverson, Roger Dingledine, and Nick Mathewson. 2004. Tor: The secondgeneration onion router. In USENIX Security Symposium. 303–320.
Team et al. (2017) ADP Team et al. 2017. Learning with privacy at scale. Apple Mach. Learn. J 1, 9 (2017).
Tyagi et al. (2017) Nirvan Tyagi, Yossi Gilad, Derek Leung, Matei Zaharia, and Nickolai Zeldovich. 2017. Stadium: A distributed metadata-private messaging system. In Proceedings of the 26th Symposium on Operating Systems Principles. 423–440.
Van Den Hooff et al. (2015) Jelle Van Den Hooff, David Lazar, Matei Zaharia, and Nickolai Zeldovich. 2015. Vuvuzela: Scalable private messaging resistant to traffic analysis. In Proceedings of the 25th Symposium on Operating Systems Principles. 137–152.
Wang et al. (2017) Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally differentially private protocols for frequency estimation. In USENIX Security Symposium. 729–745.
Wang et al. (2020) Tianhao Wang, Bolin Ding, Min Xu, Zhicong Huang, Cheng Hong, Jingren Zhou, Ninghui Li, and Somesh Jha. 2020. Improving Utility and Security of the Shuffler based Differential Privacy. Proceedings of the VLDB Endowment 13, 13 (2020).
Wang et al. (2019) Tianhao Wang, Bolin Ding, Jingren Zhou, Cheng Hong, Zhicong Huang, Ninghui Li, and Somesh Jha. 2019. Answering multi-dimensional analytical queries under local differential privacy. In Proceedings of the 2019 International Conference on Management of Data. 159–176.
Warner (1965) Stanley L Warner. 1965. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 60, 309 (1965), 63–69.

Appendix A Parameter setting in flexible pureDUMP and mixDUMP

Table 3 and Table 4 show the number of dummy points need to be sent by a user in flexible pureDUMP and mixDUMP, respectively. The number of users is $500,000$ , we show results on two different $k$ (size of data domain) and $\gamma$ (probability of a user sending dummy points).

	50		500
	$\gamma=0.01$	$\gamma=0.001$	$\gamma=0.01$	$\gamma=0.001$
0.4	13	127	127	1270
0.6	6	57	57	565
0.8	4	32	32	318
1.0	3	21	21	204

Table 3. Number of dummy points sent by a user in flexible pureDUMP. We fix

n=500,000

. All decimals are rounded up.

	50		500
	$\gamma=0.01$	$\gamma=0.001$	$\gamma=0.01$	$\gamma=0.001$
0.4	12	118	119	1190
0.6	5	44	46	451
0.8	2	18	20	192
1.0	1	6	8	72

Table 4. Number of dummy points sent by a user in flexible mixDUMP. We fix

n=500,000

and

\epsilon_{l}=8

.All decimals are rounded up.

Appendix B Proofs

B.1. Proof of Lemma 5.1

Proof.

Denote $a_{1},a_{2},...,a_{m}$ as $m$ random variables that follow the binomial distribution, where $a_{1}+a_{2}+...a_{m}=N$ . Let $a_{i}$ be a constant value $c$ , then the conditional probability distribution $Pr[a_{j}=x|a_{i}=c]$ ( $i\neq j$ ) can be expressed as

	$\displaystyle Pr[a_{j}=x\|a_{i}=c]$	$\displaystyle=C_{N-c}^{x}(\frac{1}{m-1})^{x}(\frac{m-2}{m-1})^{N-c-x}$
		$\displaystyle=\frac{\frac{(N-c)!}{x!(N-c-x)!}(m-2)^{N-c-x}}{(m-1)^{N-c}}$

The marginal distribution $Pr[a_{j}=x]$ can be expressed as

	$\displaystyle Pr[a_{j}=x]$	$\displaystyle=C_{N}^{x}(\frac{1}{m})^{x}\cdot(\frac{m-1}{m})^{N-x}$
		$\displaystyle=\frac{\frac{N!}{x!(N-x)!}\cdot(m-1)^{N-x}}{m^{N}}$

Then we calculate the upper bound of the ratio of these two probability distributions.

		$\displaystyle\frac{Pr[a_{j}=x\|a_{i}=c]}{Pr[a_{j}=x]}$
		$\displaystyle=\frac{\frac{(N-c)!}{x!(N-c-x)!}(m-2)^{N-c-x}}{(m-1)^{N-c}}\cdot\frac{m^{N}}{\frac{N!}{x!(N-x)!}\cdot(m-1)^{N-x}}$
		$\displaystyle=\frac{(m-2)^{N-c-x}m^{N}}{(m-1)^{2N-x-c}}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}$
		$\displaystyle=\frac{(m-2)^{N-c-x}}{(m-1)^{N-x-c}}\cdot\frac{m^{N}}{(m-1)^{N}}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}$
		$\displaystyle=(1-\frac{1}{m-1})^{N-c-x}\cdot(1+\frac{1}{m-1})^{N}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}$
		$\displaystyle=(1-(\frac{1}{m-1})^{2})^{N-c-x}\cdot(1+\frac{1}{m-1})^{c+x}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}$
		$\displaystyle\leq(\frac{m}{m-1})^{c+x}\cdot\frac{(N-x)\cdot(N-x-1)\cdot...\cdot(N-x-c+1)}{N\cdot(N-1)\cdot...\cdot(N-c+1)}$
		$\displaystyle\leq(\frac{m}{m-1})^{c+x}$

The lower bound of the ratio of these two probability distributions can be calculated as

		$\displaystyle\frac{Pr[a_{j}=x\|a_{i}=c]}{Pr[a_{j}=x]}$
		$\displaystyle=\frac{(m-2)^{N-c-x}m^{N}}{(m-1)^{2N-x-c}}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}$
		$\displaystyle=\frac{(m-2)^{N-c-x}}{(m-1)^{N-x-c}}\cdot\frac{m^{N}}{(m-1)^{N}}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}$
		$\displaystyle=(1-\frac{1}{m-1})^{N-c-x}\cdot(1+\frac{1}{m-1})^{N}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}$
		$\displaystyle\geq(\frac{m-2}{m-1})^{N-c-x}\cdot\frac{(N-x)\cdot(N-x-1)\cdot...\cdot(N-x-c+1)}{N\cdot(N-1)\cdot...\cdot(N-c+1)}$
		$\displaystyle\geq(\frac{m-2}{m-1})^{N-x-c}\cdot(\frac{N-x-c+1}{N-c+1})^{c}$

Since $a_{1},a_{2},...,a_{m}$ are assumed to be much smaller than $N$ , $m$ can be inferred to be a large value. Hence both $\frac{m}{m-1}$ and $\frac{m-2}{m-1}$ are close to $1$ . Besides, $\frac{N-x-c+1}{N-c+1}$ is also close to $1$ since $x$ is a value much smaller than $N$ . Therefore, both the upper bound and the lower bound of $\frac{Pr[a_{j}=x|a_{i}=c]}{Pr[a_{j}=x]}$ are close to $1$ . We have

	$\displaystyle Pr[a_{j}=x,a_{i}=c]$	$\displaystyle=Pr[a_{j}=x\|a_{i}=c]Pr[a_{i}=c]$
		$\displaystyle\simeq Pr[a_{j}=x]Pr[a_{i}=c]$

We can approximately consider that the distributions of $a_{i}$ and $a_{j}$ are independent based on the above proof.

∎

B.2. Proof of Theorem 5.2

Proof.

For any two neighboring datasets $\mathcal{D}$ and $\mathcal{D^{\prime}}$ , w.l.o.g., we assume they differ in the $n^{\text{th}}$ value, and $x_{n}=1$ in $\mathcal{D}$ , $x^{\prime}_{n}=2$ in $\mathcal{D^{\prime}}$ . Let $\mathcal{S}$ and $\mathcal{S^{\prime}}$ be two dummy datasets with $|\mathcal{S}|=|\mathcal{S}^{\prime}|=ns$ dummy points sent by users. What the analyst receives is a mixed dataset $\mathcal{V}$ that is a mixture of user dataset $\mathcal{D}$ (or $\mathcal{D^{\prime}})$ and the dummy dataset $\mathcal{S}$ (or $\mathcal{S^{\prime}})$ . We denote $\mathcal{V}=\mathcal{D}\cup\mathcal{S}$ (or $\mathcal{V}=\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}$ ). Denote $s_{j}$ as the number of value $j$ in the dummy dataset. To get the same $\mathcal{V}$ , there should be a dummy point equals to 2 in $\mathcal{S}$ but equals to 1 in $\mathcal{S^{\prime}}$ . Thus the numbers of $k$ values in other $|\mathcal{S}|-1$ dummy points are $[s_{1}-1,s_{2},\ldots,s_{k}]$ in $\mathcal{S}$ and $[s_{1},s_{2}-1,\ldots,s_{k}]$ in $\mathcal{S^{\prime}}$ . Then we have

	$\displaystyle\frac{\Pr[\mathcal{D}\cup\mathcal{S}=\mathcal{V}]}{\Pr[\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}=\mathcal{V}]}$	$\displaystyle=\frac{(_{s_{1}-1,s_{2},\ldots,s_{k}}^{\ \ \ \ \ \|\mathcal{S}\|-1})}{(_{s_{1},s_{2}-1,\ldots,s_{k}}^{\ \ \ \ \ \|\mathcal{S}\|-1})}=\frac{C_{s_{1}+s_{2}-1}^{s_{2}}C_{s_{1}-1}^{s_{1}-1}}{C_{s_{1}+s_{2}-1}^{s_{2}-1}C_{s_{1}}^{s_{1}}}$
		$\displaystyle=\frac{s_{1}!(s_{2}-1)!}{(s_{1}-1)!s_{2}!}=\frac{s_{1}}{s_{2}}$

\displaystyle\Pr[\frac{\Pr[\mathcal{D}\cup\mathcal{S}=\mathcal{V}]}{\Pr[\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}=\mathcal{V}]}\geq e^{\epsilon_{d}}]=\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]\leq\delta_{d}

Let $c=E(S_{2})=\frac{|\mathcal{S}|-1}{k}$ , $\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]$ can be divided into two cases with overlap that $S_{1}\geq ce^{\epsilon_{d}/2}$ or $S_{2}\leq ce^{-\epsilon_{d}/2}$ so we have

		$\displaystyle\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]$
		$\displaystyle\leq\Pr[S_{1}\geq ce^{\epsilon_{d}/2}]+\Pr[S_{2}\leq ce^{-\epsilon_{d}/2}]$
		$\displaystyle=\Pr[S_{2}\geq ce^{\epsilon_{d}/2}-1]+\Pr[S_{2}\leq ce^{-\epsilon_{d}/2}]$
		$\displaystyle=\Pr[S_{2}-c\geq ce^{\epsilon_{d}/2}-c-1]+\Pr[S_{2}-c\leq ce^{-\epsilon_{d}/2}-c]$
		$\displaystyle=\Pr[S_{2}-c\geq c(e^{\epsilon_{d}/2}-1-1/c)]+\Pr[S_{2}-c\leq c(e^{-\epsilon_{d}/2}-1)]$

According to multiplicative Chernoff bound, we have

\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]\leq e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}+e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}

To ensure that $\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]\leq\delta_{d}$ , we need

e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}\leq\frac{\delta_{d}}{2}\ and\ e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}\leq\frac{\delta_{d}}{2}

For $e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}\leq\frac{\delta_{d}}{2}$ , $1-e^{-\epsilon_{d}/2}\geq(1-e^{-1/2})\epsilon_{d}\geq\epsilon_{d}/\sqrt{7}$ , where $\epsilon_{d}\leq 1$ . So we can get $c\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}$ . For $e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}\leq\frac{\delta_{d}}{2}$ , $e^{\epsilon_{d}/2}-1\geq\frac{\epsilon_{d}}{2}$ , so $e^{\epsilon_{d}/2}-1-1/c\geq\frac{25}{54}\epsilon_{d}$ , where $c\geq 27/\epsilon_{d}$ . Therefore, we can get $c\geq\frac{3\cdot 54^{2}\ln{(2/\delta_{d})}}{25^{2}\epsilon_{d}^{2}}$ . Note that $\frac{3\cdot 54^{2}}{25^{2}}\leq 14$ .

In summary, we can get $c\geq max\{\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}},\frac{27}{\epsilon_{d}}\}$ . Because $\epsilon_{d}\leq 1$ , when $\delta_{d}\leq 0.2907$ , $\epsilon_{d}\leq\frac{14\ln{(2/\delta_{d})}}{27}$ can be held all the time. So $c\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}$ . Therefore, we have $\frac{S-1}{k}\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}$ , is that $\epsilon_{d}\leq\sqrt{\frac{14k\ln{(2/\delta_{d})}}{S-1}}$ . ∎

B.3. Proof of Theorem 5.6

Proof.

The assumptions are same to the proof of Theorem 5.2. The $n^{\text{th}}$ value $x_{n}=1$ in $\mathcal{D}$ and $x^{\prime}_{n}=2$ in its neighboring dataset $\mathcal{D^{\prime}}$ . If the $n^{\text{th}}$ user randomizes its data, then we can easily get

\frac{\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]}{\Pr[R(\mathcal{D})\cup\mathcal{S}^{\prime}=\mathcal{V}]}=1

\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\leq e^{\epsilon_{H}}\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]+\frac{\delta_{s}}{2}

where $\epsilon_{H}=\sqrt{\frac{14k\ln{(4/\delta_{s})}}{|\mathcal{H}|+|\mathcal{S}|-1}}$ , $\delta_{s}\leq 0.5814$ . Since $\lambda$ is the probability that each user randomizes its data, the number of users participating in randomization $|\mathcal{H}|$ follows $Bin(n-1,\lambda)$ . Define $\mu=(n-1)\lambda$ , and $\gamma=\sqrt{\frac{2\ln{(2/\delta_{s})}}{(n-1)\lambda}}$ . According to Chernoff bound, we can get

\Pr[|\mathcal{H}|<(1-\gamma)\mu]<e^{-\mu\gamma^{2}/2}=\frac{\delta_{s}}{2}

Then we have

		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]$
		$\displaystyle=\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap\|\mathcal{H}\|\geq(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]+$
		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap\|\mathcal{H}\|<(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]$
		$\displaystyle\leq\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap\|\mathcal{H}\|\geq(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]+\frac{\delta_{s}}{2}$
		$\displaystyle Let\ t=(n-1)\lambda-\sqrt{2(n-1)\lambda ln(2/\delta_{s})}$
		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\leq(\sum_{h\geq t}\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\Pr[\|\mathcal{H}\|=h])+\frac{\delta_{s}}{2}$
		$\displaystyle\leq(\sum_{h\geq t}(e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]\Pr[\|\mathcal{H}\|=h]+\frac{\delta_{s}}{2}\Pr[\|\mathcal{H}\|=h]))+\frac{\delta_{s}}{2}$
		$\displaystyle\leq(\sum_{h\geq t}(e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]\Pr[\|\mathcal{H}\|=h])+\delta_{s}$
		$\displaystyle\leq max_{h\geq t}e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}$
		$\displaystyle\leq e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{t+S-1}}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}$
		$\displaystyle=e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}+S-1}}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}$

∎

B.4. Proof of Lemma 5.8

Proof.

$f_{v}$ is the true frequency of element $v$ appears in users’ dataset. And $\tilde{f}_{v}$ is the estimation of $f_{v}$ . We prove $\tilde{f}_{v}$ in mixDUMP is unbiased.

In GRR mechanism, the probability that each value keeps the true value with the probability $p$ , and changes to other values in the data domain with the probability of $q$ . We give the estimation function of mixDUMP in Algorithm 5. The formula is

\tilde{f}_{v}=\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-n\lambda(1-\frac{1}{k})-\frac{ns}{k}}{n(1-2\lambda(1-\frac{1}{k}))}

where $\lambda=\frac{k}{k-1}q$ . To simplify the derivation, we replace $\lambda$ with $q$ , then we have estimation of $\tilde{f}_{v}$

	$\displaystyle E[\tilde{f}_{v}]$	$\displaystyle=E[\frac{1}{n}\cdot\frac{(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{ns}{k})}{(1-2q)}]$
		$\displaystyle=\frac{1}{n}\cdot E[\frac{(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{ns}{k})}{(1-2q)}]$
		$\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{ns}{k}]$
		$\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]-nq-\frac{ns}{k})$
		$\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(nf_{v}(1-q)+n(1-f_{v})q+\frac{ns}{k}$
		$\displaystyle-nq-\frac{ns}{k})$
		$\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(1-2q)\cdot nf_{v}$
		$\displaystyle=f_{v}$

∎

B.5. Proof of Theorem 5.7

Proof.

Denote $[n_{1},\ldots,n_{k}]$ as numbers of $k$ values in the $i^{\text{th}}$ user’s reports set $\mathcal{V}_{i}$ where $\sum_{i=1}^{k}n_{i}=s+1$ . Denote $\mathcal{S}_{i}$ (or $\mathcal{S}_{i}^{\prime}$ ) as the dummy dataset generated by the $i^{\text{th}}$ user. Let $v$ be the true value held by the $i^{\text{th}}$ user. According to the definition of LDP, we compare the probability of shuffler receiving the same set $\mathcal{V}_{i}$ with numbers of k values are $[n_{1},\ldots,n_{k}]$ when $v=1$ and $v=2$ . We prove the following formula

\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]\leq e^{\epsilon_{r}}\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]+\delta_{r}

According to Equation (1) of GRR mechanism, we have

		$\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}$
		$\displaystyle=\frac{e^{\epsilon_{l}}(_{n_{1}-1,n_{2},\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+(_{n_{1},n_{2}-1,\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+\ldots+(_{n_{1},n_{2},\ldots,n_{k}-1}^{\ \ \ \ \ \ \ \ s})}{e^{\epsilon_{l}}(_{n_{1},n_{2}-1,\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+(_{n_{1}-1,n_{2},\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+\ldots+(_{n_{1},n_{2},\ldots,n_{k}-1}^{\ \ \ \ \ \ \ \ s})}$
		$\displaystyle\leq\frac{e^{\epsilon_{l}}(_{n_{1}-1,n_{2},\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+(_{n_{1},n_{2}-1,\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})}{e^{\epsilon_{l}}(_{n_{1},n_{2}-1,\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+(_{n_{1}-1,n_{2},\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})}=\frac{e^{\epsilon_{l}}n_{1}+n_{2}}{e^{\epsilon_{l}}n_{2}+n_{1}}$

Since dummy points are uniformly random sampled from $\mathbb{D}$ , both $n_{1}$ and $n_{2}$ follow binomial distribution. $n_{1}$ follows $Bin(s,\frac{1}{k})+1$ and $n_{2}$ follows $Bin(s,\frac{1}{k})$ . Based on Lemma 5.1, $n_{1}$ and $n_{2}$ are considered as mutually independent variables. We have

	$\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}$	$\displaystyle\leq\frac{e^{\epsilon_{l}}Bin(s,\frac{1}{k})+e^{\epsilon_{l}}+Bin(s,\frac{1}{k})}{e^{\epsilon_{l}}Bin(s,\frac{1}{k})+Bin(s,\frac{1}{k})+1}$
		$\displaystyle\leq\frac{Bin(s,\frac{1}{k})(e^{\epsilon_{l}}+1)+e^{\epsilon_{l}}}{Bin(s,\frac{1}{k})(e^{\epsilon_{l}}+1)}$
		$\displaystyle=\frac{Bin(s,\frac{1}{k})}{Bin(s,\frac{1}{k})}+\frac{e^{\epsilon_{l}}}{Bin(s,\frac{1}{k})(e^{\epsilon_{l}}+1)}$

According to the Chernoff bound $\Pr[Bin(s,\frac{1}{k})<(1-\gamma)\mu]<\delta$ , where $\gamma=\sqrt{\frac{2k\ln(1/\delta)}{s}}$ and $\mu=\frac{s}{k}$ . That is $\Pr[Bin(s,\frac{1}{k})\geq\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta)}{s}})]\geq 1-\delta$ . Note that $1-\sqrt{\frac{2k\ln(1/\delta)}{s}}\geq 0$ , so $\frac{k}{s}\leq\frac{1}{2\ln(1/\delta)}$ . As we prove that $Bin(s,\frac{1}{k})\in[\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta)}{s}},s]$ with the probability larger than $1-\delta$ , then we have

		$\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}$
		$\displaystyle\leq\frac{s}{\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta)}{s}})}+\frac{e^{\epsilon_{l}}}{\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta)}{s}})(e^{\epsilon_{l}}+1)}$
		$\displaystyle=\frac{k}{1-\sqrt{\frac{2k\ln(1/\delta)}{s}}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)})$

when $\frac{k}{s}\leq(\sqrt{c^{2}+\frac{4c}{\sqrt{2\ln(1/\delta)}}}-c)^{2})$ , it has $\frac{k}{1-\sqrt{\frac{2k\ln(1/\delta)}{s}}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)})\leq e^{\epsilon_{l}}$ , where $c=\sqrt{2\ln(1/\delta)}\cdot\frac{e^{\epsilon}_{l}(e^{\epsilon_{l}}+1)}{s(e^{\epsilon_{l}}+1)+e^{\epsilon_{l}}}$ . ∎

B.6. Proof of Lemma 5.13

Proof.

$f_{v}$ is the true frequency of element $v$ appears in users’ dataset. $\tilde{f}_{v}$ is the estimation of $f_{v}$ . We prove $\tilde{f}_{v}$ in flexible pureDUMP is unbiased.

As each user generates dummy points with probability of $\gamma$ , and with probability of $1-\gamma$ does not generate any dummy point, the estimation function is

\tilde{f}_{v}=\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\frac{\gamma ns}{k})

Then we have

	$\displaystyle E[\tilde{f}_{v}]$	$\displaystyle=E[\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\frac{\gamma ns}{k})]$
		$\displaystyle=\frac{1}{n}\cdot E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]-\frac{\gamma s}{k}$
		$\displaystyle=\frac{1}{n}\cdot(nf_{v}+\frac{\gamma ns}{k})-\frac{\gamma s}{k}$
		$\displaystyle=f_{v}$

∎

B.7. Proof of Lemma 5.14

Proof.

$f_{v}$ is the true frequency of element $v$ appears in users’ dataset. And $\tilde{f}_{v}$ is the estimation of $f_{v}$ . We prove $\tilde{f}_{v}$ of flexible mixDUMP is unbiased.

In GRR mechanism, the probability that each value keeps unchanged with the probability $p$ , and changes to other values in the data domain with the probability of $q$ . As each user generates dummy points with probability of $\gamma$ , and with probability of $1-\gamma$ does not generate any dummy point, the estimation function is

\tilde{f}_{v}=\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-n\lambda(1-\frac{1}{k})-\frac{\gamma ns}{k}}{n(1-2\lambda(1-\frac{1}{k}))}

where $\lambda=\frac{k}{k-1}q$ . To simplify the derivation, we replace $\lambda$ with $q$ , then we have estimation of $\tilde{f}_{v}$

	$\displaystyle E[\tilde{f}_{v}]$	$\displaystyle=E[\frac{1}{n}\cdot\frac{(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{\gamma ns}{k})}{(1-2q)}]$
		$\displaystyle=\frac{1}{n}\cdot E[\frac{(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{\gamma ns}{k})}{(1-2q)}]$
		$\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{\gamma ns}{k}]$
		$\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]-nq-\frac{\gamma ns}{k})$
		$\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(nf_{v}(1-q)+n(1-f_{v})q+\frac{\gamma ns}{k}$
		$\displaystyle-nq-\frac{\gamma ns}{k})$
		$\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(1-2q)\cdot nf_{v}$
		$\displaystyle=f_{v}$

∎

B.8. Proof of Theorem 5.15

Proof.

As each user sends dummy points with the probability of $\gamma$ , the distribution of number of any distinct value in dummy points is $Bin(ns,\frac{\gamma}{k})$ . Thus, the estimation function of flexible pureDUMP is

\tilde{f}_{v}=\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\frac{\gamma ns}{k})

Then, we have

	$\displaystyle Var[\tilde{f}_{v}]$	$\displaystyle=Var[\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\frac{\gamma ns}{k})]$
		$\displaystyle=\frac{1}{n^{2}}\cdot Var[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]$
		$\displaystyle=\frac{1}{n^{2}}\cdot ns\cdot\frac{\gamma}{k}\cdot{1-\frac{\gamma}{k}}$
		$\displaystyle=\frac{s\gamma(k-\gamma)}{nk^{2}}$

Lemma 5.13 proves that $\tilde{f}_{v}$ is unbiased in flexible pureDUMP, thus the MSE of flexible pureDUMP is

	$\displaystyle MSE$	$\displaystyle=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})$
		$\displaystyle=\frac{s\gamma(k-\gamma)}{nk^{2}}$

∎

B.9. Proof of Theorem 5.16

Proof.

As each user generates dummy points in probability of $\gamma$ in flexible mixDUMP, the distribution of number of any distinct value in dummy points is $Bin(ns,\frac{\gamma}{k})$ . Thus, the estimation function of flexible mixDUMP is

\tilde{f}_{v}=\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-n\lambda(1-\frac{1}{k})-\frac{\gamma ns}{k}}{n(1-2\lambda(1-\frac{1}{k}))}

where $\lambda=\frac{k}{k-1}q$ . To simplify derivation, we replace $\lambda$ with $q$ , and with $p=\frac{e^{\epsilon_{l}}}{e^{\epsilon_{l}}+k-1},\ q=\frac{1}{e^{\epsilon_{l}}+k-1}$ , we have the variance of $\tilde{f}_{v}$ is

	$\displaystyle Var[\tilde{f}_{v}]$	$\displaystyle=Var[\frac{1}{n}\cdot\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\gamma ns\cdot\frac{1}{k}-nq}{p-q}]$
		$\displaystyle=\frac{1}{n^{2}}\cdot\frac{1}{(p-q)^{2}}Var[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]$
		$\displaystyle\simeq\frac{1}{n^{2}}\cdot\frac{1}{(p-q)^{2}}(nq(1-q)+ns\cdot\frac{\gamma}{k}\cdot\frac{k-\gamma}{k})$
		$\displaystyle=\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s\gamma(k-\gamma)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}$

Lemma 5.14 proves that $\tilde{f}_{v}$ is unbiased in flexible mixDUMP, thus we have the MSE of flexible mixDUMP is

	$\displaystyle MSE$	$\displaystyle=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})$
		$\displaystyle=\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s\gamma(k-\gamma)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}$

∎

		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]$
		$\displaystyle\leq\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\wedge\|\mathcal{H}\|\geq t]+\frac{\delta_{s}}{2}$
		$\displaystyle\leq(\sum_{h\geq t}\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\Pr[\|\mathcal{H}\|=h])+\frac{\delta_{s}}{2}$
		$\displaystyle\leq e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{t+\|\mathcal{S}\|-1}}}\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]+\delta_{s}$

		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]$
		$\displaystyle=\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap\|\mathcal{H}\|\geq(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]+$
		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap\|\mathcal{H}\|<(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]$
		$\displaystyle\leq\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap\|\mathcal{H}\|\geq(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]+\frac{\delta_{s}}{2}$
		$\displaystyle Let\ t=(n-1)\lambda-\sqrt{2(n-1)\lambda ln(2/\delta_{s})}$
		$\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\leq(\sum_{h\geq t}\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\Pr[\|\mathcal{H}\|=h])+\frac{\delta_{s}}{2}$
		$\displaystyle\leq(\sum_{h\geq t}(e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]\Pr[\|\mathcal{H}\|=h]+\frac{\delta_{s}}{2}\Pr[\|\mathcal{H}\|=h]))+\frac{\delta_{s}}{2}$
		$\displaystyle\leq(\sum_{h\geq t}(e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]\Pr[\|\mathcal{H}\|=h])+\delta_{s}$
		$\displaystyle\leq max_{h\geq t}e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}$
		$\displaystyle\leq e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{t+S-1}}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}$
		$\displaystyle=e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}+S-1}}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}$