This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DUMP: A Dummy-point-based Local Differential Privacy Enhancement Approach under the Shuffle Model

Xiaochen Li Zhejiang University [email protected] Weiren Liu Alibaba Group [email protected] Hanwen Feng Alibaba Group [email protected] Kunzhe Huang Zhejiang University [email protected] Yuke Hu Zhejiang University [email protected] Jinfei Liu Zhejiang University [email protected] Kui Ren Zhejiang University [email protected]  and  Zhan Qin Zhejiang University [email protected]
Abstract.

The shuffle model is recently proposed to address the issue of severe utility loss in Local Differential Privacy (LDP) due to distributed data randomization. In the shuffle model, a shuffler is utilized to break the link between the user identity and the message uploaded to the data analyst. Since less noise needs to be introduced to achieve the same privacy guarantee, following this paradigm, the utility of privacy-preserving data collection is improved.

In this paper, we propose DUMP (DUMmy-Point-based), a new framework for privacy-preserving histogram estimation in the shuffle model. In DUMP, dummy messages are introduced on the user side, creating an additional dummy blanket to further improve the utility of the shuffle model. In addition, this new privacy-preserving mechanism also achieves LDP privacy amplification effect to the user uploaded data against a curious shuffler. We instantiate DUMP by proposing two protocols: pureDUMP and mixDUMP, both of which achieving theoretical improvements on accuracy and communication complexity comparing with existing protocols. A comprehensive experimental evaluation is conducted to compare the proposed two designs under DUMP with the existing protocols. Experimental results on both synthetic and real-world datasets confirm our theoretical analysis and show that the proposed protocols achieve better utility than existing protocols under the same privacy guarantee.

PVLDB Reference Format:
PVLDB, 14(1): XXX-XXX, 2020.
doi:XX.XX/XXX.XX This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:XX.XX/XXX.XX

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at %leave␣empty␣if␣no␣availability␣url␣should␣be␣setURL_TO_YOUR_ARTIFACTS.

1. Introduction

Differential privacy is widely accepted, by both academia and industry (Erlingsson et al., 2014; Team et al., 2017; Ding et al., 2017; Wang et al., 2019), as the standard privacy definition for statistical analysis on databases containing sensitive user information. In the last decades, the community has devoted a number of efforts to designing practical mechanisms achieving better utility while providing reliable privacy guarantee for users, and most of them can be categorized into the central model (Geng et al., 2015; Ghosh et al., 2012) or local model (Kairouz et al., 2014; Holohan et al., 2017). In the central model, the analyst, who collects data from individuals, is responsible to add noise to the aggregated result such that the noisy result will not leak too much information about each individual. In contrast, the local model enables each user to perturb his data before submitting it to an analyst and thus provides a more reliable privacy guarantee even when the analyst is curious. However, mechanisms achieving local differential privacy (LDP) only have limited utility and turn out ill-suited to practice (Bittau et al., 2017). It is known that the estimation error is at least Ω(n)\Omega(\sqrt{n}) for a wide range of natural problems in the local model (nn is the number of users), while an error of just O(1)O(1) can be achieved in the central model(Duchi et al., 2013).

The above dilemma motivates the study of new approaches that prevent third parties from violating users’ privacy while achieving reasonable utility; the shuffle model, initiated by Bittau et al. (Bittau et al., 2017), is emerging as a promising one. This model introduces an independent third party called shuffler who is responsible for shuffling all (permuted) messages received from users before passing them to the analyst111In practice, various cryptographic tools like nest encryptions (Bittau et al., 2017) can be employed to ensure that the shuffler cannot learn the content of these reports.. Regarding accuracy, for queries that are independent of the sequence, this shuffling will not introduce extra utility loss. Regarding privacy, intuitively, the analyst cannot link each report in a shuffled database to an individual, and thus the privacy against analysts can be amplified. Or, less “noise” shall be added by users to messages for the same privacy, which in turn improves the utility. Though similar (or even better) results could be achieved using heavy cryptographic primitives like homomorphic encryptions and multiparty computation protocols (Lazar et al., 2018; Roy Chowdhury et al., 2020), shuffle-model approaches are more popular in the industry since they are considerably efficient and implementation-friendly! In recent years, several shuffle-model protocols have been presented for various tasks (e.g., frequency estimation (Balcer and Cheu, 2019), data summation (Ghazi et al., 2019b), and range query (Ghazi et al., 2019a)) over either numerical data or categorical data.

We focus on the problem of histogram estimation, the fundamental task over categorical data, in the shuffle model. Many recent works (Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019; Balle et al., 2019b; Ghazi et al., 2020b) have aimed to design truly practical mechanisms for this problem, and they indeed have different tradeoffs. In terms of efficiency and implementation-friendliness, the generalized random response (GRR) based mechanisms(Balle et al., 2019b), in which a user just replaces his data xx by a uniformly random value rr in the data domain 𝔻\mathbb{D} with certain probability qq, turns out to be optimal: (1) only a single message shall be sent per user, and (2) it directly uses uniform randomness, which is naturally provided by operating systems via standard means. We note the feature (2) is highly desirable in practice: although random variables with many other distributions can be calculated from uniformly random variables, the inexact calculation over finite precision values cannot produce randomness exactly having the specified distribution, and the implemented privacy mechanism may leak significant information of the original data (see (Ilvento, 2020; Mironov, 2012; Balcer and Vadhan, 2018; Gazeau et al., 2016)). More interestingly, Balle et al. (Balle et al., 2019b) proposed the privacy blanket intuition for GRR-based protocols, which gives a thoroughly theoretical explanation on how (and how much) shuffling enhances the privacy, in a conceptually simple manner.

Nonetheless, the estimation error of GRR-based protocols will become unexpectedly large as the domain size |𝔻||\mathbb{D}| grows; the privacy guarantee they can provide is also subjected to the number of users and the data domain size and might not be enough for applications. Subsequent works (Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019; Ghazi et al., 2020b) improve the accuracy and achieve errors independent of |𝔻||\mathbb{D}| (under the same privacy guarantee) by letting users randomize their data in rather complex ways (e.g., encoding data using hash functions (Wang et al., 2020) or Hadamard matrix (Ghazi et al., 2021), adding noises from negative binomial distributions (Ghazi et al., 2020b), etc.); some of them (Ghazi et al., 2021; Ghazi et al., 2020b) also enables high privacy guarantee. Yet they indeed sacrifice efficiency and/or implementation-friendliness (detailed comparison will be presented later).

In this work, we present shuffle-model mechanisms for histogram estimation, which are almost as efficient and implementation-friendly as GRR-based protocols (Balle et al., 2019b), enable high privacy guarantee, and have competitive accuracy with (Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019; Ghazi et al., 2020b). Moreover, we give an intuition that clearly explains how and how much our mechanisms provide privacy (mimicking the privacy blanket intuition); we believe that such intuition can assist in the design and analysis of future mechanisms.

Our Contributions. We overview our contributions in more detail below.

Dummy blanket & DUMP framework. We introduce the concept of dummy blanket, which enables a very simple method of enhancing user’s privacy (particularly, in the shuffle model) and significantly outperforms the privacy blanket (Balle et al., 2019b) in the accuracy. Recall that privacy blanket provides an insightful view of the GRR mechanism: if each user reports the true value vv with probability pp and a uniformly sampled vv^{\prime} with probability q=1pq=1-p, then, equivalently, there is expected to be nqnq uniformly random values among nn reports. After shuffling, the victim data vv^{*} is mixed with the nqnq points, and these points will serve as a “blanket” to hide vv^{*}. Our dummy blanket intuition indeed generalizes the privacy blanket. Specifically, we observe that privacy essentially comes from these uniformly random points and is determined by the number of them, and we do not need to insist that these points are contributed by data randomization. Instead, we allow users to directly add uniformly random points (called dummy points) before reporting to the shuffler. Similar to the case of privacy blanket, these dummy points will also serve as a “blanket” to protect the victim!

Dummy blanket allows us to achieve better accuracy at the cost of slightly increasing communication overhead. To see this, we first show the privacy guarantee is mostly determined by the total number of dummy points. Then, for achieving certain privacy (that needs ss dummy points), the privacy blanket intuition requires to replace ss true values, while we just add ss dummy points. Since original values are preserved, the estimation result of the dummy blanket will be more accurate. Formally, we show the estimation error introduced by dummy blanket is independent of the domain size |𝔻||\mathbb{D}| and always concretely smaller than that of privacy blanket (see Sect.5.1). Regarding communication overhead, note that only the total number of dummy points matters, and thus the amortized communication overhead will be close to a single message when nn is large. Furthermore, to some extent, the dummy blanket intuition could explain why previous works (Balcer and Cheu, 2019; Wang et al., 2020) improved accuracy: their data randomizers implicitly generate dummy points, though the methodology of adding dummy points and its advantage had never been formally studied.

We incorporate the dummy blanket intuition into the existing shuffle-model framework and present a general DUMP (DUMmy-Point-based) framework for histogram estimation. In DUMP, a user first processes his data using a data randomizer (as he does in the conventional shuffle-model framework), then invokes a dummy-point generator to have a certain amount of dummy points, and finally submits the randomized data together with these dummy points to the shuffler. From the dummy blanket intuition, adding dummy points could enhance privacy without significantly downgrading the accuracy. Note that most existing locally private protection mechanisms (e.g., Randomized Response (Warner, 1965), RAPPOR (Erlingsson et al., 2014), OLH (Wang et al., 2017)) could be formulated as one of the special cases of the data randomizer, and thus the framework admits diverse instantiations which can give different tradeoffs for different purposes.

Instantiations: pureDUMP & mixDUMP. We first instantiate DUMP with an empty data randomizer and obtain the pureDUMP protocol. In pureDUMP, the privacy guarantee is fully provided by dummy points, and thus the analysis of pureDUMP could quantify properties of the dummy blanket. Assume each user adds ss dummy points and the data domain 𝔻\mathbb{D} has size kk. We prove that (1) pureDUMP satisfies (ϵ,δ)(\epsilon,\delta)-differential privacy for δ0.2907\delta\leq 0.2907 and ϵ=14kln(2/δ)/(ns1)\epsilon=\sqrt{14k\cdot\ln(2/\delta)/(ns-1)}; (2) its expected mean squared error (eMSE) of frequency estimation is s(k1)/(nk2)s(k-1)/(nk^{2}). So the eMSE is O(log(1/δ)/(ϵ2n2))O(\log(1/\delta)/(\epsilon^{2}n^{2})), surely independent of kk, and only increases the error in the central model by log(1/δ)\log(1/\delta) that is logarithmic in nn; the number of messages per user is 1+O(klog(1/δ)/(nϵ2))1+O(k\log(1/\delta)/(n\epsilon^{2})) that will be close to 11 as nn grows. Moreover, by increasing ss, ϵ\epsilon can be arbitrarily close to 0 as wishes. Meanwhile, we prove that pureDUMP even satisfies local differential privacy, which is meaningful when the shuffler and the analyst collude.

We then present mixDUMP, which utilizes GRR as the data randomizer in the DUMP framework. mixDUMP leads to a new tradeoff between communication overhead and accuracy. Particularly, since privacy can also be provided by GRR, fewer dummy points are needed. We show that mixDUMP has the smallest communication overhead (compared with all existing multi-messages mechnisms) while its eMSE is still competitive (though slightly larger than that of pureDUMP).

Finally, notice that privacy is determined by the total number of dummy points; when nn is large, it may not be necessary to ask each user to send even a single dummy point. Therefore we investigate the setting that each user has a certain probability to generate a fixed number of dummy points. Meanwhile, we find that this setting is interesting for the case that not all users are willing to utilize dummy-points-based privacy protection mechanism due to the limited network bandwidth or computation capability of their devices. We formally prove that pureDUMP and mixDUMP can still provide enhanced privacy protection of (ϵ,δ)(\epsilon,\delta)-differential privacy in this setting. This result allows having smaller communication overhead and better accuracy when nn is large.

Implementations & Comparisons. We analyze and implement all the existing related protocols (Balle et al., 2019b; Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019; Ghazi et al., 2020b) (at least to the best of our knowledge) for comprehensive comparisons. We conduct experiments on both synthetic and real-world datasets; we are the first (also to the best of knowledge) to report (Balcer and Cheu, 2019; Ghazi et al., 2021) on large-size experiments. With the same privacy guarantee, the results show that (1) pureDUMP and mixDUMP have the minimal communication overhead compared with all existing multi-message protocols ((Ghazi et al., 2021; Balcer and Cheu, 2019; Ghazi et al., 2020b)); (2) pureDUMP achieves the best accuracy compared with (Balle et al., 2019b; Wang et al., 2020; Ghazi et al., 2021; Balcer and Cheu, 2019), while the only exception (Ghazi et al., 2020b), achieving near-optimal accuracy, has much higher communication-overhead and has to use randomness from negative-binomial distributions whose implementation is indeed challenging and vulnerable in practice. Details follow.

  • -

    Efficiency. The communication overhead of both pureDUMP and mixDUMP decreases with the increase of the numbe of users, while (Ghazi et al., 2021; Balcer and Cheu, 2019) do not have this feature; the expected communication overhead of (Ghazi et al., 2020b) turns out to be concretely larger. In our experiments, for example, over the real-world dataset Ratings (rat, 2017) with ϵ=1\epsilon=1 and δ=106\delta=10^{-6}, pureDUMP (and mixDUMP) only require around 0.80.8 (and 0.50.5) expected extra messages per user, while (Ghazi et al., 2021; Balcer and Cheu, 2019) and (Ghazi et al., 2020b) need around 10210^{2}, 10310^{3} and 10410^{4} extra messages respectively. On the other hand, single-message protocols (Balle et al., 2019b; Wang et al., 2020) are less competitive in accuracy and/or subjected to upper-bounds on privacy guarantee (namely, the lower-bounds on ϵ\epsilon). Particularly, for Ratings, (Balle et al., 2019b) cannot even achieve (ϵ,δ)(\epsilon,\delta)-privacy with ϵ1\epsilon\leq 1.

  • -

    Accuracy. Our experiements show that pureDUMP and mixDUMP enjoy better accuracy than all existing protocols except (Ghazi et al., 2020b). Particularly, in the experiments over Ratings with ϵ=1\epsilon=1 and δ=106\delta=10^{-6}, the eMSE of pureDUMP and that of (Ghazi et al., 2020b) are in [1010,109)[10^{-10},10^{-9}), while the eMSEs of others are close to 10810^{-8}.

  • -

    Implementation-friendliness. pureDUMP and mixDUMP are extremely implementation-friendly, as they just generate a fixed amount of uniformly random values with fixed probability and do not involve any cumputation that shall be approximated over finite-precision devices (e.g., base-ee logarithm ln()\ln(\cdot)). In contrast, (Ghazi et al., 2020b) uses randomness with the negative binomial distribution 𝖭𝖡(r,p)\mathsf{NB}(r,p) s.t. r=1/nr=1/n and p=eϵp=e^{-\epsilon}. 222For kk\in\mathbb{Z}, and X𝖭𝖡(r,p)X\leftarrow\mathsf{NB}(r,p), Pr[X=k]=(k+r1k)(1p)rpk\Pr[X=k]=\tbinom{k+r-1}{k}(1-p)^{r}p^{k}. Notice that r=1/nr=1/n is not an integer, so we cannot sample from this distribution by naively running independent events that have succesful probability pp untill seeing rr failures. We may have to follow the common approach of sampling negative binomial distributions: first sample λ\lambda from a Γ\Gamma distribution Γ(r,p/(1p))\Gamma(r,p/(1-p)) and then sample from a Poisson distribution 𝖯𝗈𝗂𝗌𝗌𝗈𝗇(λ)\mathsf{Poisson}(\lambda). Note that sampling from Poisson distributions involves ln()\ln(\cdot) computation, and Mironov (Mironov, 2012) showed that DP meachnisms implementing ln()\ln(\cdot) over double-precision floating-point numbers cannot provide the claimed privacy protection! Other sampling methods for 𝖭𝖡(r,p)\mathsf{NB}(r,p) might be possible but to be explored.

2. Background

Central Differential Privacy. The initial setting of differential privacy is defined in central model (Dwork, 2008), in which a trusted analyst collects sensitive data from users, gathers them into a dataset, and works out the statistical results. The statistical results are perturbed by the analyst before being released. Intuitively, differential privacy protects each individual’s privacy by requiring any single element in the dataset has limited impact on the output.

Definition 2.1.

(Central Differential Privacy, CDP). An algorithm \mathcal{M} satisfies (ϵ,δ\epsilon,\delta)-DP,where ϵ,δ0\epsilon,\delta\geq 0, if and only if for any neighboring datasets 𝒟\mathcal{D} and 𝒟\mathcal{D^{\prime}} that differ in one element, and any possible outputs Range()\mathcal{R}\subseteq Range(\mathcal{M}), we have

Pr[(𝒟)]eϵPr[(𝒟)]+δ\Pr\left[\mathcal{M}(\mathcal{D})\in\mathcal{R}\right]\leq e^{\epsilon}\Pr\left[\mathcal{M}(\mathcal{D^{\prime}})\in\mathcal{R}\right]+\delta

Here, ϵ\epsilon is called the privacy budget. The smaller ϵ\epsilon means that the outputs of \mathcal{M} on two adjacent datasets are more similar, and thus the provided privacy guarantee is stronger. δ\delta is generally interpreted as \mathcal{M} not satisfying ϵ\epsilon-DP with probability δ\delta. As usual, we consider the neighboring datasets 𝒟\mathcal{D} and 𝒟\mathcal{D^{\prime}} having the same number of elements (aka. bounded differential privacy). For simplicity, we assume w.o.l.g. that 𝒟\mathcal{D} and 𝒟\mathcal{D^{\prime}} differ at the nn-th element, and we denote them by {x1,x2,,xn}\{x_{1},x_{2},\ldots,x_{n}\} and {x1,x2,,xn}\{x_{1},x_{2},\ldots,x^{\prime}_{n}\}.

Local Differential Privacy. The central model assumes the analyst is fully trusted and does not ensure privacy against curious analysts, which is not practical in many scenarios. In constrast, the local model (Kasiviswanathan et al., 2011) allows each user to locally perturb his data using a randomization algorithm \mathcal{M} before sending it to the untrusted analyst so that the analyst cannot see the raw data. We say \mathcal{M} satisfies local differential privacy (LDP) if for different data vv and vv^{\prime}, the distribution of (v)\mathcal{M}(v) and that of (v)\mathcal{M}(v^{\prime}) are close.

Definition 2.2.

(Local Differential Privacy, LDP). An algorithm \mathcal{M} satisfies (ϵ,δ)(\epsilon,\delta)-LDP, where ϵ,δ0\epsilon,\delta\geq 0, if and only if for any different elements v,v𝔻v,v^{\prime}\in\mathbb{D} , and any possible outputs Range()\mathcal{R}\subseteq Range(\mathcal{M}), we have

Pr[(v)]eϵPr[(v)]+δ\Pr\left[\mathcal{M}(v)\in\mathcal{R}\right]\leq e^{\epsilon}\Pr\left[\mathcal{M}(v^{\prime})\in\mathcal{R}\right]+\delta

δ\delta is typically set as 0, and (ϵ,0)(\epsilon,0)-LDP can be denoted by ϵ\epsilon-LDP.

(Generalized) Randomized Response. Randomized Response (RR) (Warner, 1965) is considered as the first LDP mechanism that supports binary response. It allows each user to provide a false answer with certain probability and thus provides plausible deniability to users. RR mechanism can be extended to support multi-valued domain response, known as Generalized Randomized Response (GRR) (Wang et al., 2017). Denote kk as the size of the domain 𝔻\mathbb{D}. Each user with private value v𝔻v\in\mathbb{D} reports the true value v=vv^{\prime}=v with probability pp and reports a randomly sampled value v𝔻v^{\prime}\in\mathbb{D} where vvv^{\prime}\neq v with probability qq. By standard analysis, it is easy to see that, GRR satisfies ϵ\epsilon-LDP when

(1) {p=eϵeϵ+k1,q=1eϵ+k1.\left\{\begin{aligned} p&=\frac{e^{\epsilon}}{e^{\epsilon}+k-1},\\ q&=\frac{1}{e^{\epsilon}+k-1}.\end{aligned}\right.

Shuffle Model. The shuffle model follows the Encode, Shuffle, Analyze (ESA) architecture proposed by Bittau et al. (Bittau et al., 2017). A shuffler is inserted between users and the analyst to break the link between the user identity and the message. Following (Cheu et al., 2019), we describe a shuffle-model protocol by the following three algorithms.

  1. -

    RR: 𝒳𝒴m\mathcal{X}\rightarrow\mathcal{Y}^{m}. The local randomized encoder RR takes a single user’s data xix_{i} as input and outputs a set of mm messages yi,1,,yi,m𝒴y_{i,1},\ldots,y_{i,m}\in\mathcal{Y}.

  2. -

    SS: (𝒴m)n𝒴mn(\mathcal{Y}^{m})^{n}\rightarrow\mathcal{Y}^{mn}. The shuffler SS takes nn users’ output messages of RR as input and outputs these messages in a uniformly random order.

  3. -

    AA: 𝒴mn𝒵\mathcal{Y}^{mn}\rightarrow\mathcal{Z}. The analyst AA that takes all the output messages of S as input and runs some analysis functions on these messages.

Such a protocol is called a single-message protocol when m=1m=1 or a multi-message protocol when m>1m>1.

Variable Description
𝔻\mathbb{D} Finite and discrete domain {1, 2 , …, k}
kk Size of domain 𝔻\mathbb{D}
𝒟\mathcal{D} Set of users’ input values
nn Number of users
𝒮\mathcal{S} Set of dummy points
|𝒮||\mathcal{S}| Total number of dummy points
ss Number of dummy points sent per user
xi𝔻x_{i}\in\mathbb{D} Input value from the ithi^{\text{th}} user
yi,j𝔻y_{i,j}\in\mathbb{D} The jthj^{\text{th}} value sent from the ithi^{\text{th}} user
z\vec{z} The estimated frequency of elements in 𝔻\mathbb{D}
𝖡𝖾𝗋()\mathsf{Ber}(\cdot) Sample a random binary value under Bernoulli distribution
𝖴𝗇𝗂𝖿([k])\mathsf{Unif}([k]) Uniformly sample a random value from {1,2,,k}\{1,2,\ldots,k\}
Table 1. Important Notations

3. Histogram Estimation in the shuffle model

3.1. Problem Statement

System Setting. The task is performed in the shuffle model. The system involves an analyst, a shuffler, and nn users. Specifically, the analyst knows the data domain 𝔻\mathbb{D}. We assume the size of 𝔻\mathbb{D} is kk, i.e., 𝔻={1,2,,k}\mathbb{D}=\{1,2,\ldots,k\}. The ithi^{\text{th}} user has a single data xi𝔻x_{i}\in\mathbb{D}. All users send perturbed data to the shuffler, and the shuffler scrambles the order before transmitting the data to the analyst. For the jthj^{\text{th}} item in 𝔻\mathbb{D}, the analyst estimates frequency z[j]\vec{z}[j] (the proportion of users who has a certain value), i.e., z[j]=1ni=1n𝟙xi=j\vec{z}[j]=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}_{x_{i}=j}. Here 𝟙xi=j=1\mathbbm{1}_{x_{i}=j}=1 if xi=jx_{i}=j; 𝟙xi=j=0\mathbbm{1}_{x_{i}=j}=0 otherwise.

Threat Model. Similar to most shuffle-model works, we consider privacy against curious analysts. We concentrate on the standard case, in which the analyst does not collude with shufller. We say a shuffle-model protocol P=(R,S,A)P=(R,S,A) (see Section 2) satisfies (ϵ,δ)(\epsilon,\delta)-DP against curious analysts, if the algorithm SiR(xi)=A(S(R(x1),,R(xn)))S\circ\cup_{i}R(x_{i})=A(S(R(x_{1}),\ldots,R(x_{n}))) satisfies (ϵ,δ)(\epsilon,\delta)-DP in the central model (see Definition 2.1). We also study the case that the analyst colludes with the shufller such that he has a direct access to users’ reports. In this case, the differential privacy of PP against the collusion is equivalent to the local differential privacy of RR (see Definition 2.2).

Following (Balle et al., 2019b), throughout this paper we consider powerful adversaris who have the same background knowledge, i.e., all other users’ values except one victim’s; thus our mechanisms are also secure against any weaker adversaries. All privacy analyses are under the assumption of bounded differential privacy, thus we do not need to consider the case that some users dropout.

3.2. Existing Related Protocols

There are several existing works on histogram estimation in the shuffle model. To the best of our knowledge, the first protocol is proposed by Balle et al. (Balle et al., 2019b), in which each user randomizes sensitive data with GRR mechanism. Then, other mechanisms are utilized as randomizers to improve the utility of protocols. Ghazi et al. (Ghazi et al., 2021) propose two multi-message protocols: private-coin and public-coin. The private-coin utilizes Hadamard Response as the data randomizer. The public-coin is based on combining the Count Min data structure (Cormode and Muthukrishnan, 2005) with GRR (Wang et al., 2017). The user in the truncation-based protocol proposed by Balcer et al. (Balcer and Cheu, 2019) sends sensitive data with at most kk (size of data domain) non-repeated data sampled from 𝔻\mathbb{D} to the shuffler. However, the truncation-based protocol may suffer from an undesirable communication load, i.e., each user needs to send thousands of messages when there are thousands of candidate values in 𝔻\mathbb{D}. The SOLH proposed by Wang et al. (Wang et al., 2020) utilizes Optimal Local Hash (OLH) as the data randomizer to improve the utility when the size of 𝔻\mathbb{D} is large. The correlated-noise protocol proposed by Ghazi et al. (Ghazi et al., 2020b) adds noises from negative binomial distributions, which achieves accuracy that is arbitrarily close to that of CDP.

All the above existing protocols seek for more suitable randomizers to improve the utility of the shuffle model. Different from existing works, we try to introduce a new mechanism to provide privacy protection. Our proposed method gets more privacy benefits than existing approaches, which further improves the utility of histogram estimation in the shuffle model.

4. DUMP Framework

4.1. Framework Overview

There are three parties in DUMP framework: user side, shuffler, and analyst. Figure 1 shows the interactions among them.

User Side. There are two types of protection procedures on the user side. One is the data randomizer. Most existing LDP mechanisms could be categorized as one of special cases of the data randomizer such as Randomized Response (RR), Unary Encoding (UE), Optimal Local Hash (OLH), and Hadamard Response (HR). Hence, the output of the data randomizer R(xi)R(x_{i}) could be in different forms such as element, vector, or matrix. Besides, the data randomizer directly outputs the user’s sensitive data when no privacy protection is provided by the data randomizer, i.e., R(xi)=xiR(x_{i})=x_{i}. Another is the dummy-point generator, which provides privacy protection by generating dummy points. ss dummy points are uniformly random sampled from the output space of the data randomizer, where ss is calculated from the number of users nn, the domain size kk, and privacy parameters. The output of R(xi)R(x_{i}) mixed with ss dummy points is then sent to the shuffler.

Shuffler. The operation of shuffler in DUMP is same as the existing shuffle model(Bittau et al., 2017). The shuffler collects all messages from users and removes all implicit metadata (such as IP address, timestamps, and routing paths). Then all messages are randomly shuffled before sending to the analyst.

Refer to caption
Figure 1. The overview of DUMP framework

Analyst. The analyst performs histogram estimation on the set of messages received from the shuffler. It removes the bias caused by dummy points and data randomization to obtain unbiased estimation results.

Under DUMP framework, we design two protocols, pureDUMP and mixDUMP, in Section 5 to show the advantage of dummy points in improving utility. The dummy points provides privacy protection in both two protocols. In addition, the mixDUMP uses GRR as the data randomizer, while no protection is provided by the data randomizer in pureDUMP. We show that the dummy points can improve the accuracy without increasing too much communication overhead under a certain privacy guarantee. Table 2 presents our results alongside existing results, in which all protocols are used for kk-bins histogram estimation in the shuffle model.

Protocol No. Messages per User No. Bits per Message Expected Mean Square Error Condition
privacy-amplification(Balle et al., 2019b) 11 O(logk)O(\log k) O(ϵ2(ϵ2(n1)klog(1/δ))2log(1/δ))O(\frac{\epsilon^{2}}{(\epsilon^{2}(n-1)-k\log(1/\delta))^{2}}\cdot\log(1/\delta)) 1n114klog(2/δ)<ϵ1\sqrt{\frac{1}{n-1}\cdot 14k\log(2/\delta)}<\epsilon\leq 1
SOLH(Wang et al., 2020) 11 O(logk)O(\log k) O(1ϵ2n2log(1/δ)log(1/δ))O(\frac{1}{\epsilon^{2}n^{2}-\log(1/\delta)}\cdot\log(1/\delta)) 1n114log(2/δ)<ϵ1\sqrt{\frac{1}{n-1}\cdot 14\log(2/\delta)}<\epsilon\leq 1
truncation-based(Balcer and Cheu, 2019) 1+O(k)O(k) O(logk)O(\log k) O(1ϵ2n2log(1/δ))O(\frac{1}{\epsilon^{2}n^{2}}\cdot\log(1/\delta)) eO(nϵ2)δ<1/ne^{-O(n\epsilon^{2})}\leq\delta<1/n
private-coin(Ghazi et al., 2021) 1+O(1ϵ2log1ϵδ)O(\frac{1}{\epsilon^{2}}\log\frac{1}{\epsilon\delta}) O(lognlogk)O(\log n\log k) O(1ϵ2n2log(1/δ))O(\frac{1}{\epsilon^{2}n^{2}}\cdot\log(1/\delta)) ϵ<1\epsilon<1, δ<1/log(k)\delta<1/\log(k)
public-coin(Ghazi et al., 2021) 1+O(1ϵ2log3klog(1δlogk))O(\frac{1}{\epsilon^{2}}\log^{3}k\cdot\log(\frac{1}{\delta}\log k)) O(logn+loglogk)O(\log n+\log\log k) O(1nlog(log(k)/δ)log2k)O(\frac{1}{n}\cdot\log(\log(k)/\delta)\cdot\log^{2}k) ϵ<1\epsilon<1, δ<1/log(k)\delta<1/\log(k)
correlated-noise(Ghazi et al., 2020b) 1+O(1nϵ2klog21δ)O(\frac{1}{n\epsilon^{2}}k\cdot\log^{2}\frac{1}{\delta}) O(logk)O(\log k) O(1ϵ2n2)O(\frac{1}{\epsilon^{2}n^{2}}) ϵO(1)\epsilon\leq O(1)
mixDUMP 1+O(1nϵ2klog1δ)O(\frac{1}{n\epsilon^{2}}k\cdot\log\frac{1}{\delta}) O(logk)O(\log k) O(1ϵ2n2(1+keϵ1)2log(1/δ))O(\frac{1}{\epsilon^{2}n^{2}}\cdot(1+\frac{k}{e^{\epsilon}-1})^{2}\cdot\log(1/\delta)) ϵ1\epsilon\leq 1
pureDUMP 1+O(1nϵ2klog1δ)O(\frac{1}{n\epsilon^{2}}k\cdot\log\frac{1}{\delta}) O(logk)O(\log k) O(1ϵ2n2log(1/δ))O(\frac{1}{\epsilon^{2}n^{2}}\cdot\log(1/\delta)) ϵ1\epsilon\leq 1
Table 2. Comparison of results for the histogram estimation. Each user is assumed to hold number of 11 value from 𝔻\mathbb{D}.

4.2. Dummy Blanket

The blanket intuition is first proposed by Balle et al. (Balle et al., 2019b). They decompose the probability distribution of the data randomizer’s output into two distributions to show the privacy enhancement of the shuffle model. One distribution is dependent on the input value and the other is independently random. Consider inputting a dataset {x1,x2,,xn}\{x_{1},x_{2},\ldots,x_{n}\} into the data randomizer. The output histogram YY can be decomposed as Y=Y1Y2Y=Y_{1}\cup Y_{2}, where Y1Y_{1} is the histogram formed by independently random data and Y2Y_{2} is the histogram formed by unchanged sensitive data. Suppose that the analyst knows {x1,x2,,xn1}\{x_{1},x_{2},\ldots,x_{n-1}\}, and xnx_{n} remains unchanged. The analyst needs to observe Y1{xn}Y_{1}\cup\{x_{n}\} if it tries to break the privacy of xnx_{n} from the dataset sent by the shuffler. Although xnx_{n} is not randomized, Y1Y_{1} hides xnx_{n} like a random blanket.

In this paper, we propose the concept of dummy blanket to take further advantage of the shuffle model. We only consider the uniformly random distributed dummy blanket in this paper. Whether participating in the randomization or not, each user sends ss uniformly random generated dummy points to the shuffler. The total number of nsns dummy points form a histogram Y3Y_{3} that hides each user’s data like a dummy blanket. In the aforementioned assumptions, the analyst needs to distinguish xnx_{n} from Y1Y3{xn}Y_{1}\cup Y_{3}\cup\{x_{n}\}. The dummy blanket provides the same privacy guarantee for each user, while the communication overhead is shared by all users. It can bring significant enhancement of utility when there are a large number of users. Moreover, the dummy blanket does not conflict with any randomization mechanism, it can be wrapped outside of the blanket formed by any data randomizer.

5. Proposed Protocols

In this section, we present pureDUMP protocol (Section 5.1) and mixDUMP protocol (Section 5.2). For each proposed protocol, we analyze the privacy, accuracy, and efficiency properties.

5.1. pureDUMP: A Pure DUMP protocol

Design of Algorithms. All the privacy guarantee is provided by dummy blanket in pureDUMP. On the user side, instead of randomizing the input data xi𝔻x_{i}\in\mathbb{D}, ss dummy points are randomly sampled from the domain 𝔻\mathbb{D} to hide xix_{i}. Here ss is determined by privacy parameters when the number of users and the size of data domain are fixed, and the effect of ss on privacy budget ϵ\epsilon will be discussed later in privacy analysis. Then the shuffler collects all n(s+1)n(s+1) messages from users and randomly shuffles them before sending them to the analyst. From the receiving messages, the analyst can approximate the frequency of each i𝔻i\in\mathbb{D} with a debiasing standard step. We omit the description of the shuffling operation as it’s same as in the standard shuffle-based mechanisms. The details of the user-side algorithm and the analyst-side algorithm are given in Algorithm 1 and Algorithm 2, respectively.

Algorithm 1 User-side Algorithm of PureDUMP
1:procedure P(xix_{i}):
2:xi𝔻x_{i}\in\mathbb{D}, parameters n,k,s+n,k,s\in\mathbb{N}^{+}
3:Multiset Yi𝔻Y_{i}\subseteq\mathbb{D}
4:    for j1 to sj\leftarrow 1\text{ to }s do \triangleright Generate dummy points
5:         yi,j𝖴𝗇𝗂𝖿([k])y_{i,j}\leftarrow\mathsf{Unif}([k])
6:    end for
7:    Mixes xix_{i} with {yi,1,,yi,s}\{y_{i,1},\ldots,y_{i,s}\} to get Yi:={yi,1,,yi,s+1}Y_{i}:=\{y_{i,1},\ldots,y_{i,s+1}\}
8:return YiY_{i}
9:end procedure
Algorithm 2 Analyst-side Algorithm of PureDUMP
1:procedure A(Y1,,YnY_{1},\ldots,Y_{n}):
2:Multiset {{y1,1,,y1,s+1},,{yn,1,,yn,s+1}}𝔻\{\{y_{1,1},\ldots,y_{1,s+1}\},\ldots,\{y_{n,1},\ldots,y_{n,s+1}\}\}\subseteq\mathbb{D}, parameters n,k,s+n,k,s\in\mathbb{N}^{+}
3:Vector z[0,1]k\vec{z}\in[0,1]^{k} \triangleright Frequency of k elements
4:    for m1 to km\leftarrow 1\text{ to }k do
5:         z[m]=1n(i[n],j[s+1]𝟙m=yi,jnsk)\vec{z}[m]=\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{m=y_{i,j}}-\frac{ns}{k})
6:    end for
7:return z\vec{z}
8:end procedure

Privacy Analysis. Now we analyze the privacy guarantee of pureDUMP. First, we show how much the privacy guarantee on the analyst side is provided by dummy blanket. We focus on the regime where the count of each element in the dummy dataset is much smaller than the total number of dummy points, which captures numerous practical scenarios since the size of data domain 𝔻\mathbb{D} is usually large. Therefore, our privacy analysis is based on Lemma 5.1. In fact, the privacy analyses in (Balle et al., 2019b; Wang et al., 2020) are also need to be based on Lemma 5.1, but it is ignored.

Lemma 5.1.

Denote {a1,a2,,am}\{a_{1},a_{2},...,a_{m}\} are random variables that follow the binomial distribution, and a1+a2++am=Na_{1}+a_{2}+...+a_{m}=N. For any i,j[1,m],iji,j\in[1,m],i\neq j, aia_{i} and aja_{j} are approximately considered as mutually independent when a1,a2,,am<<Na_{1},a_{2},...,a_{m}<<N.

The proof of Lemma 5.1 is in Appendix B.1. Then we show the analysis result of pureDUMP in Theorem 5.2.

Theorem 5.2.

PureDUMP satisfies (ϵd,δd\epsilon_{d},\delta_{d})-DP against the analyst for any k,s,n+k,s,n\in\mathbb{N}^{+}, ϵd(0,1]\epsilon_{d}\in(0,1], and δd(0,0.2907]\delta_{d}\in(0,0.2907], where

ϵd=14kln(2/δd)|𝒮|1,|𝒮|=ns\epsilon_{d}=\sqrt{\frac{14k\cdot\ln{(2/\delta_{d})}}{|\mathcal{S}|-1}},\ |\mathcal{S}|=ns
Proof.

(Sketch) For any two neighboring datasets 𝒟\mathcal{D} and 𝒟\mathcal{D^{\prime}}, w.l.o.g., we assume they differ in the nthn^{\text{th}} value, and xn=1x_{n}=1 in 𝒟\mathcal{D}, xn=2x^{\prime}_{n}=2 in 𝒟\mathcal{D^{\prime}}. Let 𝒮\mathcal{S} and 𝒮\mathcal{S^{\prime}} be two dummy datasets with |𝒮|=|𝒮|=ns|\mathcal{S}|=|\mathcal{S}^{\prime}|=ns dummy points sent by users. What the analyst receives is a mixed dataset 𝒱\mathcal{V} that is a mixture of user dataset 𝒟\mathcal{D}(or 𝒟)\mathcal{D^{\prime}}) and the dummy dataset 𝒮\mathcal{S}(or 𝒮)\mathcal{S^{\prime}}). We denote 𝒱=𝒟𝒮\mathcal{V}=\mathcal{D}\cup\mathcal{S}(or 𝒱=𝒟𝒮\mathcal{V}=\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}). Denote sjs_{j} as the number of value jj in the dummy dataset. To get the same 𝒱\mathcal{V}, there should be a dummy point equals to 2 in 𝒮\mathcal{S} but equals to 1 in 𝒮\mathcal{S^{\prime}}. Thus the numbers of kk values in other |𝒮|1|\mathcal{S}|-1 dummy points are [s11,s2,,sk][s_{1}-1,s_{2},\ldots,s_{k}] in 𝒮\mathcal{S} and [s1,s21,,sk][s_{1},s_{2}-1,\ldots,s_{k}] in 𝒮\mathcal{S^{\prime}}. Then we have

Pr[𝒟𝒮=𝒱]Pr[𝒟𝒮=𝒱]=(s11,s2,,sk|𝒮|1)(s1,s21,,sk|𝒮|1)=s1s2\displaystyle\frac{\Pr[\mathcal{D}\cup\mathcal{S}=\mathcal{V}]}{\Pr[\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}=\mathcal{V}]}=\frac{(_{s_{1}-1,s_{2},\ldots,s_{k}}^{\ \ \ \ |\mathcal{S}|-1})}{(_{s_{1},s_{2}-1,\ldots,s_{k}}^{\ \ \ \ |\mathcal{S}|-1})}=\frac{s_{1}}{s_{2}}

As dummy points are all uniformly random sampled from 𝔻\mathbb{D}, both s1s_{1} and s2s_{2} follow binomial distributions, where s1s_{1} follows Bin(|𝒮|1,1k)+1Bin(|\mathcal{S}|-1,\frac{1}{k})+1 and s2s_{2} follows Bin(|𝒮|1,1k)Bin(|\mathcal{S}|-1,\frac{1}{k}). Denote S1S_{1} as Bin(|𝒮|1,1k)+1Bin(|\mathcal{S}|-1,\frac{1}{k})+1 and S2S_{2} as Bin(|𝒮|1,1k)Bin(|\mathcal{S}|-1,\frac{1}{k}). Here, s1s_{1} and s2s_{2} can be considered as mutually independent variables according to Lemma 5.1. Then the probability that pureDUMP does not satisfy ϵd\epsilon_{d}-DP is

Pr[Pr[𝒟𝒮=𝒱]Pr[𝒟𝒮=𝒱]eϵd]=Pr[S1S2eϵd]δd\displaystyle\Pr[\frac{\Pr[\mathcal{D}\cup\mathcal{S}=\mathcal{V}]}{\Pr[\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}=\mathcal{V}]}\geq e^{\epsilon_{d}}]=\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]\leq\delta_{d}

Let c=E[S2]=|𝒮|1kc=E[S_{2}]=\frac{|\mathcal{S}|-1}{k}, the Pr[S1S2eϵd]\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}] can be divided into two cases with overlap that S1ceϵd/2S_{1}\geq ce^{\epsilon_{d}/2} and S2ceϵd/2S_{2}\leq ce^{-\epsilon_{d}/2}. According to multiplicative Chernoff bound, we have

Pr[S1S2eϵd]\displaystyle\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}] Pr[S1ceϵd/2]+Pr[S2ceϵd/2]\displaystyle\leq\Pr[S_{1}\geq ce^{\epsilon_{d}/2}]+\Pr[S_{2}\leq ce^{-\epsilon_{d}/2}]
ec3(eϵd11/c)2+ec2(1eϵd/2)2δd\displaystyle\leq e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}+e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}\leq\delta_{d}

Both ec3(eϵd11/c)2e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}} and ec2(1eϵd/2)2e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}} are no greater than δd2\frac{\delta_{d}}{2}. We can deduce that cc always satisfies c14ln(2/δd)ϵd2c\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}} when ϵd1\epsilon_{d}\leq 1 and δd0.2907\delta_{d}\leq 0.2907. Substitute c=|𝒮|1kc=\frac{|\mathcal{S}|-1}{k} into c14ln(2/δd)ϵd2c\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}, we can get the upper bound of ϵd\epsilon_{d}.

The detailed proof is deferred to Appendix B.2. ∎

Then, we show that when the shuffler colludes with the analyst, each individual user’s data can also be protected in pureDUMP.

Corollary 5.3.

PureDUMP satisfies (ϵl,δl)(\epsilon_{l},\delta_{l})-LDP against the shuffler for any k,s+k,s\in\mathbb{N}^{+}, ϵl(0,1]\epsilon_{l}\in(0,1], and δl(0,0.2907]\delta_{l}\in(0,0.2907], where

ϵl=14kln(2/δl)s1\epsilon_{l}=\sqrt{\frac{14k\cdot\ln{(2/\delta_{l})}}{s-1}}
Proof.

The proof is similar to Theorem 5.2, except that each sensitive data only protected by ss dummy points generated by its owner. More specifically, LDP protection is provided by dummy blanket composed of ss dummy points before the shuffling operation. ∎

Accuracy Analysis. We analyze the accuracy of pureDUMP by measuring the Mean Squared Error (MSE) of the estimation results.

(2) MSE=1kv𝔻E[(f~vfv)2]MSE=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]

where fvf_{v} is the true frequency of element v𝔻v\in\mathbb{D} in users’ dataset, and f~v\tilde{f}_{v} is the estimation result of fvf_{v}. We first show f~v\tilde{f}_{v} of pureDUMP is unbiased in Lemma 5.4.

Lemma 5.4.

The estimation result f~v\tilde{f}_{v} for any v𝔻v\in\mathbb{D} in pureDUMP is an unbiased estimation of fvf_{v}.

Proof.
E[f~v]\displaystyle E[\tilde{f}_{v}] =E[1ni[n],j[s+1](𝟙v=yi,jnsk)]\displaystyle=E\left[\frac{1}{n}\sum_{i\in[n],j\in[s+1]}\left(\mathbbm{1}_{v=y_{i,j}}-\frac{ns}{k}\right)\right]
=1nE[i[n],j[s+1]𝟙v=yi,j]sk\displaystyle=\frac{1}{n}E\left[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{v=y_{i,j}}\right]-\frac{s}{k}
=1n(nfv+nsk)sk\displaystyle=\frac{1}{n}(nf_{v}+\frac{ns}{k})-\frac{s}{k}
=fv\displaystyle=f_{v}

Since f~v\tilde{f}_{v} is an unbiased estimation of fvf_{v}, the MSE of f~v\tilde{f}_{v} in pureDUMP is equal to the variance of f~v\tilde{f}_{v} according to Equation (2).

(3) MSE\displaystyle MSE =1kv𝔻E[(f~vfv)2]=1kv𝔻(Var[f~v]+[E[f~v]fv]2)\displaystyle=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}(Var[\tilde{f}_{v}]+[E[\tilde{f}_{v}]-f_{v}]^{2})
=1kv𝔻Var(f~v)\displaystyle=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})

Now we calculate the MSE of pureDUMP in Theorem 5.5.

Theorem 5.5.

The MSE of frequency estimation in pureDUMP is

s(k1)nk2\frac{s(k-1)}{nk^{2}}

.

Proof.
Var[f~v]\displaystyle Var[\tilde{f}_{v}] =Var[1n(i[n],j[s+1]𝟙v=yi,jnsk)]\displaystyle=Var\left[\frac{1}{n}\left(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{v=y_{i,j}}-\frac{ns}{k}\right)\right]
=1n2Var[i[n],j[s+1]𝟙v=yi,j]\displaystyle=\frac{1}{n^{2}}Var\left[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{v=y_{i,j}}\right]
=1n2(1k(11k)ns)\displaystyle=\frac{1}{n^{2}}(\frac{1}{k}\cdot(1-\frac{1}{k})\cdot ns)
=s(k1)nk2\displaystyle=\frac{s(k-1)}{nk^{2}}

According to Equation (3), we have

MSE=1kv𝔻E[(f~vfv)2]=1kv𝔻Var(f~v)=s(k1)nk2\displaystyle MSE=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})=\frac{s(k-1)}{nk^{2}}

Efficiency Analysis. Now we analyze the efficiency of pureDUMP. For an input xi𝔻x_{i}\in\mathbb{D}, the output of each user side P(xi)P(x_{i}) consists of s+1s+1 messages of length O(logk)O(\log k) bits. The runtime of the user side P(xi)P(x_{i}) is at most O(s)O(s). Moreover, ss is at most O(kln(1/δ)/(nϵ2))O(k\ln(1/\delta)/(n\epsilon^{2})) when pureDUMP satisfies (ϵ,δ)(\epsilon,\delta)-DP on the analyst side. The shuffler transmits n(s+1)n(s+1) messages to the analyst with the space of O(nslogk)O(ns\log k). The runtime of the analyst on input n(s+1)n(s+1) messages is at most O(kns)O(kns) and its output has space O(klog(n(s+1)))O(k\log(n(s+1))) bits.

According to Theorem 5.5, the expected MSE of estimation in pureDUMP is O(log(1/δ)/(ϵ2n2))O(\log(1/\delta)/(\epsilon^{2}n^{2})). Note that the domain size has almost no effect on the estimation error because more dummy points are generated to offset the error caused by the larger domain. When the user side only uses a data randomizer that satisfies ϵl\epsilon_{l}-LDP to provide protection, we can deduce from results in (Balle et al., 2019b) that the expected MSE is O(ϵ2log(1/δ)/(ϵ2(n1)klog(1/δ))2)O(\epsilon^{2}\log(1/\delta)/(\epsilon^{2}(n-1)-k\log(1/\delta))^{2}). By comparison, the estimation error of pureDUMP is always smaller than the method that only using data randomization on the user side, which proves the dummy blanket dominates the randomization in the shuffle model.

5.2. mixDUMP: A DUMP Protocol with GRR

Design of Algorithms. mixDUMP utilizes GRR as the data randomizer to provide privacy guarantee along with dummy points. The details of the algorithm are given in Algorithm 3. Note that GRR mechanism utilized in mixDUMP is a little different from the form in (Wang et al., 2017). To analyze the privacy benefit with the idea of blanket, we use the technique proposed in (Balle et al., 2019b) to decompose GRR into

xr𝔻Pr[GRR(x)=xr]=(1λ)Pr[x=xr]+λPr[𝖴𝗇𝗂𝖿([k])=xr]\displaystyle\forall_{x_{r}\in\mathbb{D}}\Pr[GRR(x)=x_{r}]=(1-\lambda)\Pr[x=x_{r}]+\lambda\Pr[\mathsf{Unif}([k])=x_{r}]

where Pr[x=xr]\Pr[x=x_{r}] is the real data distribution that depends on xx, and Pr[𝖴𝗇𝗂𝖿([k])=xr]\Pr[\mathsf{Unif}([k])=x_{r}] is the uniformly random distribution. With probability 1λ1-\lambda, the output is true input, and with probability λ\lambda, the output is random sampled from the data domain. Given the probability of q=1/(eϵl+k1)q=1/(e^{\epsilon_{l}}+k-1) in GRR, it has

λ=kq=keϵl+k1\displaystyle\lambda=kq=\frac{k}{e^{\epsilon_{l}}+k-1}

where ϵl\epsilon_{l} is the privacy budget of LDP.

The procedure of generating dummy points is same as Algorithm 1. Same as before, we omit the description of shuffling operation. The details of the user-side algorithm and the analyst-side algorithm are shown in Algorithm 4 and Algorithm 5, respectively.

Algorithm 3 Data Randomizer of MixDUMP
1:procedure R(xix_{i}):
2:xi𝔻x_{i}\in\mathbb{D}, parameters k+k\in\mathbb{N}^{+} and λ(0, 1]\lambda\in(0,\ 1]
3:xr𝔻x_{r}\in\mathbb{D}
4:    b𝖡𝖾𝗋(λ)b\leftarrow\mathsf{Ber}(\lambda) \triangleright Randomize data with probability λ\lambda
5:    if b=0b=0 then
6:         xrxx_{r}\leftarrow x
7:    else
8:         xr𝖴𝗇𝗂𝖿([k])x_{r}\leftarrow\mathsf{Unif}([k])
9:    end if
10:end procedure

Privacy Analysis. We show how much the privacy guarantee is provided by both randomization and dummy blanket to compare with pureDUMP. In particular, we analyze the privacy amplification provided by dummy blanket locally to distinguish the work of adding dummy points on the shuffler side (Wang et al., 2020). We first show the privacy guarantee on the analyst side in Theorem 5.6.

Theorem 5.6.

MixDUMP satisfies (ϵs,δs\epsilon_{s},\delta_{s})-DP for any k,s,n+k,s,n\in\mathbb{N}^{+}, ϵs(0,1]\epsilon_{s}\in(0,1], λ(0,1]\lambda\in(0,1], and δs(0,0.5814]\delta_{s}\in(0,0.5814], where

ϵs=14kln(4/δs)|𝒮|+(n1)λ2(n1)λln(2/δs)1,|𝒮|=ns\epsilon_{s}=\sqrt{\frac{14k\ln{(4/\delta_{s})}}{|\mathcal{S}|+(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}-1}},\ |\mathcal{S}|=ns
Proof.

(Sketch) The assumption is same as in Theorem 5.2. We assume the nthn^{\text{th}} value xn=1x_{n}=1 in 𝒟\mathcal{D} and xn=2x^{\prime}_{n}=2 in its neighboring dataset 𝒟\mathcal{D^{\prime}}. Denote R(𝒟)R(\mathcal{D}) (or R(𝒟)R(\mathcal{D}^{\prime})) as randomized users’ dataset {R(x1),,R(xn)}\{R(x_{1}),...,R(x_{n})\} (or {R(x1),,R(xn)}\{R(x_{1}),...,R(x_{n}^{\prime})\}). If the nthn^{\text{th}} user randomizes its data, then we can easily get

Pr[R(𝒟)𝒮=𝒱]Pr[R(𝒟)𝒮=𝒱]=1\displaystyle\frac{\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]}{\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]}=1

So, we focus on the case that the nthn^{\text{th}} user does not randomize its data. Denote \mathcal{H} as the set of users participating in randomization in Algorithm 3, and the size of \mathcal{H} is |||\mathcal{H}|. Since the randomized users’ data and dummy points are all follows the uniform distribution, mixDUMP can be regarded as pureDUMP with dummy dataset 𝒮\mathcal{H}\cup\mathcal{S} is introduced. From Theorem 5.2, mixDUMP satisfies (ϵH,δs2)(\epsilon_{H},\frac{\delta_{s}}{2})-DP, we have

Pr[R(𝒟)𝒮=𝒱]eϵHPr[R(𝒟)𝒮=𝒱]+δs2\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\leq e^{\epsilon_{H}}\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]+\frac{\delta_{s}}{2}

where ϵH=14kln(4/δs)||+|𝒮|1\epsilon_{H}=\sqrt{\frac{14k\ln{(4/\delta_{s})}}{|\mathcal{H}|+|\mathcal{S}|-1}}, δs0.5814\delta_{s}\leq 0.5814. Since |||\mathcal{H}| follows Bin(n1,λ)Bin(n-1,\lambda), according to Chernoff bound, we can get

Pr[||<(1γ)μ]<δs2\displaystyle\Pr[|\mathcal{H}|<(1-\gamma)\mu]<\frac{\delta_{s}}{2}

where γ=2ln(2/δs)(n1)λ\gamma=\sqrt{\frac{2\ln(2/\delta_{s})}{(n-1)\lambda}}. Denote t=(n1)λ2(n1)λln(2/δs)t=(n-1)\lambda-\sqrt{2(n-1)\lambda ln(2/\delta_{s})}, then we have

Pr[R(𝒟)𝒮=𝒱]\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]
Pr[R(𝒟)𝒮=𝒱||t]+δs2\displaystyle\leq\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\wedge|\mathcal{H}|\geq t]+\frac{\delta_{s}}{2}
(htPr[R(𝒟)𝒮=𝒱]Pr[||=h])+δs2\displaystyle\leq(\sum_{h\geq t}\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\Pr[|\mathcal{H}|=h])+\frac{\delta_{s}}{2}
e14kln(4/δs)t+|𝒮|1Pr[R(𝒟)𝒮=𝒱]+δs\displaystyle\leq e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{t+|\mathcal{S}|-1}}}\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]+\delta_{s}

The details of this proof is in Appendix B.3. ∎

Algorithm 4 User-side Algorithm of MixDUMP
1:procedure M(xix_{i}):
2:xi𝔻x_{i}\in\mathbb{D}, parameters n,k,s+n,k,s\in\mathbb{N}^{+} and λ(0, 1]\lambda\in(0,\ 1]
3:Multiset Yi𝔻Y_{i}\subseteq\mathbb{D}
4:    xrR(xi)x_{r}\leftarrow R(x_{i}) \triangleright Randomize real data
5:    YiP(xr)Y_{i}\leftarrow P(x_{r}) \triangleright Mix xrx_{r} with ss dummy points
6:return YiY_{i}
7:end procedure
Algorithm 5 Analyst-side Algorithm of MixDUMP
1:procedure A(Y1,,YnY_{1},\ldots,Y_{n}):
2:Multiset {{y1,1,,y1,s+1},,{yn,1,,yn,s+1}}𝔻\{\{y_{1,1},\ldots,y_{1,s+1}\},\ldots,\{y_{n,1},\ldots,y_{n,s+1}\}\}\subseteq\mathbb{D}, parameters n,k,sn,k,s\in\mathbb{N}, λ(0, 1]\lambda\in(0,\ 1]
3:Vector z[0,1]k\vec{z}\in[0,1]^{k} \triangleright Frequency of k elements
4:    for m1 to km\leftarrow 1\text{ to }k do
5:         z[m]=1n(i[n],j[s+1]𝟙m=yi,jn(λ+s)k)/(1λ)\vec{z}[m]=\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{m=y_{i,j}}-\frac{n(\lambda+s)}{k})/(1-\lambda)
6:    end for
7:return z\vec{z}
8:end procedure

Then, we analyze the LDP privacy guarantee provided by mixDUMP when the shuffler colludes with the analyst. We show that our design different from (Wang et al., 2020) can obtain privacy benefits from dummy points before shuffling operation, which enables users to be less dependent on trust in the shuffler. When the data randomizer satisfies ϵl\epsilon_{l}-LDP, the introduced dummy points can achieve an enhanced LDP with the privacy budget smaller than ϵl\epsilon_{l}. The theoretical analysis is shown in Theorem 5.7.

Theorem 5.7.

If the data randomizer satisfies ϵl\epsilon_{l}-LDP, after introducing ss dummy points, (ϵr,δr)(\epsilon_{r},\delta_{r})-LDP is achieved for any ksmin(12ln(1/δr),(c2+4c2ln(1/δr)c)2)\frac{k}{s}\leq min(\frac{1}{2\ln(1/\delta_{r})},(\sqrt{c^{2}+\frac{4c}{\sqrt{2\ln(1/\delta_{r})}}}-c)^{2}), c=2ln(1/δr)elϵ(eϵl+1)s(eϵl+1)+eϵlc=\sqrt{2\ln(1/\delta_{r})}\cdot\frac{e^{\epsilon}_{l}(e^{\epsilon_{l}}+1)}{s(e^{\epsilon_{l}}+1)+e^{\epsilon_{l}}}, where

ϵr=ln(k12kln(1/δr)/s(1+eϵls(eϵl+1)))\epsilon_{r}=\ln{(\frac{k}{1-\sqrt{2k\ln(1/\delta_{r})/s}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)}))}
Proof.

(Sketch) Denote [n1,,nk][n_{1},\ldots,n_{k}] as numbers of kk values in the ithi^{\text{th}} user’s reports set 𝒱i\mathcal{V}_{i} where i=1kni=s+1\sum_{i=1}^{k}n_{i}=s+1. Denote 𝒮i\mathcal{S}_{i} (or 𝒮i\mathcal{S}_{i}^{\prime}) as the dummy dataset generated by the ithi^{\text{th}} user. Let vv be the true value held by the ithi^{\text{th}} user. According to the definition of LDP, we compare the probability of shuffler receiving the same set 𝒱i\mathcal{V}_{i} with numbers of k values are [n1,,nk][n_{1},\ldots,n_{k}] when v=1v=1 and v=2v=2. We prove the following formula

Pr[R(v=1)𝒮i=𝒱i]eϵrPr[R(v=2)𝒮i=𝒱i]+δr\displaystyle\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]\leq e^{\epsilon_{r}}\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]+\delta_{r}

According to Equation (1) of GRR mechanism, we have

Pr[R(v=1)𝒮i=𝒱i]Pr[R(v=2)𝒮i=𝒱i]\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]} eϵln1+n2eϵln2+n1\displaystyle\leq\frac{e^{\epsilon_{l}}n_{1}+n_{2}}{e^{\epsilon_{l}}n_{2}+n_{1}}

Both n1n_{1} and n2n_{2} follow binomial distribution, where n1n_{1} follows Bin(s,1k)+1Bin(s,\frac{1}{k})+1 and n2n_{2} follows Bin(s,1k)Bin(s,\frac{1}{k}). Based on Lemma 5.1, n1n_{1} and n2n_{2} are considered as mutually independent variables. Then we have

Pr[R(v=1)𝒮i=𝒱i]Pr[R(v=2)𝒮i=𝒱i]Bin(s,1k)Bin(s,1k)+eϵlBin(s,1k)(eϵl+1)\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}\leq\frac{Bin(s,\frac{1}{k})}{Bin(s,\frac{1}{k})}+\frac{e^{\epsilon_{l}}}{Bin(s,\frac{1}{k})(e^{\epsilon_{l}}+1)}

According to the Chernoff bound, we have

Pr[Bin(s,1k)sk(12kln(1/δr)s)]1δr\Pr[Bin(s,\frac{1}{k})\geq\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}})]\geq 1-\delta_{r}

Note that 12kln(1/δr)s01-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}}\geq 0, so ks12ln(1/δr)\frac{k}{s}\leq\frac{1}{2\ln(1/\delta_{r})}. As Bin(s,1k)[sk(12kln(1/δr)s),s]Bin(s,\frac{1}{k})\in[\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}}),s] with the probability larger than 1δr1-\delta_{r}, we have

Pr[R(v=1)𝒮i=𝒱i]Pr[R(v=2)𝒮i=𝒱i]\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]} k12kln(1/δr)s(1+eϵls(eϵl+1))\displaystyle\leq\frac{k}{1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)})

when ks(c2+4c2ln(1/δr)c)2)\frac{k}{s}\leq(\sqrt{c^{2}+\frac{4c}{\sqrt{2\ln(1/\delta_{r})}}}-c)^{2}), it has k12kln(1/δr)s(1+eϵls(eϵl+1))eϵl\frac{k}{1-\sqrt{\frac{2k\ln(1/\delta_{r})}{s}}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)})\\ \leq e^{\epsilon_{l}}, where c=2ln(1/δr)elϵ(eϵl+1)s(eϵl+1)+eϵlc=\sqrt{2\ln(1/\delta_{r})}\cdot\frac{e^{\epsilon}_{l}(e^{\epsilon_{l}}+1)}{s(e^{\epsilon_{l}}+1)+e^{\epsilon_{l}}}.

The details of this proof is in Appendix B.5. ∎

Figure 2 shows the theoretical result of Theorem 5.7. We set ϵl=5\epsilon_{l}=5, δr=104\delta_{r}=10^{-4}, and show the privacy budget ϵr\epsilon_{r} with varying numbers of dummy points on three different sizes of the data domain. We can observe that dummy points significantly reduce the upper bound of the privacy budget of LDP protection, which enables enhanced privacy protection is provided when the shuffler colludes with the analyst.

Refer to caption
Figure 2. Enhancement of LDP privacy protection by dummy points, where ϵl=5\epsilon_{l}=5 and δr=104\delta_{r}=10^{-4}.

Accuracy Analysis. We still use MSE as metrics to analyze the accuracy of mixDUMP. Same as before, we first show f~v\tilde{f}_{v} is unbiased in mixDUMP in Lemma 5.8.

Lemma 5.8.

The estimation result f~v\tilde{f}_{v} for any v𝔻v\in\mathbb{D} in mixDUMP is an unbiased estimation of fvf_{v}.

The proof of Lemma 5.8 is deferred to Appendix B.4. Then the MSE of frequency estimation equals to the variance of f~v\tilde{f}_{v} in mixDUMP. Next, we show the MSE of mixDUMP in Theorem 5.9.

Theorem 5.9.

The MSE of frequency estimation in mixDUMP is

1neϵl+k2(eϵl1)2+s(k1)nk2(eϵl+k1eϵl1)2\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s(k-1)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}

where ϵl\epsilon_{l} is the privacy budget of GRR in mixDUMP.

Proof.

We give the estimation function in Algorithm 5. With p=eϵleϵl+k1,q=1eϵl+k1p=\frac{e^{\epsilon_{l}}}{e^{\epsilon_{l}}+k-1},\ q=\frac{1}{e^{\epsilon_{l}}+k-1} in Equation (1), and λ=kq\lambda=kq, we replace λ\lambda with qq to simplify the derivation. We can get the variance of f~v\tilde{f}_{v} is

Var[f~v]\displaystyle Var[\tilde{f}_{v}] =Var[1ni[n],j[s+1]𝟙{v=yi,j}ns1knqpq]\displaystyle=Var[\frac{1}{n}\cdot\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-ns\cdot\frac{1}{k}-nq}{p-q}]
=1n21(pq)2Var[i[n],j[s+1]𝟙{v=yi,j}]\displaystyle=\frac{1}{n^{2}}\cdot\frac{1}{(p-q)^{2}}Var[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]
=nfvp(1p)+n(1fv)q(1q)+ns(k1)/k2n2(pq)2\displaystyle=\frac{nf_{v}p(1-p)+n(1-f_{v})q(1-q)+ns(k-1)/k^{2}}{n^{2}(p-q)^{2}}
=nq(1q)+nfv(p(1p)q(1q))+ns(k1)/k2n2(pq)2\displaystyle=\frac{nq(1-q)+nf_{v}(p(1-p)-q(1-q))+ns(k-1)/k^{2}}{n^{2}(p-q)^{2}}

As in (Wang et al., 2017), fvf_{v} is assumed to be small on average, then we have

Var[f~v]\displaystyle Var[\tilde{f}_{v}] 1n2(pq)2(nq(1q)+ns(k1)k2)\displaystyle\simeq\frac{1}{n^{2}(p-q)^{2}}(nq(1-q)+\frac{ns(k-1)}{k^{2}})
=1neϵl+k2(eϵl1)2+s(k1)nk2(eϵl+k1eϵl1)2\displaystyle=\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s(k-1)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}

According to Equation (3) we have

MSE=\displaystyle MSE= 1kv𝔻E[(f~vfv)2]=1kv𝔻Var(f~v)=\displaystyle\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})=
1neϵl+k2(eϵl1)2+s(k1)nk2(eϵl+k1eϵl1)2\displaystyle\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s(k-1)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}

Efficiency Analysis. Next, we show the efficiency analysis of mixDUMP. For an input xi𝔻x_{i}\in\mathbb{D}, the output of the user side M(xi)M(x_{i}) is s+1s+1 messages of length O(logk)O(\log k) bits. Since we amplify twice in the process of proving the upper bound of the privacy budget in mixDUMP, the advantage of mixDUMP compared to pureDUMP in the number of messages is weakened. When mixDUMP satisfies (ϵ,δ)(\epsilon,\delta)-DP, ss is at most O(14kln(4/δ)/(nϵ2)λ(12ln(2/δ)/(λn)))O(14k\ln(4/\delta)/(n\epsilon^{2})-\lambda(1-\sqrt{2\ln(2/\delta)/(\lambda n)})). The upper bound of ss can be approximated as O(14kln(4/δ)/(nϵ2))O(14k\ln(4/\delta)/(n\epsilon^{2})) when ϵ\epsilon is close to 0.

The shuffler transmits n(s+1)n(s+1) dummy points to the analyst with the space of O(nslogk)O(ns\log k) bits. The analyst outputs O(klog(n(s+1))O(k\log(n(s+1)) bits estimated from n(s+1)n(s+1) messages with expected MSE is O(ln(1/δ)/(nϵ(1λ))2)O(\ln(1/\delta)/(n\epsilon(1-\lambda))^{2}). Note that λ=k/(eϵl+k1)\lambda=k/(e^{\epsilon_{l}}+k-1) is the probability that each user randomizes its data. As ϵlϵ\epsilon_{l}\geq\epsilon, we can amplify λk/(eϵ+k1)\lambda\leq k/(e^{\epsilon}+k-1). Then, the estimation error can be approximated as O(1ϵ2n2(1+keϵ1)2log1δ)O(\frac{1}{\epsilon^{2}n^{2}}\cdot(1+\frac{k}{e^{\epsilon}-1})^{2}\cdot\log\frac{1}{\delta}).

5.3. Analysis for Practical Implementations

We propose two protocols for privacy-preserving histogram estimation in Subsection 5.1 and Subsection 5.2. Under the assumption that each user sends ss dummy points, we analyze the performance of the proposed protocols theoretically. In practical implementations, some users might be limited by network bandwidth or computation capability (e.g., running on mobile devices or sensors), and some users may even refuse to send additional dummy points. Taking these conditions into account, we also analyze the privacy and accuracy of the proposed protocols under a more realistic assumption.

The weak assumption of flexible DUMP protocols assumes that each user sends dummy points with the probability of γ\gamma. More specifically, each user uniformly samples ss random dummy points from 𝔻\mathbb{D} with the probability of γ\gamma, and only sends one message with the probability of 1γ1-\gamma. In practical implementations, γ\gamma can be obtained by investigating users’ desires or referring to the past experience of collectors. In the following part, we first try to answer how much the weak assumption affects the privacy of protocols in DUMP framework. We prove the following lemma.

Lemma 5.10.

For any DUMP protocols that satisfy (ϵ,δ)(\epsilon,\ \delta)-DP for any n+n\in\mathbb{N}^{+} and γ(0,1]\gamma\in(0,1], the flexible DUMP protocols satisfy (ϵln(1eγn),δ+eγn)(\epsilon-\ln(1-e^{-\gamma n}),\ \delta+e^{-\gamma n})-DP.

Proof.

The probability that all users do not generate dummy points is Pr[A]=(1γ)neγn\Pr[A]=(1-\gamma)^{n}\leq e^{-\gamma n}, then it has Pr[A¯]1eγn\Pr[\overline{A}]\geq 1-e^{-\gamma n}. According to the results of Theorem 5.2 and Theorem 5.6, the DUMP protocols satisfy

Pr[R(𝒟)𝒮=𝒱|A¯]eϵPr[R(𝒟)𝒮=𝒱|A¯]+δ\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}|\overline{A}]\leq e^{\epsilon}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}|\overline{A}]+\delta

where R(𝒟)R(\mathcal{D}) and R(𝒟)R(\mathcal{D}^{\prime}) are outputs of data randomizers, 𝒮\mathcal{S} and 𝒮\mathcal{S}^{\prime} are sets of dummy points. Then we have

Pr[R(𝒟)𝒮=𝒱]\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]
=Pr[R(𝒟)𝒮=𝒱|A¯]Pr[A¯]+Pr[R(𝒟)𝒮=𝒱|A]Pr[A]\displaystyle=\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}|\overline{A}]\Pr[\overline{A}]+\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}|A]\Pr[A]
Pr[R(𝒟)𝒮=𝒱|A¯]+eγn\displaystyle\leq\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}|\overline{A}]+e^{-\gamma n}
eϵPr[R(𝒟)𝒮=𝒱|A¯]+δ+eγn\displaystyle\leq e^{\epsilon}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}|\overline{A}]+\delta+e^{-\gamma n}
1Pr[A¯]eϵPr[R(𝒟)𝒮=𝒱]+δ+eγn\displaystyle\leq\frac{1}{\Pr[\overline{A}]}\cdot e^{\epsilon}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta+e^{-\gamma n}
11eγneϵPr[R(𝒟)𝒮=𝒱]+δ+eγn\displaystyle\leq\frac{1}{1-e^{-\gamma n}}\cdot e^{\epsilon}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta+e^{-\gamma n}

According to Lemma 5.10, we can get Theorem 5.11 and Theorem 5.12. We show that if each user sends dummy points with the probability of γ\gamma, it can increase the privacy budget and the failure probability of DUMP protocols. Because under this weak assumption, there is a possibility that no user sends dummy points, and it may cause the protocols to fail to satisfy DP.

Theorem 5.11.

The flexible pureDUMP satisfies (ϵdln(1eγn),δd+eγn)(\epsilon_{d}-\ln(1-e^{-\gamma n}),\ \delta_{d}+e^{-\gamma n})-DP for any k,S,n+,γ(0,1]k,S,n\in\mathbb{N}^{+},\ \gamma\in(0,1], where δd(0,0.2907]\delta_{d}\in(0,0.2907], and ϵd=14kln(2/δd)/(S1)\epsilon_{d}=\sqrt{14k\ln(2/\delta_{d})/(S-1)}.

Theorem 5.12.

The flexible mixDUMP satisfies (ϵsln(1eγn),δs+eγn)(\epsilon_{s}-\ln(1-e^{-\gamma n}),\ \delta_{s}+e^{-\gamma n})-DP for any k,S,n+,λ(0,1],γ(0,1]k,S,n\in\mathbb{N}^{+},\ \lambda\in(0,1],\ \gamma\in(0,1], where δs(0,0.5814]\delta_{s}\in(0,0.5814], and ϵs=14kln(4/δs)(S+(n1)λ2(n1)λln(2/δs)1)\epsilon_{s}=\sqrt{\frac{14k\ln(4/\delta_{s})}{(S+(n-1)\lambda-\sqrt{2(n-1)\lambda\ln(2/\delta_{s})}-1)}}.

Next, we analyze the impact of weak assumptions on the accuracy of the DUMP protocols. Same as before, we first show that the estimation results of flexible pureDUMP and mixDUMP are unbiased in Lemma 5.13 and Lemma 5.14. Both the estimation functions and proofs are deferred to Appendix B.6 and Appendix B.7.

Lemma 5.13.

The estimation result f~v\tilde{f}_{v} for any v𝔻v\in\mathbb{D} in flexible pureDUMP is an unbiased estimation of fvf_{v}.

Lemma 5.14.

The estimation result f~v\tilde{f}_{v} for any v𝔻v\in\mathbb{D} in flexible mixDUMP is an unbiased estimation of fvf_{v}.

As the weak assumption reduces the strength of privacy protection, the fewer dummy points reduce the MSE of the estimation results. We show the accuracy of the flexible DUMP protocols in Theorem 5.15 and Theorem 5.16, and proofs are deferred to Appendix B.8 and Appendix B.9.

Theorem 5.15.

The MSE of frequency estimation in flexible pureDUMP is

sγ(kγ)nk2\frac{s\gamma(k-\gamma)}{nk^{2}}
Theorem 5.16.

The MSE of frequency estimation in flexible mixDUMP is

1neϵl+k2(eϵl1)2+sγ(kγ)nk2(eϵl+k1eϵl1)2\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s\gamma(k-\gamma)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}

where ϵl\epsilon_{l} is the privacy budget of GRR in mixDUMP.

6. Evaluation

6.1. Setup

We design experiments to evaluate the performance of the proposed protocols (pureDUMP and mixDUMP) and compare them with the existing protocols (truncation-based (Balcer and Cheu, 2019), private-coin(Ghazi et al., 2021), public-coin (Ghazi et al., 2021), SOLH (Wang et al., 2020), correlated-noise (Ghazi et al., 2020b)). Note that although the source codes of these existing protocols except SOLH are not formally released, we still implemented them with our best effort and included them for comparison purposes.

In experiments, we measure (1) the accuracy of the proposed protocols, i.e., how much accuracy they improve over existing shuffle protocols; (2) the communication overhead of the proposed protocols, i.e., how many messages does each user needs to send; and how do they compare with existing shuffle protocols. (3) how key parameters would affect the accuracy of the proposed protocols, i.e., the number of users nn and the size of domain kk; Towards these goals, we conduct experiments over both synthetic and real-world datasets. The synthetic datasets allow us to adjust the key parameters of the datasets to observe the impact on the utility of protocols. And the real-world datasets would show the utility of the proposed protocols in a more practical setting.

Refer to caption
Figure 3. Experimental results of impact of size of data domain on pureDUMP and mixDUMP, where nn is fixed to 500,000, δ=106\delta=10^{-6}. Dashed lines are theoretical results.
Refer to caption
Figure 4. Experimental results of impact of number of users on pureDUMP and mixDUMP, where kk is fixed to 50, δ=106\delta=10^{-6}. Dashed lines are theoretical results.

Environment. All protocols are implemented in Python 3.7 with pyhton-xxhash 1.4.4 and numpy 1.18.1 libraries. All experiments are conducted on servers running Ubuntu 18.04.4 LTS 2.50GHz with 4*32GB memory.

Synthetic Datasets. We generate six synthetic datasets with different sizes of the data domain kk and the number of users nn. In the first three datasets, we fix n=500,000n=500,000 and generate different datasets with k=50k=50, k=500k=500, and k=5,000k=5,000, respectively. In the rest of datasets, we fix k=50k=50 and generate different datasets with n=5,000n=5,000, n=50,000n=50,000, and n=500,000n=500,000, respectively. All datasets are following a uniform distribution. We observe that the distribution of datasets has little effect on the performance of protocols, so we only show the experimental results on uniform distribution.

Real-World Datasets. We conduct experiments on following two real-world datasets.

  • IPUMS (ipu, 2017): We sample 1%1\% of the US Census data from the Los Angeles and Long Beach area for the year 1970. The dataset contains 70,187 users and 61 attributes. We select the birthplace attribute for evaluation, which includes 900 different values.

  • Ratings (rat, 2017): This dataset contains 20 million ratings of the movies have been rated by 138,000 users. We sample 494,352 records containing 2,000 different movies.

Refer to caption
Figure 5. Results of MSE varying ϵ\epsilon on the synthetic dataset, where the number of users n=500,000n=500,000, the domain size k=50k=50.
Refer to caption
Figure 6. Number of messages sent by each user varying ϵ\epsilon on the synthetic dataset, where the number of users n=500,000n=500,000, the domain size k=50k=50.

Protocols for Comparison. We compare pureDUMP and mixDUMP with following protocols.

  • private-coin (Ghazi et al., 2021): A multi-message protocol for histogram frequency estimation in the shuffle model. It utilizes the Hadamard response as the data randomizer.

  • public-coin (Ghazi et al., 2021): A multi-message protocol for histogram frequency estimation in the shuffle model. It is based on combining the Count Min data structure (Cormode and Muthukrishnan, 2005) with GRR mechanism. It has ϵ>τ90ln(2τ/δ)/n\epsilon>\tau\sqrt{90\ln(2\tau/\delta)/n} when randomization probability γ\gamma is set to 90τ2ln(2τ/δ)/(nϵ2)90\tau^{2}\ln(2\tau/\delta)/(n\epsilon^{2}), where τ=ln(2k)\tau=\ln(2k).

  • SOLH (Wang et al., 2020): A single-message protocol for histogram frequency estimation in the shuffle model. It utilizes the hashing-based mechanism OLH as the data randomizer.

  • truncation-based (Balcer and Cheu, 2019): A multi-message protocol for histogram frequency estimation in the shuffle model. The protocol can provide privacy protection with ϵ>50ln(2/δ)/n\epsilon>\sqrt{50\ln(2/\delta)/n}.

  • correlated-noise (Ghazi et al., 2020b): A multi-message protocol for histogram frequency estimation in the shuffle model. It adds noises from negative binomial distribution to users’ data.

  • privacy-amplification (Balle et al., 2019b): A privacy amplification result in the shuffle model. It has amplification effect only when ϵ\epsilon within (14kln(2/δ)/(n1),1](\sqrt{14k\ln{(2/\delta)}/(n-1)},1].

Accuracy Metrics. In our experiments, we use mean squared error (MSE) as the accuracy metrics. For each value vv in 𝔻\mathbb{D}, we compute the real frequency fvf_{v} on the sensitive dataset and estimate frequency f~v\tilde{f}_{v} on the noise dataset. Then we calculate their squared difference for kk different values.

MSE=1kv𝔻(f~vfv)2MSE=\frac{1}{k}\sum_{v\in\mathbb{D}}(\tilde{f}_{v}-f_{v})^{2}

All MSE results in experiments are averaged with 50 repeats.

For the private-coin and correlated-noise, we calculate the expected mean squared error. As O(nklogn)O(nk\log n) comparisons are required to match messages with Hadamard codewords on the analyst side of private-coin, which leads to excessive computational overhead, i.e., 1,450 hours for dataset with k=512 and n =70,000. In the correlated-noise, we didn’t find a way to sample the negative binomial distribution NB(r,q)NB(r,q) where rr is non-integer. We hope in the further we can further improve this implementation.

Refer to caption
Figure 7. Results of MSE varying ϵ\epsilon on the synthetic dataset, where the number of users n=500,000n=500,000, the domain size k=500k=500.
Refer to caption
Figure 8. Number of messages sent by each user varying ϵ\epsilon on the synthetic dataset, where the number of users n=500,000n=500,000, the domain size k=500k=500.

6.2. Experiments with Synthetic Datasets

Overall Results. The influence of domain size kk on the MSE of pureDUMP and mixDUMP is shown in Figure 3, where the domain size has almost no influence on the MSE of pureDUMP. The influence of number of users nn on the MSE of pureDUMP and mixDUMP is shown in Figure 4, where the MSE of pureDUMP and mixDUMP is in inverse proportion to the number of users. Meanwhile, the small gap between the theoretical and empirical results validates the correctness of our theoretical error analysis.

Figure 5 and Figure 7 show MSE comparisons of all protocols on two different synthetic datasets. The proposed protocols, pureDUMP and mixDUMP, support any ϵ\epsilon in the range of (0,1](0,1] with higher accuracy. Meanwhile, the numbers of messages sent by each user in all protocols except shuffle-based and SOLH (always one message) are depicted in Figure 6 and Figure 8. The proposed protocols, pureDUMP and mixDUMP, support any ϵ\epsilon in the range of (0,1](0,1] with the best accuracy compared with other protocols, while the only exception correlated-noise, achieving near-optimal accuracy, but has much higher communication overhead 333Since the coefficient 500eϵ2/log(1/δ)500e^{\epsilon^{2}/\log(1/\delta)} is ignored, the declared asymptotic upper bound is smaller than eMSE.. Meanwhile, we observe that the numbers of messages sent by each user in pureDUMP and mixDUMP are always the smallest in all protocols. Although the MSE of truncation-based is performing similar to pureDUMP and mixDUMP, the communication overhead of private-shuffle is at least one order of magnitude higher than pureDUMP and mixDUMP. Since sufficient privacy guarantee can be provided by privacy amplification when hundreds of thousands of users are involved, we set ϵl\epsilon_{l} of mixDUMP equals to 8 throughout the experiment to adapt large ϵ\epsilon in the range of (0,1](0,1].

Refer to caption
Figure 9. Results of MSE varying ϵ\epsilon on the IPUMS dataset.
Refer to caption
Figure 10. Number of messages sent by each user varying ϵ\epsilon on the IPUMS dataset.

Influence of Domain Size. We compare the MSE of pureDUMP on three synthetic datasets with different sizes of data domain in Figure 3a. The result shows that the size of the data domain has little influence on the accuracy of pureDUMP. Because according to our theoretical result in Subsection 5.1, the MSE of pureDUMP can be approximately simplified to 14ln(2/δ)/(n2ϵ2)14\ln(2/\delta)/(n^{2}\epsilon^{2}) when kk is large. Figure 3b shows that the MSE of mixDUMP is proportional to the domain size kk because the large size of the data domain increases the error introduced by the data randomizer in mixDUMP.

Influence of Users Number. Figure 4a and Figure 4b show the influence of the number of users on the MSE of pureDUMP and mixDUMP, respectively. We can observe that the MSE in inverse proportion to the number of users nn. The MSE decreases by two orders of magnitude when the number of users increases by an order of magnitude, which is consistent with our efficiency analysis in Subsection 5.1 and Subsection 5.2.

Refer to caption
Figure 11. Results of MSE varying ϵ\epsilon on the Ratings dataset.
Refer to caption
Figure 12. Number of messages sent by each user varying ϵ\epsilon on the Ratings dataset.

Accuracy Comparisons of Existing Protocols. We choose two synthetic datasets that enable as many protocols as possible can provide privacy protection with ϵ\epsilon within the range of (0,1](0,1], and MSE values of some ϵ\epsilon out of the valid range of protocols are omitted. The comparison results is shown in Figure 5. We can observe that pureDUMP and mixDUMP always enjoy privacy amplification and have smaller MSE than the other ones except correlated-noise. Meanwhile, we can also observe that pureDUMP has higher accuracy than mixDUMP. It means that dummy points can introduce less error than the data randomizer when providing the same strength of privacy protection. Besides, the privacy-amplification has no privacy amplification on some small ϵ\epsilon values, which causes poor MSE results. Moreover, we can also observe that the lines of public-coin always slightly increase at the beginning. Because the noise obeys binomial distribution, i.e., Bin(n,γ)Bin(n,\gamma), where γ\gamma decreases with the increase of ϵ\epsilon, the variance increases before γ\gamma drops to 0.50.5.

Communication Comparisons of Existing Protocols. The communication overhead is compared by computing the number of messages sent by each user in protocols. Since the number of messages sent by each user in private-coin and public-coin is always larger than the other protocols and only these two protocols has length of message larger than O(logk)O(\log k) for datasets with n>kn>k (see Table 2), comparing the number of messages does not affect the fairness of the results. Moreover, as the number of messages sent by each user in privacy-amplification and SOLH is always 11, we omit them in our comparisons. We also omit the ϵ\epsilon values which are out of the valid range of the protocol. The comparison results are shown in Figure 6. We can observe that each user always sends fewer messages than others in pureDUMP and mixDUMP, and the mixDUMP requires the least number of messages per user because the data randomizer provides partial privacy protection so that fewer dummy points need to be generated. Note that the number of messages that each user sends in DUMP protocols is less than 11 when a large ϵ\epsilon is required. It means that fewer than nn dummy points are needed to satisfy a large ϵ\epsilon when there is a sufficient number of users. According to the theoretical results of flexible DUMP protocols in Subsection 5.3, we provide their parameter settings for different ϵ\epsilon on these two synthetic datasets in Appendix A. We can also observe that the number of messages sent by each user increases as ϵ\epsilon increases in truncation-based protocol. Because each user sends additional 11 with probability of p=150ln(2/δ)/(ϵ2n)p=1-50\ln(2/\delta)/(\epsilon^{2}n) where p(1/2,1)p\in(1/2,1). Hence the probability pp closes to 1 when ϵ\epsilon increases and a larger number of 1s are sent by each user.

6.3. Experiments with Real-World Datasets

We also conduct experiments on two real-world datasets: IPUMS and Ratings. The MSE comparisons of protocols are shown in Figure 9 and Figure 11, and their corresponding communication overhead are shown in Figure 10 and Figure 12. The privacy-amplification has no privacy amplification on both IPUMS and Ratings datasets, so it’s not involved in the comparison. Besides, public-coin is also not involved in the comparison because most of ϵ(0,1]\epsilon\in(0,1] are not in the valid range of public-coin on both datasets. We can observe that the pureDUMP and mixDUMP still achieve higher accuracy among all protocols except the correlated-noise under the large domain size in real-world datasets. Over the Ratings dataset with ϵ=1\epsilon=1 and δ=106\delta=10^{-6}, pureDUMP (and mixDUMP) only require around 0.80.8 (and 0.50.5) expected extra messages per user, whiletruncation-based, private-coin, and correlated-noise need around 10210^{2}, 10310^{3} and 10410^{4} extra messages respectively. On the other hand, single-message protocols privacy-amplification and SOLH are less competitive in accuracy and/or subjected to upper-bounds on privacy guarantee (namely, the lower-bounds on ϵ\epsilon).

7. Related Works

Anonymous Data Collection. Many works focus on anonymizing users’ messages. Some mechanisms are based on the cryptographic idea, including (Syverson et al., 2004; Lazar et al., 2018; Tyagi et al., 2017; Van Den Hooff et al., 2015). Recently, Chowdhury et al. (Roy Chowdhury et al., 2020) propose a system Cryptϵ\epsilon that achieves the accuracy guarantee of CDP with no trusted data collector required like LDP. In this approach, users encrypt their messages with homomorphic encryption and then they send messages to an auxiliary server to aggregate and add random noise. After that, the encrypted aggregated result is sent to the analyst, and the analyst decrypts to get the noisy result that satisfies CDP. Other mechanisms are based on the shuffling idea. This idea is originally proposed by Bittau et al. (Bittau et al., 2017). They design Encode, Shuffle, Analyze (ESA) architecture. The ESA architecture eliminates the link between the message and the user identity by introducing a shuffler between users and the analyst. A lot of following works analyze the privacy benefits that come from the shuffle model (Erlingsson et al., 2019; Cheu et al., 2019; Balle et al., 2019b; Feldman et al., 2020). Among them, Feldman et al. (Feldman et al., 2020) provide the tightest upper bound that ϵ\epsilon can be reduced to (1eϵ)eϵlog(1/δ)/n(1-e^{-\epsilon})\sqrt{e^{\epsilon}\log(1/\delta)}/\sqrt{n}.

Protocols in the shuffle model. Recent shuffle-related works seek to design more accurate communication-efficient protocols. The studies focus on frequency estimation in the shuffle model are shown in Section 3.2. We show some other related works as a supplement. Some studies focus on binary data summation. Cheu et al. (Cheu et al., 2019) design a single-message protocol for binary summation. Each user only sends one-bit message to the analyst and has expected error at most O(log(1/δ)/ϵ)O(\sqrt{\log(1/\delta)}/\epsilon). Ghazi et al. (Ghazi et al., 2020a) design a multi-message protocol for binary summation where each user sends O(logn/ϵ)O(\log n/\epsilon) one-bit messages to the analyst and has expected error at most O(log(1/ϵ)/ϵ3)O(\sqrt{\log(1/\epsilon)/\epsilon^{3}}). Balcer et al. (Balcer and Cheu, 2019) propose a two-message protocol for binary summation with expected error at most O(log(1/δ)/ϵ2)O(\log(1/\delta)/\epsilon^{2}). There are also a lot of studies on real data summation and many excellent protocols are proposed (Cheu et al., 2019; Balle et al., 2019b; Ghazi et al., 2019b; Balle et al., 2019a; Ghazi et al., 2020c).

8. Conclusions

In this paper, a novel framework called DUMP is proposed for privacy-preserving histogram estimation in the shuffle model. Two types of protection mechanisms, data randomization and dummy-point generator, are contained on the user side of DUMP. The analysis results show that the dummy points can get more privacy benefits from shuffling operation. To further explore the advantages of dummy points, two protocols: pureDUMP and mixDUMP are proposed. The proposed protocols achieve better trade-offs between privacy guarantee, statistics accuracy, and communication overhead than existing protocols. Besides, the proposed protocols are also analyzed under a more realistic assumption for practical application, we prove that the privacy and utility of the proposed protocols still hold. Finally, we perform experiments to compare different protocols and demonstrate the advantage of the proposed protocols on both synthetic and real-world datasets. For future work, we will study different distributions of dummy points and apply DUMP to more statistical types such as real number summation, range query, and quantile query.

References

  • (1)
  • ipu (2017) 2017. IPUMS Census Dataset. https://data.world/uci/ipums-census-database/.
  • rat (2017) 2017. Movie rating dataset. https://www.kaggle.com/ashukr/movie-rating-data/.
  • Balcer and Cheu (2019) Victor Balcer and Albert Cheu. 2019. Separating local & shuffled differential privacy via histograms. arXiv preprint arXiv:1911.06879 (2019).
  • Balcer and Vadhan (2018) Victor Balcer and Salil P. Vadhan. 2018. Differential Privacy on Finite Computers. In ITCS (LIPIcs), Vol. 94. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 43:1–43:21.
  • Balle et al. (2019a) Borja Balle, James Bell, Adria Gascon, and Kobbi Nissim. 2019a. Differentially private summation with multi-message shuffling. arXiv preprint arXiv:1906.09116 (2019).
  • Balle et al. (2019b) Borja Balle, James Bell, Adria Gascón, and Kobbi Nissim. 2019b. The privacy blanket of the shuffle model. In Annual International Cryptology Conference. Springer, 638–667.
  • Bittau et al. (2017) Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. 2017. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles. 441–459.
  • Cheu et al. (2019) Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. 2019. Distributed differential privacy via shuffling. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 375–403.
  • Cormode and Muthukrishnan (2005) Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
  • Ding et al. (2017) Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. 2017. Collecting telemetry data privately. In Advances in Neural Information Processing Systems. 3571–3580.
  • Duchi et al. (2013) John C Duchi, Michael I Jordan, and Martin J Wainwright. 2013. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 429–438.
  • Dwork (2008) Cynthia Dwork. 2008. Differential privacy: A survey of results. In International conference on theory and applications of models of computation. Springer, 1–19.
  • Erlingsson et al. (2019) Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. 2019. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2468–2479.
  • Erlingsson et al. (2014) Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 1054–1067.
  • Feldman et al. (2020) Vitaly Feldman, Audra McMillan, and Kunal Talwar. 2020. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. arXiv preprint arXiv:2012.12803 (2020).
  • Gazeau et al. (2016) Ivan Gazeau, Dale Miller, and Catuscia Palamidessi. 2016. Preserving differential privacy under finite-precision semantics. Theor. Comput. Sci. 655 (2016), 92–108.
  • Geng et al. (2015) Quan Geng, Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2015. The staircase mechanism in differential privacy. IEEE Journal of Selected Topics in Signal Processing 9, 7 (2015), 1176–1184.
  • Ghazi et al. (2020a) Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, and Ameya Velingker. 2020a. Pure differentially private summation from anonymous messages. arXiv preprint arXiv:2002.01919 (2020).
  • Ghazi et al. (2019a) Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, and Ameya Velingker. 2019a. Private Heavy Hitters and Range Queries in the Shuffled Model. arXiv preprint arXiv:1908.11358 (2019).
  • Ghazi et al. (2021) Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, and Ameya Velingker. 2021. On the Power of Multiple Anonymous Messages: Frequency Estimation and Selection in the Shuffle Model of Differential Privacy. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 463–488.
  • Ghazi et al. (2020b) Badih Ghazi, Ravi Kumar, Pasin Manurangsi, and Rasmus Pagh. 2020b. Private counting from anonymous messages: Near-optimal accuracy with vanishing communication overhead. In International Conference on Machine Learning. PMLR, 3505–3514.
  • Ghazi et al. (2020c) Badih Ghazi, Pasin Manurangsi, Rasmus Pagh, and Ameya Velingker. 2020c. Private aggregation from fewer anonymous messages. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 798–827.
  • Ghazi et al. (2019b) Badih Ghazi, Rasmus Pagh, and Ameya Velingker. 2019b. Scalable and differentially private distributed aggregation in the shuffled model. arXiv preprint arXiv:1906.08320 (2019).
  • Ghosh et al. (2012) Arpita Ghosh, Tim Roughgarden, and Mukund Sundararajan. 2012. Universally utility-maximizing privacy mechanisms. SIAM J. Comput. 41, 6 (2012), 1673–1693.
  • Holohan et al. (2017) Naoise Holohan, Douglas J Leith, and Oliver Mason. 2017. Optimal differentially private mechanisms for randomised response. IEEE Transactions on Information Forensics and Security 12, 11 (2017), 2726–2735.
  • Ilvento (2020) Christina Ilvento. 2020. Implementing the Exponential Mechanism with Base-2 Differential Privacy. In CCS. ACM, 717–742.
  • Kairouz et al. (2014) Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2014. Extremal mechanisms for local differential privacy. Advances in neural information processing systems 27 (2014), 2879–2887.
  • Kasiviswanathan et al. (2011) Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM J. Comput. 40, 3 (2011), 793–826.
  • Lazar et al. (2018) David Lazar, Yossi Gilad, and Nickolai Zeldovich. 2018. Karaoke: Distributed private messaging immune to passive traffic analysis. In USENIX Symposium on Operating Systems Design and Implementation. 711–725.
  • Mironov (2012) Ilya Mironov. 2012. On significance of the least significant bits for differential privacy. In CCS. ACM, 650–661.
  • Roy Chowdhury et al. (2020) Amrita Roy Chowdhury, Chenghong Wang, Xi He, Ashwin Machanavajjhala, and Somesh Jha. 2020. Cryptϵ\epsilon: Crypto-Assisted Differential Privacy on Untrusted Servers. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 603–619.
  • Syverson et al. (2004) Paul Syverson, Roger Dingledine, and Nick Mathewson. 2004. Tor: The secondgeneration onion router. In USENIX Security Symposium. 303–320.
  • Team et al. (2017) ADP Team et al. 2017. Learning with privacy at scale. Apple Mach. Learn. J 1, 9 (2017).
  • Tyagi et al. (2017) Nirvan Tyagi, Yossi Gilad, Derek Leung, Matei Zaharia, and Nickolai Zeldovich. 2017. Stadium: A distributed metadata-private messaging system. In Proceedings of the 26th Symposium on Operating Systems Principles. 423–440.
  • Van Den Hooff et al. (2015) Jelle Van Den Hooff, David Lazar, Matei Zaharia, and Nickolai Zeldovich. 2015. Vuvuzela: Scalable private messaging resistant to traffic analysis. In Proceedings of the 25th Symposium on Operating Systems Principles. 137–152.
  • Wang et al. (2017) Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally differentially private protocols for frequency estimation. In USENIX Security Symposium. 729–745.
  • Wang et al. (2020) Tianhao Wang, Bolin Ding, Min Xu, Zhicong Huang, Cheng Hong, Jingren Zhou, Ninghui Li, and Somesh Jha. 2020. Improving Utility and Security of the Shuffler based Differential Privacy. Proceedings of the VLDB Endowment 13, 13 (2020).
  • Wang et al. (2019) Tianhao Wang, Bolin Ding, Jingren Zhou, Cheng Hong, Zhicong Huang, Ninghui Li, and Somesh Jha. 2019. Answering multi-dimensional analytical queries under local differential privacy. In Proceedings of the 2019 International Conference on Management of Data. 159–176.
  • Warner (1965) Stanley L Warner. 1965. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 60, 309 (1965), 63–69.

Appendix A Parameter setting in flexible pureDUMP and mixDUMP

Table 3 and Table 4 show the number of dummy points need to be sent by a user in flexible pureDUMP and mixDUMP, respectively. The number of users is 500,000500,000, we show results on two different kk (size of data domain) and γ\gamma (probability of a user sending dummy points).

ϵ\epsilon kk 50 500
γ=0.01\gamma=0.01 γ=0.001\gamma=0.001 γ=0.01\gamma=0.01 γ=0.001\gamma=0.001
0.4 13 127 127 1270
0.6 6 57 57 565
0.8 4 32 32 318
1.0 3 21 21 204
Table 3. Number of dummy points sent by a user in flexible pureDUMP. We fix n=500,000n=500,000. All decimals are rounded up.
ϵ\epsilon kk 50 500
γ=0.01\gamma=0.01 γ=0.001\gamma=0.001 γ=0.01\gamma=0.01 γ=0.001\gamma=0.001
0.4 12 118 119 1190
0.6 5 44 46 451
0.8 2 18 20 192
1.0 1 6 8 72
Table 4. Number of dummy points sent by a user in flexible mixDUMP. We fix n=500,000n=500,000 and ϵl=8\epsilon_{l}=8.All decimals are rounded up.

Appendix B Proofs

B.1. Proof of Lemma 5.1

Proof.

Denote a1,a2,,ama_{1},a_{2},...,a_{m} as mm random variables that follow the binomial distribution, where a1+a2+am=Na_{1}+a_{2}+...a_{m}=N. Let aia_{i} be a constant value cc, then the conditional probability distribution Pr[aj=x|ai=c]Pr[a_{j}=x|a_{i}=c] (iji\neq j) can be expressed as

Pr[aj=x|ai=c]\displaystyle Pr[a_{j}=x|a_{i}=c] =CNcx(1m1)x(m2m1)Ncx\displaystyle=C_{N-c}^{x}(\frac{1}{m-1})^{x}(\frac{m-2}{m-1})^{N-c-x}
=(Nc)!x!(Ncx)!(m2)Ncx(m1)Nc\displaystyle=\frac{\frac{(N-c)!}{x!(N-c-x)!}(m-2)^{N-c-x}}{(m-1)^{N-c}}

The marginal distribution Pr[aj=x]Pr[a_{j}=x] can be expressed as

Pr[aj=x]\displaystyle Pr[a_{j}=x] =CNx(1m)x(m1m)Nx\displaystyle=C_{N}^{x}(\frac{1}{m})^{x}\cdot(\frac{m-1}{m})^{N-x}
=N!x!(Nx)!(m1)NxmN\displaystyle=\frac{\frac{N!}{x!(N-x)!}\cdot(m-1)^{N-x}}{m^{N}}

Then we calculate the upper bound of the ratio of these two probability distributions.

Pr[aj=x|ai=c]Pr[aj=x]\displaystyle\frac{Pr[a_{j}=x|a_{i}=c]}{Pr[a_{j}=x]}
=(Nc)!x!(Ncx)!(m2)Ncx(m1)NcmNN!x!(Nx)!(m1)Nx\displaystyle=\frac{\frac{(N-c)!}{x!(N-c-x)!}(m-2)^{N-c-x}}{(m-1)^{N-c}}\cdot\frac{m^{N}}{\frac{N!}{x!(N-x)!}\cdot(m-1)^{N-x}}
=(m2)NcxmN(m1)2Nxc(Nc)!(Nx)!N!(Ncx)!\displaystyle=\frac{(m-2)^{N-c-x}m^{N}}{(m-1)^{2N-x-c}}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}
=(m2)Ncx(m1)NxcmN(m1)N(Nc)!(Nx)!N!(Ncx)!\displaystyle=\frac{(m-2)^{N-c-x}}{(m-1)^{N-x-c}}\cdot\frac{m^{N}}{(m-1)^{N}}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}
=(11m1)Ncx(1+1m1)N(Nc)!(Nx)!N!(Ncx)!\displaystyle=(1-\frac{1}{m-1})^{N-c-x}\cdot(1+\frac{1}{m-1})^{N}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}
=(1(1m1)2)Ncx(1+1m1)c+x(Nc)!(Nx)!N!(Ncx)!\displaystyle=(1-(\frac{1}{m-1})^{2})^{N-c-x}\cdot(1+\frac{1}{m-1})^{c+x}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}
(mm1)c+x(Nx)(Nx1)(Nxc+1)N(N1)(Nc+1)\displaystyle\leq(\frac{m}{m-1})^{c+x}\cdot\frac{(N-x)\cdot(N-x-1)\cdot...\cdot(N-x-c+1)}{N\cdot(N-1)\cdot...\cdot(N-c+1)}
(mm1)c+x\displaystyle\leq(\frac{m}{m-1})^{c+x}

The lower bound of the ratio of these two probability distributions can be calculated as

Pr[aj=x|ai=c]Pr[aj=x]\displaystyle\frac{Pr[a_{j}=x|a_{i}=c]}{Pr[a_{j}=x]}
=(m2)NcxmN(m1)2Nxc(Nc)!(Nx)!N!(Ncx)!\displaystyle=\frac{(m-2)^{N-c-x}m^{N}}{(m-1)^{2N-x-c}}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}
=(m2)Ncx(m1)NxcmN(m1)N(Nc)!(Nx)!N!(Ncx)!\displaystyle=\frac{(m-2)^{N-c-x}}{(m-1)^{N-x-c}}\cdot\frac{m^{N}}{(m-1)^{N}}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}
=(11m1)Ncx(1+1m1)N(Nc)!(Nx)!N!(Ncx)!\displaystyle=(1-\frac{1}{m-1})^{N-c-x}\cdot(1+\frac{1}{m-1})^{N}\cdot\frac{(N-c)!(N-x)!}{N!(N-c-x)!}
(m2m1)Ncx(Nx)(Nx1)(Nxc+1)N(N1)(Nc+1)\displaystyle\geq(\frac{m-2}{m-1})^{N-c-x}\cdot\frac{(N-x)\cdot(N-x-1)\cdot...\cdot(N-x-c+1)}{N\cdot(N-1)\cdot...\cdot(N-c+1)}
(m2m1)Nxc(Nxc+1Nc+1)c\displaystyle\geq(\frac{m-2}{m-1})^{N-x-c}\cdot(\frac{N-x-c+1}{N-c+1})^{c}

Since a1,a2,,ama_{1},a_{2},...,a_{m} are assumed to be much smaller than NN, mm can be inferred to be a large value. Hence both mm1\frac{m}{m-1} and m2m1\frac{m-2}{m-1} are close to 11. Besides, Nxc+1Nc+1\frac{N-x-c+1}{N-c+1} is also close to 11 since xx is a value much smaller than NN. Therefore, both the upper bound and the lower bound of Pr[aj=x|ai=c]Pr[aj=x]\frac{Pr[a_{j}=x|a_{i}=c]}{Pr[a_{j}=x]} are close to 11. We have

Pr[aj=x,ai=c]\displaystyle Pr[a_{j}=x,a_{i}=c] =Pr[aj=x|ai=c]Pr[ai=c]\displaystyle=Pr[a_{j}=x|a_{i}=c]Pr[a_{i}=c]
Pr[aj=x]Pr[ai=c]\displaystyle\simeq Pr[a_{j}=x]Pr[a_{i}=c]

We can approximately consider that the distributions of aia_{i} and aja_{j} are independent based on the above proof.

B.2. Proof of Theorem 5.2

Proof.

For any two neighboring datasets 𝒟\mathcal{D} and 𝒟\mathcal{D^{\prime}}, w.l.o.g., we assume they differ in the nthn^{\text{th}} value, and xn=1x_{n}=1 in 𝒟\mathcal{D}, xn=2x^{\prime}_{n}=2 in 𝒟\mathcal{D^{\prime}}. Let 𝒮\mathcal{S} and 𝒮\mathcal{S^{\prime}} be two dummy datasets with |𝒮|=|𝒮|=ns|\mathcal{S}|=|\mathcal{S}^{\prime}|=ns dummy points sent by users. What the analyst receives is a mixed dataset 𝒱\mathcal{V} that is a mixture of user dataset 𝒟\mathcal{D}(or 𝒟)\mathcal{D^{\prime}}) and the dummy dataset 𝒮\mathcal{S}(or 𝒮)\mathcal{S^{\prime}}). We denote 𝒱=𝒟𝒮\mathcal{V}=\mathcal{D}\cup\mathcal{S}(or 𝒱=𝒟𝒮\mathcal{V}=\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}). Denote sjs_{j} as the number of value jj in the dummy dataset. To get the same 𝒱\mathcal{V}, there should be a dummy point equals to 2 in 𝒮\mathcal{S} but equals to 1 in 𝒮\mathcal{S^{\prime}}. Thus the numbers of kk values in other |𝒮|1|\mathcal{S}|-1 dummy points are [s11,s2,,sk][s_{1}-1,s_{2},\ldots,s_{k}] in 𝒮\mathcal{S} and [s1,s21,,sk][s_{1},s_{2}-1,\ldots,s_{k}] in 𝒮\mathcal{S^{\prime}}. Then we have

Pr[𝒟𝒮=𝒱]Pr[𝒟𝒮=𝒱]\displaystyle\frac{\Pr[\mathcal{D}\cup\mathcal{S}=\mathcal{V}]}{\Pr[\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}=\mathcal{V}]} =(s11,s2,,sk|𝒮|1)(s1,s21,,sk|𝒮|1)=Cs1+s21s2Cs11s11Cs1+s21s21Cs1s1\displaystyle=\frac{(_{s_{1}-1,s_{2},\ldots,s_{k}}^{\ \ \ \ \ |\mathcal{S}|-1})}{(_{s_{1},s_{2}-1,\ldots,s_{k}}^{\ \ \ \ \ |\mathcal{S}|-1})}=\frac{C_{s_{1}+s_{2}-1}^{s_{2}}C_{s_{1}-1}^{s_{1}-1}}{C_{s_{1}+s_{2}-1}^{s_{2}-1}C_{s_{1}}^{s_{1}}}
=s1!(s21)!(s11)!s2!=s1s2\displaystyle=\frac{s_{1}!(s_{2}-1)!}{(s_{1}-1)!s_{2}!}=\frac{s_{1}}{s_{2}}

As dummy points are all uniformly random sampled from 𝔻\mathbb{D}, both s1s_{1} and s2s_{2} follow binomial distributions, where s1s_{1} follows Bin(|𝒮|1,1k)+1Bin(|\mathcal{S}|-1,\frac{1}{k})+1 and s2s_{2} follows Bin(|𝒮|1,1k)Bin(|\mathcal{S}|-1,\frac{1}{k}). Denote S1S_{1} as Bin(|𝒮|1,1k)+1Bin(|\mathcal{S}|-1,\frac{1}{k})+1 and S2S_{2} as Bin(|𝒮|1,1k)Bin(|\mathcal{S}|-1,\frac{1}{k}). Here, s1s_{1} and s2s_{2} can be considered as mutually independent variables according to Lemma 5.1. Then the probability that pureDUMP does not satisfy ϵd\epsilon_{d}-DP is

Pr[Pr[𝒟𝒮=𝒱]Pr[𝒟𝒮=𝒱]eϵd]=Pr[S1S2eϵd]δd\displaystyle\Pr[\frac{\Pr[\mathcal{D}\cup\mathcal{S}=\mathcal{V}]}{\Pr[\mathcal{D^{\prime}}\cup\mathcal{S^{\prime}}=\mathcal{V}]}\geq e^{\epsilon_{d}}]=\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]\leq\delta_{d}

Let c=E(S2)=|𝒮|1kc=E(S_{2})=\frac{|\mathcal{S}|-1}{k}, Pr[S1S2eϵd]\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}] can be divided into two cases with overlap that S1ceϵd/2S_{1}\geq ce^{\epsilon_{d}/2} or S2ceϵd/2S_{2}\leq ce^{-\epsilon_{d}/2} so we have

Pr[S1S2eϵd]\displaystyle\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]
Pr[S1ceϵd/2]+Pr[S2ceϵd/2]\displaystyle\leq\Pr[S_{1}\geq ce^{\epsilon_{d}/2}]+\Pr[S_{2}\leq ce^{-\epsilon_{d}/2}]
=Pr[S2ceϵd/21]+Pr[S2ceϵd/2]\displaystyle=\Pr[S_{2}\geq ce^{\epsilon_{d}/2}-1]+\Pr[S_{2}\leq ce^{-\epsilon_{d}/2}]
=Pr[S2cceϵd/2c1]+Pr[S2cceϵd/2c]\displaystyle=\Pr[S_{2}-c\geq ce^{\epsilon_{d}/2}-c-1]+\Pr[S_{2}-c\leq ce^{-\epsilon_{d}/2}-c]
=Pr[S2cc(eϵd/211/c)]+Pr[S2cc(eϵd/21)]\displaystyle=\Pr[S_{2}-c\geq c(e^{\epsilon_{d}/2}-1-1/c)]+\Pr[S_{2}-c\leq c(e^{-\epsilon_{d}/2}-1)]

According to multiplicative Chernoff bound, we have

Pr[S1S2eϵd]ec3(eϵd11/c)2+ec2(1eϵd/2)2\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]\leq e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}+e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}

To ensure that Pr[S1S2eϵd]δd\Pr[\frac{S_{1}}{S_{2}}\geq e^{\epsilon_{d}}]\leq\delta_{d}, we need

ec3(eϵd11/c)2δd2andec2(1eϵd/2)2δd2e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}\leq\frac{\delta_{d}}{2}\ and\ e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}\leq\frac{\delta_{d}}{2}

For ec2(1eϵd/2)2δd2e^{-\frac{c}{2}(1-e^{-\epsilon_{d}/2})^{2}}\leq\frac{\delta_{d}}{2}, 1eϵd/2(1e1/2)ϵdϵd/71-e^{-\epsilon_{d}/2}\geq(1-e^{-1/2})\epsilon_{d}\geq\epsilon_{d}/\sqrt{7}, where ϵd1\epsilon_{d}\leq 1. So we can get c14ln(2/δd)ϵd2c\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}. For ec3(eϵd11/c)2δd2e^{-\frac{c}{3}(e^{\epsilon_{d}}-1-1/c)^{2}}\leq\frac{\delta_{d}}{2}, eϵd/21ϵd2e^{\epsilon_{d}/2}-1\geq\frac{\epsilon_{d}}{2}, so eϵd/211/c2554ϵde^{\epsilon_{d}/2}-1-1/c\geq\frac{25}{54}\epsilon_{d}, where c27/ϵdc\geq 27/\epsilon_{d}. Therefore, we can get c3542ln(2/δd)252ϵd2c\geq\frac{3\cdot 54^{2}\ln{(2/\delta_{d})}}{25^{2}\epsilon_{d}^{2}}. Note that 354225214\frac{3\cdot 54^{2}}{25^{2}}\leq 14.

In summary, we can get cmax{14ln(2/δd)ϵd2,27ϵd}c\geq max\{\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}},\frac{27}{\epsilon_{d}}\}. Because ϵd1\epsilon_{d}\leq 1, when δd0.2907\delta_{d}\leq 0.2907, ϵd14ln(2/δd)27\epsilon_{d}\leq\frac{14\ln{(2/\delta_{d})}}{27} can be held all the time. So c14ln(2/δd)ϵd2c\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}. Therefore, we have S1k14ln(2/δd)ϵd2\frac{S-1}{k}\geq\frac{14\ln{(2/\delta_{d})}}{\epsilon_{d}^{2}}, is that ϵd14kln(2/δd)S1\epsilon_{d}\leq\sqrt{\frac{14k\ln{(2/\delta_{d})}}{S-1}}. ∎

B.3. Proof of Theorem 5.6

Proof.

The assumptions are same to the proof of Theorem 5.2. The nthn^{\text{th}} value xn=1x_{n}=1 in 𝒟\mathcal{D} and xn=2x^{\prime}_{n}=2 in its neighboring dataset 𝒟\mathcal{D^{\prime}}. If the nthn^{\text{th}} user randomizes its data, then we can easily get

Pr[R(𝒟)𝒮=𝒱]Pr[R(𝒟)𝒮=𝒱]=1\frac{\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]}{\Pr[R(\mathcal{D})\cup\mathcal{S}^{\prime}=\mathcal{V}]}=1

So, we focus on the case that the nthn^{\text{th}} user does not randomize its data. Denote \mathcal{H} as the set of users participating in randomization in Algorithm 3, and the size of \mathcal{H} is |||\mathcal{H}|. Since the randomized users’ data and dummy points are all follows the uniform distribution, mixDUMP can be regarded as pureDUMP with dummy dataset 𝒮\mathcal{H}\cup\mathcal{S} is introduced. From Theorem 5.2, mixDUMP satisfies (ϵH,δs2)(\epsilon_{H},\frac{\delta_{s}}{2})-DP, we have

Pr[R(𝒟)𝒮=𝒱]eϵHPr[R(𝒟)𝒮=𝒱]+δs2\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\leq e^{\epsilon_{H}}\Pr[R(\mathcal{D^{\prime}})\cup\mathcal{S^{\prime}}=\mathcal{V}]+\frac{\delta_{s}}{2}

where ϵH=14kln(4/δs)||+|𝒮|1\epsilon_{H}=\sqrt{\frac{14k\ln{(4/\delta_{s})}}{|\mathcal{H}|+|\mathcal{S}|-1}}, δs0.5814\delta_{s}\leq 0.5814. Since λ\lambda is the probability that each user randomizes its data, the number of users participating in randomization |||\mathcal{H}| follows Bin(n1,λ)Bin(n-1,\lambda). Define μ=(n1)λ\mu=(n-1)\lambda, and γ=2ln(2/δs)(n1)λ\gamma=\sqrt{\frac{2\ln{(2/\delta_{s})}}{(n-1)\lambda}}. According to Chernoff bound, we can get

Pr[||<(1γ)μ]<eμγ2/2=δs2\Pr[|\mathcal{H}|<(1-\gamma)\mu]<e^{-\mu\gamma^{2}/2}=\frac{\delta_{s}}{2}

Then we have

Pr[R(𝒟)𝒮=𝒱]\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]
=Pr[R(𝒟)𝒮=𝒱||(n1)λ2(n1)λln(2/δs)]+\displaystyle=\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap|\mathcal{H}|\geq(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]+
Pr[R(𝒟)𝒮=𝒱||<(n1)λ2(n1)λln(2/δs)]\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap|\mathcal{H}|<(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]
Pr[R(𝒟)𝒮=𝒱||(n1)λ2(n1)λln(2/δs)]+δs2\displaystyle\leq\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}\cap|\mathcal{H}|\geq(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}]+\frac{\delta_{s}}{2}
Lett=(n1)λ2(n1)λln(2/δs)\displaystyle Let\ t=(n-1)\lambda-\sqrt{2(n-1)\lambda ln(2/\delta_{s})}
Pr[R(𝒟)𝒮=𝒱](htPr[R(𝒟)𝒮=𝒱]Pr[||=h])+δs2\displaystyle\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\leq(\sum_{h\geq t}\Pr[R(\mathcal{D})\cup\mathcal{S}=\mathcal{V}]\Pr[|\mathcal{H}|=h])+\frac{\delta_{s}}{2}
(ht(eϵHPr[R(𝒟)𝒮=𝒱]Pr[||=h]+δs2Pr[||=h]))+δs2\displaystyle\leq(\sum_{h\geq t}(e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]\Pr[|\mathcal{H}|=h]+\frac{\delta_{s}}{2}\Pr[|\mathcal{H}|=h]))+\frac{\delta_{s}}{2}
(ht(eϵHPr[R(𝒟)𝒮=𝒱]Pr[||=h])+δs\displaystyle\leq(\sum_{h\geq t}(e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]\Pr[|\mathcal{H}|=h])+\delta_{s}
maxhteϵHPr[R(𝒟)𝒮=𝒱]+δs\displaystyle\leq max_{h\geq t}e^{\epsilon_{H}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}
e14kln(4/δs)t+S1Pr[R(𝒟)𝒮=𝒱]+δs\displaystyle\leq e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{t+S-1}}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}
=e14kln(4/δs)(n1)λ2(n1)λln(2/δs)+S1Pr[R(𝒟)𝒮=𝒱]+δs\displaystyle=e^{\sqrt{\frac{14k\ln{(4/\delta_{s})}}{(n-1)\lambda-\sqrt{2(n-1)\lambda\ln{(2/\delta_{s})}}+S-1}}}\Pr[R(\mathcal{D}^{\prime})\cup\mathcal{S}^{\prime}=\mathcal{V}]+\delta_{s}

B.4. Proof of Lemma 5.8

Proof.

fvf_{v} is the true frequency of element vv appears in users’ dataset. And f~v\tilde{f}_{v} is the estimation of fvf_{v}. We prove f~v\tilde{f}_{v} in mixDUMP is unbiased.

In GRR mechanism, the probability that each value keeps the true value with the probability pp, and changes to other values in the data domain with the probability of qq. We give the estimation function of mixDUMP in Algorithm 5. The formula is

f~v=i[n],j[s+1]𝟙{v=yi,j}nλ(11k)nskn(12λ(11k))\tilde{f}_{v}=\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-n\lambda(1-\frac{1}{k})-\frac{ns}{k}}{n(1-2\lambda(1-\frac{1}{k}))}

where λ=kk1q\lambda=\frac{k}{k-1}q. To simplify the derivation, we replace λ\lambda with qq, then we have estimation of f~v\tilde{f}_{v}

E[f~v]\displaystyle E[\tilde{f}_{v}] =E[1n(i[n],j[s+1]𝟙{v=yi,j}nqnsk)(12q)]\displaystyle=E[\frac{1}{n}\cdot\frac{(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{ns}{k})}{(1-2q)}]
=1nE[(i[n],j[s+1]𝟙{v=yi,j}nqnsk)(12q)]\displaystyle=\frac{1}{n}\cdot E[\frac{(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{ns}{k})}{(1-2q)}]
=1n112qE[i[n],j[s+1]𝟙{v=yi,j}nqnsk]\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{ns}{k}]
=1n112q(E[i[n],j[s+1]𝟙{v=yi,j}]nqnsk)\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]-nq-\frac{ns}{k})
=1n112q(nfv(1q)+n(1fv)q+nsk\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(nf_{v}(1-q)+n(1-f_{v})q+\frac{ns}{k}
nqnsk)\displaystyle-nq-\frac{ns}{k})
=1n112q(12q)nfv\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(1-2q)\cdot nf_{v}
=fv\displaystyle=f_{v}

B.5. Proof of Theorem 5.7

Proof.

Denote [n1,,nk][n_{1},\ldots,n_{k}] as numbers of kk values in the ithi^{\text{th}} user’s reports set 𝒱i\mathcal{V}_{i} where i=1kni=s+1\sum_{i=1}^{k}n_{i}=s+1. Denote 𝒮i\mathcal{S}_{i} (or 𝒮i\mathcal{S}_{i}^{\prime}) as the dummy dataset generated by the ithi^{\text{th}} user. Let vv be the true value held by the ithi^{\text{th}} user. According to the definition of LDP, we compare the probability of shuffler receiving the same set 𝒱i\mathcal{V}_{i} with numbers of k values are [n1,,nk][n_{1},\ldots,n_{k}] when v=1v=1 and v=2v=2. We prove the following formula

Pr[R(v=1)𝒮i=𝒱i]eϵrPr[R(v=2)𝒮i=𝒱i]+δr\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]\leq e^{\epsilon_{r}}\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]+\delta_{r}

According to Equation (1) of GRR mechanism, we have

Pr[R(v=1)𝒮i=𝒱i]Pr[R(v=2)𝒮i=𝒱i]\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}
=eϵl(n11,n2,,nks)+(n1,n21,,nks)++(n1,n2,,nk1s)eϵl(n1,n21,,nks)+(n11,n2,,nks)++(n1,n2,,nk1s)\displaystyle=\frac{e^{\epsilon_{l}}(_{n_{1}-1,n_{2},\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+(_{n_{1},n_{2}-1,\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+\ldots+(_{n_{1},n_{2},\ldots,n_{k}-1}^{\ \ \ \ \ \ \ \ s})}{e^{\epsilon_{l}}(_{n_{1},n_{2}-1,\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+(_{n_{1}-1,n_{2},\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+\ldots+(_{n_{1},n_{2},\ldots,n_{k}-1}^{\ \ \ \ \ \ \ \ s})}
eϵl(n11,n2,,nks)+(n1,n21,,nks)eϵl(n1,n21,,nks)+(n11,n2,,nks)=eϵln1+n2eϵln2+n1\displaystyle\leq\frac{e^{\epsilon_{l}}(_{n_{1}-1,n_{2},\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+(_{n_{1},n_{2}-1,\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})}{e^{\epsilon_{l}}(_{n_{1},n_{2}-1,\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})+(_{n_{1}-1,n_{2},\ldots,n_{k}}^{\ \ \ \ \ \ \ \ s})}=\frac{e^{\epsilon_{l}}n_{1}+n_{2}}{e^{\epsilon_{l}}n_{2}+n_{1}}

Since dummy points are uniformly random sampled from 𝔻\mathbb{D}, both n1n_{1} and n2n_{2} follow binomial distribution. n1n_{1} follows Bin(s,1k)+1Bin(s,\frac{1}{k})+1 and n2n_{2} follows Bin(s,1k)Bin(s,\frac{1}{k}). Based on Lemma 5.1, n1n_{1} and n2n_{2} are considered as mutually independent variables. We have

Pr[R(v=1)𝒮i=𝒱i]Pr[R(v=2)𝒮i=𝒱i]\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]} eϵlBin(s,1k)+eϵl+Bin(s,1k)eϵlBin(s,1k)+Bin(s,1k)+1\displaystyle\leq\frac{e^{\epsilon_{l}}Bin(s,\frac{1}{k})+e^{\epsilon_{l}}+Bin(s,\frac{1}{k})}{e^{\epsilon_{l}}Bin(s,\frac{1}{k})+Bin(s,\frac{1}{k})+1}
Bin(s,1k)(eϵl+1)+eϵlBin(s,1k)(eϵl+1)\displaystyle\leq\frac{Bin(s,\frac{1}{k})(e^{\epsilon_{l}}+1)+e^{\epsilon_{l}}}{Bin(s,\frac{1}{k})(e^{\epsilon_{l}}+1)}
=Bin(s,1k)Bin(s,1k)+eϵlBin(s,1k)(eϵl+1)\displaystyle=\frac{Bin(s,\frac{1}{k})}{Bin(s,\frac{1}{k})}+\frac{e^{\epsilon_{l}}}{Bin(s,\frac{1}{k})(e^{\epsilon_{l}}+1)}

According to the Chernoff bound Pr[Bin(s,1k)<(1γ)μ]<δ\Pr[Bin(s,\frac{1}{k})<(1-\gamma)\mu]<\delta, where γ=2kln(1/δ)s\gamma=\sqrt{\frac{2k\ln(1/\delta)}{s}} and μ=sk\mu=\frac{s}{k}. That is Pr[Bin(s,1k)sk(12kln(1/δ)s)]1δ\Pr[Bin(s,\frac{1}{k})\geq\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta)}{s}})]\geq 1-\delta. Note that 12kln(1/δ)s01-\sqrt{\frac{2k\ln(1/\delta)}{s}}\geq 0, so ks12ln(1/δ)\frac{k}{s}\leq\frac{1}{2\ln(1/\delta)}. As we prove that Bin(s,1k)[sk(12kln(1/δ)s,s]Bin(s,\frac{1}{k})\in[\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta)}{s}},s] with the probability larger than 1δ1-\delta, then we have

Pr[R(v=1)𝒮i=𝒱i]Pr[R(v=2)𝒮i=𝒱i]\displaystyle\frac{\Pr[R(v=1)\cup\mathcal{S}_{i}=\mathcal{V}_{i}]}{\Pr[R(v=2)\cup\mathcal{S}_{i}^{\prime}=\mathcal{V}_{i}]}
ssk(12kln(1/δ)s)+eϵlsk(12kln(1/δ)s)(eϵl+1)\displaystyle\leq\frac{s}{\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta)}{s}})}+\frac{e^{\epsilon_{l}}}{\frac{s}{k}(1-\sqrt{\frac{2k\ln(1/\delta)}{s}})(e^{\epsilon_{l}}+1)}
=k12kln(1/δ)s(1+eϵls(eϵl+1))\displaystyle=\frac{k}{1-\sqrt{\frac{2k\ln(1/\delta)}{s}}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)})

when ks(c2+4c2ln(1/δ)c)2)\frac{k}{s}\leq(\sqrt{c^{2}+\frac{4c}{\sqrt{2\ln(1/\delta)}}}-c)^{2}), it has k12kln(1/δ)s(1+eϵls(eϵl+1))eϵl\frac{k}{1-\sqrt{\frac{2k\ln(1/\delta)}{s}}}(1+\frac{e^{\epsilon_{l}}}{s(e^{\epsilon_{l}}+1)})\leq e^{\epsilon_{l}}, where c=2ln(1/δ)elϵ(eϵl+1)s(eϵl+1)+eϵlc=\sqrt{2\ln(1/\delta)}\cdot\frac{e^{\epsilon}_{l}(e^{\epsilon_{l}}+1)}{s(e^{\epsilon_{l}}+1)+e^{\epsilon_{l}}}. ∎

B.6. Proof of Lemma 5.13

Proof.

fvf_{v} is the true frequency of element vv appears in users’ dataset. f~v\tilde{f}_{v} is the estimation of fvf_{v}. We prove f~v\tilde{f}_{v} in flexible pureDUMP is unbiased.

As each user generates dummy points with probability of γ\gamma, and with probability of 1γ1-\gamma does not generate any dummy point, the estimation function is

f~v=1n(i[n],j[s+1]𝟙{v=yi,j}γnsk)\tilde{f}_{v}=\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\frac{\gamma ns}{k})

Then we have

E[f~v]\displaystyle E[\tilde{f}_{v}] =E[1n(i[n],j[s+1]𝟙{v=yi,j}γnsk)]\displaystyle=E[\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\frac{\gamma ns}{k})]
=1nE[i[n],j[s+1]𝟙{v=yi,j}]γsk\displaystyle=\frac{1}{n}\cdot E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]-\frac{\gamma s}{k}
=1n(nfv+γnsk)γsk\displaystyle=\frac{1}{n}\cdot(nf_{v}+\frac{\gamma ns}{k})-\frac{\gamma s}{k}
=fv\displaystyle=f_{v}

B.7. Proof of Lemma 5.14

Proof.

fvf_{v} is the true frequency of element vv appears in users’ dataset. And f~v\tilde{f}_{v} is the estimation of fvf_{v}. We prove f~v\tilde{f}_{v} of flexible mixDUMP is unbiased.

In GRR mechanism, the probability that each value keeps unchanged with the probability pp, and changes to other values in the data domain with the probability of qq. As each user generates dummy points with probability of γ\gamma, and with probability of 1γ1-\gamma does not generate any dummy point, the estimation function is

f~v=i[n],j[s+1]𝟙{v=yi,j}nλ(11k)γnskn(12λ(11k))\tilde{f}_{v}=\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-n\lambda(1-\frac{1}{k})-\frac{\gamma ns}{k}}{n(1-2\lambda(1-\frac{1}{k}))}

where λ=kk1q\lambda=\frac{k}{k-1}q. To simplify the derivation, we replace λ\lambda with qq, then we have estimation of f~v\tilde{f}_{v}

E[f~v]\displaystyle E[\tilde{f}_{v}] =E[1n(i[n],j[s+1]𝟙{v=yi,j}nqγnsk)(12q)]\displaystyle=E[\frac{1}{n}\cdot\frac{(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{\gamma ns}{k})}{(1-2q)}]
=1nE[(i[n],j[s+1]𝟙{v=yi,j}nqγnsk)(12q)]\displaystyle=\frac{1}{n}\cdot E[\frac{(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{\gamma ns}{k})}{(1-2q)}]
=1n112qE[i[n],j[s+1]𝟙{v=yi,j}nqγnsk]\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-nq-\frac{\gamma ns}{k}]
=1n112q(E[i[n],j[s+1]𝟙{v=yi,j}]nqγnsk)\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(E[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]-nq-\frac{\gamma ns}{k})
=1n112q(nfv(1q)+n(1fv)q+γnsk\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(nf_{v}(1-q)+n(1-f_{v})q+\frac{\gamma ns}{k}
nqγnsk)\displaystyle-nq-\frac{\gamma ns}{k})
=1n112q(12q)nfv\displaystyle=\frac{1}{n}\cdot\frac{1}{1-2q}\cdot(1-2q)\cdot nf_{v}
=fv\displaystyle=f_{v}

B.8. Proof of Theorem 5.15

Proof.

As each user sends dummy points with the probability of γ\gamma, the distribution of number of any distinct value in dummy points is Bin(ns,γk)Bin(ns,\frac{\gamma}{k}). Thus, the estimation function of flexible pureDUMP is

f~v=1n(i[n],j[s+1]𝟙{v=yi,j}γnsk)\tilde{f}_{v}=\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\frac{\gamma ns}{k})

Then, we have

Var[f~v]\displaystyle Var[\tilde{f}_{v}] =Var[1n(i[n],j[s+1]𝟙{v=yi,j}γnsk)]\displaystyle=Var[\frac{1}{n}\cdot(\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\frac{\gamma ns}{k})]
=1n2Var[i[n],j[s+1]𝟙{v=yi,j}]\displaystyle=\frac{1}{n^{2}}\cdot Var[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]
=1n2nsγk1γk\displaystyle=\frac{1}{n^{2}}\cdot ns\cdot\frac{\gamma}{k}\cdot{1-\frac{\gamma}{k}}
=sγ(kγ)nk2\displaystyle=\frac{s\gamma(k-\gamma)}{nk^{2}}

Lemma 5.13 proves that f~v\tilde{f}_{v} is unbiased in flexible pureDUMP, thus the MSE of flexible pureDUMP is

MSE\displaystyle MSE =1kv𝔻E[(f~vfv)2]=1kv𝔻Var(f~v)\displaystyle=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})
=sγ(kγ)nk2\displaystyle=\frac{s\gamma(k-\gamma)}{nk^{2}}

B.9. Proof of Theorem 5.16

Proof.

As each user generates dummy points in probability of γ\gamma in flexible mixDUMP, the distribution of number of any distinct value in dummy points is Bin(ns,γk)Bin(ns,\frac{\gamma}{k}). Thus, the estimation function of flexible mixDUMP is

f~v=i[n],j[s+1]𝟙{v=yi,j}nλ(11k)γnskn(12λ(11k))\tilde{f}_{v}=\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-n\lambda(1-\frac{1}{k})-\frac{\gamma ns}{k}}{n(1-2\lambda(1-\frac{1}{k}))}

where λ=kk1q\lambda=\frac{k}{k-1}q. To simplify derivation, we replace λ\lambda with qq, and with p=eϵleϵl+k1,q=1eϵl+k1p=\frac{e^{\epsilon_{l}}}{e^{\epsilon_{l}}+k-1},\ q=\frac{1}{e^{\epsilon_{l}}+k-1}, we have the variance of f~v\tilde{f}_{v} is

Var[f~v]\displaystyle Var[\tilde{f}_{v}] =Var[1ni[n],j[s+1]𝟙{v=yi,j}γns1knqpq]\displaystyle=Var[\frac{1}{n}\cdot\frac{\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}-\gamma ns\cdot\frac{1}{k}-nq}{p-q}]
=1n21(pq)2Var[i[n],j[s+1]𝟙{v=yi,j}]\displaystyle=\frac{1}{n^{2}}\cdot\frac{1}{(p-q)^{2}}Var[\sum_{i\in[n],j\in[s+1]}\mathbbm{1}_{\{v=y_{i,j}\}}]
1n21(pq)2(nq(1q)+nsγkkγk)\displaystyle\simeq\frac{1}{n^{2}}\cdot\frac{1}{(p-q)^{2}}(nq(1-q)+ns\cdot\frac{\gamma}{k}\cdot\frac{k-\gamma}{k})
=1neϵl+k2(eϵl1)2+sγ(kγ)nk2(eϵl+k1eϵl1)2\displaystyle=\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s\gamma(k-\gamma)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}

Lemma 5.14 proves that f~v\tilde{f}_{v} is unbiased in flexible mixDUMP, thus we have the MSE of flexible mixDUMP is

MSE\displaystyle MSE =1kv𝔻E[(f~vfv)2]=1kv𝔻Var(f~v)\displaystyle=\frac{1}{k}\sum_{v\in\mathbb{D}}E[(\tilde{f}_{v}-f_{v})^{2}]=\frac{1}{k}\sum_{v\in\mathbb{D}}Var(\tilde{f}_{v})
=1neϵl+k2(eϵl1)2+sγ(kγ)nk2(eϵl+k1eϵl1)2\displaystyle=\frac{1}{n}\cdot\frac{e^{\epsilon_{l}}+k-2}{(e^{\epsilon_{l}}-1)^{2}}+\frac{s\gamma(k-\gamma)}{nk^{2}}\cdot(\frac{e^{\epsilon_{l}}+k-1}{e^{\epsilon_{l}}-1})^{2}