This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility

E Chen1, Yang Cao2, , Yifei Ge 3 Corresponding author: Yang Cao, [email protected]
Abstract

The shuffle model of local differential privacy is an advanced method of privacy amplification designed to enhance privacy protection with high utility. It achieves this by randomly shuffling sensitive data, making linking individual data points to specific individuals more challenging. However, most existing studies have focused on the shuffle model based on (ϵ0,0)(\epsilon_{0},0)-Locally Differentially Private (LDP) randomizers, with limited consideration for complex scenarios such as (ϵ0,δ0)(\epsilon_{0},\delta_{0})-LDP or personalized LDP (PLDP). This hinders a comprehensive understanding of the shuffle model’s potential and limits its application in various settings. To bridge this research gap, we propose a generalized shuffle framework that can be applied to any (ϵi,δi)(\epsilon_{i},\delta_{i})-PLDP setting with personalized privacy parameters. This generalization allows for a broader exploration of the privacy-utility trade-off and facilitates the design of privacy-preserving analyses in diverse contexts. We prove that shuffled (ϵi,δi)(\epsilon_{i},\delta_{i})-PLDP process approximately preserves μ\mu-Gaussian Differential Privacy with μ=2i=1n1δi1+eϵimaxi1δi1+eϵi.\mu=\sqrt{\frac{2}{\sum_{i=1}^{n}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}-\max_{i}{\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}}. This approach allows us to avoid the limitations and potential inaccuracies associated with inequality estimations. To strengthen the privacy guarantee, we improve the lower bound by utilizing hypothesis testing instead of relying on rough estimations like the Chernoff bound or Hoeffding’s inequality. Furthermore, extensive comparative evaluations clearly show that our approach outperforms existing methods in achieving strong central privacy guarantees while preserving the utility of the global model. We have also carefully designed corresponding algorithms for average function, frequency estimation, and stochastic gradient descent.

Introduction

The shuffle model (Bittau2017prochlo) is a state-of-the-art technique to balance privacy and utility for differentially private data analysis. In traditional differential privacy, a trusted server (or aggregator) is often assumed to collect all users’ data before privacy-preserving data analysis (Dwork2014algorithmic). However, such approaches may not be feasible or practical in scenarios where a trusted curator does not exist. Given this, Local Differential Privacy (LDP) (KS11) has been proposed to achieve differential privacy by allowing the users to add noises individually; however, LDP suffers from low utility due to the accumulated noise. To address this, the shuffle model of differential privacy (shuffle DP) (Bittau2017prochlo; balle2019privacy; Erlingsson2019amplification) adds a shuffler between the users and the server to randomly shuffle the noisy data before sending the server. The shuffle DP has an intriguing theoretical privacy amplification effect, which means a small amount of local noise could result in a strong privacy guarantee against the untrusted server. Extensive studies (balle2019privacy; Erlingsson2019amplification; Girgis2021renyi; Feldman2022hiding; liu2021flame; GDDSK21federated) have been devoted to proving a better (tighter) privacy amplification in the shuffle DP.

However, most existing studies have focused on the shuffle model based on (ϵ0,δ0)(\epsilon_{0},\delta_{0})-LDP randomizer with uniform and limited settings of local privacy parameters ϵ0\epsilon_{0} and δ0\delta_{0}. For example, Erlingsson2019amplification assumes 0<ϵ0<1/20<\epsilon_{0}<1/2 and δ0=0\delta_{0}=0. Although a recent work Liu2023 provides a privacy bound for local personalized privacy parameter ϵi\epsilon_{i} for each user ii (and a fixed δ0\delta_{0}), the bound is relatively rough and has a large room to be improved. To address this problem, we make the following contributions.

Firstly, we propose a Generalized Shuffle framework for Privacy Amplification (GSPA) to allow arbitrary local privacy parameters and provide new privacy amplification analysis. Our analysis technique benefits from the adoption of Functional Differential Privacy (Dong2022gaussian) and carefully analyzing the distance between two multinomial distributions (see Theorem 1 and 2). For both uniform and personalized privacy parameter settings, we provide lower privacy bounds that exceed that of existing results (see Figure 2).

Secondly, we apply GSPA with different personalized privacy parameter settings to diverse privacy-preserving analysis tasks, including private mean, private frequency estimation, and DP-SGD, to demonstrate the effectiveness of our approach. For mean and frequency estimation with GSPA (see Figure 3), the more conservative users there are, the less utility is observed, showing a negative linear relationship. Simultaneously, as the privacy parameters of conservative users increase, utility demonstrates a positive linear relationship. For DP-SGD with GSPA (see Figure 4), there exists an interesting phenomenon that despite the constant scenario (ϵ0=0.5\epsilon_{0}=0.5) offers a stronger privacy protection, its test accuracy (94.8%) is higher than that (93.5%) of scenarios U(0.01,2)U(0.01,2), which have varying local privacy parameters.

Preliminaries

This section presents the definitions and tools necessary for understanding the shuffle model. These serve as fundamental tools for proposing our methods and form the basis of our approach.

Definition 1

(Differential Privacy) A randomized algorithm \mathcal{R} satisfies (ϵ,δ)(\epsilon,\delta)-differential privacy, denoted as (ϵ,δ)(\epsilon,\delta)-DP, if for all 𝒮\mathcal{S}\subseteq Range()(\mathcal{R}) and for all neighboring databases D0,D1D_{0},D_{1} (D0D_{0} can be obtained from D1D_{1} by replacing exactly one record):

((D0)𝒮)eϵ((D1)𝒮)+δ.\mathbb{P}(\mathcal{R}(D_{0})\in\mathcal{S})\leq e^{\epsilon}\mathbb{P}(\mathcal{R}(D_{1})\in\mathcal{S})+\delta. (1)

ϵ\epsilon is known as the privacy budget, while δ\delta is referred to as the indistinguishability parameter, which describes the probability of privacy leakage exceeding ϵ\epsilon. Both ϵ\epsilon and δ\delta should be as small as possible, indicating stronger privacy protection.

Definition 2

(Local Differential Privacy) A randomized algorithm :𝒟𝒮\mathcal{R}:\mathcal{D}\rightarrow\mathcal{S} satisfies (ϵ,δ)(\epsilon,\delta)-local differential privacy, denoted as (ϵ,δ)(\epsilon,\delta)-LDP, if for all pairs x,x𝒟x,x^{\prime}\in\mathcal{D}, (x)\mathcal{R}(x) and (x)\mathcal{R}(x^{\prime}) satisfies

((x)𝒮)eϵ((x)𝒮)+δ.\mathbb{P}(\mathcal{R}(x)\in\mathcal{S})\leq e^{\epsilon}\mathbb{P}(\mathcal{R}(x^{\prime})\in\mathcal{S})+\delta. (2)

In Local Differential Privacy (LDP), each data contributor applies a local randomization mechanism to perturb their own data before sharing it with a central aggregator.

Refer to caption

Figure 1: The Generalized Shuffle framework for Privacy Amplification (GSPA). Privacy parameters (ϵi,δi)(\epsilon_{i},\delta_{i}) and each client’s output are shuffled separately. The random permutation of the shuffler is unknown to anyone except the shuffler itself. The type of query, whether non-adaptive or adaptive, depends on whether the next query depends on the previous output.

Privacy Tools

Differential privacy can be regarded as a hypothesis testing problem for a given distribution (Kairouz2015composition). In brief, we consider the hypothesis testing issue with two hypotheses.

H0:The underlying dataset is D0,\displaystyle H_{0}:\text{The underlying dataset is }D_{0},
H1:The underlying dataset is D1.\displaystyle H_{1}:\text{The underlying dataset is }D_{1}.

To provide an intuitive explanation, we designate the name Bob to denote the exclusive individual present in D0D_{0} but absent in D1D_{1}. Consequently, rejecting the null hypothesis implies the recognition of Bob’s nonexistence, whereas accepting the null hypothesis suggests observing Bob’s existence in the dataset.

Inspired by this, an effective tool called ff-DP (Dong2022gaussian) has been introduced, which utilizes hypothesis testing to handle differential privacy. For two neighbouring databases D0D_{0} and D1D_{1}, let UU and VV denote the probability distributions of (D0)\mathcal{R}(D_{0}) and (D1)\mathcal{R}(D_{1}), respectively. We consider a rejection rule 0ϕ10\leq\phi\leq 1, with type I and type II error rates defined as

αϕ=𝔼U[ϕ],βϕ=1𝔼V[ϕ].\alpha_{\phi}=\mathbb{E}_{U}[\phi],\quad\beta_{\phi}=1-\mathbb{E}_{V}[\phi]. (3)

It is well-known that

αϕ+βϕ1TV(U,V),\alpha_{\phi}+\beta_{\phi}\geq 1-TV(U,V), (4)

where TV(U,V)TV(U,V) is the supremum of |U(A)V(A)||U(A)-V(A)| over all measurable sets AA. To characterize the fine-grained trade-off between the two errors, Table 1 helps to establish a clear understanding of the relationship between the two errors.

Actual True Actual False
Accept Hypothesis Correct Type II Error (β\beta)
Reject Hypothesis Type I Error (α\alpha) Correct
Table 1: Table of Error Types

For any two probability distributions UU and VV on the same space Ω\Omega, the trade-off function T(U,V):[0,1][0,1]T(U,V):[0,1]\rightarrow[0,1] is defined as

T(U,V)(α)=inf{βϕ:αϕα},T(U,V)(\alpha)=\inf\{\beta_{\phi}:\alpha_{\phi}\leq\alpha\}, (5)

where the infimum is taken over all measurable rejection rules ϕ\phi, and αϕ=𝔼U(ϕ)\alpha_{\phi}=\mathbb{E}_{U}(\phi) and βϕ=1𝔼V(ϕ)\beta_{\phi}=1-\mathbb{E}_{V}(\phi).

Definition 3

(Functional Differential Privacy, ff-DP) Let ff be a trade-off function, a mechanism \mathcal{R} is said to be ff-DP if

T((D0),(D1))f,T(\mathcal{R}(D_{0}),\mathcal{R}(D_{1}))\geq f, (6)

for all neighboring data sets D0D_{0} and D1D_{1}.

To enhance readability, we have included the introduction and relevant properties of ff-DP in the section of Appendix. It is worth noting that traditional DP belongs to a special case of ff-DP, therefore ff-DP has a wider scope of applicability.

In addition, Laplace mechanism and Gaussian mechanism are two common approaches used in differential privacy (Dwork2014algorithmic). The choice between the Laplace mechanism and the Gaussian mechanism depends on the data type, privacy requirements, and query tasks. The Laplace mechanism provides stronger privacy but may introduce larger errors, while the Gaussian mechanism is more suitable for accurate results. Thus, it’s important to strike a balance between privacy and accuracy based on specific requirements.

Definition 4 (Laplace Mechanism)

Given a query function Q:DdQ:D\rightarrow\mathbb{R}^{d}, privacy parameter ϵ\epsilon and 1\ell_{1} sensitivity Δ(Q)=maxQ(D)Q(D)1\Delta(Q)=\max\|Q(D)-Q(D^{\prime})\|_{1}, then for any two neighbouring datasets D,DD,D^{\prime}, the Laplace mechanism

M(D,Q)=Q(D)+Lap(Δ(Q)ϵ)M(D,Q)=Q(D)+\text{Lap}\left(\frac{\Delta(Q)}{\epsilon}\right) (7)

preserves ϵ\epsilon-DP, where Lap(λ\lambda) denotes the centralized Laplace noise with scale parameter λ\lambda.

In the absence of ambiguity, we express both queries and answers as Q()Q(\cdot) and A()A(\cdot) respectively.

Definition 5 (Gaussian Mechanism)

Given a query function Q:DdQ:D\rightarrow\mathbb{R}^{d}, privacy parameter ϵ\epsilon and 2\ell_{2} sensitivity Δ2(Q)=maxQ(D)Q(D)2\Delta_{2}(Q)=\max\|Q(D)-Q(D^{\prime})\|_{2}, then for any two neighbouring datasets D,DD,D^{\prime}, the Gaussian mechanism

M(D,Q)=Q(D)+N(0,2log(1.25/δ)Δ22(Q)ϵ2)M(D,Q)=Q(D)+N\left(0,\frac{2\log(1.25/\delta)\Delta_{2}^{2}(Q)}{\epsilon^{2}}\right) (8)

preserves (ϵ,δ)(\epsilon,\delta)-DP, where N(μ,σ2)N(\mu,\sigma^{2}) denotes the Gaussian noise with mean μ\mu and variance σ2\sigma^{2}.

Privacy Analysis of GSPA Framework

Our Generalized Shuffle framework for Privacy Amplification (GSPA) consists of three main components: local randomizers, a trustworthy shuffler, and an aggregator, which are the same as existing shuffle DP frameworks; however, GSPA allows local randomizers with arbitrary privacy parameters. (i) For nn users, the local randomizer MiM_{i} adds noise to the original data xix_{i} on the ii-th user’s devices, thus providing (ϵi,δi)(\epsilon^{\ell}_{i},\delta^{\ell}_{i})-PLDP for user ii. (ii) The shuffler randomly permutes the order of data elements, ensuring that the resulting arrangement is unknown to any party other than the shuffler itself. (iii) The aggregator collects and integrates shuffled data for simple queries, while for complex tasks like machine learning, it trains models based on shuffled data with multiple iterations. Without causing confusion, the notation (ϵ0,δ0)(\epsilon_{0},\delta_{0})-LDP is used to represent the uniform scenario, while (ϵi,δi)(\epsilon_{i},\delta_{i})-PLDP denotes the personalized scenario.

Privacy Amplification Effect

In this section, we address the issue of privacy protection in the context of a general shuffled adaptive process for personalized local randomizers.

Definition 6

For a domain 𝒟\mathcal{D}, let (i):𝒮(1)×𝒮(2)××𝒮(i1)×𝒟𝒮(i)\mathcal{R}^{(i)}:\mathcal{S}^{(1)}\times\mathcal{S}^{(2)}\times\cdots\times\mathcal{S}^{(i-1)}\times\mathcal{D}\rightarrow\mathcal{S}^{(i)} for ii [n]\in[n], where 𝒮(i)\mathcal{S}^{(i)} is the range space of (i)\mathcal{R}^{(i)} be a sequence of algorithms such that (i)(z1:i1,)\mathcal{R}^{(i)}(z_{1:i-1},\cdot) is an (ϵi,δi)(\epsilon_{i},\delta_{i})-PLDP randomizer for all values of auxiliary inputs z1:i1𝒮(1)×𝒮(2)×𝒮(i1)z_{1:i-1}\in\mathcal{S}^{(1)}\times\mathcal{S}^{(2)}\times\cdots\mathcal{S}^{(i-1)}. Let 𝒜R:𝒟𝒮(1)×𝒮(2)××𝒮(n)\mathcal{A}_{R}:\mathcal{D}\rightarrow\mathcal{S}^{(1)}\times\mathcal{S}^{(2)}\times\cdots\times\mathcal{S}^{(n)} be the algorithm that given a dataset x1:n𝒟nx_{1:n}\in\mathcal{D}^{n}, then sequentially computes zi=(i)(z1:i1,xi)z_{i}=\mathcal{R}^{(i)}(z_{1:i-1},x_{i}) for i[n]i\in[n] and outputs z1:nz_{1:n}. We say 𝒜R(𝒟)\mathcal{A}_{R}(\mathcal{D}) is a personalized LDP (PLDP) adaptive process. Similarly, if we first sample a permutation π\pi uniformly at random, then sequentially computes zi=(i)(z1:i1,xπi)z_{i}=\mathcal{R}^{(i)}(z_{1:i-1},x_{\pi_{i}}) for i[n]i\in[n] and outputs z1:nz_{1:n}, we say this process is shuffled PLDP adaptive and denote it by 𝒜R,S(𝒟)\mathcal{A}_{R,S}(\mathcal{D}).

Lemma 1

Given an (ϵi,δi)(\epsilon_{i},\delta_{i})-PLDP adaptive process, then in the ii-th step, local randomizer (i)\mathcal{R}^{(i)}: 𝒟𝒮\mathcal{D}\rightarrow\mathcal{S} and for any n+1n+1 inputs x10,x11,x2,,xn𝒟x_{1}^{0},x_{1}^{1},x_{2},\cdots,x_{n}\in\mathcal{D}, there exists distributions 𝒬10,𝒬11,𝒬1,𝒬2,,𝒬n\mathcal{Q}_{1}^{0},\mathcal{Q}_{1}^{1},\mathcal{Q}_{1},\mathcal{Q}_{2},\cdots,\mathcal{Q}_{n} such that

(i)(x10)=(1δi)eϵi1+eϵi𝒬10+1δi1+eϵi𝒬11+δi𝒬1,\mathcal{R}^{(i)}(x_{1}^{0})=\frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}\mathcal{Q}_{1}^{0}+\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}\mathcal{Q}_{1}^{1}+\delta_{i}\mathcal{Q}_{1}, (9)
(i)(x11)=(1δi)1+eϵi𝒬10+(1δi)eϵi1+eϵi𝒬11+δi𝒬1.\mathcal{R}^{(i)}(x_{1}^{1})=\frac{(1-\delta_{i})}{1+e^{\epsilon_{i}}}\mathcal{Q}_{1}^{0}+\frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}\mathcal{Q}_{1}^{1}+\delta_{i}\mathcal{Q}_{1}. (10)

xi{x2,,xn},\forall x_{i}\in\{x_{2},\cdots,x_{n}\},

(xi)=1δi1+eϵi𝒬10+1δi1+eϵi𝒬11+(12(1δi)1+eϵi)𝒬i.\mathcal{R}(x_{i})=\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}\mathcal{Q}^{0}_{1}+\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}\mathcal{Q}^{1}_{1}+\left(1-\frac{2(1-\delta_{i})}{1+e^{\epsilon_{i}}}\right)\mathcal{Q}^{i}. (11)
Proof:

For inputs X0={x10,x2,,xn}X_{0}=\{x_{1}^{0},x_{2},\ldots,x_{n}\} and X1={x11,x2,,xn}X_{1}=\{x^{1}_{1},x_{2},\ldots,x_{n}\}, (i)\mathcal{R}^{(i)} satisfies the constraints of Lemma 4, so there exists an (ϵi,δi)(\epsilon_{i},\delta_{i})-PLDP local randomizer :𝒟𝒵\mathcal{R^{\prime}}:\mathcal{D}\rightarrow\mathcal{Z} for the ii-th output and post-processing function proc()proc(\cdot) such that proc((i)(x))=(i)(x)proc(\mathcal{R^{\prime}}^{(i)}(x))=\mathcal{R}^{(i)}(x), and

P((i)(x10)=z)={0if z=A,(1δi)eϵi1+eϵiif z=0,1δi1+eϵiif z=1,δiif z=B.P(\mathcal{R^{\prime}}^{(i)}(x_{1}^{0})=z)=\left\{\begin{array}[]{ll}0&\mbox{if }z=A,\\ \frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}&\mbox{if }z=0,\\ \frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}&\mbox{if }z=1,\\ \delta_{i}&\mbox{if }z=B.\end{array}\right.
P((i)(x11)=z)={δiif z=A,1δi1+eϵiif z=0,(1δi)eϵi1+eϵiif z=1,0if z=B.P(\mathcal{R^{\prime}}^{(i)}(x_{1}^{1})=z)=\left\{\begin{array}[]{ll}\delta_{i}&\mbox{if }z=A,\\ \frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}&\mbox{if }z=0,\\ \frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}&\mbox{if }z=1,\\ 0&\mbox{if }z=B.\end{array}\right.

Let L={z𝒵|((x10=z))=(1δi)eϵi1+eϵiL=\{z\in\mathcal{Z}|\mathbb{P}(\mathcal{R^{\prime}}(x_{1}^{0}=z))=\frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}} and ((x11=z))=1δi1+eϵi}\mathbb{P}(\mathcal{R^{\prime}}(x_{1}^{1}=z))=\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}\} , U={z𝒵|((x11=z))=1δi1+eϵiU=\{z\in\mathcal{Z}|\mathbb{P}(\mathcal{R^{\prime}}(x_{1}^{1}=z))=\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}} and ((x11=z))=(1δi)eϵi1+eϵi}\mathbb{P}(\mathcal{R^{\prime}}(x_{1}^{1}=z))=\frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}\}. Let M=𝒵/{LU}M=\mathcal{Z}/\{L\bigcup U\} and p=zpz=z𝒰pz.p=\sum_{z\in\mathcal{L}}p_{z}=\sum_{z\in\mathcal{U}}p_{z}. Since conditioned on the output lying in \mathcal{L}, the distribution of (x10)\mathcal{R}^{\prime}(x_{1}^{0}) and (x11)\mathcal{R}^{\prime}(x_{1}^{1}) are the same. Let 𝒲10=(x10)|L=(x11)|L\mathcal{W}_{1}^{0}=\mathcal{R}^{\prime}(x_{1}^{0})|L=\mathcal{R}^{\prime}(x_{1}^{1})|L, 𝒲11=(x10)|U=(x11)|U\mathcal{W}^{1}_{1}=\mathcal{R}^{\prime}(x_{1}^{0})|U=\mathcal{R}^{\prime}(x_{1}^{1})|U and 𝒲1=(x10)|M=(x11)|M.\mathcal{W}_{1}=\mathcal{R}^{\prime}(x_{1}^{0})|M=\mathcal{R}^{\prime}(x_{1}^{1})|M. Then

(x10)=(1δi)eϵi1+eϵi𝒲10+1δi1+eϵi𝒲11+δi𝒲1,\mathcal{R}^{\prime}(x_{1}^{0})=\frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}\mathcal{W}_{1}^{0}+\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}\mathcal{W}_{1}^{1}+\delta_{i}\mathcal{W}_{1},
(x11)=1δi1+eϵi𝒲10+(1δi)eϵi1+eϵi𝒲11+δi𝒲1.\mathcal{R}^{\prime}(x_{1}^{1})=\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}\mathcal{W}_{1}^{0}+\frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}\mathcal{W}_{1}^{1}+\delta_{i}\mathcal{W}_{1}.

Further, for all xi{x2,,xn}x_{i}\in\{x_{2},\cdots,x_{n}\},

(xi)1δi1+eϵi𝒲10+1δi1+eϵi𝒲11+(12(1δi)1+eϵi)𝒲i.\mathcal{R}^{\prime}(x_{i})\geq\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}\mathcal{W}_{1}^{0}+\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}\mathcal{W}_{1}^{1}+\left(1-\frac{2(1-\delta_{i})}{1+e^{\epsilon_{i}}}\right)\mathcal{W}_{i}.

Letting 𝒬10=proc(𝒲10)\mathcal{Q}_{1}^{0}=proc(\mathcal{W}_{1}^{0}), 𝒬11=proc(𝒲11)\mathcal{Q}_{1}^{1}=proc(\mathcal{W}_{1}^{1}), 𝒬1=proc(𝒲1)\mathcal{Q}_{1}=proc(\mathcal{W}_{1}) and for all i{2,,n},i\in\{2,\cdots,n\}, 𝒬i=proc(𝒲i)\mathcal{Q}_{i}=proc(\mathcal{W}_{i}). The proof is completed. \square

Theorem 1

For a domain 𝒟\mathcal{D}, if 𝒜R,S(𝒟)\mathcal{A}_{R,S}(\mathcal{D}) is a shuffled PLDP adaptive process, then for arbitrary two neighboring datasets D0,D1𝒟nD_{0},D_{1}\in\mathcal{D}^{n} distinct at the nn-th data point, there exists a post-processing function proc()proc(\cdot): (0,1,2)𝒮(1)×𝒮(2)××𝒮(n),(0,1,2)\rightarrow\mathcal{S}^{(1)}\times\mathcal{S}^{(2)}\times\cdots\times\mathcal{S}^{(n)}, such that

T(𝒜R,S(D0),𝒜R,S(D1))=T(proc(ρ0),proc(ρ1)).T(\mathcal{A}_{R,S}(D_{0}),\mathcal{A}_{R,S}(D_{1}))=T(proc(\rho_{0}),proc(\rho_{1})).

Here,

ρ0=(Δ0,Δ1,Δ2)+𝑽,\rho_{0}=(\Delta_{0},\Delta_{1},\Delta_{2})+\boldsymbol{V}, (12)
ρ1=(Δ1,Δ0,Δ2)+𝑽,\rho_{1}=(\Delta_{1},\Delta_{0},\Delta_{2})+\boldsymbol{V}, (13)

Δ2Bern(δn),Δ0Bin(1Δ2,eϵn1+eϵn)\Delta_{2}\sim Bern(\delta_{n}),\Delta_{0}\sim Bin(1-\Delta_{2},\frac{e^{\epsilon_{n}}}{1+e^{\epsilon_{n}}}), Δ1=1Δ0Δ2\Delta_{1}=1-\Delta_{0}-\Delta_{2}, where Bern(p)Bern(p) denotes a Bernoulli random variable with bias pp, Bin(n,p)Bin(n,p) denotes a Binomial distribution with nn trials and success probability pp. In addition, 𝐕=i=1n1MultiBern(1δi1+eϵi,1δi1+eϵi,12(1δi)1+eϵi)\boldsymbol{V}=\sum_{i=1}^{n-1}MultiBern\left(\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}},\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}},1-\frac{2(1-\delta_{i})}{1+e^{\epsilon_{i}}}\right), where MultiBern(θ1,,θd)MultiBern(\theta_{1},\cdots,\theta_{d}) represents a dd-dimensional Bernoulli distribution with j=1dθj=1\sum_{j=1}^{d}\theta_{j}=1.

In order to enhance readability, proof details are placed in the section of Appendix. Based on Theorem 1, we can simplify the original problem by analyzing the shuffling process in a simple non-adaptive protocol.

The primary objective in the following is to demonstrate the distance between two distributions. The Berry Esseen lemma (berry1941accuracy; Esseen1942) is highly valuable and essential for proving asymptotic properties.

Lemma 2 (Berry Esseen)

Let P=(ξ0,ξ1,ξ2)i=1mMultiBern(pi2,pi2,1pi)P=(\xi_{0},\xi_{1},\xi_{2})\sim\sum_{i=1}^{m}MultiBern\left(\frac{p_{i}}{2},\frac{p_{i}}{2},1-p_{i}\right) and QN(μ,Σ)Q\sim N(\mu,\Sigma), where μ=𝔼(P)\mu=\mathbb{E}(P) and Σ=Var(P).\Sigma=Var(P). Then for the first two components (X0,X1)(X_{0},X_{1}), there exists C>0C>0, such that P~Q~TVCm\|\tilde{P}-\tilde{Q}\|_{TV}\leq\frac{C}{\sqrt{m}}, where P~\tilde{P} and Q~\tilde{Q} represent the distribution of (ξ0,ξ1)(\xi_{0},\xi_{1}) and corresponding normal distribution, respectively.

In fact, for given nn, we can obtain sophisticated bound of P~Q~TV\|\tilde{P}-\tilde{Q}\|_{TV} by numerical methods. Without loss of generality, we assume ϵi=ϵ0\epsilon_{i}=\epsilon_{0}, δi=δ0=O(1/n)\delta_{i}=\delta_{0}=O(1/n), then p0=1δ01+eϵ0p_{0}=\frac{1-\delta_{0}}{1+e^{\epsilon_{0}}}. For some fixed output (ξ0,ξ1)=(k0,k1)(\xi_{0},\xi_{1})=(k_{0},k_{1}), we approximate by integrating the normal probability density function around that point. Let G()G(\cdot) be the cumulative distribution function of Q~\tilde{Q} and h(k0,k1)=G(k0+0.5,k1+0.5)G(k0+0.5,k1)G(k0,k1+0.5)+G(k0,k1)h(k_{0},k_{1})=G(k_{0}+0.5,k_{1}+0.5)-G(k_{0}+0.5,k_{1})-G(k_{0},k_{1}+0.5)+G(k_{0},k_{1}), then

P~Q~TV=sup(k0,k1)|(ξ0=k0,ξ1=k1)h(k0,k1)|.\|\tilde{P}-\tilde{Q}\|_{TV}=\sup_{(k_{0},k_{1})}|\mathbb{P}(\xi_{0}=k_{0},\xi_{1}=k_{1})-h(k_{0},k_{1})|. (14)
Lemma 3

Let pi=2(1δi)1+eϵip_{i}=\frac{2(1-\delta_{i})}{1+e^{\epsilon_{i}}}, if μ¯=i=1n1(pi2,pi2)\bar{\mu}=\sum_{i=1}^{n-1}(\frac{p_{i}}{2},\frac{p_{i}}{2})^{\prime} and μ0=(1,0)+μ¯\mu_{0}=(1,0)^{\prime}+\bar{\mu}, μ1=(0,1)+μ¯\mu_{1}=(0,1)^{\prime}+\bar{\mu}, then T(N(μ0,𝚺),N(μ1,𝚺))=Φ(Φ1(1α)2i=1n1pi)T(N(\mu_{0},\boldsymbol{\Sigma}),N(\mu_{1},\boldsymbol{\Sigma}))=\Phi(\Phi^{-1}(1-\alpha)-\frac{2}{\sqrt{\sum_{i=1}^{n-1}}p_{i}}),

𝚺=i=1n1(pi2(1pi2)pi24pi24pi2(1pi2)).\boldsymbol{\Sigma}=\sum_{i=1}^{n-1}\left(\begin{array}[]{cc}\frac{p_{i}}{2}(1-\frac{p_{i}}{2})&-\frac{p_{i}^{2}}{4}\\ -\frac{p_{i}^{2}}{4}&\frac{p_{i}}{2}(1-\frac{p_{i}}{2})\\ \end{array}\right).
Theorem 2 (Enhanced Central Privacy Upper bound)

Assume ρ0\rho_{0} and ρ1\rho_{1} are defined in equations (12) and (13), then there exists C>0C>0, such that

T(ρ0,ρ1)(Gμ(α+Cn1)Cn1),T(\rho_{0},\rho_{1})\geq\left(G_{\mu}\left({\alpha+\frac{C}{\sqrt{n-1}}}\right)-\frac{C}{\sqrt{n-1}}\right), (15)

where Gμ(α)=Φ(Φ1(1α)μ),μ=2i=1n1δi1+eϵimaxi1δi1+eϵi.G_{\mu}(\alpha)=\Phi(\Phi^{-1}(1-\alpha)-\mu),\mu=\sqrt{\frac{2}{\sum_{i=1}^{n}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}-\max_{i}{\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}}. In an unambiguous manner, we refer to it as approximately following the μ\mu-GDP.

Proof:

First, let’s analyze the scenarios where the nn-th data point differs. According to the definition of (Δ0,Δ1,Δ2)(\Delta_{0},\Delta_{1},\Delta_{2}),

(Δ0,Δ1,Δ2)={(0,0,1)w.p.δn;(1,0,0)w.p.(1δn)eϵn1+eϵn;(0,1,0)w.p.(1δn)11+eϵn.(\Delta_{0},\Delta_{1},\Delta_{2})=\begin{cases}\begin{array}[]{lll}(0,0,1)&\quad w.p.&\quad\delta_{n};\\ (1,0,0)&\quad w.p.&\quad(1-\delta_{n})\frac{e^{\epsilon_{n}}}{1+e^{\epsilon_{n}}};\\ (0,1,0)&\quad w.p.&\quad(1-\delta_{n})\frac{1}{{1+e^{\epsilon_{n}}}}.\end{array}\end{cases} (16)

When Δ2=1\Delta_{2}=1, ρ0\rho_{0} and ρ1\rho_{1} are indistinguishable, which indicates that T(ρ0,ρ1)|Δ2=1=1αT(\rho_{0},\rho_{1})|_{\Delta_{2}=1}=1-\alpha. Let ρ0=(1,0,0)+i=1n1MultiBern(pi/2,pi/2,1pi)\rho^{\prime}_{0}=(1,0,0)^{\prime}+\sum_{i=1}^{n-1}MultiBern\left(p_{i}/2,p_{i}/2,1-p_{i}\right) and ρ1=(0,1,0)+i=1n1MultiBern(pi/2,pi/2,1pi)\rho^{\prime}_{1}=(0,1,0)^{\prime}+\sum_{i=1}^{n-1}MultiBern\left(p_{i}/2,p_{i}/2,1-p_{i}\right) with pi=1δi1+eϵip_{i}=\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}, then

T(ρ0,ρ1)=δn(1α)+(1δn)Tsymm(ρ0,ρ1),T(\rho_{0},\rho_{1})=\delta_{n}(1-\alpha)+(1-\delta_{n})T_{symm}(\rho^{\prime}_{0},\rho^{\prime}_{1}), (17)

where Tsymm(ρ0,ρ1)=max{T(ρ0,ρ1),T(ρ1,ρ0)}T_{symm}(\rho^{\prime}_{0},\rho^{\prime}_{1})=\max\{T(\rho^{\prime}_{0},\rho^{\prime}_{1}),T(\rho^{\prime}_{1},\rho^{\prime}_{0})\}. Assume PN(μ0,Σ)P\sim N(\mu_{0},\Sigma), QN(μ1,Σ)Q\sim N(\mu_{1},\Sigma), where μ0,μ1,Σ\mu_{0},\mu_{1},\Sigma are same as that in Lemma 3. Let μ=2i=1n11δi1+eϵi\mu=\sqrt{\frac{2}{\sum_{i=1}^{n-1}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}, according to equation (4),

T(ρ0,P)1αρ0PTV,T(\rho^{\prime}_{0},P)\geq 1-\alpha-\|\rho^{\prime}_{0}-P\|_{TV},
T(ρ1,Q)1αρ1QTV,T(\rho^{\prime}_{1},Q)\geq 1-\alpha-\|\rho^{\prime}_{1}-Q\|_{TV},

then based on Fact 4,

T(ρ0,Q)Φ(Φ1(1αρ0PTV)μ)=F(α).T(\rho^{\prime}_{0},Q)\geq\Phi(\Phi^{-1}(1-\alpha-\|\rho^{\prime}_{0}-P\|_{TV})-\mu)=F(\alpha).

Reusing Fact 4, we can obtain that

T(ρ0,ρ1)\displaystyle T(\rho^{\prime}_{0},\rho^{\prime}_{1}) \displaystyle\geq 1(1F(α))ρ1QTV\displaystyle 1-(1-F(\alpha))-\|\rho^{\prime}_{1}-Q\|_{TV}
=\displaystyle= F(α)ρ1QTV\displaystyle F(\alpha)-\|\rho^{\prime}_{1}-Q\|_{TV}

Lemma 2 shows that there exists C>0C>0, such that ρ1QTVCn1\|\rho^{\prime}_{1}-Q\|_{TV}\leq\frac{C}{\sqrt{n-1}} and ρ0PTVCn1\|\rho^{\prime}_{0}-P\|_{TV}\leq\frac{C}{\sqrt{n-1}}. Hence

T(ρ0,ρ1)Gμ(α+Cn1)Cn1.T(\rho^{\prime}_{0},\rho^{\prime}_{1})\geq G_{\mu}\left({\alpha+\frac{C}{\sqrt{n-1}}}\right)-\frac{C}{\sqrt{n-1}}.

Then

T(ρ0,ρ1)δn(1α)+(1δn)(Gμ(α+Cn1)Cn1).\begin{split}T(\rho_{0},\rho_{1})&\geq\delta_{n}(1-\alpha)\\ &\quad+(1-\delta_{n})\left(G_{\mu}\left({\alpha+\frac{C}{\sqrt{n-1}}}\right)-\frac{C}{\sqrt{n-1}}\right).\end{split}

Since for an arbitrary trade-off function ff, we have f1αf\leq 1-\alpha, it follows that:

T(ρ0,ρ1)(Gμ(α+Cn1)Cn1).T(\rho_{0},\rho_{1})\geq\left(G_{\mu}\left(\alpha+\frac{C}{\sqrt{n-1}}\right)-\frac{C}{\sqrt{n-1}}\right).

Finally, taking into account the case where the ii-th (1in1\leq i\leq n) data differs in neighboring datasets, the privacy bound is determined based on the worst-case scenario, that is, μ=2i=1n1δi1+eϵimaxi1δi1+eϵi.\mu=\sqrt{\frac{2}{\sum_{i=1}^{n}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}-\max_{i}{\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}}. \square

Comparison with Existing Results

We provide numerical evaluations for privacy amplification effect under fixed LDP settings in Table 2. Given a local privacy budget set ϵ[0.01,2]\epsilon^{\ell}\in[0.01,2]. For the purpose of comparison, we examine privacy amplification for a fixed ϵ\epsilon^{\ell} while varying nn from 10310^{3} to 10410^{4}, with central δ\delta for shuffling to be 10410^{-4} for the sake of simplicity. To avoid misunderstandings, we repeat the first 10310^{3} parameters. Considering that convergence rate in Lemma 2 is nearly O(1/n)O(1/n) and can be negligible in numerical analysis, our focus lies in measuring GμG_{\mu}.

Name Distribution of ϵ=(ϵ1,,ϵn)\epsilon^{\ell}=(\epsilon_{1}^{\ell},\cdots,\epsilon_{n}^{\ell})
Unif 1 U(0.01,1)U(0.01,1)
Unif 2 U(0.01,2)U(0.01,2)
Constant 0.50.5
Mixed Constant 50%0.5+50%0.0150\%~{}0.5+50\%~{}0.01
Table 2: Distributions of LDP budgets ϵ\epsilon^{\ell}. U(a,b)U(a,b) represents uniform distribution ranging from aa to bb.

To keep it concise, we use Fact 3 in Appendix to compute the corresponding central ϵ\epsilon and δ\delta for Theorem 2. Baseline bounds of privacy amplification effect include: [Liu23] (Liu2023), [FMT22] (Feldman2022hiding), [Erlingsson19] (Erlingsson2019amplification). [Liu23] provides bounds for the personalized scenario, while [FMT22] and [Erlingsson19] only consider the same ϵ\epsilon^{\ell}.

The numerical results demonstrate the following results: (i) Our bound is suitable for extreme privacy budgets while [Liu23] required each ϵi\epsilon_{i} should not be close to zero. However, it is natural to encounter user responses that contain no information, resulting in ϵi=0\epsilon_{i}=0. (ii) As the sample size nn increases, the amplification effect also increases proportionally to the square root of nn. (iii) Our privacy bounds significantly outperform in all current scenarios, even in cases where the privacy parameters are the same.

Refer to caption
Figure 2: Privacy Bounds for Varied Budgets

Application and Experiments

All the experiments are implemented on a workstation with an Intel Core i5-1155G7 processor on Windows 11 OS.

Application to Mean and Frequency Estimation

Mean Estimation

The average function is a fundamental and commonly used mathematical operation with wide-ranging applications. In this section, we apply GSPA to the average function on the synthetic data. We randomly divide the users into three groups: conservative, moderate, and liberal. The fraction of three groups are determined by fc,fm,flf_{c},f_{m},f_{l}. As is reported (Acquisti2005privacy), the default values in this experiment are fc=0.54,fm=0.37,fl=0.09f_{c}=0.54,f_{m}=0.37,f_{l}=0.09. For convenience, the privacy preferences for the users in conservative, moderate and liberal groups are ϵC,ϵM\epsilon_{C},\epsilon_{M} and ϵl\epsilon_{l}, respectively. In the LDP case, the privacy preference of users in the liberal group is fixed at ϵL=1\epsilon_{L}=1, while the default values of ϵC\epsilon_{C} and ϵM\epsilon_{M} are set to 0.10.1 and 0.50.5, respectively.

Theorem 3

Algorithm 1 approximately preserves μ\mu-GDP for each user, where μ=2i=1n1δi1+eϵimaxi1δi1+eϵi.\mu=\sqrt{\frac{2}{\sum_{i=1}^{n}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}-\max_{i}{\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}}.

Proof:

According to the definition of Laplace mechanism, data point i[n]i\in[n] satisfies (ϵ,0)(\epsilon,0)-LDP. Combined with Theorem 2, we can obtain that Algorithm 1 approximately satisfies μ\mu-GDP with μ=2i=1n1δi1+eϵimaxi1δi1+eϵi.\mu=\sqrt{\frac{2}{\sum_{i=1}^{n}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}-\max_{i}{\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}}. \square

Next, we simulate the accuracy for different set of privacy protection. To facilitate comparison, we set fl=0.09f_{l}=0.09 as a fixed value and vary fcf_{c} from 0.010.01 to 0.50.5 with fm=1flfcf_{m}=1-f_{l}-f_{c}. Additionally, we generate n=10,000n=10,000 privacy budgets for users based on the privacy preferences rule. We assume that each sample is drawn from a normal distribution N(50,σ2)N(50,\sigma^{2}), and then the samples are clipped into the range [20,80][20,80]. We repeat this procedure for a total of 1,0001,000 times to give a confidence interval. According to Fact 3, privacy parameter μ\mu under the shuffle model can be obtained for varying ϵc\epsilon_{c}. Figure 3(a) shows that an increase in the proportion of conservative users leads to a decrease in estimation accuracy. On the other hand, Figure 3(b) demonstrates that increasing privacy budget is beneficial for improving accuracy.

Algorithm 1 Mean estimation with GSPA.
0:  Dataset X=(x1,,xn)nX=(x_{1},\ldots,x_{n})\in\mathbb{R}^{n}, privacy budget 𝒮={ϵ1,,ϵn}\mathcal{S}=\{\epsilon_{1},\cdots,\epsilon_{n}\} for each user.
0:  zz\in\mathbb{N}
1:  for each ii [n]\in[n] do
2:     yixi+Lap(Δf/ϵi)y_{i}\leftarrow x_{i}+Lap(\Delta f/\epsilon_{i})
3:  end for
4:  Choose a random permutation π\pi: [n][n][n]\rightarrow[n]
5:  z=1ni=1nyπ(i)z=\frac{1}{n}\sum_{i=1}^{n}y_{\pi(i)}
6:  return  zz
Refer to caption
(a) Impact of fcf_{c} (Mean)
Refer to caption
(b) Impact of ϵc\epsilon_{c} (Mean)
Refer to caption
(c) Impact of fcf_{c} (Frequency)
Refer to caption
(d) Impact of ϵc\epsilon_{c} (Frequency)
Figure 3: Impact of privacy parameter settings on MAE.

Frequency estimation

In machine learning, frequency estimation is often used as a preprocessing step to understand the distribution and importance of different features or categories within a dataset. By accurately estimating the frequencies of various features or categories, it helps in feature selection, dimensionality reduction, and building effective models.

In order to obtain the dataset, a total of 10,000 records are generated for counting. Each record is encoded as a binary attribute. The proportion of records with a value of 11 is determined by a density parameter cc, which ranges from 0 to 11 (with a default value of cc = 0.7).

Theorem 4

Algorithm 2 approximately preserves μ\mu-GDP for each user, where μ=2i=1n1δi1+eϵimaxi1δi1+eϵi.\mu=\sqrt{\frac{2}{\sum_{i=1}^{n}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}-\max_{i}{\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}}.

The proof of Theorem 4 is the same as Theorem 3. The direct calculation shows that zz is an unbiased estimator of cc, that is, 𝔼(z)=c\mathbb{E}(z)=c. Similar to the average function, we adopt the same configuration for personalized privacy budgets.

Algorithm 2 Frequency estimation with GSPA
0:  Dataset X=(x1,,xn){0,1}nX=(x_{1},\ldots,x_{n})\in\{0,1\}^{n}, privacy budget 𝒮={ϵ1,,ϵn}\mathcal{S}=\{\epsilon_{1},\cdots,\epsilon_{n}\} for each user.
0:  zz\in\mathbb{N}
1:  for each ii [n]\in[n] do
2:     if xi=1x_{i}=1 then
3:        yiBer(eϵi1+eϵi)y_{i}\leftarrow Ber(\frac{e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}})
4:     else
5:        yiBer(11+eϵi)y_{i}\leftarrow Ber(\frac{1}{1+e^{\epsilon_{i}}})
6:     end if
7:  end for
8:  Choose a random permutation π\pi: [n][n][n]\rightarrow[n]
9:  A=i=1nyπ(i)A=\sum_{i=1}^{n}y_{\pi(i)}
10:  B=i=1n11+eϵπ(i)B=\sum_{i=1}^{n}\frac{1}{1+e^{\epsilon_{\pi(i)}}}
11:  z=ABn2Bz=\frac{A-B}{n-2B}
12:  return  zz

Personalized Private Stochastic Gradient Descent

The private stochastic gradient descent is a common method in deep learning (abadi2016deep). However, personalized private stochastic gradient descent combines personalized differential privacy with stochastic gradient descent optimization for model training and parameter updates while ensuring privacy protection. In the context of personalized differential privacy, privacy of individual users must be protected, and direct use of raw data for parameter updates is not feasible .

The key idea of personalized differential privacy is to introduce personalized parameters into the differentially private mechanism to flexibly adjust the level of privacy protection. For the gradient descent algorithm, personalized differential privacy can be achieved by introducing noise during gradient computation.

Theorem 5

Algorithm 3 approximately satisfies Tμ\sqrt{T}\mu-GDP with μ=2i=1n1δi1+eϵimaxi1δi1+eϵi.\mu=\sqrt{\frac{2}{\sum_{i=1}^{n}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}-\max_{i}{\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}}.

Proof:

For arbitrary j[m]j\in[m], client jj satisfies (ϵj,δj)(\epsilon_{j},\delta_{j})-LDP before sending to the shuffler by the definition of Gaussian mechanism. By using Theorem 2, it preserves μ\mu-GDP after shuffling with μ=2i=1n1δi1+eϵimaxi1δi1+eϵi.\mu=\sqrt{\frac{2}{\sum_{i=1}^{n}\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}-\max_{i}{\frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}}}}. In addition, combined with Fact 3, it holds Tμ\sqrt{T}\mu-GDP under TT-fold composition.

\square

Dataset and implementation

The MNIST dataset (lecun1998gradient) for handwritten digit recognition consists of 60,00060,000 training images and 10,00010,000 test images. Each sample in the dataset represents a 28×2828\times 28 vector generated from handwritten images, where the independent variable corresponds to the input vector, and the dependent variable represents the digit label ranging from 0 to 99. In our experiments, we consider a scenario with mm clients, where each client has n/mn/m samples. For simplicity, we train a simple classifier using a feed-forward neural network with ReLU activation units and a softmax output layer with 1010 classes, corresponding to the 1010 possible digits. The model is trained using cross-entropy loss and an initial PCA input layer with 6060 components. At each step of the shuffled SGD, we choose at one client at random without replacement. The parameters of experimental setup is listed in Table 3. This experiment is designed to demonstrate the use cases of the shuffle model and therefore does not focus on comparing with previous results. For comparative results, please refer to Figure 2.

Parameter Selection

As a result, our approach achieves an accuracy of 96.78%96.78\% on the test dataset after approximately 5050 epochs. This result is consistent with the findings of a vanilla neural network (lecun1998gradient) trained on the same MNIST dataset. By employing this methodology, we can effectively train a simple classifier that achieves high accuracy in recognizing handwritten digits from the MNIST dataset.

Parameters Value Explanation
CC 22 Clipping bound
δ\delta^{\ell} 10510^{-5} Indistinguishability parameter
ϵ\epsilon^{\ell} [0.01,2][0.01,2] Privacy budget
η\eta 0.050.05 Step size of the gradient
mm 100100 The number of clients
nn 60,00060,000 Total number of training samples
TT 5050 The number of epochs
Table 3: Experiment Setting for the Shuffled Personalized-SGD on the MNIST dataset.
Refer to caption
Figure 4: Comparison of Test Accuracy with Different Noise Distributions

Utility of GSPA

We evaluate the utility of GSPA with ϵ\epsilon^{\ell} drawing from Table 2 on MNIST dataset. We introduce Unif 3 as a distribution to represent U(0.5,1)U(0.5,1). The numerical results indicate that Unif 3 exhibits the best accuracy, which aligns with expectations as it corresponds to a larger value of the privacy budget. Despite constant scenario exhibiting stronger privacy protection than Unif 2, it actually achieves better accuracy. One possible reason behind this interesting observation is the significant difference in the privacy parameters, which can cause instability in gradient iterations.

Algorithm 3 SGD with GSPA
0:  Dataset X=(x1,,xn)X=(x_{1},\ldots,x_{n}), loss function (𝜽,x)\mathcal{L}(\boldsymbol{\theta},x), initial point 𝜽0\boldsymbol{\theta}_{0}, learning rate η\eta, number of epochs TT, privacy budget 𝒮={ϵ1,δ1,,ϵn,δn}\mathcal{S}=\{\epsilon_{1},\delta_{1},\cdots,\epsilon_{n},\delta_{n}\}, batch size mm and gradient norm bound CC.
0:  𝜽^T,m\hat{\boldsymbol{\theta}}_{T,m}
1:  Split [n][n] into n/mn/m disjoint subsets S1,,SmS_{1},\cdots,S_{m} with equal size mm
2:  Choose arbitrary initial point 𝜽^0\hat{\boldsymbol{\theta}}_{0}
3:  Choose a random permutation π\pi: [m][m][m]\rightarrow[m]
4:  for each t[T]t\in[T] do
5:     𝜽~t,m=𝜽^t\tilde{\boldsymbol{\theta}}_{t,m}=\hat{\boldsymbol{\theta}}_{t}
6:     for each i[n/m]i\in[n/m] do
7:        σ=2Cm2log(1.25/δπ(i))ϵπ(i)\sigma=\frac{2C}{m}\frac{\sqrt{2\log(1.25/\delta_{\pi(i)})}}{\epsilon_{\pi(i)}}
8:        𝒃iN(0,σ2𝑰d)\boldsymbol{b}_{i}\sim N(0,\sigma^{2}\boldsymbol{I}_{d})
9:        for each jSπ(i)j\in S_{\pi(i)} do
10:           Compute gradient:𝒈ij=(𝜽~i1,xj)\boldsymbol{g}_{i}^{j}=\nabla\ell(\tilde{\boldsymbol{\theta}}_{i-1},x_{j})
11:        end for
12:        Clip to norm CC:𝒈i=j𝒈ijm\boldsymbol{g}_{i}=\frac{\sum_{j}\boldsymbol{g}_{i}^{j}}{m} 𝒈~i=𝒈i/max(1,𝒈i2/C)\tilde{\boldsymbol{g}}_{i}={\boldsymbol{g}_{i}}/\max(1,\|\boldsymbol{g}_{i}\|_{2}/C)
13:        𝜽~i=𝜽~i1η(𝒈~i+𝒃i)\tilde{\boldsymbol{\theta}}_{i}=\tilde{\boldsymbol{\theta}}_{i-1}-\eta(\tilde{\boldsymbol{g}}_{i}+\boldsymbol{b}_{i})
14:     end for
15:     𝜽^t,m=𝜽~m\hat{\boldsymbol{\theta}}_{t,m}=\tilde{\boldsymbol{\theta}}_{m}
16:  end for
17:  return  𝜽^T,m\hat{\boldsymbol{\theta}}_{T,m}

Conclusion and Future Work

This work focuses on privacy amplification of shuffle model. To address the trade-off between privacy and utility, we propose the GSPA framework, which achieves a higher accuracy while maintaining at least 33% privacy loss compared to existing methods.

In our future research, we plan to expand by incorporating additional privacy definitions such as Renyi differential privacy (Girgis2021renyi). Moreover, we acknowledge the significance of enhancing techniques for specific data types like images, speech, and other forms. This entails developing specialized privacy metrics, differential privacy mechanisms, and model training algorithms that offer more accurate and efficient privacy protection.

Appendix

ff-DP

Here are several important properties of ff-DP. We present these facts directly for the sake of brevity, and for comprehensive proofs, please refer to the related article (Dong2022gaussian).

Fact 1

(ϵ,δ)(\epsilon,\delta)-DP is equivalent to fϵ,δf_{\epsilon,\delta}-DP, where

fϵ,δ=max{0,1δeϵα,eϵ(1δϵ)}.f_{\epsilon,\delta}=\max\{0,1-\delta-e^{\epsilon}\alpha,e^{-\epsilon}(1-\delta-\epsilon)\}. (19)
Fact 2

ff-DP holds the post-processing property, that is, if a mechanism MM is ff-DP, then its post-processing ProcMProc\circ M is also ff-DP.

Fact 3

(μ\mu-GDP) A ff-DP mechanism is called μ\mu-GDP if ff can be obtained by f=T(N(0,1),N(μ,1))=Φ(Φ1(1α)μ)f=T(N(0,1),N(\mu,1))=\Phi(\Phi^{-1}(1-\alpha)-\mu), where Φ()\Phi(\cdot) is cumulative distribution function of standard Gaussian distribution N(0,1)N(0,1). Then a mechanism is μ\mu-GDP if and only if it is (ϵ,δ(ϵ))(\epsilon,\delta(\epsilon))-DP for all ϵ0\epsilon\geq 0, where

δ(ϵ)=Φ(ϵμ+μ2)eϵΦ(ϵμμ2).\delta(\epsilon)=\Phi(-\frac{\epsilon}{\mu}+\frac{\mu}{2})-e^{\epsilon}\Phi(-\frac{\epsilon}{\mu}-\frac{\mu}{2}).

In particular, if a mechanism is μ\mu-GDP, then it is kμk\mu-GDP for groups of size kk and the nn-fold composition of μi\mu_{i}-GDP mechanisms is μ12++μn2\sqrt{\mu_{1}^{2}+\cdots+\mu_{n}^{2}}-GDP.

Fact 4

Suppose T(P,R)f,T(Q,R)g,T(P,R)\geq f,T(Q,R)\geq g, then T(P,R)fg=g(1f(α))T(P,R)\geq f\circ g=g(1-f(\alpha)).

The relationship between ff-DP and traditional DP has been illustrated from the perspective of hypothesis testing (Dong2022gaussian). It provides a visual representation of how the choice of parameter μ\mu in μ\mu-GDP relates to the strength of privacy protection.

The (ϵi,δi)(\epsilon_{i},\delta_{i})-PLDP mechanism can be dominated by the following hypothesis testing problem (Kairouz2015composition). This forms the foundation for the subsequent analysis.

Lemma 4 (KOV15)

Let (i):𝒟𝒮\mathcal{R}^{(i)}:\mathcal{D}\rightarrow\mathcal{S} be an (ϵi,δi)(\epsilon_{i},\delta_{i})-DP local randomizer, and x0,x1Dx_{0},x_{1}\in D, then there exist two quaternary random variables X0~\tilde{X_{0}} and X1~\tilde{X_{1}}, such that (i)(x0)\mathcal{R}^{(i)}(x_{0}) and (i)(x1)\mathcal{R}^{(i)}(x_{1}) can be viewed as post-processing of X0~\tilde{X_{0}} and X1~\tilde{X_{1}}, respectively. In details,

P(X0~=x)={δiif x=A,(1δi)eϵi1+eϵiif x=0,1δi1+eϵiif x=1,0if x=B,P(\tilde{X_{0}}=x)=\left\{\begin{array}[]{ll}\delta_{i}&\mbox{if }x=A,\\ \frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}&\mbox{if }x=0,\\ \frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}&\mbox{if }x=1,\\ 0&\mbox{if }x=B,\end{array}\right.

and

P(X1~=x)={0if x=A,1δi1+eϵiif x=0,(1δi)eϵi1+eϵiif x=1,δiif x=B.P(\tilde{X_{1}}=x)=\left\{\begin{array}[]{ll}0&\mbox{if }x=A,\\ \frac{1-\delta_{i}}{1+e^{\epsilon_{i}}}&\mbox{if }x=0,\\ \frac{(1-\delta_{i})e^{\epsilon_{i}}}{1+e^{\epsilon_{i}}}&\mbox{if }x=1,\\ \delta_{i}&\mbox{if }x=B.\end{array}\right.

Proof of Theorem 1

Proof:

Formally, for each i{2,,n}i\in\{2,\cdots,n\}, let pi=2(1δi)1+eϵip_{i}=\frac{2(1-\delta_{i})}{1+e^{\epsilon_{i}}}, we define random variables Y1,i0Y_{1,i}^{0}, Y1,i1Y_{1,i}^{1} and YiY_{i} as follows:

Y1,i0={0w.p.eϵipi2,1w.p.pi2,2w.p.1eϵipi2pi2.Y_{1,i}^{0}=\begin{cases}\begin{array}[]{lll}0&\quad w.p.&\quad e^{\epsilon_{i}}\frac{p_{i}}{2},\\ 1&\quad w.p.&\quad\frac{p_{i}}{2},\\ 2&\quad w.p.&\quad 1-e^{\epsilon_{i}}\frac{p_{i}}{2}-\frac{p_{i}}{2}.\end{array}\end{cases} (20)
Y1,i1={0w.p.pi2,1w.p.eϵipi2,2w.p.1eϵipi2pi2.Y_{1,i}^{1}=\begin{cases}\begin{array}[]{lll}0&\quad w.p.&\quad\frac{p_{i}}{2},\\ 1&\quad w.p.&\quad e^{\epsilon_{i}}\frac{p_{i}}{2},\\ 2&\quad w.p.&\quad 1-e^{\epsilon_{i}}\frac{p_{i}}{2}-\frac{p_{i}}{2}.\end{array}\end{cases} (21)

and

Yi={0w.p.pi2,1w.p.pi2,2w.p.1pi.Y_{i}=\begin{cases}\begin{array}[]{lll}0&\quad w.p.&\quad\frac{p_{i}}{2},\\ 1&\quad w.p.&\quad\frac{p_{i}}{2},\\ 2&\quad w.p.&\quad 1-p_{i}.\end{array}\end{cases} (22)

We consider the case in the tt-th iteration. Given a dataset XbX_{b} for b{0,1}b\in\{0,1\}, we generate nn samples from {0,1,2}\{0,1,2\} in the following way. Client number one reports a sample from Y1,ibY_{1,i}^{b}. Client ii (i=2,,n)i=2,\cdots,n) each reports an independent sample from YiY_{i}. We then shuffle the reports randomly. Let ρb\rho_{b} denote the resulting distribution over {0,1,2}n\{0,1,2\}^{n}. We then count the total number of 0s and 11s. Note that a vector containing a permutation of the users responses contains no more information than simply the number of 0s and 11s, so we can consider these two representations as equivalent.
We claim that there exists a post-processing function proc()proc(\cdot) such that for yy sampled from ρb\rho_{b}, proc(y)proc(y) is distributed identically to 𝒜S(Xb)\mathcal{A}_{S}(X_{b}). To see this, let π\pi be a randomly and uniformly chosen permutation of {1,,n}\{1,\cdots,n\}. For every i{1,,n}i\in\{1,\cdots,n\}, given the hidden permutation π\pi, we can generate a sample from 𝒜S(Xb)\mathcal{A}_{S}(X_{b}) by sequentially transforming proc(yt)proc(y_{t}) to obtain the correct mixture components, then sampling from the corresponding mixture component. Specially, by Lemma 1,

zt={(t)(z1:t1,x10)ifyt=0;(t)(z1:t1,x11)ifyt=1;(t)(z1:t1,xπ(i))ifyt=2.z_{t}=\begin{cases}\begin{array}[]{lll}\mathcal{R}^{(t)}(z_{1:t-1},x_{1}^{0})&\text{if}&y_{t}=0;\\ \mathcal{R}^{(t)}(z_{1:t-1},x_{1}^{1})&\text{if}&y_{t}=1;\\ \mathcal{R}^{(t)}(z_{1:t-1},x_{\pi(i)})&\text{if}&y_{t}=2.\end{array}\end{cases} (23)

By our assumption, this produces a sample ztz_{t} from (i)(xπ(i)).\mathcal{R}^{(i)}(x_{\pi(i)}). It is easy to see that the resulting random variable (z,y)(z,y) has the property that for input b{0,1}b\in\{0,1\}, its marginal distribution over 𝒮\mathcal{S} is the same as 𝒜S(Xb)\mathcal{A}_{S}(X_{b}) and its marginal distribution over {0,1,2}n\{0,1,2\}^{n} is ρb\rho_{b}. The difficulty then lies in the fact that conditioned on a particular instantiation y=vy=v, the permutation π|y=v\pi|_{y=v} is not independent of bb. Note that if vt=0v_{t}=0 or 11, the corresponding 𝒬10(t)(z1:t1)\mathcal{Q}^{0(t)}_{1}(z_{1:t-1}) or Q11(t)(z1:t1),Q_{1}^{1(t)}(z_{1:t-1}), is independent of π\pi. Therefore, in order to do the appropriate post-processing, it suffices to know the permutation π\pi restricted to the set of users who sampled 22, K=π({i:yi=2})K=\pi(\{i:y_{i}=2\}). The set KK of users who select 22 is independent of bb since Y1,i0Y_{1,i}^{0} and Y1,i1Y_{1,i}^{1} have the same probability of sampling 22. The probability of being included in KK is identical for each i{2,,n},i\in\{2,\cdots,n\}, and slightly smaller for the first user. Since the sampling of zz given yy only needs KK, we can sample from z|(y,K)=(v,J)z|_{(y,K)=(v,J)} without knowing bb. This conditional sampling is exactly the post-processing step that we claimed.

We now analyze the divergence between ρ0\rho_{0} and ρ1\rho_{1}, the shuffling step implies that ρ0\rho_{0} and ρ1\rho_{1} are symmetric. This implies that the divergence between ρ0\rho_{0} and ρ1\rho_{1} is equal to the divergence between the distribution between the distribution of the counts of 00^{\prime}s and 11^{\prime}s. The decomposition in equation (11) implies that the divergence between 𝒜S(X0)\mathcal{A}_{S}(X_{0}) and 𝒜S(X1)\mathcal{A}_{S}(X_{1}) can be dominated by the divergence of ρ0\rho_{0} and ρ1\rho_{1}, where Δ2Bern(δn),Δ0Bin(1Δ2,eϵn1+eϵn)\Delta_{2}\sim Bern(\delta_{n}),\Delta_{0}\sim Bin(1-\Delta_{2},\frac{e^{\epsilon_{n}}}{1+e^{\epsilon_{n}}}), Δ1=1Δ0Δ2\Delta_{1}=1-\Delta_{0}-\Delta_{2} and MultiBern(θ1,,θd)MultiBern(\theta_{1},\cdots,\theta_{d}) represents a dd-dimensional Bernoulli distribution with j=1dθj=1\sum_{j=1}^{d}\theta_{j}=1. \square

Proof of Lemma 3

Proof:

Since T(N(𝝁0,𝚺),N(𝝁1,𝚺))T(N(\boldsymbol{\mu}_{0},\boldsymbol{\Sigma}),N(\boldsymbol{\mu}_{1},\boldsymbol{\Sigma})) is

Φ(Φ1(1α)(𝝁1𝝁0)𝚺1(𝝁1𝝁0)),\Phi(\Phi^{-1}(1-\alpha)-\sqrt{(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0})^{\prime}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0})}),

according to the property of normal distribution, the key is to calculate (𝝁1𝝁0)𝚺1(𝝁1𝝁0)(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0})^{\prime}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0}). Let v1=i=1n1pi,v2=i=1n1pi2,v_{1}=\sum_{i=1}^{n-1}p_{i},v_{2}=\sum_{i=1}^{n-1}p_{i}^{2}, then

𝚺=(v12v24v24v24v12v24),\boldsymbol{\Sigma}=\left(\begin{array}[]{cc}\frac{v_{1}}{2}-\frac{v_{2}}{4}&-\frac{v_{2}}{4}\\ -\frac{v_{2}}{4}&\frac{v_{1}}{2}-\frac{v_{2}}{4}\\ \end{array}\right),

and

𝚺1=(2v1v2v12v1v2v2v12v1v2v2v12v1v22v1v2v12v1v2).\boldsymbol{\Sigma}^{-1}=\left(\begin{array}[]{cc}\frac{2v_{1}-v_{2}}{v_{1}^{2}-v_{1}v_{2}}&\frac{v_{2}}{v_{1}^{2}-v_{1}v_{2}}\\ \frac{v_{2}}{v_{1}^{2}-v_{1}v_{2}}&\frac{2v_{1}-v_{2}}{v_{1}^{2}-v_{1}v_{2}}\\ \end{array}\right).

By simple calculation, we can obtain that

(𝝁1𝝁0)𝚺1(𝝁1𝝁0)=(1,1)𝚺1(1,1)=4i=1n1pi.(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0})^{\prime}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0})=(-1,1)\boldsymbol{\Sigma}^{-1}(-1,1)^{\prime}=\frac{4}{\sum_{i=1}^{n-1}p_{i}}.

Substituting μ=4i=1n1pi\mu=\sqrt{\frac{4}{\sum_{i=1}^{n-1}p_{i}}} yields the proof. \square

Acknowledgements

We would like to thank all the anonymous reviewers for generously dedicating their time and expertise to evaluate our manuscript with insightful comments. This work is partially support by JST CREST JPMJCR21M2, JSPS KAKENHI Grant Number JP22H00521, JP22H03595, JP21K19767, JST/NSF Joint Research SICORP JPMJSC2107.

References

\nobibliography

* \bibentryabadi2016deep.
\bibentryballe2019privacy.
\bibentryberry1941accuracy.
\bibentryBittau2017prochlo.
\bibentrycheu2019distributed.
\bibentryDong2022gaussian.
\bibentryDwork2014algorithmic.
\bibentryErlingsson2019amplification.
\bibentryEsseen1942.
\bibentryFeldman2022hiding.
\bibentryGDDSK21federated.
\bibentryGirgis2021renyi.
\bibentryJZT2015.
\bibentryKairouz2015composition.
\bibentryKS11.
\bibentrylecun1998gradient.
\bibentryliu2021flame.
\bibentryLiu2023
\bibentryNCW21.

\nobibliography

aaai22