This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Algorithmically Effective Differentially Private Synthetic Data

Yiyun He Department of Mathematics, University of California Irvine, Irvine, CA 92697 [email protected] Roman Vershynin Department of Mathematics, University of California Irvine, Irvine, CA 92697 [email protected]  and  Yizhe Zhu Department of Mathematics, University of California Irvine, Irvine, CA 92697 [email protected]
Abstract.

We present a highly effective algorithmic approach for generating ε\varepsilon-differentially private synthetic data in a bounded metric space with near-optimal utility guarantees under the 1-Wasserstein distance. In particular, for a dataset 𝒳\mathcal{X} in the hypercube [0,1]d[0,1]^{d}, our algorithm generates synthetic dataset 𝒴\mathcal{Y} such that the expected 1-Wasserstein distance between the empirical measure of 𝒳\mathcal{X} and 𝒴\mathcal{Y} is O((εn)1/d)O((\varepsilon n)^{-1/d}) for d2d\geq 2, and is O(log2(εn)(εn)1)O(\log^{2}(\varepsilon n)(\varepsilon n)^{-1}) for d=1d=1. The accuracy guarantee is optimal up to a constant factor for d2d\geq 2, and up to a logarithmic factor for d=1d=1. Our algorithm has a fast running time of O(εdn)O(\varepsilon dn) for all d1d\geq 1 and demonstrates improved accuracy compared to the method in [12] for d2d\geq 2.

1. Introduction

Differential privacy has become the benchmark for privacy protection in scenarios where vast amounts of data need to be analyzed. The aim of differential privacy is to prevent the disclosure of information about individual participants in the dataset. In simple terms, an algorithm that has a randomized output and produces similar results when given two adjacent datasets is considered to be differentially private. This method of privacy protection is increasingly being adopted and implemented in various fields, including the 2020 US Census [2, 29, 28] and numerous machine learning tasks [24].

A wide range of data computations can be performed in a differentially private manner, including regression [17], clustering [37], parameter estimation [21], stochastic gradient descent [36], and deep learning [1]. However, many existing works on differential privacy focus on designing algorithms for specific tasks and are restricted to queries that are predefined before use. This requires expert knowledge and often involves modifying existing algorithms.

One promising solution to this challenge is to generate a synthetic dataset similar to the original dataset with guaranteed differential privacy [27, 8, 31, 7, 10, 11, 12]. As any downstream tasks are based on the synthetic dataset, they can be performed without incurring additional privacy costs.

1.1. Private synthetic data

Mathematically, the problem of generating private synthetic data can be defined as follows. Let (Ω,ρ)(\Omega,\rho) be a metric space. Consider a dataset 𝒳=(X1,,Xn)Ωn\mathcal{X}=(X_{1},\dots,X_{n})\in\Omega^{n}. Our goal is to construct an efficient randomized algorithm that outputs differentially private synthetic data 𝒴=(Y1,,Ym)Ωm\mathcal{Y}=(Y_{1},\dots,Y_{m})\in\Omega^{m} such that the two empirical measures

μ𝒳=1ni=1nδXiandμ𝒴=1mi=1mδYi\mu_{\mathcal{X}}=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}\quad\text{and}\quad\mu_{\mathcal{Y}}=\frac{1}{m}\sum_{i=1}^{m}\delta_{Y_{i}}

are close to each other. We measure the utility of the output by 𝔼W1(μ𝒳,μ𝒴)\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}), where W1(μ𝒳,μ𝒴)W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}) is the 1-Wasserstein distance, and the expectation is taken over the randomness of the algorithm. The Kantorovich-Rubinstein duality (see, e.g., [47]) gives an equivalent representation of the 1-Wasserstein distance between two measures ν𝒳\nu_{\mathcal{X}} and μ𝒴\mu_{\mathcal{Y}}:

W1(μ𝒳,μ𝒴)=supLip(f)1(fdμ𝒳fdμ𝒴),W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})=\sup_{{\mathrm{Lip}(f)}\leq 1}\left(\int f\mathrm{d}\mu_{\mathcal{X}}-\int f\mathrm{d}\mu_{\mathcal{Y}}\right), (1.1)

where the supremum is taken over the set of all 11-Lipschitz functions on Ω\Omega. Since many machine learning algorithms are Lipschitz [48, 32, 15, 35], Equation (1.1) provides a uniform accuracy guarantee for a wide range of machine learning tasks performed on synthetic datasets whose empirical measure is close to μ𝒳\mu_{\mathcal{X}} in the 1-Wasserstein distance.

1.2. Main results

The most straightforward way to construct differentially private synthetic data is to add independent noise to the location of each data point. However, this method can result in a significant loss of data utility as the amount of noise needed for privacy protection may be too large [20]. Another direct approach could be to add noise to the density function of the empirical measure of 𝒳\mathcal{X}, by dividing Ω\Omega into small subregions and perturbing the true counts in each subregion. However, Laplacian noise may perturb the count in a certain subregion to negative, causing the output to become a signed measure. To address this issue, we introduce an algorithmically effective method called the Private Measure Mechanism.

Private Measure Mechanism (PMM)

PMM makes the count zero if the noisy count in a subregion is negative. Instead of a single partition of Ω\Omega, we consider a collection of binary hierarchical partitions on Ω\Omega and add inhomogeneous noise to each level of the partition. However, the counts of two subregions do not always add up to the count of the region at a higher level. We develop an algorithm that enforces the consistency of counts in regions at different levels. PMM has O(εdn)O(\varepsilon dn) running time while the running time of the approach in [12] is polynomial in nn.

The accuracy analysis of PMM uses the hierarchical partitions to estimate the 1-Wasserstein distance in terms of the multi-scale geometry of Ω\Omega and the noise magnitude in each level of the partition. In particular, when Ω=[0,1]d\Omega=[0,1]^{d}, by optimizing the choice of the hierarchical partitions and noise magnitude, PMM achieves better accuracy compared to [12] for d2d\geq 2. The accuracy is optimal rate up to a constant factor for d2d\geq 2, and up to a logarithmic factor for d=1d=1. We state it in the next theorem.

The hierarchical partitions appeared in many previous works on the approximation of distributions under Wasserstein distances in a non-private setting, including [4, 18, 50]. In the differential privacy literature, the hierarchical partitions are also closely related to the binary tree mechanism [22, 16] for differential privacy under continual observation. However, the accuracy analysis of the two mechanisms is significantly different. In addition, the TopDown algorithm in the 2020 census [3] also has a similar hierarchical structure and enforces consistency, but the accuracy analysis of the algorithm is not provided in [3].

Theorem 1.1 (PMM for data in a hypercube).

Let Ω=[0,1]d\Omega=[0,1]^{d} equipped with the \ell^{\infty} metric. PMM outputs an ε\varepsilon-differentially private synthetic dataset 𝒴\mathcal{Y} in time O(εdn)O(\varepsilon dn) such that

𝔼W1(μ𝒳,μ𝒴){Clog2(εn)(εn)1 if d=1,C(εn)1d if d2.\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq\left\{\begin{aligned} &C\log^{2}(\varepsilon n)(\varepsilon n)^{-1}&\quad\textrm{ if }d=1,\\ &C{(\varepsilon n)^{-\frac{1}{d}}}&\quad\textrm{ if }d\geq 2.\end{aligned}\right.

Private Signed Measure Mechanism (PSMM)

In addition to PMM, we introduce an alternative method, the Private Signed Measure Mechanism, that achieves optimal accuracy rate on [0,1]d[0,1]^{d} when d3d\geq 3 in poly(n)\text{poly}(n) time. The analysis of PSMM is not restricted to 1-Wasserstein distance, and it can be generalized to provide a uniform utility guarantee of other function classes.

We first partition the domain Ω\Omega into mm subregions Ω1,,Ωm\Omega_{1},\dots,\Omega_{m}. Perturbing the counts in each subregion with i.i.d. integer Laplacian noise gives an unbiased approximation of μ𝒴\mu_{\mathcal{Y}} with a signed measure ν\nu. Then we find the closest probability measure ν^\hat{\nu} under the bounded Lipschitz distance by solving a linear programming problem.

In the proof of accuracy for PSMM, one ingredient is to estimate the Laplacian complexity of the Lipschitz function class on Ω\Omega and connect it to the 1-Wasserstein distance. This type of argument is similar in spirit to the optimal matching problem for two sets of random points in a metric space [38, 39, 9]. When Ω=[0,1]d\Omega=[0,1]^{d}, PSMM achieves the optimal accuracy rate O((εn)1/d)O((\varepsilon n)^{-1/d}) for d3d\geq 3. For d=2d=2, PSMM achieves a near-optimal accuracy O(log(εn)(εn)1/2)O(\log(\varepsilon n)(\varepsilon n)^{-1/2}). For d=1d=1, the accuracy becomes O((εn)1/2)O((\varepsilon n)^{-1/2}).

For the case when d=2d=2, we believe that the bound in Corollary 3.7 could be improved to Clog(εn)/εnC\sqrt{\log(\varepsilon n)}/\sqrt{\varepsilon n} by replacing Dudley’s chaining bound in Proposition 3.2 with the generic chaining bound in [39, 19] involving the γ1\gamma_{1} and γ2\gamma_{2} functionals on Ω\Omega. We will not pursue this direction in this paper.

Comparison to previous results

[42] proved that it is NP-hard to generate private synthetic data on the Boolean cube which approximately preserves all two-dimensional marginals, assuming the existence of one-way functions. There exists a substantial body of work for differentially private synthetic data with guarantees limited to accuracy bounds for a finite set of specified queries [5, 40, 23, 43, 33, 46, 12, 13, 14].

[49] considered differentially private synthetic data in [0,1]d[0,1]^{d} with guarantees for any smooth queries with bounded partial derivatives of order KK, and achieved an accuracy of O(ε1nK2d+K)O(\varepsilon^{-1}n^{-\frac{K}{2d+K}}). Recently, [12] introduced a method based on superregular random walks to generate differentially private synthetic data with near-optimal guarantees in general compact metric spaces. In particular, when the dataset is in [0,1]d[0,1]^{d}, they obtain 𝔼W1(μ𝒳,μ𝒴)Clog32d(εn)(εn)1d\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq C\log^{\frac{3}{2d}}(\varepsilon n)(\varepsilon n)^{-\frac{1}{d}}. A corresponding lower bound of order n1/dn^{-1/d} was also proved in [12, Corollary 9.3]. PMM matches the lower bound up to a constant factor for d2d\geq 2, and up to a logarithmic factor for d=1d=1.

In terms of computational efficiency, PMM runs in time O(εdn)O(\varepsilon dn). This is more efficient compared to the algorithm in [12].

Organization of the paper

The rest of the paper is organized as follows. In Section 2, we introduce some background on differential privacy and distances between measures. We will first introduce and analyze the easier and more direct method PSMM before our main result. In Section 3, we describe PSMM in detail and prove its privacy and accuracy for data in a bounded metric space, and detailed results are provided for the case for the hypercube. In Section 4, we introduce PMM and analyze its privacy and accuracy. Optimizing the choices of noise parameters, we obtain the optimal accuracy on the hypercube with O(εdn)O(\varepsilon dn) running time, which proves Theorem 1.1.

Additional proofs are included in Appendix A. We use a variant of Laplacian distribution, called discrete Laplacian distribution, in PMM and PSMM. The definition and properties of discrete Laplacian distribution are included in Appendix B.

2. Preliminaries

Differential Privacy

We use the following definition from [24]. A randomized algorithm \mathcal{M} provides ε\varepsilon-differential privacy if for any input data D,DD,D^{\prime} that differs on only one element (or DD and DD^{\prime} are adjacent data sets) and for any measurable set Srange()S\subseteq\mathrm{range}(\mathcal{M}), there is

{(D)S}{(D)S}eε.\frac{\mathbb{P}\left\{\mathcal{M}(D)\in S\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{\mathcal{M}(D^{\prime})\in S\rule{0.0pt}{8.53581pt}\right\}}\leq e^{\varepsilon}.

Here the probability is taken from the probability space of the randomness of \mathcal{M}.

Wasserstein distance

Consider a metric space (Ω,ρ)(\Omega,\rho) with two probability measures μ,ν\mu,\nu. Then the 1-Wasserstein distance (see e.g., [47] for more details) between them is defined as

W1(μ,ν):=infγΓ(μ,ν)Ω×Ωρ(x,y)dγ(x,y),W_{1}(\mu,\nu)\mathrel{\mathop{\mathchar 58\relax}}=\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{\Omega\times\Omega}\rho(x,y)\mathrm{d}\gamma(x,y),

where γ(μ,ν)\gamma(\mu,\nu) is the set of all couplings of μ\mu and ν\nu.

Bounded Lipschitz distance

Let (Ω,ρ)(\Omega,\rho) be a bounded metric space. The Lipschitz norm of a function ff is defined as

fLip:=max{Lip(f),fdiam(Ω)},\mathinner{\!\left\lVert f\right\rVert}_{\mathrm{Lip}}\mathrel{\mathop{\mathchar 58\relax}}=\max\left\{\mathrm{Lip}(f),\;\frac{\mathinner{\!\left\lVert f\right\rVert}_{\infty}}{\operatorname{diam}(\Omega)}\right\},

where Lip(f)\mathrm{Lip}(f) is the Lipschitz constant of ff. Let \mathcal{F} be the set of all Lipschitz functions ff on Ω\Omega with fLip1\mathinner{\!\left\lVert f\right\rVert}_{\mathrm{Lip}}\leq 1. For signed measures μ,ν\mu,\nu, we define the bounded Lipschitz distance:

dBL(μ,ν):=supf(fdμfdν).d_{\mathrm{BL}}(\mu,\nu)\mathrel{\mathop{\mathchar 58\relax}}=\sup_{f\in\mathcal{F}}\left(\int f\mathrm{d}\mu-\int f\mathrm{d}\nu\right).

One can easily check that this is a metric. Moreover, in the special case where μ\mu and ν\nu are both probability measures, moving ff by a constant does not change the result of fdμfdν\int f\mathrm{d}\mu-\int f\mathrm{d}\nu. Therefore, for a bounded domain Ω\Omega, we can always assume f(x0)=0f(x_{0})=0 for a fixed x0Ωx_{0}\in\Omega, then fdiam(Ω)\mathinner{\!\left\lVert f\right\rVert}_{\infty}\leq\operatorname{diam}(\Omega) when computing the supremum in (1.1). This implies dBLd_{\mathrm{BL}}-metric is equivalent to the classical W1W_{1}-metric when μ,ν\mu,\nu are both probability measures on a bounded domain Ω\Omega:

W1(μ,ν)\displaystyle W_{1}(\mu,\nu) =supLip(f)1(fdμfdν)=supf(fdμfdν)=dBL(μ,ν).\displaystyle=\sup_{{\mathrm{Lip}(f)\leq 1}}\left(\int f\mathrm{d}\mu-\int f\mathrm{d}\nu\right)=\sup_{f\in\mathcal{F}}\left(\int f\mathrm{d}\mu-\int f\mathrm{d}\nu\right)=d_{\mathrm{BL}}(\mu,\nu). (2.1)

3. Private signed measure mechanism (PSMM)

We will first introduce PSMM, which is an easier and more intuitive approach. The procedure of PSMM is formally described in Algorithm 1. Note that in the output step of Algorithm 1, the size of the synthetic data mm^{\prime} depends on the rational approximation of the density function of ν^\hat{\nu}, and we discuss the details here. Let v^1,,v^m\hat{v}_{1},\dots,\hat{v}_{m} be the weight of the probability measure ν^\hat{\nu} on y1,,ymy_{1},\dots,y_{m}, respectively. We can choose rational numbers r1,,rmr_{1},\dots,r_{m} such that maxi[m]|riν^i|\max_{i\in[m]}|r_{i}-\hat{\nu}_{i}| is arbitrarily small. Let mm^{\prime} be the least common multiple of the denominators of r1,,rmr_{1},\dots,r_{m}, then we output the synthetic dataset 𝒴^\hat{\mathcal{Y}} containing mrim^{\prime}r_{i} copies of yiy_{i} for i=1,,mi=1,\dots,m.

Before analyzing the privacy and accuracy of PSMM, we introduce a useful complexity measure of a given function class, which quantifies the influence of the Laplacian noise on the function class.

Algorithm 1 Private Signed Measure Mechanism
Input: true data 𝒳=(x1,,xn)Ωn\mathcal{X}=(x_{1},\ldots,x_{n})\in\Omega^{n}, partition (Ω1,,Ωm)(\Omega_{1},\dots,\Omega_{m}) of Ω\Omega, privacy parameter ε>0\varepsilon>0.
Compute the true counts:

Compute the true count in each regime ni=#{xjΩi:j[n]}n_{i}=\#\{x_{j}\in\Omega_{i}\mathrel{\mathop{\mathchar 58\relax}}j\in[n]\}.

Create a new dataset:

Choose any element yiΩiy_{i}\in\Omega_{i} independently of 𝒳\mathcal{X}, and let 𝒴\mathcal{Y} be the collection of nin_{i} copies of yiy_{i} for each i[n]i\in[n].

Add noise:

Perturb the empirical measure μ𝒴\mu_{\mathcal{Y}} of 𝒴\mathcal{Y} and obtain a signed measure ν\nu such that

ν({yi}):=(ni+λi)/n,\nu(\{y_{i}\})\mathrel{\mathop{\mathchar 58\relax}}=(n_{i}+\lambda_{i})/n,

where λiLap(1/ε)\lambda_{i}\sim\operatorname{Lap}_{\mathbb{Z}}(1/\varepsilon) are i.i.d. discrete Laplacian random variables.

Linear programming:

Find the closest probability measure ν^\hat{\nu} of ν\nu in dBLd_{\mathrm{BL}}-metric using Algorithm 2, and generate synthetic data 𝒴^\widehat{\mathcal{Y}} from ν^\hat{\nu}.

Output: synthetic data 𝒴^=(y1,,ym)Ωm\widehat{\mathcal{Y}}=(y_{1},\ldots,y_{m^{\prime}})\in\Omega^{m^{\prime}} for some integer mm^{\prime}.
Algorithm 2 Linear Programming
Input: A discrete signed measure ν\nu supported on 𝒴={y1,,ym}\mathcal{Y}=\{y_{1},\dots,y_{m}\}.
Compute the distances:

Compute the pairwise distances {yiyj,i>j}\{\|y_{i}-y_{j}\|_{\infty},i>j\}.

Solve the linear programming:

Solve the linear programming problem with 2m22m^{2} variables and m+1m+1 constraints:

min\displaystyle\min\quad i,j=1myiyj(uij+uij)+2vi\displaystyle\sum_{i,j=1}^{m}\|y_{i}-y_{j}\|_{\infty}(u_{ij}+u_{ij}^{\prime})+2\,v_{i}
s.t. j=1m(uijuij)+vi+τiν({yi}),\displaystyle\sum_{j=1}^{m}(u_{ij}-u_{ij}^{\prime})+v_{i}+\tau_{i}\geq\nu(\{y_{i}\}), im,\displaystyle\forall{i\leq m},
i=1mτi=1,\displaystyle\sum_{i=1}^{m}\tau_{i}=1,
uij,uij,vi,τi0,\displaystyle u_{ij},u_{ij}^{\prime},v_{i},\tau_{i}\geq 0, i,jm,ij.\displaystyle\forall i,j\leq m,i\neq j.
Output: a probability measure ν^\hat{\nu} with ν^({yi})=τi\hat{\nu}(\{y_{i}\})=\tau_{i}.

3.1. Laplacian complexity

Given the Kantorovich-Rubinstein duality (1.1), to control the W1W_{1}-distance between the original measure and the private measure, we need to describe how Lipschitz functions behave under Laplacian noise. As an analog of the worst-case Rademacher complexity [6, 25], we consider the worst-case Laplacian complexity. Such a worst-case complexity measure appears since the original dataset is deterministic without any distribution assumption.

Definition 3.1 (Worst-case Laplacian complexity).

Let \mathcal{F} be a function class on a metric space Ω\Omega. The worst-case Laplacian complexity of \mathcal{F} is defined by

Ln():=supX1,,XnΩ𝔼[supf|1ni=1nλif(Xi)|],\displaystyle L_{n}(\mathcal{F})\mathrel{\mathop{\mathchar 58\relax}}=\sup_{X_{1},\dots,X_{n}\in\Omega}\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\lambda_{i}f(X_{i})\right|\right], (3.1)

where λ1,,λnLap(1)\lambda_{1},\dots,\lambda_{n}\sim\operatorname{Lap}(1) are i.i.d. random variables.

Since Laplacian random variables are sub-exponential but not sub-gaussian, its complexity measure is not equivalent to the Gaussian or Rademacher complexity, but it is related to the suprema of the mixed tail process [19] and the quadratic empirical process [34]. Our next proposition bounds Ln()L_{n}(\mathcal{F}) in terms of the covering numbers of \mathcal{F}. Its proof is a classical application of Dudley’s chaining method (see, e.g., [45]).

Proposition 3.2 (Bounding Laplacian complexity with Dudley’s entropy integral).

Suppose that (Ω,ρ)(\Omega,\rho) is a metric space and \mathcal{F} is a set of functions on Ω\Omega. Then

Ln()Cinfα>0(2α+1nαlog𝒩(,u,)du+1nαlog𝒩(,u,)du)L_{n}(\mathcal{F})\leq C\inf_{\alpha>0}\left(2\alpha+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}\sqrt{{\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})}}\mathrm{d}u+\frac{1}{{n}}\int_{\alpha}^{\infty}{\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})}\mathrm{d}u\right)

where 𝒩(,u,)\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty}) is the covering number of \mathcal{F} and C>0C>0 is an absolute constant.

In particular, we are interested in the case where \mathcal{F} is the class of all the bounded Lipschitz functions. One can find the result in [41] or more explicit bound in [26] that for the set \mathcal{F} of functions ff with fLip1\mathinner{\!\left\lVert f\right\rVert}_{\mathrm{Lip}}\leq 1, its covering number satisfies

𝒩(,u,)(8u)𝒩(Ω,u/2,ρ).\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\leq\left(\frac{8}{u}\right)^{\mathcal{N}(\Omega,{u}/{2},\rho)}.

When Ω=[0,1]d\Omega=[0,1]^{d}, a better bound on the covering number for Lipschitz functions is available from [41, 48]:

𝒩(,u,)(22/u+1)2𝒩([0,1]d,u/2,),\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\leq\left(2\left\lceil{2}/u\right\rceil+1\right)2^{\mathcal{N}\left([0,1]^{d},u/2,\|\cdot\|_{\infty}\right)},

which implies the following corollary.

Corollary 3.3 (Laplacian complexity for Lipschitz functions on the hypercube).

Let Ω=[0,1]d\Omega=[0,1]^{d} with the \|\cdot\|_{\infty} metric, and \mathcal{F} be the set of all Lipschitz functions ff on Ω\Omega with fLip1\mathinner{\!\left\lVert f\right\rVert}_{\mathrm{Lip}}\leq 1. We have

Ln(){Cn1/2 if d=1,Clognn1/2 if d=2,Cd1n1/d if d3.L_{n}(\mathcal{F})\leq\left\{\begin{aligned} &Cn^{-1/2}\quad&\textrm{ if }d=1,\\ &{C\log n}\cdot n^{-1/2}\quad&\textrm{ if }d=2,\\ &Cd^{-1}{n^{-{1}/{d}}}\quad&\textrm{ if }d\geq 3.\end{aligned}\right.

Discrete Laplacian complexity

Laplacian complexity can be useful for differential privacy algorithms based on the Laplacian mechanism [24]. However, since PSMM perturbs counts in each subregion, it is more convenient for us to add integer noise to the true counts. Instead, we will use the worst-case discrete Laplacian complexity defined below:

L~n():=supX1,,XnΩ𝔼[supf|1ni=1nλif(Xi)|],\displaystyle\widetilde{L}_{n}(\mathcal{F})\mathrel{\mathop{\mathchar 58\relax}}=\sup_{X_{1},\dots,X_{n}\in\Omega}\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\lambda_{i}f(X_{i})\right|\right], (3.2)

where λ1,,λnLap(1)\lambda_{1},\dots,\lambda_{n}\sim\operatorname{Lap}_{\mathbb{Z}}(1) are i.i.d. discrete Laplacian random variables.

In particular, Lap(1)\operatorname{Lap}_{\mathbb{Z}}(1) has a bounded sub-exponential norm, therefore the proof of Proposition 3.2 works for discrete Laplacian random variables as well. Consequently, Corollary 3.3 also holds for L~n()\widetilde{L}_{n}(\mathcal{F}), with a different absolute constant CC.

3.2. Privacy and Accuracy of Algorithm 1

The privacy guarantee of Algorithm 1 can be proved by checking the definition. The essence of the proof is the same as the classical Laplacian mechanism [24].

Proposition 3.4 (Privacy of Algorithm 1).

Algorithm 1 is ε\varepsilon-differentially private.

We now turn to accuracy. The linear programming problem stated in Algorithm 2 has (2m2+2m)(2m^{2}+2m) many variables and (m+1)(m+1) many constraints, which can be solved in polynomial time in mm. We first show that Algorithm 2 indeed outputs the closest probability measure to ν\nu in the dBLd_{\mathrm{BL}}-distance in the next proposition.

Proposition 3.5.

For a discrete signed measure ν\nu on Ω\Omega, Algorithm 2 gives its closest probability measure in dBLd_{\mathrm{BL}}-distance with the same support set with a polynomial running time in mm.

Now we are ready to analyze the accuracy of Algorithm 1. In PSMM, independent Laplacian noise is added to the count of each sub-region. Therefore, the Laplacian complexity arises when considering the expected Wasserstein distance between the original empirical measure and the synthetic measure.

Theorem 3.6 (Accuracy of Algorithm 1).

Suppose (Ω1,,Ωm)(\Omega_{1},\dots,\Omega_{m}) is a partition of (Ω,ρ)(\Omega,\rho) and \mathcal{F} is the set of all functions with Lipschitz norm bounded by 11. Then the measure ν^\hat{\nu} generated from Algorithm 1 satisfies

𝔼W1(μ𝒳,ν^)maxidiam(Ωi)+2mεnL~m().\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\hat{\nu})\leq\max_{i}\operatorname{diam}(\Omega_{i})+\frac{2m}{\varepsilon n}\widetilde{L}_{m}(\mathcal{F}).

Note that diam(Ωi)m1/d\operatorname{diam}(\Omega_{i})\asymp m^{-1/d} can be satisfied when we take a partition of Ω=[0,1]d\Omega=[0,1]^{d} where each Ωi\Omega_{i} is a subcube of the same size. Using the formula above and the result of Laplacian complexity for the hypercube in Corollary 3.3, one can easily deduce the following result.

Corollary 3.7 (Accuracy of Algorithm 1 on the hypercube).

Take m=εnm=\lceil\varepsilon n\rceil and let (Ω1,,Ωm)(\Omega_{1},\dots,\Omega_{m}) be a partition of Ω=[0,1]d\Omega=[0,1]^{d} with the norm \|\cdot\|_{\infty}. Assume that diam(Ωi)m1/d\operatorname{diam}(\Omega_{i})\asymp m^{-1/d}. Then the measure ν^\hat{\nu} generated from Algorithm 1 satisfies

𝔼W1(μ𝒳,ν^){C(εn)12 if d=1,Clog(εn)(εn)12 if d=2,C(εn)1d if d3.\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\hat{\nu})\leq\left\{\begin{aligned} &C(\varepsilon n)^{-\frac{1}{2}}&\quad\textrm{ if }d=1,\\ &{C}\log(\varepsilon n)(\varepsilon n)^{-\frac{1}{2}}&\quad\textrm{ if }d=2,\\ &C{(\varepsilon n)^{-\frac{1}{d}}}&\quad\textrm{ if }d\geq 3.\end{aligned}\right.

4. Private measure mechanism (PMM)

4.1. Binary partition and noisy counts

A binary hierarchical partition of a set Ω\Omega of depth rr is a family of subsets Ωθ\Omega_{\theta} indexed by θ{0,1}r\theta\in\{0,1\}^{\leq r}, where

{0,1}k={0,1}0{0,1}1{0,1}k,k=0,1,2,\{0,1\}^{\leq k}=\{0,1\}^{0}\sqcup\{0,1\}^{1}\sqcup\cdots\sqcup\{0,1\}^{k},\quad k=0,1,2\dots,

and such that Ωθ\Omega_{\theta} is partitioned into Ωθ0\Omega_{\theta 0} and Ωθ1\Omega_{\theta 1} for every θ{0,1}r1\theta\in\{0,1\}^{\leq r-1}. By convention, the cube {0,1}0\{0,1\}^{0} consists of a single element \emptyset. We usually drop the subscript \emptyset and write nn instead of nn_{\emptyset}. When θ{0,1}j\theta\in\{0,1\}^{j}, we call jj the level of θ\theta. We can also encode a binary hierarchical partition of Ω\Omega in a binary tree of depth rr, where the root is labeled Ω\Omega and the jj-th level of the tree encodes the subsets Ωθ\Omega_{\theta} for θ\theta at level jj.

Let (Ωθ)θ{0,1}r(\Omega_{\theta})_{\theta\in\{0,1\}^{\leq r}} be a binary partition of Ω\Omega. Given true data (x1,,xn)Ωn(x_{1},\ldots,x_{n})\in\Omega^{n}, the true count nθn_{\theta} is the number of data points in the region Ωθ\Omega_{\theta}, i.e.

nθ|{i[n]:xiΩθ}|.n_{\theta}\coloneqq\mathinner{\!\left\lvert\left\{i\in[n]\mathrel{\mathop{\mathchar 58\relax}}\;x_{i}\in\Omega_{\theta}\right\}\right\rvert}.

We will convert true counts into noisy counts nθn^{\prime}_{\theta} by adding Laplacian noise; all regions on the same level will receive noise of the same expected magnitude. Formally, we set

nθ(nθ+λθ)+,whereλθLap(σj),n^{\prime}_{\theta}\coloneqq\left(n_{\theta}+\lambda_{\theta}\right)_{+},\quad\text{where}\quad\lambda_{\theta}\sim\operatorname{Lap}_{\mathbb{Z}}(\sigma_{j}),

and j{0,,r}j\in\{0,\ldots,r\} is the level of θ\theta. At this point, the magnitudes of the noise σj\sigma_{j} can be arbitrary.

4.2. Consistency

The true counts nθn_{\theta} are non-negative and consistent, i.e., the counts of subregions always add up to the count of the region:

nθ0+nθ1=nθfor all θ{0,1}r1.n_{\theta 0}+n_{\theta 1}=n_{\theta}\quad\text{for all }\theta\in\{0,1\}^{\leq r-1}.

The noisy counts nθn^{\prime}_{\theta} are non-negative, but not necessarily consistent. Algorithm 3 enforces consistency by adjusting the counts iteratively, from top to bottom. In the case of the deficit, when the sum of the two subregional counts is smaller than the count of the region, we increase both subregional counts. In the opposite case or surplus, we decrease both subregional counts. Apart from this requirement, we are free to distribute the deficit or surplus between the subregional counts.

It is convenient to state this requirement by considering a product partial order on +2\mathbb{Z}_{+}^{2}, where we declare that (a0,a1)(b0,b1)(a_{0},a_{1})\preceq(b_{0},b_{1}) if and only if a0b0a_{0}\leq b_{0} and a1b1a_{1}\leq b_{1}. We call the two vectors a,b2a,b\in\mathbb{Z}^{2} comparable if either aba\preceq b or bab\preceq a. Furthermore, L(a)L(a) denotes the line x+y=ax+y=a on the plane.

Algorithm 3 Consistency
Input: non-negative numbers (nθ)θ{0,1}r(n^{\prime}_{\theta})_{\theta\in\{0,1\}^{\leq r}}, where nn^{\prime} is a nonnegative integer.
set mnm\coloneqq n^{\prime}.
for j=0,,r1j=0,\ldots,r-1 do
     for θ{0,1}j\theta\in\{0,1\}^{j} do
         transform the vector (nθ0,nθ1)+2(n^{\prime}_{\theta 0},n^{\prime}_{\theta 1})\in\mathbb{Z}_{+}^{2} into any comparable vector (mθ0,mθ1)+2L(mθ)(m_{\theta 0},m_{\theta 1})\in\mathbb{Z}_{+}^{2}\cap L(m_{\theta}).
     end for
end for
Output: non-negative integers (mθ)θ{0,1}r(m_{\theta})_{\theta\in\{0,1\}^{\leq r}}.

At each step, Algorithm 3 uses a transformation fθ:+2+2L(mθ).f_{\theta}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{Z}_{+}^{2}\to\mathbb{Z}_{+}^{2}\cap L(m_{\theta}). It can be chosen arbitrarily; the only requirement is that fθ(x)f_{\theta}(x) be comparable with xx. The comparability requirement is natural and non-restrictive. For example, the uniform transformation selects the closest point in the discrete interval +2L(mθ)\mathbb{Z}_{+}^{2}\cap L(m_{\theta}) in (say) the Euclidean metric. Alternatively, the proportional transformation selects the point in the discrete interval +2L(mθ)\mathbb{Z}_{+}^{2}\cap L(m_{\theta}) that is closest to the line that connects the input vector and the origin.

4.3. Synthetic data

Algorithm 3 ensures that the output counts mθm_{\theta} are non-negative, integer, and consistent. They are also private since they are a function of the noisy counts nθn^{\prime}_{\theta}, which are private as we proved. Therefore, the counts mθm_{\theta} can be used to generate private synthetic data by putting mθm_{\theta} points in cell Ωθ\Omega_{\theta}. Algorithm 4 makes this formal.

Algorithm 4 Private Measure Mechanism
Input: true data 𝒳=(x1,,xn)Ωn\mathcal{X}=(x_{1},\ldots,x_{n})\in\Omega^{n}, noise magnitudes σ0,,σr>0\sigma_{0},\ldots,\sigma_{r}>0.
Compute true counts:

Let nθn_{\theta} be the number of data points in Ωθ\Omega_{\theta}.

Add noise:

Let nθ(nθ+λθ)+n^{\prime}_{\theta}\coloneqq(n_{\theta}+\lambda_{\theta})_{+}, where λθLap(σj)\lambda_{\theta}\sim\operatorname{Lap}_{\mathbb{Z}}(\sigma_{j}) are i.i.d. random variables,

Enforce consistency:

Convert the noisy counts (nθ)(n^{\prime}_{\theta}) to consistent counts (mθ)(m_{\theta}) using Algorithm 3.

Sample:

Choose any mθm_{\theta} points in each cell Ωθ\Omega_{\theta}, θ{0,1}r\theta\in\{0,1\}^{r} independently of 𝒳\mathcal{X}.

Output: the set of all these points as synthetic data 𝒴=(y1,,ym)Ωm\mathcal{Y}=(y_{1},\ldots,y_{m})\in\Omega^{m}.

4.4. Privacy and accuracy of Algorithm 4

We first prove that Algorithm 4 is differentially private. The proof idea is similar to the classic Laplacian mechanism. But now our noise is of differential scale for each level, so more delicate calculations are needed.

Theorem 4.1 (Privacy of Algorithm 4).

The vector of noisy counts (nθ+λθ)(n_{\theta}+\lambda_{\theta}) in Algorithm 4 is ε\varepsilon-differentially private, where

ε=j=0r1σj.\varepsilon=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}.

Consequently, the synthetic data 𝒴\mathcal{Y} generated by Algorithm 4 is ε\varepsilon-differentially private.

Having analyzed the privacy of the synthetic data, we now turn to its accuracy. It is determined by the magnitudes of the noise σj\sigma_{j} and by the multiscale geometry of the domain Ω\Omega. The latter is captured by the diameters of the regions Ωθ\Omega_{\theta}, specifically by their sum at each level, which we denote

Δjθ{0,1}jdiam(Ωθ)\Delta_{j}\coloneqq\sum_{\theta\in\{0,1\}^{j}}\operatorname{diam}(\Omega_{\theta}) (4.1)

and adopt the notation Δ1Δ0=diam(Ω).\Delta_{-1}\coloneqq\Delta_{0}=\operatorname{diam}(\Omega). In addition to Δj\Delta_{j}, the accuracy is affected by the resolution of the partition, which is the maximum diameter of the cells, denoted by

δmaxθ{0,1}rdiam(Ωθ).\delta\coloneqq\max_{\theta\in\{0,1\}^{r}}\operatorname{diam}(\Omega_{\theta}).
Theorem 4.2 (Accuracy of Algorithm 4).

Algorithm 4 that transforms true data 𝒳\mathcal{X} into synthetic data 𝒴\mathcal{Y} has the following expected accuracy in the Wasserstein metric:

𝔼W1(μ𝒳,μ𝒴)22nj=0rσjΔj1+δ.\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\leq\frac{2\sqrt{2}}{n}\sum_{j=0}^{r}\sigma_{j}\Delta_{j-1}+\delta.

Here μ𝒳\mu_{\mathcal{X}} and μ𝒴\mu_{\mathcal{Y}} are the empirical probability distributions on the true and synthetic data, respectively.

The privacy and accuracy guarantees of Algorithm 4 (Theorems 4.1 and 4.2) hold for any choice of noise levels σj\sigma_{j}. By optimizing σj\sigma_{j}, we can achieve the best accuracy for a given level of privacy.

Theorem 4.3 (Optimized accuracy).

With the optimal choice of magnitude levels (A.4), Algorithm 4 that transforms true data 𝒳\mathcal{X} into synthetic data 𝒴\mathcal{Y} is ε\varepsilon-differential private, and has the following expected accuracy in the 11-Wasserstein distance:

𝔼W1(μ𝒳,μ𝒴)2εn(j=0rΔj1)2+δ.\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\leq\frac{\sqrt{2}}{\varepsilon n}\Big{(}\sum_{j=0}^{r}\sqrt{\Delta_{j-1}}\Big{)}^{2}+\delta.

Here μ𝒳\mu_{\mathcal{X}} and μ𝒴\mu_{\mathcal{Y}} are the empirical measures of the true and synthetic data, respectively.

Corollary 4.4 (Optimized accuracy for hypercubes).

When Ω=[0,1]d\Omega=[0,1]^{d} equipped with the \ell^{\infty} metric, with the optimal choice of magnitude levels (A.4) and the optimal choice of

r={log2(εn)1 if d=1,log2(εn) if d2,r=\left\{\begin{aligned} &\log_{2}(\varepsilon n)-1&\textrm{ if }d=1,\\ &\log_{2}(\varepsilon n)&\textrm{ if }d\geq 2,\end{aligned}\right.

we have

𝔼W1(μ𝒳,μ𝒴){log2(εn)εn, if d=1,(εn)1/d, if d2.\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\lesssim\left\{\begin{aligned} \frac{\log^{2}(\varepsilon n)}{\varepsilon n},\quad\textrm{ if }d=1,\\ (\varepsilon n)^{-1/d},\quad\textrm{ if }d\geq 2.\end{aligned}\right.
Remark 4.5 (Computational efficiency of Algorithm 4).

Since a binary hierarchical partition has 2r2^{r} cells in total, the running time of Algorithm 4 is O(2r)O(2^{r}). When Ω=[0,1]d\Omega=[0,1]^{d}, with the same optimal choice of rr in Corollary 4.4, the running time of PMM becomes O(εdn)O(\varepsilon dn).

4.5. Proof of Theorem 4.2

For the proof of Theorem 4.2, we introduce a quantitative notion for the incomparability of two vectors on the plane. For vectors a,b+2a,b\in\mathbb{Z}_{+}^{2}, we define

flux(a,b):={0if a and b are comparable,min(|a1b1|,|a2b2|)otherwise.\operatorname{flux}(a,b)\mathrel{\mathop{\mathchar 58\relax}}=\begin{cases}0&\text{if $a$ and $b$ are comparable,}\\ \min\left(\,\mathinner{\!\left\lvert a_{1}-b_{1}\right\rvert},\,\mathinner{\!\left\lvert a_{2}-b_{2}\right\rvert}\right)&\text{otherwise}.\end{cases}
Lemma 4.6 (Flux as incomparability).

flux(a,b)\operatorname{flux}(a,b) is the \ell_{\infty}-distance from aa to the set of points that are comparable to bb.

For example, if a=(1,9)a=(1,9) and b=(6,7)b=(6,7), then flux(a,b)=2\operatorname{flux}(a,b)=2. Note that aa has a distance 22 to the vector (1,7)(1,7) which is comparable with bb.

Lemma 4.7 (Flux as transfer).

Suppose we have two bins with a1a_{1} and a2a_{2} balls in them. Then one can achieve b1b_{1} and b2b_{2} balls in these bins by:

  1. (a)

    first making the total number of balls correct by adding a total of (b1+b2)(a1a2)(b_{1}+b_{2})-(a_{1}-a_{2}) balls to the two bins (or removing, if that number is negative);

  2. (b)

    then transferring flux((a1,a2),(b1,b2))\operatorname{flux}\left((a_{1},a_{2}),\,(b_{1},b_{2})\right) balls from one bin to the other.

For example, suppose that one bin has 11 ball and the other has 99. Then we can achieve 66 and 77 balls in these bins by first adding 33 balls to the first bin and transferring 22 balls from the second to the first bin. As we noted above, 22 is the flux between the vectors (1,9)(1,9) and b=(6,7)b=(6,7).

Lemma 4.7 can be generalized to the hierarchical binary partition of Ω\Omega as follows.

Lemma 4.8.

Consider any data set 𝒳Ωn\mathcal{X}\in\Omega^{n}, and let (nθ)θ{0,1}r(n_{\theta})_{\theta\in\{0,1\}^{r}} be its counts. Consider any consistent vector of non-negative integers (mθ)θ{0,1}r(m_{\theta})_{\theta\in\{0,1\}^{r}}. Then one can transform 𝒳\mathcal{X} into a set 𝒵Ωm\mathcal{Z}\in\Omega^{m} that has counts (mθ)θ{0,1}r(m_{\theta})_{\theta\in\{0,1\}^{r}} by:

  1. (a)

    first making the total number of points correct by adding a total of mnm-n points to Ω\Omega (or remove, if that number is negative);

  2. (b)

    then transferring flux((nθ0,nθ1),(mθ0,mθ1))\operatorname{flux}\left((n_{\theta 0},n_{\theta 1}),\,(m_{\theta 0},m_{\theta 1})\right) points from Ωθ0\Omega_{\theta 0} to Ωθ1\Omega_{\theta 1} or vice versa, for all j=0,,r1j=0,\ldots,r-1 and θ{0,1}j\theta\in\{0,1\}^{j}.

Combining the concept of the flux\operatorname{flux} and our algorithm, the following two lemmas are useful in the proof of Theorem 4.2.

Lemma 4.9.

In Algorithm 4, we have

flux((nθ0,nθ1),(mθ0,mθ1))max(|λθ0|,|λθ1|)\operatorname{flux}\left((n_{\theta 0},n_{\theta 1}),\,(m_{\theta 0},m_{\theta 1})\right)\leq\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)

for all j=0,,r1j=0,\ldots,r-1 and θ{0,1}j\theta\in\{0,1\}^{j}.

Lemma 4.10.

For any finite multisets UVU\subset V such that all elements in UU are from Ω\Omega, one has

W1(μU,μV)|VU||V|diam(Ω).W_{1}(\mu_{U},\mu_{V})\leq\frac{\mathinner{\!\left\lvert V\setminus U\right\rvert}}{\mathinner{\!\left\lvert V\right\rvert}}\cdot\operatorname{diam}(\Omega).
Proof.

(Proof of Theorem 4.2) Owing to Lemma 4.8 and Lemma 4.9, the creation of synthetic data from the true data 𝒳𝒴\mathcal{X}\mapsto\mathcal{Y}, described by Algorithm 4, can be achieved by the following three steps.

  1. 1.

    Transform the nn-point input set 𝒳\mathcal{X} to an mm-point set 𝒳1\mathcal{X}_{1} by adding or removing |mn|\mathinner{\!\left\lvert m-n\right\rvert} points.

  2. 2.

    Transform 𝒳1\mathcal{X}_{1} to 𝒳2\mathcal{X}_{2} by moving at most max(|λθ0|,|λθ1|)\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right) many data points for each j=0,1,,r1j=0,1,\dots,r-1 and θ{0,1}j\theta\in\{0,1\}^{j} between the two parts of the region Ωθ\Omega_{\theta}.

  3. 3.

    Transforms 𝒳2\mathcal{X}_{2} to the output data 𝒴\mathcal{Y} by relocating points within their cells.

We will analyze the accuracy of these steps one at a time.

Analyzing Step 2. The total distance the points are moved at this step is bounded by

j=0r1θ{0,1}jmax(|λθ0|,|λθ1|)diam(Ωθ)D.\sum_{j=0}^{r-1}\sum_{\theta\in\{0,1\}^{j}}\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)\operatorname{diam}(\Omega_{\theta})\eqqcolon D. (4.2)

Since |𝒳1|=m\mathinner{\!\left\lvert\mathcal{X}_{1}\right\rvert}=m, it follows that

W1(μ𝒳1,μ𝒳2)Dm.W_{1}(\mu_{\mathcal{X}_{1}},\mu_{\mathcal{X}_{2}})\leq\frac{D}{m}. (4.3)

Combining Steps 1 and 2. Recall that step 1 transforms the input data 𝒳\mathcal{X} with |𝒳|=n\mathinner{\!\left\lvert\mathcal{X}\right\rvert}=n into 𝒳1\mathcal{X}_{1} with |𝒳1|=m=n+sign(λ)|λ|\mathinner{\!\left\lvert\mathcal{X}_{1}\right\rvert}=m=n+\operatorname{sign}(\lambda)\cdot\lfloor|\lambda|\rfloor by adding or removing points, depending on the sign of λ\lambda.

Case 1: λ0\lambda\geq 0. Here 𝒳1\mathcal{X}_{1} is obtained from 𝒳\mathcal{X} by adding λ\lfloor\lambda\rfloor points, so Lemma 4.10 gives

W1(μ𝒳,μ𝒳1)λmΔ0.W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{X}_{1}})\leq\frac{\lambda}{m}\cdot\Delta_{0}.

Combining this with (4.3) by triangle inequality, we conclude that

W1(μ𝒳,μ𝒳2)λΔ0+DmλΔ0+Dn.W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{X}_{2}})\leq\frac{\lambda\Delta_{0}+D}{m}\leq\frac{\lambda\Delta_{0}+D}{n}.

Case 2: λ<0\lambda<0. Here 𝒳1\mathcal{X}_{1} is obtained from 𝒳\mathcal{X} by removing a set 𝒳0\mathcal{X}_{0} of nm=|λ|n-m=\lfloor\mathinner{\!\left\lvert\lambda\right\rvert}\rfloor points. Furthermore, by our analysis of step 2, 𝒳2\mathcal{X}_{2} is obtained from 𝒳1\mathcal{X}_{1} by moving points the total distance at most DD. Therefore, 𝒳2𝒳0\mathcal{X}_{2}\cup\mathcal{X}_{0} (as a multiset) is obtained from 𝒳=𝒳1𝒳0\mathcal{X}=\mathcal{X}_{1}\cup\mathcal{X}_{0} by moving points the total distance at most DD, too. (The points in 𝒳0\mathcal{X}_{0} remain unmoved.) Since |𝒳|=n\mathinner{\!\left\lvert\mathcal{X}\right\rvert}=n, it follows that

W1(μ𝒳,μ𝒳2𝒳0)Dn.W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{X}_{2}\cup\mathcal{X}_{0}})\leq\frac{D}{n}.

Moreover, Lemma 4.10 gives

W1(μ𝒳2,μ𝒳2𝒳0)|𝒳0||𝒳2𝒳0|diam(Ω)|λ|Δ0n.W_{1}(\mu_{\mathcal{X}_{2}},\mu_{\mathcal{X}_{2}\cup\mathcal{X}_{0}})\leq\frac{\mathinner{\!\left\lvert\mathcal{X}_{0}\right\rvert}}{\mathinner{\!\left\lvert\mathcal{X}_{2}\cup\mathcal{X}_{0}\right\rvert}}\cdot\operatorname{diam}(\Omega)\leq\frac{\mathinner{\!\left\lvert\lambda\right\rvert}\Delta_{0}}{n}.

(Here we used that the multiset 𝒳2𝒳0\mathcal{X}_{2}\cup\mathcal{X}_{0} has the same number of points as 𝒳\mathcal{X}, which is nn.) Combining the two bounds by triangle inequality, we obtain

W1(μ𝒳,μ𝒳2)|λ|Δ0+Dn.W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{X}_{2}})\leq\frac{\mathinner{\!\left\lvert\lambda\right\rvert}\Delta_{0}+D}{n}. (4.4)

In other words, this bound holds in both cases.

Analyzing Step 3. This step is the easiest to analyze: since 𝒴\mathcal{Y} is obtained from 𝒳2\mathcal{X}_{2} by relocating the points are relocated within their cells, and the maximal diameter of the cells is δ\delta, we have W1(μ𝒳2,μ𝒴)δ.W_{1}(\mu_{\mathcal{X}_{2}},\mu_{\mathcal{Y}})\leq\delta. Combining this with (4.4) by triangle inequality, we conclude that

W1(μ𝒳,μ𝒴)|λ|Δ0+Dn+δ.W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq\frac{\mathinner{\!\left\lvert\lambda\right\rvert}\Delta_{0}+D}{n}+\delta.

Taking expectation. Recall the definition of DD from (4.2). We get

𝔼W1(μ𝒳,μ𝒴)1n[𝔼[|λ|]Δ0+j=0r1θ{0,1}j𝔼[max(|λθ0|,|λθ1|)]diam(Ωθ)]+δ.\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq\frac{1}{n}\left[\operatorname{\mathbb{E}}\left[\,\mathinner{\!\left\lvert\lambda\right\rvert}\right]\Delta_{0}+\sum_{j=0}^{r-1}\sum_{\theta\in\{0,1\}^{j}}\operatorname{\mathbb{E}}\left[\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)\right]\operatorname{diam}(\Omega_{\theta})\right]+\delta.

Since λLap(σ0)\lambda\sim\operatorname{Lap}_{\mathbb{Z}}(\sigma_{0}), by (B.1) we have 𝔼[|λ|](𝔼(λ)2)1/22σ0.\operatorname{\mathbb{E}}\left[\,\mathinner{\!\left\lvert\lambda\right\rvert}\right]\leq(\operatorname{\mathbb{E}}(\lambda)^{2})^{1/2}\leq\sqrt{2}\sigma_{0}. Similarly, since λθ0\lambda_{\theta 0} and λθ1\lambda_{\theta 1} are independent Lap(σj+1)\operatorname{Lap}_{\mathbb{Z}}(\sigma_{j+1}) random variables, 𝔼[max(|λθ0|,|λθ1|)]22σj+1\operatorname{\mathbb{E}}\left[\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)\right]\leq 2\sqrt{2}\sigma_{j+1}. Substituting these estimates and rearranging the terms of the sum will complete the proof. ∎

Acknowledgements

R.V. acknowledges support from NSF DMS–1954233, NSF DMS–2027299, U.S. Army 76649–CS, and NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning. Y.Z. is partially supported by NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning.

The authors thank March Boedihardjo, Girish Kumar, and Thomas Strohmer for helpful discussions.

References

  • [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • [2] John Abowd, Robert Ashmead, Garfinkel Simson, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, and William Sexton. Census topdown: Differentially private data, incremental schemas, and consistency with public knowledge. US Census Bureau, 2019.
  • [3] John M Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Micah Heineck, Christine Heiss, Robert Johns, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, et al. The 2020 census disclosure avoidance system topdown algorithm. Harvard Data Science Review, (Special Issue 2), 2022.
  • [4] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2011.
  • [5] Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 273–282, 2007.
  • [6] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • [7] Steven M Bellovin, Preetam K Dutta, and Nathan Reitinger. Privacy and synthetic datasets. Stan. Tech. L. Rev., 22:1, 2019.
  • [8] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2):1–25, 2013.
  • [9] Sergey G Bobkov and Michel Ledoux. A simple fourier analytic proof of the AKT optimal matching theorem. The Annals of Applied Probability, 31(6):2567–2584, 2021.
  • [10] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Covariance’s loss is privacy’s gain: Computationally efficient, private and accurate synthetic data. Foundations of Computational Mathematics, pages 1–48, 2022.
  • [11] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Privacy of synthetic data: A statistical framework. IEEE Transactions on Information Theory, 69(1):520–527, 2022.
  • [12] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Private measures, random walks, and synthetic data. arXiv preprint arXiv:2204.09167, 2022.
  • [13] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Private sampling: a noiseless approach for generating differentially private synthetic data. SIAM Journal on Mathematics of Data Science, 4(3):1082–1115, 2022.
  • [14] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Covariance loss, Szemeredi regularity, and differential privacy. arXiv preprint arXiv:2301.02705, 2023.
  • [15] Sébastien Bubeck and Mark Sellke. A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34:28811–28822, 2021.
  • [16] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1–24, 2011.
  • [17] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. Advances in neural information processing systems, 21, 2008.
  • [18] Steffen Dereich, Michael Scheutzow, and Reik Schottstedt. Constructive quantization: Approximation by empirical measures. Annales de l’IHP Probabilités et statistiques, 49(4):1183–1203, 2013.
  • [19] Sjoerd Dirksen. Tail bounds via generic chaining. Electron. J. Probab, 20(53):1–29, 2015.
  • [20] Josep Domingo-Ferrer, David Sánchez, and Alberto Blanco-Justicia. The limits of differential privacy (and its misuse in data release and machine learning). Communications of the ACM, 64(7):33–35, 2021.
  • [21] John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201, 2018.
  • [22] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715–724, 2010.
  • [23] Cynthia Dwork, Aleksandar Nikolov, and Kunal Talwar. Efficient algorithms for privately releasing marginals via convex relaxations. Discrete & Computational Geometry, 53:650–673, 2015.
  • [24] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [25] Dylan J Foster and Alexander Rakhlin. \ell^{\infty} vector contraction for Rademacher complexity. arXiv preprint arXiv:1911.06468, 6, 2019.
  • [26] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620:105–118, 2016.
  • [27] Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentially private data release. Advances in neural information processing systems, 25, 2012.
  • [28] Mathew E Hauer and Alexis R Santos-Lozada. Differential privacy in the 2020 census will distort covid-19 rates. Socius, 7:2378023121994014, 2021.
  • [29] Michael B Hawes. Implementing differential privacy: Seven lessons from the 2020 United States Census. Harvard Data Science Review, 2(2), 2020.
  • [30] Seidu Inusah and Tomasz Kozubowski. A discrete analogue of the Laplace distribution. Journal of Statistical Planning and Inference, 136:1090–1102, 03 2006.
  • [31] James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2019.
  • [32] Leonid V Kovalev. Lipschitz clustering in metric spaces. The Journal of Geometric Analysis, 32(7):188, 2022.
  • [33] Terrance Liu, Giuseppe Vietri, and Steven Z Wu. Iterative methods for private synthetic data: Unifying framework and new methods. Advances in Neural Information Processing Systems, 34:690–702, 2021.
  • [34] Shahar Mendelson. Empirical processes with a bounded ψ\psi 1 diameter. Geometric and Functional Analysis, 20(4):988–1027, 2010.
  • [35] Laurent Meunier, Blaise J Delattre, Alexandre Araujo, and Alexandre Allauzen. A dynamical system perspective for Lipschitz neural networks. In International Conference on Machine Learning, pages 15484–15500. PMLR, 2022.
  • [36] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013.
  • [37] Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. Differentially private kk-means clustering. In Proceedings of the sixth ACM conference on data and application security and privacy, pages 26–37, 2016.
  • [38] Michel Talagrand. Matching random samples in many dimensions. The Annals of Applied Probability, pages 846–856, 1992.
  • [39] Michel Talagrand. The generic chaining: upper and lower bounds of stochastic processes. Springer Science & Business Media, 2005.
  • [40] Justin Thaler, Jonathan Ullman, and Salil Vadhan. Faster algorithms for privately releasing marginals. In Automata, Languages, and Programming: 39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, Part I 39, pages 810–821. Springer, 2012.
  • [41] VM Tikhomirov. ε\varepsilon-entropy and ε\varepsilon-capacity of sets in functional spaces. In Selected works of AN Kolmogorov, pages 86–170. Springer, 1993.
  • [42] Jonathan Ullman and Salil Vadhan. PCPs and the hardness of generating private synthetic data. In Theory of Cryptography: 8th Theory of Cryptography Conference, TCC 2011, Providence, RI, USA, March 28-30, 2011. Proceedings 8, pages 400–416. Springer, 2011.
  • [43] Salil Vadhan. The complexity of differential privacy. Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pages 347–450, 2017.
  • [44] Vijay V Vazirani. Approximation algorithms, volume 1. Springer, 2001.
  • [45] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [46] Giuseppe Vietri, Cedric Archambeau, Sergul Aydore, William Brown, Michael Kearns, Aaron Roth, Ankit Siva, Shuai Tang, and Steven Wu. Private synthetic data for multitask learning and marginal queries. In Advances in Neural Information Processing Systems, 2022.
  • [47] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
  • [48] Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with Lipschitz functions. J. Mach. Learn. Res., 5(Jun):669–695, 2004.
  • [49] Ziteng Wang, Chi Jin, Kai Fan, Jiaqi Zhang, Junliang Huang, Yiqiao Zhong, and Liwei Wang. Differentially private data releasing for smooth queries. The Journal of Machine Learning Research, 17(1):1779–1820, 2016.
  • [50] Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4 A):2620–2648, 2019.

Appendix A Additional proofs

A.1. Proof of Proposition 3.2

Proof.

We will apply the chaining argument (see, e.g., [45, Chapter 8]) to deduce a bound similar to Dudley’s inequality.

Step 1: (Finding nets)

Define εj=2j\varepsilon_{j}=2^{-j} for jj\in\mathbb{Z} and consider an εj\varepsilon_{j}-net TjT_{j} of \mathcal{F} of size 𝒩(,εj,)\mathcal{N}(\mathcal{F},\varepsilon_{j},\|\cdot\|_{\infty}). Then for any ff\in\mathcal{F} and any level jj, we can find the closest element in the net, denoted πj(f)\pi_{j}(f). In other words, there exists πj(f)\pi_{j}(f) s.t.

πj(f)Tj,fπj(f)εj.\pi_{j}(f)\in T_{j},\quad\|f-\pi_{j}(f)\|_{\infty}\leq\varepsilon_{j}.

Let mm be a positive integer to be determined later, we have the telescope sum together with triangle inequality

𝔼supf1n|i=1nf(Xi)λi|\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left|\sum_{i=1}^{n}f(X_{i})\lambda_{i}\right|\leq 𝔼supf1n|i=1n(fπm(f))(Xi)λi|\displaystyle\,\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left|\sum_{i=1}^{n}\left(f-\pi_{m}(f)\right)(X_{i})\cdot\lambda_{i}\right|
+j=j0+1m𝔼supf1n|i=1n(πj(f)πj1(f))(Xi)λi|.\displaystyle\,+\sum_{j=j_{0}+1}^{m}\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left|\sum_{i=1}^{n}\left(\pi_{j}(f)-\pi_{j-1}(f)\right)(X_{i})\cdot\lambda_{i}\right|.

Note that when j=j0j=j_{0} is small enough, Ω\Omega can be covered by πj0(f)0\pi_{j_{0}}(f)\equiv 0.

Step 2: (Bounding the telescoping sum)

For a fixed j0<jmj_{0}<j\leq m, we consider the quantity

𝔼supf1n|i=1n(πj(f)πj1(f))(Xi)λi|.\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left|\sum_{i=1}^{n}\left(\pi_{j}(f)-\pi_{j-1}(f)\right)(X_{i})\cdot\lambda_{i}\right|.

For simplicity we will denote ai=ai(f)a_{i}=a_{i}(f) as the coefficient 1n(πj(f)πj1(f))(Xi)\frac{1}{n}\left(\pi_{j}(f)-\pi_{j-1}(f)\right)(X_{i}). Then we have

|ai|1nfπj1(f)+1nπj(f)f1n(εj+εj1)3εjn.|a_{i}|\leq\frac{1}{n}\left\|f-\pi_{j-1}(f)\right\|_{\infty}+\frac{1}{n}\left\|\pi_{j}(f)-f\right\|_{\infty}\leq\frac{1}{n}(\varepsilon_{j}+\varepsilon_{j-1})\leq\frac{3\varepsilon_{j}}{n}.

Since {λi}i[n]\{\lambda_{i}\}_{i\in[n]} are independent subexponential random variables, we can apply Bernstein’s inequality to the sum iaiλi\sum_{i}a_{i}\lambda_{i}. Let K=3εjK=3\varepsilon_{j}, we have

{|i=1naiλi|>t}\displaystyle\mathbb{P}\left\{\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|>t\rule{0.0pt}{8.53581pt}\right\} 2exp[cmin(t2a22,ta)]\displaystyle\leq 2\exp\left[-c\min\left(\frac{t^{2}}{\|a\|_{2}^{2}},\frac{t}{\|a\|_{\infty}}\right)\right]
2exp[cmin(t2K2/n,tK/n)]\displaystyle\leq 2\exp\left[-c\min\left(\frac{t^{2}}{K^{2}/n},\frac{t}{K/n}\right)\right]
=2exp[cnmin(t2K2,tK)],\displaystyle=2\exp\left[-cn\min\left(\frac{t^{2}}{K^{2}},\frac{t}{K}\right)\right],

Then we can use the union bound to control the supreme. Define N=|Tj||Tj1||Tj|2N=|T_{j}|\cdot|T_{j-1}|\leq|T_{j}|^{2},

{supf|i=1naiλi|>t}\displaystyle\mathbb{P}\left\{\sup_{f\in\mathcal{F}}\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|>t\rule{0.0pt}{8.53581pt}\right\} 2Nexp[cnmin(t2K2,tK)]1\displaystyle\leq 2N\exp\left[-cn\min\left(\frac{t^{2}}{K^{2}},\frac{t}{K}\right)\right]\wedge 1
=2exp[logNcnmin(t2K2,tK)]1\displaystyle=2\exp\left[\log N-cn\min\left(\frac{t^{2}}{K^{2}},\frac{t}{K}\right)\right]\wedge 1
2exp(logNcnt2K2)1\displaystyle\leq 2\exp\left(\log N-cn\frac{t^{2}}{K^{2}}\right)\wedge 1
+2exp(logNcntK)1\displaystyle\quad+2\exp\left(\log N-cn\frac{t}{K}\right)\wedge 1

and hence

𝔼supf|i=1naiλi|=\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|= 02exp(logNcnt2K2)1dt\displaystyle\int_{0}^{\infty}2\exp\left(\log N-cn\frac{t^{2}}{K^{2}}\right)\wedge 1\mathrm{d}t
+02exp(logNcntK)1dt\displaystyle\quad+\int_{0}^{\infty}2\exp\left(\log N-cn\frac{t}{K}\right)\wedge 1\mathrm{d}t
:=\displaystyle\mathrel{\mathop{\mathchar 58\relax}}= I2+I1.\displaystyle\,I_{2}+I_{1}.

We will compute them separately.

I1\displaystyle I_{1} =02exp(logNcntK)1dt\displaystyle=\int_{0}^{\infty}2\exp\left(\log N-cn\frac{t}{K}\right)\wedge 1\mathrm{d}t
=KlogNcn+KlogN/cn2exp(logNcntK)\displaystyle=\frac{K\log N}{cn}+\int_{{K\log N}/{cn}}^{\infty}2\exp\left(\log N-cn\frac{t}{K}\right)
=KlogNcn+02exp(cntK)\displaystyle=\frac{K\log N}{cn}+\int_{0}^{\infty}2\exp\left(-cn\frac{t}{K}\right)
CKlogNn\displaystyle\leq CK\frac{\log N}{n}
I2\displaystyle I_{2} =02exp(logNcnt2K2)1dt\displaystyle=\int_{0}^{\infty}2\exp\left(\log N-cn\frac{t^{2}}{K^{2}}\right)\wedge 1\mathrm{d}t
=K2logNcn+K2logN/cn2exp(logNcnt2K2)\displaystyle=\sqrt{\frac{K^{2}\log N}{cn}}+\int_{\sqrt{{K^{2}\log N}/{cn}}}^{\infty}2\exp\left(\log N-cn\frac{t^{2}}{K^{2}}\right)
=K2logNcn+02exp(cnt2K22cnlogNtK)\displaystyle=\sqrt{\frac{K^{2}\log N}{cn}}+\int_{0}^{\infty}2\exp\left(-cn\frac{t^{2}}{K^{2}}-2\sqrt{cn\log N}\frac{t}{K}\right)
K2logNcn+KcnlogN\displaystyle\leq\sqrt{\frac{K^{2}\log N}{cn}}+\frac{K}{\sqrt{cn\log N}}
CKlogNn.\displaystyle\leq CK\sqrt{\frac{\log N}{n}}.

Therefore we concluded that for a fixed level jj,

𝔼supf|i=1naiλi|CK(logNn+logNn)εj(logNn+logNn)\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|\leq CK\left({\frac{\log N}{n}}+\sqrt{\frac{\log N}{n}}\right)\lesssim\varepsilon_{j}\left({\frac{\log N}{n}}+\sqrt{\frac{\log N}{n}}\right)

Step 3: (Bounding the last entry)

For the last entry in the telescoping sum, similarly, we denote ai:=1n(fπm(f))(Xi)a_{i}\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\left(f-\pi_{m}(f)\right)(X_{i}) and we have |ai|εm/n|a_{i}|\leq\varepsilon_{m}/n. Then

supf|i=1naiλi|εmni=1n|λi|,\sup_{f\in\mathcal{F}}\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|\leq\frac{\varepsilon_{m}}{n}\sum_{i=1}^{n}|\lambda_{i}|,

and the expectation satisfies

𝔼supf|i=1naiλi|εmni=1n𝔼|λi|εm.\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|\leq\frac{\varepsilon_{m}}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}|\lambda_{i}|\lesssim\varepsilon_{m}.

Step 4: (Combining the bound and choosing mm) Combining the two integrals together, we deduce that for any X1,,XnΩX_{1},\dots,X_{n}\in\Omega,

𝔼supf1n|i=1nf(Xi)λi|C(εm+j=j0+1m\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left|\sum_{i=1}^{n}f(X_{i})\lambda_{i}\right|\leq C\left(\varepsilon_{m}+\sum_{j=j_{0}+1}^{m}\right. εj(log𝒩(,εj,)n\displaystyle\varepsilon_{j}\Bigg{(}{\frac{\log\mathcal{N}(\mathcal{F},\varepsilon_{j},\|\cdot\|_{\infty})}{n}}
+log𝒩(,εj,)n)).\displaystyle\left.+\sqrt{\frac{\log\mathcal{N}(\mathcal{F},\varepsilon_{j},\|\cdot\|_{\infty})}{n}}\Bigg{)}\right).

Then for any α>0\alpha>0, we can always choose mm such that 2αεm<4α2\alpha\leq\varepsilon_{m}<4\alpha and bound the sum above with integral

𝔼supf1n|i=1nf(Xi)λi|C(2α\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left|\sum_{i=1}^{n}f(X_{i})\lambda_{i}\right|\leq C\Bigg{(}2\alpha +1nαlog𝒩(,u,)du\displaystyle+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}\sqrt{\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})}\mathrm{d}u
+1nαlog𝒩(,u,)du).\displaystyle+\frac{1}{{n}}\int_{\alpha}^{\infty}{\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})}\mathrm{d}u\Bigg{)}. (A.1)

Taking infimum over α\alpha completes the proof of the first inequality.

Now assume \mathcal{F} is the set of all functions ff with fLip1\|f\|_{\mathrm{Lip}}\leq 1. From [26, Lemma 4.2], we can bound the covering number of \mathcal{F} by the covering number of Ω\Omega as follows:

log𝒩(,u,)log(8/u)𝒩(Ω,u/2,ρ).\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\leq\log({8}/{u})\,\mathcal{N}(\Omega,{u}/{2},\rho).

As a result, for any α>0\alpha>0,

Ln()C(2α+1nαlog(8/u)𝒩(Ω,u/2,ρ)du+1nαlog(8/u)𝒩(Ω,u/2,ρ)du).L_{n}(\mathcal{F})\leq C\left(2\alpha+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}\sqrt{\log({8}/{u})\,\mathcal{N}(\Omega,{u}/{2},\rho)}\mathrm{d}u+\frac{1}{{n}}\int_{\alpha}^{\infty}\log({8}/{u})\,\mathcal{N}(\Omega,{u}/{2},\rho)\mathrm{d}u\right).

This completes the proof. ∎

A.2. Proof of Corollary 3.3

Proof.

For Ω=[0,1]d\Omega=[0,1]^{d} with ll_{\infty}-norm, we have diam(Ω)=1\operatorname{diam}(\Omega)=1 and the covering number

𝒩([0,1]d,u,)ud.\mathcal{N}([0,1]^{d},u,\|\cdot\|_{\infty})\leq u^{-d}.

Then, as the domain Ω=[0,1]d\Omega=[0,1]^{d} is connected and centered, we can apply the bound for the covering number of \mathcal{F} from [48, Theorem 17]:

𝒩(,u,)(22/u+1)2𝒩([0,1]d,u/2,),\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\leq\left(2\left\lceil{2}/u\right\rceil+1\right)2^{\mathcal{N}\left([0,1]^{d},u/2,\|\cdot\|_{\infty}\right)},
log𝒩(,u,)𝒩(Ω,u/2,)(u/2)d.\Longrightarrow\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\lesssim\mathcal{N}(\Omega,{u}/{2},\|\cdot\|_{\infty})\lesssim(u/2)^{-d}.

Applying the inequality above to (A.1), we get

LnC(2α+1nα(u/2)d/2du+1nα(u/2)ddu).L_{n}\leq C\left(2\alpha+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}(u/2)^{-d/2}\mathrm{d}u+\frac{1}{{n}}\int_{\alpha}^{\infty}(u/2)^{-d}\mathrm{d}u\right). (A.2)

Compute the integral for the case d=2d=2 and d3d\geq 3,

Ln(f){C(2α+2nlog2α+2n(α2)1) if d=2.C(2α+2n1d21(α2)1d2+2n1d1(α2)1d) if d3.L_{n}(f)\leq\left\{\begin{aligned} &C\Bigg{(}2\alpha+\frac{2}{\sqrt{n}}\log\frac{2}{\alpha}+\frac{2}{n}\left(\frac{\alpha}{2}\right)^{-1}\Bigg{)}&\quad\textrm{ if }d=2.\\ &C\Bigg{(}2\alpha+\frac{2}{\sqrt{n}}\cdot\frac{1}{\frac{d}{2}-1}\left(\frac{\alpha}{2}\right)^{1-\frac{d}{2}}+\frac{2}{n}\cdot\frac{1}{d-1}\left(\frac{\alpha}{2}\right)^{1-d}\Bigg{)}\quad&\textrm{ if }d\geq 3.\end{aligned}\right.

Choosing α=2n1/d\alpha=2n^{-1/d} finishes the cases for d2d\geq 2.

When d=1d=1, the Dudley integral in (A.2) is divergent. However, note that diam()2\operatorname{diam}(\mathcal{F})\leq 2 and hence log𝒩(,u,)=0\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})=0 for u>1u>1. From (A.1), we have

Ln()\displaystyle L_{n}(\mathcal{F}) C(2α+1nα1(u/2)1/2du+1nα1(u/2)1du)\displaystyle\leq C\left(2\alpha+\frac{1}{\sqrt{n}}\int_{\alpha}^{1}(u/2)^{-1/2}\mathrm{d}u+\frac{1}{{n}}\int_{\alpha}^{1}(u/2)^{-1}\mathrm{d}u\right)
C(2α+2(2α)n+2nlog1α).\displaystyle\leq C\left(2\alpha+\frac{2(\sqrt{2}-\sqrt{\alpha})}{\sqrt{n}}+\frac{2}{n}\log\frac{1}{\alpha}\right).

The optimal choice of α\alpha is αn1/2\alpha\sim n^{-1/2}, which gives us the result for d=1d=1. ∎

A.3. Proof of Proposition 3.4

Proof.

It suffices to prove that the steps from 𝒳\mathcal{X} to the sign measure ν\nu in Algorithm 1 is ε\varepsilon-differentially private since the remaining steps are only based on ν\nu. Notice that both μ𝒴,ν\mu_{\mathcal{Y}},\nu are supported on Y1,,YmY_{1},\dots,Y_{m}, we can identify the two discrete measures as mm dimensional vectors in the standard simplex, denoted μ𝒴¯,ν¯\overline{\mu_{\mathcal{Y}}},\overline{\nu}, respectively. Consider two data sets 𝒳1\mathcal{X}_{1} and 𝒳2\mathcal{X}_{2} differ in one point. Suppose we deduced μ𝒴1,μ𝒴2{\mu_{\mathcal{Y}_{1}}},{\mu_{\mathcal{Y}_{2}}} and ν1,ν2\nu_{1},\nu_{2} through the first four steps of Algorithm 1 from 𝒳1,𝒳2\mathcal{X}_{1},\mathcal{X}_{2}, respectively. We know two vectors μ𝒴1¯,μ𝒴2¯\overline{\mu_{\mathcal{Y}_{1}}},\overline{\mu_{\mathcal{Y}_{2}}} are different at one coordinate, where the difference is bounded by 1/n1/n.

Then

{ν1=η}{ν2=η}\displaystyle\frac{\mathbb{P}\left\{\nu_{1}=\eta\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{\nu_{2}=\eta\rule{0.0pt}{8.53581pt}\right\}} =i=1m{λi=n(ημ𝒴1¯)i}{λi=n(ημ𝒴2¯)i}=i=1mexp(εn|(ημ𝒴1¯)i|)exp(εn|(ημ𝒴2¯)i|)\displaystyle=\prod_{i=1}^{m}\frac{\mathbb{P}\left\{\lambda_{i}=n(\eta-\overline{\mu_{\mathcal{Y}_{1}}})_{i}\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{\lambda_{i}=n(\eta-\overline{\mu_{\mathcal{Y}_{2}}})_{i}\rule{0.0pt}{8.53581pt}\right\}}=\prod_{i=1}^{m}\frac{\exp(-\varepsilon n|(\eta-\overline{\mu_{\mathcal{Y}_{1}}})_{i}|)}{\exp(-\varepsilon n|(\eta-\overline{\mu_{\mathcal{Y}_{2}}})_{i}|)}
exp(εnμ𝒴2μ𝒴11)eε.\displaystyle\leq\exp\left(\varepsilon n\|{\mu_{\mathcal{Y}_{2}}}-{\mu_{\mathcal{Y}_{1}}}\|_{1}\right)\leq e^{\varepsilon}.

By writing {νiS}=ηS{νi=η}\mathbb{P}\left\{\nu_{i}\in S\rule{0.0pt}{8.53581pt}\right\}=\sum_{\eta\in S}\mathbb{P}\left\{\nu_{i}=\eta\rule{0.0pt}{8.53581pt}\right\} for i=1,2i=1,2, the inequality above implies Algorithm 1 is ε\varepsilon-differentially private. ∎

A.4. Proof of Proposition 3.5

Proof.

For two signed measures τ,ν\tau,\nu supported on 𝒴\mathcal{Y}, the dBLd_{\mathrm{BL}}-distance between τ\tau and ν\nu is

dBL(τ,ν)=supfLip1|i=1mf(yi)(τ({yi})ν({yi}))|.d_{\mathrm{BL}}(\tau,\nu)=\sup_{\|f\|_{\mathrm{Lip}\leq 1}}\left|\sum_{i=1}^{m}f(y_{i})\left(\tau(\{y_{i}\})-\nu(\{y_{i}\})\right)\right|.

For simplicity, we denote fi=f(yi)f_{i}=f(y_{i}), νi=ν({yi})\nu_{i}=\nu(\{y_{i}\}) and τi=τ({yi})\tau_{i}=\tau(\{y_{i}\}). Then we note that for any ff with fLip1\|f\|_{\mathrm{Lip}}\leq 1, only (fi)i[m](f_{i})_{i\in[m]} matters in the definition above. Therefore, suppose ν\nu and τ\tau are fixed, computing the dBLd_{\mathrm{BL}}-distance is equivalent to the following linear programming problem:

max\displaystyle\max\quad i=1m(νiτi)fi\displaystyle\sum_{i=1}^{m}(\nu_{i}-\tau_{i})f_{i}
s.t. fifjyiyj,\displaystyle f_{i}-f_{j}\leq\|y_{i}-y_{j}\|_{\infty}, i,jm,ij,\displaystyle{\forall i,j\leq m,i\not=j},
fi+fjyiyj,\displaystyle-f_{i}+f_{j}\leq\|y_{i}-y_{j}\|_{\infty}, i,jm,ij,\displaystyle\forall i,j\leq m,i\not=j,
1fi1,\displaystyle-1\leq f_{i}\leq 1, im.\displaystyle\forall i\leq m.

After a change of variable fi=fi+1f_{i}^{\prime}=f_{i}+1, we can rewrite it as

max\displaystyle\max\quad i=1m(νiτi)fi(ν(Ω)1)\displaystyle\sum_{i=1}^{m}(\nu_{i}-\tau_{i})f_{i}^{\prime}-\left(\nu(\Omega)-1\right)
s.t. fifjyiyj,\displaystyle f_{i}^{\prime}-f_{j}^{\prime}\leq\|y_{i}-y_{j}\|_{\infty}, i,jm,ij,\displaystyle\forall i,j\leq m,i\not=j,
fi+fjyiyj,\displaystyle-f_{i}^{\prime}+f_{j}^{\prime}\leq\|y_{i}-y_{j}\|_{\infty}, i,jm,ij,\displaystyle\forall i,j\leq m,i\not=j,
0fi2,\displaystyle 0\leq f_{i}^{\prime}\leq 2, im.\displaystyle\forall i\leq m.

Next, we can consider the dual problem of the linear programming problem above. The duality theory in linear programming [44, Chapter 12] showed that the original problem and the dual problem have the same optimal solution. Let uij,uij0u_{ij},u_{ij}^{\prime}\geq 0 be the dual variable for the linear constraints about fifjf_{i}^{\prime}-f_{j}^{\prime} and fi+fj-f_{i}^{\prime}+f_{j}^{\prime}, and let vi0v_{i}\geq 0 be the dual variable for the equation fi2f_{i}^{\prime}\leq 2. As the linear programming above is in the standard form, by the duality theory, it is equivalent to

min\displaystyle\min\quad ijyiyj(uij+uij)+2vi(ν(Ω)1)\displaystyle\sum_{i\neq j}\|y_{i}-y_{j}\|_{\infty}(u_{ij}+u_{ij}^{\prime})+2\,v_{i}-\left(\nu(\Omega)-1\right)
s.t. ji(uijuij)+viνiτi,\displaystyle\sum_{j\neq i}(u_{ij}-u_{ij}^{\prime})+v_{i}\geq\nu_{i}-\tau_{i}, im,\displaystyle\forall{i\leq m},
uij,uij,vi0\displaystyle u_{ij},u_{ij}^{\prime},v_{i}\geq 0 i,jm,ij.\displaystyle\forall i,j\leq m,i\not=j.

To find the minimizer τ\tau for a given ν\nu, we regard τi\tau_{i} as variables and add the constraints of τ\tau being a probability measure. Also, we can eliminate the constant ν(Ω)1\nu(\Omega)-1 in the target function. So we get the linear programming problem:

min\displaystyle\min\quad ijyiyj(uij+uij)+2vi\displaystyle\sum_{i\neq j}\|y_{i}-y_{j}\|_{\infty}(u_{ij}+u_{ij}^{\prime})+2\,v_{i}
s.t. ji(uijuij)+vi+τiνi,\displaystyle\sum_{j\neq i}(u_{ij}-u_{ij}^{\prime})+v_{i}+\tau_{i}\geq\nu_{i}, im,\displaystyle\forall{i\leq m},
i=1mτi=1,\displaystyle\sum_{i=1}^{m}\tau_{i}=1,
uij,uij,vi,τi0\displaystyle u_{ij},u_{ij}^{\prime},v_{i},\tau_{i}\geq 0 i,jm,ij.\displaystyle\forall i,j\leq m,i\not=j.

There are 2m22m^{2} variables in total and m+1m+1 linear constraints, and the minimizer (τi)i=1m(\tau_{i})_{i=1}^{m} is what we want. ∎

A.5. Proof of Theorem 3.6

Proof.

We transformed the original data measure μ𝒳\mu_{\mathcal{X}} with three steps: μ𝒳μ𝒴νν^.\mu_{\mathcal{X}}\longrightarrow\mu_{\mathcal{Y}}\longrightarrow\nu\longrightarrow\hat{\nu}.

Step 1: For the first step in the algorithm, we have W1(μ𝒳,μ𝒴)maxidiam(Ωi)W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq\max_{i}\operatorname{diam}(\Omega_{i}). This follows from the definition of 11-Wasserstein distance.

Step 2: In this step, ν\nu is no longer a probability measure, and we consider dBL(μ𝒴,ν)d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu) instead:

𝔼dBL(μ𝒴,ν)\displaystyle\mathbb{E}d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu) =𝔼supfLip1|fdμ𝒴fdν|\displaystyle=\mathbb{E}\sup_{\|f\|_{\mathrm{Lip}}\leq 1}\left|\int f\mathrm{d}\mu_{\mathcal{Y}}-\int f\mathrm{d}\nu\right|
=𝔼supfLip1|i=1mf(yi)(nin+λinnin)|=mεnL~m().\displaystyle=\mathbb{E}\sup_{\|f\|_{\mathrm{Lip}}\leq 1}\left|\sum_{i=1}^{m}f(y_{i})\left(\frac{n_{i}}{n}+\frac{\lambda_{i}}{n}-\frac{n_{i}}{n}\right)\right|=\frac{m}{\varepsilon n}\widetilde{L}_{m}(\mathcal{F}). (A.3)

Step 3: For the last step, we have dBL(ν,ν^)dBL(μ𝒴,ν)d_{\mathrm{BL}}(\nu,\hat{\nu})\leq d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu) because ν^\hat{\nu} is the closest probability measure to ν\nu from Proposition 3.5. As a result, we have

W1(μ,ν^)=dBL(μ,ν^)\displaystyle W_{1}(\mu,\hat{\nu})=d_{\mathrm{BL}}(\mu,\hat{\nu}) dBL(μ𝒳,μ𝒴)+dBL(μ𝒴,ν)+dBL(ν,ν^)\displaystyle\leq d_{\mathrm{BL}}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})+d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu)+d_{\mathrm{BL}}(\nu,\hat{\nu})
W1(μ𝒳,μ𝒴)+2dBL(μ𝒴,ν)maxidiam(Ωi)+2dBL(μ𝒴,ν).\displaystyle\leq W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})+2d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu)\leq\max_{i}\operatorname{diam}(\Omega_{i})+2d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu).

After taking the expectation, we can apply (A.3) to get the desired inequality. ∎

A.6. Proof of Corollary 3.7

Proof.

Using Theorem 3.6, we have

𝔼W1(μ𝒳,ν^)maxidiam(Ωi)+2mεnL~m().\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\hat{\nu})\leq\max_{i}\operatorname{diam}(\Omega_{i})+\frac{2m}{\varepsilon n}\widetilde{L}_{m}(\mathcal{F}).

By assumption we have maxidiam(Ωi)m1/d(εn)1/d\max_{i}\operatorname{diam}(\Omega_{i})\asymp m^{-1/d}\asymp(\varepsilon n)^{-1/d}. And by 3.3 we have the bound for the Laplacian complexity

L~m(){C(εn)1/2 if d=1,Clogn(εn)1/2 if d=2,C(εn)1/d if d3.\widetilde{L}_{m}(\mathcal{F})\leq\left\{\begin{aligned} &C(\varepsilon n)^{-1/2}\quad&\textrm{ if }d=1,\\ &{C\log n}\cdot(\varepsilon n)^{-1/2}\quad&\textrm{ if }d=2,\\ &{C}{(\varepsilon n)^{-{1}/{d}}}\quad&\textrm{ if }d\geq 3.\end{aligned}\right.

When d3d\geq 3, the two terms are comparable. And when d=1,2d=1,2, the Laplacian complexity dominates the error. Combining the two inequalities gives the result. ∎

A.7. Proof of Theorem 4.1

Theorem 4.1 can be obtained by applying the parallel composition lemma [24]. Here we present a self-contained proof by considering an inhomogeneous version of the classical Laplacian mechanism [24].

Lemma A.1 (Inhomogeneous Laplace mechanism).

Let F:ΩnkF\mathrel{\mathop{\mathchar 58\relax}}\Omega^{n}\to\mathbb{R}^{k} be any map, s=(si)i=1k+ks=(s_{i})_{i=1}^{k}\in\mathbb{R}_{+}^{k} be a fixed vector, and λ=(λi)i=1k\lambda=(\lambda_{i})_{i=1}^{k} be a random vector with independent coordinates λiLap(si)\lambda_{i}\sim\operatorname{Lap}_{\mathbb{Z}}(s_{i}). Then the map xF(x)+λx\mapsto F(x)+\lambda is ε\varepsilon-differentially private, where

ε=supx,x~F(x)F(x~)1(s).\varepsilon=\sup_{x,\tilde{x}}\mathinner{\!\left\lVert F(x)-F(\tilde{x})\right\rVert}_{\ell^{1}(s)}.

Here the supremum is over all pairs of input vectors in Ωn\Omega^{n} that differ in one coordinate, and z1(s)=i=1k|zi|/si\mathinner{\!\left\lVert z\right\rVert}_{\ell^{1}(s)}=\sum_{i=1}^{k}\mathinner{\!\left\lvert z_{i}\right\rvert}/{s_{i}}.

Proof of Lemma A.1.

Suppose x,x~Ωnx,\tilde{x}\in\Omega^{n} differs in exactly one coordinate. Consider the density functions of the inputs having the same output y=F(x)+λ=F(x~)+λ~ky=F(x)+\lambda=F(\tilde{x})+\tilde{\lambda}\in\mathbb{Z}^{k}. We have

{F(x)+λ=y}{F(x~)+λ~=y}\displaystyle\frac{\mathbb{P}\left\{F(x)+\lambda=y\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{F(\tilde{x})+\tilde{\lambda}=y\rule{0.0pt}{8.53581pt}\right\}} ={λ=yF(x)}{λ~=yF(x~)}\displaystyle=\frac{\mathbb{P}\left\{\lambda=y-F(x)\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{{\tilde{\lambda}}={y-F(\tilde{x})}\rule{0.0pt}{8.53581pt}\right\}}
=i=1kexp(|(yF(x))i|si)i=1kexp(|(yF(x~))i|si)\displaystyle=\frac{\prod_{i=1}^{k}\exp\left(-\frac{|(y-F(x))_{i}|}{s_{i}}\right)}{\prod_{i=1}^{k}\exp\left(-\frac{|(y-F(\tilde{x}))_{i}|}{s_{i}}\right)}
=exp(i=1k1si(|(yF(x))i||(yF(x~))i|))\displaystyle=\exp\left(-\sum_{i=1}^{k}\frac{1}{s_{i}}\left(|(y-F(x))_{i}|-|(y-F(\tilde{x}))_{i}|\right)\right)
exp(F(x)F(x~)1(s))\displaystyle\leq\exp\left(\|F(x)-F(\tilde{x})\|_{\ell^{1}(s)}\right)
eε\displaystyle\leq e^{\varepsilon}

Therefore, we know xF(x)+λx\mapsto F(x)+\lambda is ε\varepsilon-differentially private. ∎

Proof of Theorem 4.1.

Consider the map F(𝒳)=(nθ)F(\mathcal{X})=(n_{\theta}) that transforms the input data into the vector of counts. Suppose a pair of input data 𝒳\mathcal{X} and 𝒳~\tilde{\mathcal{X}} differ in one point xix_{i}. Consider the corresponding vectors of counts (nθ)(n_{\theta}) and (n~θ)(\tilde{n}_{\theta}). For each level j=0,,rj=0,\ldots,r, the vectors of counts differ for a single θ{0,1}j\theta\in\{0,1\}^{j}, namely for the θ\theta that corresponds to the region Ωθ\Omega_{\theta} containing xix_{i}. Moreover, whenever such a difference occurs, we have |nθn~θ|=1\mathinner{\!\left\lvert n_{\theta}-\tilde{n}_{\theta}\right\rvert}=1. Thus, extending the vector (σj)j=0r(\sigma_{j})_{j=0}^{r} to (σθ)θ{0,1}r(\sigma_{\theta})_{\theta\in\{0,1\}^{\leq r}} trivially (by converting σj\sigma_{j} to σθ\sigma_{\theta} for all θ{0,1}j\theta\in\{0,1\}^{j}), we have

F(𝒳)F(𝒳~)1(σ)=j=0r1σjθ{0,1}j|nθn~θ|=j=0r1σj=ε.\mathinner{\!\left\lVert F(\mathcal{X})-F(\tilde{\mathcal{X}})\right\rVert}_{\ell^{1}(\sigma)}=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}\sum_{\theta\in\{0,1\}^{j}}\mathinner{\!\left\lvert n_{\theta}-\tilde{n}_{\theta}\right\rvert}=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}=\varepsilon.

Applying Lemma A.1, we conclude that the map 𝒳(nθ+λθ)\mathcal{X}\mapsto(n_{\theta}+\lambda_{\theta}) is ε\varepsilon-differentially private. ∎

A.8. Proof of Theorem 4.3

Proof.

We will use the Lagrange multipliers procedure to find the optimal choices of σj\sigma_{j}. Given the maximal layer rr, recall Theorem 4.1, we should use our privacy budget as

ε=j=0r1σj.\varepsilon=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}.

Therefore, we aim to minimize the accuracy bound with the specified privacy budget, namely

min𝔼W1(μ𝒳,μ𝒴)s.t. ε=j=0r1σj.\displaystyle\min\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\quad\;\textrm{s.t. }\varepsilon=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}.

Recall the result in Theorem 4.2. Here ε,n\varepsilon,n are given and δ\delta is fixed as long as we determine the maximal level rr. So the minimization problem is

minj=0rσjΔj1s.t. ε=j=0r1σj.\displaystyle\min\sum_{j=0}^{r}\sigma_{j}\Delta_{j-1}\quad\;\textrm{s.t. }\varepsilon=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}.

Consider the Lagrangian function

f(σ0,,σr;t):=j=0rσjΔj1t(j=0r1σjε)f(\sigma_{0},\dots,\sigma_{r};t)\mathrel{\mathop{\mathchar 58\relax}}=\sum_{j=0}^{r}\sigma_{j}\Delta_{j-1}-t\left(\sum_{j=0}^{r}\frac{1}{\sigma_{j}}-\varepsilon\right)

and the corresponding equation

fσ0==fσr=ft=0.\frac{\partial f}{\partial\sigma_{0}}=\dots=\frac{\partial f}{\partial\sigma_{r}}=\frac{\partial f}{\partial t}=0.

One can easily check that the equations above have a unique solution

σj=SεΔj1whereS=j=0rΔi1.\sigma_{j}=\frac{S}{\varepsilon\sqrt{\Delta_{j-1}}}\quad\text{where}\quad S=\sum_{j=0}^{r}\sqrt{\Delta_{i-1}}. (A.4)

and it is indeed a minimal point for f(σ0,,σr;t)f(\sigma_{0},\dots,\sigma_{r};t).

As a result, if we fix ε\varepsilon and want Algorithm 4 to be ε\varepsilon-differentially private, we should choose the noise magnitudes as (A.4). Substituting these noise magnitudes into the accuracy Theorem 4.2, we see that the accuracy gets bounded by 2εnS2+δ\frac{\sqrt{2}}{\varepsilon n}S^{2}+\delta. ∎

A.9. Proof of Corollary 4.4

Proof.

Let Ω=[0,1]\Omega=[0,1] with the \ell^{\infty} metric. The natural hierarchical binary decomposition of [0,1][0,1] (cut through the middle) makes subintervals of length diam(Ωθ)=2j\operatorname{diam}(\Omega_{\theta})=2^{-j} for θ{0,1}j\theta\in\{0,1\}^{j}, so Δj=1\Delta_{j}=1 for all jj, and the resolution is δ=2r\delta=2^{-r}. Theorem 4.3 makes ε\varepsilon-differential private synthetic data with accuracy

𝔼W1(μ𝒳,μ𝒴)2(r+1)2εn+2r.\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\leq\frac{\sqrt{2}(r+1)^{2}}{\varepsilon n}+2^{-r}.

A nearly optimal choice for rr is r=log2(εn)1r=\log_{2}(\varepsilon n)-1, which yields

𝔼W1(μ𝒳,μ𝒴)(2+2)log22(εn)εn.\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\leq\frac{(2+\sqrt{2})\log_{2}^{2}(\varepsilon n)}{\varepsilon n}.

The optimal noise magnitudes, per (A.4), are σj=log22(εn)/ε\sigma_{j}=\log_{2}^{2}(\varepsilon n)/\varepsilon. In other words, the noise does not decay with the level.

Let Ω=[0,1]d\Omega=[0,1]^{d} for d>1d>1. The natural hierarchical binary decomposition of [0,1]d[0,1]^{d} (cut through the middle along a coordinate hyperplane) makes subintervals of length diam(Ωθ)2j/d\operatorname{diam}(\Omega_{\theta})\asymp 2^{-j/d} for θ{0,1}j\theta\in\{0,1\}^{j}, so Δj=2j2j/d=2(11/d)j\Delta_{j}=2^{j}\cdot 2^{-j/d}=2^{(1-1/d)j} for all jj, and the resolution is δ=2r/d\delta=2^{-r/d}. Thus,

S=j=0rΔj1212(11d)r.S=\sum_{j=0}^{r}\sqrt{\Delta_{j-1}}\sim 2^{\frac{1}{2}(1-\frac{1}{d})r}.

Theorem 4.3 makes a ε\varepsilon-differential private synthetic data with accuracy

𝔼W1(μ𝒳,μ𝒴)2(11d)rεn+2r/d.\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\lesssim\frac{2^{(1-\frac{1}{d})r}}{\varepsilon n}+2^{-r/d}.

A nearly optimal choice for the depth of the partition is r=log2(εn)r=\log_{2}(\varepsilon n), which yields

𝔼W1(μ𝒳,μ𝒴)(εn)1/d.\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\lesssim(\varepsilon n)^{-1/d}.

The optimal noise magnitudes, per (A.4), are

σjε1212(11d)(rj).\sigma_{j}\sim\varepsilon^{-1}2^{\frac{1}{2}(1-\frac{1}{d})(r-j)}.

Thus, the noise decays with the level jj, becoming O(1)O(1) per region for the smallest regions. ∎

A.10. Proof of Lemma 4.6

Proof.

If a,ba,b are comparable, both values are zero. If a,ba,b is not comparable, we can assume a1>b1a_{1}>b_{1}, a2<b2a_{2}<b_{2} without loss of generality. The set of points that are comparable to bb is

{(x1,x2)+2x1b1,x2b2}{(x1,x2)+2x1b1,x2b2}.\{(x_{1},x_{2})\in\mathbb{Z}_{+}^{2}\mid x_{1}\leq b_{1},x_{2}\leq b_{2}\}\cup\{(x_{1},x_{2})\in\mathbb{Z}_{+}^{2}\mid x_{1}\geq b_{1},x_{2}\geq b_{2}\}.

Note that the distance from aa to the first set is |a1b1|\mathinner{\!\left\lvert a_{1}-b_{1}\right\rvert} and the distance from aa to the second set is |a2b2|\mathinner{\!\left\lvert a_{2}-b_{2}\right\rvert}. Then flux(a,b)\operatorname{flux}(a,b) is the smaller one of the two distances, which is also the distance from aa to the union set. ∎

A.11. Proof of Lemma 4.7

Proof.

Case 1: a=(a1,a2)a=(a_{1},a_{2}) and b=(b1,b2)b=(b_{1},b_{2}) are comparable. If aba\preceq b, remove b1a1b_{1}-a_{1} balls from bin 1 and b2a2b_{2}-a_{2} balls from bin 2 to achieve the result. If bab\preceq a, adding a1b1a_{1}-b_{1} balls to bin 1 and a2b2a_{2}-b_{2} balls to bin 2 to achieve the result.

Case 2: a=(a1,a2)a=(a_{1},a_{2}) and b=(b1,b2)b=(b_{1},b_{2}) are incomparable. Without loss of generality, we can assume that a1b10a_{1}-b_{1}\geq 0, a2b20a_{2}-b_{2}\leq 0.

Assume first that a1b1b2a2a_{1}-b_{1}\geq b_{2}-a_{2}. Then flux(a,b)=b2a2M\operatorname{flux}(a,b)=b_{2}-a_{2}\coloneqq M. Then Δ(a1+a2)(b1+b2)>0\Delta\eqqcolon(a_{1}+a_{2})-(b_{1}+b_{2})>0. Removing Δ\Delta balls from bin 1 and transferring MM balls from bin 1 to bin 2 achieves the result. Note that there are enough balls in bin 11 to transfer, since M+Δ=a1b1[0,a1]M+\Delta=a_{1}-b_{1}\in[0,a_{1}].

Now assume that a1b1b2a2a_{1}-b_{1}\leq b_{2}-a_{2}. Then flux(a,b)=a1b1M\operatorname{flux}(a,b)=a_{1}-b_{1}\coloneqq M. Then Δ(b1+b2)(a1+a2)>0\Delta\eqqcolon(b_{1}+b_{2})-(a_{1}+a_{2})>0. Adding Δ\Delta balls to bin 2 and transferring MM balls from bin 1 to bin 2 achieves the result. ∎

A.12. Proof of Lemma 4.8

Proof.

First, we make the total number of points in Ω\Omega correct by adding mnm-n points to Ω\Omega (or removing, if that number is negative).

Apply Lemma 4.7 for the two parts of Ω\Omega: bin Ω0\Omega_{0} that contains n0n_{0} points and bin Ω1\Omega_{1} that contains n1n_{1} points. Since Ω\Omega already contains the correct total number of points mm, we can make the two bins contain the correct number of points, i.e. m0m_{0} and m1m_{1} respectively, by transferring flux((n0,n1),(m0,m1))\operatorname{flux}\left((n_{0},n_{1}),\,(m_{0},m_{1})\right) points from one bin to the other.

Apply Lemma 4.7 for the two parts of Ω0\Omega_{0}: bin Ω00\Omega_{00} that contains n00n_{00} points and bin Ω01\Omega_{01} that contains n01n_{01} points. Since Ω0\Omega_{0} already contains the correct number of points m0m_{0}, we can make the two bins contain the correct number of points, i.e. m00m_{00} and m01m_{01} respectively, by transferring flux((n00,n01),(m00,m01))\operatorname{flux}\left((n_{00},n_{01}),\,(m_{00},m_{01})\right) points from one bin to the other.

Similarly, since Ω1\Omega_{1} already has the correct number of points m1m_{1}, we can make Ω10\Omega_{10} and Ω11\Omega_{11} contain the correct number of points m10m_{10} and m11m_{11} by transferring flux((n00,n01),(m00,m01))\operatorname{flux}\left((n_{00},n_{01}),\,(m_{00},m_{01})\right) points from one bin to the other.

Continuing this way, we can complete the proof. Note that the steps of the iteration procedure we described are interlocked. Each next step determines which subregion the transferred points are selected from, and which subregion they are moved to in the previous step. For example, the original step calls to add (or remove) mnm-n points to or from Ω\Omega, but does not specify how these points are distributed between the two parts Ω0\Omega_{0} and Ω1\Omega_{1}. The application of Lemma 4.7 at the next step determines this. ∎

A.13. Proof of Lemma 4.9

Proof.

We will derive this result from Lemma 4.6. First, let us compute the distance from a=(nθ0,nθ1)a=(n_{\theta 0},n_{\theta 1}) to b=(nθ0,nθ1)=((nθ0+λθ0)+,(nθ1+λθ1)+)b^{\prime}=(n^{\prime}_{\theta 0},n^{\prime}_{\theta 1})=\left((n_{\theta 0}+\lambda_{\theta 0})_{+},\,(n_{\theta 1}+\lambda_{\theta 1})_{+}\right). Since the map xx+x\mapsto x_{+} is 11-Lipschitz, we have

abmax(|λθ0|,|λθ1|).\mathinner{\!\left\lVert a-b^{\prime}\right\rVert}_{\infty}\leq\max\left(\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right).

Furthermore, recall that by Algorithm 3, bb^{\prime} is comparable to b=(mθ0,mθ1)b=(m_{\theta 0},m_{\theta 1}). An application of Lemma 4.6 completes the proof. ∎

A.14. Proof of Lemma 4.10

Proof.

Finding the 1-Wasserstein distance in the discrete case is equivalent to solving the optimal transformation problem. In fact, we can obtain μU\mu_{U} from μV\mu_{V} by moving |VU|\mathinner{\!\left\lvert V\setminus U\right\rvert} atoms of μV\mu_{V}, each having mass 1/|V|1/\mathinner{\!\left\lvert V\right\rvert}, and distributing their mass uniformly over UU. The distance for each movement is bounded by diam(Ω)\operatorname{diam}(\Omega). Therefore the 11-Wasserstein distance between μU\mu_{U} and νV\nu_{V} is bounded by |VU||V|diam(Ω)\frac{|V\setminus U|}{|V|}\operatorname{diam}(\Omega). ∎

Appendix B Discrete Laplacian distribution

Recall that the classical Laplacian distribution Lap(σ)\operatorname{Lap}_{\mathbb{R}}(\sigma) is a continuous distribution with density

f(x)=12σexp(|x|/σ),x.f(x)=\frac{1}{2\sigma}\exp\left(-\mathinner{\!\left\lvert x\right\rvert}/\sigma\right),\quad x\in\mathbb{R}.

A random variable XLap(σ)X\sim\operatorname{Lap}_{\mathbb{R}}(\sigma) has zero mean and

Var(Z)=2σ2.\operatorname{Var}(Z)=2\sigma^{2}.

To deal with counts, it is more convenient to use the discrete Laplacian distribution Lap(σ)\operatorname{Lap}_{\mathbb{Z}}(\sigma), see [30], which has probability mass function

f(z)=1pσ1+pσexp(|z|/σ),zf(z)=\frac{1-p_{\sigma}}{1+p_{\sigma}}\exp\left(-\mathinner{\!\left\lvert z\right\rvert}/\sigma\right),\quad z\in\mathbb{Z}

where pσ=exp(1/σ)p_{\sigma}=\exp(-1/\sigma). A random variable ZLap(σ)Z\sim\operatorname{Lap}_{\mathbb{Z}}(\sigma) has zero mean and

Var(Z)=2pσ(1pσ)2.\operatorname{Var}(Z)=\frac{2p_{\sigma}}{(1-p_{\sigma})^{2}}.

Thus, one can verify that discrete Laplacian has a smaller variance than its continuous counterpart:

Var(Z)<2σ2,\operatorname{Var}(Z)<2\sigma^{2}, (B.1)

but the gap vanishes for large σ\sigma:

Var(Z)2σ2as σ.\operatorname{Var}(Z)\to 2\sigma^{2}\quad\text{as }\sigma\to\infty.