Algorithmically Effective Differentially Private Synthetic Data

Yiyun He Department of Mathematics, University of California Irvine, Irvine, CA 92697 [email protected] , Roman Vershynin Department of Mathematics, University of California Irvine, Irvine, CA 92697 [email protected] and Yizhe Zhu Department of Mathematics, University of California Irvine, Irvine, CA 92697 [email protected]

Abstract.

We present a highly effective algorithmic approach for generating $\varepsilon$ -differentially private synthetic data in a bounded metric space with near-optimal utility guarantees under the 1-Wasserstein distance. In particular, for a dataset $\mathcal{X}$ in the hypercube $[0,1]^{d}$ , our algorithm generates synthetic dataset $\mathcal{Y}$ such that the expected 1-Wasserstein distance between the empirical measure of $\mathcal{X}$ and $\mathcal{Y}$ is $O((\varepsilon n)^{-1/d})$ for $d\geq 2$ , and is $O(\log^{2}(\varepsilon n)(\varepsilon n)^{-1})$ for $d=1$ . The accuracy guarantee is optimal up to a constant factor for $d\geq 2$ , and up to a logarithmic factor for $d=1$ . Our algorithm has a fast running time of $O(\varepsilon dn)$ for all $d\geq 1$ and demonstrates improved accuracy compared to the method in [12] for $d\geq 2$ .

1. Introduction

Differential privacy has become the benchmark for privacy protection in scenarios where vast amounts of data need to be analyzed. The aim of differential privacy is to prevent the disclosure of information about individual participants in the dataset. In simple terms, an algorithm that has a randomized output and produces similar results when given two adjacent datasets is considered to be differentially private. This method of privacy protection is increasingly being adopted and implemented in various fields, including the 2020 US Census [2, 29, 28] and numerous machine learning tasks [24].

A wide range of data computations can be performed in a differentially private manner, including regression [17], clustering [37], parameter estimation [21], stochastic gradient descent [36], and deep learning [1]. However, many existing works on differential privacy focus on designing algorithms for specific tasks and are restricted to queries that are predefined before use. This requires expert knowledge and often involves modifying existing algorithms.

One promising solution to this challenge is to generate a synthetic dataset similar to the original dataset with guaranteed differential privacy [27, 8, 31, 7, 10, 11, 12]. As any downstream tasks are based on the synthetic dataset, they can be performed without incurring additional privacy costs.

1.1. Private synthetic data

Mathematically, the problem of generating private synthetic data can be defined as follows. Let $(\Omega,\rho)$ be a metric space. Consider a dataset $\mathcal{X}=(X_{1},\dots,X_{n})\in\Omega^{n}$ . Our goal is to construct an efficient randomized algorithm that outputs differentially private synthetic data $\mathcal{Y}=(Y_{1},\dots,Y_{m})\in\Omega^{m}$ such that the two empirical measures

\mu_{\mathcal{X}}=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}\quad\text{and}\quad\mu_{\mathcal{Y}}=\frac{1}{m}\sum_{i=1}^{m}\delta_{Y_{i}}

are close to each other. We measure the utility of the output by $\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})$ , where $W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})$ is the 1-Wasserstein distance, and the expectation is taken over the randomness of the algorithm. The Kantorovich-Rubinstein duality (see, e.g., [47]) gives an equivalent representation of the 1-Wasserstein distance between two measures $\nu_{\mathcal{X}}$ and $\mu_{\mathcal{Y}}$ :

W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})=\sup_{{\mathrm{Lip}(f)}\leq 1}\left(\int f\mathrm{d}\mu_{\mathcal{X}}-\int f\mathrm{d}\mu_{\mathcal{Y}}\right),

(1.1)

where the supremum is taken over the set of all $1$ -Lipschitz functions on $\Omega$ . Since many machine learning algorithms are Lipschitz [48, 32, 15, 35], Equation (1.1) provides a uniform accuracy guarantee for a wide range of machine learning tasks performed on synthetic datasets whose empirical measure is close to $\mu_{\mathcal{X}}$ in the 1-Wasserstein distance.

1.2. Main results

The most straightforward way to construct differentially private synthetic data is to add independent noise to the location of each data point. However, this method can result in a significant loss of data utility as the amount of noise needed for privacy protection may be too large [20]. Another direct approach could be to add noise to the density function of the empirical measure of $\mathcal{X}$ , by dividing $\Omega$ into small subregions and perturbing the true counts in each subregion. However, Laplacian noise may perturb the count in a certain subregion to negative, causing the output to become a signed measure. To address this issue, we introduce an algorithmically effective method called the Private Measure Mechanism.

Private Measure Mechanism (PMM)

PMM makes the count zero if the noisy count in a subregion is negative. Instead of a single partition of $\Omega$ , we consider a collection of binary hierarchical partitions on $\Omega$ and add inhomogeneous noise to each level of the partition. However, the counts of two subregions do not always add up to the count of the region at a higher level. We develop an algorithm that enforces the consistency of counts in regions at different levels. PMM has $O(\varepsilon dn)$ running time while the running time of the approach in [12] is polynomial in $n$ .

The accuracy analysis of PMM uses the hierarchical partitions to estimate the 1-Wasserstein distance in terms of the multi-scale geometry of $\Omega$ and the noise magnitude in each level of the partition. In particular, when $\Omega=[0,1]^{d}$ , by optimizing the choice of the hierarchical partitions and noise magnitude, PMM achieves better accuracy compared to [12] for $d\geq 2$ . The accuracy is optimal rate up to a constant factor for $d\geq 2$ , and up to a logarithmic factor for $d=1$ . We state it in the next theorem.

The hierarchical partitions appeared in many previous works on the approximation of distributions under Wasserstein distances in a non-private setting, including [4, 18, 50]. In the differential privacy literature, the hierarchical partitions are also closely related to the binary tree mechanism [22, 16] for differential privacy under continual observation. However, the accuracy analysis of the two mechanisms is significantly different. In addition, the TopDown algorithm in the 2020 census [3] also has a similar hierarchical structure and enforces consistency, but the accuracy analysis of the algorithm is not provided in [3].

Theorem 1.1 (PMM for data in a hypercube).

Let $\Omega=[0,1]^{d}$ equipped with the $\ell^{\infty}$ metric. PMM outputs an $\varepsilon$ -differentially private synthetic dataset $\mathcal{Y}$ in time $O(\varepsilon dn)$ such that

\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq\left\{\begin{aligned} &C\log^{2}(\varepsilon n)(\varepsilon n)^{-1}&\quad\textrm{ if }d=1,\\ &C{(\varepsilon n)^{-\frac{1}{d}}}&\quad\textrm{ if }d\geq 2.\end{aligned}\right.

Private Signed Measure Mechanism (PSMM)

In addition to PMM, we introduce an alternative method, the Private Signed Measure Mechanism, that achieves optimal accuracy rate on $[0,1]^{d}$ when $d\geq 3$ in $\text{poly}(n)$ time. The analysis of PSMM is not restricted to 1-Wasserstein distance, and it can be generalized to provide a uniform utility guarantee of other function classes.

We first partition the domain $\Omega$ into $m$ subregions $\Omega_{1},\dots,\Omega_{m}$ . Perturbing the counts in each subregion with i.i.d. integer Laplacian noise gives an unbiased approximation of $\mu_{\mathcal{Y}}$ with a signed measure $\nu$ . Then we find the closest probability measure $\hat{\nu}$ under the bounded Lipschitz distance by solving a linear programming problem.

In the proof of accuracy for PSMM, one ingredient is to estimate the Laplacian complexity of the Lipschitz function class on $\Omega$ and connect it to the 1-Wasserstein distance. This type of argument is similar in spirit to the optimal matching problem for two sets of random points in a metric space [38, 39, 9]. When $\Omega=[0,1]^{d}$ , PSMM achieves the optimal accuracy rate $O((\varepsilon n)^{-1/d})$ for $d\geq 3$ . For $d=2$ , PSMM achieves a near-optimal accuracy $O(\log(\varepsilon n)(\varepsilon n)^{-1/2})$ . For $d=1$ , the accuracy becomes $O((\varepsilon n)^{-1/2})$ .

For the case when $d=2$ , we believe that the bound in Corollary 3.7 could be improved to $C\sqrt{\log(\varepsilon n)}/\sqrt{\varepsilon n}$ by replacing Dudley’s chaining bound in Proposition 3.2 with the generic chaining bound in [39, 19] involving the $\gamma_{1}$ and $\gamma_{2}$ functionals on $\Omega$ . We will not pursue this direction in this paper.

Comparison to previous results

[42] proved that it is NP-hard to generate private synthetic data on the Boolean cube which approximately preserves all two-dimensional marginals, assuming the existence of one-way functions. There exists a substantial body of work for differentially private synthetic data with guarantees limited to accuracy bounds for a finite set of specified queries [5, 40, 23, 43, 33, 46, 12, 13, 14].

[49] considered differentially private synthetic data in $[0,1]^{d}$ with guarantees for any smooth queries with bounded partial derivatives of order $K$ , and achieved an accuracy of $O(\varepsilon^{-1}n^{-\frac{K}{2d+K}})$ . Recently, [12] introduced a method based on superregular random walks to generate differentially private synthetic data with near-optimal guarantees in general compact metric spaces. In particular, when the dataset is in $[0,1]^{d}$ , they obtain $\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq C\log^{\frac{3}{2d}}(\varepsilon n)(\varepsilon n)^{-\frac{1}{d}}$ . A corresponding lower bound of order $n^{-1/d}$ was also proved in [12, Corollary 9.3]. PMM matches the lower bound up to a constant factor for $d\geq 2$ , and up to a logarithmic factor for $d=1$ .

In terms of computational efficiency, PMM runs in time $O(\varepsilon dn)$ . This is more efficient compared to the algorithm in [12].

Organization of the paper

The rest of the paper is organized as follows. In Section 2, we introduce some background on differential privacy and distances between measures. We will first introduce and analyze the easier and more direct method PSMM before our main result. In Section 3, we describe PSMM in detail and prove its privacy and accuracy for data in a bounded metric space, and detailed results are provided for the case for the hypercube. In Section 4, we introduce PMM and analyze its privacy and accuracy. Optimizing the choices of noise parameters, we obtain the optimal accuracy on the hypercube with $O(\varepsilon dn)$ running time, which proves Theorem 1.1.

Additional proofs are included in Appendix A. We use a variant of Laplacian distribution, called discrete Laplacian distribution, in PMM and PSMM. The definition and properties of discrete Laplacian distribution are included in Appendix B.

2. Preliminaries

Differential Privacy

We use the following definition from [24]. A randomized algorithm $\mathcal{M}$ provides $\varepsilon$ -differential privacy if for any input data $D,D^{\prime}$ that differs on only one element (or $D$ and $D^{\prime}$ are adjacent data sets) and for any measurable set $S\subseteq\mathrm{range}(\mathcal{M})$ , there is

\frac{\mathbb{P}\left\{\mathcal{M}(D)\in S\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{\mathcal{M}(D^{\prime})\in S\rule{0.0pt}{8.53581pt}\right\}}\leq e^{\varepsilon}.

Here the probability is taken from the probability space of the randomness of $\mathcal{M}$ .

Wasserstein distance

Consider a metric space $(\Omega,\rho)$ with two probability measures $\mu,\nu$ . Then the 1-Wasserstein distance (see e.g., [47] for more details) between them is defined as

W_{1}(\mu,\nu)\mathrel{\mathop{\mathchar 58\relax}}=\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{\Omega\times\Omega}\rho(x,y)\mathrm{d}\gamma(x,y),

where $\gamma(\mu,\nu)$ is the set of all couplings of $\mu$ and $\nu$ .

Bounded Lipschitz distance

Let $(\Omega,\rho)$ be a bounded metric space. The Lipschitz norm of a function $f$ is defined as

\mathinner{\!\left\lVert f\right\rVert}_{\mathrm{Lip}}\mathrel{\mathop{\mathchar 58\relax}}=\max\left\{\mathrm{Lip}(f),\;\frac{\mathinner{\!\left\lVert f\right\rVert}_{\infty}}{\operatorname{diam}(\Omega)}\right\},

where $\mathrm{Lip}(f)$ is the Lipschitz constant of $f$ . Let $\mathcal{F}$ be the set of all Lipschitz functions $f$ on $\Omega$ with $\mathinner{\!\left\lVert f\right\rVert}_{\mathrm{Lip}}\leq 1$ . For signed measures $\mu,\nu$ , we define the bounded Lipschitz distance:

d_{\mathrm{BL}}(\mu,\nu)\mathrel{\mathop{\mathchar 58\relax}}=\sup_{f\in\mathcal{F}}\left(\int f\mathrm{d}\mu-\int f\mathrm{d}\nu\right).

One can easily check that this is a metric. Moreover, in the special case where $\mu$ and $\nu$ are both probability measures, moving $f$ by a constant does not change the result of $\int f\mathrm{d}\mu-\int f\mathrm{d}\nu$ . Therefore, for a bounded domain $\Omega$ , we can always assume $f(x_{0})=0$ for a fixed $x_{0}\in\Omega$ , then $\mathinner{\!\left\lVert f\right\rVert}_{\infty}\leq\operatorname{diam}(\Omega)$ when computing the supremum in (1.1). This implies $d_{\mathrm{BL}}$ -metric is equivalent to the classical $W_{1}$ -metric when $\mu,\nu$ are both probability measures on a bounded domain $\Omega$ :

\displaystyle W_{1}(\mu,\nu)

\displaystyle=\sup_{{\mathrm{Lip}(f)\leq 1}}\left(\int f\mathrm{d}\mu-\int f\mathrm{d}\nu\right)=\sup_{f\in\mathcal{F}}\left(\int f\mathrm{d}\mu-\int f\mathrm{d}\nu\right)=d_{\mathrm{BL}}(\mu,\nu).

(2.1)

3. Private signed measure mechanism (PSMM)

We will first introduce PSMM, which is an easier and more intuitive approach. The procedure of PSMM is formally described in Algorithm 1. Note that in the output step of Algorithm 1, the size of the synthetic data $m^{\prime}$ depends on the rational approximation of the density function of $\hat{\nu}$ , and we discuss the details here. Let $\hat{v}_{1},\dots,\hat{v}_{m}$ be the weight of the probability measure $\hat{\nu}$ on $y_{1},\dots,y_{m}$ , respectively. We can choose rational numbers $r_{1},\dots,r_{m}$ such that $\max_{i\in[m]}|r_{i}-\hat{\nu}_{i}|$ is arbitrarily small. Let $m^{\prime}$ be the least common multiple of the denominators of $r_{1},\dots,r_{m}$ , then we output the synthetic dataset $\hat{\mathcal{Y}}$ containing $m^{\prime}r_{i}$ copies of $y_{i}$ for $i=1,\dots,m$ .

Before analyzing the privacy and accuracy of PSMM, we introduce a useful complexity measure of a given function class, which quantifies the influence of the Laplacian noise on the function class.

Algorithm 1 Private Signed Measure Mechanism

Input: true data

\mathcal{X}=(x_{1},\ldots,x_{n})\in\Omega^{n}

, partition

(\Omega_{1},\dots,\Omega_{m})

\Omega

, privacy parameter

\varepsilon>0

Compute the true counts:

Compute the true count in each regime $n_{i}=\#\{x_{j}\in\Omega_{i}\mathrel{\mathop{\mathchar 58\relax}}j\in[n]\}$ .

Create a new dataset:

Choose any element $y_{i}\in\Omega_{i}$ independently of $\mathcal{X}$ , and let $\mathcal{Y}$ be the collection of $n_{i}$ copies of $y_{i}$ for each $i\in[n]$ .

Add noise:

Perturb the empirical measure $\mu_{\mathcal{Y}}$ of $\mathcal{Y}$ and obtain a signed measure $\nu$ such that

\nu(\{y_{i}\})\mathrel{\mathop{\mathchar 58\relax}}=(n_{i}+\lambda_{i})/n,

where $\lambda_{i}\sim\operatorname{Lap}_{\mathbb{Z}}(1/\varepsilon)$ are i.i.d. discrete Laplacian random variables.

Linear programming:

Find the closest probability measure $\hat{\nu}$ of $\nu$ in $d_{\mathrm{BL}}$ -metric using Algorithm 2, and generate synthetic data $\widehat{\mathcal{Y}}$ from $\hat{\nu}$ .

Output: synthetic data

\widehat{\mathcal{Y}}=(y_{1},\ldots,y_{m^{\prime}})\in\Omega^{m^{\prime}}

for some integer

m^{\prime}

Algorithm 2 Linear Programming

Input: A discrete signed measure

\nu

supported on

\mathcal{Y}=\{y_{1},\dots,y_{m}\}

Compute the distances:

Compute the pairwise distances $\{\|y_{i}-y_{j}\|_{\infty},i>j\}$ .

Solve the linear programming:

Solve the linear programming problem with $2m^{2}$ variables and $m+1$ constraints:

$\displaystyle\min\quad$	$\displaystyle\sum_{i,j=1}^{m}\\|y_{i}-y_{j}\\|_{\infty}(u_{ij}+u_{ij}^{\prime})+2\,v_{i}$
s.t.	$\displaystyle\sum_{j=1}^{m}(u_{ij}-u_{ij}^{\prime})+v_{i}+\tau_{i}\geq\nu(\{y_{i}\}),$	$\displaystyle\forall{i\leq m},$
	$\displaystyle\sum_{i=1}^{m}\tau_{i}=1,$
	$\displaystyle u_{ij},u_{ij}^{\prime},v_{i},\tau_{i}\geq 0,$	$\displaystyle\forall i,j\leq m,i\neq j.$

Output: a probability measure

\hat{\nu}

with

\hat{\nu}(\{y_{i}\})=\tau_{i}

3.1. Laplacian complexity

Given the Kantorovich-Rubinstein duality (1.1), to control the $W_{1}$ -distance between the original measure and the private measure, we need to describe how Lipschitz functions behave under Laplacian noise. As an analog of the worst-case Rademacher complexity [6, 25], we consider the worst-case Laplacian complexity. Such a worst-case complexity measure appears since the original dataset is deterministic without any distribution assumption.

Definition 3.1 (Worst-case Laplacian complexity).

Let $\mathcal{F}$ be a function class on a metric space $\Omega$ . The worst-case Laplacian complexity of $\mathcal{F}$ is defined by

\displaystyle L_{n}(\mathcal{F})\mathrel{\mathop{\mathchar 58\relax}}=\sup_{X_{1},\dots,X_{n}\in\Omega}\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\lambda_{i}f(X_{i})\right|\right],

(3.1)

where $\lambda_{1},\dots,\lambda_{n}\sim\operatorname{Lap}(1)$ are i.i.d. random variables.

Since Laplacian random variables are sub-exponential but not sub-gaussian, its complexity measure is not equivalent to the Gaussian or Rademacher complexity, but it is related to the suprema of the mixed tail process [19] and the quadratic empirical process [34]. Our next proposition bounds $L_{n}(\mathcal{F})$ in terms of the covering numbers of $\mathcal{F}$ . Its proof is a classical application of Dudley’s chaining method (see, e.g., [45]).

Proposition 3.2 (Bounding Laplacian complexity with Dudley’s entropy integral).

Suppose that $(\Omega,\rho)$ is a metric space and $\mathcal{F}$ is a set of functions on $\Omega$ . Then

L_{n}(\mathcal{F})\leq C\inf_{\alpha>0}\left(2\alpha+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}\sqrt{{\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})}}\mathrm{d}u+\frac{1}{{n}}\int_{\alpha}^{\infty}{\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})}\mathrm{d}u\right)

where $\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})$ is the covering number of $\mathcal{F}$ and $C>0$ is an absolute constant.

In particular, we are interested in the case where $\mathcal{F}$ is the class of all the bounded Lipschitz functions. One can find the result in [41] or more explicit bound in [26] that for the set $\mathcal{F}$ of functions $f$ with $\mathinner{\!\left\lVert f\right\rVert}_{\mathrm{Lip}}\leq 1$ , its covering number satisfies

\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\leq\left(\frac{8}{u}\right)^{\mathcal{N}(\Omega,{u}/{2},\rho)}.

When $\Omega=[0,1]^{d}$ , a better bound on the covering number for Lipschitz functions is available from [41, 48]:

\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\leq\left(2\left\lceil{2}/u\right\rceil+1\right)2^{\mathcal{N}\left([0,1]^{d},u/2,\|\cdot\|_{\infty}\right)},

which implies the following corollary.

Corollary 3.3 (Laplacian complexity for Lipschitz functions on the hypercube).

Let $\Omega=[0,1]^{d}$ with the $\|\cdot\|_{\infty}$ metric, and $\mathcal{F}$ be the set of all Lipschitz functions $f$ on $\Omega$ with $\mathinner{\!\left\lVert f\right\rVert}_{\mathrm{Lip}}\leq 1$ . We have

L_{n}(\mathcal{F})\leq\left\{\begin{aligned} &Cn^{-1/2}\quad&\textrm{ if }d=1,\\ &{C\log n}\cdot n^{-1/2}\quad&\textrm{ if }d=2,\\ &Cd^{-1}{n^{-{1}/{d}}}\quad&\textrm{ if }d\geq 3.\end{aligned}\right.

Discrete Laplacian complexity

Laplacian complexity can be useful for differential privacy algorithms based on the Laplacian mechanism [24]. However, since PSMM perturbs counts in each subregion, it is more convenient for us to add integer noise to the true counts. Instead, we will use the worst-case discrete Laplacian complexity defined below:

\displaystyle\widetilde{L}_{n}(\mathcal{F})\mathrel{\mathop{\mathchar 58\relax}}=\sup_{X_{1},\dots,X_{n}\in\Omega}\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\lambda_{i}f(X_{i})\right|\right],

(3.2)

where $\lambda_{1},\dots,\lambda_{n}\sim\operatorname{Lap}_{\mathbb{Z}}(1)$ are i.i.d. discrete Laplacian random variables.

In particular, $\operatorname{Lap}_{\mathbb{Z}}(1)$ has a bounded sub-exponential norm, therefore the proof of Proposition 3.2 works for discrete Laplacian random variables as well. Consequently, Corollary 3.3 also holds for $\widetilde{L}_{n}(\mathcal{F})$ , with a different absolute constant $C$ .

3.2. Privacy and Accuracy of Algorithm 1

The privacy guarantee of Algorithm 1 can be proved by checking the definition. The essence of the proof is the same as the classical Laplacian mechanism [24].

Proposition 3.4 (Privacy of Algorithm 1).

Algorithm 1 is $\varepsilon$ -differentially private.

We now turn to accuracy. The linear programming problem stated in Algorithm 2 has $(2m^{2}+2m)$ many variables and $(m+1)$ many constraints, which can be solved in polynomial time in $m$ . We first show that Algorithm 2 indeed outputs the closest probability measure to $\nu$ in the $d_{\mathrm{BL}}$ -distance in the next proposition.

Proposition 3.5.

For a discrete signed measure $\nu$ on $\Omega$ , Algorithm 2 gives its closest probability measure in $d_{\mathrm{BL}}$ -distance with the same support set with a polynomial running time in $m$ .

Now we are ready to analyze the accuracy of Algorithm 1. In PSMM, independent Laplacian noise is added to the count of each sub-region. Therefore, the Laplacian complexity arises when considering the expected Wasserstein distance between the original empirical measure and the synthetic measure.

Theorem 3.6 (Accuracy of Algorithm 1).

Suppose $(\Omega_{1},\dots,\Omega_{m})$ is a partition of $(\Omega,\rho)$ and $\mathcal{F}$ is the set of all functions with Lipschitz norm bounded by $1$ . Then the measure $\hat{\nu}$ generated from Algorithm 1 satisfies

\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\hat{\nu})\leq\max_{i}\operatorname{diam}(\Omega_{i})+\frac{2m}{\varepsilon n}\widetilde{L}_{m}(\mathcal{F}).

Note that $\operatorname{diam}(\Omega_{i})\asymp m^{-1/d}$ can be satisfied when we take a partition of $\Omega=[0,1]^{d}$ where each $\Omega_{i}$ is a subcube of the same size. Using the formula above and the result of Laplacian complexity for the hypercube in Corollary 3.3, one can easily deduce the following result.

Corollary 3.7 (Accuracy of Algorithm 1 on the hypercube).

Take $m=\lceil\varepsilon n\rceil$ and let $(\Omega_{1},\dots,\Omega_{m})$ be a partition of $\Omega=[0,1]^{d}$ with the norm $\|\cdot\|_{\infty}$ . Assume that $\operatorname{diam}(\Omega_{i})\asymp m^{-1/d}$ . Then the measure $\hat{\nu}$ generated from Algorithm 1 satisfies

\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\hat{\nu})\leq\left\{\begin{aligned} &C(\varepsilon n)^{-\frac{1}{2}}&\quad\textrm{ if }d=1,\\ &{C}\log(\varepsilon n)(\varepsilon n)^{-\frac{1}{2}}&\quad\textrm{ if }d=2,\\ &C{(\varepsilon n)^{-\frac{1}{d}}}&\quad\textrm{ if }d\geq 3.\end{aligned}\right.

4. Private measure mechanism (PMM)

4.1. Binary partition and noisy counts

A binary hierarchical partition of a set $\Omega$ of depth $r$ is a family of subsets $\Omega_{\theta}$ indexed by $\theta\in\{0,1\}^{\leq r}$ , where

\{0,1\}^{\leq k}=\{0,1\}^{0}\sqcup\{0,1\}^{1}\sqcup\cdots\sqcup\{0,1\}^{k},\quad k=0,1,2\dots,

and such that $\Omega_{\theta}$ is partitioned into $\Omega_{\theta 0}$ and $\Omega_{\theta 1}$ for every $\theta\in\{0,1\}^{\leq r-1}$ . By convention, the cube $\{0,1\}^{0}$ consists of a single element $\emptyset$ . We usually drop the subscript $\emptyset$ and write $n$ instead of $n_{\emptyset}$ . When $\theta\in\{0,1\}^{j}$ , we call $j$ the level of $\theta$ . We can also encode a binary hierarchical partition of $\Omega$ in a binary tree of depth $r$ , where the root is labeled $\Omega$ and the $j$ -th level of the tree encodes the subsets $\Omega_{\theta}$ for $\theta$ at level $j$ .

Let $(\Omega_{\theta})_{\theta\in\{0,1\}^{\leq r}}$ be a binary partition of $\Omega$ . Given true data $(x_{1},\ldots,x_{n})\in\Omega^{n}$ , the true count $n_{\theta}$ is the number of data points in the region $\Omega_{\theta}$ , i.e.

n_{\theta}\coloneqq\mathinner{\!\left\lvert\left\{i\in[n]\mathrel{\mathop{\mathchar 58\relax}}\;x_{i}\in\Omega_{\theta}\right\}\right\rvert}.

We will convert true counts into noisy counts $n^{\prime}_{\theta}$ by adding Laplacian noise; all regions on the same level will receive noise of the same expected magnitude. Formally, we set

n^{\prime}_{\theta}\coloneqq\left(n_{\theta}+\lambda_{\theta}\right)_{+},\quad\text{where}\quad\lambda_{\theta}\sim\operatorname{Lap}_{\mathbb{Z}}(\sigma_{j}),

and $j\in\{0,\ldots,r\}$ is the level of $\theta$ . At this point, the magnitudes of the noise $\sigma_{j}$ can be arbitrary.

4.2. Consistency

The true counts $n_{\theta}$ are non-negative and consistent, i.e., the counts of subregions always add up to the count of the region:

n_{\theta 0}+n_{\theta 1}=n_{\theta}\quad\text{for all }\theta\in\{0,1\}^{\leq r-1}.

The noisy counts $n^{\prime}_{\theta}$ are non-negative, but not necessarily consistent. Algorithm 3 enforces consistency by adjusting the counts iteratively, from top to bottom. In the case of the deficit, when the sum of the two subregional counts is smaller than the count of the region, we increase both subregional counts. In the opposite case or surplus, we decrease both subregional counts. Apart from this requirement, we are free to distribute the deficit or surplus between the subregional counts.

It is convenient to state this requirement by considering a product partial order on $\mathbb{Z}_{+}^{2}$ , where we declare that $(a_{0},a_{1})\preceq(b_{0},b_{1})$ if and only if $a_{0}\leq b_{0}$ and $a_{1}\leq b_{1}$ . We call the two vectors $a,b\in\mathbb{Z}^{2}$ comparable if either $a\preceq b$ or $b\preceq a$ . Furthermore, $L(a)$ denotes the line $x+y=a$ on the plane.

Algorithm 3 Consistency

Input: non-negative numbers

(n^{\prime}_{\theta})_{\theta\in\{0,1\}^{\leq r}}

, where

n^{\prime}

is a nonnegative integer.

set

m\coloneqq n^{\prime}

for

j=0,\ldots,r-1

for

\theta\in\{0,1\}^{j}

transform the vector

(n^{\prime}_{\theta 0},n^{\prime}_{\theta 1})\in\mathbb{Z}_{+}^{2}

into any comparable vector

(m_{\theta 0},m_{\theta 1})\in\mathbb{Z}_{+}^{2}\cap L(m_{\theta})

end for

Output: non-negative integers

(m_{\theta})_{\theta\in\{0,1\}^{\leq r}}

At each step, Algorithm 3 uses a transformation $f_{\theta}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{Z}_{+}^{2}\to\mathbb{Z}_{+}^{2}\cap L(m_{\theta}).$ It can be chosen arbitrarily; the only requirement is that $f_{\theta}(x)$ be comparable with $x$ . The comparability requirement is natural and non-restrictive. For example, the uniform transformation selects the closest point in the discrete interval $\mathbb{Z}_{+}^{2}\cap L(m_{\theta})$ in (say) the Euclidean metric. Alternatively, the proportional transformation selects the point in the discrete interval $\mathbb{Z}_{+}^{2}\cap L(m_{\theta})$ that is closest to the line that connects the input vector and the origin.

4.3. Synthetic data

Algorithm 3 ensures that the output counts $m_{\theta}$ are non-negative, integer, and consistent. They are also private since they are a function of the noisy counts $n^{\prime}_{\theta}$ , which are private as we proved. Therefore, the counts $m_{\theta}$ can be used to generate private synthetic data by putting $m_{\theta}$ points in cell $\Omega_{\theta}$ . Algorithm 4 makes this formal.

Algorithm 4 Private Measure Mechanism

Input: true data

\mathcal{X}=(x_{1},\ldots,x_{n})\in\Omega^{n}

, noise magnitudes

\sigma_{0},\ldots,\sigma_{r}>0

Compute true counts:: Let $n_{\theta}$ be the number of data points in $\Omega_{\theta}$ .
Add noise:: Let $n^{\prime}_{\theta}\coloneqq(n_{\theta}+\lambda_{\theta})_{+}$ , where $\lambda_{\theta}\sim\operatorname{Lap}_{\mathbb{Z}}(\sigma_{j})$ are i.i.d. random variables,
Enforce consistency:: Convert the noisy counts $(n^{\prime}_{\theta})$ to consistent counts $(m_{\theta})$ using Algorithm 3.
Sample:: Choose any $m_{\theta}$ points in each cell $\Omega_{\theta}$ , $\theta\in\{0,1\}^{r}$ independently of $\mathcal{X}$ .

Output: the set of all these points as synthetic data

\mathcal{Y}=(y_{1},\ldots,y_{m})\in\Omega^{m}

4.4. Privacy and accuracy of Algorithm 4

We first prove that Algorithm 4 is differentially private. The proof idea is similar to the classic Laplacian mechanism. But now our noise is of differential scale for each level, so more delicate calculations are needed.

Theorem 4.1 (Privacy of Algorithm 4).

The vector of noisy counts $(n_{\theta}+\lambda_{\theta})$ in Algorithm 4 is $\varepsilon$ -differentially private, where

\varepsilon=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}.

Consequently, the synthetic data $\mathcal{Y}$ generated by Algorithm 4 is $\varepsilon$ -differentially private.

Having analyzed the privacy of the synthetic data, we now turn to its accuracy. It is determined by the magnitudes of the noise $\sigma_{j}$ and by the multiscale geometry of the domain $\Omega$ . The latter is captured by the diameters of the regions $\Omega_{\theta}$ , specifically by their sum at each level, which we denote

\Delta_{j}\coloneqq\sum_{\theta\in\{0,1\}^{j}}\operatorname{diam}(\Omega_{\theta})

(4.1)

and adopt the notation $\Delta_{-1}\coloneqq\Delta_{0}=\operatorname{diam}(\Omega).$ In addition to $\Delta_{j}$ , the accuracy is affected by the resolution of the partition, which is the maximum diameter of the cells, denoted by

\delta\coloneqq\max_{\theta\in\{0,1\}^{r}}\operatorname{diam}(\Omega_{\theta}).

Theorem 4.2 (Accuracy of Algorithm 4).

Algorithm 4 that transforms true data $\mathcal{X}$ into synthetic data $\mathcal{Y}$ has the following expected accuracy in the Wasserstein metric:

\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\leq\frac{2\sqrt{2}}{n}\sum_{j=0}^{r}\sigma_{j}\Delta_{j-1}+\delta.

Here $\mu_{\mathcal{X}}$ and $\mu_{\mathcal{Y}}$ are the empirical probability distributions on the true and synthetic data, respectively.

The privacy and accuracy guarantees of Algorithm 4 (Theorems 4.1 and 4.2) hold for any choice of noise levels $\sigma_{j}$ . By optimizing $\sigma_{j}$ , we can achieve the best accuracy for a given level of privacy.

Theorem 4.3 (Optimized accuracy).

With the optimal choice of magnitude levels (A.4), Algorithm 4 that transforms true data $\mathcal{X}$ into synthetic data $\mathcal{Y}$ is $\varepsilon$ -differential private, and has the following expected accuracy in the $1$ -Wasserstein distance:

\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\leq\frac{\sqrt{2}}{\varepsilon n}\Big{(}\sum_{j=0}^{r}\sqrt{\Delta_{j-1}}\Big{)}^{2}+\delta.

Here $\mu_{\mathcal{X}}$ and $\mu_{\mathcal{Y}}$ are the empirical measures of the true and synthetic data, respectively.

Corollary 4.4 (Optimized accuracy for hypercubes).

When $\Omega=[0,1]^{d}$ equipped with the $\ell^{\infty}$ metric, with the optimal choice of magnitude levels (A.4) and the optimal choice of

r=\left\{\begin{aligned} &\log_{2}(\varepsilon n)-1&\textrm{ if }d=1,\\ &\log_{2}(\varepsilon n)&\textrm{ if }d\geq 2,\end{aligned}\right.

we have

\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\lesssim\left\{\begin{aligned} \frac{\log^{2}(\varepsilon n)}{\varepsilon n},\quad\textrm{ if }d=1,\\ (\varepsilon n)^{-1/d},\quad\textrm{ if }d\geq 2.\end{aligned}\right.

Remark 4.5 (Computational efficiency of Algorithm 4).

Since a binary hierarchical partition has $2^{r}$ cells in total, the running time of Algorithm 4 is $O(2^{r})$ . When $\Omega=[0,1]^{d}$ , with the same optimal choice of $r$ in Corollary 4.4, the running time of PMM becomes $O(\varepsilon dn)$ .

4.5. Proof of Theorem 4.2

For the proof of Theorem 4.2, we introduce a quantitative notion for the incomparability of two vectors on the plane. For vectors $a,b\in\mathbb{Z}_{+}^{2}$ , we define

\operatorname{flux}(a,b)\mathrel{\mathop{\mathchar 58\relax}}=\begin{cases}0&\text{if $a$ and $b$ are comparable,}\\ \min\left(\,\mathinner{\!\left\lvert a_{1}-b_{1}\right\rvert},\,\mathinner{\!\left\lvert a_{2}-b_{2}\right\rvert}\right)&\text{otherwise}.\end{cases}

Lemma 4.6 (Flux as incomparability).

$\operatorname{flux}(a,b)$ is the $\ell_{\infty}$ -distance from $a$ to the set of points that are comparable to $b$ .

For example, if $a=(1,9)$ and $b=(6,7)$ , then $\operatorname{flux}(a,b)=2$ . Note that $a$ has a distance $2$ to the vector $(1,7)$ which is comparable with $b$ .

Lemma 4.7 (Flux as transfer).

Suppose we have two bins with $a_{1}$ and $a_{2}$ balls in them. Then one can achieve $b_{1}$ and $b_{2}$ balls in these bins by:

(a)

first making the total number of balls correct by adding a total of $(b_{1}+b_{2})-(a_{1}-a_{2})$ balls to the two bins (or removing, if that number is negative);
(b)

then transferring $\operatorname{flux}\left((a_{1},a_{2}),\,(b_{1},b_{2})\right)$ balls from one bin to the other.

For example, suppose that one bin has $1$ ball and the other has $9$ . Then we can achieve $6$ and $7$ balls in these bins by first adding $3$ balls to the first bin and transferring $2$ balls from the second to the first bin. As we noted above, $2$ is the flux between the vectors $(1,9)$ and $b=(6,7)$ .

Lemma 4.7 can be generalized to the hierarchical binary partition of $\Omega$ as follows.

Lemma 4.8.

Consider any data set $\mathcal{X}\in\Omega^{n}$ , and let $(n_{\theta})_{\theta\in\{0,1\}^{r}}$ be its counts. Consider any consistent vector of non-negative integers $(m_{\theta})_{\theta\in\{0,1\}^{r}}$ . Then one can transform $\mathcal{X}$ into a set $\mathcal{Z}\in\Omega^{m}$ that has counts $(m_{\theta})_{\theta\in\{0,1\}^{r}}$ by:

(a)

first making the total number of points correct by adding a total of $m-n$ points to $\Omega$ (or remove, if that number is negative);
(b)

then transferring $\operatorname{flux}\left((n_{\theta 0},n_{\theta 1}),\,(m_{\theta 0},m_{\theta 1})\right)$ points from $\Omega_{\theta 0}$ to $\Omega_{\theta 1}$ or vice versa, for all $j=0,\ldots,r-1$ and $\theta\in\{0,1\}^{j}$ .

Combining the concept of the $\operatorname{flux}$ and our algorithm, the following two lemmas are useful in the proof of Theorem 4.2.

Lemma 4.9.

In Algorithm 4, we have

\operatorname{flux}\left((n_{\theta 0},n_{\theta 1}),\,(m_{\theta 0},m_{\theta 1})\right)\leq\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)

for all $j=0,\ldots,r-1$ and $\theta\in\{0,1\}^{j}$ .

Lemma 4.10.

For any finite multisets $U\subset V$ such that all elements in $U$ are from $\Omega$ , one has

W_{1}(\mu_{U},\mu_{V})\leq\frac{\mathinner{\!\left\lvert V\setminus U\right\rvert}}{\mathinner{\!\left\lvert V\right\rvert}}\cdot\operatorname{diam}(\Omega).

Proof.

(Proof of Theorem 4.2) Owing to Lemma 4.8 and Lemma 4.9, the creation of synthetic data from the true data $\mathcal{X}\mapsto\mathcal{Y}$ , described by Algorithm 4, can be achieved by the following three steps.

1.

Transform the $n$ -point input set $\mathcal{X}$ to an $m$ -point set $\mathcal{X}_{1}$ by adding or removing $\mathinner{\!\left\lvert m-n\right\rvert}$ points.
2.

Transform $\mathcal{X}_{1}$ to $\mathcal{X}_{2}$ by moving at most $\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)$ many data points for each $j=0,1,\dots,r-1$ and $\theta\in\{0,1\}^{j}$ between the two parts of the region $\Omega_{\theta}$ .
3.

Transforms $\mathcal{X}_{2}$ to the output data $\mathcal{Y}$ by relocating points within their cells.

We will analyze the accuracy of these steps one at a time.

Analyzing Step 2. The total distance the points are moved at this step is bounded by

\sum_{j=0}^{r-1}\sum_{\theta\in\{0,1\}^{j}}\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)\operatorname{diam}(\Omega_{\theta})\eqqcolon D.

(4.2)

Since $\mathinner{\!\left\lvert\mathcal{X}_{1}\right\rvert}=m$ , it follows that

W_{1}(\mu_{\mathcal{X}_{1}},\mu_{\mathcal{X}_{2}})\leq\frac{D}{m}.

(4.3)

Combining Steps 1 and 2. Recall that step 1 transforms the input data $\mathcal{X}$ with $\mathinner{\!\left\lvert\mathcal{X}\right\rvert}=n$ into $\mathcal{X}_{1}$ with $\mathinner{\!\left\lvert\mathcal{X}_{1}\right\rvert}=m=n+\operatorname{sign}(\lambda)\cdot\lfloor|\lambda|\rfloor$ by adding or removing points, depending on the sign of $\lambda$ .

Case 1: $\lambda\geq 0$ . Here $\mathcal{X}_{1}$ is obtained from $\mathcal{X}$ by adding $\lfloor\lambda\rfloor$ points, so Lemma 4.10 gives

W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{X}_{1}})\leq\frac{\lambda}{m}\cdot\Delta_{0}.

Combining this with (4.3) by triangle inequality, we conclude that

W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{X}_{2}})\leq\frac{\lambda\Delta_{0}+D}{m}\leq\frac{\lambda\Delta_{0}+D}{n}.

Case 2: $\lambda<0$ . Here $\mathcal{X}_{1}$ is obtained from $\mathcal{X}$ by removing a set $\mathcal{X}_{0}$ of $n-m=\lfloor\mathinner{\!\left\lvert\lambda\right\rvert}\rfloor$ points. Furthermore, by our analysis of step 2, $\mathcal{X}_{2}$ is obtained from $\mathcal{X}_{1}$ by moving points the total distance at most $D$ . Therefore, $\mathcal{X}_{2}\cup\mathcal{X}_{0}$ (as a multiset) is obtained from $\mathcal{X}=\mathcal{X}_{1}\cup\mathcal{X}_{0}$ by moving points the total distance at most $D$ , too. (The points in $\mathcal{X}_{0}$ remain unmoved.) Since $\mathinner{\!\left\lvert\mathcal{X}\right\rvert}=n$ , it follows that

W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{X}_{2}\cup\mathcal{X}_{0}})\leq\frac{D}{n}.

Moreover, Lemma 4.10 gives

W_{1}(\mu_{\mathcal{X}_{2}},\mu_{\mathcal{X}_{2}\cup\mathcal{X}_{0}})\leq\frac{\mathinner{\!\left\lvert\mathcal{X}_{0}\right\rvert}}{\mathinner{\!\left\lvert\mathcal{X}_{2}\cup\mathcal{X}_{0}\right\rvert}}\cdot\operatorname{diam}(\Omega)\leq\frac{\mathinner{\!\left\lvert\lambda\right\rvert}\Delta_{0}}{n}.

(Here we used that the multiset $\mathcal{X}_{2}\cup\mathcal{X}_{0}$ has the same number of points as $\mathcal{X}$ , which is $n$ .) Combining the two bounds by triangle inequality, we obtain

W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{X}_{2}})\leq\frac{\mathinner{\!\left\lvert\lambda\right\rvert}\Delta_{0}+D}{n}.

(4.4)

In other words, this bound holds in both cases.

Analyzing Step 3. This step is the easiest to analyze: since $\mathcal{Y}$ is obtained from $\mathcal{X}_{2}$ by relocating the points are relocated within their cells, and the maximal diameter of the cells is $\delta$ , we have $W_{1}(\mu_{\mathcal{X}_{2}},\mu_{\mathcal{Y}})\leq\delta.$ Combining this with (4.4) by triangle inequality, we conclude that

W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq\frac{\mathinner{\!\left\lvert\lambda\right\rvert}\Delta_{0}+D}{n}+\delta.

Taking expectation. Recall the definition of $D$ from (4.2). We get

\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq\frac{1}{n}\left[\operatorname{\mathbb{E}}\left[\,\mathinner{\!\left\lvert\lambda\right\rvert}\right]\Delta_{0}+\sum_{j=0}^{r-1}\sum_{\theta\in\{0,1\}^{j}}\operatorname{\mathbb{E}}\left[\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)\right]\operatorname{diam}(\Omega_{\theta})\right]+\delta.

Since $\lambda\sim\operatorname{Lap}_{\mathbb{Z}}(\sigma_{0})$ , by (B.1) we have $\operatorname{\mathbb{E}}\left[\,\mathinner{\!\left\lvert\lambda\right\rvert}\right]\leq(\operatorname{\mathbb{E}}(\lambda)^{2})^{1/2}\leq\sqrt{2}\sigma_{0}.$ Similarly, since $\lambda_{\theta 0}$ and $\lambda_{\theta 1}$ are independent $\operatorname{Lap}_{\mathbb{Z}}(\sigma_{j+1})$ random variables, $\operatorname{\mathbb{E}}\left[\max\left(\,\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right)\right]\leq 2\sqrt{2}\sigma_{j+1}$ . Substituting these estimates and rearranging the terms of the sum will complete the proof. ∎

Acknowledgements

R.V. acknowledges support from NSF DMS–1954233, NSF DMS–2027299, U.S. Army 76649–CS, and NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning. Y.Z. is partially supported by NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning.

The authors thank March Boedihardjo, Girish Kumar, and Thomas Strohmer for helpful discussions.

References

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
[2] John Abowd, Robert Ashmead, Garfinkel Simson, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, and William Sexton. Census topdown: Differentially private data, incremental schemas, and consistency with public knowledge. US Census Bureau, 2019.
[3] John M Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Micah Heineck, Christine Heiss, Robert Johns, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, et al. The 2020 census disclosure avoidance system topdown algorithm. Harvard Data Science Review, (Special Issue 2), 2022.
[4] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2011.
[5] Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 273–282, 2007.
[6] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[7] Steven M Bellovin, Preetam K Dutta, and Nathan Reitinger. Privacy and synthetic datasets. Stan. Tech. L. Rev., 22:1, 2019.
[8] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2):1–25, 2013.
[9] Sergey G Bobkov and Michel Ledoux. A simple fourier analytic proof of the AKT optimal matching theorem. The Annals of Applied Probability, 31(6):2567–2584, 2021.
[10] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Covariance’s loss is privacy’s gain: Computationally efficient, private and accurate synthetic data. Foundations of Computational Mathematics, pages 1–48, 2022.
[11] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Privacy of synthetic data: A statistical framework. IEEE Transactions on Information Theory, 69(1):520–527, 2022.
[12] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Private measures, random walks, and synthetic data. arXiv preprint arXiv:2204.09167, 2022.
[13] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Private sampling: a noiseless approach for generating differentially private synthetic data. SIAM Journal on Mathematics of Data Science, 4(3):1082–1115, 2022.
[14] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Covariance loss, Szemeredi regularity, and differential privacy. arXiv preprint arXiv:2301.02705, 2023.
[15] Sébastien Bubeck and Mark Sellke. A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34:28811–28822, 2021.
[16] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1–24, 2011.
[17] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. Advances in neural information processing systems, 21, 2008.
[18] Steffen Dereich, Michael Scheutzow, and Reik Schottstedt. Constructive quantization: Approximation by empirical measures. Annales de l’IHP Probabilités et statistiques, 49(4):1183–1203, 2013.
[19] Sjoerd Dirksen. Tail bounds via generic chaining. Electron. J. Probab, 20(53):1–29, 2015.
[20] Josep Domingo-Ferrer, David Sánchez, and Alberto Blanco-Justicia. The limits of differential privacy (and its misuse in data release and machine learning). Communications of the ACM, 64(7):33–35, 2021.
[21] John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201, 2018.
[22] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715–724, 2010.
[23] Cynthia Dwork, Aleksandar Nikolov, and Kunal Talwar. Efficient algorithms for privately releasing marginals via convex relaxations. Discrete & Computational Geometry, 53:650–673, 2015.
[24] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
[25] Dylan J Foster and Alexander Rakhlin. $\ell^{\infty}$ vector contraction for Rademacher complexity. arXiv preprint arXiv:1911.06468, 6, 2019.
[26] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620:105–118, 2016.
[27] Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentially private data release. Advances in neural information processing systems, 25, 2012.
[28] Mathew E Hauer and Alexis R Santos-Lozada. Differential privacy in the 2020 census will distort covid-19 rates. Socius, 7:2378023121994014, 2021.
[29] Michael B Hawes. Implementing differential privacy: Seven lessons from the 2020 United States Census. Harvard Data Science Review, 2(2), 2020.
[30] Seidu Inusah and Tomasz Kozubowski. A discrete analogue of the Laplace distribution. Journal of Statistical Planning and Inference, 136:1090–1102, 03 2006.
[31] James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2019.
[32] Leonid V Kovalev. Lipschitz clustering in metric spaces. The Journal of Geometric Analysis, 32(7):188, 2022.
[33] Terrance Liu, Giuseppe Vietri, and Steven Z Wu. Iterative methods for private synthetic data: Unifying framework and new methods. Advances in Neural Information Processing Systems, 34:690–702, 2021.
[34] Shahar Mendelson. Empirical processes with a bounded $\psi$ 1 diameter. Geometric and Functional Analysis, 20(4):988–1027, 2010.
[35] Laurent Meunier, Blaise J Delattre, Alexandre Araujo, and Alexandre Allauzen. A dynamical system perspective for Lipschitz neural networks. In International Conference on Machine Learning, pages 15484–15500. PMLR, 2022.
[36] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013.
[37] Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. Differentially private $k$ -means clustering. In Proceedings of the sixth ACM conference on data and application security and privacy, pages 26–37, 2016.
[38] Michel Talagrand. Matching random samples in many dimensions. The Annals of Applied Probability, pages 846–856, 1992.
[39] Michel Talagrand. The generic chaining: upper and lower bounds of stochastic processes. Springer Science & Business Media, 2005.
[40] Justin Thaler, Jonathan Ullman, and Salil Vadhan. Faster algorithms for privately releasing marginals. In Automata, Languages, and Programming: 39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, Part I 39, pages 810–821. Springer, 2012.
[41] VM Tikhomirov. $\varepsilon$ -entropy and $\varepsilon$ -capacity of sets in functional spaces. In Selected works of AN Kolmogorov, pages 86–170. Springer, 1993.
[42] Jonathan Ullman and Salil Vadhan. PCPs and the hardness of generating private synthetic data. In Theory of Cryptography: 8th Theory of Cryptography Conference, TCC 2011, Providence, RI, USA, March 28-30, 2011. Proceedings 8, pages 400–416. Springer, 2011.
[43] Salil Vadhan. The complexity of differential privacy. Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pages 347–450, 2017.
[44] Vijay V Vazirani. Approximation algorithms, volume 1. Springer, 2001.
[45] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
[46] Giuseppe Vietri, Cedric Archambeau, Sergul Aydore, William Brown, Michael Kearns, Aaron Roth, Ankit Siva, Shuai Tang, and Steven Wu. Private synthetic data for multitask learning and marginal queries. In Advances in Neural Information Processing Systems, 2022.
[47] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
[48] Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with Lipschitz functions. J. Mach. Learn. Res., 5(Jun):669–695, 2004.
[49] Ziteng Wang, Chi Jin, Kai Fan, Jiaqi Zhang, Junliang Huang, Yiqiao Zhong, and Liwei Wang. Differentially private data releasing for smooth queries. The Journal of Machine Learning Research, 17(1):1779–1820, 2016.
[50] Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4 A):2620–2648, 2019.

Appendix A Additional proofs

A.1. Proof of Proposition 3.2

Proof.

We will apply the chaining argument (see, e.g., [45, Chapter 8]) to deduce a bound similar to Dudley’s inequality.

Step 1: (Finding nets)

Define $\varepsilon_{j}=2^{-j}$ for $j\in\mathbb{Z}$ and consider an $\varepsilon_{j}$ -net $T_{j}$ of $\mathcal{F}$ of size $\mathcal{N}(\mathcal{F},\varepsilon_{j},\|\cdot\|_{\infty})$ . Then for any $f\in\mathcal{F}$ and any level $j$ , we can find the closest element in the net, denoted $\pi_{j}(f)$ . In other words, there exists $\pi_{j}(f)$ s.t.

\pi_{j}(f)\in T_{j},\quad\|f-\pi_{j}(f)\|_{\infty}\leq\varepsilon_{j}.

Let $m$ be a positive integer to be determined later, we have the telescope sum together with triangle inequality

	$\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left\|\sum_{i=1}^{n}f(X_{i})\lambda_{i}\right\|\leq$	$\displaystyle\,\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left\|\sum_{i=1}^{n}\left(f-\pi_{m}(f)\right)(X_{i})\cdot\lambda_{i}\right\|$
		$\displaystyle\,+\sum_{j=j_{0}+1}^{m}\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left\|\sum_{i=1}^{n}\left(\pi_{j}(f)-\pi_{j-1}(f)\right)(X_{i})\cdot\lambda_{i}\right\|.$

Note that when $j=j_{0}$ is small enough, $\Omega$ can be covered by $\pi_{j_{0}}(f)\equiv 0$ .

Step 2: (Bounding the telescoping sum)

For a fixed $j_{0}<j\leq m$ , we consider the quantity

\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left|\sum_{i=1}^{n}\left(\pi_{j}(f)-\pi_{j-1}(f)\right)(X_{i})\cdot\lambda_{i}\right|.

For simplicity we will denote $a_{i}=a_{i}(f)$ as the coefficient $\frac{1}{n}\left(\pi_{j}(f)-\pi_{j-1}(f)\right)(X_{i})$ . Then we have

|a_{i}|\leq\frac{1}{n}\left\|f-\pi_{j-1}(f)\right\|_{\infty}+\frac{1}{n}\left\|\pi_{j}(f)-f\right\|_{\infty}\leq\frac{1}{n}(\varepsilon_{j}+\varepsilon_{j-1})\leq\frac{3\varepsilon_{j}}{n}.

Since $\{\lambda_{i}\}_{i\in[n]}$ are independent subexponential random variables, we can apply Bernstein’s inequality to the sum $\sum_{i}a_{i}\lambda_{i}$ . Let $K=3\varepsilon_{j}$ , we have

	$\displaystyle\mathbb{P}\left\{\left\|\sum_{i=1}^{n}a_{i}\lambda_{i}\right\|>t\rule{0.0pt}{8.53581pt}\right\}$	$\displaystyle\leq 2\exp\left[-c\min\left(\frac{t^{2}}{\\|a\\|_{2}^{2}},\frac{t}{\\|a\\|_{\infty}}\right)\right]$
		$\displaystyle\leq 2\exp\left[-c\min\left(\frac{t^{2}}{K^{2}/n},\frac{t}{K/n}\right)\right]$
		$\displaystyle=2\exp\left[-cn\min\left(\frac{t^{2}}{K^{2}},\frac{t}{K}\right)\right],$

Then we can use the union bound to control the supreme. Define $N=|T_{j}|\cdot|T_{j-1}|\leq|T_{j}|^{2}$ ,

	$\displaystyle\mathbb{P}\left\{\sup_{f\in\mathcal{F}}\left\|\sum_{i=1}^{n}a_{i}\lambda_{i}\right\|>t\rule{0.0pt}{8.53581pt}\right\}$	$\displaystyle\leq 2N\exp\left[-cn\min\left(\frac{t^{2}}{K^{2}},\frac{t}{K}\right)\right]\wedge 1$
		$\displaystyle=2\exp\left[\log N-cn\min\left(\frac{t^{2}}{K^{2}},\frac{t}{K}\right)\right]\wedge 1$
		$\displaystyle\leq 2\exp\left(\log N-cn\frac{t^{2}}{K^{2}}\right)\wedge 1$
		$\displaystyle\quad+2\exp\left(\log N-cn\frac{t}{K}\right)\wedge 1$

and hence

	$\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\left\|\sum_{i=1}^{n}a_{i}\lambda_{i}\right\|=$	$\displaystyle\int_{0}^{\infty}2\exp\left(\log N-cn\frac{t^{2}}{K^{2}}\right)\wedge 1\mathrm{d}t$
		$\displaystyle\quad+\int_{0}^{\infty}2\exp\left(\log N-cn\frac{t}{K}\right)\wedge 1\mathrm{d}t$
	$\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=$	$\displaystyle\,I_{2}+I_{1}.$

We will compute them separately.

	$\displaystyle I_{1}$	$\displaystyle=\int_{0}^{\infty}2\exp\left(\log N-cn\frac{t}{K}\right)\wedge 1\mathrm{d}t$
		$\displaystyle=\frac{K\log N}{cn}+\int_{{K\log N}/{cn}}^{\infty}2\exp\left(\log N-cn\frac{t}{K}\right)$
		$\displaystyle=\frac{K\log N}{cn}+\int_{0}^{\infty}2\exp\left(-cn\frac{t}{K}\right)$
		$\displaystyle\leq CK\frac{\log N}{n}$

	$\displaystyle I_{2}$	$\displaystyle=\int_{0}^{\infty}2\exp\left(\log N-cn\frac{t^{2}}{K^{2}}\right)\wedge 1\mathrm{d}t$
		$\displaystyle=\sqrt{\frac{K^{2}\log N}{cn}}+\int_{\sqrt{{K^{2}\log N}/{cn}}}^{\infty}2\exp\left(\log N-cn\frac{t^{2}}{K^{2}}\right)$
		$\displaystyle=\sqrt{\frac{K^{2}\log N}{cn}}+\int_{0}^{\infty}2\exp\left(-cn\frac{t^{2}}{K^{2}}-2\sqrt{cn\log N}\frac{t}{K}\right)$
		$\displaystyle\leq\sqrt{\frac{K^{2}\log N}{cn}}+\frac{K}{\sqrt{cn\log N}}$
		$\displaystyle\leq CK\sqrt{\frac{\log N}{n}}.$

Therefore we concluded that for a fixed level $j$ ,

\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|\leq CK\left({\frac{\log N}{n}}+\sqrt{\frac{\log N}{n}}\right)\lesssim\varepsilon_{j}\left({\frac{\log N}{n}}+\sqrt{\frac{\log N}{n}}\right)

Step 3: (Bounding the last entry)

For the last entry in the telescoping sum, similarly, we denote $a_{i}\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\left(f-\pi_{m}(f)\right)(X_{i})$ and we have $|a_{i}|\leq\varepsilon_{m}/n$ . Then

\sup_{f\in\mathcal{F}}\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|\leq\frac{\varepsilon_{m}}{n}\sum_{i=1}^{n}|\lambda_{i}|,

and the expectation satisfies

\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\left|\sum_{i=1}^{n}a_{i}\lambda_{i}\right|\leq\frac{\varepsilon_{m}}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}|\lambda_{i}|\lesssim\varepsilon_{m}.

Step 4: (Combining the bound and choosing $m$ ) Combining the two integrals together, we deduce that for any $X_{1},\dots,X_{n}\in\Omega$ ,

	$\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left\|\sum_{i=1}^{n}f(X_{i})\lambda_{i}\right\|\leq C\left(\varepsilon_{m}+\sum_{j=j_{0}+1}^{m}\right.$	$\displaystyle\varepsilon_{j}\Bigg{(}{\frac{\log\mathcal{N}(\mathcal{F},\varepsilon_{j},\\|\cdot\\|_{\infty})}{n}}$
		$\displaystyle\left.+\sqrt{\frac{\log\mathcal{N}(\mathcal{F},\varepsilon_{j},\\|\cdot\\|_{\infty})}{n}}\Bigg{)}\right).$

Then for any $\alpha>0$ , we can always choose $m$ such that $2\alpha\leq\varepsilon_{m}<4\alpha$ and bound the sum above with integral

	$\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left\|\sum_{i=1}^{n}f(X_{i})\lambda_{i}\right\|\leq C\Bigg{(}2\alpha$	$\displaystyle+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}\sqrt{\log\mathcal{N}(\mathcal{F},u,\\|\cdot\\|_{\infty})}\mathrm{d}u$
		$\displaystyle+\frac{1}{{n}}\int_{\alpha}^{\infty}{\log\mathcal{N}(\mathcal{F},u,\\|\cdot\\|_{\infty})}\mathrm{d}u\Bigg{)}.$		(A.1)

Taking infimum over $\alpha$ completes the proof of the first inequality.

Now assume $\mathcal{F}$ is the set of all functions $f$ with $\|f\|_{\mathrm{Lip}}\leq 1$ . From [26, Lemma 4.2], we can bound the covering number of $\mathcal{F}$ by the covering number of $\Omega$ as follows:

\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\leq\log({8}/{u})\,\mathcal{N}(\Omega,{u}/{2},\rho).

As a result, for any $\alpha>0$ ,

L_{n}(\mathcal{F})\leq C\left(2\alpha+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}\sqrt{\log({8}/{u})\,\mathcal{N}(\Omega,{u}/{2},\rho)}\mathrm{d}u+\frac{1}{{n}}\int_{\alpha}^{\infty}\log({8}/{u})\,\mathcal{N}(\Omega,{u}/{2},\rho)\mathrm{d}u\right).

This completes the proof. ∎

A.2. Proof of Corollary 3.3

Proof.

For $\Omega=[0,1]^{d}$ with $l_{\infty}$ -norm, we have $\operatorname{diam}(\Omega)=1$ and the covering number

\mathcal{N}([0,1]^{d},u,\|\cdot\|_{\infty})\leq u^{-d}.

Then, as the domain $\Omega=[0,1]^{d}$ is connected and centered, we can apply the bound for the covering number of $\mathcal{F}$ from [48, Theorem 17]:

\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\leq\left(2\left\lceil{2}/u\right\rceil+1\right)2^{\mathcal{N}\left([0,1]^{d},u/2,\|\cdot\|_{\infty}\right)},

\Longrightarrow\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})\lesssim\mathcal{N}(\Omega,{u}/{2},\|\cdot\|_{\infty})\lesssim(u/2)^{-d}.

Applying the inequality above to (A.1), we get

L_{n}\leq C\left(2\alpha+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}(u/2)^{-d/2}\mathrm{d}u+\frac{1}{{n}}\int_{\alpha}^{\infty}(u/2)^{-d}\mathrm{d}u\right).

(A.2)

Compute the integral for the case $d=2$ and $d\geq 3$ ,

L_{n}(f)\leq\left\{\begin{aligned} &C\Bigg{(}2\alpha+\frac{2}{\sqrt{n}}\log\frac{2}{\alpha}+\frac{2}{n}\left(\frac{\alpha}{2}\right)^{-1}\Bigg{)}&\quad\textrm{ if }d=2.\\ &C\Bigg{(}2\alpha+\frac{2}{\sqrt{n}}\cdot\frac{1}{\frac{d}{2}-1}\left(\frac{\alpha}{2}\right)^{1-\frac{d}{2}}+\frac{2}{n}\cdot\frac{1}{d-1}\left(\frac{\alpha}{2}\right)^{1-d}\Bigg{)}\quad&\textrm{ if }d\geq 3.\end{aligned}\right.

Choosing $\alpha=2n^{-1/d}$ finishes the cases for $d\geq 2$ .

When $d=1$ , the Dudley integral in (A.2) is divergent. However, note that $\operatorname{diam}(\mathcal{F})\leq 2$ and hence $\log\mathcal{N}(\mathcal{F},u,\|\cdot\|_{\infty})=0$ for $u>1$ . From (A.1), we have

	$\displaystyle L_{n}(\mathcal{F})$	$\displaystyle\leq C\left(2\alpha+\frac{1}{\sqrt{n}}\int_{\alpha}^{1}(u/2)^{-1/2}\mathrm{d}u+\frac{1}{{n}}\int_{\alpha}^{1}(u/2)^{-1}\mathrm{d}u\right)$
		$\displaystyle\leq C\left(2\alpha+\frac{2(\sqrt{2}-\sqrt{\alpha})}{\sqrt{n}}+\frac{2}{n}\log\frac{1}{\alpha}\right).$

The optimal choice of $\alpha$ is $\alpha\sim n^{-1/2}$ , which gives us the result for $d=1$ . ∎

A.3. Proof of Proposition 3.4

Proof.

It suffices to prove that the steps from $\mathcal{X}$ to the sign measure $\nu$ in Algorithm 1 is $\varepsilon$ -differentially private since the remaining steps are only based on $\nu$ . Notice that both $\mu_{\mathcal{Y}},\nu$ are supported on $Y_{1},\dots,Y_{m}$ , we can identify the two discrete measures as $m$ dimensional vectors in the standard simplex, denoted $\overline{\mu_{\mathcal{Y}}},\overline{\nu}$ , respectively. Consider two data sets $\mathcal{X}_{1}$ and $\mathcal{X}_{2}$ differ in one point. Suppose we deduced ${\mu_{\mathcal{Y}_{1}}},{\mu_{\mathcal{Y}_{2}}}$ and $\nu_{1},\nu_{2}$ through the first four steps of Algorithm 1 from $\mathcal{X}_{1},\mathcal{X}_{2}$ , respectively. We know two vectors $\overline{\mu_{\mathcal{Y}_{1}}},\overline{\mu_{\mathcal{Y}_{2}}}$ are different at one coordinate, where the difference is bounded by $1/n$ .

Then

	$\displaystyle\frac{\mathbb{P}\left\{\nu_{1}=\eta\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{\nu_{2}=\eta\rule{0.0pt}{8.53581pt}\right\}}$	$\displaystyle=\prod_{i=1}^{m}\frac{\mathbb{P}\left\{\lambda_{i}=n(\eta-\overline{\mu_{\mathcal{Y}_{1}}})_{i}\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{\lambda_{i}=n(\eta-\overline{\mu_{\mathcal{Y}_{2}}})_{i}\rule{0.0pt}{8.53581pt}\right\}}=\prod_{i=1}^{m}\frac{\exp(-\varepsilon n\|(\eta-\overline{\mu_{\mathcal{Y}_{1}}})_{i}\|)}{\exp(-\varepsilon n\|(\eta-\overline{\mu_{\mathcal{Y}_{2}}})_{i}\|)}$
		$\displaystyle\leq\exp\left(\varepsilon n\\|{\mu_{\mathcal{Y}_{2}}}-{\mu_{\mathcal{Y}_{1}}}\\|_{1}\right)\leq e^{\varepsilon}.$

By writing $\mathbb{P}\left\{\nu_{i}\in S\rule{0.0pt}{8.53581pt}\right\}=\sum_{\eta\in S}\mathbb{P}\left\{\nu_{i}=\eta\rule{0.0pt}{8.53581pt}\right\}$ for $i=1,2$ , the inequality above implies Algorithm 1 is $\varepsilon$ -differentially private. ∎

A.4. Proof of Proposition 3.5

Proof.

For two signed measures $\tau,\nu$ supported on $\mathcal{Y}$ , the $d_{\mathrm{BL}}$ -distance between $\tau$ and $\nu$ is

d_{\mathrm{BL}}(\tau,\nu)=\sup_{\|f\|_{\mathrm{Lip}\leq 1}}\left|\sum_{i=1}^{m}f(y_{i})\left(\tau(\{y_{i}\})-\nu(\{y_{i}\})\right)\right|.

For simplicity, we denote $f_{i}=f(y_{i})$ , $\nu_{i}=\nu(\{y_{i}\})$ and $\tau_{i}=\tau(\{y_{i}\})$ . Then we note that for any $f$ with $\|f\|_{\mathrm{Lip}}\leq 1$ , only $(f_{i})_{i\in[m]}$ matters in the definition above. Therefore, suppose $\nu$ and $\tau$ are fixed, computing the $d_{\mathrm{BL}}$ -distance is equivalent to the following linear programming problem:

$\displaystyle\max\quad$	$\displaystyle\sum_{i=1}^{m}(\nu_{i}-\tau_{i})f_{i}$
s.t.	$\displaystyle f_{i}-f_{j}\leq\\|y_{i}-y_{j}\\|_{\infty},$	$\displaystyle{\forall i,j\leq m,i\not=j},$
	$\displaystyle-f_{i}+f_{j}\leq\\|y_{i}-y_{j}\\|_{\infty},$	$\displaystyle\forall i,j\leq m,i\not=j,$
	$\displaystyle-1\leq f_{i}\leq 1,$	$\displaystyle\forall i\leq m.$

After a change of variable $f_{i}^{\prime}=f_{i}+1$ , we can rewrite it as

$\displaystyle\max\quad$	$\displaystyle\sum_{i=1}^{m}(\nu_{i}-\tau_{i})f_{i}^{\prime}-\left(\nu(\Omega)-1\right)$
s.t.	$\displaystyle f_{i}^{\prime}-f_{j}^{\prime}\leq\\|y_{i}-y_{j}\\|_{\infty},$	$\displaystyle\forall i,j\leq m,i\not=j,$
	$\displaystyle-f_{i}^{\prime}+f_{j}^{\prime}\leq\\|y_{i}-y_{j}\\|_{\infty},$	$\displaystyle\forall i,j\leq m,i\not=j,$
	$\displaystyle 0\leq f_{i}^{\prime}\leq 2,$	$\displaystyle\forall i\leq m.$

Next, we can consider the dual problem of the linear programming problem above. The duality theory in linear programming [44, Chapter 12] showed that the original problem and the dual problem have the same optimal solution. Let $u_{ij},u_{ij}^{\prime}\geq 0$ be the dual variable for the linear constraints about $f_{i}^{\prime}-f_{j}^{\prime}$ and $-f_{i}^{\prime}+f_{j}^{\prime}$ , and let $v_{i}\geq 0$ be the dual variable for the equation $f_{i}^{\prime}\leq 2$ . As the linear programming above is in the standard form, by the duality theory, it is equivalent to

$\displaystyle\min\quad$	$\displaystyle\sum_{i\neq j}\\|y_{i}-y_{j}\\|_{\infty}(u_{ij}+u_{ij}^{\prime})+2\,v_{i}-\left(\nu(\Omega)-1\right)$
s.t.	$\displaystyle\sum_{j\neq i}(u_{ij}-u_{ij}^{\prime})+v_{i}\geq\nu_{i}-\tau_{i},$	$\displaystyle\forall{i\leq m},$
	$\displaystyle u_{ij},u_{ij}^{\prime},v_{i}\geq 0$	$\displaystyle\forall i,j\leq m,i\not=j.$

To find the minimizer $\tau$ for a given $\nu$ , we regard $\tau_{i}$ as variables and add the constraints of $\tau$ being a probability measure. Also, we can eliminate the constant $\nu(\Omega)-1$ in the target function. So we get the linear programming problem:

$\displaystyle\min\quad$	$\displaystyle\sum_{i\neq j}\\|y_{i}-y_{j}\\|_{\infty}(u_{ij}+u_{ij}^{\prime})+2\,v_{i}$
s.t.	$\displaystyle\sum_{j\neq i}(u_{ij}-u_{ij}^{\prime})+v_{i}+\tau_{i}\geq\nu_{i},$	$\displaystyle\forall{i\leq m},$
	$\displaystyle\sum_{i=1}^{m}\tau_{i}=1,$
	$\displaystyle u_{ij},u_{ij}^{\prime},v_{i},\tau_{i}\geq 0$	$\displaystyle\forall i,j\leq m,i\not=j.$

There are $2m^{2}$ variables in total and $m+1$ linear constraints, and the minimizer $(\tau_{i})_{i=1}^{m}$ is what we want. ∎

A.5. Proof of Theorem 3.6

Proof.

We transformed the original data measure $\mu_{\mathcal{X}}$ with three steps: $\mu_{\mathcal{X}}\longrightarrow\mu_{\mathcal{Y}}\longrightarrow\nu\longrightarrow\hat{\nu}.$

Step 1: For the first step in the algorithm, we have $W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\leq\max_{i}\operatorname{diam}(\Omega_{i})$ . This follows from the definition of $1$ -Wasserstein distance.

Step 2: In this step, $\nu$ is no longer a probability measure, and we consider $d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu)$ instead:

	$\displaystyle\mathbb{E}d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu)$	$\displaystyle=\mathbb{E}\sup_{\\|f\\|_{\mathrm{Lip}}\leq 1}\left\|\int f\mathrm{d}\mu_{\mathcal{Y}}-\int f\mathrm{d}\nu\right\|$
		$\displaystyle=\mathbb{E}\sup_{\\|f\\|_{\mathrm{Lip}}\leq 1}\left\|\sum_{i=1}^{m}f(y_{i})\left(\frac{n_{i}}{n}+\frac{\lambda_{i}}{n}-\frac{n_{i}}{n}\right)\right\|=\frac{m}{\varepsilon n}\widetilde{L}_{m}(\mathcal{F}).$		(A.3)

Step 3: For the last step, we have $d_{\mathrm{BL}}(\nu,\hat{\nu})\leq d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu)$ because $\hat{\nu}$ is the closest probability measure to $\nu$ from Proposition 3.5. As a result, we have

	$\displaystyle W_{1}(\mu,\hat{\nu})=d_{\mathrm{BL}}(\mu,\hat{\nu})$	$\displaystyle\leq d_{\mathrm{BL}}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})+d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu)+d_{\mathrm{BL}}(\nu,\hat{\nu})$
		$\displaystyle\leq W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})+2d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu)\leq\max_{i}\operatorname{diam}(\Omega_{i})+2d_{\mathrm{BL}}(\mu_{\mathcal{Y}},\nu).$

After taking the expectation, we can apply (A.3) to get the desired inequality. ∎

A.6. Proof of Corollary 3.7

Proof.

Using Theorem 3.6, we have

\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\hat{\nu})\leq\max_{i}\operatorname{diam}(\Omega_{i})+\frac{2m}{\varepsilon n}\widetilde{L}_{m}(\mathcal{F}).

By assumption we have $\max_{i}\operatorname{diam}(\Omega_{i})\asymp m^{-1/d}\asymp(\varepsilon n)^{-1/d}$ . And by 3.3 we have the bound for the Laplacian complexity

\widetilde{L}_{m}(\mathcal{F})\leq\left\{\begin{aligned} &C(\varepsilon n)^{-1/2}\quad&\textrm{ if }d=1,\\ &{C\log n}\cdot(\varepsilon n)^{-1/2}\quad&\textrm{ if }d=2,\\ &{C}{(\varepsilon n)^{-{1}/{d}}}\quad&\textrm{ if }d\geq 3.\end{aligned}\right.

When $d\geq 3$ , the two terms are comparable. And when $d=1,2$ , the Laplacian complexity dominates the error. Combining the two inequalities gives the result. ∎

A.7. Proof of Theorem 4.1

Theorem 4.1 can be obtained by applying the parallel composition lemma [24]. Here we present a self-contained proof by considering an inhomogeneous version of the classical Laplacian mechanism [24].

Lemma A.1 (Inhomogeneous Laplace mechanism).

Let $F\mathrel{\mathop{\mathchar 58\relax}}\Omega^{n}\to\mathbb{R}^{k}$ be any map, $s=(s_{i})_{i=1}^{k}\in\mathbb{R}_{+}^{k}$ be a fixed vector, and $\lambda=(\lambda_{i})_{i=1}^{k}$ be a random vector with independent coordinates $\lambda_{i}\sim\operatorname{Lap}_{\mathbb{Z}}(s_{i})$ . Then the map $x\mapsto F(x)+\lambda$ is $\varepsilon$ -differentially private, where

\varepsilon=\sup_{x,\tilde{x}}\mathinner{\!\left\lVert F(x)-F(\tilde{x})\right\rVert}_{\ell^{1}(s)}.

Here the supremum is over all pairs of input vectors in $\Omega^{n}$ that differ in one coordinate, and $\mathinner{\!\left\lVert z\right\rVert}_{\ell^{1}(s)}=\sum_{i=1}^{k}\mathinner{\!\left\lvert z_{i}\right\rvert}/{s_{i}}$ .

Proof of Lemma A.1.

Suppose $x,\tilde{x}\in\Omega^{n}$ differs in exactly one coordinate. Consider the density functions of the inputs having the same output $y=F(x)+\lambda=F(\tilde{x})+\tilde{\lambda}\in\mathbb{Z}^{k}$ . We have

	$\displaystyle\frac{\mathbb{P}\left\{F(x)+\lambda=y\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{F(\tilde{x})+\tilde{\lambda}=y\rule{0.0pt}{8.53581pt}\right\}}$	$\displaystyle=\frac{\mathbb{P}\left\{\lambda=y-F(x)\rule{0.0pt}{8.53581pt}\right\}}{\mathbb{P}\left\{{\tilde{\lambda}}={y-F(\tilde{x})}\rule{0.0pt}{8.53581pt}\right\}}$
		$\displaystyle=\frac{\prod_{i=1}^{k}\exp\left(-\frac{\|(y-F(x))_{i}\|}{s_{i}}\right)}{\prod_{i=1}^{k}\exp\left(-\frac{\|(y-F(\tilde{x}))_{i}\|}{s_{i}}\right)}$
		$\displaystyle=\exp\left(-\sum_{i=1}^{k}\frac{1}{s_{i}}\left(\|(y-F(x))_{i}\|-\|(y-F(\tilde{x}))_{i}\|\right)\right)$
		$\displaystyle\leq\exp\left(\\|F(x)-F(\tilde{x})\\|_{\ell^{1}(s)}\right)$
		$\displaystyle\leq e^{\varepsilon}$

Therefore, we know $x\mapsto F(x)+\lambda$ is $\varepsilon$ -differentially private. ∎

Proof of Theorem 4.1.

Consider the map $F(\mathcal{X})=(n_{\theta})$ that transforms the input data into the vector of counts. Suppose a pair of input data $\mathcal{X}$ and $\tilde{\mathcal{X}}$ differ in one point $x_{i}$ . Consider the corresponding vectors of counts $(n_{\theta})$ and $(\tilde{n}_{\theta})$ . For each level $j=0,\ldots,r$ , the vectors of counts differ for a single $\theta\in\{0,1\}^{j}$ , namely for the $\theta$ that corresponds to the region $\Omega_{\theta}$ containing $x_{i}$ . Moreover, whenever such a difference occurs, we have $\mathinner{\!\left\lvert n_{\theta}-\tilde{n}_{\theta}\right\rvert}=1$ . Thus, extending the vector $(\sigma_{j})_{j=0}^{r}$ to $(\sigma_{\theta})_{\theta\in\{0,1\}^{\leq r}}$ trivially (by converting $\sigma_{j}$ to $\sigma_{\theta}$ for all $\theta\in\{0,1\}^{j}$ ), we have

\mathinner{\!\left\lVert F(\mathcal{X})-F(\tilde{\mathcal{X}})\right\rVert}_{\ell^{1}(\sigma)}=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}\sum_{\theta\in\{0,1\}^{j}}\mathinner{\!\left\lvert n_{\theta}-\tilde{n}_{\theta}\right\rvert}=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}=\varepsilon.

Applying Lemma A.1, we conclude that the map $\mathcal{X}\mapsto(n_{\theta}+\lambda_{\theta})$ is $\varepsilon$ -differentially private. ∎

A.8. Proof of Theorem 4.3

Proof.

We will use the Lagrange multipliers procedure to find the optimal choices of $\sigma_{j}$ . Given the maximal layer $r$ , recall Theorem 4.1, we should use our privacy budget as

\varepsilon=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}.

Therefore, we aim to minimize the accuracy bound with the specified privacy budget, namely

\displaystyle\min\operatorname{\mathbb{E}}W_{1}(\mu_{\mathcal{X}},\mu_{\mathcal{Y}})\quad\;\textrm{s.t. }\varepsilon=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}.

Recall the result in Theorem 4.2. Here $\varepsilon,n$ are given and $\delta$ is fixed as long as we determine the maximal level $r$ . So the minimization problem is

\displaystyle\min\sum_{j=0}^{r}\sigma_{j}\Delta_{j-1}\quad\;\textrm{s.t. }\varepsilon=\sum_{j=0}^{r}\frac{1}{\sigma_{j}}.

Consider the Lagrangian function

f(\sigma_{0},\dots,\sigma_{r};t)\mathrel{\mathop{\mathchar 58\relax}}=\sum_{j=0}^{r}\sigma_{j}\Delta_{j-1}-t\left(\sum_{j=0}^{r}\frac{1}{\sigma_{j}}-\varepsilon\right)

and the corresponding equation

\frac{\partial f}{\partial\sigma_{0}}=\dots=\frac{\partial f}{\partial\sigma_{r}}=\frac{\partial f}{\partial t}=0.

One can easily check that the equations above have a unique solution

\sigma_{j}=\frac{S}{\varepsilon\sqrt{\Delta_{j-1}}}\quad\text{where}\quad S=\sum_{j=0}^{r}\sqrt{\Delta_{i-1}}.

(A.4)

and it is indeed a minimal point for $f(\sigma_{0},\dots,\sigma_{r};t)$ .

As a result, if we fix $\varepsilon$ and want Algorithm 4 to be $\varepsilon$ -differentially private, we should choose the noise magnitudes as (A.4). Substituting these noise magnitudes into the accuracy Theorem 4.2, we see that the accuracy gets bounded by $\frac{\sqrt{2}}{\varepsilon n}S^{2}+\delta$ . ∎

A.9. Proof of Corollary 4.4

Proof.

Let $\Omega=[0,1]$ with the $\ell^{\infty}$ metric. The natural hierarchical binary decomposition of $[0,1]$ (cut through the middle) makes subintervals of length $\operatorname{diam}(\Omega_{\theta})=2^{-j}$ for $\theta\in\{0,1\}^{j}$ , so $\Delta_{j}=1$ for all $j$ , and the resolution is $\delta=2^{-r}$ . Theorem 4.3 makes $\varepsilon$ -differential private synthetic data with accuracy

\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\leq\frac{\sqrt{2}(r+1)^{2}}{\varepsilon n}+2^{-r}.

A nearly optimal choice for $r$ is $r=\log_{2}(\varepsilon n)-1$ , which yields

\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\leq\frac{(2+\sqrt{2})\log_{2}^{2}(\varepsilon n)}{\varepsilon n}.

The optimal noise magnitudes, per (A.4), are $\sigma_{j}=\log_{2}^{2}(\varepsilon n)/\varepsilon$ . In other words, the noise does not decay with the level.

Let $\Omega=[0,1]^{d}$ for $d>1$ . The natural hierarchical binary decomposition of $[0,1]^{d}$ (cut through the middle along a coordinate hyperplane) makes subintervals of length $\operatorname{diam}(\Omega_{\theta})\asymp 2^{-j/d}$ for $\theta\in\{0,1\}^{j}$ , so $\Delta_{j}=2^{j}\cdot 2^{-j/d}=2^{(1-1/d)j}$ for all $j$ , and the resolution is $\delta=2^{-r/d}$ . Thus,

S=\sum_{j=0}^{r}\sqrt{\Delta_{j-1}}\sim 2^{\frac{1}{2}(1-\frac{1}{d})r}.

Theorem 4.3 makes a $\varepsilon$ -differential private synthetic data with accuracy

\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\lesssim\frac{2^{(1-\frac{1}{d})r}}{\varepsilon n}+2^{-r/d}.

A nearly optimal choice for the depth of the partition is $r=\log_{2}(\varepsilon n)$ , which yields

\operatorname{\mathbb{E}}W_{1}\left(\mu_{\mathcal{X}},\mu_{\mathcal{Y}}\right)\lesssim(\varepsilon n)^{-1/d}.

The optimal noise magnitudes, per (A.4), are

\sigma_{j}\sim\varepsilon^{-1}2^{\frac{1}{2}(1-\frac{1}{d})(r-j)}.

Thus, the noise decays with the level $j$ , becoming $O(1)$ per region for the smallest regions. ∎

A.10. Proof of Lemma 4.6

Proof.

If $a,b$ are comparable, both values are zero. If $a,b$ is not comparable, we can assume $a_{1}>b_{1}$ , $a_{2}<b_{2}$ without loss of generality. The set of points that are comparable to $b$ is

\{(x_{1},x_{2})\in\mathbb{Z}_{+}^{2}\mid x_{1}\leq b_{1},x_{2}\leq b_{2}\}\cup\{(x_{1},x_{2})\in\mathbb{Z}_{+}^{2}\mid x_{1}\geq b_{1},x_{2}\geq b_{2}\}.

Note that the distance from $a$ to the first set is $\mathinner{\!\left\lvert a_{1}-b_{1}\right\rvert}$ and the distance from $a$ to the second set is $\mathinner{\!\left\lvert a_{2}-b_{2}\right\rvert}$ . Then $\operatorname{flux}(a,b)$ is the smaller one of the two distances, which is also the distance from $a$ to the union set. ∎

A.11. Proof of Lemma 4.7

Proof.

Case 1: $a=(a_{1},a_{2})$ and $b=(b_{1},b_{2})$ are comparable. If $a\preceq b$ , remove $b_{1}-a_{1}$ balls from bin 1 and $b_{2}-a_{2}$ balls from bin 2 to achieve the result. If $b\preceq a$ , adding $a_{1}-b_{1}$ balls to bin 1 and $a_{2}-b_{2}$ balls to bin 2 to achieve the result.

Case 2: $a=(a_{1},a_{2})$ and $b=(b_{1},b_{2})$ are incomparable. Without loss of generality, we can assume that $a_{1}-b_{1}\geq 0$ , $a_{2}-b_{2}\leq 0$ .

Assume first that $a_{1}-b_{1}\geq b_{2}-a_{2}$ . Then $\operatorname{flux}(a,b)=b_{2}-a_{2}\coloneqq M$ . Then $\Delta\eqqcolon(a_{1}+a_{2})-(b_{1}+b_{2})>0$ . Removing $\Delta$ balls from bin 1 and transferring $M$ balls from bin 1 to bin 2 achieves the result. Note that there are enough balls in bin $1$ to transfer, since $M+\Delta=a_{1}-b_{1}\in[0,a_{1}]$ .

Now assume that $a_{1}-b_{1}\leq b_{2}-a_{2}$ . Then $\operatorname{flux}(a,b)=a_{1}-b_{1}\coloneqq M$ . Then $\Delta\eqqcolon(b_{1}+b_{2})-(a_{1}+a_{2})>0$ . Adding $\Delta$ balls to bin 2 and transferring $M$ balls from bin 1 to bin 2 achieves the result. ∎

A.12. Proof of Lemma 4.8

Proof.

First, we make the total number of points in $\Omega$ correct by adding $m-n$ points to $\Omega$ (or removing, if that number is negative).

Apply Lemma 4.7 for the two parts of $\Omega$ : bin $\Omega_{0}$ that contains $n_{0}$ points and bin $\Omega_{1}$ that contains $n_{1}$ points. Since $\Omega$ already contains the correct total number of points $m$ , we can make the two bins contain the correct number of points, i.e. $m_{0}$ and $m_{1}$ respectively, by transferring $\operatorname{flux}\left((n_{0},n_{1}),\,(m_{0},m_{1})\right)$ points from one bin to the other.

Apply Lemma 4.7 for the two parts of $\Omega_{0}$ : bin $\Omega_{00}$ that contains $n_{00}$ points and bin $\Omega_{01}$ that contains $n_{01}$ points. Since $\Omega_{0}$ already contains the correct number of points $m_{0}$ , we can make the two bins contain the correct number of points, i.e. $m_{00}$ and $m_{01}$ respectively, by transferring $\operatorname{flux}\left((n_{00},n_{01}),\,(m_{00},m_{01})\right)$ points from one bin to the other.

Similarly, since $\Omega_{1}$ already has the correct number of points $m_{1}$ , we can make $\Omega_{10}$ and $\Omega_{11}$ contain the correct number of points $m_{10}$ and $m_{11}$ by transferring $\operatorname{flux}\left((n_{00},n_{01}),\,(m_{00},m_{01})\right)$ points from one bin to the other.

Continuing this way, we can complete the proof. Note that the steps of the iteration procedure we described are interlocked. Each next step determines which subregion the transferred points are selected from, and which subregion they are moved to in the previous step. For example, the original step calls to add (or remove) $m-n$ points to or from $\Omega$ , but does not specify how these points are distributed between the two parts $\Omega_{0}$ and $\Omega_{1}$ . The application of Lemma 4.7 at the next step determines this. ∎

A.13. Proof of Lemma 4.9

Proof.

We will derive this result from Lemma 4.6. First, let us compute the distance from $a=(n_{\theta 0},n_{\theta 1})$ to $b^{\prime}=(n^{\prime}_{\theta 0},n^{\prime}_{\theta 1})=\left((n_{\theta 0}+\lambda_{\theta 0})_{+},\,(n_{\theta 1}+\lambda_{\theta 1})_{+}\right)$ . Since the map $x\mapsto x_{+}$ is $1$ -Lipschitz, we have

\mathinner{\!\left\lVert a-b^{\prime}\right\rVert}_{\infty}\leq\max\left(\mathinner{\!\left\lvert\lambda_{\theta 0}\right\rvert},\mathinner{\!\left\lvert\lambda_{\theta 1}\right\rvert}\right).

Furthermore, recall that by Algorithm 3, $b^{\prime}$ is comparable to $b=(m_{\theta 0},m_{\theta 1})$ . An application of Lemma 4.6 completes the proof. ∎

A.14. Proof of Lemma 4.10

Proof.

Finding the 1-Wasserstein distance in the discrete case is equivalent to solving the optimal transformation problem. In fact, we can obtain $\mu_{U}$ from $\mu_{V}$ by moving $\mathinner{\!\left\lvert V\setminus U\right\rvert}$ atoms of $\mu_{V}$ , each having mass $1/\mathinner{\!\left\lvert V\right\rvert}$ , and distributing their mass uniformly over $U$ . The distance for each movement is bounded by $\operatorname{diam}(\Omega)$ . Therefore the $1$ -Wasserstein distance between $\mu_{U}$ and $\nu_{V}$ is bounded by $\frac{|V\setminus U|}{|V|}\operatorname{diam}(\Omega)$ . ∎

Appendix B Discrete Laplacian distribution

Recall that the classical Laplacian distribution $\operatorname{Lap}_{\mathbb{R}}(\sigma)$ is a continuous distribution with density

f(x)=\frac{1}{2\sigma}\exp\left(-\mathinner{\!\left\lvert x\right\rvert}/\sigma\right),\quad x\in\mathbb{R}.

A random variable $X\sim\operatorname{Lap}_{\mathbb{R}}(\sigma)$ has zero mean and

\operatorname{Var}(Z)=2\sigma^{2}.

To deal with counts, it is more convenient to use the discrete Laplacian distribution $\operatorname{Lap}_{\mathbb{Z}}(\sigma)$ , see [30], which has probability mass function

f(z)=\frac{1-p_{\sigma}}{1+p_{\sigma}}\exp\left(-\mathinner{\!\left\lvert z\right\rvert}/\sigma\right),\quad z\in\mathbb{Z}

where $p_{\sigma}=\exp(-1/\sigma)$ . A random variable $Z\sim\operatorname{Lap}_{\mathbb{Z}}(\sigma)$ has zero mean and

\operatorname{Var}(Z)=\frac{2p_{\sigma}}{(1-p_{\sigma})^{2}}.

Thus, one can verify that discrete Laplacian has a smaller variance than its continuous counterpart:

\operatorname{Var}(Z)<2\sigma^{2},

(B.1)

but the gap vanishes for large $\sigma$ :

\operatorname{Var}(Z)\to 2\sigma^{2}\quad\text{as }\sigma\to\infty.

	$\displaystyle\operatorname{\mathbb{E}}\sup_{f\in\mathcal{F}}\frac{1}{n}\left\|\sum_{i=1}^{n}f(X_{i})\lambda_{i}\right\|\leq C\Bigg{(}2\alpha$	$\displaystyle+\frac{1}{\sqrt{n}}\int_{\alpha}^{\infty}\sqrt{\log\mathcal{N}(\mathcal{F},u,\\|\cdot\\|_{\infty})}\mathrm{d}u$
		$\displaystyle+\frac{1}{{n}}\int_{\alpha}^{\infty}{\log\mathcal{N}(\mathcal{F},u,\\|\cdot\\|_{\infty})}\mathrm{d}u\Bigg{)}.$		(A.1)