Algorithmically Effective Differentially Private Synthetic Data
Abstract.
We present a highly effective algorithmic approach for generating -differentially private synthetic data in a bounded metric space with near-optimal utility guarantees under the 1-Wasserstein distance. In particular, for a dataset in the hypercube , our algorithm generates synthetic dataset such that the expected 1-Wasserstein distance between the empirical measure of and is for , and is for . The accuracy guarantee is optimal up to a constant factor for , and up to a logarithmic factor for . Our algorithm has a fast running time of for all and demonstrates improved accuracy compared to the method in [12] for .
1. Introduction
Differential privacy has become the benchmark for privacy protection in scenarios where vast amounts of data need to be analyzed. The aim of differential privacy is to prevent the disclosure of information about individual participants in the dataset. In simple terms, an algorithm that has a randomized output and produces similar results when given two adjacent datasets is considered to be differentially private. This method of privacy protection is increasingly being adopted and implemented in various fields, including the 2020 US Census [2, 29, 28] and numerous machine learning tasks [24].
A wide range of data computations can be performed in a differentially private manner, including regression [17], clustering [37], parameter estimation [21], stochastic gradient descent [36], and deep learning [1]. However, many existing works on differential privacy focus on designing algorithms for specific tasks and are restricted to queries that are predefined before use. This requires expert knowledge and often involves modifying existing algorithms.
One promising solution to this challenge is to generate a synthetic dataset similar to the original dataset with guaranteed differential privacy [27, 8, 31, 7, 10, 11, 12]. As any downstream tasks are based on the synthetic dataset, they can be performed without incurring additional privacy costs.
1.1. Private synthetic data
Mathematically, the problem of generating private synthetic data can be defined as follows. Let be a metric space. Consider a dataset . Our goal is to construct an efficient randomized algorithm that outputs differentially private synthetic data such that the two empirical measures
are close to each other. We measure the utility of the output by , where is the 1-Wasserstein distance, and the expectation is taken over the randomness of the algorithm. The Kantorovich-Rubinstein duality (see, e.g., [47]) gives an equivalent representation of the 1-Wasserstein distance between two measures and :
(1.1) |
where the supremum is taken over the set of all -Lipschitz functions on . Since many machine learning algorithms are Lipschitz [48, 32, 15, 35], Equation (1.1) provides a uniform accuracy guarantee for a wide range of machine learning tasks performed on synthetic datasets whose empirical measure is close to in the 1-Wasserstein distance.
1.2. Main results
The most straightforward way to construct differentially private synthetic data is to add independent noise to the location of each data point. However, this method can result in a significant loss of data utility as the amount of noise needed for privacy protection may be too large [20]. Another direct approach could be to add noise to the density function of the empirical measure of , by dividing into small subregions and perturbing the true counts in each subregion. However, Laplacian noise may perturb the count in a certain subregion to negative, causing the output to become a signed measure. To address this issue, we introduce an algorithmically effective method called the Private Measure Mechanism.
Private Measure Mechanism (PMM)
PMM makes the count zero if the noisy count in a subregion is negative. Instead of a single partition of , we consider a collection of binary hierarchical partitions on and add inhomogeneous noise to each level of the partition. However, the counts of two subregions do not always add up to the count of the region at a higher level. We develop an algorithm that enforces the consistency of counts in regions at different levels. PMM has running time while the running time of the approach in [12] is polynomial in .
The accuracy analysis of PMM uses the hierarchical partitions to estimate the 1-Wasserstein distance in terms of the multi-scale geometry of and the noise magnitude in each level of the partition. In particular, when , by optimizing the choice of the hierarchical partitions and noise magnitude, PMM achieves better accuracy compared to [12] for . The accuracy is optimal rate up to a constant factor for , and up to a logarithmic factor for . We state it in the next theorem.
The hierarchical partitions appeared in many previous works on the approximation of distributions under Wasserstein distances in a non-private setting, including [4, 18, 50]. In the differential privacy literature, the hierarchical partitions are also closely related to the binary tree mechanism [22, 16] for differential privacy under continual observation. However, the accuracy analysis of the two mechanisms is significantly different. In addition, the TopDown algorithm in the 2020 census [3] also has a similar hierarchical structure and enforces consistency, but the accuracy analysis of the algorithm is not provided in [3].
Theorem 1.1 (PMM for data in a hypercube).
Let equipped with the metric. PMM outputs an -differentially private synthetic dataset in time such that
Private Signed Measure Mechanism (PSMM)
In addition to PMM, we introduce an alternative method, the Private Signed Measure Mechanism, that achieves optimal accuracy rate on when in time. The analysis of PSMM is not restricted to 1-Wasserstein distance, and it can be generalized to provide a uniform utility guarantee of other function classes.
We first partition the domain into subregions . Perturbing the counts in each subregion with i.i.d. integer Laplacian noise gives an unbiased approximation of with a signed measure . Then we find the closest probability measure under the bounded Lipschitz distance by solving a linear programming problem.
In the proof of accuracy for PSMM, one ingredient is to estimate the Laplacian complexity of the Lipschitz function class on and connect it to the 1-Wasserstein distance. This type of argument is similar in spirit to the optimal matching problem for two sets of random points in a metric space [38, 39, 9]. When , PSMM achieves the optimal accuracy rate for . For , PSMM achieves a near-optimal accuracy . For , the accuracy becomes .
Comparison to previous results
[42] proved that it is NP-hard to generate private synthetic data on the Boolean cube which approximately preserves all two-dimensional marginals, assuming the existence of one-way functions. There exists a substantial body of work for differentially private synthetic data with guarantees limited to accuracy bounds for a finite set of specified queries [5, 40, 23, 43, 33, 46, 12, 13, 14].
[49] considered differentially private synthetic data in with guarantees for any smooth queries with bounded partial derivatives of order , and achieved an accuracy of . Recently, [12] introduced a method based on superregular random walks to generate differentially private synthetic data with near-optimal guarantees in general compact metric spaces. In particular, when the dataset is in , they obtain . A corresponding lower bound of order was also proved in [12, Corollary 9.3]. PMM matches the lower bound up to a constant factor for , and up to a logarithmic factor for .
In terms of computational efficiency, PMM runs in time . This is more efficient compared to the algorithm in [12].
Organization of the paper
The rest of the paper is organized as follows. In Section 2, we introduce some background on differential privacy and distances between measures. We will first introduce and analyze the easier and more direct method PSMM before our main result. In Section 3, we describe PSMM in detail and prove its privacy and accuracy for data in a bounded metric space, and detailed results are provided for the case for the hypercube. In Section 4, we introduce PMM and analyze its privacy and accuracy. Optimizing the choices of noise parameters, we obtain the optimal accuracy on the hypercube with running time, which proves Theorem 1.1.
2. Preliminaries
Differential Privacy
We use the following definition from [24]. A randomized algorithm provides -differential privacy if for any input data that differs on only one element (or and are adjacent data sets) and for any measurable set , there is
Here the probability is taken from the probability space of the randomness of .
Wasserstein distance
Consider a metric space with two probability measures . Then the 1-Wasserstein distance (see e.g., [47] for more details) between them is defined as
where is the set of all couplings of and .
Bounded Lipschitz distance
Let be a bounded metric space. The Lipschitz norm of a function is defined as
where is the Lipschitz constant of . Let be the set of all Lipschitz functions on with . For signed measures , we define the bounded Lipschitz distance:
One can easily check that this is a metric. Moreover, in the special case where and are both probability measures, moving by a constant does not change the result of . Therefore, for a bounded domain , we can always assume for a fixed , then when computing the supremum in (1.1). This implies -metric is equivalent to the classical -metric when are both probability measures on a bounded domain :
(2.1) |
3. Private signed measure mechanism (PSMM)
We will first introduce PSMM, which is an easier and more intuitive approach. The procedure of PSMM is formally described in Algorithm 1. Note that in the output step of Algorithm 1, the size of the synthetic data depends on the rational approximation of the density function of , and we discuss the details here. Let be the weight of the probability measure on , respectively. We can choose rational numbers such that is arbitrarily small. Let be the least common multiple of the denominators of , then we output the synthetic dataset containing copies of for .
Before analyzing the privacy and accuracy of PSMM, we introduce a useful complexity measure of a given function class, which quantifies the influence of the Laplacian noise on the function class.
- Compute the true counts:
-
Compute the true count in each regime .
- Create a new dataset:
-
Choose any element independently of , and let be the collection of copies of for each .
- Add noise:
-
Perturb the empirical measure of and obtain a signed measure such that
where are i.i.d. discrete Laplacian random variables.
- Linear programming:
-
Find the closest probability measure of in -metric using Algorithm 2, and generate synthetic data from .
- Compute the distances:
-
Compute the pairwise distances .
- Solve the linear programming:
-
Solve the linear programming problem with variables and constraints:
s.t.
3.1. Laplacian complexity
Given the Kantorovich-Rubinstein duality (1.1), to control the -distance between the original measure and the private measure, we need to describe how Lipschitz functions behave under Laplacian noise. As an analog of the worst-case Rademacher complexity [6, 25], we consider the worst-case Laplacian complexity. Such a worst-case complexity measure appears since the original dataset is deterministic without any distribution assumption.
Definition 3.1 (Worst-case Laplacian complexity).
Let be a function class on a metric space . The worst-case Laplacian complexity of is defined by
(3.1) |
where are i.i.d. random variables.
Since Laplacian random variables are sub-exponential but not sub-gaussian, its complexity measure is not equivalent to the Gaussian or Rademacher complexity, but it is related to the suprema of the mixed tail process [19] and the quadratic empirical process [34]. Our next proposition bounds in terms of the covering numbers of . Its proof is a classical application of Dudley’s chaining method (see, e.g., [45]).
Proposition 3.2 (Bounding Laplacian complexity with Dudley’s entropy integral).
Suppose that is a metric space and is a set of functions on . Then
where is the covering number of and is an absolute constant.
In particular, we are interested in the case where is the class of all the bounded Lipschitz functions. One can find the result in [41] or more explicit bound in [26] that for the set of functions with , its covering number satisfies
When , a better bound on the covering number for Lipschitz functions is available from [41, 48]:
which implies the following corollary.
Corollary 3.3 (Laplacian complexity for Lipschitz functions on the hypercube).
Let with the metric, and be the set of all Lipschitz functions on with . We have
Discrete Laplacian complexity
Laplacian complexity can be useful for differential privacy algorithms based on the Laplacian mechanism [24]. However, since PSMM perturbs counts in each subregion, it is more convenient for us to add integer noise to the true counts. Instead, we will use the worst-case discrete Laplacian complexity defined below:
(3.2) |
where are i.i.d. discrete Laplacian random variables.
3.2. Privacy and Accuracy of Algorithm 1
The privacy guarantee of Algorithm 1 can be proved by checking the definition. The essence of the proof is the same as the classical Laplacian mechanism [24].
We now turn to accuracy. The linear programming problem stated in Algorithm 2 has many variables and many constraints, which can be solved in polynomial time in . We first show that Algorithm 2 indeed outputs the closest probability measure to in the -distance in the next proposition.
Proposition 3.5.
For a discrete signed measure on , Algorithm 2 gives its closest probability measure in -distance with the same support set with a polynomial running time in .
Now we are ready to analyze the accuracy of Algorithm 1. In PSMM, independent Laplacian noise is added to the count of each sub-region. Therefore, the Laplacian complexity arises when considering the expected Wasserstein distance between the original empirical measure and the synthetic measure.
Theorem 3.6 (Accuracy of Algorithm 1).
Suppose is a partition of and is the set of all functions with Lipschitz norm bounded by . Then the measure generated from Algorithm 1 satisfies
Note that can be satisfied when we take a partition of where each is a subcube of the same size. Using the formula above and the result of Laplacian complexity for the hypercube in Corollary 3.3, one can easily deduce the following result.
4. Private measure mechanism (PMM)
4.1. Binary partition and noisy counts
A binary hierarchical partition of a set of depth is a family of subsets indexed by , where
and such that is partitioned into and for every . By convention, the cube consists of a single element . We usually drop the subscript and write instead of . When , we call the level of . We can also encode a binary hierarchical partition of in a binary tree of depth , where the root is labeled and the -th level of the tree encodes the subsets for at level .
Let be a binary partition of . Given true data , the true count is the number of data points in the region , i.e.
We will convert true counts into noisy counts by adding Laplacian noise; all regions on the same level will receive noise of the same expected magnitude. Formally, we set
and is the level of . At this point, the magnitudes of the noise can be arbitrary.
4.2. Consistency
The true counts are non-negative and consistent, i.e., the counts of subregions always add up to the count of the region:
The noisy counts are non-negative, but not necessarily consistent. Algorithm 3 enforces consistency by adjusting the counts iteratively, from top to bottom. In the case of the deficit, when the sum of the two subregional counts is smaller than the count of the region, we increase both subregional counts. In the opposite case or surplus, we decrease both subregional counts. Apart from this requirement, we are free to distribute the deficit or surplus between the subregional counts.
It is convenient to state this requirement by considering a product partial order on , where we declare that if and only if and . We call the two vectors comparable if either or . Furthermore, denotes the line on the plane.
At each step, Algorithm 3 uses a transformation It can be chosen arbitrarily; the only requirement is that be comparable with . The comparability requirement is natural and non-restrictive. For example, the uniform transformation selects the closest point in the discrete interval in (say) the Euclidean metric. Alternatively, the proportional transformation selects the point in the discrete interval that is closest to the line that connects the input vector and the origin.
4.3. Synthetic data
Algorithm 3 ensures that the output counts are non-negative, integer, and consistent. They are also private since they are a function of the noisy counts , which are private as we proved. Therefore, the counts can be used to generate private synthetic data by putting points in cell . Algorithm 4 makes this formal.
- Compute true counts:
-
Let be the number of data points in .
- Add noise:
-
Let , where are i.i.d. random variables,
- Enforce consistency:
-
Convert the noisy counts to consistent counts using Algorithm 3.
- Sample:
-
Choose any points in each cell , independently of .
4.4. Privacy and accuracy of Algorithm 4
We first prove that Algorithm 4 is differentially private. The proof idea is similar to the classic Laplacian mechanism. But now our noise is of differential scale for each level, so more delicate calculations are needed.
Theorem 4.1 (Privacy of Algorithm 4).
Having analyzed the privacy of the synthetic data, we now turn to its accuracy. It is determined by the magnitudes of the noise and by the multiscale geometry of the domain . The latter is captured by the diameters of the regions , specifically by their sum at each level, which we denote
(4.1) |
and adopt the notation In addition to , the accuracy is affected by the resolution of the partition, which is the maximum diameter of the cells, denoted by
Theorem 4.2 (Accuracy of Algorithm 4).
Algorithm 4 that transforms true data into synthetic data has the following expected accuracy in the Wasserstein metric:
Here and are the empirical probability distributions on the true and synthetic data, respectively.
The privacy and accuracy guarantees of Algorithm 4 (Theorems 4.1 and 4.2) hold for any choice of noise levels . By optimizing , we can achieve the best accuracy for a given level of privacy.
Theorem 4.3 (Optimized accuracy).
Corollary 4.4 (Optimized accuracy for hypercubes).
When equipped with the metric, with the optimal choice of magnitude levels (A.4) and the optimal choice of
we have
4.5. Proof of Theorem 4.2
For the proof of Theorem 4.2, we introduce a quantitative notion for the incomparability of two vectors on the plane. For vectors , we define
Lemma 4.6 (Flux as incomparability).
is the -distance from to the set of points that are comparable to .
For example, if and , then . Note that has a distance to the vector which is comparable with .
Lemma 4.7 (Flux as transfer).
Suppose we have two bins with and balls in them. Then one can achieve and balls in these bins by:
-
(a)
first making the total number of balls correct by adding a total of balls to the two bins (or removing, if that number is negative);
-
(b)
then transferring balls from one bin to the other.
For example, suppose that one bin has ball and the other has . Then we can achieve and balls in these bins by first adding balls to the first bin and transferring balls from the second to the first bin. As we noted above, is the flux between the vectors and .
Lemma 4.7 can be generalized to the hierarchical binary partition of as follows.
Lemma 4.8.
Consider any data set , and let be its counts. Consider any consistent vector of non-negative integers . Then one can transform into a set that has counts by:
-
(a)
first making the total number of points correct by adding a total of points to (or remove, if that number is negative);
-
(b)
then transferring points from to or vice versa, for all and .
Combining the concept of the and our algorithm, the following two lemmas are useful in the proof of Theorem 4.2.
Lemma 4.9.
Lemma 4.10.
For any finite multisets such that all elements in are from , one has
Proof.
(Proof of Theorem 4.2) Owing to Lemma 4.8 and Lemma 4.9, the creation of synthetic data from the true data , described by Algorithm 4, can be achieved by the following three steps.
-
1.
Transform the -point input set to an -point set by adding or removing points.
-
2.
Transform to by moving at most many data points for each and between the two parts of the region .
-
3.
Transforms to the output data by relocating points within their cells.
We will analyze the accuracy of these steps one at a time.
Analyzing Step 2. The total distance the points are moved at this step is bounded by
(4.2) |
Since , it follows that
(4.3) |
Combining Steps 1 and 2. Recall that step 1 transforms the input data with into with by adding or removing points, depending on the sign of .
Case 1: . Here is obtained from by adding points, so Lemma 4.10 gives
Combining this with (4.3) by triangle inequality, we conclude that
Case 2: . Here is obtained from by removing a set of points. Furthermore, by our analysis of step 2, is obtained from by moving points the total distance at most . Therefore, (as a multiset) is obtained from by moving points the total distance at most , too. (The points in remain unmoved.) Since , it follows that
Moreover, Lemma 4.10 gives
(Here we used that the multiset has the same number of points as , which is .) Combining the two bounds by triangle inequality, we obtain
(4.4) |
In other words, this bound holds in both cases.
Analyzing Step 3. This step is the easiest to analyze: since is obtained from by relocating the points are relocated within their cells, and the maximal diameter of the cells is , we have Combining this with (4.4) by triangle inequality, we conclude that
Acknowledgements
R.V. acknowledges support from NSF DMS–1954233, NSF DMS–2027299, U.S. Army 76649–CS, and NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning. Y.Z. is partially supported by NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning.
The authors thank March Boedihardjo, Girish Kumar, and Thomas Strohmer for helpful discussions.
References
- [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
- [2] John Abowd, Robert Ashmead, Garfinkel Simson, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, and William Sexton. Census topdown: Differentially private data, incremental schemas, and consistency with public knowledge. US Census Bureau, 2019.
- [3] John M Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Micah Heineck, Christine Heiss, Robert Johns, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, et al. The 2020 census disclosure avoidance system topdown algorithm. Harvard Data Science Review, (Special Issue 2), 2022.
- [4] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2011.
- [5] Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 273–282, 2007.
- [6] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- [7] Steven M Bellovin, Preetam K Dutta, and Nathan Reitinger. Privacy and synthetic datasets. Stan. Tech. L. Rev., 22:1, 2019.
- [8] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2):1–25, 2013.
- [9] Sergey G Bobkov and Michel Ledoux. A simple fourier analytic proof of the AKT optimal matching theorem. The Annals of Applied Probability, 31(6):2567–2584, 2021.
- [10] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Covariance’s loss is privacy’s gain: Computationally efficient, private and accurate synthetic data. Foundations of Computational Mathematics, pages 1–48, 2022.
- [11] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Privacy of synthetic data: A statistical framework. IEEE Transactions on Information Theory, 69(1):520–527, 2022.
- [12] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Private measures, random walks, and synthetic data. arXiv preprint arXiv:2204.09167, 2022.
- [13] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Private sampling: a noiseless approach for generating differentially private synthetic data. SIAM Journal on Mathematics of Data Science, 4(3):1082–1115, 2022.
- [14] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Covariance loss, Szemeredi regularity, and differential privacy. arXiv preprint arXiv:2301.02705, 2023.
- [15] Sébastien Bubeck and Mark Sellke. A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34:28811–28822, 2021.
- [16] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1–24, 2011.
- [17] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. Advances in neural information processing systems, 21, 2008.
- [18] Steffen Dereich, Michael Scheutzow, and Reik Schottstedt. Constructive quantization: Approximation by empirical measures. Annales de l’IHP Probabilités et statistiques, 49(4):1183–1203, 2013.
- [19] Sjoerd Dirksen. Tail bounds via generic chaining. Electron. J. Probab, 20(53):1–29, 2015.
- [20] Josep Domingo-Ferrer, David Sánchez, and Alberto Blanco-Justicia. The limits of differential privacy (and its misuse in data release and machine learning). Communications of the ACM, 64(7):33–35, 2021.
- [21] John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201, 2018.
- [22] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715–724, 2010.
- [23] Cynthia Dwork, Aleksandar Nikolov, and Kunal Talwar. Efficient algorithms for privately releasing marginals via convex relaxations. Discrete & Computational Geometry, 53:650–673, 2015.
- [24] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- [25] Dylan J Foster and Alexander Rakhlin. vector contraction for Rademacher complexity. arXiv preprint arXiv:1911.06468, 6, 2019.
- [26] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620:105–118, 2016.
- [27] Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentially private data release. Advances in neural information processing systems, 25, 2012.
- [28] Mathew E Hauer and Alexis R Santos-Lozada. Differential privacy in the 2020 census will distort covid-19 rates. Socius, 7:2378023121994014, 2021.
- [29] Michael B Hawes. Implementing differential privacy: Seven lessons from the 2020 United States Census. Harvard Data Science Review, 2(2), 2020.
- [30] Seidu Inusah and Tomasz Kozubowski. A discrete analogue of the Laplace distribution. Journal of Statistical Planning and Inference, 136:1090–1102, 03 2006.
- [31] James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2019.
- [32] Leonid V Kovalev. Lipschitz clustering in metric spaces. The Journal of Geometric Analysis, 32(7):188, 2022.
- [33] Terrance Liu, Giuseppe Vietri, and Steven Z Wu. Iterative methods for private synthetic data: Unifying framework and new methods. Advances in Neural Information Processing Systems, 34:690–702, 2021.
- [34] Shahar Mendelson. Empirical processes with a bounded 1 diameter. Geometric and Functional Analysis, 20(4):988–1027, 2010.
- [35] Laurent Meunier, Blaise J Delattre, Alexandre Araujo, and Alexandre Allauzen. A dynamical system perspective for Lipschitz neural networks. In International Conference on Machine Learning, pages 15484–15500. PMLR, 2022.
- [36] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013.
- [37] Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. Differentially private -means clustering. In Proceedings of the sixth ACM conference on data and application security and privacy, pages 26–37, 2016.
- [38] Michel Talagrand. Matching random samples in many dimensions. The Annals of Applied Probability, pages 846–856, 1992.
- [39] Michel Talagrand. The generic chaining: upper and lower bounds of stochastic processes. Springer Science & Business Media, 2005.
- [40] Justin Thaler, Jonathan Ullman, and Salil Vadhan. Faster algorithms for privately releasing marginals. In Automata, Languages, and Programming: 39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, Part I 39, pages 810–821. Springer, 2012.
- [41] VM Tikhomirov. -entropy and -capacity of sets in functional spaces. In Selected works of AN Kolmogorov, pages 86–170. Springer, 1993.
- [42] Jonathan Ullman and Salil Vadhan. PCPs and the hardness of generating private synthetic data. In Theory of Cryptography: 8th Theory of Cryptography Conference, TCC 2011, Providence, RI, USA, March 28-30, 2011. Proceedings 8, pages 400–416. Springer, 2011.
- [43] Salil Vadhan. The complexity of differential privacy. Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pages 347–450, 2017.
- [44] Vijay V Vazirani. Approximation algorithms, volume 1. Springer, 2001.
- [45] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- [46] Giuseppe Vietri, Cedric Archambeau, Sergul Aydore, William Brown, Michael Kearns, Aaron Roth, Ankit Siva, Shuai Tang, and Steven Wu. Private synthetic data for multitask learning and marginal queries. In Advances in Neural Information Processing Systems, 2022.
- [47] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
- [48] Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with Lipschitz functions. J. Mach. Learn. Res., 5(Jun):669–695, 2004.
- [49] Ziteng Wang, Chi Jin, Kai Fan, Jiaqi Zhang, Junliang Huang, Yiqiao Zhong, and Liwei Wang. Differentially private data releasing for smooth queries. The Journal of Machine Learning Research, 17(1):1779–1820, 2016.
- [50] Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4 A):2620–2648, 2019.
Appendix A Additional proofs
A.1. Proof of Proposition 3.2
Proof.
We will apply the chaining argument (see, e.g., [45, Chapter 8]) to deduce a bound similar to Dudley’s inequality.
Step 1: (Finding nets)
Define for and consider an -net of of size . Then for any and any level , we can find the closest element in the net, denoted . In other words, there exists s.t.
Let be a positive integer to be determined later, we have the telescope sum together with triangle inequality
Note that when is small enough, can be covered by .
Step 2: (Bounding the telescoping sum)
For a fixed , we consider the quantity
For simplicity we will denote as the coefficient . Then we have
Since are independent subexponential random variables, we can apply Bernstein’s inequality to the sum . Let , we have
Then we can use the union bound to control the supreme. Define ,
and hence
We will compute them separately.
Therefore we concluded that for a fixed level ,
Step 3: (Bounding the last entry)
For the last entry in the telescoping sum, similarly, we denote and we have . Then
and the expectation satisfies
Step 4: (Combining the bound and choosing ) Combining the two integrals together, we deduce that for any ,
Then for any , we can always choose such that and bound the sum above with integral
(A.1) |
Taking infimum over completes the proof of the first inequality.
Now assume is the set of all functions with . From [26, Lemma 4.2], we can bound the covering number of by the covering number of as follows:
As a result, for any ,
This completes the proof. ∎
A.2. Proof of Corollary 3.3
Proof.
For with -norm, we have and the covering number
Then, as the domain is connected and centered, we can apply the bound for the covering number of from [48, Theorem 17]:
Applying the inequality above to (A.1), we get
(A.2) |
Compute the integral for the case and ,
Choosing finishes the cases for .
A.3. Proof of Proposition 3.4
Proof.
It suffices to prove that the steps from to the sign measure in Algorithm 1 is -differentially private since the remaining steps are only based on . Notice that both are supported on , we can identify the two discrete measures as dimensional vectors in the standard simplex, denoted , respectively. Consider two data sets and differ in one point. Suppose we deduced and through the first four steps of Algorithm 1 from , respectively. We know two vectors are different at one coordinate, where the difference is bounded by .
A.4. Proof of Proposition 3.5
Proof.
For two signed measures supported on , the -distance between and is
For simplicity, we denote , and . Then we note that for any with , only matters in the definition above. Therefore, suppose and are fixed, computing the -distance is equivalent to the following linear programming problem:
s.t. | ||||
After a change of variable , we can rewrite it as
s.t. | ||||
Next, we can consider the dual problem of the linear programming problem above. The duality theory in linear programming [44, Chapter 12] showed that the original problem and the dual problem have the same optimal solution. Let be the dual variable for the linear constraints about and , and let be the dual variable for the equation . As the linear programming above is in the standard form, by the duality theory, it is equivalent to
s.t. | ||||
To find the minimizer for a given , we regard as variables and add the constraints of being a probability measure. Also, we can eliminate the constant in the target function. So we get the linear programming problem:
s.t. | ||||
There are variables in total and linear constraints, and the minimizer is what we want. ∎
A.5. Proof of Theorem 3.6
Proof.
We transformed the original data measure with three steps:
Step 1: For the first step in the algorithm, we have . This follows from the definition of -Wasserstein distance.
Step 2: In this step, is no longer a probability measure, and we consider instead:
(A.3) |
A.6. Proof of Corollary 3.7
A.7. Proof of Theorem 4.1
Theorem 4.1 can be obtained by applying the parallel composition lemma [24]. Here we present a self-contained proof by considering an inhomogeneous version of the classical Laplacian mechanism [24].
Lemma A.1 (Inhomogeneous Laplace mechanism).
Let be any map, be a fixed vector, and be a random vector with independent coordinates . Then the map is -differentially private, where
Here the supremum is over all pairs of input vectors in that differ in one coordinate, and .
Proof of Lemma A.1.
Suppose differs in exactly one coordinate. Consider the density functions of the inputs having the same output . We have
Therefore, we know is -differentially private. ∎
Proof of Theorem 4.1.
Consider the map that transforms the input data into the vector of counts. Suppose a pair of input data and differ in one point . Consider the corresponding vectors of counts and . For each level , the vectors of counts differ for a single , namely for the that corresponds to the region containing . Moreover, whenever such a difference occurs, we have . Thus, extending the vector to trivially (by converting to for all ), we have
Applying Lemma A.1, we conclude that the map is -differentially private. ∎
A.8. Proof of Theorem 4.3
Proof.
We will use the Lagrange multipliers procedure to find the optimal choices of . Given the maximal layer , recall Theorem 4.1, we should use our privacy budget as
Therefore, we aim to minimize the accuracy bound with the specified privacy budget, namely
Recall the result in Theorem 4.2. Here are given and is fixed as long as we determine the maximal level . So the minimization problem is
Consider the Lagrangian function
and the corresponding equation
One can easily check that the equations above have a unique solution
(A.4) |
and it is indeed a minimal point for .
A.9. Proof of Corollary 4.4
Proof.
Let with the metric. The natural hierarchical binary decomposition of (cut through the middle) makes subintervals of length for , so for all , and the resolution is . Theorem 4.3 makes -differential private synthetic data with accuracy
A nearly optimal choice for is , which yields
The optimal noise magnitudes, per (A.4), are . In other words, the noise does not decay with the level.
Let for . The natural hierarchical binary decomposition of (cut through the middle along a coordinate hyperplane) makes subintervals of length for , so for all , and the resolution is . Thus,
Theorem 4.3 makes a -differential private synthetic data with accuracy
A nearly optimal choice for the depth of the partition is , which yields
The optimal noise magnitudes, per (A.4), are
Thus, the noise decays with the level , becoming per region for the smallest regions. ∎
A.10. Proof of Lemma 4.6
Proof.
If are comparable, both values are zero. If is not comparable, we can assume , without loss of generality. The set of points that are comparable to is
Note that the distance from to the first set is and the distance from to the second set is . Then is the smaller one of the two distances, which is also the distance from to the union set. ∎
A.11. Proof of Lemma 4.7
Proof.
Case 1: and are comparable. If , remove balls from bin 1 and balls from bin 2 to achieve the result. If , adding balls to bin 1 and balls to bin 2 to achieve the result.
Case 2: and are incomparable. Without loss of generality, we can assume that , .
Assume first that . Then . Then . Removing balls from bin 1 and transferring balls from bin 1 to bin 2 achieves the result. Note that there are enough balls in bin to transfer, since .
Now assume that . Then . Then . Adding balls to bin 2 and transferring balls from bin 1 to bin 2 achieves the result. ∎
A.12. Proof of Lemma 4.8
Proof.
First, we make the total number of points in correct by adding points to (or removing, if that number is negative).
Apply Lemma 4.7 for the two parts of : bin that contains points and bin that contains points. Since already contains the correct total number of points , we can make the two bins contain the correct number of points, i.e. and respectively, by transferring points from one bin to the other.
Apply Lemma 4.7 for the two parts of : bin that contains points and bin that contains points. Since already contains the correct number of points , we can make the two bins contain the correct number of points, i.e. and respectively, by transferring points from one bin to the other.
Similarly, since already has the correct number of points , we can make and contain the correct number of points and by transferring points from one bin to the other.
Continuing this way, we can complete the proof. Note that the steps of the iteration procedure we described are interlocked. Each next step determines which subregion the transferred points are selected from, and which subregion they are moved to in the previous step. For example, the original step calls to add (or remove) points to or from , but does not specify how these points are distributed between the two parts and . The application of Lemma 4.7 at the next step determines this. ∎
A.13. Proof of Lemma 4.9
A.14. Proof of Lemma 4.10
Proof.
Finding the 1-Wasserstein distance in the discrete case is equivalent to solving the optimal transformation problem. In fact, we can obtain from by moving atoms of , each having mass , and distributing their mass uniformly over . The distance for each movement is bounded by . Therefore the -Wasserstein distance between and is bounded by . ∎
Appendix B Discrete Laplacian distribution
Recall that the classical Laplacian distribution is a continuous distribution with density
A random variable has zero mean and
To deal with counts, it is more convenient to use the discrete Laplacian distribution , see [30], which has probability mass function
where . A random variable has zero mean and
Thus, one can verify that discrete Laplacian has a smaller variance than its continuous counterpart:
(B.1) |
but the gap vanishes for large :