Ryanred
A Unifying Privacy Analysis Framework for Unknown Domain Algorithms in Differential Privacy
Abstract
There are many existing differentially private algorithms for releasing histograms, i.e. counts with corresponding labels, in various settings. Our focus in this survey is to revisit some of the existing differentially private algorithms for releasing histograms over unknown domains, i.e. the labels of the counts that are to be released are not known beforehand. The main practical advantage of releasing histograms over an unknown domain is that the algorithm does not need to fill in missing labels because they are not present in the original histogram but in a hypothetical neighboring dataset could appear in the histogram. However, the challenge in designing differentially private algorithms for releasing histograms over an unknown domain is that some outcomes can clearly show which input was used, clearly violating privacy. The goal then is to show that the differentiating outcomes occur with very low probability. We present a unified framework for the privacy analyses of several existing algorithms. Furthermore, our analysis uses approximate concentrated differential privacy from Bun and Steinke [1], which can improve the privacy loss parameters rather than using differential privacy directly, especially when composing many of these algorithms together in an overall system.
1 Introduction
Releasing histograms, counts over some set of items, is one of the most fundamental tasks in data analytics. Given a dataset, we want to compute the number of rows that satisfy some condition, grouped together by some column(s) of interest, say the number of rows in each country. Despite the commonality of this task, there are many different differentially private algorithms to use, depending on the setting we are in, e.g. do we want to show only the top-, do we know how many items a user can have in any row, do we know all possible values that a column can take, how often is the data refreshed? We will focus on the setting where we do not have, or even know, the counts of all possible items in a histogram. This is typical in practice, because SQL queries will only return items that have positive counts, otherwise how would it know items not present in the dataset? Unfortunately, differential privacy requires considering a hypothetical neighboring dataset, which might contain items that were previously unseen, which makes the privacy analysis challenging. Either we need to populate all possible items that could appear in the dataset and fill in the missing counts as zero, or we need to design better, more practical, DP algorithms. In the latter case, we would want to be able to ensure DP whether we were computing the top-10 skills in a dataset or the top-10 credit card numbers in the dataset. The latter should return no results, but the algorithm should ensure privacy in both cases. We will refer to the scenario where the DP algorithm does not know the domain of items that could be in a dataset beforehand as the Unknown Domain setting.
In differential privacy there are typically two different parameters that are used to quantify the privacy of an algorithm. The parameter is typically referred to as the amount of privacy loss that an algorithm ensures, while is commonly referred to as the probability in which the algorithm can have larger privacy loss than . In fact, when , we say an algorithm satisfies approximate differentially privacy, while we say pure DP with . The parameter then can be the chance that the algorithm returns a result that clearly violates privacy, e.g. returning an individual’s data record in the data set. Not every approximate DP algorithm satisfies this interpretation, which has resulted in variants of DP, based on Rényi divergence. In particular, adding Gaussian noise with fixed standard deviation to a statistic of interest cannot be pure DP, but can be -DP for any . Hence the probability of failure need not be fixed in advance in the algorithm. However, there are many different algorithms that are shown to be -differentially private, for a particular . In particular, designing DP algorithms for releasing histograms in the Unknown Domain setting will require setting a failure probability in advance.
Consider the example where we want to know the most popular words typed in an email client. The set of all potential words is massive, and can include slang, typos, and abbreviations, so we only want to take words that are present in the dataset, rather than the set of all possible words. So one approach would be to add noise to all word counts and sort them to return the top ones. However, there could be a word like “RyanRogersSSN123456789”. Even if there is noise in the counts, the mere presence of this record is a clear violation of privacy. To prevent this, we can introduce a threshold on the noisy counts, but then there is a chance that the noise is especially large, making a small count go above the threshold. This is where the probability comes in, so we want to make sure that the threshold is set high enough to ensure that small counts with noise can only appear above the threshold with probability . However, it does not suffice to just bound the probability of bad outcomes to prove an algorithm is approximate differentially private.
We present a general framework that can be used in the privacy analysis of several algorithms that aims to release counts from an unspecified domain, which would only return results that are present in the original dataset, although the framework can be applied to other scenarios. The main idea is to show a mechanism satisfies three conditions: (1) there are small chance events that can lead to differentiating outcomes, where it is clear whether one dataset was used, (2) there is another mechanism that can know both neighboring datasets, and is equal to on all non-differentiating outcomes, and (3) is pure DP. Although the algorithms we cover were previously shown to satisfy approximate differential privacy, we revisit their analyses, following our general framework. Furthermore, we show that providing a privacy analysis in terms of approximate concentrated DP (CDP) can lead to improved privacy parameters, especially with algorithms based on Gaussian noise or the Exponential Mechanism.
We advocate for presenting privacy guarantees of new or existing algorithms in terms of approximate CDP, rather than approximate DP, as composing the privacy parameters for CDP parameters is straightforward, while composing approximate DP parameters can be complicated and loose. Note that each algorithm is likely to be part of a more general privacy system where the overall privacy guarantee of the system can be in terms of approximate DP. It has become common to compose algorithms using an analysis based on approximate CDP and then converting to an overall approximate DP guarantee at the end. Leaving privacy guarantees of individual algorithms in terms of approximate DP will lead to loose bounds in converting the DP parameters to CDP parameters, composing CDP parameters, then lastly converting back to DP at the end.
2 Preliminaries
We now define approximate differential privacy which depends on neighboring datasets , denoted as that differ in the presence or absence of one user’s records.
Definition 2.1 (Dwork et al. [10, 9]).
An algorithm is -differentially private if, for any measurable set and any neighboring inputs ,
(1) |
If , we say is -DP or simply pure DP.
One of the classical pure DP algorithms is the Laplace mechanism, which adds Laplace noise to a statistic. However, to determine the scale of noise to add to the statistic to ensure DP, we must know its sensitivity. We then define the -sensitivity of a statistic that takes a dataset to a real vector in as the following where the max is taken over neighboring
We then have the following privacy guarantee for the Laplace mechanism.
Theorem 1 (Dwork et al. [10]).
Let have -sensitivity , then the mechanism where with is -DP for .
Another classical pure DP mechanism is the Exponential Mechanism, which takes a quality score mapping a dataset and outcome to a real value and the goal is to return outcomes that have a high quality score. We will define the range of a quality score to be the following where the max over datasets is over neighbors ,111The original Exponential Mechanism was presented in terms of a quality scores sensitivity, while recent work from [5] showed that the Exponential Mechanism is more naturally defined in terms of the quality score’s range. See also Jinshuo Dong’s blog post https://dongjs.github.io/2020/02/10/ExpMech.html
We then have the following privacy guarantee of the Exponential Mechanism.
Theorem 2 (McSherry and Talwar [15]).
Let be a quality score with sensitivity , then the mechanism is -DP for any where
We now define approximate concentrated differential privacy (CDP),222Although [1] defines zCDP, to differentiate between CDP from [8], we will use CDP to be the version from [1] which differs slightly from the original definition but was later shown [21] to be equivalent to the original version. Similar to approximate DP, it permits a small probability of unbounded Rényi divergence.
Definition 2.2 (Bun and Steinke [1], Papernot and Steinke [16]).
Suppose and . We say the algorithm is -approximate -CDP if, for any neighboring datasets , there exist distributions such that the outputs are distributed according to the following mixture distributions:
where for all , and .
We can also convert approximate differential privacy to approximate CDP and vice versa.
Theorem 3 (Bun and Steinke [1]).
If is -DP then it is -approximate -CDP. If is -approximate -CDP then it is -DP for any .
The classical CDP mechanism is the Gaussian Mechanism. Note that the Gaussian Mechanism was originally introduced as satisfying approximate DP, but it was then shown to satisfy pure CDP in later work [8, 1].
Theorem 4 (Bun and Steinke [1]).
Let have -sensitivity , then the mechanism where with is -CDP for .
Note that we can apply Theorem 3 to conclude that the Exponential Mechanism is -DP and hence -CDP, but work from Cesar and Rogers [2] showed that the Exponential Mechanism actually has a better CDP parameter.333Also see the previous blog post at https://differentialprivacy.org/exponential-mechanism-bounded-range/
Theorem 5 (Cesar and Rogers [2]).
Let be a quality score with sensitivity , then the mechanism is -CDP for any where
We also state the composition property of CDP, showing that the overall privacy parameters degrade after multiple CDP algorithms are run on a dataset.
Theorem 6.
Let be -approximate -CDP and where is -approximate -CDP for all . Then where is -approximate -CDP.
3 Unifying Framework
We now present a general framework that can be used to unify some of the previous analyses for proving approximate DP (CDP) for various algorithms. If we want to show an algorithm is approximate CDP, we need to consider the randomness in that can generate differentiating outcomes, where if we are given two inputs and , we see an outcome of that could have only come from one of them. We want to be able to show that the chance that randomness in can generate these bad outcomes is at most . Furthermore, we want to show that for all other non-differentiating outcomes, there are related mechanisms that match with inputs and . Lastly, we need to show that these related mechanisms satisfy pure CDP.
To be more precise, the following lemma can be used to prove many different algorithms that are approximate DP can also be proven to be approximate CDP directly, without needing to resort to the general approximate DP conversion to approximate CDP conversion.
Lemma 3.1.
Let be a randomized algorithm and fix parameters . If for each neighboring datasets the algorithm satisfies the following conditions, then it is -approximate -CDP:
For each neighboring datasets , we have the following three conditions
-
1.
There exists events such that
Let be the corresponding outcomes of both conditioned on events and conditioned on .
-
2.
There exists distributions and with common support , such that for all we have the following where , are the distributions for and , respectively,444We denote to denote the Radon-Nikodym derivative of the corresponding distribution with respect to some base measure (see a similar note in [19])
-
3.
Also we have and for all .
Proof.
Fix neighbors . Let be the distribution for and , respectively. Consider the following distribution where we write to denote the conditional distribution of given events
We then have that for
Furthermore, for we have
Hence, we have for all outcomes . A similar arguments works to show with defined similar to .
By assumption, we know that and for all . Hence, is approximate CDP. ∎
Although a similar lemma can be used for approximate DP, this will typically lead to unnecessarily loose privacy parameters, as we will see later. Hence, providing an approximate CDP analysis of each algorithm will be useful in any privacy system that combines these algorithms because the CDP privacy parameters simply add up and can be converted to approximate DP parameters at the end, if necessary.
4 Unknown Domain Algorithms
We will denote a histogram as which consists of a set of pairs with a label in and its corresponding count . When we define neighboring histograms, we will need to consider how much a user can modify a histogram. If we remove a user’s contributions from a histogram , this can both remove items entirely or decrease counts by some amount. We will say that a histogram is -sensitive if removing or adding a user’s data to histogram can change at most distinct elements and each count in can differ by at most . We now turn to some applications of Lemma 3.1 for some existing mechanisms.
4.1 Positive Count Histograms
We start with the setting where we want to return a histogram subject to CDP where we are only given positive count items, hence each count present in the histogram is at least 1. It is straightforward to extend this analysis to the case where we have a histogram with only counts above some known value larger than 1. This is a natural setting for data analytics as GROUP BY queries in SQL only provide items that exist in the dataset, so no zero counts are returned. This is problematic for privacy, as a neighboring histogram can have fewer results, resulting in a differing set of counts to add noise to. We present a general template for the private algorithm in Algorithm 1, where we deliberately leave the noise distribution and the threshold arbitrary.
Previous mechanisms follow this template, specifically from Korolova et al. [13] and Wilson et al. [22] who used Laplace noise and Swanberg et al. [20] who used Gaussian noise. We now prove that Algorithm 1 is approximate CDP using Lemma 3.1.
Theorem 7.
Proof.
We will rely on Lemma 3.1 to prove this result. Consider neighbors where has one additional user’s data from , without loss of generality. By assumption, we know that can differ in at most counts for different labels. Furthermore, we know that in counts that are differing, they can differ by at most . Consider the set to be the set of labels that are common between and along with corresponding counts. Since we assume that has one additional user’s data from , we know that this set must include all labels of . Note that the counts for the items that are not present in but are present in must be at most . We now cover each item in Lemma 3.1.
-
1.
We define the event to be all the randomness in that can generate outcomes in , so that must only include the noise that is added to items that are common between and and the noise that is added to the items’ counts in , but not in must be bounded. We then lower bound the probability of event . Note that for every item who is in just not in , it’s count can be no more than .
We then consider the two scenarios with either Laplace noise or Gaussian noise.
-
–
(Laplace) With Laplace noise of scale , with , we have
-
–
(Gaussian) Next we consider the Gaussian noise version that has standard deviation with .
For this case, the event is all randomness in that can generate outcomes in , which would include all randomness in because is a subset of .
-
–
-
2.
We then consider the mechanism whose domain is the set of common items between and and has noise added to each count so that only noisy counts above are returned with its corresponding item label. We write the distribution of as and the distribution of as . Because we add independent noise to each histogram count, we can separate out the noise terms added to the counts of labels that are not common between and , hence we know that for , by design. Furthermore .
-
3.
Note that is either the Laplace mechanism or the Gaussian mechanism over common items between and . We then cover each variant separately.
-
–
(Laplace) We first consider the case where we apply Laplace noise with noise parameter . Note that is a Laplace Mechanism over a histogram that can change in at most counts and each count can differ by at most . Ignoring the threshold in , as this is a post processing of the noisy histogram and does not impact the privacy analysis, we can then say that the Laplace mechanism is being applied to a histogram with -sensitivity . Hence, we know that the Laplace mechanism is -(pure) DP and hence -(pure) CDP. However, this will not get us the result we want because it would result in -CDP with . This was due to using the -sensitivity of the histogram and then applying Theorem 1.
We now consider the Laplace mechanism only on the common items between and that also have differing counts. We call the corresponding mechanism and denote the distribution for and the distribution for . Note that each noisy count in is generated independently, so we can say that each count in is a single univariate Laplace mechanism. Each Laplace mechanism will then be -(pure) DP and hence -(pure) CDP. Applying composition from Theorem 6 over each univariate Laplace mechanism implies that is -(pure) CDP. Let denote the Laplace Mechanism applied to the counts that were unchanged between and with distribution for . For ease of notation, let and match on the first indices and let the first indices of and have differing counts. Hence, we have for outcome that . Similarly, we have . This gives us the following for
and similarly,
-
–
(Gaussian) We next consider the Gaussian noise variant. We will denote as the Gaussian mechanism whose domain is the set of common items between and and has noise standard deviation and only counts above will be returned with its corresponding item label. We write the distribution of as and the distribution of as .
Note that is a Gaussian Mechanism over a histogram that can change in at most counts and each count can differ by at most . Ignoring the threshold in , again this does not impact the privacy analysis, we can then say that the Gaussian mechanism is being applied to a histogram with -sensitivity . Hence, we know that the Gaussian mechanism is -(pure) CDP from Theorem 4. Furthermore, post processing the Gaussian Mechanism is also CDP with the same parameters, so restricting the noisy counts to be larger than gets us back to distributions and . This gives us and , for all .
-
–
Hence, using Laplace noise with scale in the Unknown Domain Histogram algorithm is -approximate -CDP. Setting completes the proof for Laplace noise. Furthermore, using Gaussian noise with standard deviation is -approximate -CDP and setting completes the proof.
∎
We want to highlight the improvement we get when we use the CDP analysis. If we consider a similar analysis using approximate DP, we would remove the items that cannot be returned under both neighboring datasets and we would be left with the Laplace mechanism over the common items. We could then use the -sensitivity of the resulting histogram, but then adding Laplace noise with scale , would result in -DP, which can be converted to CDP using Theorem 3 to get -approximate -CDP, although we can get -approximate -CDP in our analysis. Furthermore, if we convert approximate CDP guarantees to approximate DP guarantees, this would lead to loose privacy parameters when developing privacy systems that use these algorithms. For example, if we use the Gaussian noise variant in Theorem 7 with , we can conclude that it is -DP for any , so if we only use DP guarantee and combine it with another Unknown Domain Histogram with Gaussian noise, we can compose to get -DP for any where
However, if we had never converted to approximate DP until after composing both Unknown Domain Histogram mechanisms, we could have gotten an overall -DP guarantee, where
4.2 Top- Count Histograms
We now turn to a slight variant of releasing private histograms over a histogram with positive counts. In this setting, we assume that only a limited part of the histogram is available, perhaps due to an existing data analytics system that cannot return all counts. This setting first appeared in [6] and is especially important when designing DP algorithms on top of existing systems that cannot provide the full histogram of counts [18]. We will refer to the histogram consisting of the top- items as the histogram with items and corresponding counts that are in the top-. Note that in this case, the top- can change between neighboring histograms. We now present a general algorithm template, similar to Algorithm 1, in Algorithm 2 that takes an arbitrary threshold and a top--histogram.
We now show that it is indeed approximate CDP.
Theorem 8.
Assume input histograms are -sensitive. If we use and threshold
then Algorithm 2 is -approximate -CDP.
Proof.
We follow the same analysis as in the proof of Theorem 7, which used Lemma 3.1. We again set to be the set of labels that are common between neighbors and . Note that we are only considering items with counts in the top- of each histogram and the items between and , including the zero count items with labels in . Hence, there might be different items between and , as reducing some counts in might change the order of the top-. From Lemma 5.2 in [6], we know that there can be at most many differing items between the top- histograms and . We then follow the three items that we need to show in order to apply Lemma 3.1.
-
1.
We denote as Algorithm 2 and the event as the noise added to counts in outcomes for , the noise added for the threshold to get , and the noise added to the differing items not in must be no more that . We define similarly for . Note that for any item that is in but not an item that can be returned in , we know
The analysis is straightforward due to the difference between two Gaussians being Gaussian itself. Hence, we have with so that .
The analysis to show is similar.
-
2.
We then consider the distribution to be the distribution of conditioned on . Note that for each and similarly for being the distribution of conditioned on . We will modify the mechanism and that will result in the same mechanism when conditioned on events and , respectively. For each label that differs between and , we change it to common labels where . Because we condition on events , no outcome can include the indices . Furthermore, it was shown in [17][Lemma 6.4]] that the resulting histogram with common labels will also have as many as items with differing counts, and those counts can change by at most , regardless of how we assign the common labels . We then let and be the resulting distribution after this relabeling.
Next we will need to bound the Rényi divergence between and . We will make use of the following result from [12][Corollary 4.3]
Lemma 4.1.
Suppose are two -strongly convex functions over , and is -Lipschitz over . For any , if we let and be two probability distributions on , then we have for all
This result is useful because it allows us to condition on outcomes from a joint Gaussian Mechanism falling in some convex region, which will correspond to releasing only “good” outcomes, i.e. not allowing certain counts from going above some noisy value. For the Gaussian mechanism, we have and , which are both -strongly convex over any convex region. Furthermore, we have is -Lipschitz. For the density of a Gaussian, we then use
-
3.
We now want to prove that the Rényi divergence between and , which are conditioned on events in and respectively. To do this we will consider the joint Gaussian distribution that releases all counts over the histograms with common labels, including the -th largest count with added to its count being labeled , but we do not enforce the threshold. We only want to consider events that do not have the items with labels in having noisy count above the noisy count for . We then consider the convex region where these “bad” noisy counts do not go above the noisy count for . We then apply Lemma 4.1 to claim that the resulting mechanism conditioned on this region has a bound on the Rényi Divergence that is the same as if it were the Gaussian mechanism not constrained to region . Dropping the items with counts lower than the noisy count for is simply post processing, which does not increase the Rényi divergence bound. Hence, we have and for all .
Setting completes the proof. ∎
4.2.1 Exponential Mechanism
For the previous applications, there was a crucial assumption that the input histograms were -sensitive, specifically that the histogram’s -sensitivity must be bounded by . However, this might not always be the case, and would be difficult to enforce a bound in practice – requiring that each user has only a certain number of distinct items in the data. One way to limit the impact a single user can have on the result is to limit the number of items that can be returned. The Laplace/Gaussian Mechanism based algorithms can return an arbitrary number of things that are above the threshold and in the setting where we only have access to the top- items, we could return all . Hence, the CDP parameter could be bounded in terms of , but we would like to control how many things could be returned from the counts we have access to. In this case, we can return the top- results, where is an input to the algorithm. Unfortunately, the previous analysis would still have the CDP parameter scale with despite only wanting to return .
To ensure that privacy loss only scales with the number of things that are returned, we can use the classical Exponential Mechanism [15], as presented in Theorem 2. It was shown that the Exponential Mechanism is part of a general framework of DP mechanisms called report noisy max. This can be summarized as adding noise to the quality scores for each outcome and returning the outcome with the noisy max. Note that it is critical that only the arg max is returned, not the actual noisy value. It turns out that adding Gumbel noise to the quality scores and returning the max item is equivalent to the Exponential Mechanism, see [6]. Furthermore, it was shown that iteratively applying the Exponential Mechanism to return the top- outcomes is equivalent to adding Gumbel noise to all quality scores and returning the outcomes with the largest noisy quality scores [6]. Other mechanisms in the report noisy max framework include adding Laplace noise to the quality scores [7] and adding Exponential noise to the quality scores [4] which turns out to be equivalent to the Permute-and-Flip Mechanism [14].555See previous blog post on one-shot top- DP algorithms: https://differentialprivacy.org/one-shot-top-k/
We then present the Unknown Domain Gumbel algorithm in Algorithm 3 from [6] which takes an additional parameter and importantly returns a ranked list of items, but not their counts.
We then state Unknown Domain Gumbel’s privacy guarantee, which will largely follow the analysis in [6], although adapted for CDP.
Theorem 9.
Assume input histograms are -sensitive. If we use the noise scale , and threshold
then Algorithm 3 is -approximate -CDP.
Proof.
We will apply our general framework from Lemma 3.1 and some previous results from [6]. We will denote as the Unknown Domain Gumbel mechanism. We assume WLOG that differ by there being one additional user’s data in when compared to .
-
1.
We denote as the set of common outcomes that are possible between top- histograms and and to be all the randomness in that can generate outcomes in . Hence, must only include the Gumbel noise added to items that are common in and and the noise that is added to the differing counts must have a noisy count either below the top- or below the noisy threshold that is set. From Lemma 5.5 in [6], we have the following bound when
We similarly define events for the randomness in that can generate outcomes in which gives us the same lower bound for .
-
2.
From Lemma 5.6 in [6], we know that there exists a distribution such that for all outcomes , we have
-
3.
These distributions are also related to the Exponential Mechanism. Specifically, is the distribution of iteratively applying the Exponential Mechanism on only over the items that are common between and . Similarly, we can define as the distribution of iteratively applying the Exponential Mechanism on from items that are common between and . We can then use Theorem 5 to conclude that and , for all .
Setting completes the proof. ∎
4.2.2 Pay-what-you-get Composition
We now consider the setting where an analyst wants to run multiple top- queries over an unknown domain. One of the issues with the Unknown Domain Gumbel mechanism in Algorithm 3 is that it can sometimes return fewer than results, yet the overall CDP privacy parameter scales with , regardless of the number of results returned. Hence, with many top- queries, the privacy loss would scale with with traditional privacy techniques. However, [6] showed that if we instead set an overall bound on the number of results that can be returned from all the intermediate top- queries, then the privacy loss will scale with . They referred to this as pay-what-you-get composition because even if an analyst requests top-100 items, and only 1 result is returned, we need only deduct the number of items that are returned (including ) from the overall privacy budget until items have been returned. We present the pay-what-you-get composition in Algorithm 4
The key observation to proving the overall privacy guarantee of pay-what-you-get is that each Unknown Domain Gumbel Mechanism is related to the Exponential Mechanism. Removing the randomness that can generate outcomes from the differing items between neighboring datasets, the algorithm becomes equivalent to an Exponential Mechanism. Hence, considering only the randomness over each Unknown Domain Gumbel that can generate outcomes in both neighboring histograms, we are left with just a string of at most many Exponential Mechanisms.
Theorem 10.
Proof.
We let be the Pay-what-you-get mechanism with neighboring histogram streams and , i.e. each histogram and are neighbors, we denote as the distribution for and as the distribution for . WLOG, we assume that each has an additional user’s data from for .
We will denote as the set of common outcomes between these neighboring streams.
-
1.
Let be the corresponding randomness in that can generate outcomes in , where each is an Unknown Domain Gumbel mechanism, and similarly we denote as the corresponding randomness in that can generate outcomes in . We then apply Lemma 5.5 in [6] to get the following for each event set of
Hence, applying a union bound over all events, we get
-
2.
We then apply Lemma 5.6 in [6], as we did in Theorem 9 to write the distribution of in terms of a stream of top- mechanisms that can be further decomposed into Exponential Mechanisms. Given the prior outcomes , we will write to denote the distribution of the top- mechanism evaluated on but only on the common items between and . Similarly, we will write to denote the distribution of the top- mechanism evaluated on but only on the common items with . We can now apply Lemma 5.6 from [6] iteratively to get for all ,
-
3.
We then consider the distribution , which can be further decomposed into separate Exponential Mechanisms. We will write and as the distribution for the Exponential Mechanism for the -th round’s mechanism returning the -th item. Hence we have
Note that the two products on the right hand side will have at most many terms due to the number of items that can be returned by pay-what-you-get, and each item is the distribution of an Exponential Mechanism. We have a similar argument for being decomposed as a sequence of Exponential Mechanisms with . Hence, we have and for all .
Note that we can almost apply Lemma 3.1, but not quite because we have when . We then construct the joint distribution as
We can then write the probability as a convex combination, so that for all we have the following
Similarly, we have
This completes the proof. ∎
4.3 Continual Observation
We now present an approach for continually releasing a running counter over various domain elements while ensuring differential privacy. The continual observation privacy model was introduced in [3, 11] and is meant to ensure strong privacy guarantees in the setting where we want to continually provide a running counter on a stream of events, denoted at for where . It is then straightforward to extend the continual observation counters to the setting of releasing a running histogram over a known set of items, i.e. , where someone can contribute a limited set of items at each round in the stream. Typically we want to ensure event-level privacy, where we consider the change in outcomes when one event in the stream can change.
Recent work from [17, 23] have also considered the continual observation setting, but in the case where we want to continually release histograms over an unknown set of items. Consider the motivating example where we want to provide a running counter for the drugs that are purchased at a pharmacy and each customer can buy at most different drugs. There might be several different drugs to provide a counter for and new drugs can emerge later that were not known about before. Hence, we would like to have algorithms that do not require a set of items to provide counts over.
We describe the algorithm at a high level, as formally describing will require some additional notation. The main subroutine is the Binary Mechanism from [3, 11] which will add at most many noise terms to the count of a stream of length events. The number of noise terms depends on the number of 1s in the binary representation of . This will ensure that for a stream of length at most , we can release counts, one after each event, each with Gaussian noise with standard deviation . Although we release counts, the privacy analysis relies on the fact that we form a table of partial sums, so that one event in the stream can modify at most partial sums and we can view the Binary Mechanism as a Gaussian mechanism on the table of partial sums for all common items between and , which has -sensitivity .
The Unknown Domain Binary Mechanism follows the same approach as the classical Binary Mechanism, which we present at a high level. The idea on the Binary Mechanism is to split a stream of items into overlapping partial sums, guaranteeing that each is part of no more than partial sums. We create separate partial sums for each item in . For instance, with , we will focus on a single partial sum, based on the binary representation of , so that we need only add noise to the partial sums for each which we add noise to the partial sum and is then used again for any other partial sum that utilizes . For instance, when , we will use the two partial sums and for each , each with its own noise added to it. Note that for each , we will add fresh noise to each prior partial sum for the new item . We then only release items with corresponding noisy counts if it is larger than a fixed threshold .
We now present the privacy analysis for Unknown Domain Binary Mechanism from [17].
Theorem 11.
Assume that for all . Setting and threshold to be the following for any
ensures that Unknown Domain Binary Mechanism is -approximate -CDP, under event-level adjacent streams.
Proof.
We follow the same analysis as in the earlier theorems, mainly leveraging Lemma 3.1. Let contain an event where neighboring stream has an empty set at that event. Say that round is where they differ so that and . Let denote the Unknown Domain Binary Mechanism. We denote as the set of all outcomes that both and can return. Note that at round , stream event can introduce as many as previously unseen items in the stream. We will write the distribution of as and the distribution of as .
-
1.
We will write as the randomness in that can generate outcomes that are common between and . We need to ensure that no noisy count on any new item that appears in but not in can go above the threshold . At round we can add together as many as independent Gaussian noise terms, which itself will be Gaussian. We then apply a union bound over all possible items in and further a union bound over all possible rounds to get the following when
-
2.
We will write the distribution of the Binary Mechanism evaluated on with only the common items with as and whose counts are positive. We then have for all outcomes that
Since is the distribution for the neighboring input where , we have is all outcomes so that for all .
-
3.
Note that and are simply a post-processing functions of the partial sums table with Gaussian noise added to each cell in the table. Hence, we need to consider the Rényi divergence for the partial sum tables on common items between and . We then have and for all because this is a randomized post processing (including additional noise terms) on the common items in each partial sum table in the same as the Binary Mechanism.
Setting completes the proof.
∎
5 Conclusion
We have presented a unified framework to prove that several different algorithms over unknown domain histograms are approximate CDP. In many settings, practitioners want to have a way to incorporate private algorithms with minimal onboarding. A major bottleneck for incorporating private algorithms into existing systems is requiring a fixed list of items that we want to release counts for. Furthermore, products teams might be comfortable with noise added to counts, but not displaying counts for items that never appeared in the dataset. We wanted to show how the privacy analyses of many existing DP algorithms can be unified by fixing neighboring datasets and considering not just outcomes that can occur in both neighboring inputs, but also the related distributions that can only generate these good outcomes. We think that approximate CDP provides the easiest way to combine these algorithms together to get tight privacy loss bounds, as the privacy analysis of many rely on improved pure CDP bounds rather than pure DP bounds. For example, we showed how using a CDP analysis of the Laplace mechanism can improve the CDP privacy parameter by considering composition over differing counts, rather than relying on an -sensitivity bound as would be the case for DP. We can also use the tighter connection between Exponential Mechanisms and CDP, rather than using pure DP parameters of the Exponential Mechanism. Lastly, the Gaussian mechanism does not satisfy a pure DP bound, so using CDP is a natural fit and converting to approximate DP would result in a lossy DP parameter. We hope that this unified framework will help demystify some previous analyses and can be leveraged in designing future private algorithms.
6 Acknowledgements
Special thanks for the helpful comments from David Durfee and Thomas Steinke that helped improve the quality of this survey.
References
- Bun and Steinke [2016] M. Bun and T. Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In M. Hirt and A. Smith, editors, Theory of Cryptography, pages 635–658, Berlin, Heidelberg, 2016. Springer Berlin Heidelberg. ISBN 978-3-662-53641-4. URL https://link.springer.com/chapter/10.1007/978-3-662-53641-4_24.
- Cesar and Rogers [2021] M. Cesar and R. Rogers. Bounding, concentrating, and truncating: Unifying privacy loss composition for data analytics. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 421–457. PMLR, 16–19 Mar 2021. URL https://proceedings.mlr.press/v132/cesar21a.html.
- Chan et al. [2012] T.-H. H. Chan, E. Shi, and D. Song. Optimal lower bound for differentially private multi-party aggregation. In European Symposium on Algorithms (ESA), 2012. URL http://eprint.iacr.org/2012/373.pdf.
- Ding et al. [2021] Z. Ding, D. Kifer, S. M. S. N. E., T. Steinke, Y. Wang, Y. Xiao, and D. Zhang. The permute-and-flip mechanism is identical to report-noisy-max with exponential noise, 2021.
- Dong et al. [2020] J. Dong, D. Durfee, and R. Rogers. Optimal differential privacy composition for exponential mechanisms. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2597–2606. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/dong20a.html.
- Durfee and Rogers [2019] D. Durfee and R. M. Rogers. Practical differentially private top-k selection with pay-what-you-get composition. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3527–3537, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/b139e104214a08ae3f2ebcce149cdf6e-Abstract.html.
- Dwork and Roth [2014] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 & 4):211–407, 2014. doi: 10.1561/0400000042. URL http://dx.doi.org/10.1561/0400000042.
- Dwork and Rothblum [2016] C. Dwork and G. Rothblum. Concentrated differential privacy. arXiv:1603.01887 [cs.DS], 2016.
- Dwork et al. [2006a] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology - EUROCRYPT 2006, pages 486–503, Berlin, Heidelberg, 2006a. Springer Berlin Heidelberg. ISBN 978-3-540-34547-3. URL https://www.iacr.org/archive/eurocrypt2006/40040493/40040493.pdf.
- Dwork et al. [2006b] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, page 265–284, Berlin, Heidelberg, 2006b. Springer-Verlag. ISBN 3540327312. doi: 10.1007/11681878˙14. URL https://doi.org/10.1007/11681878_14.
- Dwork et al. [2010] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under continual observation. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC ’10, page 715–724, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781450300506. doi: 10.1145/1806689.1806787. URL https://doi.org/10.1145/1806689.1806787.
- Gopi et al. [2022] S. Gopi, Y. T. Lee, and D. Liu. Private convex optimization via exponential mechanism. In P.-L. Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 1948–1989. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/gopi22a.html.
- Korolova et al. [2009] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas. Releasing search queries and clicks privately. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, page 171–180, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605584874. doi: 10.1145/1526709.1526733. URL https://doi.org/10.1145/1526709.1526733.
- McKenna and Sheldon [2020] R. McKenna and D. Sheldon. Permute-and-flip: A new mechanism for differentially private selection. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
- McSherry and Talwar [2007] F. McSherry and K. Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103, 2007. doi: 10.1109/FOCS.2007.66. URL http://dx.doi.org/10.1109/FOCS.2007.66.
- Papernot and Steinke [2022] N. Papernot and T. Steinke. Hyperparameter tuning with renyi differential privacy. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=-70L8lpp9DF.
- Rivera Cardoso and Rogers [2022] A. Rivera Cardoso and R. Rogers. Differentially private histograms under continual observation: Streaming selection into the unknown. In G. Camps-Valls, F. J. R. Ruiz, and I. Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 2397–2419. PMLR, 28–30 Mar 2022. URL https://proceedings.mlr.press/v151/rivera-cardoso22a.html.
- Rogers et al. [2021] R. Rogers, S. Subramaniam, S. Peng, D. Durfee, S. Lee, S. K. Kancha, S. Sahay, and P. Ahammad. Linkedin’s audience engagements api: A privacy preserving data analytics system at scale. Journal of Privacy and Confidentiality, 11(3), Dec. 2021. doi: 10.29012/jpc.782. URL https://journalprivacyconfidentiality.org/index.php/jpc/article/view/782.
- Steinke [2022] T. Steinke. Composition of differential privacy & privacy amplification by subsampling, 2022.
- Swanberg et al. [2023] M. Swanberg, D. Desfontaines, and S. Haney. Dp-sips: A simpler, more scalable mechanism for differentially private partition selection, 2023.
- Whitehouse et al. [2023] J. Whitehouse, A. Ramdas, R. Rogers, and Z. S. Wu. Fully adaptive composition in differential privacy. In International Conference on Machine Learning. PMLR, 2023.
- Wilson et al. [2020] R. J. Wilson, C. Y. Zhang, W. Lam, D. Desfontaines, D. Simmons-Marengo, and B. Gipson. Differentially private SQL with bounded user contribution. Proceedings on Privacy Enhancing Technologies, 2020(2):230 – 250, 2020. doi: https://doi.org/10.2478/popets-2020-0025. URL https://content.sciendo.com/view/journals/popets/2020/2/article-p230.xml.
- Zhang et al. [2023] B. Zhang, V. Doroshenko, P. Kairouz, T. Steinke, A. Thakurta, Z. Ma, H. Apte, and J. Spacek. Differentially private stream processing at scale, 2023.