Congenial Differential Privacy under Mandated Disclosure
Abstract.
Differentially private data releases are often required to satisfy a set of external constraints that reflect the legal, ethical, and logical mandates to which the data curator is obligated. The enforcement of constraints, when treated as post-processing, adds an extra phase in the production of privatized data. It is well understood in the theory of multi-phase processing that congeniality, a form of procedural compatibility between phases, is a prerequisite for the end users to straightforwardly obtain statistically valid results. Congenial differential privacy is theoretically principled, which facilitates transparency and intelligibility of the mechanism that would otherwise be undermined by ad-hoc post-processing procedures. We advocate for the systematic integration of mandated disclosure into the design of the privacy mechanism via standard probabilistic conditioning on the invariant margins. Conditioning automatically renders congeniality because any extra post-processing phase becomes unnecessary. We provide both initial theoretical guarantees and a Markov chain algorithm for our proposal. We also discuss intriguing theoretical issues that arise in comparing congenital differential privacy and optimization-based post-processing, as well as directions for further research.
1. Privacy as data processing
1.1. A blurry yet essential picture
The curation and dissemination of large-scale datasets benefits science and society by supplying factual knowledge to assist discoveries, policy decisions, and promote transparency of information. As more data become accessible to more entities, however, the unobstructed access to information collected from individuals poses the risk of infringing on their privacy. Differential privacy is a mathematical concept that quantifies the extent of disclosure of confidential information in a database. It enjoys several advantages over its previous counterparts in statistical disclosure limitation. Most important of all, especially to data analysts who wish to conduct statistical inference on privatized data releases, is the transparency of the algorithm. The mechanism through which the privatized data release is generated can be spelled out in full, with its statistical properties fully understood. This enables analysts to incorporate it as part of a model, hence permitting the statistical validity of the resulting inference (Gong, 2020).
The protection of confidential data with differential privacy relies on the careful design of a probabilistic mechanism, one that can veil the microscopic identities of individual respondents while preserving the macroscopic aspects of the data with high fidelity. The probabilistic nature of the mechanism is necessary, and enables a tradeoff as such to be made (Dinur and Nissim, 2003). Typically, a differentially private mechanism injects a random perturbation into an otherwise deterministic query to be applied to the confidential database. One can say, then, that differentially private releases are “blurry” versions of the confidential data, just the same way a skilled Impressionist painter captures the essence of a pond of waterlilies without sketching out the contour of every petal and leaf.
When randomness is involved, however, certain truthful aspects of the data is invariably compromised, no matter how well-designed the privacy algorithm may be. Imagine if a picture of waterlilies was commissioned, not by an art collector, but by a botanist whose sole purpose is to study the structural formation of the plant, such as the exact length and width of its petals. She would be terribly disappointed at the Impressionist painting, even if it was the work of Claude Monet himself!
Circumstances in practice dictates that aspects of the data release may be deemed as unfit to be tampered with. These are usually key statistics reflecting the fundamental purpose of data collection, as required by law, policy, or other external constraints as put forth by the stakeholders. The data curator is mandated to disclose these statistics accurately at any expense, while at the same time shielding the remainder part of the data release with a veil of privacy. This poses a challenge to the design of the privacy mechanism subject to mandated disclosure. The central question is, how to integrate data privatization, an inherently random act, with the mandated disclosure of partial yet deterministic information, while maintaining both the logical consistency of the overall data release and the quality of the privacy protection.
1.2. Congenial privacy
In this work we conceptualize the privacy mechanism as one of the many phases of data processing. The concept of congeniality, or rather uncongeniality (Meng, 1994), is then relevant. The theory of uncongeniality was developed for investigating a seemingly paradoxical phenomenon discovered by researchers dealing with imputations for the U.S. Decennial Census and similar public-use data files. Fay (Fay, 1992) and Kott (Kott, 1995) found that one can have inconsistent variance estimation for multiple imputation inference (Rubin, 2004), even when both the imputation model and analysis procedure are valid. The issue turned out to be a mathematical incompatibility between the imputation model and the analysis procedure, even if neither is incompatible with the underlying model that generates the data. In other words, there is no probabilistically coherent model that can simultaneously imply the imputation model and the analysis procedure, a situation termed uncongeniality by Meng (Meng, 1994). To make matters worse, the imputation model, such as adopted by the Census Bureau, is typically not disclosed to the analyst of the imputed data, or at least not fully (e.g., due to confidential information used to help better predict the missing values). The lack of transparency makes it impossible for the analyst to correct for the uncongeniality, and worse, even to realize the problem.
The framework of uncongeniality was later generalized to the multivariate setting (Xie and Meng, 2017) and to general multi-phase processing (Blocker and Meng, 2013), which covers the current application by the same overarching principles. Two properties are critical for good privacy practice: transparency and congeniality. When the protection of privacy must observe mandated constraints, our proposal is to use conditional distributions as derived from the original unconstrained differential privacy mechanism conditioning on invariant margins. This approach achieves both automatically. Transparency is automatic, because the conditional distribution is determined by the original unconstrained distribution and the invariant constraints, both fully disclosed by design. Congeniality stipulates the use of a single coherent probabilistic model to ensure both differential privacy and mandated disclosure. This requirement is automatically satisfied when we use the conditional distribution derived from the original differential privacy mechanism, restricted solely to obey the mandated disclosure. A third advantage is that our proposal does not need any additional choice of procedural ingredients, such as projection distance, which is required by optimization-based post-processing such as adopted by the Census TopDown algorithm (Abowd et al., 2019).
There is, however, no free lunch. The first price we pay for congenial privacy is computational. Sampling from a distribution truncated on some space, as determined by the invariant margins, is generally not trivial, especially when the truncated region is of irregular shape. The second price is that we may pay more privacy loss budget than necessary, since the budget designed for the original mechanism depends on the sensitivity of the query measure on the unconstrained space, which is larger than that for the constrained space. When deriving the appropriate class of conditional distributions for the constrained mechanism, the new privacy loss budget should ideally be calibrated directly according to the query behavior on the constrained space, as opposed to be inherited from the unconstrained mechanism, which ensures a likely overly conservative level of privacy protection for the entire space. When the analytical complexity and computational requirements for the two approaches of budget calibration are similar, we certainly recommend the former.
1.3. The mechanism of differential privacy
Let denote a database consisting of individuals, and the space on which it is defined. A query function embodies the knowledge contained in the database that stakeholders – scientists, policy makers and the general public – would like to learn. What determines the value of is of course , or equivalently all its component values, corresponding to individual respondents included in the database. It is precisely these individuals records, or the ’s, that are the subject of privacy protection. How can the data curator say useful things about , while saying barely anything about each of the ’s?
The mechanism that can instill privacy into the curator’s release appeals to randomness. A random query function is said to satisfy -differential privacy (Dwork et al., 2006), if for all pairs of neighboring datasets , we have that
(1.1) |
for all Borel-measurable sets . In this work, the term neighboring datasets means that and differ by exactly one individual’s record, i.e. for some , but for all , . Write if and are neighbors. This concept of neighbor is employed in the definition of -bounded differential privacy (Abowd et al., 2019), and is distinct from the original formulation which defined neighbors as a pair that differ from each other by the addition or deletion of a single record. There are many ways to design an -differentially private algorithm , among which the most widely known and implemented are the Laplace and the Double Geometric algorithms (Dwork et al., 2006; Fioretto and Van Hentenryck, 2019), both are additive -differentially private mechanisms.
Definition 1.1 (Laplace mechanism).
Let be a deterministic query function. The Laplace mechanism is given by
(1.2) |
where ’s are independent zero-mean Laplacian random variables each with dispersion parameter , and
is the global sensitivity of . When the database consists of binary records in an unrestricted domain, and is the counting or histogram query, .
Definition 1.2 (Double Geometric mechanism).
A random query mechanism is called the Double Geometric mechanism if it has the same functional form as (1.2), with random variables defined on the integers with probability mass function
(1.3) |
where the parameter .
Note that the definition of either the Laplace or the Double Geometric mechanism presents not just one, but a collection of mechanisms that can be written as , indexed by the privacy loss budget allocated to the mechanism in question. When regarded as a sequence of statistical procedures, serves as an indicator of the statistical quality of the output (with larger for higher quality) that can be used to offer interpretation and to guide its own choice. This point will be immediately useful in Section 1.4.
1.4. Statistical intelligibility
An important reason that differential privacy is embraced by the statistics community is that it defines privacy in the language of probability, induced by the mechanism that injects randomness in the data release. An impactful consequence is that the distributional specification of the mechanism can be made fully transparent without sabotaging the promised protection. This opens the door for systematic analysis of the statistical property of mechanisms, which is in turn crucial to the accurate interpretation of statistical inference from privatized data releases (Gong, 2019a). The clarity both in definition and in implementation makes up the statistical intelligibility of differential privacy as a data processing procedure.
We discuss the statistical interpretation of the privacy mechanism, which is what served as inspiration for the conditioning approach to construct the invariant-respecting mechanism in the first place. The degree of protection exerted by a privacy mechanism on the confidential database is seen as a calculated limit on the statistical knowledge it is able to supply, as a function of the privacy loss budget allotted to the mechanism. Compare quantities
(1.4) |
which are the analyst’s prior probability about the value of the th entry of the dataset, versus her posterior probability if an -differentially private query released a report . Thus, the statistical meaning of can be explained as follows.
Theorem 1.3.
Let be the database, and a class of -differentially private procedures operating on . Then for every , and every prior probability the analyst harbors about ,
(1.5) |
Proof.
The posterior probability can be written as
The result then follows immediately from the fact that is -private, which means that for any ,
∎
Theorem 1.3 says that, any release generated by an -differentially private procedure sharpens the analyst’s knowledge about by at most a factor of . This interpretation provides a direct link between the differential privacy promise and the actual posterior risk of disclosure due to the release of the random query .
Recall that the definition of the Laplace and the Double Geometric mechanisms are both well-defined for any . However, in the limiting case of , i.e. the privacy loss budget becomes increasingly restrictive, both algorithms amount to adding noise with increasingly large variance to the confidential query. At , neither mechanism remains well-defined since the distributions of the noise component become improper due to the infinite cardinality of their respective domains. Nevertheless, the definition of -differential privacy allows for the expression with . A mechanism is -differentially private if one cannot gain any discriminatory knowledge from its release about the underlying database whatsoever. In other words, the analyst’s knowledge about the individual state of must remain the same as her prior. This notion can be explained consistently with Theorem 1.3, by observing that
(1.6) |
where the limit is implied by (1.5). This inspires the following deliberate construction of as a -differentially private procedure.
Definition 1.4 (-differentially private procedure).
For a class of -differentially private procedures well-defined for but not , define as the -differentially private procedure such that for every prior probability ,
(1.7) |
Definition 1.4 grants conceptual continuity to , a perfectly meaningful object in the privacy sense but lacking statistical intelligibility from the mechanistic point of view. For practical purposes, should be taken to mean the suppression procedure, which supplies vacuous knowledge to the analyst about the state of affairs of the database. Theoretically, the meaning of as a probabilistic mechanism cannot be supplied by ordinary probabilities, because any probability specification represents a set of specific knowledge about relative frequencies of any pair of states. However, its meaning can be quantified precisely in the more general framework of imprecise probability, as Section 5 will discuss.
2. Constructing congenial privacy
2.1. Privacy with invariants
While privacy protection is called for, the data curator may be simultaneously mandated to disclose certain aspects of the data as they are exactly observed, without subjecting them to any privacy protection. This collection of information is referred to as invariant information, or invariants. In practice, invariants are often defined according to a set of exact statistics calculated based on the confidential database (Ashmead et al., 2019). For example, suppose is the histogram query which tabulates the population residing in each county of the state of New Jersey from the 2020 U.S. Census. When producing a differentially private version of the histogram, the Census Bureau is constitutionally mandated to report the total population of each state as enumerated. This means that the privatized histogram must possess the same total population size as , where the confidential Census microdata; or in notation, .
Suppose that the Double Geometric mechanism is to be applied to the histogram query . Due to the random nature of the perturbations, a single realization of the mechanism will with high probability produce . Furthermore has a positive probability of consisting negative cell counts, which is logically impossible of the confidential query . The challenge of privacy preservation under mandated disclosure is thus to find an alternative mechanism, say , such that every realization of meets all the requirement of mandated information disclosure, while preserving the promise of differential privacy.
Let be the set of ’s that obey the given invariants. In turn, defines the set of values that the query must satisfy as
(2.1) |
Note that implicitly, is a set-valued function of the confidential dataset , because the invariant knowledge we intend to impose on the private release is implied by .
A random mechanism satisfies the mandated disclosure if
(2.2) |
That is, whenever applied to a database conformal to the mandated disclosure, with mathematical certainty is also conformal to the mandated disclosure. The size and complexity of the restricted (and ) relative to their original spaces are crucial to the overall extent to which privacy of the residual information in the database can be protected, a point we will revisit in Section 5.1.
There may be many ways to construct a random mechanism , but all are not equally desirable. We argue that should be constructed in a principled manner, and a constructive way to achieve that is to use conditional distributions of unconstrained privacy mechanisms. The resulting mechanism can be easily tuned to retain its differential privacy promise, while ensuring its congenial integration into the data processing pipeline, preserving the statistical intelligibility of its releases.
2.2. Imposing invariants via conditioning
Let be a valid -differentially private mechanism, which generates outputs that typically do not obey the invariant requirement , even if . A natural idea to force the requirement onto is via conditioning. Define a modified privatization mechanism , such that the probability distribution it induces is the same as the conditional distribution of subject to the constraint that . For what’s next, we’ll use the notation to mean and are identically distributed, and denotes a well-specified conditional distribution . Also assume for now . We have the following theorem.
Theorem 2.1.
Let be the confidential database, and the invariant subset to which conforms. The deterministic function is a query, and the implied is defined by (2.1). Let be an -differentially private mechanism based on , and be a constrained mechanism such that
(2.3) |
Then for all invariant-conforming pairs of datasets that are -neighbors, i.e. such that , there exists a real-valued such that for all ,
(2.4) |
Proof.
For a pair of -neighboring and -conforming datasets and any ,
Clearly each of the last two ratios above is bounded above by and below by because is -differentially private. Consequently, if we let
(2.5) |
then and it is a known constant because is public. Then, (2.4) holds for any , and certainly for . ∎
Our proof above might create an impression that only is permissible, but Sections 4 and 5.1 will supply two examples both with . Negative may sound paradoxical, for it seems to suggest that better privacy protection can be achieved by disclosing some information. However, we must be mindful that differential privacy is not about protecting privacy in absolute terms. Rather, it is about controlling the additional disclosure risk from releasing the privatized data to the users (or hackers), relative to what they know before the release. The presence or absence of mandated invariants would ex ante constitute two different bodies of knowledge, hence any additional privacy protection would carry different interpretations too. We also emphasize that Theorem 2.1 generalizes to cases where , such as when it is a linear subspace of and the privacy mechanism is continuous. The proof is a bit more involved in order to properly define , which we will discuss in future work. However, this complication is not a concern for discrete privatization mechanisms, such as within the Census TopDown algorithm (Abowd et al., 2019).
Theorem 2.1 holds broadly for arbitrary kinds of -differentially private mechanisms, as well as any deterministic invariant information about either the database or the query function that can be expressed in a set form. It lends itself to the same kind of posterior interpretation enjoyed by unconstrained differentially private mechanisms. Specifically, if is the constrained differentially private procedure constructed based on the unconstrained procedure , according to the specifications of Theorem 2.1, then for all such that so that , and , the analyst’s posterior probability is bounded in between
This an interval that bears structural resemblance to (1.5), thanks to the conditional nature of which allows for statistical information from privacy mechanisms, constrained or otherwise, to be interpreted in the same (hence congenial) way.
The definition of -differential privacy has the property that, if a mechanism is -differentially private, then for all , is also -differentially private. If the invariant information does not substantially disrupt the neighboring structure of the sample space of the database, a notion we will make precise later in Section 5, what Theorem 2.1 says is that enforcing the invariant onto the unconstrained mechanism via conditioning costs times – and at most twice since can always be set to – the privacy loss budget allotted to . When a more cost-effective value of is hard to determine, a simplest way to ensure privacy loss budget for is to use a budget of for the unconstrained mechanism to begin with; see Section 3.
2.3. An Monte Carlo Implementation
Let denote the probability distribution, either a mass function or a density function, induced by the unconstrained differentially private algorithm which depends on the confidential query . Further denote to be the corresponding conditional distribution of constrained on the invariant set . The constrained privacy mechanism requires samples from . A simplest, though often inefficient or event impractical, method is rejection sampling. Since
where , one can opt for a proposal density with support such that , and accept a sample with probability . This encompasses the option to set , the unconstrained privacy kernel itself, and keep sampling until the sample falls into . This strategy is clearly inefficient in general, and impossible when .
Efficient algorithm tailor-made to specifications of are possible. Here, we present in Algorithm 1 a generic approach based on Metropolized Independent Sampling (MIS; Liu, 1996), for the most common case in which the mandated invariants are expressed in terms of a consistent system of linear equalities and inequalities
Here and are and matrices with ranks and respectively, and and vectors with length and respectively. The algorithm requires a proposal index set , a subset of of size such that , where is the submatrix of consisting of all columns whose indices belong to . For each , the choice of may not be unique, and can have a potential impact on the efficiency of the algorithm.
Input: unconstrained privacy mechanism , |
confidential query , invariant parameters , |
proposal distribution , proposal index set , |
initial value , integer nsim; |
Iterate: for , at : |
step 1, propose : |
1-1. sample ; |
1-2. solve for in ; |
1-3. write ; |
step 2, compute ; |
step 3, set with probability , |
otherwise set . |
Output: a set of draws , . |
This algorithm is applicable to both discrete and continuous data and privatization schemes, and it does not require the normalizing constant for . In Section 3 below, we use it to construct a mechanism for differentially private demographic contingency tables subject to both linear equality and inequality invariant constraints.
3. Contingency Table with Invariants
5 | 6-10 | 11-15 | 16-17 | 18-19 | 20 | 21 | 22-24 | 25-29 | 30-34 | 35-39 | 40-44 | 45-49 | 50-54 | |
Female | 8 | 6 | 3 | 6 | 4 | 4 | 4 | 8 | 5 | 7 | 7 | 6 | 1 | 5 |
Male | 3 | 4 | 5 | 8 | 6 | 4 | 5 | 5 | 5 | 6 | 10 | 7 | 3 | 2 |
55-59 | 60-61 | 62-64 | 65-66 | 67-69 | 70-74 | 75-79 | 80-84 | 85+ | Total | |||||
Female | 4 | 4 | 9 | 6 | 2 | 8 | 8 | 8 | 7 | 130 | ||||
Male | 5 | 11 | 6 | 4 | 7 | 4 | 5 | 3 | 8 | 126 | ||||
Voting | 43 | 213 | 256 |
5 | 6-10 | 11-15 | 16-17 | 18-19 | 20 | 21 | 22-24 | 25-29 | 30-34 | 35-39 | 40-44 | 45-49 | 50-54 | |
Female | 6 | 6 | 4 | 3 | 2 | 10 | 2 | 5 | 6 | 7 | 5 | 6 | 0 | 5 |
Male | 9 | 4 | 5 | 6 | 3 | 4 | 5 | 5 | 3 | 4 | 7 | 8 | 5 | 3 |
55-59 | 60-61 | 62-64 | 65-66 | 67-69 | 70-74 | 75-79 | 80-84 | 85+ | Total | |||||
Female | 4 | 5 | 10 | 8 | 3 | 8 | 8 | 12 | 5 | 130 | ||||
Male | 2 | 23 | 8 | 2 | 6 | 3 | 0 | 3 | 8 | 126 | ||||
Voting | 43 | 213 | 256 |
The table we consider is of dimension , with rows representing sex (male/female), and columns representing age bucketed roughly every four years, with finer buckets around key age ranges such as 18, 21 and 60. This data structure corresponds to the 2010 Census Table #P12 and 2020 Census Table #P7 (US Census Bureau, 2020), one of the most referenced type of contingency table releases by the Census Bureau at various geographic levels. For the purpose of computational illustration, this example will use simulation to construct synthetic datasets that represent the confidential Census demographic data.
Let be a vector of length , denoting the row-vectorized contingency table. The constraints to be imposed on the differentially private table include
-
(1)
total population;
-
(2)
proportion of female population;
-
(3)
total voting age population (18+ age); and
-
(4)
nonnegative table entries.
Items (1) to (3) constitute equality constraints, and item (4) inequality constraints. The unconstrained privacy mechanism that serves as the basis of our construction is the Double Geometric mechanism, with distribution function
where is as defined in (1.3), and the privacy loss budget set to per cell. The proposal distribution is set to be of the same family as does the unconstrained privatization algorithm, but it is given a distinct dispersion parameter to tune for best algorithmic performance, in this case the acceptance probability of the algorithm. The proposal distribution function for , the ()-length subvector of the th proposal , is
where is chosen to be . The remainder coordinates of the th proposal, , is solved according to the equality constraint .
Table 1 displays a simulated confidential table (top) and a constrained differentially private table (bottom) based on the confidential table, where the draw is produced by Algorithm 1. In this case, setting yields the best acceptance probability, with acceptance probability at , as shown in Figure 2 in Appendix C alongside traceplots for the second and the first cells of the table, with the former index belonging to and the latter not.

We compare the outputs of our congenial mechanism with the unconstrained privacy mechanism, with and without post-processing via nonnegative minimization onto the invariant set defined by (1)-(4). A total of confidential contingency tables are simulated, each with cells following
which has mean and variance . For each confidential table, privatized releases were created using each of the following methods:
-
a)
the unconstrained (raw) Double Geometric mechanism with privacy loss budget per cell;
-
b)
the above mechanism followed by nonnegative (NNL2) minimization onto the subspace defined by the four constraints, i.e. ;
-
c)
Our proposed -conditional algorithm per Algorithm 1, constructed based on an unconstrained Double Geometric mechanism with ; and
-
d)
the same unconstrained Double Geometric mechanism as in (a), but with per cell.
The distance and distance between a confidential table and , a privatization of , are given respectively by
(3.1) |
For each of the four types of privatization and processing mechanisms, we compute averages of both distances over 100 realizations of . Figure 1 displays the box plots of these average distances over the 20 simulated copies of private table . We observe that the conditional mechanism, constructed from an unconstrained mechanism with , exhibits a degree of variability between the unconstrained mechanisms with privacy loss budget and . On the other hand, the nonnegative projection of the unconstrained mechanism with achieves a level of accuracy mostly on par with it. Note that this observation should not be taken as a suggestion of relative accuracy between the nonnegative minimization and the constrained mechanism, because the effective privacy guarantee that either mechanism enjoys is undetermined, an issue we will discuss further at the end of Section 4. That said, by Theorem 2.1 that the conditional mechanism inflates the privacy loss budget of a unconstrained algorithm by , the empirical observation suggests that the effective may be somewhere in between and . In the following sections, we will see examples where or even .
4. Curator’s post-processing may not be innocent processing
A common practice to ensure unconstrained differentially private releases respect the mandated disclosure requirements is through optimization-based post-processing, which takes the general form
(4.1) |
That is, is the element in that is the closest to according to some discrepancy measure , typically a distance, such as the two given in (3.1). In the case of the Census Bureau’s TopDown algorithm, is a composite post-processing procedure consisting of first a nonnegative minimization followed by an minimization onto the integer solutions; see (Abowd et al., 2019).
In the literature of differential privacy, there is a widely referenced theorem which establishes that differentially privatized releases are “immune” to post-processing (Dwork and Roth, 2014). The theorem states that if is an -differentially private mechanism and is an arbitrary function, then is still -differentially private. Indeed for any -measurable set ,
(4.2) |
Thus for every , the maximal increased risk of disclosure from releasing cannot exceed that from releasing (but the reversed is guaranteed only when is one-to-one). Intuitively, further blurring an already blurred picture can only make it harder, not easier, to see what is in the original picture.
This intuition, however, is based on the assumption that the further blurring process does not use any knowledge about the original picture. We need to make clear here that imposing invariants on differentially private releases via optimization-based post-processing, in the sense of the operation discussed here, does not in general fall under the jurisdiction of the post-processing theorem. This is because , the function used to impose invariants on the unconstrained output , is implicitly dependent on the confidential dataset , with the dependence induced via , or equivalently to which belongs. Since supplies information about the confidential database, whereas the unconstrained mechanism is by design not preferential towards , any further processing of that makes nontrivial use of risks violating the privacy guarantee that deserves.
The post-processing theorem guarantees that no loss of privacy will be induced to the privatized query via any functional transformation that may be carried out by an ordinary analyst or data user. However, imposing invariants is the kind of post-processing that only the data curator – one who has access to the confidential data – is capable of performing. In the extreme scenario (see Example 5.2) that the invariant forces the privatized disclosure to be precisely equal to the confidential query, for the data curator to achieve this algorithmically is as simple as taking the privatized query and projecting it to the single point in defined by the confidential value . But this is impossible for a data user who do not know what is.
One may wonder the following question. While the invariant has a dependence on the confidential , itself is nevertheless public information. Doesn’t that make a fully specified function, just like in the post-processing theorem? The answer is no in general, and the distinction here is a subtle one. When talking about the value of the invariants, it suffices to regard merely as an announced description of the confidential data. However as alluded to previously, is a set-valued map from the database space to subsets of the query space, i.e. . Almost always is the case in practice that the functional form of map of is a priori determined, but its value – namely – can be calculated only after the confidential data is observed. Indeed, the actual specification of would almost certainly change if were observed differently. This means for an -measurable set and a database , the equivalent events in the and the spaces are now
(4.3) |
noting that the inverse function now depends on .
To see the complication caused by this dependence, write and . We then have
(4.4) |
Although both and are measurable sets, they are not necessarily the same when . Hence we cannot use (1.1) to conclude that the right hand side of (4.4) is bounded above by . This does not necessarily imply that the post-processing as defined in (4.1) is not -differentially private. Indeed, we prove in Appendix A that both and (a class of) post-processing in Example 4.1 below is -differentially private. But it does imply that in general, the statistical and privacy properties of are not straightforwardly inherited from that of , and hence they need to be established on a case by case basis.
Another major drawback of using optimization-based methods to impose invariants is that the statistical intelligibility of differential privacy is obscured. The post-processing function is often procedurally defined, hence a complex and confidential data-dependent map from the unconstrained query space to the constrained query space, with almost impenetrable statistical properties, and certainly so for any given database. In contrast, using conditioning to realize differential privacy with mandated disclosure, despite often computationally demanding by construction, preserves the statistical intelligibility of the privacy mechanism. The constrained privacy mechanism is distributionally – as opposed to procedurally – constructed, preserving the possibility of transparent downstream analysis. It furthermore delivers privacy guarantee in the same format as does differential privacy without constraints, offering a congenial statistical interpretation that resembles the original.
Below we use an example to compare congenial privacy with two approaches of post-processing for a same query function. The example is simple enough for analytical derivations of the distributions of post-processing mechanisms to be possible. As we will see, the three approaches to impose invariant constraints yield distinct theoretical behaviors.
Example 4.1 (A two-bin histogram with constrained total).
Suppose the confidential database is a binary vector, and the query of interest tabulates the number of and entries in , i.e.
Employ the Laplace mechanism as the unconstrained privatization mechanism to protect the two-bin histogram, i.e.
expending in total privacy loss budget. The induced probability density of is
Suppose the invariant to be imposed is that the total of the privatized histogram shall be the same as that the the confidential query itself. That is, for any given , the associated invariant set is
(4.5) |
In the calculations below, a certain database is fixed, and we write the invariant total , the length of .
Congenial privacy.
Under the constraint of histogram total, our congenial is obtained from the conditional distribution
It turns out that the probability density of is
(4.6) |
and otherwise; see Appendix A. That is, our congenial mechanism is simply to draw a from , and then release . Clearly, the privacy property of is the same as its first component, call it , which protects by releasing because setting is a deterministic step with no implication on privacy when is known. But is simply the Laplace mechanism with budget. Consequently, our congenial mechanism maintains the same guarantee as the original unconstrained mechanism, even though the meaning of protection is different, as we emphasized in Section 2.2.
Post-processing with minimization.
Here we minimize the distance between and the post-processed histogram release, denoted , subject to its sum being . The solution is
where is the average of two independent Laplace random variables with scale . As can be easily seen (for example from its characteristic function), is not a Laplace random variable, but in fact follows distribution
(4.7) |
that is a 50-50 mixture of a Laplace distribution with scale and a signed Gamma distribution (i.e. a regular Gamma distribution multiplied by a fair random sign) with shape and scale ; see Appendix A for derivation. It is worth noting that since a signed Gamma distribution of shape can be written as the sum of two independent Laplace distributions of the same scale, is more variable than a single independent Laplace random variable of the same scale. Hence for any , the privatized release using will be more variable than that of the congenial privatization . Intuitively, this suggests that should not do worse than in terms of privacy protection. Indeed as we will show in Appendix A, also achieves the same -differentially private guarantee.
Post-processing with minimization.
If we change the norm to norm in the above, the privatization mechanism is no longer unique. There will be infinitely many solutions in the form of
where only needs to satisfy
where and are the minimum and the maximum of two i.i.d. Laplace random variables. In particular, choosing any convex combination of and as the additive noise term to the first entry constitutes a solution, i.e., for some , and then set . For the rest of the article, post-processing refers to this convex combination strategy.
For ease of reference, Table 2 collects a comparison of the constrained differentially private histogram and two the post-processing approaches, and , in terms of the expectation and variance of the resulting release for a given database and confidential query . All expectations are taken with respect to the relevant mechanism.
We can see that our congenial mechanism has the smallest variance. Because the congenial mechanism and both carry the same -privacy guarantee which cannot be further improved, we can comfortably declare that is inadmissible because it is dominated by the congenial mechanism, providing less utility (in terms of statistical precision) without the benefit of increased privacy protection. However, we cannot say that the congenial mechanism dominates even though it still leads to smaller variance. This is because, as we will prove in Appendix A, the attained level of privacy guarantee of is , which is never worse than . Hence the increased variance under might be acceptable as a price for gaining more privacy protection. In general, comparing the utility of two privatization mechanisms with the same nominal but different actual privacy loss budget is as thorny an issue as comparing the statistical power of two testing procedures with the same nominal, but different actual, Type I error rates.
5. Discussion
5.1. Finding better
While Theorem 2.1 always holds with , it likely sets a loose bound on the ratio between and , hence declaring an overly “conservative” nominal level of privacy loss induced by . Depending on how the invariant interacts with the distributional property of the unconstrained mechanism in a specific instance, can be shown to take smaller values, adding more “bang of the buck” to the privacy loss budget, so to speak. Three examples are given below.
Example 5.1 (trivial invariants).
Consider the trivial case where the set of invariants does not actually impose any restriction, i.e., . It is then necessarily true that , and the “constrained” differentially private mechanism is identical in distribution to the unconstrained one: . In this case, and is -differentially private.
Example 5.2 (rounding and secrecy).
Let be a binary vector of length indicating a group of individuals’ possession of a certain feature (yes 1/no 0), and the query of interest is , or the number of groups of size that can be formed by people who possesses the feature. A Double Geometric mechanism is used to protect the query, with a privacy loss budget of (under the global sensitivity of ).
Suppose the following invariant set is mandated for disclosure:
or equivalently, is the singleton set that contains nothing but the true value . In this case, the implied constrained privacy mechanism is equivalent to a degenerate distribution: for all . Furthermore, for all neighboring datasets , and any a measurable subset of ,
Therefore in this particular instance, is in fact -differentially private, corresponding to in Theorem 2.1. This means that for those databases conformal to the invariant , supplies no discriminatory information among them whatsoever. Indeed, if the value of the supposedly private query is already public knowledge, no mechanism can further increase its disclosure risk, therefore achieving complete differential privacy.
Our third example is a less trivial example of , which is actually provided by the congenial mechanism in Example 4.1. There, although the guaranteed privacy loss budget is still , in applying Theorem 2.1, must be set to or greater, because under the constraint of fixed sum, the nearest neighbors among binary vectors must have . Hence the privacy bound implies , yielding .
This example also reminds us that a major cause of information leakage due to invariants is the structural erosion to the underlying data space, such as making (as measured on the original space ) impossible. In reality, the unconstrained data space is typically regular, and contains as a proper subset. We should expect to find many , and many (if not many more) such that and are neighbors, near or far. Knowing that the confidential dataset must belong to categorically rules out the possibility that all the ’s can be the confidential dataset, weakening the differential privacy promise by eliminating the neighbors. If the invariant is sufficiently restrictive such that becomes topologically small relative to , it may be the case that for some , all of its original immediate neighboring datasets are not in :
in which case we say that the neighboring structure of the original data space of the database is substantially disrupted, as seen in Example 4.1. If the disruption is so substantial that neighbors of any distance cease to exist, we say that the neighborhood structure is completely destroyed:
Then, even for the constrained mechanism , the -differential privacy promise becomes vacuously true, since no possible neighboring pairs remain for which the concept of privacy is applicable. However, Example 5.2 demonstrates that vacuous privacy promise can occur without the neighborhood structures completely destroyed.
In general, it is conceptually difficult to parse out the share of responsibility on privacy attributable to the data curator under any scenario of mandated disclosure. If certain information is made public, then any information that it logically implies cannot be expected to be protected, either. The best that we can expect any privacy mechanism to deliver, then, is protection over information that truly remains. Notions that serve the equivalent purposes as and have been proposed in the literature for expositions of new notions of differential privacy, including blowfish and pufferfish privacy (He et al., 2014; Kifer and Machanavajjhala, 2012), where the privacy guarantee is re-conceptualized on the restricted space modulo any structural erosion to the original sample space due to external or auxiliary information available to an attacker. When interpreting the promise of Theorem 2.1, we shall pay due diligence to the case in which immediate neighbors no longer exists, and talk about the -differential privacy guarantee only for those -neighbors that actually do.
5.2. Other interpretations of privacy
The literature has seen other lines of work that offer interpretations of differential privacy in statistical terms. Notably, the posterior-to-posterior semantics of differential privacy (Dinur and Nissim, 2003; Kasiviswanathan and Smith, 2014; Ashmead et al., 2019) explains the effect of privacy also in the vocabulary of Bayesian posteriors. The posterior-to-posterior semantics establishes differential privacy as a bound for the ratio of posterior probabilities assigned to an individual confidential data entry, when the private mechanism is applied to neighboring datasets that differ in only one entry. The said ratio is between the two quantities
(5.1) |
where and are neighboring datasets. What varies between the two posterior quantities is the confidential dataset on which the private query is applied. The datasets and are neighboring datasets, one of which presumably (but not necessarily) contains the true value of the th confidential data entry , and the other contains a fabricated value of it.
The comparison in (5.1) raises the question of what it means by the conditional probability of given a private query constructed from a database that does not contain this true value, as this conditional probability hinges on external knowledge about how a fabricated database may inform the actual confidential database. Our prior-to-posterior semantics formulated in Theorem 1.3 takes a practical point of view and avoids such conceptual complication. We compare the disclosure risk before and after an actual release, reflecting the core idea behind differential privacy.
5.3. Full privacy or vacuous knowledge
As alluded to in Section 1.4, the notion of vacuous knowledge cannot be appropriately captured by ordinary probabilities. The defect reflects a fundamental inability of the language of probability in expressing a true lack of knowledge, a central struggle in the Bayesian literature that motivated endeavors in search for the so-called “objective priors” (Ghosh, 2011). Neither the uniform distribution nor any other reference distributions are truly information-free, as they all invariably invoke some principle of indifference in relation to a specific criterion (such as the Lebesgue measure, the counting measure, or the likelihood function) which is subject to debate.
To supply the rigorous definition needed to define probabilistically in Section 1.4, we invoke the concept of lower probability functions, and as a special case belief functions (see e.g. Gong, 2019b), both generalized versions of a probability function which amounts to a set of probability distributions on a given space. The statistical information contained in is represented by the vacuous lower probability function, denoted , which takes the value only when , and everywhere else. Equivalently stated in terms of its conjugate upper probability function ,
(5.2) |
That is, the statistical information contained in can be (but is not known to be) concordant with any Borel-measurable probability function, thus the probability of any is as low as and as high as , as long as is neither the full set nor the empty set.
Generally, the conditioning operation involving lower probability functions is not trivial, and it is not unique due to the existence of several applicable rules. But if the lower probability functions being conditioned on is vacuous, there is consensus among different rules as to what posterior distribution should result, namely precisely as stated in (1.7). See (Gong and Meng, 2020) for an extended exposition of conditioning rules involving lower probability and belief functions.
5.4. Future directions
This work points to several future directions of pursuit. On the computational front, how do we construct efficient algorithms to realize congenial privacy, by drawing possibly high-dimensional releases subject to complex constraints? When we use non-perfect Markov chain Monte Carlo to accomplish this task, how do we ensure the declared privacy guarantee is not destroyed because a chain cannot run indefinitely? On the privacy front, for every constrained mechanism constructed through conditioning, how to find the best value that tracks as closely as possible the effective privacy loss budget, which in turn enables fair performance comparisons among invariant-respecting algorithms? Furthermore, how to achieve an orthogonal decomposition of the public, invariant information from the free, residual information that remains in the confidential microdata, in a logical sense without having to resort to the probabilistic vocabulary of statistical independence?
Acknowledgements.
The authors thank Jeremy Seeman, Salil Vadhan, Guanyang Wang and three anonymous reviewers for helpful suggestions. Ruobin Gong gratefully acknowledges research support by the National Science Foundation (NSF) DMS-1916002. Xiao-Li Meng also thanks NSF for partial financial support while completing this article.References
- (1)
- Abowd et al. (2019) John Abowd, Robert Ashmead, Garfinkel Simson, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, and William Sexton. 2019. Census TopDown: Differentially Private Data, Incremental Schemas, and Consistency with Public Knowledge. Technical Report. US Census Bureau.
- Ashmead et al. (2019) Robert Ashmead, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, and William Sexton. 2019. Effective Privacy After Adjusting for Invariants with Applications to the 2020 Census. Technical Report. US Census Bureau.
- Blocker and Meng (2013) Alexander W Blocker and Xiao-Li Meng. 2013. The potential and perils of preprocessing: Building new foundations. Bernoulli 19, 4 (2013), 1176–1211.
- Dinur and Nissim (2003) Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 202–210.
- Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference. Springer, 265–284.
- Dwork and Roth (2014) Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.
- Fay (1992) Robert E Fay. 1992. When Are Inferences from Multiple Imputation Valid? US Census Bureau.
- Fioretto and Van Hentenryck (2019) Ferdinando Fioretto and Pascal Van Hentenryck. 2019. Differential privacy of hierarchical census data: An optimization approach. In International Conference on Principles and Practice of Constraint Programming. Springer, 639–655.
- Ghosh (2011) Malay Ghosh. 2011. Objective priors: An introduction for frequentists. Statist. Sci. 26, 2 (2011), 187–202.
- Gong (2019a) Ruobin Gong. 2019a. Exact inference with approximate computation for differentially private data via perturbations. arXiv preprint arXiv:1909.12237 (2019).
- Gong (2019b) Ruobin Gong. 2019b. Simultaneous Inference under the Vacuous Orientation Assumption. Proceedings of Machine Learning Research 103 (2019), 225–234.
- Gong (2020) Ruobin Gong. 2020. Transparent Privacy is Principled Privacy. arXiv preprint arXiv:2006.08522 (2020).
- Gong and Meng (2020) Ruobin Gong and Xiao-Li Meng. 2020. Judicious judgment meets unsettling updating: dilation, sure loss, and Simpson’s paradox. Statist. Sci. (2020).
- He et al. (2014) Xi He, Ashwin Machanavajjhala, and Bolin Ding. 2014. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1447–1458.
- Kasiviswanathan and Smith (2014) Shiva P Kasiviswanathan and Adam Smith. 2014. On the’semantics’ of differential privacy: A bayesian formulation. Journal of Privacy and Confidentiality 6, 1 (2014).
- Kifer and Machanavajjhala (2012) Daniel Kifer and Ashwin Machanavajjhala. 2012. A rigorous and customizable framework for privacy. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. 77–88.
- Kott (1995) PS Kott. 1995. A paradox of multiple imputation. In Proceedings of the Section on Survey Research Methods. American Statistical Association, 384–389.
- Liu (1996) Jun S Liu. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and computing 6, 2 (1996), 113–119.
- Meng (1994) Xiao-Li Meng. 1994. Multiple-imputation inferences with uncongenial sources of input. Statist. Sci. (1994), 538–558.
- Rubin (2004) Donald B Rubin. 2004. Multiple imputation for nonresponse in surveys. John Wiley & Sons.
- US Census Bureau (2020) US Census Bureau. 2020. 2010 Demonstration Data Products. https://www.census.gov/programs-surveys/decennial-census/2020-census/planning-management/2020-census-data-products/2010-demonstration-data-products.html [Accessed: 04-08-2020].
- Xie and Meng (2017) Xianchao Xie and Xiao-Li Meng. 2017. Dissecting Multiple Imputation from a Multi-Phase Inference Perspective: What Happens When God’s, Imputer’s and Analyst’s Models Are Uncongenial? Statistica Sinica (2017), 1485–1545.
Appendix A Privacy Guarantees for Congenial, and methods
We first derive the conditional distribution of two i.i.d. Laplace random variables, given that their sum is zero. Let and denote , . Since is linear in , their joint probability density function is given by.
This implies that
which leads to (4.6).
We then derive the density for , where (the case of or is trivial). This covers the projection case, where any is acceptable, and the projection case, where . Since , the Jacobian from to is . Consequently,
To derive , we assume without loss of generality . Consider on three regions:
Summing up these terms and noting the symmetry of , we obtain
(A.1) |
Remark I
The expression (A.1) is fascinating. It shows that the density of a convex combination of i.i.d. Laplace random variables is a “mixture” but non-convex combination of two Laplace densities with different scale parameters, because
That is, although the two weights add up to one, they always take the opposite sign when
Remark II
When , the expression (A.1) is of appearance, but is well-defined once taking the limit and using the L’Hospital’s rule, yielding
(A.2) |
This can be written as
suggesting that it is more variable that . We now prove that (A.2), moreover the entire family of distributions given by (A.1) as indexed by , is -differentially private. In fact, they attain a level of privacy protection more stringent than whenever . Our proof relies on the following general result, which can be useful for verifying differential privacy guarantees in other situations.
Theorem A.1.
Suppose is a positive real-valued function on a normed vector space , with its norm denoted by . Suppose has the following properties:
- (i):
-
is monotone decreasing in ;
- (ii):
-
is monotone increasing in , where is a positive constant.
Then for any and , we have
(A.3) |
Proof.
To apply this result, we first note that for of (A.1) with , its derivative is given by
for any . For , we can directly verify from (A.2) that
Hence, condition (i) holds for (A.1) for all .
To establish condition (ii), the choice is the key since it directly governs the degree of privacy guarantee. From the expression (A.3), we want the smallest such that condition (ii) holds, which gives us the tightest bound hence better privacy guarantee. A good strategy here is to start with and let the mathematics tell us how to minimize over . We start with the simplest case with . From the expression (A.2), the smallest that can make monotone increasing is clearly 1. The resulting also has the property that
(A.4) |
This implies that the bound can be approached arbitrarily closely by letting , which means that the privacy loss budget cannot be reduced. For our current application, this means the post-processing by projection is also differentially private at level , but not more stringent than that.
When , we assume without loss of generality . Then for , it is easy to verify that for any ,
where and . Our job is to seek the smallest such that this derivative is non-negative regardless of the value of . Clearly the positivity holds when we set , that is , and hence . To show that this is the smallest possible , we see that when setting , we have
where . Clearly as , the ratio above goes to regardless of the value and as long as they are fixed. Consequently, the same implication from (A.4) follows, that the bound can be approached arbitrarily closely with , hence it cannot be further improved. That is, for post-processing via projection, the actual differential privacy protection achieved is when (and when ). This makes intuitive sense. For example, when the injected noise is drawn from a single distribution, corresponding to a privacy loss budget of .
In summary, for any , the attained privacy loss budget for is .
Appendix B Sampling scheme for the Double Geometric distribution
The Double Geometric mechanism, as introduced in Definition 1.2, utilizes additive noise whose cumulative mass function is given by
with quantile function
Hence, one way to sample a Double geometric random variable is via inverse probability sampling. That is,
where is given by (1.3). This method is implemented for all numerical examples illustrated in this paper.
Appendix C Performance Diagnostics of the MIS Algorithm
The acceptance rate of Algorithm 1 rate is shown in Figure 2 as a function of the proposal inverse scale parameter . The acceptance rate is the highest in this example when is set to , just slightly larger than the privacy loss budget of the unconstrained privacy mechanism ( per cell). The acceptance rate achieved is about .
Figure 3 shows traceplots of draws from Algorithm 1 of respectively the second (in proposal index set ) and the first (not in proposal index set ) cells of the constrained differentially private contingency table, when .

