This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Congenial Differential Privacy under Mandated Disclosure

Ruobin Gong Rutgers University110 Frelinghuysen RoadPiscatawayNew Jersey08854 [email protected]  and  Xiao-Li Meng Harvard University1 Oxford StCambridgeMassachusetts02138 [email protected]
(2020)
Abstract.

Differentially private data releases are often required to satisfy a set of external constraints that reflect the legal, ethical, and logical mandates to which the data curator is obligated. The enforcement of constraints, when treated as post-processing, adds an extra phase in the production of privatized data. It is well understood in the theory of multi-phase processing that congeniality, a form of procedural compatibility between phases, is a prerequisite for the end users to straightforwardly obtain statistically valid results. Congenial differential privacy is theoretically principled, which facilitates transparency and intelligibility of the mechanism that would otherwise be undermined by ad-hoc post-processing procedures. We advocate for the systematic integration of mandated disclosure into the design of the privacy mechanism via standard probabilistic conditioning on the invariant margins. Conditioning automatically renders congeniality because any extra post-processing phase becomes unnecessary. We provide both initial theoretical guarantees and a Markov chain algorithm for our proposal. We also discuss intriguing theoretical issues that arise in comparing congenital differential privacy and optimization-based post-processing, as well as directions for further research.

Belief Function, conditioning, invariants, post-processing, statistical intelligibility, uncongeniality, Monte Carlo
journalyear: 2020copyright: acmlicensedconference: Proceedings of the 2020 ACM-IMS Foundations of Data Science Conference; October 19–20, 2020; Virtual Event, USAbooktitle: Proceedings of the 2020 ACM-IMS Foundations of Data Science Conference (FODS ’20), October 19–20, 2020, Virtual Event, USAprice: 15.00doi: 10.1145/3412815.3416892isbn: 978-1-4503-8103-1/20/10

1. Privacy as data processing

1.1. A blurry yet essential picture

The curation and dissemination of large-scale datasets benefits science and society by supplying factual knowledge to assist discoveries, policy decisions, and promote transparency of information. As more data become accessible to more entities, however, the unobstructed access to information collected from individuals poses the risk of infringing on their privacy. Differential privacy is a mathematical concept that quantifies the extent of disclosure of confidential information in a database. It enjoys several advantages over its previous counterparts in statistical disclosure limitation. Most important of all, especially to data analysts who wish to conduct statistical inference on privatized data releases, is the transparency of the algorithm. The mechanism through which the privatized data release is generated can be spelled out in full, with its statistical properties fully understood. This enables analysts to incorporate it as part of a model, hence permitting the statistical validity of the resulting inference (Gong, 2020).

The protection of confidential data with differential privacy relies on the careful design of a probabilistic mechanism, one that can veil the microscopic identities of individual respondents while preserving the macroscopic aspects of the data with high fidelity. The probabilistic nature of the mechanism is necessary, and enables a tradeoff as such to be made (Dinur and Nissim, 2003). Typically, a differentially private mechanism injects a random perturbation into an otherwise deterministic query to be applied to the confidential database. One can say, then, that differentially private releases are “blurry” versions of the confidential data, just the same way a skilled Impressionist painter captures the essence of a pond of waterlilies without sketching out the contour of every petal and leaf.

When randomness is involved, however, certain truthful aspects of the data is invariably compromised, no matter how well-designed the privacy algorithm may be. Imagine if a picture of waterlilies was commissioned, not by an art collector, but by a botanist whose sole purpose is to study the structural formation of the plant, such as the exact length and width of its petals. She would be terribly disappointed at the Impressionist painting, even if it was the work of Claude Monet himself!

Circumstances in practice dictates that aspects of the data release may be deemed as unfit to be tampered with. These are usually key statistics reflecting the fundamental purpose of data collection, as required by law, policy, or other external constraints as put forth by the stakeholders. The data curator is mandated to disclose these statistics accurately at any expense, while at the same time shielding the remainder part of the data release with a veil of privacy. This poses a challenge to the design of the privacy mechanism subject to mandated disclosure. The central question is, how to integrate data privatization, an inherently random act, with the mandated disclosure of partial yet deterministic information, while maintaining both the logical consistency of the overall data release and the quality of the privacy protection.

1.2. Congenial privacy

In this work we conceptualize the privacy mechanism as one of the many phases of data processing. The concept of congeniality, or rather uncongeniality (Meng, 1994), is then relevant. The theory of uncongeniality was developed for investigating a seemingly paradoxical phenomenon discovered by researchers dealing with imputations for the U.S. Decennial Census and similar public-use data files. Fay (Fay, 1992) and Kott (Kott, 1995) found that one can have inconsistent variance estimation for multiple imputation inference (Rubin, 2004), even when both the imputation model and analysis procedure are valid. The issue turned out to be a mathematical incompatibility between the imputation model and the analysis procedure, even if neither is incompatible with the underlying model that generates the data. In other words, there is no probabilistically coherent model that can simultaneously imply the imputation model and the analysis procedure, a situation termed uncongeniality by Meng (Meng, 1994). To make matters worse, the imputation model, such as adopted by the Census Bureau, is typically not disclosed to the analyst of the imputed data, or at least not fully (e.g., due to confidential information used to help better predict the missing values). The lack of transparency makes it impossible for the analyst to correct for the uncongeniality, and worse, even to realize the problem.

The framework of uncongeniality was later generalized to the multivariate setting (Xie and Meng, 2017) and to general multi-phase processing (Blocker and Meng, 2013), which covers the current application by the same overarching principles. Two properties are critical for good privacy practice: transparency and congeniality. When the protection of privacy must observe mandated constraints, our proposal is to use conditional distributions as derived from the original unconstrained differential privacy mechanism conditioning on invariant margins. This approach achieves both automatically. Transparency is automatic, because the conditional distribution is determined by the original unconstrained distribution and the invariant constraints, both fully disclosed by design. Congeniality stipulates the use of a single coherent probabilistic model to ensure both differential privacy and mandated disclosure. This requirement is automatically satisfied when we use the conditional distribution derived from the original differential privacy mechanism, restricted solely to obey the mandated disclosure. A third advantage is that our proposal does not need any additional choice of procedural ingredients, such as projection distance, which is required by optimization-based post-processing such as adopted by the Census TopDown algorithm (Abowd et al., 2019).

There is, however, no free lunch. The first price we pay for congenial privacy is computational. Sampling from a distribution truncated on some space, as determined by the invariant margins, is generally not trivial, especially when the truncated region is of irregular shape. The second price is that we may pay more privacy loss budget than necessary, since the budget designed for the original mechanism depends on the sensitivity of the query measure on the unconstrained space, which is larger than that for the constrained space. When deriving the appropriate class of conditional distributions for the constrained mechanism, the new privacy loss budget should ideally be calibrated directly according to the query behavior on the constrained space, as opposed to be inherited from the unconstrained mechanism, which ensures a likely overly conservative level of privacy protection for the entire space. When the analytical complexity and computational requirements for the two approaches of budget calibration are similar, we certainly recommend the former.

1.3. The mechanism of differential privacy

Let x=(x1,,xn)x=\left(x_{1},\ldots,x_{n}\right) denote a database consisting of nn individuals, and 𝒳\mathcal{X} the space on which it is defined. A query function s:𝒳ds:\mathcal{X}\to\mathbb{R}^{d} embodies the knowledge contained in the database that stakeholders – scientists, policy makers and the general public – would like to learn. What determines the value of s(x)s\left(x\right) is of course xx, or equivalently all its component xix_{i} values, corresponding to individual respondents included in the database. It is precisely these individuals records, or the xix_{i}’s, that are the subject of privacy protection. How can the data curator say useful things about s(x)s\left(x\right), while saying barely anything about each of the xix_{i}’s?

The mechanism that can instill privacy into the curator’s release appeals to randomness. A random query function M:𝒳dM:\mathcal{X}\to\mathbb{R}^{d} is said to satisfy ϵ\epsilon-differential privacy (Dwork et al., 2006), if for all pairs of neighboring datasets (x,x)𝒳×𝒳\left(x,x^{\prime}\right)\in\mathcal{X}\times\mathcal{X}, we have that

(1.1) P(M(x)B)exp(ϵ)P(M(x)B)P\left(M\left(x\right)\in B\right)\leq\exp\left(\epsilon\right)P\left(M\left(x^{\prime}\right)\in B\right)

for all Borel-measurable sets B(d)B\in\mathscr{B}\left(\mathbb{R}^{d}\right). In this work, the term neighboring datasets means that xx and xx^{\prime} differ by exactly one individual’s record, i.e. for some j=1,,nj=1,\ldots,n, xjxjx_{j}\neq x_{j}^{\prime} but for all iji\neq j, xi=xix_{i}=x_{i}^{\prime}. Write 𝚍(x,x)=1\mathtt{d}\left(x,x^{\prime}\right)=1 if xx and xx^{\prime} are neighbors. This concept of neighbor is employed in the definition of ϵ\epsilon-bounded differential privacy (Abowd et al., 2019), and is distinct from the original formulation which defined neighbors as a pair that differ from each other by the addition or deletion of a single record. There are many ways to design an ϵ\epsilon-differentially private algorithm MM, among which the most widely known and implemented are the Laplace and the Double Geometric algorithms (Dwork et al., 2006; Fioretto and Van Hentenryck, 2019), both are additive ϵ\epsilon-differentially private mechanisms.

Definition 1.1 (Laplace mechanism).

Let s:𝒳ds:\mathcal{X}\to\mathbb{R}^{d} be a deterministic query function. The Laplace mechanism is given by

(1.2) M(x):=s(x)+(U1,,Ud),M\left(x\right):=s\left(x\right)+\left(U_{1},\ldots,U_{d}\right),

where UiU_{i}’s are independent zero-mean Laplacian random variables each with dispersion parameter ϵ1(s)\epsilon^{-1}\nabla\left(s\right), and

(s)=sup(x,x)𝒳×𝒳{s(x)s(x):𝚍(x,x)=1}\nabla\left(s\right)=\sup_{\left(x,x^{\prime}\right)\in\mathcal{X}\times\mathcal{X}}\left\{\left\|s\left(x\right)-s\left(x^{\prime}\right)\right\|:\mathtt{d}\left(x,x^{\prime}\right)=1\right\}

is the global sensitivity of ss. When the database consists of binary records in an unrestricted domain, and ss is the counting or histogram query, (s)=1\nabla\left(s\right)=1.

Definition 1.2 (Double Geometric mechanism).

A random query mechanism MM is called the Double Geometric mechanism if it has the same functional form as (1.2), with UiU_{i} random variables defined on the integers with probability mass function

(1.3) pi(uϵ)=1a1+aa|u|,p_{i}\left(u\mid\epsilon\right)=\frac{1-a}{1+a}\cdot a^{\left|u\right|},

where the parameter a=a(ϵ,s)=exp(ϵ/(s))a=a\left(\epsilon,s\right)=\exp\left(-\epsilon/\nabla\left(s\right)\right).

Note that the definition of either the Laplace or the Double Geometric mechanism presents not just one, but a collection of mechanisms that can be written as {Mϵ}\left\{M_{\epsilon}\right\}, indexed by ϵ>0\epsilon>0 the privacy loss budget allocated to the mechanism in question. When regarded as a sequence of statistical procedures, ϵ\epsilon serves as an indicator of the statistical quality of the output (with larger ϵ\epsilon for higher quality) that can be used to offer interpretation and to guide its own choice. This point will be immediately useful in Section 1.4.

1.4. Statistical intelligibility

An important reason that differential privacy is embraced by the statistics community is that it defines privacy in the language of probability, induced by the mechanism that injects randomness in the data release. An impactful consequence is that the distributional specification of the mechanism can be made fully transparent without sabotaging the promised protection. This opens the door for systematic analysis of the statistical property of mechanisms, which is in turn crucial to the accurate interpretation of statistical inference from privatized data releases (Gong, 2019a). The clarity both in definition and in implementation makes up the statistical intelligibility of differential privacy as a data processing procedure.

We discuss the statistical interpretation of the privacy mechanism, which is what served as inspiration for the conditioning approach to construct the invariant-respecting mechanism in the first place. The degree of protection exerted by a privacy mechanism on the confidential database is seen as a calculated limit on the statistical knowledge it is able to supply, as a function of the privacy loss budget allotted to the mechanism. Compare quantities

(1.4) π(xi=ω)andπ(xi=ωMϵ(x)B),\pi\left(x_{i}=\omega\right)\;\;\text{and}\;\;\pi\left(x_{i}=\omega\mid M_{\epsilon}\left(x\right)\in B\right),

which are the analyst’s prior probability about the value of the iith entry of the dataset, versus her posterior probability if an ϵ\epsilon-differentially private query released a report BB. Thus, the statistical meaning of MϵM_{\epsilon} can be explained as follows.

Theorem 1.3.

Let x=(,xi,)𝒳x=\left(\cdots,x_{i},\cdots\right)\in\mathcal{X} be the database, and {Mϵ:𝒳d,ϵ>0}\{M_{\epsilon}:\mathcal{X}\to\mathbb{R}^{d},\epsilon>0\} a class of ϵ\epsilon-differentially private procedures operating on xx. Then for every B(d)B\in\mathscr{B}\left(\mathbb{R}^{d}\right), ϵ>0\epsilon>0 and every prior probability π\pi the analyst harbors about xix_{i},

(1.5) π(xi=ωMϵ(x)B)[exp(ϵ)π(xi=ω),exp(ϵ)π(xi=ω)].\pi\left(x_{i}=\omega\mid M_{\epsilon}\left(x\right)\in B\right)\in\\ \bigg{[}\exp\left(-\epsilon\right)\pi\left(x_{i}=\omega\right),\ \exp\left(\epsilon\right)\pi\left(x_{i}=\omega\right)\bigg{]}.
Proof.

The posterior probability π(xi=ωMϵ(x)B)\pi\left(x_{i}=\omega\mid M_{\epsilon}\left(x\right)\in B\right) can be written as

P(Mϵ(x)Bxi=ω)π(xi=ω)ωP(Mϵ(x)Bxi=ω)π(xi=ω).\displaystyle\frac{P\left(M_{\epsilon}\left(x\right)\in B\mid x_{i}=\omega\right)\pi\left(x_{i}=\omega\right)}{\sum_{\omega^{\prime}}P\left(M_{\epsilon}\left(x\right)\in B\mid x_{i}=\omega^{\prime}\right)\pi\left(x_{i}=\omega^{\prime}\right)}.

The result then follows immediately from the fact that MϵM_{\epsilon} is ϵ\epsilon-private, which means that for any B(d)B\in\mathscr{B}\left(\mathbb{R}^{d}\right),

exp(ϵ)P(Mϵ(x)Bxi=ω)P(Mϵ(x)Bxi=ω)exp(ϵ).\exp{(-\epsilon)}\leq\frac{P\left(M_{\epsilon}\left(x\right)\in B\mid x_{i}=\omega^{\prime}\right)}{P\left(M_{\epsilon}\left(x\right)\in B\mid x_{i}=\omega\right)}\leq\exp{(\epsilon)}.

Theorem 1.3 says that, any release generated by an ϵ\epsilon-differentially private procedure sharpens the analyst’s knowledge about xix_{i} by at most a factor of exp(ϵ)\exp(\epsilon). This interpretation provides a direct link between the differential privacy promise and the actual posterior risk of disclosure due to the release of the random query MϵM_{\epsilon}.

Recall that the definition of the Laplace and the Double Geometric mechanisms are both well-defined for any ϵ>0\epsilon>0. However, in the limiting case of ϵ0\epsilon\to 0, i.e. the privacy loss budget becomes increasingly restrictive, both algorithms amount to adding noise with increasingly large variance to the confidential query. At ϵ=0\epsilon=0, neither mechanism remains well-defined since the distributions of the noise component become improper due to the infinite cardinality of their respective domains. Nevertheless, the definition of ϵ\epsilon-differential privacy allows for the expression with ϵ=0\epsilon=0. A mechanism is 0-differentially private if one cannot gain any discriminatory knowledge from its release about the underlying database whatsoever. In other words, the analyst’s knowledge about the individual state of xix_{i} must remain the same as her prior. This notion can be explained consistently with Theorem 1.3, by observing that

(1.6) limϵ0π(xi=ωMϵ(x)B)=π(xi=ω),\lim_{\epsilon\to 0}\pi\left(x_{i}=\omega\mid M_{\epsilon}\left(x\right)\in B\right)=\pi\left(x_{i}=\omega\right),

where the limit is implied by (1.5). This inspires the following deliberate construction of M0M_{0} as a 0-differentially private procedure.

Definition 1.4 (0-differentially private procedure).

For {Mϵ}\{M_{\epsilon}\} a class of ϵ\epsilon-differentially private procedures well-defined for ϵ>0\epsilon>0 but not ϵ=0\epsilon=0, define M0M_{0} as the 0-differentially private procedure such that for every prior probability π\pi,

(1.7) π(xi=ωM0(x)B)=π(xi=ω),B(d).\pi\left(x_{i}=\omega\mid M_{0}\left(x\right)\in B\right)=\pi\left(x_{i}=\omega\right),\;\forall B\in\mathscr{B}\left(\mathbb{R}^{d}\right).

Definition 1.4 grants conceptual continuity to M0M_{0}, a perfectly meaningful object in the privacy sense but lacking statistical intelligibility from the mechanistic point of view. For practical purposes, M0M_{0} should be taken to mean the suppression procedure, which supplies vacuous knowledge to the analyst about the state of affairs of the database. Theoretically, the meaning of M0M_{0} as a probabilistic mechanism cannot be supplied by ordinary probabilities, because any probability specification represents a set of specific knowledge about relative frequencies of any pair of states. However, its meaning can be quantified precisely in the more general framework of imprecise probability, as Section 5 will discuss.

2. Constructing congenial privacy

2.1. Privacy with invariants

While privacy protection is called for, the data curator may be simultaneously mandated to disclose certain aspects of the data as they are exactly observed, without subjecting them to any privacy protection. This collection of information is referred to as invariant information, or invariants. In practice, invariants are often defined according to a set of exact statistics calculated based on the confidential database (Ashmead et al., 2019). For example, suppose ss is the histogram query which tabulates the population residing in each county of the state of New Jersey from the 2020 U.S. Census. When producing a differentially private version of the histogram, the Census Bureau is constitutionally mandated to report the total population of each state as enumerated. This means that the privatized histogram M(x)M(x^{*}) must possess the same total population size as s(x)s(x^{*}), where xx^{*} the confidential Census microdata; or in notation, M(x)=s(x)\left\|M\left(x^{*}\right)\right\|=\left\|s\left(x^{*}\right)\right\|.

Suppose that the Double Geometric mechanism is to be applied to the histogram query ss. Due to the random nature of the perturbations, a single realization of the mechanism will with high probability produce M(x)s(x)\left\|M\left(x^{*}\right)\right\|\neq\left\|s\left(x^{*}\right)\right\|. Furthermore M(x)M\left(x^{*}\right) has a positive probability of consisting negative cell counts, which is logically impossible of the confidential query s(x)s\left(x^{*}\right). The challenge of privacy preservation under mandated disclosure is thus to find an alternative mechanism, say M~\tilde{M}, such that every realization of M~\tilde{M} meets all the requirement of mandated information disclosure, while preserving the promise of differential privacy.

Let 𝒳𝒳\mathcal{X}^{*}\subset\mathcal{X} be the set of xx’s that obey the given invariants. In turn, 𝒳\mathcal{X}^{*} defines the set of values that the query must satisfy as

(2.1) 𝒮={s(x)d:x𝒳}.\mathcal{S}^{*}=\left\{s\left(x\right)\in\mathbb{R}^{d}:x\in\mathcal{X}^{*}\right\}.

Note that implicitly, 𝒮=𝒮(x)\mathcal{S}^{*}=\mathcal{S}^{*}\left(x^{*}\right) is a set-valued function of the confidential dataset xx^{*}, because the invariant knowledge we intend to impose on the private release is implied by xx^{*}.

A random mechanism M~\tilde{M} satisfies the mandated disclosure if

(2.2) M~(x)𝒮,x𝒳.\tilde{M}\left(x\right)\in\mathcal{S}^{*},\quad\forall x\in\mathcal{X}^{*}.

That is, whenever applied to a database conformal to the mandated disclosure, with mathematical certainty M~\tilde{M} is also conformal to the mandated disclosure. The size and complexity of the restricted 𝒮\mathcal{S}^{*} (and 𝒳\mathcal{X}^{*}) relative to their original spaces are crucial to the overall extent to which privacy of the residual information in the database can be protected, a point we will revisit in Section 5.1.

There may be many ways to construct a random mechanism M~\tilde{M}, but all are not equally desirable. We argue that M~\tilde{M} should be constructed in a principled manner, and a constructive way to achieve that is to use conditional distributions of unconstrained privacy mechanisms. The resulting mechanism can be easily tuned to retain its differential privacy promise, while ensuring its congenial integration into the data processing pipeline, preserving the statistical intelligibility of its releases.

2.2. Imposing invariants via conditioning

Let MM be a valid ϵ\epsilon-differentially private mechanism, which generates outputs that typically do not obey the invariant requirement M(x)𝒮M\left(x\right)\in\mathcal{S}^{*}, even if x𝒳x\in\mathcal{X}^{*}. A natural idea to force the requirement onto MM is via conditioning. Define a modified privatization mechanism MM^{*}, such that the probability distribution it induces is the same as the conditional distribution of MM subject to the constraint that M(x)𝒮M\left(x\right)\in\mathcal{S}^{*}. For what’s next, we’ll use the notation Z=𝐿WZ\overset{L}{=}W to mean ZZ and WW are identically distributed, and M(x)M(x)𝒮M\left(x\right)\mid M\left(x\right)\in\mathcal{S}^{*} denotes a well-specified conditional distribution P(M(x)M(x)𝒮)P(M\left(x\right)\mid M\left(x\right)\in\mathcal{S}^{*}). Also assume for now P(M(x)𝒮)>0P(M\left(x\right)\in\mathcal{S}^{*})>0. We have the following theorem.

Theorem 2.1.

Let x𝒳x^{*}\in\mathcal{X} be the confidential database, and 𝒳𝒳\mathcal{X}^{*}\subset\mathcal{X} the invariant subset to which xx^{*} conforms. The deterministic function s:𝒳ds:\mathcal{X}\to\mathbb{R}^{d} is a query, and the implied 𝒮(d)\mathcal{S}^{*}\in\mathscr{B}\left(\mathbb{R}^{d}\right) is defined by (2.1). Let MM be an ϵ\epsilon-differentially private mechanism based on ss, and MM^{*} be a constrained mechanism such that

(2.3) M(x)=𝐿M(x)M(x)𝒮.M^{*}\left(x\right)\overset{L}{=}M\left(x\right)\mid M\left(x\right)\in\mathcal{S}^{*}.

Then for all invariant-conforming pairs of datasets that are kk-neighbors, i.e. (x,x)𝒳×𝒳\left(x,x^{\prime}\right)\in\mathcal{X}^{*}\times\mathcal{X}^{*} such that 𝚍(x,x)=k\mathtt{d}\left(x,x^{\prime}\right)=k, there exists a real-valued γ[1,1]\gamma\in[-1,1] such that for all B(d)B\in\mathscr{B}\left(\mathbb{R}^{d}\right),

(2.4) P(M(x)B)exp((1+γ)kϵ)P(M(x)B).P\left(M^{*}\left(x\right)\in B\right)\leq\exp\left(\left(1+\gamma\right)k\epsilon\right)P\left(M^{*}\left(x^{\prime}\right)\in B\right).
Proof.

For a pair of kk-neighboring and 𝒮\mathcal{S}^{*}-conforming datasets (x,x)\left(x,x^{\prime}\right) and any B(d)B\in\mathscr{B}\left(\mathbb{R}^{d}\right),

P(M(x)B)P(M(x)B)\displaystyle\frac{P\left(M^{*}\left(x\right)\in B\right)}{P\left(M^{*}\left(x^{\prime}\right)\in B\right)} =P(M(x)BM(x)𝒮)P(M(x)BM(x)𝒮)\displaystyle=\frac{P\left(M\left(x\right)\in B\mid M\left(x\right)\in\mathcal{S}^{*}\right)}{P\left(M\left(x^{\prime}\right)\in B\mid M\left(x^{\prime}\right)\in\mathcal{S}^{*}\right)}
=P(M(x)B𝒮)P(M(x)B𝒮)P(M(x)𝒮)P(M(x)𝒮).\displaystyle=\frac{P\left(M\left(x\right)\in B\cap\mathcal{S}^{*}\right)}{P\left(M\left(x^{\prime}\right)\in B\cap\mathcal{S}^{*}\right)}\cdot\frac{P\left(M\left(x^{\prime}\right)\in\mathcal{S}^{*}\right)}{P\left(M\left(x\right)\in\mathcal{S}^{*}\right)}.

Clearly each of the last two ratios above is bounded above by exp(kϵ)\exp\left(k\epsilon\right) and below by exp(kϵ)\exp\left(-k\epsilon\right) because MM is ϵ\epsilon-differentially private. Consequently, if we let

(2.5) γ=1kϵlog[max(x,x)𝒳×𝒳𝚍(x,x)=kP(M(x)𝒮)P(M(x)𝒮)],\displaystyle\gamma^{*}=\frac{1}{{k\epsilon}}\log\left[\max_{\begin{subarray}{c}\left(x,x^{\prime}\right)\in\mathcal{X}^{*}\times\mathcal{X}^{*}\\ \mathtt{d}\left(x,x^{\prime}\right)=k\end{subarray}}\frac{P\left(M\left(x^{\prime}\right)\in\mathcal{S}^{*}\right)}{P\left(M\left(x\right)\in\mathcal{S}^{*}\right)}\right],

then γ[0,1]\gamma^{*}\in[0,1] and it is a known constant because 𝒮\mathcal{S}^{*} is public. Then, (2.4) holds for any γ[γ,1]\gamma\in\left[\gamma^{*},1\right], and certainly for γ=1\gamma=1. ∎

Our proof above might create an impression that only γ0\gamma\geq 0 is permissible, but Sections 4 and 5.1 will supply two examples both with γ<0\gamma<0. Negative γ\gamma may sound paradoxical, for it seems to suggest that better privacy protection can be achieved by disclosing some information. However, we must be mindful that differential privacy is not about protecting privacy in absolute terms. Rather, it is about controlling the additional disclosure risk from releasing the privatized data to the users (or hackers), relative to what they know before the release. The presence or absence of mandated invariants would ex ante constitute two different bodies of knowledge, hence any additional privacy protection would carry different interpretations too. We also emphasize that Theorem 2.1 generalizes to cases where P(M(x)𝒮)=0P(M(x)\in\mathcal{S}^{*})=0, such as when it is a linear subspace of d\mathbb{R}^{d} and the privacy mechanism is continuous. The proof is a bit more involved in order to properly define P(M(x)M(x)𝒮)P(M\left(x\right)\mid M\left(x\right)\in\mathcal{S}^{*}), which we will discuss in future work. However, this complication is not a concern for discrete privatization mechanisms, such as within the Census TopDown algorithm (Abowd et al., 2019).

Theorem 2.1 holds broadly for arbitrary kinds of ϵ\epsilon-differentially private mechanisms, as well as any deterministic invariant information about either the database or the query function that can be expressed in a set form. It lends itself to the same kind of posterior interpretation enjoyed by unconstrained differentially private mechanisms. Specifically, if MϵM^{*}_{\epsilon} is the constrained differentially private procedure constructed based on the unconstrained procedure MϵM_{\epsilon}, according to the specifications of Theorem 2.1, then for all x𝒳x\in\mathcal{X}^{*} such that x𝒳\exists x^{\prime}\in\mathcal{X}^{*} so that 𝚍(x,x)=1\mathtt{d}\left(x,x^{\prime}\right)=1, and B(𝒮)\forall B\in\mathscr{B}\left(\mathcal{S}^{*}\right), the analyst’s posterior probability π(xi=ωx𝒳,Mϵ(x)B)\pi\left(x_{i}=\omega\mid x\in\mathcal{X}^{*},M_{\epsilon}^{*}\left(x\right)\in B\right) is bounded in between

[exp((1+γ)ϵ)π(xi=ωx𝒳),exp((1+γ)ϵ)π(xi=ωx𝒳)].\bigg{[}\exp\left(-\left(1+\gamma\right)\epsilon\right)\pi\left(x_{i}=\omega\mid x\in\mathcal{X}^{*}\right),\\ \exp\left(\left(1+\gamma\right)\epsilon\right)\pi\left(x_{i}=\omega\mid x\in\mathcal{X}^{*}\right)\bigg{]}.

This an interval that bears structural resemblance to (1.5), thanks to the conditional nature of MM^{*} which allows for statistical information from privacy mechanisms, constrained or otherwise, to be interpreted in the same (hence congenial) way.

The definition of ϵ\epsilon-differential privacy has the property that, if a mechanism MM is ϵ1\epsilon_{1}-differentially private, then for all ϵ2ϵ1\epsilon_{2}\geq\epsilon_{1}, MM is also ϵ2\epsilon_{2}-differentially private. If the invariant information does not substantially disrupt the neighboring structure of the sample space of the database, a notion we will make precise later in Section 5, what Theorem 2.1 says is that enforcing the invariant 𝒳\mathcal{X}^{*} onto the unconstrained mechanism MM via conditioning costs (1+γ)\left(1+\gamma\right) times – and at most twice since γ\gamma can always be set to 11 – the privacy loss budget allotted to MM. When a more cost-effective value of γ\gamma is hard to determine, a simplest way to ensure privacy loss budget ϵ\epsilon for MM^{*} is to use a budget of ϵ0=ϵ/2\epsilon_{0}=\epsilon/2 for the unconstrained mechanism MM to begin with; see Section 3.

2.3. An Monte Carlo Implementation

Let pp denote the probability distribution, either a mass function or a density function, induced by the unconstrained differentially private algorithm MM which depends on the confidential query ss^{*}. Further denote pp^{*} to be the corresponding conditional distribution of pp constrained on the invariant set 𝒮\mathcal{S}^{*}. The constrained privacy mechanism requires samples from pp^{*}. A simplest, though often inefficient or event impractical, method is rejection sampling. Since

p(s)={c1p(s)0if s𝒮otherwise,p^{*}\left(s\right)=\begin{cases}\begin{array}[]{c}c^{-1}p\left(s\right)\\ 0\end{array}&\begin{array}[]{c}\text{if }s\in\mathcal{S}^{*}\\ \text{otherwise},\end{array}\end{cases}

where c=𝒮p(s)𝑑sc=\int_{\mathcal{S}^{*}}p\left(s\right)ds, one can opt for a proposal density qq with support 𝒮\mathcal{S}^{*} such that sups𝒮[p(s)/q(s)]R\sup_{s\in\mathcal{S}^{*}}[p(s)/q(s)]\leq R, and accept a sample sqs\sim q with probability p(s)/Rq(s)p(s)/Rq(s). This encompasses the option to set q=pq=p, the unconstrained privacy kernel itself, and keep sampling until the sample falls into 𝒮\mathcal{S}^{*}. This strategy is clearly inefficient in general, and impossible when c=0c=0.

Efficient algorithm tailor-made to specifications of 𝒮\mathcal{S}^{*} are possible. Here, we present in Algorithm 1 a generic approach based on Metropolized Independent Sampling (MIS; Liu, 1996), for the most common case in which the mandated invariants are expressed in terms of a consistent system of linear equalities and inequalities

𝒮={sd:As=a,Bsb}.\mathcal{S}^{*}=\left\{s\in\mathbb{R}^{d}:As=a,Bs\geq b\right\}.

Here AA and BB are dA×dd_{A}\times d and dB×dd_{B}\times d matrices with ranks dA<dd_{A}<d and dB<dd_{B}<d respectively, and aa and bb vectors with length dAd_{A} and dBd_{B} respectively. The algorithm requires a proposal index set \mathcal{I}, a subset of {1,,d}\left\{1,\ldots,d\right\} of size ddAd-d_{A} such that rank(A[c])=dA\text{rank}\left(A_{[\mathcal{I}^{c}]}\right)=d_{A}, where A[𝒥]A_{[\mathcal{J}]} is the submatrix of AA consisting of all columns whose indices belong to 𝒥\mathcal{J}. For each AA, the choice of \mathcal{I} may not be unique, and can have a potential impact on the efficiency of the algorithm.

Algorithm 1 Metropolized Independent Differentially Private Sampler with Invariants
Input: unconstrained privacy mechanism pp,
        confidential query ss^{*}, invariant parameters (A,a,B,b)(A,a,B,b),
        proposal distribution qq, proposal index set \mathcal{I},
        initial value s(0)𝒮s^{(0)}\in\mathcal{S}^{*}, integer nsim;
Iterate: for t=0,1,,nsim1t=0,1,\ldots,\text{nsim}-1, at t+1t+1:
    step 1, propose s~\tilde{s}:
         1-1. sample s~q\tilde{s}{}_{\mathcal{I}}\sim q;
         1-2. solve for s~c\tilde{s}_{\mathcal{I}^{c}} in A[]s~+A[c]s~c=aA_{\left[\mathcal{I}\right]}\tilde{s}_{\mathcal{I}}+A_{\left[\mathcal{I}^{c}\right]}\tilde{s}_{\mathcal{I}^{c}}=a;
         1-3. write s~=(s~,s~c)\tilde{s}=\left(\tilde{s}_{\mathcal{I}},\tilde{s}_{\mathcal{I}^{c}}\right);
    step 2, compute α(s(t),s~)=min{1,p(s~)𝟏(Bs~b)q(s(t))p(s(t))q(s~)}\alpha(s^{\left(t\right)},\tilde{s})=\min\left\{1,\frac{p(\tilde{s}){\bf 1}\left(B\tilde{s}\geq b\right)q\left(s_{\mathcal{I}}^{(t)}\right)}{p\left(s^{(t)}\right)q\left(\tilde{s}_{\mathcal{I}}\right)}\right\};
    step 3, set s(t+1)=s~s^{\left(t+1\right)}=\tilde{s} with probability α(s(t),s~)\alpha(s^{\left(t\right)},\tilde{s}),
         otherwise set s(t+1)=s(t)s^{\left(t+1\right)}=s^{\left(t\right)}.
Output: a set of draws {s(t)}\{s^{\left(t\right)}\}, t=1,,nsimt=1,\ldots,\text{nsim}.

This algorithm is applicable to both discrete and continuous data and privatization schemes, and it does not require the normalizing constant for pp^{*}. In Section 3 below, we use it to construct a mechanism for differentially private demographic contingency tables subject to both linear equality and inequality invariant constraints.

3. Contingency Table with Invariants

<< 5 6-10 11-15 16-17 18-19 20 21 22-24 25-29 30-34 35-39 40-44 45-49 50-54
Female 8 6 3 6 4 4 4 8 5 7 7 6 1 5
Male 3 4 5 8 6 4 5 5 5 6 10 7 3 2
55-59 60-61 62-64 65-66 67-69 70-74 75-79 80-84 85+ Total
Female 4 4 9 6 2 8 8 8 7 130
Male 5 11 6 4 7 4 5 3 8 126
Voting 43 213 256
<< 5 6-10 11-15 16-17 18-19 20 21 22-24 25-29 30-34 35-39 40-44 45-49 50-54
Female 6 6 4 3 2 10 2 5 6 7 5 6 0 5
Male 9 4 5 6 3 4 5 5 3 4 7 8 5 3
55-59 60-61 62-64 65-66 67-69 70-74 75-79 80-84 85+ Total
Female 4 5 10 8 3 8 8 12 5 130
Male 2 23 8 2 6 3 0 3 8 126
Voting 43 213 256
Table 1. A confidential sex ×\times age contingency table (top) and a corresponding constrained differentially private (bottom) release, subject to total population, proportion female population and voting age population constraints (bold).

The table we consider is of dimension 2×232\times 23, with rows representing sex (male/female), and columns representing age bucketed roughly every four years, with finer buckets around key age ranges such as 18, 21 and 60. This data structure corresponds to the 2010 Census Table #P12 and 2020 Census Table #P7 (US Census Bureau, 2020), one of the most referenced type of contingency table releases by the Census Bureau at various geographic levels. For the purpose of computational illustration, this example will use simulation to construct synthetic datasets that represent the confidential Census demographic data.

Let ss be a vector of length 4646, denoting the row-vectorized contingency table. The constraints to be imposed on the differentially private table include

  1. (1)

    total population;

  2. (2)

    proportion of female population;

  3. (3)

    total voting age population (18+ age); and

  4. (4)

    nonnegative table entries.

Items (1) to (3) constitute equality constraints, and item (4) inequality constraints. The unconstrained privacy mechanism that serves as the basis of our construction is the Double Geometric mechanism, with distribution function

p(s)=i=146pi(sisiϵ),p\left(s\right)=\prod_{i=1}^{46}p_{i}\left(s_{i}-s_{i}^{*}\mid\epsilon\right),

where pip_{i} is as defined in (1.3), and the privacy loss budget set to ϵ=0.5\epsilon=0.5 per cell. The proposal distribution is set to be of the same family as does the unconstrained privatization algorithm, but it is given a distinct dispersion parameter ϵ~\tilde{\epsilon} to tune for best algorithmic performance, in this case the acceptance probability of the algorithm. The proposal distribution function for s~\tilde{s}_{\mathcal{I}}, the (ddAd-d_{A})-length subvector of the kkth proposal s~\tilde{s}, is

q(s)=ipi(sisiϵ~),q\left(s\right)=\prod_{i\in\mathcal{I}}p_{i}\left(s_{i}-s_{i}^{*}\mid\tilde{\epsilon}\right),

where \mathcal{I} is chosen to be {2,,22,24,45}\left\{2,\ldots,22,24,\ldots 45\right\}. The remainder coordinates of the kkth proposal, s~c\tilde{s}_{\mathcal{I}^{c}}, is solved according to the equality constraint As~=aA\tilde{s}=a.

Table 1 displays a simulated confidential table (top) and a constrained differentially private table (bottom) based on the confidential table, where the draw is produced by Algorithm 1. In this case, setting ϵ~=0.6\tilde{\epsilon}=0.6 yields the best acceptance probability, with acceptance probability at 1.68%1.68\%, as shown in Figure 2 in Appendix C alongside traceplots for the second and the first cells of the table, with the former index belonging to \mathcal{I} and the latter not.

Refer to caption
Figure 1. Average L1L_{1} (top) and L2L_{2} (bottom) distances of a simulated confidential dataset from its privatized releases using four different processing methods.

We compare the outputs of our congenial mechanism with the unconstrained privacy mechanism, with and without post-processing via nonnegative L2L_{2} minimization onto the invariant set 𝒮\mathcal{S}^{*} defined by (1)-(4). A total of 2020 confidential contingency tables are simulated, each with cells following

sii.i.d.NegativeBinomial(100,.05),s_{i}\overset{i.i.d.}{\sim}Negative\ Binomial\left(100,.05\right),

which has mean E(si)=5.26E(s_{i})=5.26 and variance Var(si)=5.54Var(s_{i})=5.54. For each confidential table, 100100 privatized releases were created using each of the following methods:

  1. a)

    the unconstrained (raw) Double Geometric mechanism with privacy loss budget ϵ=1\epsilon=1 per cell;

  2. b)

    the above mechanism followed by nonnegative L2L_{2} (NNL2) minimization onto the subspace defined by the four constraints, i.e. 𝒮\mathcal{S}^{*};

  3. c)

    Our proposed 𝒮\mathcal{S}^{*}-conditional algorithm per Algorithm 1, constructed based on an unconstrained Double Geometric mechanism with ϵ0=0.5\epsilon_{0}=0.5; and

  4. d)

    the same unconstrained Double Geometric mechanism as in (a), but with ϵ=0.5\epsilon=0.5 per cell.

The L1L_{1} distance and L2L_{2} distance between a confidential table ss and s~\tilde{s}, a privatization of ss, are given respectively by

(3.1) L1(s,s~)=i|sis~i|andL2(s,s~)=i(sis~i)2.L_{1}\left(s,\tilde{s}\right)=\sum_{i}\left|s_{i}-{\tilde{s}}_{i}\right|\quad{\rm and}\quad L_{2}\left(s,\tilde{s}\right)=\sqrt{\sum_{i}\left(s_{i}-{\tilde{s}}_{i}\right)^{2}}.

For each of the four types of privatization and processing mechanisms, we compute averages of both distances over 100 realizations of s~\tilde{s}. Figure 1 displays the box plots of these average distances over the 20 simulated copies of private table ss. We observe that the conditional mechanism, constructed from an unconstrained mechanism with ϵ0=0.5\epsilon_{0}=0.5, exhibits a degree of variability between the unconstrained mechanisms with privacy loss budget ϵ=0.5\epsilon=0.5 and ϵ=1\epsilon=1. On the other hand, the nonnegative L2L_{2} projection of the unconstrained mechanism with ϵ=1\epsilon=1 achieves a level of accuracy mostly on par with it. Note that this observation should not be taken as a suggestion of relative accuracy between the nonnegative L2L_{2} minimization and the constrained mechanism, because the effective privacy guarantee that either mechanism enjoys is undetermined, an issue we will discuss further at the end of Section 4. That said, by Theorem 2.1 that the conditional mechanism inflates the privacy loss budget of a unconstrained algorithm by (1+γ)\left(1+\gamma\right), the empirical observation suggests that the effective γ\gamma may be somewhere in between 0 and 11. In the following sections, we will see examples where γ=0\gamma=0 or even γ<0\gamma<0.

4. Curator’s post-processing may not be innocent processing

A common practice to ensure unconstrained differentially private releases respect the mandated disclosure requirements is through optimization-based post-processing, which takes the general form

(4.1) f(M;𝒮)=argmins𝒮Δ(M,s).f\left(M;\mathcal{S}^{*}\right)=\text{argmin}_{s\in\mathcal{S}^{*}}\Delta\left(M,s\right).

That is, ff is the element in 𝒮\mathcal{S}^{*} that is the closest to MM according to some discrepancy measure Δ\Delta, typically a distance, such as the two given in (3.1). In the case of the Census Bureau’s TopDown algorithm, ff is a composite post-processing procedure consisting of first a nonnegative L2L_{2} minimization followed by an L1L_{1} minimization onto the integer solutions; see (Abowd et al., 2019).

In the literature of differential privacy, there is a widely referenced theorem which establishes that differentially privatized releases are “immune” to post-processing (Dwork and Roth, 2014). The theorem states that if MM is an ϵ\epsilon-differentially private mechanism and gg is an arbitrary function, then gMg\circ M is still ϵ\epsilon-differentially private. Indeed for any gg-measurable set BB,

(4.2) P(g(M(x))B)=P(M(x)g1(B)).P\left(g(M\left(x\right))\in B\right)=P(M\left(x\right)\in g^{-1}(B)).

Thus for every x𝒳x\in\mathcal{X}, the maximal increased risk of disclosure from releasing g(M(x))g(M(x)) cannot exceed that from releasing M(x)M(x) (but the reversed is guaranteed only when gg is one-to-one). Intuitively, further blurring an already blurred picture can only make it harder, not easier, to see what is in the original picture.

This intuition, however, is based on the assumption that the further blurring process does not use any knowledge about the original picture. We need to make clear here that imposing invariants on differentially private releases via optimization-based post-processing, in the sense of the operation discussed here, does not in general fall under the jurisdiction of the post-processing theorem. This is because ff, the function used to impose invariants on the unconstrained output M(x)M\left(x\right), is implicitly dependent on the confidential dataset xx^{*}, with the dependence induced via 𝒮\mathcal{S}^{*}, or equivalently 𝒳\mathcal{X}^{*} to which xx^{*} belongs. Since 𝒮\mathcal{S}^{*} supplies information about the confidential database, whereas the unconstrained mechanism MM is by design not preferential towards 𝒮\mathcal{S}^{*}, any further processing of MM that makes nontrivial use of 𝒮\mathcal{S}^{*} risks violating the privacy guarantee that MM deserves.

The post-processing theorem guarantees that no loss of privacy will be induced to the privatized query via any functional transformation that may be carried out by an ordinary analyst or data user. However, imposing invariants is the kind of post-processing that only the data curator – one who has access to the confidential data – is capable of performing. In the extreme scenario (see Example 5.2) that the invariant forces the privatized disclosure to be precisely equal to the confidential query, for the data curator to achieve this algorithmically is as simple as taking the privatized query MM and projecting it to the single point in d\mathbb{R}^{d} defined by the confidential value s(x)s\left(x^{*}\right). But this is impossible for a data user who do not know what s(x)s\left(x^{*}\right) is.

One may wonder the following question. While the invariant 𝒮\mathcal{S}^{*} has a dependence on the confidential xx^{*}, itself is nevertheless public information. Doesn’t that make f(;𝒮)f(\cdot;\mathcal{S}^{*}) a fully specified function, just like gg in the post-processing theorem? The answer is no in general, and the distinction here is a subtle one. When talking about the value of the invariants, it suffices to regard 𝒮\mathcal{S}^{*} merely as an announced description of the confidential data. However as alluded to previously, 𝒮\mathcal{S}^{*} is a set-valued map from the database space to subsets of the query space, i.e. 𝒮:𝒳(d)\mathcal{S}^{*}:\mathcal{X}\to\mathscr{B}(\mathbb{R}^{d}). Almost always is the case in practice that the functional form of map of 𝒮\mathcal{S}^{*} is a priori determined, but its value – namely 𝒮(x)\mathcal{S}^{*}\left(x^{*}\right) – can be calculated only after the confidential data is observed. Indeed, the actual specification of 𝒮\mathcal{S}^{*} would almost certainly change if xx^{*} were observed differently. This means for an ff-measurable set BB and a database x𝒳x\in\mathcal{X}, the equivalent events in the ff and the MM spaces are now

(4.3) f(M(x);𝒮(x))BM(x)f1(B;𝒮(x)),f\left(M\left(x\right);\mathcal{S}^{*}\left(x\right)\right)\in B\;\Leftrightarrow\;M\left(x\right)\in f^{-1}\left(B;\mathcal{S}^{*}(x)\right),

noting that the inverse function f1(;𝒮(x))f^{-1}(\cdot;\mathcal{S}^{*}(x)) now depends on xx.

To see the complication caused by this dependence, write B~(x)=f1(B;𝒮(x))\tilde{B}(x)=f^{-1}(B;{\mathcal{S}^{*}(x)}) and f(x)=f(M(x);𝒮(x))f(x)=f(M\left(x\right);\mathcal{S}^{*}(x)). We then have

(4.4) P(f(x)B)P(f(x)B)=P(M(x)B~(x))P(M(x)B~(x)).\displaystyle\frac{P(f(x)\in B)}{P(f(x^{\prime})\in B)}=\frac{P(M(x)\in\tilde{B}(x))}{P(M(x^{\prime})\in\tilde{B}(x^{\prime}))}.

Although both B~(x)\tilde{B}(x) and B(x)B(x^{\prime}) are measurable sets, they are not necessarily the same when xxx^{\prime}\neq x. Hence we cannot use (1.1) to conclude that the right hand side of (4.4) is bounded above by exp(ϵ)\exp\left(\epsilon\right). This does not necessarily imply that the post-processing ff as defined in (4.1) is not ϵ\epsilon-differentially private. Indeed, we prove in Appendix A that both L2L_{2} and (a class of) L1L_{1} post-processing in Example 4.1 below is ϵ\epsilon-differentially private. But it does imply that in general, the statistical and privacy properties of ff are not straightforwardly inherited from that of MM, and hence they need to be established on a case by case basis.

Another major drawback of using optimization-based methods to impose invariants is that the statistical intelligibility of differential privacy is obscured. The post-processing function ff is often procedurally defined, hence a complex and confidential data-dependent map from the unconstrained query space to the constrained query space, with almost impenetrable statistical properties, and certainly so for any given database. In contrast, using conditioning to realize differential privacy with mandated disclosure, despite often computationally demanding by construction, preserves the statistical intelligibility of the privacy mechanism. The constrained privacy mechanism is distributionally – as opposed to procedurally – constructed, preserving the possibility of transparent downstream analysis. It furthermore delivers privacy guarantee in the same format as does differential privacy without constraints, offering a congenial statistical interpretation that resembles the original.

Below we use an example to compare congenial privacy with two approaches of post-processing for a same query function. The example is simple enough for analytical derivations of the distributions of post-processing mechanisms to be possible. As we will see, the three approaches to impose invariant constraints yield distinct theoretical behaviors.

Example 4.1 (A two-bin histogram with constrained total).

Suppose the confidential database xx is a binary vector, and the query of interest tabulates the number of 0 and 11 entries in xx, i.e.

s(x)=(s1(x),s2(x))=(i𝟏(xi=0),i𝟏(xi=1)).s\left(x\right)=\left(s_{1}\left(x\right),s_{2}\left(x\right)\right)=\left(\sum_{i}{\bf 1}\left(x_{i}=0\right),\sum_{i}{\bf 1}\left(x_{i}=1\right)\right).

Employ the Laplace mechanism as the unconstrained privatization mechanism to protect the two-bin histogram, i.e.

M(x)=(m1=s1+u1,m2=s2+u2),uii.i.d.Lap(2ϵ1),M\left(x\right)=\left(m_{1}=s_{1}+u_{1},m_{2}=s_{2}+u_{2}\right),\;u_{i}\overset{i.i.d.}{\sim}Lap(2\epsilon^{-1}),

expending in total ϵ\epsilon privacy loss budget. The induced probability density of MM is

p(m1,m2)=(ϵ4)2exp{ϵ2(|m1s1|+|m2s2|)}.p\left(m_{1},m_{2}\right)=\left(\frac{\epsilon}{4}\right)^{2}\exp\left\{-\frac{\epsilon}{2}\left(\left|m_{1}-s_{1}\right|+\left|m_{2}-s_{2}\right|\right)\right\}.

Suppose the invariant to be imposed is that the total of the privatized histogram shall be the same as that the the confidential query itself. That is, for any given xx, the associated invariant set is

(4.5) 𝒮(x)={(a1,a2)2:a1+a2=s1(x)+s2(x)}.\mathcal{S}^{*}\left(x\right)=\left\{\left(a_{1},a_{2}\right)\in\mathbb{R}^{2}:a_{1}+a_{2}=s_{1}\left(x\right)+s_{2}\left(x\right)\right\}.

In the calculations below, a certain database xx is fixed, and we write the invariant total n=xn=\left\|x\right\|, the length of xx.

Congenial privacy.

Under the constraint of histogram total, our congenial M(x)M^{*}\left(x\right) is obtained from the conditional distribution

(s1+u1,s2+u2)u1+u2=0.\left(s_{1}+u_{1},s_{2}+u_{2}\right)\mid u_{1}+u_{2}=0.

It turns out that the probability density of MM^{*} is

(4.6) P(m1=m,m2=nm)=ϵ2exp{ϵ|ms1|}P\left(m_{1}=m,m_{2}=n-m\right)=\frac{\epsilon}{2}\exp\left\{-\epsilon\left|m-s_{1}\right|\right\}

and 0 otherwise; see Appendix A. That is, our congenial mechanism MM^{*} is simply to draw a uu from Lap(ϵ1)Lap(\epsilon^{-1}), and then release (m1=s1+u,m2=s2u)(m_{1}=s_{1}+u,m_{2}=s_{2}-u). Clearly, the privacy property of MM^{*} is the same as its first component, call it M1M^{*}_{1}, which protects s1s_{1} by releasing m1m_{1} because setting m2=nm1m_{2}=n-m_{1} is a deterministic step with no implication on privacy when nn is known. But M1M^{*}_{1} is simply the Laplace mechanism with ϵ\epsilon budget. Consequently, our congenial mechanism maintains the same ϵ\epsilon guarantee as the original unconstrained mechanism, even though the meaning of protection is different, as we emphasized in Section 2.2.

Post-processing with L2L_{2} minimization.

Here we minimize the L2L_{2} distance between M(x)M\left(x\right) and the post-processed histogram release, denoted fL2f_{L_{2}}, subject to its sum being nn. The solution is

fL2(M(x),𝒮)=argmins𝒮M(x)s2=(x¯+u~,x¯u~),f_{L_{2}}\left(M\left(x\right),\mathcal{S}^{*}\right)=\text{argmin}_{s\in\mathcal{S}^{*}}\left\|M\left(x\right)-s\right\|_{2}{=}\left(\bar{x}+\tilde{u},\bar{x}-\tilde{u}\right),

where u~\tilde{u} is the average of two independent Laplace random variables with scale 2ϵ12\epsilon^{-1}. As can be easily seen (for example from its characteristic function), u~\tilde{u} is not a Laplace random variable, but in fact follows distribution

(4.7) 12Lap(ϵ1)+12SgnGamma(2,ϵ1),\frac{1}{2}Lap(\epsilon^{-1})+\frac{1}{2}SgnGamma(2,\epsilon^{-1}),

that is a 50-50 mixture of a Laplace distribution with scale ϵ1\epsilon^{-1} and a signed Gamma distribution (i.e. a regular Gamma distribution multiplied by a fair random sign) with shape k=2k=2 and scale ϵ1\epsilon^{-1}; see Appendix A for derivation. It is worth noting that since a signed Gamma distribution of shape k=2k=2 can be written as the sum of two independent Laplace distributions of the same scale, u~\tilde{u} is more variable than a single independent Laplace random variable of the same scale. Hence for any xx, the privatized release using fL2f_{L_{2}} will be more variable than that of the congenial privatization MM^{*}. Intuitively, this suggests that fL2f_{L_{2}} should not do worse than MM^{*} in terms of privacy protection. Indeed as we will show in Appendix A, fL2f_{L_{2}} also achieves the same ϵ\epsilon-differentially private guarantee.

E(m1,m2)E\left(m_{1},m_{2}\right) Var(m1)Var\left(m_{1}\right)
MM^{*} (s1(x),s2(x))\left(s_{1}\left(x\right),s_{2}\left(x\right)\right)^{\top} 2ϵ2\frac{2}{\epsilon^{2}}
fL2f_{L_{2}} (s¯(x),s¯(x))\left(\bar{s}\left(x\right),\bar{s}\left(x\right)\right)^{\top} 4ϵ2\frac{4}{\epsilon^{2}}
fL1f_{L_{1}} (s1(x),s2(x))\left(s_{1}\left(x\right),s_{2}\left(x\right)\right)^{\top} [4ϵ2,8ϵ2]\left[\frac{4}{\epsilon^{2}},\frac{8}{\epsilon^{2}}\right]

Table 2. Differentially private two-bin histogram with invariant total: expectation and first-component variance of the conditional (MM^{*}) and post-processed (fL2f_{L_{2}} and fL1f_{L_{1}}) histograms.

Post-processing with L1L_{1} minimization.

If we change the L2L_{2} norm to L1L_{1} norm in the above, the privatization mechanism is no longer unique. There will be infinitely many solutions in the form of

fL1(M(x),𝒮)=argmins𝒮M(x)s1:=(s~,ns~),f_{L_{1}}\left(M\left(x\right),\mathcal{S}^{*}\right)=\text{argmin}_{s\in\mathcal{S}^{*}}\left\|M\left(x\right)-s\right\|_{1}:=\left(\tilde{s},n-\tilde{s}\right),

where s~\tilde{s} only needs to satisfy

s~\displaystyle\tilde{s} [min{s1+u1,n(s2+u2)},max{s1+u1,n(s2+u2)}]\displaystyle\in\left[\min\left\{s_{1}+u_{1},n-\left(s_{2}+u_{2}\right)\right\},\max\left\{s_{1}+u_{1},n-\left(s_{2}+u_{2}\right)\right\}\right]
=𝐿[s1+min(u1,u2),s1+max(u1,u2)],\displaystyle\overset{L}{=}\left[s_{1}+\min\left(u_{1},u_{2}\right),s_{1}+\max\left(u_{1},u_{2}\right)\right],

where min(u1,u2)\min\left(u_{1},u_{2}\right) and max(u1,u2)\max\left(u_{1},u_{2}\right) are the minimum and the maximum of two i.i.d. Laplace random variables. In particular, choosing any convex combination of u1u_{1} and u2u_{2} as the additive noise term to the first entry constitutes a solution, i.e., s~1=s1+βu1+(1β)u2\tilde{s}_{1}=s_{1}+\beta u_{1}+(1-\beta)u_{2} for some β[0,1]\beta\in[0,1], and then set s~2=ns~1\tilde{s}_{2}=n-\tilde{s}_{1}. For the rest of the article, L1L_{1} post-processing refers to this convex combination strategy.

For ease of reference, Table 2 collects a comparison of the constrained differentially private histogram MM^{*} and two the post-processing approaches, fL2f_{L_{2}} and fL1f_{L_{1}}, in terms of the expectation and variance of the resulting release for a given database x𝒳x\in\mathcal{X} and confidential query ss. All expectations are taken with respect to the relevant mechanism.

We can see that our congenial mechanism has the smallest variance. Because the congenial mechanism and fL2f_{L_{2}} both carry the same ϵ\epsilon-privacy guarantee which cannot be further improved, we can comfortably declare that fL2f_{L_{2}} is inadmissible because it is dominated by the congenial mechanism, providing less utility (in terms of statistical precision) without the benefit of increased privacy protection. However, we cannot say that the congenial mechanism dominates fL1f_{L_{1}} even though it still leads to smaller variance. This is because, as we will prove in Appendix A, the attained level of privacy guarantee of fL1f_{L_{1}} is ϵ/(2max{β,1β})\epsilon/(2\max\{\beta,1-\beta\}), which is never worse than ϵ\epsilon. Hence the increased variance under fL1f_{L_{1}} might be acceptable as a price for gaining more privacy protection. In general, comparing the utility of two privatization mechanisms with the same nominal but different actual privacy loss budget is as thorny an issue as comparing the statistical power of two testing procedures with the same nominal, but different actual, Type I error rates.

5. Discussion

5.1. Finding better γ\gamma

While Theorem 2.1 always holds with γ=1\gamma=1, it likely sets a loose bound on the ratio between P(M(x)B)P\left(M^{*}\left(x\right)\in B\right) and P(M(x)B)P\left(M^{*}\left(x^{\prime}\right)\in B\right), hence declaring an overly “conservative” nominal level of privacy loss induced by MM^{*}. Depending on how the invariant 𝒮\mathcal{S}^{*} interacts with the distributional property of the unconstrained mechanism MM in a specific instance, γ\gamma can be shown to take smaller values, adding more “bang of the buck” to the privacy loss budget, so to speak. Three examples are given below.

Example 5.1 (trivial invariants).

Consider the trivial case where the set of invariants does not actually impose any restriction, i.e., 𝒳=𝒳\mathcal{X}^{*}=\mathcal{X}. It is then necessarily true that 𝒮=𝒮\mathcal{S}^{*}=\mathcal{S}, and the “constrained” differentially private mechanism is identical in distribution to the unconstrained one: M=𝐿MM^{*}\overset{L}{=}M. In this case, γ=0\gamma=0 and MM^{*} is ϵ\epsilon-differentially private.

Example 5.2 (rounding and secrecy).

Let xx be a binary vector of length nn indicating a group of individuals’ possession of a certain feature (yes 1/no 0), and the query of interest is s(x)=xi/10s(x)=\lceil\sum x_{i}/10\rceil, or the number of groups of size 1010 that can be formed by people who possesses the feature. A Double Geometric mechanism M(x)=s(x)+UM(x)=s(x)+U is used to protect the query, with a privacy loss budget of ϵ\epsilon (under the global sensitivity of (s)=1\nabla(s)=1).

Suppose the following invariant set is mandated for disclosure:

𝒳={(x1,,xn){0,1}n:xi[41,50]},\mathcal{X}^{*}=\left\{\left(x_{1},\ldots,x_{n}\right)\in\left\{0,1\right\}^{n}:\sum x_{i}\in\left[41,50\right]\right\},

or equivalently, 𝒮={5}\mathcal{S}^{*}=\left\{5\right\} is the singleton set that contains nothing but the true value s(x)=5s(x^{*})=5. In this case, the implied constrained privacy mechanism MM^{*} is equivalent to a degenerate distribution: P(M(x)=5)=1P\left(M^{*}\left(x\right)=5\right)=1 for all x𝒳x\in\mathcal{X}^{*}. Furthermore, for all neighboring datasets (x,x)𝒳×𝒳\left(x,x^{\prime}\right)\in\mathcal{X}^{*}\times\mathcal{X}^{*}, and any BB a measurable subset of \mathbb{N},

P(M(x)B)=exp(0)P(M(x)B)={10if 5Botherwise.P\left(M^{*}\left(x\right)\in B\right)=\exp\left(0\right)P\left(M^{*}\left(x^{\prime}\right)\in B\right)=\begin{cases}\begin{array}[]{c}1\\ 0\end{array}&\begin{array}[]{c}\text{if }5\in B\\ \text{otherwise}.\end{array}\end{cases}

Therefore in this particular instance, MM^{*} is in fact 0-differentially private, corresponding to γ=1\gamma=-1 in Theorem 2.1. This means that for those databases conformal to the invariant 𝒳\mathcal{X}^{*}, MM^{*} supplies no discriminatory information among them whatsoever. Indeed, if the value of the supposedly private query is already public knowledge, no mechanism can further increase its disclosure risk, therefore achieving complete differential privacy.

Our third example is a less trivial example of γ<0\gamma<0, which is actually provided by the congenial mechanism in Example 4.1. There, although the guaranteed privacy loss budget is still ϵ\epsilon, in applying Theorem 2.1, kk must be set to 22 or greater, because under the constraint of fixed sum, the nearest neighbors among binary vectors (x,x)(x,x^{\prime}) must have 𝚍(x,x)=2\mathtt{d}(x,x^{\prime})=2. Hence the ϵ\epsilon privacy bound implies k(1+γ)=2(1+γ)=1k(1+\gamma)=2(1+\gamma)=1, yielding γ=0.5\gamma=-0.5.

This example also reminds us that a major cause of information leakage due to invariants is the structural erosion to the underlying data space, such as making 𝚍(x,x)=1\mathtt{d}(x,x^{\prime})=1 (as measured on the original space 𝒳\mathcal{X}) impossible. In reality, the unconstrained data space 𝒳\mathcal{X} is typically regular, and contains 𝒳\mathcal{X}^{*} as a proper subset. We should expect to find many x𝒳x\in\mathcal{X}^{*}, and many (if not many more) x𝒳\𝒳x^{\prime}\in\mathcal{X}\backslash\mathcal{X}^{*} such that xx and xx^{\prime} are neighbors, near or far. Knowing that the confidential dataset must belong to 𝒳\mathcal{X}^{*} categorically rules out the possibility that all the xx^{\prime}’s can be the confidential dataset, weakening the differential privacy promise by eliminating the neighbors. If the invariant is sufficiently restrictive such that 𝒳\mathcal{X}^{*} becomes topologically small relative to 𝒳\mathcal{X}, it may be the case that for some x𝒳x\in\mathcal{X}^{*}, all of its original immediate neighboring datasets are not in 𝒳\mathcal{X}^{*}:

{(x,x)𝒳×𝒳:𝚍(x,x)=1}=,\left\{\left(x,x^{\prime}\right)\in\mathcal{X}^{*}\times\mathcal{X}^{*}:\mathtt{d}\left(x,x^{\prime}\right)=1\right\}=\emptyset,

in which case we say that the neighboring structure of the original data space of the database is substantially disrupted, as seen in Example 4.1. If the disruption is so substantial that neighbors of any distance cease to exist, we say that the neighborhood structure is completely destroyed:

{(x,x)𝒳×𝒳:𝚍(x,x)1}=.\left\{\left(x,x^{\prime}\right)\in\mathcal{X}^{*}\times\mathcal{X}^{*}:\mathtt{d}\left(x,x^{\prime}\right)\geq 1\right\}=\emptyset.

Then, even for the constrained mechanism MM^{*}, the ϵ\epsilon-differential privacy promise becomes vacuously true, since no possible neighboring pairs remain for which the concept of privacy is applicable. However, Example 5.2 demonstrates that vacuous privacy promise can occur without the neighborhood structures completely destroyed.

In general, it is conceptually difficult to parse out the share of responsibility on privacy attributable to the data curator under any scenario of mandated disclosure. If certain information is made public, then any information that it logically implies cannot be expected to be protected, either. The best that we can expect any privacy mechanism to deliver, then, is protection over information that truly remains. Notions that serve the equivalent purposes as 𝒳\mathcal{X}^{*} and 𝒮\mathcal{S}^{*} have been proposed in the literature for expositions of new notions of differential privacy, including blowfish and pufferfish privacy (He et al., 2014; Kifer and Machanavajjhala, 2012), where the privacy guarantee is re-conceptualized on the restricted space modulo any structural erosion to the original sample space due to external or auxiliary information available to an attacker. When interpreting the promise of Theorem 2.1, we shall pay due diligence to the case in which immediate neighbors no longer exists, and talk about the ϵ\epsilon-differential privacy guarantee only for those kk-neighbors that actually do.

5.2. Other interpretations of privacy

The literature has seen other lines of work that offer interpretations of differential privacy in statistical terms. Notably, the posterior-to-posterior semantics of differential privacy (Dinur and Nissim, 2003; Kasiviswanathan and Smith, 2014; Ashmead et al., 2019) explains the effect of privacy also in the vocabulary of Bayesian posteriors. The posterior-to-posterior semantics establishes differential privacy as a bound for the ratio of posterior probabilities assigned to an individual confidential data entry, when the private mechanism is applied to neighboring datasets that differ in only one entry. The said ratio is between the two quantities

(5.1) π(xi=ωMϵ(x)B)andπ(xi=ωMϵ(x)B),\pi\left(x_{i}^{*}=\omega\mid M_{\epsilon}\left(x\right)\in B\right)\;\text{and}\;\pi\left(x_{i}^{*}=\omega\mid M_{\epsilon}\left(x^{\prime}\right)\in B\right),

where xx and xx^{\prime} are neighboring datasets. What varies between the two posterior quantities is the confidential dataset on which the private query is applied. The datasets xx and xx^{\prime} are neighboring datasets, one of which presumably (but not necessarily) contains the true value of the iith confidential data entry xix^{*}_{i}, and the other contains a fabricated value of it.

The comparison in (5.1) raises the question of what it means by the conditional probability of xix^{*}_{i} given a private query constructed from a database that does not contain this true value, as this conditional probability hinges on external knowledge about how a fabricated database may inform the actual confidential database. Our prior-to-posterior semantics formulated in Theorem 1.3 takes a practical point of view and avoids such conceptual complication. We compare the disclosure risk before and after an actual release, reflecting the core idea behind differential privacy.

5.3. Full privacy or vacuous knowledge

As alluded to in Section 1.4, the notion of vacuous knowledge cannot be appropriately captured by ordinary probabilities. The defect reflects a fundamental inability of the language of probability in expressing a true lack of knowledge, a central struggle in the Bayesian literature that motivated endeavors in search for the so-called “objective priors” (Ghosh, 2011). Neither the uniform distribution nor any other reference distributions are truly information-free, as they all invariably invoke some principle of indifference in relation to a specific criterion (such as the Lebesgue measure, the counting measure, or the likelihood function) which is subject to debate.

To supply the rigorous definition needed to define probabilistically M0(x)M_{0}(x) in Section 1.4, we invoke the concept of lower probability functions, and as a special case belief functions (see e.g. Gong, 2019b), both generalized versions of a probability function which amounts to a set of probability distributions on a given space. The statistical information contained in M0M_{0} is represented by the vacuous lower probability function, denoted P¯\underline{P}, which takes the value P¯(B)=1\underline{P}(B)=1 only when B=dB=\mathbb{R}^{d}, and 0 everywhere else. Equivalently stated in terms of its conjugate upper probability function P¯(B)=1P¯(Bc)\overline{P}(B)=1-\underline{P}(B^{c}),

(5.2) P¯(B)={10if B(d)\{},if B=.\overline{P}\left(B\right)=\begin{cases}\begin{array}[]{c}1\\ 0\end{array}&\begin{array}[]{c}\text{if }B\in\mathscr{B}\left(\mathbb{R}^{d}\right)\backslash\left\{\emptyset\right\},\\ \text{if }B=\emptyset.\end{array}\end{cases}

That is, the statistical information contained in M0M_{0} can be (but is not known to be) concordant with any Borel-measurable probability function, thus the probability of any BB is as low as 0 and as high as 11, as long as BB is neither the full set nor the empty set.

Generally, the conditioning operation involving lower probability functions is not trivial, and it is not unique due to the existence of several applicable rules. But if the lower probability functions being conditioned on is vacuous, there is consensus among different rules as to what posterior distribution should result, namely precisely as stated in (1.7). See (Gong and Meng, 2020) for an extended exposition of conditioning rules involving lower probability and belief functions.

5.4. Future directions

This work points to several future directions of pursuit. On the computational front, how do we construct efficient algorithms to realize congenial privacy, by drawing possibly high-dimensional releases subject to complex constraints? When we use non-perfect Markov chain Monte Carlo to accomplish this task, how do we ensure the declared privacy guarantee is not destroyed because a chain cannot run indefinitely? On the privacy front, for every constrained mechanism constructed through conditioning, how to find the best γ\gamma value that tracks as closely as possible the effective privacy loss budget, which in turn enables fair performance comparisons among invariant-respecting algorithms? Furthermore, how to achieve an orthogonal decomposition of the public, invariant information from the free, residual information that remains in the confidential microdata, in a logical sense without having to resort to the probabilistic vocabulary of statistical independence?

Acknowledgements.
The authors thank Jeremy Seeman, Salil Vadhan, Guanyang Wang and three anonymous reviewers for helpful suggestions. Ruobin Gong gratefully acknowledges research support by the National Science Foundation (NSF) DMS-1916002. Xiao-Li Meng also thanks NSF for partial financial support while completing this article.

References

  • (1)
  • Abowd et al. (2019) John Abowd, Robert Ashmead, Garfinkel Simson, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, and William Sexton. 2019. Census TopDown: Differentially Private Data, Incremental Schemas, and Consistency with Public Knowledge. Technical Report. US Census Bureau.
  • Ashmead et al. (2019) Robert Ashmead, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, and William Sexton. 2019. Effective Privacy After Adjusting for Invariants with Applications to the 2020 Census. Technical Report. US Census Bureau.
  • Blocker and Meng (2013) Alexander W Blocker and Xiao-Li Meng. 2013. The potential and perils of preprocessing: Building new foundations. Bernoulli 19, 4 (2013), 1176–1211.
  • Dinur and Nissim (2003) Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 202–210.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference. Springer, 265–284.
  • Dwork and Roth (2014) Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.
  • Fay (1992) Robert E Fay. 1992. When Are Inferences from Multiple Imputation Valid? US Census Bureau.
  • Fioretto and Van Hentenryck (2019) Ferdinando Fioretto and Pascal Van Hentenryck. 2019. Differential privacy of hierarchical census data: An optimization approach. In International Conference on Principles and Practice of Constraint Programming. Springer, 639–655.
  • Ghosh (2011) Malay Ghosh. 2011. Objective priors: An introduction for frequentists. Statist. Sci. 26, 2 (2011), 187–202.
  • Gong (2019a) Ruobin Gong. 2019a. Exact inference with approximate computation for differentially private data via perturbations. arXiv preprint arXiv:1909.12237 (2019).
  • Gong (2019b) Ruobin Gong. 2019b. Simultaneous Inference under the Vacuous Orientation Assumption. Proceedings of Machine Learning Research 103 (2019), 225–234.
  • Gong (2020) Ruobin Gong. 2020. Transparent Privacy is Principled Privacy. arXiv preprint arXiv:2006.08522 (2020).
  • Gong and Meng (2020) Ruobin Gong and Xiao-Li Meng. 2020. Judicious judgment meets unsettling updating: dilation, sure loss, and Simpson’s paradox. Statist. Sci. (2020).
  • He et al. (2014) Xi He, Ashwin Machanavajjhala, and Bolin Ding. 2014. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1447–1458.
  • Kasiviswanathan and Smith (2014) Shiva P Kasiviswanathan and Adam Smith. 2014. On the’semantics’ of differential privacy: A bayesian formulation. Journal of Privacy and Confidentiality 6, 1 (2014).
  • Kifer and Machanavajjhala (2012) Daniel Kifer and Ashwin Machanavajjhala. 2012. A rigorous and customizable framework for privacy. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. 77–88.
  • Kott (1995) PS Kott. 1995. A paradox of multiple imputation. In Proceedings of the Section on Survey Research Methods. American Statistical Association, 384–389.
  • Liu (1996) Jun S Liu. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and computing 6, 2 (1996), 113–119.
  • Meng (1994) Xiao-Li Meng. 1994. Multiple-imputation inferences with uncongenial sources of input. Statist. Sci. (1994), 538–558.
  • Rubin (2004) Donald B Rubin. 2004. Multiple imputation for nonresponse in surveys. John Wiley & Sons.
  • US Census Bureau (2020) US Census Bureau. 2020. 2010 Demonstration Data Products. https://www.census.gov/programs-surveys/decennial-census/2020-census/planning-management/2020-census-data-products/2010-demonstration-data-products.html [Accessed: 04-08-2020].
  • Xie and Meng (2017) Xianchao Xie and Xiao-Li Meng. 2017. Dissecting Multiple Imputation from a Multi-Phase Inference Perspective: What Happens When God’s, Imputer’s and Analyst’s Models Are Uncongenial? Statistica Sinica (2017), 1485–1545.

Appendix A Privacy Guarantees for Congenial, L1L_{1} and L2L_{2} methods

We first derive the conditional distribution of two i.i.d. Laplace random variables, given that their sum is zero. Let u1,u2i.i.d.Lap(2ϵ1)u_{1},u_{2}\overset{i.i.d.}{\sim}Lap\left(2\epsilon^{-1}\right) and denote v=u1v=u_{1}, w=u1+u2w=u_{1}+u_{2}. Since (v,w)\left(v,w\right) is linear in (u1,u2)\left(u_{1},u_{2}\right), their joint probability density function is given by.

p(v,w)\displaystyle p\left(v,w\right) \displaystyle\propto p(u1(v,w),u2(v,w))\displaystyle p\left(u_{1}\left(v,w\right),u_{2}\left(v,w\right)\right)
\displaystyle\propto exp(0.5ϵ|v|0.5ϵ|wv|),\displaystyle\exp\left(-0.5\epsilon\left|v\right|-0.5\epsilon\left|w-v\right|\right),

This implies that

p(vw=0)\displaystyle p\left(v\mid w=0\right) \displaystyle\propto p(v,w=0)\displaystyle p\left(v,w=0\right)
\displaystyle\propto exp(ϵ|v|)Lap(ϵ1),\displaystyle\exp\left(-\epsilon\left|v\right|\right)\sim Lap(\epsilon^{-1}),

which leads to (4.6).

We then derive the density for u~=βu1+(1β)u2\tilde{u}=\beta u_{1}+(1-\beta)u_{2}, where β(0,1)\beta\in(0,1) (the case of β=0\beta=0 or 11 is trivial). This covers the L1L_{1} projection case, where any β[0,1]\beta\in[0,1] is acceptable, and the L2L_{2} projection case, where β=1/2\beta=1/2. Since u1=β1[u~(1β)u2]u_{1}=\beta^{-1}[\tilde{u}-(1-\beta)u_{2}], the Jacobian from (u~,u2)(\tilde{u},u_{2}) to (u1,u2)(u_{1},u_{2}) is β1\beta^{-1}. Consequently,

pϵ(u~,u2)=ϵ216βexp{ϵ2β[|u~(1β)u2|+β|u2|]}.p_{\epsilon}\left(\tilde{u},u_{2}\right)=\frac{\epsilon^{2}}{16\beta}\exp\left\{-\frac{\epsilon}{2\beta}\left[\left|\tilde{u}-(1-\beta)u_{2}\right|+\beta\left|u_{2}\right|\right]\right\}.

To derive pϵ(u~)p_{\epsilon}(\tilde{u}), we assume without loss of generality u~0\tilde{u}\geq 0. Consider pϵ(u~)=pϵ(u~,u2)𝑑u2p_{\epsilon}\left(\tilde{u}\right)=\int p_{\epsilon}\left(\tilde{u},u_{2}\right)du_{2} on three regions:

I1(β)\displaystyle I_{1}(\beta) =ϵ216β0exp{ϵ2β(u~u2)}𝑑u2=ϵ8exp{ϵ2βu~};\displaystyle=\frac{\epsilon^{2}}{16\beta}\int_{-\infty}^{0}\exp\left\{-\frac{\epsilon}{2\beta}(\tilde{u}-u_{2})\right\}du_{2}=\frac{\epsilon}{8}\exp\left\{-\frac{\epsilon}{2\beta}\tilde{u}\right\};
I2(β)\displaystyle I_{2}(\beta) =ϵ216β0u~1βexp{ϵ2β[u~+(2β1)u2]}𝑑u2\displaystyle=\frac{\epsilon^{2}}{16\beta}\int_{0}^{\frac{\tilde{u}}{1-\beta}}\exp\left\{-\frac{\epsilon}{2\beta}\left[\tilde{u}+(2\beta-1)u_{2}\right]\right\}du_{2}
=ϵ8(2β1)[exp{ϵ2βu~}exp{ϵ2(1β)u~}];\displaystyle=\frac{\epsilon}{8(2\beta-1)}\left[\exp\left\{-\frac{\epsilon}{2\beta}\tilde{u}\right\}-\exp\left\{-\frac{\epsilon}{2(1-\beta)}\tilde{u}\right\}\right];
I3(β)\displaystyle I_{3}(\beta) =ϵ216βu~1βexp{ϵ2β[u2u~]}𝑑u2=ϵ8exp{ϵ2(1β)u~}.\displaystyle=\frac{\epsilon^{2}}{16\beta}\int_{\frac{\tilde{u}}{1-\beta}}^{\infty}\exp\left\{-\frac{\epsilon}{2\beta}\left[u_{2}-\tilde{u}\right]\right\}du_{2}=\frac{\epsilon}{8}\exp\left\{-\frac{\epsilon}{2(1-\beta)}\tilde{u}\right\}.

Summing up these terms and noting the symmetry of pϵp_{\epsilon}, we obtain

pϵ(u~)\displaystyle p_{\epsilon}(\tilde{u}) =ϵ4(2β1)[βexp{ϵ2β|u~|}(1β)exp{ϵ2(1β)|u~|}]\displaystyle=\frac{\epsilon}{4(2\beta-1)}\left[\beta\exp\left\{-\frac{\epsilon}{2\beta}|\tilde{u}|\right\}-(1\!-\!\beta)\exp\left\{-\frac{\epsilon}{2(1\!-\!\beta)}|\tilde{u}|\right\}\right]
(A.1) =β2(2β1)Lap(2βϵ1)(1β)2(2β1)Lap(2(1β)ϵ1).\displaystyle=\frac{\beta^{2}}{(2\beta\!-\!1)}Lap\left(2\beta\epsilon^{-1}\right)-\frac{(1\!-\!\beta)^{2}}{(2\beta\!-\!1)}Lap\left(2(1\!-\!\beta)\epsilon^{-1}\right).

Remark I

The expression (A.1) is fascinating. It shows that the density of a convex combination of i.i.d. Laplace random variables is a “mixture” but non-convex combination of two Laplace densities with different scale parameters, because

β2(2β1)+[(1β)2(2β1)]=1.\frac{\beta^{2}}{(2\beta-1)}+\left[-\frac{(1-\beta)^{2}}{(2\beta-1)}\right]=1.

That is, although the two weights add up to one, they always take the opposite sign when β1/2.\beta\not=1/2.

Remark II

When β=1/2\beta=1/2, the expression (A.1) is of 0/00/0 appearance, but is well-defined once taking the limit β1/2\beta\rightarrow 1/2 and using the L’Hospital’s rule, yielding

(A.2) pϵ(u~)=ϵ4[1+ϵ|u~|]exp{ϵ|u~|}.p_{\epsilon}(\tilde{u})=\frac{\epsilon}{4}\left[1+\epsilon|\tilde{u}|\right]\exp\left\{-\epsilon|\tilde{u}|\right\}.

This can be written as

p(u~)\displaystyle p\left(\tilde{u}\right) =\displaystyle= 12ϵ2exp{ϵ|u~|}+12ϵ22|u~|exp{ϵ|u~|}\displaystyle\frac{1}{2}\cdot\frac{\epsilon}{2}\exp\left\{-\epsilon\left|\tilde{u}\right|\right\}+\frac{1}{2}\cdot\frac{\epsilon^{2}}{2}\left|\tilde{u}\right|\exp\left\{-\epsilon\left|\tilde{u}\right|\right\}
\displaystyle\sim 12Lap(ϵ1)+12SgnGamma(2,ϵ1),\displaystyle\frac{1}{2}Lap\left(\epsilon^{-1}\right)+\frac{1}{2}SgnGamma\left(2,\epsilon^{-1}\right),

suggesting that it is more variable that Lap(ϵ1)Lap(\epsilon^{-1}). We now prove that (A.2), moreover the entire family of distributions given by (A.1) as indexed by β(0,1)\beta\in(0,1), is ϵ\epsilon-differentially private. In fact, they attain a level of privacy protection more stringent than ϵ\epsilon whenever β1/2\beta\not=1/2. Our proof relies on the following general result, which can be useful for verifying differential privacy guarantees in other situations.

Theorem A.1.

Suppose f(x)f(x) is a positive real-valued function on a normed vector space 𝒳\mathcal{X}, with its norm denoted by |x||x|. Suppose f(x)f(x) has the following properties:

(i):

f(x)f(x) is monotone decreasing in |x||x|;

(ii):

gα(x)=f(x)eα|x|g_{\alpha}(x)=f(x)e^{\alpha|x|} is monotone increasing in |x||x|, where α\alpha is a positive constant.

Then for any a𝒳a\in\mathcal{X} and b𝒳b\in\mathcal{X}, we have

(A.3) supx𝒳f(xa)f(xb)eα|ab|.\sup_{x\in\mathcal{X}}\frac{f(x-a)}{f(x-b)}\leq e^{\alpha|a-b|}.
Proof.

For any x𝒳x\in\mathcal{X}, if |xa|>|xb||x-a|>|x-b|, then f(xa)f(xb)f(x-a)\leq f(x-b) by (i) and hence (A.3) holds trivially. If |xa||xb||x-a|\leq|x-b|, then gα(xa)gα(xb)g_{\alpha}(x-a)\leq g_{\alpha}(x-b) by (ii), and hence

f(xa)f(xb)\displaystyle\frac{f(x-a)}{f(x-b)} =gα(|xa|)gα(|xb|)eα(|xb||xa|)\displaystyle=\frac{g_{\alpha}(|x-a|)}{g_{\alpha}(|x-b|)}e^{\alpha(|x-b|-|x-a|)}
eα(|xb||xa|)eα|ab|.\displaystyle\leq e^{\alpha(|x-b|-|x-a|)}\leq e^{\alpha|a-b|}.

To apply this result, we first note that for pϵ(x)p_{\epsilon}(x) of (A.1) with x>0x>0, its derivative is given by

dpϵ(x)dx\displaystyle\frac{dp_{\epsilon}(x)}{dx} =ϵ28(2β1)[exp{ϵ2(1β)x}exp{ϵ2βx}]<0,\displaystyle=\frac{\epsilon^{2}}{8(2\beta-1)}\left[\exp\left\{-\frac{\epsilon}{2(1\!-\!\beta)}x\right\}-\exp\left\{-\frac{\epsilon}{2\beta}x\right\}\right]<0,

for any β1/2\beta\not=1/2. For β=1/2\beta=1/2, we can directly verify from (A.2) that

dpϵ(x)dx\displaystyle\frac{dp_{\epsilon}(x)}{dx} =ϵ34xexp{ϵx}<0,\displaystyle=-\frac{\epsilon^{3}}{4}x\exp\left\{-\epsilon x\right\}<0,

Hence, condition (i) holds for (A.1) for all β(0,1)\beta\in(0,1).

To establish condition (ii), the choice α\alpha is the key since it directly governs the degree of privacy guarantee. From the expression (A.3), we want the smallest α\alpha such that condition (ii) holds, which gives us the tightest bound hence better privacy guarantee. A good strategy here is to start with α=cϵ\alpha=c\epsilon and let the mathematics tell us how to minimize over cc. We start with the simplest case with β=1/2\beta=1/2. From the expression (A.2), the smallest cc that can make gαg_{\alpha} monotone increasing is clearly 1. The resulting gαg_{\alpha} also has the property that

(A.4) lim|x|gα(|xa|)gα(|xb|)=1.\displaystyle\lim_{|x|\rightarrow\infty}\frac{g_{\alpha}(|x-a|)}{g_{\alpha}(|x-b|)}=1.

This implies that the bound eϵ|ab|e^{\epsilon|a-b|} can be approached arbitrarily closely by letting |x||x|\rightarrow\infty, which means that the privacy loss budget ϵ\epsilon cannot be reduced. For our current application, this means the post-processing by L2L_{2} projection is also differentially private at level ϵ\epsilon, but not more stringent than that.

When β1/2\beta\not=1/2, we assume without loss of generality β>1/2\beta>1/2. Then for gα(x)=ecϵ|x|pϵ(x)g_{\alpha}(x)=e^{c\epsilon|x|}p_{\epsilon}(x), it is easy to verify that for any x>0x>0,

dgα(x)dx=ϵ28(2β1)[wcexp{wc2βϵx}+[wc+c]exp{wc+c2(1β)ϵx}],\displaystyle\frac{dg_{\alpha}(x)}{dx}\!=\!\frac{\epsilon^{2}}{8(2\beta-1)}\!\left[w_{c}\exp\left\{\frac{w_{c}}{2\beta}\epsilon x\right\}\!+\![w_{c}+c^{\prime}]\exp\left\{-\!\frac{w_{c}+c^{\prime}}{2(1-\beta)}\epsilon x\right\}\right],

where wc=2cβ1w_{c}=2c\beta-1 and c=2(1c)c^{\prime}=2(1-c). Our job is to seek the smallest cc such that this derivative is non-negative regardless of the value of xx. Clearly the positivity holds when we set wc=0w_{c}=0, that is c=(2β)1<1c=(2\beta)^{-1}<1, and hence c>0c^{\prime}>0. To show that this is the smallest possible cc, we see that when setting α=ϵ/(2β)\alpha=\epsilon/(2\beta), we have

gα(|xa|)gα(|xb|)=β(1β)exp{τ|xa|}β(1β)exp{τ|xb|},\displaystyle\frac{g_{\alpha}(|x-a|)}{g_{\alpha}(|x-b|)}=\frac{\beta-(1\!-\!\beta)\exp\left\{-\tau|x-a|\right\}}{\beta-(1\!-\!\beta)\exp\left\{-\tau|x-b|\right\}},

where τ=(2β1)/(2β(1β))>0\tau=(2\beta-1)/(2\beta(1\!-\!\beta))>0. Clearly as |x||x|\rightarrow\infty, the ratio above goes to 11 regardless of the value aa and bb as long as they are fixed. Consequently, the same implication from (A.4) follows, that the bound eα|ab|e^{\alpha|a-b|} can be approached arbitrarily closely with α=ϵ/(2β)\alpha=\epsilon/(2\beta), hence it cannot be further improved. That is, for post-processing via L1L_{1} projection, the actual differential privacy protection achieved is ϵ/(2β)\epsilon/(2\beta) when β>1/2\beta>1/2 (and ϵ/(2(1β))\epsilon/(2(1-\beta)) when β<1/2\beta<1/2). This makes intuitive sense. For example, when β=1\beta=1 the injected noise is drawn from a single Lap(2ϵ1)Lap(2\epsilon^{-1}) distribution, corresponding to a privacy loss budget of ϵ/2\epsilon/2.

In summary, for any β[0,1]\beta\in[0,1], the attained privacy loss budget for M(x)=s(x)+u~M(x)=s(x)+\tilde{u} is ϵ/(2max{β,1β})\epsilon/(2\max\{\beta,1-\beta\}).

Appendix B Sampling scheme for the Double Geometric distribution

The Double Geometric mechanism, as introduced in Definition 1.2, utilizes additive noise whose cumulative mass function is given by

F(u)=P(Uu)={au1+a1au+11+au0,u>0,F\left(u\right)=P\left(U\leq u\right)=\begin{cases}\begin{array}[]{c}\frac{a^{-u}}{1+a}\\ 1-\frac{a^{u+1}}{1+a}\end{array}&\begin{array}[]{c}u\leq 0,\\ u>0,\end{array}\end{cases}

with quantile function

F1(v)={logvlog(1+a)log(a)log(1v)+log(1+a)log(a)v11+a,v>11+a.F^{-1}\left(v\right)=\begin{cases}\begin{array}[]{c}\left\lceil\frac{-\log v-\log\left(1+a\right)}{\log\left(a\right)}\right\rceil\\ \left\lfloor\frac{\log\left(1-v\right)+\log\left(1+a\right)}{\log\left(a\right)}\right\rfloor\end{array}&\begin{array}[]{c}v\leq\frac{1}{1+a},\\ v>\frac{1}{1+a}.\end{array}\end{cases}

Hence, one way to sample a Double geometric random variable is via inverse probability sampling. That is,

UiUnif(0,1),F1(Ui)pi(ϵ),U_{i}\sim Unif\left(0,1\right),\qquad F^{-1}\left(U_{i}\right)\sim{p_{i}\left(\cdot\mid\epsilon\right)},

where pip_{i} is given by (1.3). This method is implemented for all numerical examples illustrated in this paper.

Appendix C Performance Diagnostics of the MIS Algorithm

The acceptance rate of Algorithm 1 rate is shown in Figure 2 as a function of the proposal inverse scale parameter ϵ~\tilde{\epsilon} . The acceptance rate is the highest in this example when ϵ~\tilde{\epsilon} is set to 0.60.6, just slightly larger than the privacy loss budget of the unconstrained privacy mechanism (ϵ=0.5\epsilon=0.5 per cell). The acceptance rate achieved is about 1.68%1.68\%.

Figure 3 shows traceplots of 10,00010,000 draws from Algorithm 1 of respectively the second (in proposal index set \mathcal{I}) and the first (not in proposal index set \mathcal{I}) cells of the constrained differentially private contingency table, when ϵ~=0.6\tilde{\epsilon}=0.6.

Refer to caption
Figure 2. Algorithm 1 acceptance rate as a function of the proposal parameter ϵ~\tilde{\epsilon}.
Refer to caption
Figure 3. Traceplots of 10,00010,000 draws from Algorithm 1 of the second (left) and the first (right) cell of the constrained differentially private contingency table.