This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Practical Privacy Filters and Odometers with Rényi Differential Privacy
and Applications to Differentially Private Deep Learning

Mathias Lécuyer
Abstract

Differential Privacy (DP) is the leading approach to privacy preserving deep learning. As such, there are multiple efforts to provide drop-in integration of DP into popular frameworks. These efforts, which add noise to each gradient computation to make it DP, rely on composition theorems to bound the total privacy loss incurred over this sequence of DP computations.

However, existing composition theorems present a tension between efficiency and flexibility. Most theorems require all computations in the sequence to have a predefined DP parameter, called the privacy budget. This prevents the design of training algorithms that adapt the privacy budget on the fly, or that terminate early to reduce the total privacy loss. Alternatively, the few existing composition results for adaptive privacy budgets provide complex bounds on the privacy loss, with constants too large to be practical.

In this paper, we study DP composition under adaptive privacy budgets through the lens of Rényi Differential Privacy, proving a simpler composition theorem with smaller constants, making it practical enough to use in algorithm design. We demonstrate two applications of this theorem for DP deep learning: adapting the noise or batch size online to improve a model’s accuracy within a fixed total privacy loss, and stopping early when fine-tuning a model to reduce total privacy loss.


1 Introduction

Performing statistical analyses or training machine learning models have been shown to leak information about individual data points used in the process. Differential Privacy (DP) is the leading approach to prevent such privacy leakage, by adding noise to any computation performed on the data such that the privacy loss resulting their result is provably bounded. Due to DP’s strong privacy guarantees, multiple efforts exist to integrate it into libraries, including for statistical queries (Google Differential Privacy, ; OpenDP, ), machine learning (IBM Differential Privacy Library, ), and deep learning (TensorFlow Privacy, ; Opacus, ). The DP analysis in these libraries rely on composing the results of a sequence of DP computations, consisting for instance in a sequences of statistical queries, or a sequence of mini-batches of gradients in deep learning. Each computation in the sequence can be seen as consuming some privacy budget, that depends on the size of the noise added to the computation’s result. Composition theorems are then applied to bound to the total privacy budget, or privacy loss, used over the sequence.

However efficient DP composition theorems require that all computations in the entire sequence have a privacy budget fixed a priori. This creates a mismatch between the theory, and practical requirements. Indeed, users who interact with a dataset through statistical queries or develop machine learning models will typically want to adapt the next query or model’s DP budget allocation based on previous results, resulting in an adaptive behavior. At the algorithm level this mismatch is visible in the APIs developed for DP deep learning, which track DP computations as they happen during training. Users are responsible for avoiding early stopping or other types of parameter adaptation. This restriction reduces the design space for DP deep learning algorithms, ruling out early stopping to reduce overall privacy loss, DP budgets that adapt during training to improve the final performance, or online architecture search to adapt the complexity of the model to the data within DP constraints.

The main study of adaptive composition (Rogers et al., 2016) defines two relevant interaction models that yield different asymptotic rates for composition theorems. The privacy filter model enforces an a priori bound on the total privacy loss, and can reach the same rate as non-adaptive composition. The privacy odometer model is more costly, but more flexible as it provides a running upper-bound on the sequence of DP computations performed so far, which can stop at any time. However there is a small penalty to pay in asymptotic rate. (Rogers et al., 2016) provides composition theorems with optimal rate for both models, but the bounds are complex and with large constants, making them impractical. A recent paper, (Feldman & Zrnic, 2020), shows improved filters but unsatisfactory odometers.

In this paper, we study both privacy filters and odometers through the lens of Rényi Differential Privacy (RDP). RDP in an analysis tool for DP composition that enables intuitive privacy loss tracking and provides efficient composition results under non-adaptive privacy budget. It has become a key building block in DP deep learning libraries (TensorFlow Privacy, ; Opacus, ). We show that RDP can be used to provide privacy filters without incurring any cost for adaptivity. We build on this result to construct privacy odometers that remove limitations of previously known results, in particular in regimes relevant for DP deep learning.

We demonstrate two applications of these theoretical results when training DP deep learning models. First, we leverage privacy filters to show that even basic policies to adapt on the fly the noise added to gradient computations, as well as the batch size, yield improved accuracy of more than 2.5%2.5\% within a fixed privacy loss bound on CIFAR-10. Second, we demonstrate that in a fine-tuning scenario our privacy odometer enables large reductions in total privacy budget using adaptivity and early-stopping.

2 Privacy Background

This section provides the necessary background on DP and composition, and introduces some notation. Specific results on adaptive composition are discussed and compared to our contributions throughout the paper.

2.1 Differential Privacy

Definition.

We first recall the definition of (approximate) Differential Privacy (Dwork et al., 2014). A randomized mechanisms M:𝒟M:\mathcal{D}\rightarrow\mathcal{R} satisfies (ϵ,δ)(\epsilon,\delta)-DP when, for any neighboring datasets DD and DD^{\prime} in 𝒟\mathcal{D}, and for any 𝒮\mathcal{S}\subset\mathcal{R}:

P(M(D)𝒮)eϵP(M(D)𝒮)+δ.P(M(D)\in\mathcal{S})\leq e^{\epsilon}P(M(D^{\prime})\in\mathcal{S})+\delta.

This definition depends on the notion of neighboring datasets, which can be domain specific but is typically chosen as the addition or removal of one data point. Since the neighboring relation is symmetric, the DP inequality also is. A smaller privacy budget ϵ>0\epsilon>0 implies stronger guarantees, as one draw from the mechanism’s output gives little information about whether it ran on DD or DD^{\prime}. When δ=0\delta=0, the guarantee is called pure DP, while δ>0\delta>0 is a relaxation of pure DP called approximate DP.

A useful way of rephrasing the DP guarantee is to define the privacy loss as, for mm\in\mathcal{R}:

Loss(m)=log(P(M(D)=m)P(M(D)=m)).\textrm{Loss}(m)=\log\Big{(}\frac{P\big{(}M(D)=m\big{)}}{P\big{(}M(D^{\prime})=m\big{)}}\Big{)}.

If with probability at least (1δ)(1-\delta) over mM(D)m\sim M(D):

|Loss(m)|ϵ,|\textrm{Loss}(m)|\leq\epsilon, (1)

then the mechanism MM is (ϵ,δ)(\epsilon,\delta)-DP. We can see in this formulation that ϵ\epsilon is a bound on the privacy loss. Approximate DP allows for a δ>0\delta>0 failure probability on that bound.

Composition.

Composition is a core DP property. Composition theorems bound the total privacy loss incurred over a sequence of DP computations. This is useful both as a low level tool for algorithm design, and to account for the privacy loss of repeated data analyses performed on the same dataset. Basic composition states that a sequence MiM_{i} of (ϵi,δi)(\epsilon_{i},\delta_{i})-DP mechanisms is (ϵi,δi)(\sum\epsilon_{i},\sum\delta_{i})-DP. While not efficient, it is simple and provides an intuitive notion of additive privacy budget consumed by DP computations.

Strong composition provides a tighter bound on privacy loss, showing that the same sequence is (ϵg,δg)(\epsilon_{g},\delta_{g})-DP with ϵg=2log(1δ^)ϵi2+ϵi(eiϵ1)\epsilon_{g}=\sqrt{2\log(\frac{1}{\hat{\delta}})\sum\epsilon_{i}^{2}}+\sum\epsilon_{i}(e^{\epsilon}_{i}-1) and δg=δ^+δi\delta_{g}=\hat{\delta}+\sum\delta_{i}. This result comes with its share of drawbacks though. Each mechanism MiM_{i} comes with its own trade-off between ϵi\epsilon_{i} and δi\delta_{i}, and there is no practical way to track and optimally compose under these trade-offs. This has led to multiple special purpose composition theorems, such as the moment accountant for DP deep learning (Abadi et al., 2016). Strong composition results also require the sequence of DP parameters to be fixed in advance.

2.2 Rényi Differential Privacy

RDP (Mironov, 2017) is also a relaxation of pure DP that always implies approximate DP, though the converse in not always true.

Rényi Differential Privacy (RDP).

RDP expresses its privacy guarantee using the Rényi divergence:

Dα(P||Q)1α1log𝔼xQ(P(x)Q(x))α,D_{\alpha}(P||Q)\triangleq\frac{1}{\alpha-1}\log\mathbb{E}_{x\sim Q}\Big{(}\frac{P(x)}{Q(x)}\Big{)}^{\alpha},

where α]1,+]\alpha\in]1,+\infty] is the order of the divergence. A randomized mechanism MM is (α,ϵ)(\alpha,\epsilon)-RDP if:

Dα(M(D)||M(D))ϵ.D_{\alpha}(M(D)||M(D^{\prime}))\leq\epsilon.

RDP composition.

A sequence of (α,ϵi)(\alpha,\epsilon_{i})-RDP mechanisms is (α,ϵi)(\alpha,\sum\epsilon_{i})-RDP. We can see that RDP re-introduces an intuitive additive privacy budget ϵ\epsilon, at each order α\alpha.

Tracking the RDP budget over multiple orders enables one to account for the specific trade-off curves of each specific mechanism in a sequence of computations. Indeed, for an (α,ϵ)(\alpha,\epsilon)-RDP mechanism:

PmM(D)(|Loss(m)|>ϵ+log(1/δ)α1)δ,P_{m\sim M(D)}\Big{(}|\textrm{Loss}(m)|>\epsilon+\frac{\log(1/\delta)}{\alpha-1}\Big{)}\leq\delta, (2)

which in turn implies (ϵ+log(1/δ)α1,δ)(\epsilon+\frac{\log(1/\delta)}{\alpha-1},\delta)-DP. Although RDP composition is similar to strong composition in the worst case, tracking privacy loss over the curve of Rényi orders often yields smaller privacy loss bounds in practice.

3 Privacy Filters

We first focus on privacy filters, which authorize the composition of adaptive DP budgets within an upper-bound on the privacy loss fixed a priori. That is, the sequence of computations’ budgets is not known in advance, and can depend on previous results, and the filter ensures that the upper-bound is always valid.

3.1 The Privacy Filter Setup

Complex sequences of interactions with the data, such as the adaptive privacy budget interactions we focus on in this work, can be challenging to express precisely. Following (Rogers et al., 2016), we express them in the form of an algorithm that describes the power and constraints of an analyst, or adversary since it models worst case behavior, interacting with the data. Algorithm 1 shows this interaction setup for a privacy filter.

Specifically, an adversary AA will interact for kk rounds —kk does not appear in the bounds and can be chosen arbitrarily large—, in one of two “neighboring worlds” indexed by b{0,1}b\in\{0,1\}. Typically, the two worlds will represent two neighboring datasets with an observation added or removed, but this formulation allows for more flexibility by letting AA chose two neighboring datasets at each round, as well as the DP mechanism to use and its RDP parameters. Each decision can be made conditioned on the previous results.

Our RDP filter FILTα,ϵRDP\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}} of order α\alpha can PASS on a computation at any time, setting ϵi=0\epsilon_{i}=0 and returning \perp to AA. Sections 3.2 and 3.3 describe how to instantiate this filter to provide RDP and DP guarantees. The final output of the algorithm is a view Vb=(v1,,vk)V^{b}=(v_{1},\ldots,v_{k}) of all results from the computations. With a slight abuse of notation, we call VibV^{b}_{i} the distribution over possible outcomes for the ithi^{th} query, and viVibv_{i}\sim V^{b}_{i} an observed realization.

This way, we can explicitly write the dependencies of the ithi^{th} query’s distribution as Vib(vi|vi1)V^{b}_{i}(v_{i}|v_{\leq i-1}), where viv_{\leq i} is the view of all realized outcomes up to, and including, the ithi^{th} query. Similarly, the joint distribution over possible views is Vb(vn)V^{b}(v_{\leq n}). The budget of the ithi^{th} query, ϵi\epsilon_{i} depends only on all queries up to the previous one, which we note explicitly as ϵi(vi1)\epsilon_{i}(v_{\leq i-1}). DP seeks to bound the information about bb that AA can learn from the results of its DP computations. Following Equation 1, we want to bound with high probability under vV0v\sim V^{0} the privacy loss incurred by the sequence of computations |Loss(v)|=|log(V0(vn)V1(vn))||\textrm{Loss}(v)|=\big{|}\log\big{(}\frac{V^{0}(v_{\leq n})}{V^{1}(v_{\leq n})}\big{)}\big{|}.

Algorithm 1 FilterComp(A,k,b,α;FILTα,ϵRDP)\textrm{FilterComp}(A,k,b,\alpha;\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}})
  for i=1,,ki=1,\ldots,k do
     A=A(v1,,vi1)A=A(v_{1},\ldots,v_{i-1}) gives neighboring Di0,Di1,ϵi,and Mi:𝒟D_{i}^{0},D_{i}^{1},\epsilon_{i},\textrm{and~{}}M_{i}:\mathcal{D}\rightarrow\mathcal{R} an (α,ϵi)(\alpha,\epsilon_{i})-RDP mechanism 
     if FILTα,ϵRDP(ϵ1,,ϵi,0,,0)=PASS\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}}(\epsilon_{1},\ldots,\epsilon_{i},0,\ldots,0)=\textrm{PASS} then
        ϵi=0\epsilon_{i}=0
        AA receives vi=Vib=v_{i}=V^{b}_{i}=\perp
     else
        AA receives viVib=Mi(Dib)v_{i}\sim V^{b}_{i}=M_{i}(D_{i}^{b})
     end if
  end for
  view Vb=(v1,,vk)V^{b}=(v_{1},\ldots,v_{k}).

3.2 Rényi Differential Privacy Analysis

We first analyse the filter FILTα,ϵRDP(ϵ1,,ϵi,0,,0)\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}}(\epsilon_{1},\ldots,\epsilon_{i},0,\ldots,0) that will PASS when:

iϵi>ϵRDP.\sum_{i}\epsilon_{i}>\epsilon_{\textrm{RDP}}. (3)

The filter PASSes when the new query would cause the sum of all budgets used so far to go over ϵRDP\epsilon_{\textrm{RDP}}. We next show that this filter ensures that Algorithm 1 is (α,ϵRDP)(\alpha,\epsilon_{\textrm{RDP}})-RDP.

Theorem 1 (RDP Filter).

The interaction mechanism from Algorithm 1, instantiated with the filter from Equation 3, is (α,ϵRDP)(\alpha,\epsilon_{\textrm{RDP}})-RDP. That is: Dα(V0||V1)ϵRDPD_{\alpha}(V^{0}||V^{1})\leq\epsilon_{\textrm{RDP}}.

Proof.

We start by writing the integral for Dα(V0||V1)D_{\alpha}(V^{0}||V^{1}) as:

e(α1)Dα(V0||V1)\displaystyle e^{(\alpha-1)D_{\alpha}(V^{0}||V^{1})}
=1n(V0(vn))α(V1(vn))1α𝑑v1𝑑vn\displaystyle=\int_{\mathcal{R}_{1}\ldots\mathcal{R}_{n}}\big{(}V^{0}(v_{\leq n})\big{)}^{\alpha}\big{(}V^{1}(v_{\leq n})\big{)}^{1-\alpha}dv_{1}\ldots dv_{n}
=1(V10(v1))α(V11(v1))1α\displaystyle=\int_{\mathcal{R}_{1}}\big{(}V_{1}^{0}(v_{1})\big{)}^{\alpha}\big{(}V_{1}^{1}(v_{1})\big{)}^{1-\alpha}\ldots
n(Vn0(vn|vn1))α(Vn1(vn|vn1))1α𝑑vn𝑑v1.\displaystyle\int_{\mathcal{R}_{n}}\big{(}V_{n}^{0}(v_{n}|v_{\leq n-1})\big{)}^{\alpha}\big{(}V_{n}^{1}(v_{n}|v_{\leq n-1})\big{)}^{1-\alpha}dv_{n}\ldots dv_{1}.

Focusing on the inner-most integral, we have that:

n(Vn0(vn|vn1))α(Vn1(vn|vn1))1α𝑑vn\displaystyle\int_{\mathcal{R}_{n}}\big{(}V_{n}^{0}(v_{n}|v_{\leq n-1})\big{)}^{\alpha}\big{(}V_{n}^{1}(v_{n}|v_{\leq n-1})\big{)}^{1-\alpha}dv_{n}
e(α1)ϵn(vn1)e(α1)(ϵRDPi=1n1ϵi(vi1)),\displaystyle\leq e^{(\alpha-1)\epsilon_{n}(v_{\leq n-1})}\leq e^{(\alpha-1)\big{(}\epsilon_{\textrm{RDP}}-\sum_{i=1}^{n-1}\epsilon_{i}(v_{\leq i-1})\big{)}},

where the first inequality follows because MnM_{n} is (α,ϵn(vn1))\big{(}\alpha,\epsilon_{n}(v_{\leq n-1})\big{)}-RDP, and the second because the privacy filter enforces that i=1nϵiϵRDPϵnϵRDPi=1n1ϵi\sum_{i=1}^{n}\epsilon_{i}\leq\epsilon_{\textrm{RDP}}\Rightarrow\epsilon_{n}\leq\epsilon_{\textrm{RDP}}-\sum_{i=1}^{n-1}\epsilon_{i}, since ϵi0\epsilon_{i}\geq 0.

Note that now, the inner most integral is bounded by a function of vn2v_{\leq n-2} and not vn1v_{n-1}. Thus, we can write:

e(α1)Dα(V0||V1)\displaystyle e^{(\alpha-1)D_{\alpha}(V^{0}||V^{1})}
1(V10(v1))α(V11(v1))1α\displaystyle\leq\int_{\mathcal{R}_{1}}\big{(}V_{1}^{0}(v_{1})\big{)}^{\alpha}\big{(}V_{1}^{1}(v_{1})\big{)}^{1-\alpha}\ldots
n2(Vn20(vn2|vn3))α(Vn21(vn2|vn3))1α\displaystyle\int_{\mathcal{R}_{n-2}}\big{(}V_{n-2}^{0}(v_{n-2}|v_{\leq n-3})\big{)}^{\alpha}\big{(}V_{n-2}^{1}(v_{n-2}|v_{\leq n-3})\big{)}^{1-\alpha}
{e(α1)(ϵRDPi=1n1ϵi(vi1))\displaystyle\Big{\{}e^{(\alpha-1)\big{(}\epsilon_{\textrm{RDP}}-\sum_{i=1}^{n-1}\epsilon_{i}(v_{\leq i-1})\big{)}}
n1(Vn10(vn1|vn2))α(Vn11(vn1|vn2))1α\displaystyle\int_{\mathcal{R}_{n-1}}\big{(}V_{n-1}^{0}(v_{n-1}|v_{\leq n-2})\big{)}^{\alpha}\big{(}V_{n-1}^{1}(v_{n-1}|v_{\leq n-2})\big{)}^{1-\alpha}
dvn1}dvn2dv1\displaystyle dv_{n-1}\Big{\}}dv_{n-2}\ldots dv_{1}
1(V10(v1))α(V11(v1))1α\displaystyle\leq\int_{\mathcal{R}_{1}}\big{(}V_{1}^{0}(v_{1})\big{)}^{\alpha}\big{(}V_{1}^{1}(v_{1})\big{)}^{1-\alpha}\ldots
n2(Vn20(vn2|vn3))α(Vn21(vn2|vn3))1α\displaystyle\int_{\mathcal{R}_{n-2}}\big{(}V_{n-2}^{0}(v_{n-2}|v_{\leq n-3})\big{)}^{\alpha}\big{(}V_{n-2}^{1}(v_{n-2}|v_{\leq n-3})\big{)}^{1-\alpha}
{e(α1)(ϵRDPi=1n1ϵi(vi1))e(α1)ϵn1(vn2)}\displaystyle\Big{\{}e^{(\alpha-1)\big{(}\epsilon_{\textrm{RDP}}-\sum_{i=1}^{n-1}\epsilon_{i}(v_{\leq i-1})\big{)}}e^{(\alpha-1)\epsilon_{n-1}(v_{\leq n-2})}\Big{\}}
dvn2dv1\displaystyle dv_{n-2}\ldots dv_{1}
=1(V10(v1))α(V11(v1))1α\displaystyle=\int_{\mathcal{R}_{1}}\big{(}V_{1}^{0}(v_{1})\big{)}^{\alpha}\big{(}V_{1}^{1}(v_{1})\big{)}^{1-\alpha}\ldots
n2(Vn20(vn2|vn3))α(Vn21(vn2|vn3))1α\displaystyle\int_{\mathcal{R}_{n-2}}\big{(}V_{n-2}^{0}(v_{n-2}|v_{\leq n-3})\big{)}^{\alpha}\big{(}V_{n-2}^{1}(v_{n-2}|v_{\leq n-3})\big{)}^{1-\alpha}
{e(α1)(ϵRDPi=1n2ϵi(vi1))}dvn2dv1\displaystyle\Big{\{}e^{(\alpha-1)\big{(}\epsilon_{\textrm{RDP}}-\sum_{i=1}^{n-2}\epsilon_{i}(v_{\leq i-1})\big{)}}\Big{\}}dv_{n-2}\ldots dv_{1}
e(α1)ϵRDP,\displaystyle\leq e^{(\alpha-1)\epsilon_{\textrm{RDP}}},

where the first inequality takes the previous bound out of the integral as it is a constant w.r.t. vn1v_{n-1}, the second inequality comes from Mn1M_{n-1} being (α,ϵn1(vn2))\big{(}\alpha,\epsilon_{n-1}(v_{\leq n-2})\big{)}-RDP, and the equality combines the two exponentials and cancels ϵn1\epsilon_{n-1} in the sum. As we can see, the term in the brackets is now only a function of vn3v_{\leq n-3} and not vn2v_{n-2}. The last inequality recursively applies the same reasoning to each integral, removing all the ϵi\epsilon_{i} in turn. Taking the log of each side of the global inequality concludes the proof. ∎

3.3 Composing Over the Rényi Curve

A powerful feature of RDP is to be able to compose mechanisms over their whole (α,ϵ)(\alpha,\epsilon) curve. Since for many mechanisms this curve is not linear, it is unclear in advance which α\alpha would yield the best final (ϵ,δ)(\epsilon,\delta)-DP guarantee through Equation 2. This composition “over all α\alphas” is important, as it enables RDP’s strong composition results.

We can extend our privacy filter and Theorem 1 to this case. Consider a (possibly infinite) sequence of α\alpha values to track, noted Λ\Lambda, and an upper bound ϵRDP(α)\epsilon_{\textrm{RDP}}(\alpha) for each of these values. In Algorithm 1, the mechanism MiM_{i} is now (α,ϵi(α))\big{(}\alpha,\epsilon_{i}(\alpha)\big{)}-RDP for αΛ\alpha\in\Lambda. The filter FILTα,ϵRDP(α)(ϵ1(α),,ϵi(α),0,,0)\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}(\alpha)}\big{(}\epsilon_{1}(\alpha),\ldots,\epsilon_{i}(\alpha),0,\ldots,0\big{)} will now PASS when:

αΛ,iϵi(α)>ϵRDP(α).\forall\alpha\in\Lambda,\sum_{i}\epsilon_{i}(\alpha)>\epsilon_{\textrm{RDP}}(\alpha). (4)
Corollary 1 (RDP Filter over the Rényi Curve).

The interaction mechanism from Algorithm 1, instantiated with the filter from Equation 4 is such that α:Dα(V0||V1)ϵRDP(α)\exists\alpha:D_{\alpha}(V^{0}||V^{1})\leq\epsilon_{\textrm{RDP}}(\alpha).

Proof.

We argue by contradiction that α:iϵi(α)ϵRDP(α)\exists\alpha:\sum_{i}\epsilon_{i}(\alpha)\leq\epsilon_{\textrm{RDP}}(\alpha). Assume that was not the case, and consider j=argmaxi(maxαϵi(α)0)j=\arg\max_{i}(\max_{\alpha}\epsilon_{i}(\alpha)\neq 0) the maximal index for which the filter did not PASS. Then iϵi(α)>ϵRDP(α)\sum_{i}\epsilon_{i}(\alpha)>\epsilon_{\textrm{RDP}}(\alpha), the filter PASSed and α:ϵi(α)=0\forall\alpha:\epsilon_{i}(\alpha)=0, which is a contradiction. Applying Theorem 1 to this α\alpha concludes the proof. ∎

3.4 (ϵ,δ)(\epsilon,\delta)-DP Implications

A key application of Corollary 1 is to build a privacy filter that enforces an (ϵDP,δ)(\epsilon_{\textrm{DP}},\delta)-DP guarantee, and leverages RDP to ensure strong composition. Intuitively, for each α\alpha the corresponding ϵRDP(α)\epsilon_{\textrm{RDP}}(\alpha) is set to the value which, for this α\alpha, implies the desired (ϵDP,δ)(\epsilon_{\textrm{DP}},\delta)-DP target. Formally, based on Equation 2 we set the filter from Equation 4 to αΛ:ϵRDP(α)=ϵDPlog(1/δ)α1\forall\alpha\in\Lambda:\epsilon_{\textrm{RDP}}(\alpha)=\epsilon_{\textrm{DP}}-\frac{\log(1/\delta)}{\alpha-1}. Applying Corollary 1 followed by Equation 2 shows that under this filter, Algorithm 1 is (ϵDP,δ)(\epsilon_{\textrm{DP}},\delta)-DP.

This is significant, as this construction yields strong DP composition with a simple additive (RDP) privacy filter. There is no overhead compared to non-adaptive RDP, which yields tighter results than DP strong composition in many practical settings, such as the composition of Gaussian mechanisms ubiquitous in DP deep learning. On the contrary, Theorem 5.1 from (Rogers et al., 2016) is restricted to DP strong composition, and pays a significant multiplicative constant. The result from Theorem 1 already appears in (Feldman & Zrnic, 2020) as a corollary to a more general result. Here we show a proof closer to that of RDP composition, that we will re-use in subsequent results. We also explicitly show how to implement filters over the Rényi curve in §3.3.

4 Privacy Odometers

Privacy filters are useful to support sequences of computations with adaptive privacy budgets, under a fixed total privacy loss. In many cases however, we are interested in a more flexible setup in which we can also stop at any time, and bound the privacy loss of the realized sequence. Examples of such use-cases include early stopping when training or fine-tuning deep learning models, “accuracy first” machine learning (Ligett et al., 2017), or an analyst interacting with the data until a particular question is answered.

4.1 The Privacy Odometer Setup

Algorithm 2 formalizes our privacy odometer setup. The adversary AA can now interact with the data in an unrestricted fashion. For the given RDP order α\alpha, the odometer returns the view and the RDP parameter ϵRDPtot\epsilon_{\textrm{RDP}}^{tot}, which must be a running upper-bound on Dα(V0(vi)||V1(vi))D_{\alpha}\big{(}V^{0}(v_{\leq i})||V^{1}(v_{\leq i})\big{)} since the interaction can stop at any time.

Algorithm 2 OdometerComp(A,k,b,α,ϵRDPf1)\textrm{OdometerComp}(A,k,b,\alpha,\epsilon_{\textrm{RDP}}^{f\in\mathbb{N}_{1}})
  for i=1,,ki=1,\ldots,k do
     A=A(v1,,vi1)A=A(v_{1},\ldots,v_{i-1})
     AA gives neighboring Di0,Di1,ϵi,and Mi:𝒟D_{i}^{0},D_{i}^{1},\epsilon_{i},\textrm{and~{}}M_{i}:\mathcal{D}\rightarrow\mathcal{R} an (α,ϵi)(\alpha,\epsilon_{i})-RDP mechanism 
     f=argming{FILTα,ϵRDPg(ϵ1,,ϵi,0,,0)PASS}f=\arg\min_{g}\{\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}^{g}}(\epsilon_{1},\ldots,\epsilon_{i},0,\ldots,0)\neq\textrm{PASS}\}
     AA receives viVib=Mi(Dib)v_{i}\sim V^{b}_{i}=M_{i}(D_{i}^{b})
  end for
  view Vb=(v1,,vk)V^{b}=(v_{1},\ldots,v_{k}), stopping filter ff, and spent budget ϵRDPtot=ϵRDPf\epsilon_{\textrm{RDP}}^{tot}=\epsilon_{\textrm{RDP}}^{f}.

Analyzing such an unbounded interaction directly with RDP is challenging, as the very RDP parameters can change at each step based on previous realizations of the random variable, and there is no limit enforced on their sum. We sidestep this challenge by initiating the odometer with a sequence of ϵRDPf\epsilon_{\textrm{RDP}}^{f}, with f1f\in\mathbb{N}_{1} a strictly positive integer. After each new RDP computation, we compute the smallest ff such that a filter FILTα,ϵRDPf\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}^{f}} initiated with budget ϵRDPf\epsilon_{\textrm{RDP}}^{f} would not PASS. We show how to analyse the privacy guarantees of this construct in sections 4.2 and 4.3, before extending it to support RDP accounting over the Rényi curve in section 4.4. Finally, section 4.5 discusses how to instantiate ϵRDPf1\epsilon_{\textrm{RDP}}^{f\in\mathbb{N}_{1}} to get efficient odometers.

4.2 Rényi Differential Privacy Analysis

To analyse Algorithm 2, we define a truncation operator truncf\textrm{trunc}_{\leq f}, which changes the behavior of an adversary AA to match that of a privacy filter of size ϵRDPf\epsilon_{\textrm{RDP}}^{f}. Intuitively, truncf(A)\textrm{trunc}_{\leq f}(A) follows AA exactly until AA outputs an RDP mechanism that would cause a switch to the (f+1)th(f+1)^{th} filter. When this happens, truncf(A)\textrm{trunc}_{\leq f}(A) stops interacting until the end of the kk queries. More precisely, at round ii truncf(A)\textrm{trunc}_{\leq f}(A) outputs the same Di0,Di1D_{i}^{0},D_{i}^{1}, ϵi\epsilon_{i}, and MiM_{i} as AA as long as argming{FILTα,ϵRDPg(ϵ0,,ϵi,0,,0)PASS}f\arg\min_{g}\{\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}^{g}}(\epsilon_{0},\ldots,\epsilon_{i},0,\ldots,0)\neq\textrm{PASS}\}\leq f. Otherwise, truncf(A)\textrm{trunc}_{\leq f}(A) returns ϵi=0\epsilon_{i}=0 and Mi:xM_{i}:x\rightarrow\perp.

We similarly note truncf(Vb)\textrm{trunc}_{\leq f}(V^{b}) the distribution over possible views when adversary truncf(A)\textrm{trunc}_{\leq f}(A) interacts following Algorithm 2, and truncf(v)\textrm{trunc}_{\leq f}(v) a corresponding truncation of a view generated by adversary AA. This truncated adversary behaves like a privacy filter, yielding the following result.

Corollary 2 (Truncated RDP Odometer).

The interaction mechanism from Algorithm 2, is such that Dα(truncf(V0)||truncf(V1))ϵRDPfD_{\alpha}\big{(}\textrm{trunc}_{\leq f}(V^{0})||\textrm{trunc}_{\leq f}(V^{1})\big{)}\leq\epsilon_{\textrm{RDP}}^{f}.

Proof.

The interaction from Algorithm 2 with truncf(A)\textrm{trunc}_{\leq f}(A) is identical to that of Algorithm 1 with truncf(A)\textrm{trunc}_{\leq f}(A) initiated with FILTα,ϵRDPf\textrm{FILT}_{\alpha,\epsilon_{\textrm{RDP}}^{f}}, since the filter would never PASS. The statement follows from Theorem 1. ∎

We note that the odometer from (Feldman & Zrnic, 2020) (Proposition 5.2) is closer to this truncated odometer than to the odometer from (Rogers et al., 2016). Indeed, the guarantee it gives is for kk stacked privacy filters, where kk is fixed in advance. If applied with an adaptive kk (e.g. early stopping) the result would violate the lower bound proven in (Rogers et al., 2016), Theorem 6.1. We next show how to provide odometer semantics with RDP composition bounds.

4.3 (ϵ,δ)(\epsilon,\delta)-DP Implications

We are now ready to bound the privacy loss of Algorithm 2.

Theorem 2 (DP Odometer).

For a given outcome vv and ff of the interaction described in Algorithm 2, we have that P(|Loss(v)|>ϵRDPf+log(2f2/δ)α1)δP\big{(}|\textrm{Loss}(v)|>\epsilon_{\textrm{RDP}}^{f}+\frac{\log(2f^{2}/\delta)}{\alpha-1}\big{)}\leq\delta.

This implies that the interaction is (ϵRDPf+log(2f2/δ)α1,δ)(\epsilon_{\textrm{RDP}}^{f}+\frac{\log(2f^{2}/\delta)}{\alpha-1},\delta)-DP.

Proof.

We first show an important relationship between the interaction of Algorithm 2 and its truncated version. Denote PASSf\textrm{PASS}_{f} the event that truncf(V0)\textrm{trunc}_{\leq f}(V^{0}) stopped, that is truncf(v)\textrm{trunc}_{\leq f}(v) includes at least one \perp. Denote FF the random variable from which ff is drawn (its randomness comes from the randomness in both AA and VV).

Further note Loss~(v~)=log(P(truncf(V0)=v~)P(truncf(V1)=v~))\widetilde{\textrm{Loss}}(\tilde{v})=\log\big{(}\frac{P(\textrm{trunc}_{\leq f}(V^{0})=\tilde{v})}{P(\textrm{trunc}_{\leq f}(V^{1})=\tilde{v})}\big{)} the privacy loss of an outcome v~truncf(V0)\tilde{v}\sim\textrm{trunc}_{\leq f}(V^{0}) in the truncated version of Algorithm 2. Because truncation at ff does not change anything when FfF\leq f, we have that:

P(Vb=v|Ff)=P(truncf(Vb)=v|¬PASSf)\displaystyle P(V^{b}=v\big{|}F\leq f)=P(\textrm{trunc}_{\leq f}(V^{b})=v\big{|}\lnot\textrm{PASS}_{f})
\displaystyle\Rightarrow P(|Loss(v)|>c|Ff)=P(|Loss~(v)|>c|¬PASSf)\displaystyle P\big{(}|\textrm{Loss}(v)|>c\big{|}F\leq f\big{)}=P\big{(}|\widetilde{\textrm{Loss}}(v)|>c\big{|}\lnot\textrm{PASS}_{f}\big{)}
\displaystyle\Rightarrow P(|Loss(v)|>c,Ff)=P(|Loss~(v)|>c,¬PASSf),\displaystyle P\big{(}|\textrm{Loss}(v)|>c,F\leq f\big{)}=P\big{(}|\widetilde{\textrm{Loss}}(v)|>c,\lnot\textrm{PASS}_{f}\big{)},

where the last inequality uses the fast that P(Ff)=P(¬PASSf)P(F\leq f)=P(\lnot\textrm{PASS}_{f}) by definition of the truncation operator and PASSf\textrm{PASS}_{f} event.

Second, we bound the privacy loss of the non-truncated interaction as follows:

P(|Loss(v)|>ϵRDPf+log(2f2/δ)α1)\displaystyle P\big{(}|\textrm{Loss}(v)|>\epsilon_{\textrm{RDP}}^{f}+\frac{\log(2f^{2}/\delta)}{\alpha-1}\big{)}
=fP(|Loss(v)|>ϵRDPf+log(2f2/δ)α1,F=f)\displaystyle=\sum_{f}P\big{(}|\textrm{Loss}(v)|>\epsilon_{\textrm{RDP}}^{f}+\frac{\log(2f^{2}/\delta)}{\alpha-1},F=f\big{)}
fP(|Loss(v)|>ϵRDPf+log(2f2/δ)α1,Ff)\displaystyle\leq\sum_{f}P\big{(}|\textrm{Loss}(v)|>\epsilon_{\textrm{RDP}}^{f}+\frac{\log(2f^{2}/\delta)}{\alpha-1},F\leq f\big{)}
=fP(|Loss~(v)|>ϵRDPf+log(2f2/δ)α1,¬PASSf)\displaystyle=\sum_{f}P\big{(}|\widetilde{\textrm{Loss}}(v)|>\epsilon_{\textrm{RDP}}^{f}+\frac{\log(2f^{2}/\delta)}{\alpha-1},\lnot\textrm{PASS}_{f}\big{)}
fP(|Loss~(v)|>ϵRDPf+log(2f2/δ)α1)\displaystyle\leq\sum_{f}P\big{(}|\widetilde{\textrm{Loss}}(v)|>\epsilon_{\textrm{RDP}}^{f}+\frac{\log(2f^{2}/\delta)}{\alpha-1}\big{)}
fδ2f2δ.\displaystyle\leq\sum_{f}\frac{\delta}{2f^{2}}\leq\delta.

Where the second equality follows from the previous step, and the penultimate inequality applies Corollary 2 with δ=δ2f2\delta=\frac{\delta}{2f^{2}}, and Equation 2. ∎

4.4 Composing Over the Rényi Curve

Theorem 2 gives us an RDP based bound on the privacy loss, with a penalty for providing a running upper-bound over the sequence of ϵRDPf\epsilon_{\textrm{RDP}}^{f}. To fully leverage RDP strong composition, we once again need to track the RDP budget over the Rényi curve. Applying a union bound shows that, for all αΛ\alpha\in\Lambda:

P(|Loss(v)|>ϵRDPf+log(|Λ|2f2/δ)α1)δ.P\big{(}|\textrm{Loss}(v)|>\epsilon_{\textrm{RDP}}^{f}+\frac{\log(|\Lambda|2f^{2}/\delta)}{\alpha-1}\big{)}\leq\delta. (5)

Concretely, we track the budget spent by AA in Algorithm 2 over a set of RDP orders αΛ\alpha\in\Lambda, denoted ϵRDP(α)\epsilon_{\textrm{RDP}}(\alpha). When mapping this RDP consumption to a DP bound, for each α\alpha we find ff such that ϵRDP(α)ϵRDPf+log(|Λ|2f2/δ)α1\epsilon_{\textrm{RDP}}(\alpha)\leq\epsilon_{\textrm{RDP}}^{f}+\frac{\log(|\Lambda|2f^{2}/\delta)}{\alpha-1}. The minimum value on the right hand side over all orders is our upper-bound on the privacy loss.

4.5 Instantiating the Odometer

There are two main choices to make when instantiating our privacy odometer. The first choice is that of ϵRDPf(α)\epsilon_{\textrm{RDP}}^{f}(\alpha). At a given order α\alpha, the best DP guarantee we can hope for is ϵ=log(|Λ|2/δ)α1\epsilon=\frac{\log(|\Lambda|2/\delta)}{\alpha-1}. We set ϵRDP1(α)=log(|Λ|2/δ)α1\epsilon_{\textrm{RDP}}^{1}(\alpha)=\frac{\log(|\Lambda|2/\delta)}{\alpha-1}, ensuring that we can always yield a DP guarantee within a factor two of the lowest possible one when f=1f=1. We then preserve this factor two upper-bound by doubling the filter at every ff, yielding:

ϵRDPf(α)=2f1log(|Λ|2/δ)α1.\epsilon_{\textrm{RDP}}^{f}(\alpha)=2^{f-1}\frac{\log(|\Lambda|2/\delta)}{\alpha-1}.

Note that the final DP guarantee will not be a step function doubling at every step, as we can choose the best DP guarantee over all α\alphas a posteriori.

We then need to choose the set Λ\Lambda of RDP orders to track, with multiple considerations to balance. First, the final DP guarantee implied by Equation 5 has a dependency on log(|Λ|)\sqrt{\log(|\Lambda|)}, so |Λ||\Lambda| cannot grow too large. Second, the largest α\alpha will constrain the lowest possible DP guarantee possible. (Rogers et al., 2016) suggests that a minimum DP guarantee of ϵDP=1n2\epsilon_{\textrm{DP}}=\frac{1}{n^{2}}, where nn is the size of the dataset, is always sufficient. This is because the result of a DP interaction has to be almost independent of the data at such an ϵDP\epsilon_{\textrm{DP}}, rendering accounting at lower granularity meaningless. In this case, setting {Λ=2i,i{1,log2(n2)}}\big{\{}\Lambda=2^{i},i\in\{1,\log_{2}(n^{2})\}\big{\}} yields the same optimal dependency on nn as (Rogers et al., 2016), Theorems 6.1 and 6.3. In contrast to the result from (Rogers et al., 2016), the resulting odometer does not degrade for values ϵDP\epsilon_{\textrm{DP}} larger 11, making it more practical in situations where large budgets are expected.

In practice, many applications have high DP cost for which the minimal granularity of 1n2\frac{1}{n^{2}} is too extreme. Such applications can gain from a higher resolution of RDP orders in the relevant range, yielding a tighter RDP bound. This effect is particularly important due to our exponential discretisation of the RDP budget spent at each order. By choosing the range of RDP orders Λ\Lambda, we can thus trade-off the log(|Λ|)\log(|\Lambda|) cost with the RDP order resolution and the minimum budget granularity. This ease of adaptation, combined with efficient RDP accounting, make our privacy odometer the first to be practical enough to apply to DP deep learning. In the rest of this paper, we use α={1.25,1.5,1.75,,9.75,10,16,32}\alpha=\{1.25,1.5,1.75,\ldots,9.75,10,16,32\} for all results. These are typical values used in RDP accounting for deep learning, with a slightly higher resolution than the one given as example in (Mironov, 2017).

5 Applications to DP Deep Learning

Practical privacy filters and odometers open a new design space for DP, in particular in its applications to deep learning models. We next showcase early applications of both constructs for classification models on the CIFAR-10 dataset (Krizhevsky, 2009). Our goal is not to develop state of the art DP models, but to show the promise of adaptive DP budgets in controlled experiments. Our Opacus based implementation for privacy filters and odometers, as well as the code to replicate experiments, is available at https://github.com/matlecu/adaptive_rdp.

Refer to caption
(a) Test Set Accuracy
Refer to caption
(b) Budget Consumed
Refer to caption
(c) Noise Adaptation Example
Refer to caption
(d) Batch Adaptation Example
Figure 1: Adaptive strategies. Figures 1(a) and 1(b) respectively show the test set accuracy and DP budget consumed during training, over five independent runs. Adaptive policies can train for longer, for a better average accuracy (+2.64%+2.64\% for noise adaptation). Figures 1(c) and 1(d) show a representative instance of the evolution of noise and batch size.

5.1 Privacy Filter Applications

Intuitively, adaptivity can be used to save budget during training to perform more steps in total, increasing accuracy under a fixed total privacy loss. By far the most common method to train deep learning models with DP is DP Stochastic Gradient Descent (DP-SGD) (Abadi et al., 2016), which clips the gradients of individual data points to bound their influence on each SGD step, and adds Gaussian noise to the mini-batch’s gradients to ensure that each parameter update is DP. Composition is then used to compute the total privacy loss incurred when training the model.

Adaptation policy.

Starting from a non adaptive baseline, every 1010 epochs we compute the number of correctly classified train set images. Such a computation needs to be DP, and we use the Gaussian mechanism with standard deviation σ=100\sigma=100. We account for this budget in the total used to train the model. If the number of correctly classified examples increased by at least three standard deviations, a “significant increase” that ensures this change is never due to the DP noise, we additively decrease the privacy budget used at each mini-batch for the next 1010 epochs. If there isn’t such a significant increase, we increase the budget by the same amount, up to that of the non adaptive baseline. Finally, we only lower the privacy budget when the filter will allow more than 5050 epochs under the current budget.

There are two main parameters in DP-SGD that influence the budget of each mini-batch computation: the size of the batch, and the standard deviation of the noise added to the gradients. We apply our policy to each parameter separately. For the noise standard deviation, we add or remove 0.10.1. For the batch size, we add or remove 128128 data points in the mini-batch (with a minimum of 256256), which is the size of sub-mini-batch that we use to build the final batch.

Results.

Figure 2 shows these two adaptive policies applied to a baseline ResNet9 model, trained for 100100 epochs with a fixed learning rate of 0.050.05, gradient clipping 11, DP noise σ=1\sigma=1, and batch size 512512. BatchNorm layers are replaced with LayerNorm to support DP. As expected, the policy first decreases the privacy budget, increasing the noise (Fig. 1(c)) or decreasing the batch size (Fig. 1(d)). This translates to less consumption in DP budget (Fig. 1(b)) and longer training, up to 3×3\times the number of epochs (and up to 15h vs 3.5h on one Tesla M60 GPU). When performance starts to plateau, the DP budget is increased back to its original level. Despite the slower start in accuracy, depicted on Figure 1(a), the end result is significantly better. Over 55 runs, the average accuracy of the baseline is 54.58%54.58\% (min/max 53.40%53.40\%/55.24%55.24\%), while adapting the batch size yields 55.56%55.56\% (54.93%54.93\%/56.38%56.38\%) and adapting noise 57.22%57.22\% (56.96%56.96\%/57.4%57.4\%). The noise adaptation policy always performs better than the best baseline run, by more than 1.72%1.72\% and with an average increase of 2.64%2.64\%.

Refer to caption
(a) Fine-tuning
Refer to caption
(b) Fine-tuning
Refer to caption
(c) Adaptive Fine-tuning
Refer to caption
(d) Adaptive Fine-tuning
Figure 2: Odometer based privacy loss upper bound (dashed), when fine-tuning an ImageNet model for CIFAR-10. We show the accuracy over 5 runs 2(a) and odometer privacy loss compared the curve of a non-early stopping RDP curve that runs for 50 epochs 2(b). We then leverage adaptivity to start with lower privacy budgets, and show the resulting accuracy 2(c) and privacy loss curves 2(d).

5.2 Privacy Odometer Applications

A natural application of privacy odometers is fine-tuning an existing model with DP, so that the privacy of the fine-tuning dataset in preserved. In this setup, we expect to reach good performance in a small but unknown number of epochs. We would like to stop when the model reaches satisfactory accuracy on the new dataset (or stops improving), without incurring more privacy loss than necessary.

Setup.

We showcase such an adaptive scenario by fine-tuning Pytorch’s pre-trained ResNet18 for ImageNet so that it predicts on CIFAR-10. To this end, we modify the last layer of the original model to output only 1010 classes, and resize the images to that they have the expected 224×224224\times 224 size. When fine-tuning the whole model, we still freeze BatchNorm parameters and apply the layer in test mode, because DP-SGD is not compatible with non frozen BatchNorm. We also show results when training the last layer only. In each case, we fine-tune the model for 5050 epochs with learning rate 0.010.01, DP noise standard-deviation σ=1\sigma=1, and mini-batch size 512512. Finally, we combine fine-tuning with an adaptive policy that starts from a larger σ=2\sigma=2 DP noise, and decreases it by 0.10.1 decrements after every epoch during which the number of correctly predicted train set images did not significantly increase (see §5.1 for details). Full fine-tuning takes 8h on our Tesla M60 GPU, and last layer fine-tuning 2.5h.

Results.

Figure 1 shows the accuracy reached during fine-tuning either the last layer or the entire model, as well as the DP budget consumption under RDP and the odometer. We can think of the RDP curve as a privacy filter filling up until reaching the planned 5050 epochs. One can stop at any time, but the entire budget is spent regardless. Focussing on non-adaptive fine-tuning, 2(a) shows that we may want to use the fully fine-tuned model from epoch 2020, and the end version of the last layer fine-tuned model. Either way, we pay the full ϵ=5.76\epsilon=5.76. With the odometer, we can stop at the 20th20^{th} epoch having spent only ϵ=4.7\epsilon=4.7.

In case we have an accuracy goal in mind, say 80%80\%, we can do even better by using the adaptive policy to fine-tune the model. Under that policy, the fully fine-tuned model consistently reaches this accuracy under 88 epochs, for an odometer budget of ϵ=1.45\epsilon=1.45, an enormous saving. Without adaptation, the model reaches 80%80\% accuracy in 66 epochs, but under a higher budget of ϵ=3.24\epsilon=3.24. We also note that the last layer fine-tuning barely suffers from the adaptive policy, yielding large privacy gains.

6 Related Work

The study of Differential Privacy composition under adaptively chosen privacy parameters was initiated in (Rogers et al., 2016), which formalized privacy odometers and filters for (ϵ,δ)(\epsilon,\delta)-DP and showed constructions with optimal rate. The overhead of the privacy filter, as well as the limitations when dealing with larger DP budgets and the Gaussian mechanism have limited the practical impact of these results. Some applications exist, but are further tailored to restricted settings (Ligett et al., 2017), or using basic composition (Lécuyer et al., 2019). Our filter and odometer alleviate these limitations.

We also build on RDP (Mironov, 2017), a relaxation of pure ϵ\epsilon-DP which offers strong semantics and enables tighter accounting of privacy composition under non adaptive privacy parameters. Due to its more intuitive and tighter accounting, RDP has become a major building block in DP libraries for deep learning (Opacus, ; TensorFlow Privacy, ).

The closest work to ours is (Feldman & Zrnic, 2020), which studies adaptive budget composition using RDP in the context of individual privacy accounting. While our filter result is identical to theirs, we then leverage it in a fully adaptive odometer construction. We use a nested filter design more efficient when packing RDP queries into the filters, and exponentially growing filters to match the optimial compostion rate from (Rogers et al., 2016), Theorem 6.1.

7 Conclusion

In this paper, we analyse DP composition with adaptively chosen privacy parameters through the lens of RDP. We show that privacy filters, which allow adaptive privacy budget within a fixed privacy loss upper-bound, do not incur any overhead compared to composition of privacy budgets known a priori. We leverage this privacy filter to construct a privacy odometer, which gives a running upper-bound on the privacy loss incurred under adaptive privacy budgets, when the sequence of DP computations can stop at any time. This construction, and its leverage of RDP composition, enables efficient odometers in the large budget regime.

We demonstrate through experiments that these practical privacy filter and odometer enable new DP algorithms, in particular for deep learning. By adapting the DP budget allocated to each gradient step and stopping training early, we improve accuracy under a fixed privacy loss bound, and reduce the total privacy loss incurred when fine-tuning a model. We hope that these results will motivate broader applications, from the design of new DP algorithms, to practical parameter tuning and architecture search for DP machine learning models.

References

  • Abadi et al. (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In CCS, 2016.
  • Dwork et al. (2014) Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 2014.
  • Feldman & Zrnic (2020) Feldman, V. and Zrnic, T. Individual Privacy Accounting via a Renyi Filter. In arXiv, 2020.
  • (4) Google Differential Privacy. https://github.com/google/differential-privacy/.
  • (5) IBM Differential Privacy Library. https://diffprivlib.readthedocs.io/en/latest/.
  • Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
  • Lécuyer et al. (2019) Lécuyer, M., Spahn, R., Vodrahalli, K., Geambasu, R., and Hsu, D. Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform. In SOSP, 2019.
  • Ligett et al. (2017) Ligett, K., Neel, S., Roth, A., Waggoner, B., and Wu, S. Z. Accuracy first: Selecting a differential privacy level for accuracy constrained erm. In NeurIPS, 2017.
  • Mironov (2017) Mironov, I. Rényi Differential Privacy. In Computer Security Foundations Symposium (CSF), 2017.
  • (10) Opacus. Train PyTorch models with Differential Privacy. https://opacus.ai/.
  • (11) OpenDP. https://smartnoise.org/.
  • Rogers et al. (2016) Rogers, R., Roth, A., Ullman, J., and Vadhan, S. Privacy Odometers and Filters: Pay-as-You-Go Composition. In NeurIPS, 2016.
  • (13) TensorFlow Privacy. https://github.com/tensorflow/privacy.