Practical Privacy Filters and Odometers with Rényi Differential Privacy
and Applications to Differentially Private Deep Learning
Abstract
Differential Privacy (DP) is the leading approach to privacy preserving deep learning. As such, there are multiple efforts to provide drop-in integration of DP into popular frameworks. These efforts, which add noise to each gradient computation to make it DP, rely on composition theorems to bound the total privacy loss incurred over this sequence of DP computations.
However, existing composition theorems present a tension between efficiency and flexibility. Most theorems require all computations in the sequence to have a predefined DP parameter, called the privacy budget. This prevents the design of training algorithms that adapt the privacy budget on the fly, or that terminate early to reduce the total privacy loss. Alternatively, the few existing composition results for adaptive privacy budgets provide complex bounds on the privacy loss, with constants too large to be practical.
In this paper, we study DP composition under adaptive privacy budgets through the lens of Rényi Differential Privacy, proving a simpler composition theorem with smaller constants, making it practical enough to use in algorithm design. We demonstrate two applications of this theorem for DP deep learning: adapting the noise or batch size online to improve a model’s accuracy within a fixed total privacy loss, and stopping early when fine-tuning a model to reduce total privacy loss.
1 Introduction
Performing statistical analyses or training machine learning models have been shown to leak information about individual data points used in the process. Differential Privacy (DP) is the leading approach to prevent such privacy leakage, by adding noise to any computation performed on the data such that the privacy loss resulting their result is provably bounded. Due to DP’s strong privacy guarantees, multiple efforts exist to integrate it into libraries, including for statistical queries (Google Differential Privacy, ; OpenDP, ), machine learning (IBM Differential Privacy Library, ), and deep learning (TensorFlow Privacy, ; Opacus, ). The DP analysis in these libraries rely on composing the results of a sequence of DP computations, consisting for instance in a sequences of statistical queries, or a sequence of mini-batches of gradients in deep learning. Each computation in the sequence can be seen as consuming some privacy budget, that depends on the size of the noise added to the computation’s result. Composition theorems are then applied to bound to the total privacy budget, or privacy loss, used over the sequence.
However efficient DP composition theorems require that all computations in the entire sequence have a privacy budget fixed a priori. This creates a mismatch between the theory, and practical requirements. Indeed, users who interact with a dataset through statistical queries or develop machine learning models will typically want to adapt the next query or model’s DP budget allocation based on previous results, resulting in an adaptive behavior. At the algorithm level this mismatch is visible in the APIs developed for DP deep learning, which track DP computations as they happen during training. Users are responsible for avoiding early stopping or other types of parameter adaptation. This restriction reduces the design space for DP deep learning algorithms, ruling out early stopping to reduce overall privacy loss, DP budgets that adapt during training to improve the final performance, or online architecture search to adapt the complexity of the model to the data within DP constraints.
The main study of adaptive composition (Rogers et al., 2016) defines two relevant interaction models that yield different asymptotic rates for composition theorems. The privacy filter model enforces an a priori bound on the total privacy loss, and can reach the same rate as non-adaptive composition. The privacy odometer model is more costly, but more flexible as it provides a running upper-bound on the sequence of DP computations performed so far, which can stop at any time. However there is a small penalty to pay in asymptotic rate. (Rogers et al., 2016) provides composition theorems with optimal rate for both models, but the bounds are complex and with large constants, making them impractical. A recent paper, (Feldman & Zrnic, 2020), shows improved filters but unsatisfactory odometers.
In this paper, we study both privacy filters and odometers through the lens of Rényi Differential Privacy (RDP). RDP in an analysis tool for DP composition that enables intuitive privacy loss tracking and provides efficient composition results under non-adaptive privacy budget. It has become a key building block in DP deep learning libraries (TensorFlow Privacy, ; Opacus, ). We show that RDP can be used to provide privacy filters without incurring any cost for adaptivity. We build on this result to construct privacy odometers that remove limitations of previously known results, in particular in regimes relevant for DP deep learning.
We demonstrate two applications of these theoretical results when training DP deep learning models. First, we leverage privacy filters to show that even basic policies to adapt on the fly the noise added to gradient computations, as well as the batch size, yield improved accuracy of more than within a fixed privacy loss bound on CIFAR-10. Second, we demonstrate that in a fine-tuning scenario our privacy odometer enables large reductions in total privacy budget using adaptivity and early-stopping.
2 Privacy Background
This section provides the necessary background on DP and composition, and introduces some notation. Specific results on adaptive composition are discussed and compared to our contributions throughout the paper.
2.1 Differential Privacy
Definition.
We first recall the definition of (approximate) Differential Privacy (Dwork et al., 2014). A randomized mechanisms satisfies -DP when, for any neighboring datasets and in , and for any :
This definition depends on the notion of neighboring datasets, which can be domain specific but is typically chosen as the addition or removal of one data point. Since the neighboring relation is symmetric, the DP inequality also is. A smaller privacy budget implies stronger guarantees, as one draw from the mechanism’s output gives little information about whether it ran on or . When , the guarantee is called pure DP, while is a relaxation of pure DP called approximate DP.
A useful way of rephrasing the DP guarantee is to define the privacy loss as, for :
If with probability at least over :
(1) |
then the mechanism is -DP. We can see in this formulation that is a bound on the privacy loss. Approximate DP allows for a failure probability on that bound.
Composition.
Composition is a core DP property. Composition theorems bound the total privacy loss incurred over a sequence of DP computations. This is useful both as a low level tool for algorithm design, and to account for the privacy loss of repeated data analyses performed on the same dataset. Basic composition states that a sequence of -DP mechanisms is -DP. While not efficient, it is simple and provides an intuitive notion of additive privacy budget consumed by DP computations.
Strong composition provides a tighter bound on privacy loss, showing that the same sequence is -DP with and . This result comes with its share of drawbacks though. Each mechanism comes with its own trade-off between and , and there is no practical way to track and optimally compose under these trade-offs. This has led to multiple special purpose composition theorems, such as the moment accountant for DP deep learning (Abadi et al., 2016). Strong composition results also require the sequence of DP parameters to be fixed in advance.
2.2 Rényi Differential Privacy
RDP (Mironov, 2017) is also a relaxation of pure DP that always implies approximate DP, though the converse in not always true.
Rényi Differential Privacy (RDP).
RDP expresses its privacy guarantee using the Rényi divergence:
where is the order of the divergence. A randomized mechanism is -RDP if:
RDP composition.
A sequence of -RDP mechanisms is -RDP. We can see that RDP re-introduces an intuitive additive privacy budget , at each order .
Tracking the RDP budget over multiple orders enables one to account for the specific trade-off curves of each specific mechanism in a sequence of computations. Indeed, for an -RDP mechanism:
(2) |
which in turn implies -DP. Although RDP composition is similar to strong composition in the worst case, tracking privacy loss over the curve of Rényi orders often yields smaller privacy loss bounds in practice.
3 Privacy Filters
We first focus on privacy filters, which authorize the composition of adaptive DP budgets within an upper-bound on the privacy loss fixed a priori. That is, the sequence of computations’ budgets is not known in advance, and can depend on previous results, and the filter ensures that the upper-bound is always valid.
3.1 The Privacy Filter Setup
Complex sequences of interactions with the data, such as the adaptive privacy budget interactions we focus on in this work, can be challenging to express precisely. Following (Rogers et al., 2016), we express them in the form of an algorithm that describes the power and constraints of an analyst, or adversary since it models worst case behavior, interacting with the data. Algorithm 1 shows this interaction setup for a privacy filter.
Specifically, an adversary will interact for rounds — does not appear in the bounds and can be chosen arbitrarily large—, in one of two “neighboring worlds” indexed by . Typically, the two worlds will represent two neighboring datasets with an observation added or removed, but this formulation allows for more flexibility by letting chose two neighboring datasets at each round, as well as the DP mechanism to use and its RDP parameters. Each decision can be made conditioned on the previous results.
Our RDP filter of order can PASS on a computation at any time, setting and returning to . Sections 3.2 and 3.3 describe how to instantiate this filter to provide RDP and DP guarantees. The final output of the algorithm is a view of all results from the computations. With a slight abuse of notation, we call the distribution over possible outcomes for the query, and an observed realization.
This way, we can explicitly write the dependencies of the query’s distribution as , where is the view of all realized outcomes up to, and including, the query. Similarly, the joint distribution over possible views is . The budget of the query, depends only on all queries up to the previous one, which we note explicitly as . DP seeks to bound the information about that can learn from the results of its DP computations. Following Equation 1, we want to bound with high probability under the privacy loss incurred by the sequence of computations .
3.2 Rényi Differential Privacy Analysis
We first analyse the filter that will PASS when:
(3) |
The filter PASSes when the new query would cause the sum of all budgets used so far to go over . We next show that this filter ensures that Algorithm 1 is -RDP.
Theorem 1 (RDP Filter).
Proof.
We start by writing the integral for as:
Focusing on the inner-most integral, we have that:
where the first inequality follows because is -RDP, and the second because the privacy filter enforces that , since .
Note that now, the inner most integral is bounded by a function of and not . Thus, we can write:
where the first inequality takes the previous bound out of the integral as it is a constant w.r.t. , the second inequality comes from being -RDP, and the equality combines the two exponentials and cancels in the sum. As we can see, the term in the brackets is now only a function of and not . The last inequality recursively applies the same reasoning to each integral, removing all the in turn. Taking the log of each side of the global inequality concludes the proof. ∎
3.3 Composing Over the Rényi Curve
A powerful feature of RDP is to be able to compose mechanisms over their whole curve. Since for many mechanisms this curve is not linear, it is unclear in advance which would yield the best final -DP guarantee through Equation 2. This composition “over all s” is important, as it enables RDP’s strong composition results.
We can extend our privacy filter and Theorem 1 to this case. Consider a (possibly infinite) sequence of values to track, noted , and an upper bound for each of these values. In Algorithm 1, the mechanism is now -RDP for . The filter will now PASS when:
(4) |
Corollary 1 (RDP Filter over the Rényi Curve).
Proof.
We argue by contradiction that . Assume that was not the case, and consider the maximal index for which the filter did not PASS. Then , the filter PASSed and , which is a contradiction. Applying Theorem 1 to this concludes the proof. ∎
3.4 -DP Implications
A key application of Corollary 1 is to build a privacy filter that enforces an -DP guarantee, and leverages RDP to ensure strong composition. Intuitively, for each the corresponding is set to the value which, for this , implies the desired -DP target. Formally, based on Equation 2 we set the filter from Equation 4 to . Applying Corollary 1 followed by Equation 2 shows that under this filter, Algorithm 1 is -DP.
This is significant, as this construction yields strong DP composition with a simple additive (RDP) privacy filter. There is no overhead compared to non-adaptive RDP, which yields tighter results than DP strong composition in many practical settings, such as the composition of Gaussian mechanisms ubiquitous in DP deep learning. On the contrary, Theorem 5.1 from (Rogers et al., 2016) is restricted to DP strong composition, and pays a significant multiplicative constant. The result from Theorem 1 already appears in (Feldman & Zrnic, 2020) as a corollary to a more general result. Here we show a proof closer to that of RDP composition, that we will re-use in subsequent results. We also explicitly show how to implement filters over the Rényi curve in §3.3.
4 Privacy Odometers
Privacy filters are useful to support sequences of computations with adaptive privacy budgets, under a fixed total privacy loss. In many cases however, we are interested in a more flexible setup in which we can also stop at any time, and bound the privacy loss of the realized sequence. Examples of such use-cases include early stopping when training or fine-tuning deep learning models, “accuracy first” machine learning (Ligett et al., 2017), or an analyst interacting with the data until a particular question is answered.
4.1 The Privacy Odometer Setup
Algorithm 2 formalizes our privacy odometer setup. The adversary can now interact with the data in an unrestricted fashion. For the given RDP order , the odometer returns the view and the RDP parameter , which must be a running upper-bound on since the interaction can stop at any time.
Analyzing such an unbounded interaction directly with RDP is challenging, as the very RDP parameters can change at each step based on previous realizations of the random variable, and there is no limit enforced on their sum. We sidestep this challenge by initiating the odometer with a sequence of , with a strictly positive integer. After each new RDP computation, we compute the smallest such that a filter initiated with budget would not PASS. We show how to analyse the privacy guarantees of this construct in sections 4.2 and 4.3, before extending it to support RDP accounting over the Rényi curve in section 4.4. Finally, section 4.5 discusses how to instantiate to get efficient odometers.
4.2 Rényi Differential Privacy Analysis
To analyse Algorithm 2, we define a truncation operator , which changes the behavior of an adversary to match that of a privacy filter of size . Intuitively, follows exactly until outputs an RDP mechanism that would cause a switch to the filter. When this happens, stops interacting until the end of the queries. More precisely, at round outputs the same , , and as as long as . Otherwise, returns and .
We similarly note the distribution over possible views when adversary interacts following Algorithm 2, and a corresponding truncation of a view generated by adversary . This truncated adversary behaves like a privacy filter, yielding the following result.
Corollary 2 (Truncated RDP Odometer).
The interaction mechanism from Algorithm 2, is such that .
Proof.
We note that the odometer from (Feldman & Zrnic, 2020) (Proposition 5.2) is closer to this truncated odometer than to the odometer from (Rogers et al., 2016). Indeed, the guarantee it gives is for stacked privacy filters, where is fixed in advance. If applied with an adaptive (e.g. early stopping) the result would violate the lower bound proven in (Rogers et al., 2016), Theorem 6.1. We next show how to provide odometer semantics with RDP composition bounds.
4.3 -DP Implications
We are now ready to bound the privacy loss of Algorithm 2.
Theorem 2 (DP Odometer).
For a given outcome and of the interaction described in Algorithm 2, we have that .
This implies that the interaction is -DP.
Proof.
We first show an important relationship between the interaction of Algorithm 2 and its truncated version. Denote the event that stopped, that is includes at least one . Denote the random variable from which is drawn (its randomness comes from the randomness in both and ).
Further note the privacy loss of an outcome in the truncated version of Algorithm 2. Because truncation at does not change anything when , we have that:
where the last inequality uses the fast that by definition of the truncation operator and event.
4.4 Composing Over the Rényi Curve
Theorem 2 gives us an RDP based bound on the privacy loss, with a penalty for providing a running upper-bound over the sequence of . To fully leverage RDP strong composition, we once again need to track the RDP budget over the Rényi curve. Applying a union bound shows that, for all :
(5) |
Concretely, we track the budget spent by in Algorithm 2 over a set of RDP orders , denoted . When mapping this RDP consumption to a DP bound, for each we find such that . The minimum value on the right hand side over all orders is our upper-bound on the privacy loss.
4.5 Instantiating the Odometer
There are two main choices to make when instantiating our privacy odometer. The first choice is that of . At a given order , the best DP guarantee we can hope for is . We set , ensuring that we can always yield a DP guarantee within a factor two of the lowest possible one when . We then preserve this factor two upper-bound by doubling the filter at every , yielding:
Note that the final DP guarantee will not be a step function doubling at every step, as we can choose the best DP guarantee over all s a posteriori.
We then need to choose the set of RDP orders to track, with multiple considerations to balance. First, the final DP guarantee implied by Equation 5 has a dependency on , so cannot grow too large. Second, the largest will constrain the lowest possible DP guarantee possible. (Rogers et al., 2016) suggests that a minimum DP guarantee of , where is the size of the dataset, is always sufficient. This is because the result of a DP interaction has to be almost independent of the data at such an , rendering accounting at lower granularity meaningless. In this case, setting yields the same optimal dependency on as (Rogers et al., 2016), Theorems 6.1 and 6.3. In contrast to the result from (Rogers et al., 2016), the resulting odometer does not degrade for values larger , making it more practical in situations where large budgets are expected.
In practice, many applications have high DP cost for which the minimal granularity of is too extreme. Such applications can gain from a higher resolution of RDP orders in the relevant range, yielding a tighter RDP bound. This effect is particularly important due to our exponential discretisation of the RDP budget spent at each order. By choosing the range of RDP orders , we can thus trade-off the cost with the RDP order resolution and the minimum budget granularity. This ease of adaptation, combined with efficient RDP accounting, make our privacy odometer the first to be practical enough to apply to DP deep learning. In the rest of this paper, we use for all results. These are typical values used in RDP accounting for deep learning, with a slightly higher resolution than the one given as example in (Mironov, 2017).
5 Applications to DP Deep Learning
Practical privacy filters and odometers open a new design space for DP, in particular in its applications to deep learning models. We next showcase early applications of both constructs for classification models on the CIFAR-10 dataset (Krizhevsky, 2009). Our goal is not to develop state of the art DP models, but to show the promise of adaptive DP budgets in controlled experiments. Our Opacus based implementation for privacy filters and odometers, as well as the code to replicate experiments, is available at https://github.com/matlecu/adaptive_rdp.




5.1 Privacy Filter Applications
Intuitively, adaptivity can be used to save budget during training to perform more steps in total, increasing accuracy under a fixed total privacy loss. By far the most common method to train deep learning models with DP is DP Stochastic Gradient Descent (DP-SGD) (Abadi et al., 2016), which clips the gradients of individual data points to bound their influence on each SGD step, and adds Gaussian noise to the mini-batch’s gradients to ensure that each parameter update is DP. Composition is then used to compute the total privacy loss incurred when training the model.
Adaptation policy.
Starting from a non adaptive baseline, every epochs we compute the number of correctly classified train set images. Such a computation needs to be DP, and we use the Gaussian mechanism with standard deviation . We account for this budget in the total used to train the model. If the number of correctly classified examples increased by at least three standard deviations, a “significant increase” that ensures this change is never due to the DP noise, we additively decrease the privacy budget used at each mini-batch for the next epochs. If there isn’t such a significant increase, we increase the budget by the same amount, up to that of the non adaptive baseline. Finally, we only lower the privacy budget when the filter will allow more than epochs under the current budget.
There are two main parameters in DP-SGD that influence the budget of each mini-batch computation: the size of the batch, and the standard deviation of the noise added to the gradients. We apply our policy to each parameter separately. For the noise standard deviation, we add or remove . For the batch size, we add or remove data points in the mini-batch (with a minimum of ), which is the size of sub-mini-batch that we use to build the final batch.
Results.
Figure 2 shows these two adaptive policies applied to a baseline ResNet9 model, trained for epochs with a fixed learning rate of , gradient clipping , DP noise , and batch size . BatchNorm layers are replaced with LayerNorm to support DP. As expected, the policy first decreases the privacy budget, increasing the noise (Fig. 1(c)) or decreasing the batch size (Fig. 1(d)). This translates to less consumption in DP budget (Fig. 1(b)) and longer training, up to the number of epochs (and up to 15h vs 3.5h on one Tesla M60 GPU). When performance starts to plateau, the DP budget is increased back to its original level. Despite the slower start in accuracy, depicted on Figure 1(a), the end result is significantly better. Over runs, the average accuracy of the baseline is (min/max /), while adapting the batch size yields (/) and adapting noise (/). The noise adaptation policy always performs better than the best baseline run, by more than and with an average increase of .




5.2 Privacy Odometer Applications
A natural application of privacy odometers is fine-tuning an existing model with DP, so that the privacy of the fine-tuning dataset in preserved. In this setup, we expect to reach good performance in a small but unknown number of epochs. We would like to stop when the model reaches satisfactory accuracy on the new dataset (or stops improving), without incurring more privacy loss than necessary.
Setup.
We showcase such an adaptive scenario by fine-tuning Pytorch’s pre-trained ResNet18 for ImageNet so that it predicts on CIFAR-10. To this end, we modify the last layer of the original model to output only classes, and resize the images to that they have the expected size. When fine-tuning the whole model, we still freeze BatchNorm parameters and apply the layer in test mode, because DP-SGD is not compatible with non frozen BatchNorm. We also show results when training the last layer only. In each case, we fine-tune the model for epochs with learning rate , DP noise standard-deviation , and mini-batch size . Finally, we combine fine-tuning with an adaptive policy that starts from a larger DP noise, and decreases it by decrements after every epoch during which the number of correctly predicted train set images did not significantly increase (see §5.1 for details). Full fine-tuning takes 8h on our Tesla M60 GPU, and last layer fine-tuning 2.5h.
Results.
Figure 1 shows the accuracy reached during fine-tuning either the last layer or the entire model, as well as the DP budget consumption under RDP and the odometer. We can think of the RDP curve as a privacy filter filling up until reaching the planned epochs. One can stop at any time, but the entire budget is spent regardless. Focussing on non-adaptive fine-tuning, 2(a) shows that we may want to use the fully fine-tuned model from epoch , and the end version of the last layer fine-tuned model. Either way, we pay the full . With the odometer, we can stop at the epoch having spent only .
In case we have an accuracy goal in mind, say , we can do even better by using the adaptive policy to fine-tune the model. Under that policy, the fully fine-tuned model consistently reaches this accuracy under epochs, for an odometer budget of , an enormous saving. Without adaptation, the model reaches accuracy in epochs, but under a higher budget of . We also note that the last layer fine-tuning barely suffers from the adaptive policy, yielding large privacy gains.
6 Related Work
The study of Differential Privacy composition under adaptively chosen privacy parameters was initiated in (Rogers et al., 2016), which formalized privacy odometers and filters for -DP and showed constructions with optimal rate. The overhead of the privacy filter, as well as the limitations when dealing with larger DP budgets and the Gaussian mechanism have limited the practical impact of these results. Some applications exist, but are further tailored to restricted settings (Ligett et al., 2017), or using basic composition (Lécuyer et al., 2019). Our filter and odometer alleviate these limitations.
We also build on RDP (Mironov, 2017), a relaxation of pure -DP which offers strong semantics and enables tighter accounting of privacy composition under non adaptive privacy parameters. Due to its more intuitive and tighter accounting, RDP has become a major building block in DP libraries for deep learning (Opacus, ; TensorFlow Privacy, ).
The closest work to ours is (Feldman & Zrnic, 2020), which studies adaptive budget composition using RDP in the context of individual privacy accounting. While our filter result is identical to theirs, we then leverage it in a fully adaptive odometer construction. We use a nested filter design more efficient when packing RDP queries into the filters, and exponentially growing filters to match the optimial compostion rate from (Rogers et al., 2016), Theorem 6.1.
7 Conclusion
In this paper, we analyse DP composition with adaptively chosen privacy parameters through the lens of RDP. We show that privacy filters, which allow adaptive privacy budget within a fixed privacy loss upper-bound, do not incur any overhead compared to composition of privacy budgets known a priori. We leverage this privacy filter to construct a privacy odometer, which gives a running upper-bound on the privacy loss incurred under adaptive privacy budgets, when the sequence of DP computations can stop at any time. This construction, and its leverage of RDP composition, enables efficient odometers in the large budget regime.
We demonstrate through experiments that these practical privacy filter and odometer enable new DP algorithms, in particular for deep learning. By adapting the DP budget allocated to each gradient step and stopping training early, we improve accuracy under a fixed privacy loss bound, and reduce the total privacy loss incurred when fine-tuning a model. We hope that these results will motivate broader applications, from the design of new DP algorithms, to practical parameter tuning and architecture search for DP machine learning models.
References
- Abadi et al. (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In CCS, 2016.
- Dwork et al. (2014) Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 2014.
- Feldman & Zrnic (2020) Feldman, V. and Zrnic, T. Individual Privacy Accounting via a Renyi Filter. In arXiv, 2020.
- (4) Google Differential Privacy. https://github.com/google/differential-privacy/.
- (5) IBM Differential Privacy Library. https://diffprivlib.readthedocs.io/en/latest/.
- Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
- Lécuyer et al. (2019) Lécuyer, M., Spahn, R., Vodrahalli, K., Geambasu, R., and Hsu, D. Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform. In SOSP, 2019.
- Ligett et al. (2017) Ligett, K., Neel, S., Roth, A., Waggoner, B., and Wu, S. Z. Accuracy first: Selecting a differential privacy level for accuracy constrained erm. In NeurIPS, 2017.
- Mironov (2017) Mironov, I. Rényi Differential Privacy. In Computer Security Foundations Symposium (CSF), 2017.
- (10) Opacus. Train PyTorch models with Differential Privacy. https://opacus.ai/.
- (11) OpenDP. https://smartnoise.org/.
- Rogers et al. (2016) Rogers, R., Roth, A., Ullman, J., and Vadhan, S. Privacy Odometers and Filters: Pay-as-You-Go Composition. In NeurIPS, 2016.
- (13) TensorFlow Privacy. https://github.com/tensorflow/privacy.