Privacy of the last iterate in cyclically-traversed DP-SGD on nonconvex composite losses
Abstract
Differentially-private stochastic gradient descent (DP-SGD) is a family of iterative machine learning training algorithms that privatize gradients to generate a sequence of differentially-private (DP) model parameters. It is also the standard tool used to train DP models in practice, even though most users are only interested in protecting the privacy of the final model. Tight DP accounting for the last iterate would minimize the amount of noise required while maintaining the same privacy guarantee and potentially increasing model utility. However, last-iterate accounting is challenging, and existing works require strong assumptions not satisfied by most implementations. These include assuming (i) the global sensitivity constant is known — to avoid gradient clipping; (ii) the loss function is Lipschitz or convex; and (iii) input batches are sampled randomly.
In this work, we forego any unrealistic assumptions and provide privacy bounds for the most commonly used variant of DP-SGD, in which data is traversed cyclically, gradients are clipped, and only the last model is released. More specifically, we establish new Rényi differential privacy (RDP) upper bounds for the last iterate under realistic assumptions of small stepsize and Lipschitz smoothness of the loss function. Our general bounds also recover the special-case convex bounds when the weak-convexity parameter of the objective function approaches zero and no clipping is performed. The approach itself leverages optimal transport techniques for last iterate bounds, which is a nontrivial task when the data is traversed cyclically and the loss function is nonconvex.
1 Introduction
Differential privacy (DP) is an approach to capture the sensitivity of an algorithm to any individual user’s data and is frequently used in both industrial and government applications (see the book by [15] for a rich introduction). Given a possibly nonprivate computation , a desired level of DP (or privacy budget) is generally achieved by bounding the global sensitivity333The maximum change in the mechanism’s output caused by changes in a single user or data point. of and then adding noise to its output. This noise is typically calibrated to the sensitivity and in order to obscure the contributions of a single input example. Conversely, given a mechanism for making a computation differentially private, a method for determining the level of DP obtained by is often called a DP accounting method.
Differentially-private stochastic gradient descent (DP-SGD) refers to a family of popular first-order methods for training model weights with DP[10, 27, 6, 1]. At a high level, a DP-SGD method first computes the gradients of a given set of per-example loss functions with respect to the model weights and applies an algorithm to obtain a private gradient . The private gradient is then used in some first-order optimization scheme, e.g., SGD, Adam, or AdaGrad, to update model weights. More precisely, consists of (i) scaling the per-example loss gradients (a.k.a. gradient clipping) to reduce sensitivity, (ii) adding independent and identically distributed (i.i.d.) Gaussian noise to each of the scaled gradients, and (iii) summing the noised gradients to obtain . In general, the higher the variance of is, the lower the utility of the final trained model.
Depending on the optimization scheme, and the assumptions on how the user-level loss functions are obtained, existing DP accounting methods for DP-SGD can differ significantly. For example, when only the last iterate of DP-SGD is released, existing accounting methods require both sophisticated machinery and numerous strong assumptions to provide tight DP bounds. Some of these strong assumptions, that almost never hold in practice, include (i) the input data is sampled randomly at each DP-SGD iteration, (ii) the loss functions are convex, (iii) the global DP-SGD sensitivity is known beforehand, and (iv) the intermediate model weights are bounded.
This work develops tighter privacy analyses for last-iterate DP-SGD under more realistic settings than existing works. Consequently, our analyses enable implementations of DP-SGD that apply Gaussian noise with lower variance than existing work, and as a consequence, obtain higher utility at the same privacy budget . More specifically, we develop a family of Rényi DP (RDP) bounds on the last iterate of DP-SGD, which are novel in that they:
-
(i)
do not assume knowledge of the global sensitivity constant and, hence, are valid with or without gradient clipping;
-
(ii)
hold for both the nonconvex and convex settings under significantly fewer assumptions than other works;
-
(iii)
are parameterized by a weak convexity444A function is -weakly-convex if is convex. parameter , for which one of the bounds smoothly converges to a similar one in the convex setting as .
1.1 Background
We begin by formally stating the problem of interest, describing common terminology and notation, and specifying the DP-SGD variant under consideration. We then briefly describe the Privacy Amplification by Iteration (PABI) argument of [17] and discuss the difficulties of generalizing this argument to more practical settings.
Problem of interest. We develop RDP bounds for the last iterate of a DP-SGD variant applied to the nonsmooth composite optimization problem
(1) |
where is convex and proper lower-semicontinuous and is continuously differentiable on the domain of . Notice that the assumption on encapsulates (i) common nonsmooth regularization functions such as the -norm , nuclear matrix norm , elastic net regularizer and (ii) indicator functions on possibly unbounded closed convex sets. A common setting in practice is when corresponds to a softmax cross-entropy loss function and corresponds to an - or -regularization function.
Common terminology. An input data collection is said to be traversed cyclically (or cyclically-traversed) in batches of size if contains for the first batches555For simplicity, we assume divides throughout., and the the rest of the batches cycle between in order. Cyclically-traversed DP-SGD is a variant of DP-SGD where the input data is traversed cyclically666Cyclically traversed is also known in the literature as incremental gradient [23].. A dataset pass occurs when the input data (e.g., above) in a cyclically-traversed DP-SGD run has been used, and the next batch of inputs is the same as the first batch of inputs at the beginning of the dataset pass. Gradient clipping is the process of orthogonally projecting a gradient vector in to a Euclidean ball of radius centered at the origin. The parameter is typically called the -norm clip value. In this work, we say a function is a randomized operator if it consists of applying some deterministic operator to an input and adding random noise to resulting output. An operator is said to be -Lipschitz if for every and , and is said to be nonexpansive if it is 1-Lipschitz.
Common notation. Let for any positive integer . Let denote the set of real numbers and denote the -fold Cartesian product of . Let denote a Euclidean space over and denote to be the induced norm. The domain of a function is . The function is said to be proper if . A function is said to be lower semicontinuous if . The set of proper, lower semicontinuous, convex functions over is denoted by . The clipping operator is given by
(2) |
and the proximal operator for a proper convex function is defined as
(3) |
It is well-known that is nonexpansive (see, for example [7, Theorem 6.42]) and that is the proximal operator for the (convex) indicator function of the set .
The -Wasserstein metric is the smallest real number such that for and , there is a joint distribution on where almost surely, i.e., , where is the collection of measures on with first and second marginals and , respectively. For any probability distributions and with , the Rényi divergence of order is
(4) |
where we take the convention that . For , we define . For parameters and , the shifted Rényi divergence is
(5) |
for any probability distributions and over . Given random variables and , we denote and .
We consider the swap model for differential privacy. We say two datasets and are neighbors, denoted as , if can be obtained by swapping one record. A randomized algorithm is said to be -RDP if, for every pair of neighboring datasets and in the domain of , we have .
satisfies local DP if for all records and , . Finally, we use the following variable conventions: is the number of batches (or iterations) in a dataset pass, is the number of dataset passes, is the total number of iterations, is the total number of per-example losses, is the batch size, is the DP-SGD stepsize, and is the clipping norm.
DP-SGD variant. Algorithm 1 outlines the specific variant of DP-SGD applied to (1). This variant takes as input per-example loss functions , the number of iterations , iid samples from a spherical Gaussian distribution , initial model weights , batch size, stepsize, and -clipping norm values. Model weights are updated as follows. At time step , the algorithm (i) selects a batch of examples by cyclically traversing , (ii) computes the average of clipped per example gradients at , and (iii) updates using a noisy gradient. Finally, the algorithm returns the last iterate, .
1.2 Outline of approach
We now outline our approach of tackling the problem of interest. A more formal treatment is given in Section 2.
To motivate our approach, we provide a brief overview of previous well-known methods. An early approach of analyzing Algorithm 1 is to develop a bound based on local DP for a single dataset pass and extend this bound for multiple passes (see, for example, [27, 14]). While straightforward, this approach can be overly restrictive in a centralized setting. Privacy Amplification by Subsampling (PABS) (see subsequent work [13] for a comparison of different sampling methods) improves on the previous approach in certain regimes. Although this method allows for clean privacy accounting, its reliance on Poisson subsampling makes it impractical for large-scale applications (see Appendix D for an extended discussion). The work started by [17] addressed the limitations of PABS with Privacy Amplification by Iteration (PABI), which achieves a bound for releasing the final DP-SGD iterate. This PABI bound improves the baseline bound in certain regimes incorporating a contraction factor dependent on the loss function’s convexity parameters.
Our approach is inspired by PABI, but relaxes several of its convexity assumptions. For added context, we briefly review PABI below. Under the assumption that the loss function of (1) is convex and -Lipschitz, and that is the indicator of a closed convex set, [17] shows that the DP-SGD update in Algorithm 1 with small constant stepsize and no gradient clipping is a nonexpansive operator. This property can be then combined with the following technical result about nonexpansive operators.
Theorem 1.1.
Suppose we are given iterates and , nonexpansive operators and , iid Gaussian random variables , and scalars satisfying
For any scalar sequences and satisfying
(6) |
we obtain the following last-iterate shifted Rényi divergence bound:
(7) |
More specifically, assuming that the DP-SGD iterates first differ at index , i.e., and for every , the operators and in the above theorem can be formed with and for every . Consequently, one can select so that the shift satisfies and obtain a closed form bound in (7), which yields a corresponding RDP bound when . The generalization to multiple dataset passes follows similarly but the final bound scales with the number of dataset passes . Specifically, we have .
In the more practical setting where (i) the loss function in (1) is nonconvex and not necessarily -Lipschitz, (ii) gradient clipping is applied, and (iii) is nonsmooth, it is no longer clear to what extent the corresponding DP-SGD operators and are nonexpansive or how should be obtained. Furthermore, the first inequality of (6) no longer holds, and additional technical issues arise when analyzing the case of multiple dataset passes.
Our approach. We generalize the above argument and combine it with additional analyses of weakly-convex functions and proximal operators to relax several strong assumptions. A sketch of our approach is given below, and formal arguments can be found in subsequent sections.
General operator analysis. In Lemma A.2, we study properties of operators and satisfying
(8) |
Note that if is Hölder continuous, then it can be shown [22] that it satisfies the second inequality in (8)777More specifically, if is -Hölder continuous with modulus , then (8) holds with and for any .. Using these properties, we then establish in Proposition A.5 that if and are generated by a specific sequence of randomized proximal operators using and , respectively, then (roughly)
where is non-decreasing and dependent on some assumptions on and . More specifically, we obtain a sequence of parameterized shifted Rényi divergence bounds similar to (7), while dealing with the challenge of nonconvexity. In the setting of one dataset pass, we derive the bound by solving a related quadratic programming problem on a similar set of residuals as in (6) (see Appendix B for details).
Lipschitz properties of the DP-SGD update. Denoting as the DP-SGD update function888Or any SGD-like update as in (12)., i.e., for every in Algorithm 1, we show in Proposition 2.1 that — depending on our assumptions on and stepsize — the operator satisfies the second inequality of (8) with for different values of and .
More specifically, when the domain of is bounded, we have in (8) for clipping norm . On the other hand, when is sufficiently small, we have and being a constant that (i) tends to when the weak convexity parameter tends to zero, i.e., becomes more convex; and (ii) tends to one when no clipping is performed and in (i) tends to zero. This continuity with respect to the weak convexity parameter appears to be new, and it is proved in Appendix A.2 by using topological properties about weakly-convex functions and proximal operators.
Privacy bounds for DP-SGD. For neighboring DP-SGD iterates and , we combine the above results in Theorems 2.2 and 2.3 to obtain RDP bounds of the form
(9) |
where , , , , are as in Algorithm 1. More specifically, assuming that the DP-SGD iterates are contained within an ball of diameter and each is Lipschitz continuous, we obtain (9) with for for some , , and small enough . When the iterates are (possibly) unbounded, we obtain (9) with
(10) |
1.3 Related work
We first present high-level descriptions of related works in the convex and nonconvex settings, followed by more general works that use advanced composition to obtain loose bounds on the last iterate. We then conclude with some summary tables and figures, and a discussion of technical nuances that carefully compares our work to existing literature.
Convex setting. Given the challenge of proving tight bounds in the general setting, a number of prior analyses focus on the convex case. Works by [17, 16] additionally assume Lispchitz continuity of the loss function to obtain results. The work of [12] studies multiple passes over the data, but their results only apply to the smooth, strongly convex, and full batch setting without clipping. The work of [28] improves these results and extend them to mini-batches both with “shuffle and partition” and “sampling without replacement” strategies. Similarly, results in by [2], and its extension [3], consider only convex Lipschitz smooth functions. The contemporary work of [8] introduces the shifted interpolated process under -DP, allowing for tight characterizations of privacy by iteration for Rényi and other generalized DP definitions.
Notice that none of the above works study clipping, all assume access to the Lipschitz constant of the loss function, and require convexity, limiting their practical viability.
Nonconvex setting. We now discuss papers that do not require convexity of the loss function. [4] analyze the privacy loss dynamics for nonconvex functions, but their analysis differs from ours in two ways. First, they assume that their DP-SGD batches are obtained by Poisson sampling or sampling without replacement. Second, their results require numerically solving a minimization problem that can be hard in practice.
A contemporary work999Appearing after the first version of this preprint. by [11] derives bounds under the assumption that the loss gradient is Hölder continuous and the loss function is Lispchitz continuous when it is also convex. However, this work needs an additional assumption, that constants and satisfying are known and used in a specific optimization subproblem. While [11] focuses on tight theoretical bounds under specific conditions (such as the full batch and single-epoch setting), we prioritize bridging the gap between theory and practice by addressing the complexities of real-world deployments.
Non-specialized analyses. Prior works on DP-SGD accounting often rely on loose bounds that allow for the release of intermediate updates [20, 6, 1]. These works rely on differential privacy advanced composition results [19], resulting in noise with standard deviation that scales as [6, 1]. Alternatively, using disjoint batches decreases the dependence from the number of iterations to the number of epochs (see the first row in Table 2). However, the assumption that all intermediate updates are released can be stringent for certain loss regimes [17, 16], where this bound can be contracted based on loss smoothness and convexity parameters, if only the last iterate is released.
While our work focuses on the privacy guarantees of DP-SGD, it’s important to acknowledge the parallel research efforts exploring the convergence properties of shuffling methods. Recent studies, such as [23] and [9], have established convergence bounds for strongly convex and/or smooth functions in various settings, albeit without considering privacy. These works provide valuable insights into the optimization behavior of shuffling techniques, which can inform future research at the intersection of privacy and optimization.
Summary tables and graphs. Table 1 lists and labels some assumptions that are commonly used in the RDP literature. Table 2 compares our bounds against other RDP bounds. Note that the multi-epoch noisy-SGD algorithm in [17, Algorithm 1] only considers the case where the number of dataset passes equals the number of batches in dataset pass and does not consider batched gradients. As a consequence of the latter, its corresponding RDP upper bound does not depend on the batch size . Figure 1 compares our bounds against other bounds as function of the number of iterations performed, under various settings. Note that we do not consider the multi-epoch bound in [11, Theorem A.1] as it requires solving nonlinear programming problems, and we do not compare the with the bound in [25] because it exploits the fact that batches are randomly sampled (whereas our bounds assume the input batches must be obtained deterministically).
Label | Description |
---|---|
The regularizer is the indicator of a closed convex set | |
The domain of a regularizer , , is bounded with diameter | |
input data batches are of size one | |
input data batches are disjoint | |
input data batches are obtained using Poisson subsampling with sampling rate | |
is -Lipschitz continuous for every , and is known | |
is -Lipschitz continuous for every , and is known | |
is -Hölder continuous for every , and and are known | |
the stepsize needs to be small relative to certain smoothness constants | |
the given RDP bounds only hold for a small range of values of and | |
no gradient clipping is applied or the global sensitivity is known | |
when is convex, no gradient clipping is applied or the global sensitivity is known |
Source | Asymptotic RDP upper bounds | Assumptions | |
---|---|---|---|
Convex | Nonconvex | (see Table 1) | |
[14, Theorem 3.1]101010 The bound follows directly from Theorem 8 in [5] | same as convex | ||
[25, Section 3.3] | numerical procedure111111Obtained by evaluating an integral using numerical quadrature techniques. | same as convex | |
[25, Theorem 11] | same as convex | ||
[17, Theorem 35] | none | ||
[2, Theorem 3.1] | none | ||
[11, Theorem A.1] | numerical procedure121212Obtained by solving nonlinear programming problems. | numerical procedure131313Same procedure as in the nonconvex case, but with different parameters. | |
Ours | same as convex | ||


Technical nuances. The bound in (9) with as in (10) and might appear to follow from subsampled RDP composition results such as [25]. However, those results only apply to DP-SGD variants where the batches are sampled randomly, an assumption that does not hold when batches are cyclically traversed. While established Python libraries like Opacus [29] and TensorFlow Privacy [24] implement and account for random sampled batches (such as those obtained by Poisson subsampling), these implementations address a different issue. One has to ensure the optimizer truly samples at random from a pre-specified distribution, which becomes incredibly difficult with large-scale datasets (see Appendix D for an extended discussion). Consequently, any privacy guarantee predicated on this idealized random sampling assumption becomes effectively meaningless when the actual optimization process deviates from it (as is the case with cyclical batch traversal). It is worth mentioning that DP-FTRL [18] was specifically developed to address this gap and the method accepts a potentially weaker DP guarantee in exchange for practical applicability.
For the special case where (i) only one dataset pass is performed, (ii) the objective function is nonconvex , and (iii) each is Lipschitz continuous, the bound in (9) with as in (10) holds with . Consequently, we obtain an RDP bound that does not grow with the number of iterations — in contrast to the RDP composition bounds in [20] which do scale linearly in or , depending on the sampling assumption. Note that (16) and Theorem 2.3 show that as , our bounds converge to an expression that depends linearly on , matching the bound for PABI. Figure 1 illustrates the regimes under which we improve previous work.
The addition of the composite term in (1) is not a trivial extension and greatly complicates the analysis. For example, the objective function in (1) is no longer differentiable in general (even on ), and more general analyses must be used to handle this nonsmoothness. Under stepsize and -Lipschitz continuous for , paper [11] shows that the DP-SGD update in the nonconvex case is -Lipschitz continuous, which is independent of the weak convexity constant. Consequently, the established RDP bounds in [11] in the setting where the weak convexity constant is positive but near zero (i.e., the function is nearly convex) may be an overestimate of the true RDP bound.
For the convex case, it is worth emphasizing that we do not require each to be Lipschitz continuous case in order to bound (see, for example, [17, 2, 16, 11] which do require this assumption). As a consequence, our analysis is applicable to a substantially wider class of objective functions. Moreover, all other existing bounds in the literature of the form in (9) replace the parameter with a Lipschitz constant of from (1), and these bounds are generally proportional to . Consequently, when , e.g., when is a quadratic function on a large compact domain, our bounds are significantly tighter (see Figure 1 for an illustration).
Organization. The remainder of the paper gives a formal presentation of the results, including the key assumptions on (1), the topological properties of the DP-SGD update operator, and the non-asymptotic RDP bounds on the last DP-SGD iterates.
2 Privacy bounds for DP-SGD
This section formally presents the main RDP bounds for the last iterates of Algorithm 1. For conciseness, the lengthy proofs of the main results are given in Appendix A.
We start by precisely giving the assumptions on (1). Given and for , consider the following assumptions:
-
(A1)
;
-
(A2)
there exists such that for the function is differentiable on an open set containing and
(11)
We now give a few remarks about (A1)–(A2). First, (A1) is necessary to ensure that is well-defined. Second, it can be shown141414See, for example, [7, 26] and [21, Proposition 2.1.55]. that (A2) is equivalent to the assumption that is -Lipschitz continuous when . Third, the lower bound in (11) is equivalent to assuming that is convex and, hence, if then is convex. The parameter is often called a weak-convexity parameter of . Fourth, using symmetry arguments and the third remark, if then is concave. Finally, the third remark motivates why we choose two parameters, and , in (11). Specifically, we use (resp. ) to develop results that can be described in terms of the level of convexity (resp. concavity) of the problem.
We now develop the some properties of an SGD-like update. Given with and , define the prox-linear operator
(12) |
Clearly, when for every , the above update corresponds to a SGD step applied to the problem of minimizing (with respect to ) under the stepsize and starting point . Moreover, while it is straightforward to show that is -Lipschitz continuous when satisfies (A2)151515See, for example, [11, Appendix A.6]., we prove in Proposition 2.1(b) that the Lipschitz constant can be improved to for some when is small. Notice that the former constant does not converge to one when , e.g., when becomes more convex, while the latter one does.
We are now present some important properties of .
Proposition 2.1.
Let be as in assumption (A2), and define
(13) |
Then, the following statements hold:
-
(a)
if is bounded with diameter for , then for any we have
(14) -
(b)
if satisfies (A2) and then is -Lipschitz continuous;
-
(c)
if satisfies (A2), , and for every and , then is -Lipschitz continuous on .
Some remarks are in order. First, and, hence, is nonexpansive when is convex for every , , and . Second, if then and, hence, -Lispchitz continuous when is concave. Third, like the first remark, implies that is nonexpansive. Finally, when and , we have and is -Lispchitz continuous.
For the remainder of this section, suppose satisfies (A1) and let for be such that there exists where for every and , i.e., . That is, is the index where the neighboring datasets and differ.
We also make use of the follow assumption.
-
(A3)
The functions satisfy assumption (A2) with for every .
We now present the main RDP bounds in terms of the following constants:
(15) |
We first present a bound where is bounded with diameter .
Theorem 2.2.
We now present the RDP bounds for when is (possibly) unbounded.
Theorem 2.3.
We conclude with a few remarks about the above bounds. First, the bound in (19) is on the same order of magnitude as the bound in [17] in terms of and when . However, the right-hand-side of (19) scales linearly with a term, which does not appear in [17]. Second, as , the right-hand-sides of (17) and (19) increases (at most) linearly with respect to the number of dataset passes . Third, substituting in (17) yields a bound that depends linearly on and is invariant to changes in and . In Appendix C, we discuss further choices of the parameters in (19) and their properties.
References
- [1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016.
- [2] Jason Altschuler and Kunal Talwar. Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. Advances in Neural Information Processing Systems (NeurIPS), 2022.
- [3] Jason M. Altschuler, Jinho Bok, and Kunal Talwar. On the privacy of noisy stochastic gradient descent for convex optimization. SIAM Journal on Computing, 2024.
- [4] Shahab Asoodeh and Mario Díaz. Privacy loss of noisy stochastic gradient descent might converge even for non-convex losses. arXiv preprint arXiv:2305.09903, 2023.
- [5] Borja Balle and Yu-Xiang Wang. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning (ICML). PMLR, 2018.
- [6] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Symposium on foundations of computer science, 2014.
- [7] Amir Beck. First-order methods in optimization. SIAM, 2017.
- [8] Jinho Bok, Weijie Su, and Jason M Altschuler. Shifted interpolation for differential privacy. arXiv preprint arXiv:2403.00278, 2024.
- [9] Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods and applications in continual learning. arXiv preprint arXiv:2403.06873, 2024.
- [10] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 2011.
- [11] Eli Chien and Pan Li. Convergent privacy loss of noisy-SGD without convexity and smoothness. arXiv preprint arXiv:2410.01068, 2024.
- [12] Rishav Chourasia, Jiayuan Ye, and Reza Shokri. Differential privacy dynamics of Langevin diffusion and noisy gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- [13] Lynn Chua, Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. How private are DP-SGD implementations? In International Conference on Machine Learning (ICML), 2024.
- [14] Lynn Chua, Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. Scalable DP-SGD: Shuffling vs. poisson subsampling. In Conference on Neural Information Processing Systems (NeurIPS), 2024.
- [15] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 2014.
- [16] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In ACM SIGACT Symposium on Theory of Computing (STOC), 2020.
- [17] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In Annual Symposium on Foundations of Computer Science (FOCS), 2018.
- [18] Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213–5225. PMLR, 2021.
- [19] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International Conference on Machine Learning (ICML), 2015.
- [20] Georgios Kaissis, Moritz Knolle, Friederike Jungmann, Alexander Ziller, Dmitrii Usynin, and Daniel Rueckert. A unified interpretation of the gaussian mechanism for differential privacy through the sensitivity index. arXiv preprint arXiv:2109.10528, 2021.
- [21] Weiwei Kong. Accelerated inexact first-order methods for solving nonconvex composite optimization problems. arXiv preprint arXiv:2104.09685, 2021.
- [22] Jiaming Liang and Renato D.C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 49(2):832–855, 2024.
- [23] Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In International Conference on Machine Learning (ICML), 2024.
- [24] Google LLC. Tensorflow privacy. https://github.com/tensorflow/privacy, 2019.
- [25] Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled Gaussian mechanism. arXiv preprint arXiv:1908.10530, 2019.
- [26] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
- [27] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing. IEEE, 2013.
- [28] Jiayuan Ye and Reza Shokri. Differentially private learning needs hidden state (or much faster convergence). Advances in Neural Information Processing Systems (NeurIPS), 2022.
- [29] Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.
Appendix A Derivation of main results
This appendix derives the main results, namely, Theorems 2.2 and 2.3. It contains three subappendices. The first one derives important properties of a family of randomized operators, the second specializes these results to the DP-SGD update operator in (12), and the last one gives the proofs of Theorems 2.2 and 2.3 using the previous two subappendices.
A.1 General operator analysis
This subappendix gives some crucial properties about randomized proximal Lipschitz operators, which consist of evaluating a Lipschitz proximal operator followed by adding Gaussian noise. More specifically, it establishes several RDP bounds based on the closeness of neighboring operators.
We first bound the shifted Rényi divergence of a randomized proximal Lipschitz operator. The proof of this result is a straightforward extension of the argument in [17, Theorem 22] from -Lipschitz operators to -Lipschitz operators with additive residuals.
To begin, we present two calculus rules for the shifted Rényi divergence given in (5). In particular, the proof of the second rule is a minor modification of the proof given for [17, Lemma 21].
Lemma A.1.
For random variables and and , it holds that
-
(a)
;
-
(b)
for some , if and satisfy
for any , then
Proof.
(a) See [17, Lemma 20].
(b) By the definitions of of and , there exist a joint distribution such that and almost surely. Now, the post-processing property of Rényi divergence implies that
Using our assumptions on and and the triangle inequality, we then have
almost surely. Combining the previous two inequalities, yields the desired bound in view of the definitions of and . ∎
The next result is the aforemention shifted RDP bound.
Lemma A.2.
For some , suppose and satisfy (8) for any . Moreover, let and . Then, for any scalars and satisfying and random variables and , it holds that
(20) |
Proof.
Note that the second inequality in (8) is equivalent to being -Lipschitz continuous when , and that the conditions in (8) need to only hold on .
We next apply (20) to a sequence of points generated by the update
(22) |
under different assumptions on and and a single dataset pass. Before proceeding, we first present a technical lemma.
Lemma A.3.
Given scalars and positive integer , let
(23) |
Then, for every ,
-
(a)
;
-
(b)
and ;
-
(c)
.
Proof.
Let be fixed.
(a) This is immediate from the definition of .
(b) We have that
Evaluating the above expression at clearly gives .
(c) The case of is immediate. For the case of , we use the definitions of and to obtain
∎
We now present the shifted RDP properties of the update in (22). This particular result generalizes the one in [17], which only considers the case of and .
Lemma A.4.
Let , , and be fixed. Given , suppose , , and satisfy (8) with
Moreover, given , let , and define the random variables
If , then the following statements hold:
-
(a)
if , then
(24) -
(b)
if , , and , then
(25)
Proof.
(a) Let . Our goal is to recursively apply (20) with suitable choices of the free parameter at each application. Specifically, let be as in (23), and define
Using Lemma A.3(a)–(b), we first have and, hence, by Lemma A.2, we have
Since Lemma A.3(a)–(b) also implies and we have for , we repeatedly apply Lemma A.2 with for to obtain
It now remains to bound . Using Lemma A.3(c) and the fact that and , we have
Combining this bound with the previous one yields the desired conclusion.
(b) Let . Similar to (a), our goal is to recursively apply (20) with suitable choices of the free parameter at each application. For this setting, let and for . Using the fact that and and Lemma A.2, we first have that
We then repeatedly apply Lemma A.2 with for to obtain
It now remains to bound . Using the definition of and the fact that , it holds that
Combining this bound with the previous one yields the desired conclusion. ∎
We next extend the above result to multiple dataset passes.
Proposition A.5.
Proof.
(a) Let . For convenience, define
Using Lemma A.4(a), we have that for the first iterates,
Similarly, using part Lemma A.4(a) with , we have that
for any . Finally, using part Lemma A.4(a) with and , we have that
Summing the above three inequalities, using the fact that , and using the definition of we conclude that
Some remarks about Proposition A.5 are in order. First, part (a) shows that if and only differ at , then is finite for any . Second, part (a) also shows that if and differ cyclically for a cycle length of , then the divergence between and grows linearly with the number of cycles . Third, part (b) gives a bound that is independent of . Finally, both of the bounds in parts (a) and (b) can be viewed as Rényi divergences between the Gaussian random variables and for different values of .
A.2 SGD operator analysis
This subappendix derives some important properties about the DP-SGD update operator in (12) and also contains the proof of Proposition 2.1.
To start, we recall the following well-known bound from convex analysis. Its proof can be found, for example, in [7, Theorem 5.8(iv)].
Lemma A.6.
Let be convex and differentiable. Then satisfies
(28) |
if and only if
We next give a technical bound on , which generalizes the co-coercivity of convex functions to weakly-convex functions.
Lemma A.7.
For any and satisfying (A2), it holds that
Proof.
The below result gives some technical bounds on changes in the proximal function.
Lemma A.8.
Given , let and define
Then, the following statements hold:
-
(a)
;
-
(b)
.
We now develop some technical properties of . The first result presents a bound involving the following quantities for and .
(29) |
Lemma A.9.
Proof.
We are now ready to give the proof of Proposition 2.1.
A.2.1 Proof of Proposition 2.1
Proof.
(a) Let and be arbitrary. Moreover, denote for . Using the definition of , the assumption that for any , and the triangle inequality, we have that
A.3 RDP bounds
The first result shows how the updates in Algorithm 1 are randomized proximal updates applied to the operator in (12) with .
Lemma A.10.
Proof.
This follows immediately from the definition of , the update rules in Algorithm 1, and the fact that is the proximal operator of the (convex) indicator function of the convex set . ∎
We now present some important norm bounds.
Lemma A.11.
Proof.
We now give the proofs of the main RDP bounds.
A.3.1 Proof of Theorem 2.2
Proof.
A.3.2 Proof of Theorem 2.3
Proof.
Appendix B Choice of residuals
This appendix briefly discusses the choice of residuals that are used in the proof of Proposition A.5(a) and Lemma A.2.
In the setup of Proposition A.5(a), it is straightforward to show that if is a nonnegative sequence of scalars such that
then repeatedly applying Lemma A.2 with yields
(35) |
Hence, to obtain the tightest bound of the form in (35), we need to solve the quadratic program
s.t | |||
If we ignore the inequality constraints, the first order optimality condition of the resulting problem is that there exists such that
The latter identity implies that
which then implies
Hence, to verify that the above choice of is optimal for , it remains to verify that for . Indeed, this follows from Lemma A.3(b) after normalizing for the factor. As a consequence, the right-hand-side of (35) is minimized for our choice of above.
Appendix C Parameter choices
Let us now consider some interesting values for , , and .
The result below establishes a useful bound on for sufficiently large enough values of .
Lemma C.1.
For any and , if then
Proof.
Using the definition of , we have
∎
Corollary C.2.
Proof.
For ease of notation, denote , , , and . We first note that
which implies
Consequently, using Lemma C.1 with and the definitions of and , we have that
Using the above bound and Theorem 2.3 with , we obtain
In view of the fact that Algorithm 1 returns the last iterate (or ), the conclusion follows. ∎
Some remarks about Corollary C.2 are in order. First, increases linearly with the number of dataset passes . Second, the smaller is the smaller the effect of on is. Fourth, which implies that the reducing the dependence on in leads to more restrictive choices on . Finally, it is worth emphasizing that the restrictions on can be removed by using (17) directly. However, the resulting bounds are less informative in terms of the topological constants and .
We now present an RDP bound that is independent of when is sufficiently small.
Corollary C.3.
Let , , , , , , , , and be as in Theorem 2.3. If
and no gradient clipping is performed, then we have
Proof.
Appendix D Limitations of Poisson sampling in practice
This appendix discussing the computational limitation of implementing Poisson sampling in practice. It is primarily concerned with the large-scale setting where datasets may be on the order of hundreds of millions of examples.
Data access. Implementations of Poisson sampling, e.g., Opacus [29], typically employ a pseudorandom number generator to (i) randomly sample a collection of indices from zero to , where is the size of the dataset and (ii) map these indices to corresponding examples in the dataset to generate a batch of examples. In order for (ii) to be efficient, many libraries need fast random access to the dataset which is difficult to do without loading the entire dataset into RAM (as reading data from disk can be orders of magnitude more expensive). In contrast, cyclic traversal of batches only requires (relatively small) fixed blocks of the dataset to be loaded into memory for every batch and need not perform a matching of indices (such as in (i) above) to data.
Variable batch size. Independent of the access speed of the dataset examples, Poisson sampling also generates batches of random sizes, which are typically inconvenient to handle in deep learning systems [14]. For example, popular just-in-time compilation-based machine learning libraries such as JAX, PyTorch/Opacus, and TensorFlow may need to retrace their computation graph at every training step as the batch size cannot be statically inferred or kept constant. Additionally, optimizing training workloads on hardware accelerators such as graphical processing units (GPUs) and tensor processing units (TPUs) becomes difficult as (i) they require any in-device data to have fixed sizes and (ii) any input data generated by Poisson sampling will have variable sizes due to the effect of variable batch sizes. In contrast, the cyclic traversal of batches will always generate fixed batch sizes and, consequently, will not suffer from the above issues.