This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Privacy of the last iterate in cyclically-traversed DP-SGD on nonconvex composite losses

Weiwei Kong111Google Research, Email: [email protected], ORCID: 0000-0002-4700-619X    Mónica Ribero222Google Research, Email: [email protected]
Abstract

Differentially-private stochastic gradient descent (DP-SGD) is a family of iterative machine learning training algorithms that privatize gradients to generate a sequence of differentially-private (DP) model parameters. It is also the standard tool used to train DP models in practice, even though most users are only interested in protecting the privacy of the final model. Tight DP accounting for the last iterate would minimize the amount of noise required while maintaining the same privacy guarantee and potentially increasing model utility. However, last-iterate accounting is challenging, and existing works require strong assumptions not satisfied by most implementations. These include assuming (i) the global sensitivity constant is known — to avoid gradient clipping; (ii) the loss function is Lipschitz or convex; and (iii) input batches are sampled randomly.

In this work, we forego any unrealistic assumptions and provide privacy bounds for the most commonly used variant of DP-SGD, in which data is traversed cyclically, gradients are clipped, and only the last model is released. More specifically, we establish new Rényi differential privacy (RDP) upper bounds for the last iterate under realistic assumptions of small stepsize and Lipschitz smoothness of the loss function. Our general bounds also recover the special-case convex bounds when the weak-convexity parameter of the objective function approaches zero and no clipping is performed. The approach itself leverages optimal transport techniques for last iterate bounds, which is a nontrivial task when the data is traversed cyclically and the loss function is nonconvex.

1 Introduction

Differential privacy (DP) is an approach to capture the sensitivity of an algorithm to any individual user’s data and is frequently used in both industrial and government applications (see the book by [15] for a rich introduction). Given a possibly nonprivate computation ff, a desired level of DP (or privacy budget) ε\varepsilon is generally achieved by bounding the global sensitivity333The maximum change in the mechanism’s output caused by changes in a single user or data point. of ff and then adding noise to its output. This noise is typically calibrated to the sensitivity and ε\varepsilon in order to obscure the contributions of a single input example. Conversely, given a mechanism 𝒜\cal A for making a computation differentially private, a method for determining the level of DP obtained by 𝒜\cal A is often called a DP accounting method.

Differentially-private stochastic gradient descent (DP-SGD) refers to a family of popular first-order methods for training model weights with DP[10, 27, 6, 1]. At a high level, a DP-SGD method first computes the gradients of a given set of per-example loss functions with respect to the model weights and applies an algorithm 𝒜{\cal A} to obtain a private gradient 𝒢\cal G. The private gradient 𝒢\cal G is then used in some first-order optimization scheme, e.g., SGD, Adam, or AdaGrad, to update model weights. More precisely, 𝒜{\cal A} consists of (i) scaling the per-example loss gradients (a.k.a. gradient clipping) to reduce sensitivity, (ii) adding independent and identically distributed (i.i.d.) Gaussian noise 𝒵\cal Z to each of the scaled gradients, and (iii) summing the noised gradients to obtain 𝒢\cal G. In general, the higher the variance of 𝒵\cal Z is, the lower the utility of the final trained model.

Depending on the optimization scheme, and the assumptions on how the user-level loss functions are obtained, existing DP accounting methods for DP-SGD can differ significantly. For example, when only the last iterate of DP-SGD is released, existing accounting methods require both sophisticated machinery and numerous strong assumptions to provide tight DP bounds. Some of these strong assumptions, that almost never hold in practice, include (i) the input data is sampled randomly at each DP-SGD iteration, (ii) the loss functions are convex, (iii) the global DP-SGD sensitivity is known beforehand, and (iv) the intermediate model weights are bounded.

This work develops tighter privacy analyses for last-iterate DP-SGD under more realistic settings than existing works. Consequently, our analyses enable implementations of DP-SGD that apply Gaussian noise 𝒵\cal Z with lower variance than existing work, and as a consequence, obtain higher utility at the same privacy budget ε\varepsilon. More specifically, we develop a family of Rényi DP (RDP) bounds on the last iterate of DP-SGD, which are novel in that they:

  • (i)

    do not assume knowledge of the global sensitivity constant and, hence, are valid with or without gradient clipping;

  • (ii)

    hold for both the nonconvex and convex settings under significantly fewer assumptions than other works;

  • (iii)

    are parameterized by a weak convexity444A function ff is mm-weakly-convex if f+m2/2f+m\|\cdot\|^{2}/2 is convex. parameter m0m\geq 0, for which one of the bounds smoothly converges to a similar one in the convex setting as m0m\to 0.

1.1 Background

We begin by formally stating the problem of interest, describing common terminology and notation, and specifying the DP-SGD variant under consideration. We then briefly describe the Privacy Amplification by Iteration (PABI) argument of [17] and discuss the difficulties of generalizing this argument to more practical settings.

Problem of interest. We develop RDP bounds for the last iterate of a DP-SGD variant applied to the nonsmooth composite optimization problem

minxn{ϕ(x):=1ki=1kfi(x)+h(x)}\min_{x\in\mathbb{R}^{n}}\left\{\phi(x):=\frac{1}{k}\sum_{i=1}^{k}f_{i}(x)+h(x)\right\} (1)

where hh is convex and proper lower-semicontinuous and fif_{i} is continuously differentiable on the domain of hh. Notice that the assumption on hh encapsulates (i) common nonsmooth regularization functions such as the 1\ell_{1}-norm 1\|\cdot\|_{1}, nuclear matrix norm \|\cdot\|_{*}, elastic net regularizer and (ii) indicator functions on possibly unbounded closed convex sets. A common setting in practice is when (1/k)i=1kfi(x)(1/k)\sum_{i=1}^{k}f_{i}(x) corresponds to a softmax cross-entropy loss function and h(x)h(x) corresponds to an 1\ell_{1}- or 2\ell_{2}-regularization function.

Common terminology. An input data collection X={xi}i=1kX=\{x_{i}\}_{i=1}^{k} is said to be traversed cyclically (or cyclically-traversed) in batches {Bt}\{B_{t}\} of size bb if BtB_{t} contains {xb(t1)+1,,xbt}\{x_{b(t-1)+1},\ldots,x_{bt}\} for the first tk/bt\leq k/b batches555For simplicity, we assume bb divides kk throughout., and the the rest of the batches cycle between B1,,Bk/bB_{1},\ldots,B_{k/b} in order. Cyclically-traversed DP-SGD is a variant of DP-SGD where the input data is traversed cyclically666Cyclically traversed is also known in the literature as incremental gradient [23].. A dataset pass occurs when the input data (e.g., XX above) in a cyclically-traversed DP-SGD run has been used, and the next batch of inputs is the same as the first batch of inputs at the beginning of the dataset pass. Gradient clipping is the process of orthogonally projecting a gradient vector in n\mathbb{R}^{n} to a Euclidean ball of radius CC centered at the origin. The parameter CC is typically called the 2\ell_{2}-norm clip value. In this work, we say a function is a randomized operator if it consists of applying some deterministic operator to an input and adding random noise to resulting output. An operator 𝒯{\cal T} is said to be LL-Lipschitz if 𝒯(x)𝒯(y)Lxy\|{\cal T}(x)-{\cal T}(y)\|\leq L\|x-y\| for every xx and yy, and 𝒯{\cal T} is said to be nonexpansive if it is 1-Lipschitz.

Common notation. Let [n]={1,,n}[n]=\{1,\ldots,n\} for any positive integer nn. Let \mathbb{R} denote the set of real numbers and n=××\mathbb{R}^{n}=\mathbb{R}\times\cdots\times\mathbb{R} denote the nn-fold Cartesian product of \mathbb{R}. Let (,,n)(\langle\cdot,\cdot\rangle,\mathbb{R}^{n}) denote a Euclidean space over n\mathbb{R}^{n} and denote :=,\|\cdot\|:=\sqrt{\langle\cdot,\cdot\rangle} to be the induced norm. The domain of a function ϕ:n(,]\phi:\mathbb{R}^{n}\mapsto(-\infty,\infty] is domϕ:={zn:ϕ(z)<}\operatorname*{dom}\phi:=\{z\in\mathbb{R}^{n}:\phi(z)<\infty\}. The function ϕ\phi is said to be proper if domϕ\operatorname*{dom}\phi\neq\emptyset. A function ϕ:n(,]\phi:\mathbb{R}^{n}\mapsto(-\infty,\infty] is said to be lower semicontinuous if lim infxx0ϕ(x)ϕ(x0)\liminf_{x\to x_{0}}\phi(x)\geq\phi(x_{0}). The set of proper, lower semicontinuous, convex functions over n\mathbb{R}^{n} is denoted by Conv¯(n)\overline{{\rm Conv}}\,(\mathbb{R}^{n}). The clipping operator is given by

ClipC(y):=ymin{1,Cy},\displaystyle{\rm Clip}_{C}(y):=y\cdot\min\left\{1,\frac{C}{\|y\|}\right\}, (2)

and the proximal operator for a proper convex function ψ\psi is defined as

proxψ(z0)=argminzn{ψ(z)+12zz02}z0n.\operatorname{prox}_{\psi}(z_{0})=\operatorname*{argmin}_{z\in\mathbb{R}^{n}}\left\{\psi(z)+\frac{1}{2}\|z-z_{0}\|^{2}\right\}\quad\forall z_{0}\in\mathbb{R}^{n}. (3)

It is well-known that proxψ(){\rm prox}_{\psi}(\cdot) is nonexpansive (see, for example [7, Theorem 6.42]) and that ClipC(y){\rm Clip}_{C}(y) is the proximal operator for the (convex) indicator function of the set {x:xC}\{x:\|x\|\leq C\}.

The \infty-Wasserstein metric 𝒲(μ,ν){\cal W}_{\infty}(\mu,\nu) is the smallest real number ww such that for XμX\sim\mu and YνY\sim\nu, there is a joint distribution on (X,Y)(X,Y) where XYw\|X-Y\|\leq w almost surely, i.e., 𝒲(μ,ν)=infγΓ(μ,ν)esssup(x,y)γxy{\cal W}_{\infty}(\mu,\nu)=\inf_{\gamma\sim\Gamma(\mu,\nu)}\operatorname*{ess\ sup}_{(x,y)\sim\gamma}\|x-y\|, where Γ(μ,ν)\Gamma(\mu,\nu) is the collection of measures on n×n\mathbb{R}^{n}\times\mathbb{R}^{n} with first and second marginals μ\mu and ν\nu, respectively. For any probability distributions μ\mu and ν\nu with νμ\nu\ll\mu, the Rényi divergence of order α(1,)\alpha\in(1,\infty) is

Dα(μν)=1α1log[μ(x)ν(x)]αν(x)𝑑x,D_{\alpha}(\mu\|\nu)=\frac{1}{\alpha-1}\log\int\left[\frac{\mu(x)}{\nu(x)}\right]^{\alpha}\nu(x)\,dx, (4)

where we take the convention that 0/0=00/0=0. For ν≪̸μ\nu\not\ll\mu, we define Dα(μν)=D_{\alpha}(\mu\|\nu)=\infty. For parameters τ0\tau\geq 0 and α1\alpha\geq 1, the shifted Rényi divergence is

Dα(τ)(μν):=infγ{Dα(γν):𝒲(μ,γ)τ}D_{\alpha}^{(\tau)}(\mu\|\nu):=\inf_{\gamma}\left\{D_{\alpha}(\gamma\|\nu):{\cal W}_{\infty}(\mu,\gamma)\leq\tau\right\} (5)

for any probability distributions μ\mu and ν\nu over n\mathbb{R}^{n}. Given random variables XμX\sim\mu and YνY\sim\nu, we denote Dα(XY)=Dα(μν)D_{\alpha}(X\|Y)=D_{\alpha}(\mu\|\nu) and Dα(τ)(XY)=Dα(τ)(μν)D_{\alpha}^{(\tau)}(X\|Y)=D_{\alpha}^{(\tau)}(\mu\|\nu).

We consider the swap model for differential privacy. We say two datasets SS and SS^{\prime} are neighbors, denoted as SSS\sim S^{\prime}, if SS^{\prime} can be obtained by swapping one record. A randomized algorithm 𝒜\cal A is said to be (α,ε)(\alpha,\varepsilon)-RDP if, for every pair of neighboring datasets SS and SS^{\prime} in the domain of 𝒜\cal A, we have Dα(𝒜(S)𝒜(S))εD_{\alpha}({\cal A}(S)\|{\cal A}(S^{\prime}))\leq\varepsilon.

𝒜{\cal A} satisfies local DP if for all records xix_{i} and xjx_{j}, Dα(𝒜(xi)𝒜(xj))εD_{\alpha}({\cal A}(x_{i})\|{\cal A}(x_{j}))\leq\varepsilon. Finally, we use the following variable conventions: \ell is the number of batches (or iterations) in a dataset pass, EE is the number of dataset passes, T=ET=E\cdot\ell is the total number of iterations, kk is the total number of per-example losses, bb is the batch size, λ\lambda is the DP-SGD stepsize, and CC is the clipping norm.

DP-SGD variant. Algorithm 1 outlines the specific variant of DP-SGD applied to (1). This variant takes as input kk per-example loss functions {fi}i=1k\{f_{i}\}^{k}_{i=1}, the number of iterations TT, iid samples {Nt}t=1T\{N_{t}\}^{T}_{t=1} from a spherical Gaussian distribution 𝒩(0,σ2I){\cal N}(0,\sigma^{2}I), initial model weights X0X_{0}, batch size, stepsize, and 2\ell_{2}-clipping norm values. Model weights are updated as follows. At time step tt, the algorithm (i) selects a batch of examples by cyclically traversing {fi}i=1k\{f_{i}\}^{k}_{i=1}, (ii) computes the average gtg_{t} of clipped per example gradients at Xt1X_{t-1}, and (iii) updates Xt1X_{t-1} using a noisy gradient. Finally, the algorithm returns the last iterate, XTX_{T}.

Input: {fi}i=1k\{f_{i}\}_{i=1}^{k}, hh, iid samples {Nt}n\{N_{t}\}\subseteq\mathbb{R}^{n} from 𝒩(0,σ2I){\cal N}(0,\sigma^{2}I), X0domhX_{0}\in\operatorname*{dom}h;
Data: batch size bb, stepsize λ\lambda, clipping norm CC, iteration limit TT, steps per dataset pass \ell;
Output: XTdomhX_{T}\in\operatorname*{dom}h;
for t=1,,Tt=1,\ldots,T do
       jtb(t1)modkj_{t}\leftarrow b(t-1)\bmod k
       Bt{jt+1,,jt+b}B_{t}\leftarrow\{j_{t}+1,\ldots,j_{t}+b\}
       gt(1/b)iBtClipC(fi(Xt1))g_{t}\leftarrow(1/b)\sum_{i\in B_{t}}{\rm Clip}_{C}(\nabla f_{i}(X_{t-1}))
       Xtproxλh(Xt1λgt+Nt)X_{t}\leftarrow\operatorname{prox}_{\lambda h}\left(X_{t-1}-\lambda g_{t}+N_{t}\right)
end for
return XTX_{T}
Algorithm 1 Cyclically-traversed last-iterate DP-SGD

1.2 Outline of approach

We now outline our approach of tackling the problem of interest. A more formal treatment is given in Section 2.

To motivate our approach, we provide a brief overview of previous well-known methods. An early approach of analyzing Algorithm 1 is to develop a bound based on local DP for a single dataset pass and extend this bound for multiple passes (see, for example, [27, 14]). While straightforward, this approach can be overly restrictive in a centralized setting. Privacy Amplification by Subsampling (PABS) (see subsequent work [13] for a comparison of different sampling methods) improves on the previous approach in certain regimes. Although this method allows for clean privacy accounting, its reliance on Poisson subsampling makes it impractical for large-scale applications (see Appendix D for an extended discussion). The work started by [17] addressed the limitations of PABS with Privacy Amplification by Iteration (PABI), which achieves a bound for releasing the final DP-SGD iterate. This PABI bound improves the baseline bound in certain regimes incorporating a contraction factor dependent on the loss function’s convexity parameters.

Our approach is inspired by PABI, but relaxes several of its convexity assumptions. For added context, we briefly review PABI below. Under the assumption that the loss function of (1) is convex and QQ-Lipschitz, and that hh is the indicator of a closed convex set, [17] shows that the DP-SGD update in Algorithm 1 with small constant stepsize λ\lambda and no gradient clipping is a nonexpansive operator. This property can be then combined with the following technical result about nonexpansive operators.

Theorem 1.1.

Suppose we are given iterates {Xt}\{X_{t}\} and {Xt}\{X_{t}^{\prime}\}, nonexpansive operators {ϕt}\{\phi_{t}\} and {ϕt}\{\phi_{t}^{\prime}\}, iid Gaussian random variables {Nt}\{N_{t}\}, and scalars {st}\{s_{t}\} satisfying

Xt=ϕt(Xt1)+Nt,Xt=ϕt(Xt1)+Nt,supxϕt(x)ϕt(x)stt[T].X_{t}=\phi_{t}(X_{t-1})+N_{t},\quad X_{t}^{\prime}=\phi_{t}^{\prime}(X_{t-1}^{\prime})+N_{t},\quad\sup_{x}\|\phi_{t}(x)-\phi_{t}^{\prime}(x^{\prime})\|\leq s_{t}\quad\forall t\in[T].

For any scalar sequences {at}\{a_{t}\} and {zt}\{z_{t}\} satisfying

zt=itsiitai0,zt0,at0,t[T],z_{t}=\sum_{i\leq t}s_{i}-\sum_{i\leq t}a_{i}\geq 0,\quad z_{t}\geq 0,\quad a_{t}\geq 0,\quad\forall t\in[T], (6)

we obtain the following last-iterate shifted Rényi divergence bound:

Dα(zT)(XTXT)Dα(X0X0)α2σ2t=1Tat2=:T({at})T1,α1.D_{\alpha}^{(z_{T})}(X_{T}\|X_{T}^{\prime})-D_{\alpha}(X_{0}\|X_{0}^{\prime})\leq\frac{\alpha}{2\sigma^{2}}\sum_{t=1}^{T}a_{t}^{2}=:{\cal R}_{T}(\{a_{t}\})\quad\forall T\geq 1,\quad\forall\alpha\geq 1. (7)

More specifically, assuming that the DP-SGD iterates first differ at index tt^{*}, i.e., XtXtX_{t^{*}}^{\prime}\neq X_{t^{*}} and Xt=XtX_{t}^{\prime}=X_{t} for every t<tt<t^{*}, the operators {ϕt}\{\phi_{t}\} and {ϕt}\{\phi_{t}^{\prime}\} in the above theorem can be formed with st=2λQs_{t^{*}}=2\lambda Q and st=0s_{t}=0 for every ttt\neq t^{*}. Consequently, one can select {at}\{a_{t}\} so that the shift satisfies zT=0z_{T}=0 and obtain a closed form bound T({at})=Θ(α[λQ/σ]2){\cal R}_{T}(\{a_{t}\})=\Theta(\alpha[\lambda Q/\sigma]^{2}) in (7), which yields a corresponding RDP bound when X0=X0X_{0}=X_{0}^{\prime}. The generalization to multiple dataset passes follows similarly but the final bound scales with the number of dataset passes EE. Specifically, we have T({at})=Θ(α[E/][λQ/σ]2){\cal R}_{T}(\{a_{t}\})=\Theta(\alpha[E/\ell]\cdot[\lambda Q/\sigma]^{2}).

In the more practical setting where (i) the loss function in (1) is nonconvex and not necessarily QQ-Lipschitz, (ii) gradient clipping is applied, and (iii) hh is nonsmooth, it is no longer clear to what extent the corresponding DP-SGD operators {ϕt}\{\phi_{t}\} and {ϕt}\{\phi_{t}^{\prime}\} are nonexpansive or how {st}\{s_{t}\} should be obtained. Furthermore, the first inequality of (6) no longer holds, and additional technical issues arise when analyzing the case of multiple dataset passes.

Our approach. We generalize the above argument and combine it with additional analyses of weakly-convex functions and proximal operators to relax several strong assumptions. A sketch of our approach is given below, and formal arguments can be found in subsequent sections.

General operator analysis. In Lemma A.2, we study properties of operators ϕ\phi and ϕ\phi^{\prime} satisfying

supuϕ(u)ϕ(u)s,Φ(x)Φ(y)Lxy+ζ,Φ{ϕ,ϕ}.\sup_{u}\|\phi^{\prime}(u)-\phi(u)\|\leq s,\quad\|\Phi(x)-\Phi(y)\|\leq L\|x-y\|+\zeta,\quad\forall\Phi\in\{\phi,\phi^{\prime}\}. (8)

Note that if Φ\Phi is Hölder continuous, then it can be shown [22] that it satisfies the second inequality in (8)777More specifically, if Φ\Phi is η\eta-Hölder continuous with modulus HH, then (8) holds with L=Hρ/2L=H\rho/2 and ζ=Hη([1η]/ρ)(1η)/η\zeta=H\eta([1-\eta]/\rho)^{(1-\eta)/\eta} for any ρ>0\rho>0.. Using these properties, we then establish in Proposition A.5 that if {Yt}\{Y_{t}\} and {Yt}\{Y_{t}^{\prime}\} are generated by a specific sequence of randomized proximal operators using ϕ\phi and ϕ\phi^{\prime}, respectively, then (roughly)

Dα(YTYT)Dα(Y0Y0)ασ2(T),D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}(Y_{0}\|Y_{0}^{\prime})\preceq\frac{\alpha}{\sigma^{2}}\cdot{\cal{F}}(T),

where :{\cal F}:\mathbb{R}\mapsto\mathbb{R} is non-decreasing and dependent on some assumptions on {Yt}\{Y_{t}\} and {Yt}\{Y_{t}^{\prime}\}. More specifically, we obtain a sequence of parameterized shifted Rényi divergence bounds similar to (7), while dealing with the challenge of nonconvexity. In the setting of one dataset pass, we derive the bound by solving a related quadratic programming problem on a similar set of residuals {at}\{a_{t}\} as in (6) (see Appendix B for details).

Lipschitz properties of the DP-SGD update. Denoting 𝒜λ(){\cal A}_{\lambda}(\cdot) as the DP-SGD update function888Or any SGD-like update as in (12)., i.e., Xt=𝒜λ(Xt1)X_{t}={\cal A}_{\lambda}(X_{t-1}) for every tt in Algorithm 1, we show in Proposition 2.1 that — depending on our assumptions on hh and stepsize λ\lambda — the operator 𝒜λ(){\cal A}_{\lambda}(\cdot) satisfies the second inequality of (8) with Φ=Aλ\Phi=A_{\lambda} for different values of LL and ζ\zeta.

More specifically, when the domain of hh is bounded, we have (L,ζ)=(1,2λC)(L,\zeta)=(1,2\lambda C) in (8) for clipping norm CC. On the other hand, when λ\lambda is sufficiently small, we have ζ=0\zeta=0 and LL being a constant that (i) tends to 2\sqrt{2} when the weak convexity parameter mm tends to zero, i.e., ff becomes more convex; and (ii) tends to one when no clipping is performed and mm in (i) tends to zero. This continuity with respect to the weak convexity parameter appears to be new, and it is proved in Appendix A.2 by using topological properties about weakly-convex functions and proximal operators.

Privacy bounds for DP-SGD. For neighboring DP-SGD iterates XTX_{T} and XTX_{T}^{\prime} , we combine the above results in Theorems 2.2 and 2.3 to obtain RDP bounds of the form

Dα(XTXT)ασ2λ(C,b,T,),D_{\alpha}(X_{T}\|X_{T}^{\prime})\preceq\frac{\alpha}{\sigma^{2}}\cdot{\cal{B}}_{\lambda}(C,b,T,\ell), (9)

where CC, bb, TT, \ell, are as in Algorithm 1. More specifically, assuming that the DP-SGD iterates are contained within an 2\ell_{2} ball of diameter dhd_{h} and each fi\nabla f_{i} is Lipschitz continuous, we obtain (9) with Bλ(C,b,T,)=(Lλdh+λC/b)2B_{\lambda}(C,b,T,\ell)=(L_{\lambda}d_{h}+\lambda C/b)^{2} for for some Lλ=1+κλmL_{\lambda}=\sqrt{1+\kappa\lambda m}, κ4\kappa\leq 4, and small enough λ\lambda. When the iterates are (possibly) unbounded, we obtain (9) with

Bλ(C,b,T,)=(1+[T][Lλ2i=1Lλ2i])(λCb)2.B_{\lambda}(C,b,T,\ell)=\left(1+\left[\frac{T}{\ell}\right]\left[\frac{L^{2\ell}_{\lambda}}{\sum_{i=1}^{\ell}L^{2i}_{\lambda}}\right]\right)\left(\frac{\lambda C}{b}\right)^{2}. (10)

1.3 Related work

We first present high-level descriptions of related works in the convex and nonconvex settings, followed by more general works that use advanced composition to obtain loose bounds on the last iterate. We then conclude with some summary tables and figures, and a discussion of technical nuances that carefully compares our work to existing literature.

Convex setting. Given the challenge of proving tight bounds in the general setting, a number of prior analyses focus on the convex case. Works by [17, 16] additionally assume Lispchitz continuity of the loss function to obtain results. The work of [12] studies multiple passes over the data, but their results only apply to the smooth, strongly convex, and full batch setting without clipping. The work of [28] improves these results and extend them to mini-batches both with “shuffle and partition” and “sampling without replacement” strategies. Similarly, results in by [2], and its extension [3], consider only convex Lipschitz smooth functions. The contemporary work of [8] introduces the shifted interpolated process under ff-DP, allowing for tight characterizations of privacy by iteration for Rényi and other generalized DP definitions.

Notice that none of the above works study clipping, all assume access to the Lipschitz constant of the loss function, and require convexity, limiting their practical viability.

Nonconvex setting. We now discuss papers that do not require convexity of the loss function. [4] analyze the privacy loss dynamics for nonconvex functions, but their analysis differs from ours in two ways. First, they assume that their DP-SGD batches are obtained by Poisson sampling or sampling without replacement. Second, their results require numerically solving a minimization problem that can be hard in practice.

A contemporary work999Appearing after the first version of this preprint. by [11] derives bounds under the assumption that the loss gradient is Hölder continuous and the loss function is Lispchitz continuous when it is also convex. However, this work needs an additional assumption, that constants L>0L>0 and η(0,1]\eta\in(0,1] satisfying f(x)f(y)Lxyη\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|^{\eta} are known and used in a specific optimization subproblem. While [11] focuses on tight theoretical bounds under specific conditions (such as the full batch and single-epoch setting), we prioritize bridging the gap between theory and practice by addressing the complexities of real-world deployments.

Non-specialized analyses. Prior works on DP-SGD accounting often rely on loose bounds that allow for the release of intermediate updates [20, 6, 1]. These works rely on differential privacy advanced composition results [19], resulting in noise with standard deviation that scales as T\sqrt{T} [6, 1]. Alternatively, using disjoint batches decreases the dependence from the number of iterations TT to the number of epochs EE (see the first row in Table 2). However, the assumption that all intermediate updates are released can be stringent for certain loss regimes [17, 16], where this bound can be contracted based on loss smoothness and convexity parameters, if only the last iterate is released.

While our work focuses on the privacy guarantees of DP-SGD, it’s important to acknowledge the parallel research efforts exploring the convergence properties of shuffling methods. Recent studies, such as [23] and [9], have established convergence bounds for strongly convex and/or smooth functions in various settings, albeit without considering privacy. These works provide valuable insights into the optimization behavior of shuffling techniques, which can inform future research at the intersection of privacy and optimization.

Summary tables and graphs. Table 1 lists and labels some assumptions that are commonly used in the RDP literature. Table 2 compares our bounds against other RDP bounds. Note that the multi-epoch noisy-SGD algorithm in [17, Algorithm 1] only considers the case where the number of dataset passes equals the number of batches in dataset pass and does not consider batched gradients. As a consequence of the latter, its corresponding RDP upper bound does not depend on the batch size bb. Figure 1 compares our bounds against other bounds as function of the number of iterations performed, under various settings. Note that we do not consider the multi-epoch bound in [11, Theorem A.1] as it requires solving TT nonlinear programming problems, and we do not compare the with the bound in [25] because it exploits the fact that batches are randomly sampled (whereas our bounds assume the input batches must be obtained deterministically).

Label Description
c{\cal I}_{c} The regularizer hh is the indicator of a closed convex set
{\cal H} The domain of a regularizer hh, domh\operatorname*{dom}h, is bounded with diameter dhd_{h}
1{\cal B}_{1} input data batches are of size one
𝒟{\cal D} input data batches are disjoint
𝒫{\cal P} input data batches are obtained using Poisson subsampling with sampling rate 1/1/\ell
𝒮Q0{\cal S}_{Q}^{0} fif_{i} is QQ-Lipschitz continuous for every i[k]i\in[k], and QQ is known
𝒮L1{\cal S}_{L}^{1} fi\nabla f_{i} is LL-Lipschitz continuous for every i[k]i\in[k], and LL is known
H,η1{\cal R}_{H,\eta}^{1} fi\nabla f_{i} is (H,η)(H,\eta)-Hölder continuous for every i[k]i\in[k], and HH and η\eta are known
Λ\Lambda the stepsize needs to be small relative to certain smoothness constants
𝒜{\cal A} the given RDP bounds only hold for a small range of values of α\alpha and σ\sigma
𝒩{\cal N} no gradient clipping is applied or the global sensitivity is known
𝒩~\tilde{{\cal N}} when ϕ\phi is convex, no gradient clipping is applied or the global sensitivity is known
Table 1: List of common assumptions used in RDP accounting literature. For conciseness we let \ell denote the number of iterations/batches in a dataset pass and domh\operatorname*{dom}h denote the domain of hh.
Source Asymptotic (α,ϵ)(\alpha,\epsilon)-RDP upper bounds Assumptions
Convex ϕ\phi Nonconvex ϕ\phi (see Table 1)
[14, Theorem 3.1]101010 The bound follows directly from Theorem 8 in [5] αE2[λCσb]2\dfrac{\alpha E}{2}\left[\dfrac{\lambda C}{\sigma b}\right]^{2} same as convex 𝒟{\cal D}
[25, Section 3.3] numerical procedure111111Obtained by evaluating an integral using numerical quadrature techniques. same as convex 𝒫{\color[rgb]{1,0,0}{\cal P}}
[25, Theorem 11] αE[λCbσ]2\dfrac{\alpha E}{\ell}\left[\dfrac{\lambda C}{b\sigma}\right]^{2} same as convex 𝒜,𝒫{\color[rgb]{1,0,0}{\cal A}},{\color[rgb]{1,0,0}{\cal P}}
[17, Theorem 35] αE[λQσ]2\dfrac{\alpha E}{\ell}\left[\dfrac{\lambda Q}{\sigma}\right]^{2} none c,𝒩,Λ,SQ0,SL1,1{\cal I}_{c},{\cal N},\Lambda,S_{Q}^{0},S_{L}^{1},{\color[rgb]{1,0,0}{\cal B}_{1}}
[2, Theorem 3.1] α[λQσ]2min{E,dhQλk}\alpha\left[\dfrac{\lambda Q}{\sigma}\right]^{2}\min\left\{\dfrac{E}{\ell},\dfrac{d_{h}}{Q\lambda k}\right\} none c,𝒩,Λ,SQ0,SL1,,𝒜{\cal I}_{c},{\cal N},\Lambda,S_{Q}^{0},S_{L}^{1},{\cal H},{\color[rgb]{1,0,0}{\cal A}}
[11, Theorem A.1] numerical procedure121212Obtained by solving TT nonlinear programming problems. numerical procedure131313Same procedure as in the nonconvex case, but with different parameters. c,𝒩~,Λ,SQ0,H,η1,{\cal I}_{c},\tilde{{\cal N}},\Lambda,{S}_{Q}^{0},{\cal R}_{H,\eta}^{1},{\cal H}
Ours ασ2[Lλdh+λCb]2\dfrac{\alpha}{\sigma^{2}}\left[L_{\lambda}d_{h}+\dfrac{\lambda C}{b}\right]^{2} same as convex Λ,SL1,,𝒟\Lambda,S_{L}^{1},{\cal H},{\cal D}
αE[λCbσ]2\dfrac{\alpha E}{\ell}\left[\dfrac{\lambda C}{b\sigma}\right]^{2} αEθLλ()[λCσ]2\alpha E\theta_{L_{\lambda}}(\ell)\left[\dfrac{\lambda C}{\sigma}\right]^{2} Λ,SL1,𝒩,𝒟\Lambda,S_{L}^{1},{\cal N},{\cal D}
αEθ2()[λCσ]2\alpha E\theta_{\sqrt{2}}(\ell)\left[\dfrac{\lambda C}{\sigma}\right]^{2} αEθ2Lλ()[λCσ]2\alpha E\theta_{\sqrt{2}L_{\lambda}}(\ell)\left[\dfrac{\lambda C}{\sigma}\right]^{2} Λ,SL1,,𝒟\Lambda,S_{L}^{1},,{\cal D}
Table 2: Asymptotic ε\varepsilon upper bounds for (α,ε)(\alpha,\varepsilon)-Rényi differential privacy of the last iterate after TT iterations of DP-SGD. Here, λ\lambda is the stepsize, CC is the clipping norm, σ\sigma is the standard deviation of the Gaussian noise, \ell is the number of iterations in one dataset pass, and EE is the number of dataset passes. Also, Lλ:=1+λκmL_{\lambda}:=\sqrt{1+\lambda\kappa m} for some κ4\kappa\leq 4 and weak-convexity parameter mm (see (11)), and θL():=L2(1)/i=01L2i\theta_{L}(\ell):=L^{2(\ell-1)}/\sum_{i=0}^{\ell-1}L^{2i}. Particularly strong assumptions are highlighted in red.
Refer to caption
Refer to caption
Figure 1: Log-scale last-iterate RDP bounds for Algorithm 1 as a function of number of iterations. The fixed algorithmic parameters are λ=105\lambda=10^{-5}, C=10C=10, σ=105\sigma=10^{-5}, T=105T=10^{5}, b=101b=10^{1}, and k=104k=10^{4}. The free parameters mm, MM, and QQ are the weak convexity, Lipschitz-smoothness, and Lipschitz constants of ff, respectively, while CC is the clipping norm. The left plot considers weakly-convex losses under the settings of no gradient clipping. The right plot considers the convex case where QQ may differ from CC. The RDP composition bounds are from [20], the PABI bounds are from [17], and the local DP bounds are from [14].

Technical nuances. The bound in (9) with λ{\cal B}_{\lambda} as in (10) and Lλ=1L_{\lambda}=1 might appear to follow from subsampled RDP composition results such as [25]. However, those results only apply to DP-SGD variants where the batches are sampled randomly, an assumption that does not hold when batches are cyclically traversed. While established Python libraries like Opacus [29] and TensorFlow Privacy [24] implement and account for random sampled batches (such as those obtained by Poisson subsampling), these implementations address a different issue. One has to ensure the optimizer truly samples at random from a pre-specified distribution, which becomes incredibly difficult with large-scale datasets (see Appendix D for an extended discussion). Consequently, any privacy guarantee predicated on this idealized random sampling assumption becomes effectively meaningless when the actual optimization process deviates from it (as is the case with cyclical batch traversal). It is worth mentioning that DP-FTRL [18] was specifically developed to address this gap and the method accepts a potentially weaker DP guarantee in exchange for practical applicability.

For the special case where (i) only one dataset pass is performed, (ii) the objective function is nonconvex , and (iii) each fi\nabla f_{i} is Lipschitz continuous, the bound in (9) with λ{\cal B}_{\lambda} as in (10) holds with T=T=\ell. Consequently, we obtain an RDP bound that does not grow with the number of iterations — in contrast to the RDP composition bounds in [20] which do scale linearly in TT or EE, depending on the sampling assumption. Note that (16) and Theorem 2.3 show that as θLλ()1\theta_{L_{\lambda}}(\cdot)\downarrow 1, our bounds converge to an expression that depends linearly on E/E/\ell, matching the bound for PABI. Figure 1 illustrates the regimes under which we improve previous work.

The addition of the composite term h()h(\cdot) in (1) is not a trivial extension and greatly complicates the analysis. For example, the objective function ϕ\phi in (1) is no longer differentiable in general (even on domϕ\operatorname*{dom}\phi), and more general analyses must be used to handle this nonsmoothness. Under stepsize λ\lambda and LL-Lipschitz continuous fi\nabla f_{i} for i[k]i\in[k], paper [11] shows that the DP-SGD update in the nonconvex case is (1+λL)(1+\lambda L)-Lipschitz continuous, which is independent of the weak convexity constant. Consequently, the established RDP bounds in [11] in the setting where the weak convexity constant mm is positive but near zero (i.e., the function is nearly convex) may be an overestimate of the true RDP bound.

For the convex case, it is worth emphasizing that we do not require each to be Lipschitz continuous case in order to bound fi\nabla f_{i} (see, for example, [17, 2, 16, 11] which do require this assumption). As a consequence, our analysis is applicable to a substantially wider class of objective functions. Moreover, all other existing bounds in the literature of the form in (9) replace the parameter CC with a Lipschitz constant QQ of ϕ(x)\phi(x) from (1), and these bounds are generally proportional to Q2Q^{2}. Consequently, when CQC\ll Q, e.g., when ϕ(x)\phi(x) is a quadratic function on a large compact domain, our bounds are significantly tighter (see Figure 1 for an illustration).

Organization. The remainder of the paper gives a formal presentation of the results, including the key assumptions on (1), the topological properties of the DP-SGD update operator, and the non-asymptotic RDP bounds on the last DP-SGD iterates.

2 Privacy bounds for DP-SGD

This section formally presents the main RDP bounds for the last iterates of Algorithm 1. For conciseness, the lengthy proofs of the main results are given in Appendix A.

We start by precisely giving the assumptions on (1). Given h:n(,]h:\mathbb{R}^{n}\mapsto(-\infty,\infty] and fi:domhf_{i}:\operatorname*{dom}h\mapsto\mathbb{R} for i[k]i\in[k], consider the following assumptions:

  • (A1)

    hConv¯(n)h\in\overline{{\rm Conv}}\,(\mathbb{R}^{n});

  • (A2)

    there exists m,M0m,M\geq 0 such that for i[k]i\in[k] the function fif_{i} is differentiable on an open set containing domh\operatorname*{dom}h and

    m2xy2fi(x)fi(y)fi(y),xyM2xy2x,ydomh.-\frac{m}{2}\|x-y\|^{2}\leq f_{i}(x)-f_{i}(y)-\left\langle\nabla f_{i}(y),x-y\right\rangle\leq\frac{M}{2}\|x-y\|^{2}\quad\forall x,y\in\operatorname*{dom}h. (11)

We now give a few remarks about (A1)–(A2). First, (A1) is necessary to ensure that proxh()\operatorname{prox}_{h}(\cdot) is well-defined. Second, it can be shown141414See, for example, [7, 26] and [21, Proposition 2.1.55]. that (A2) is equivalent to the assumption that f\nabla f is MM-Lipschitz continuous when m=Mm=M. Third, the lower bound in (11) is equivalent to assuming that fi()+m2/2f_{i}(\cdot)+m\|\cdot\|^{2}/2 is convex and, hence, if m=0m=0 then fif_{i} is convex. The parameter mm is often called a weak-convexity parameter of fif_{i}. Fourth, using symmetry arguments and the third remark, if M=0M=0 then fif_{i} is concave. Finally, the third remark motivates why we choose two parameters, mm and MM, in (11). Specifically, we use mm (resp. MM) to develop results that can be described in terms of the level of convexity (resp. concavity) of the problem.

We now develop the some properties of an SGD-like update. Given {qi}Conv¯(n)\{q_{i}\}\subseteq\overline{{\rm Conv}}\,(\mathbb{R}^{n}) with domqidomh\operatorname*{dom}q_{i}\subseteq\operatorname*{dom}h and B[k]B\subseteq[k], define the prox-linear operator

𝒜λ(x)=𝒜λ(x,{fi},{qi}):=xλ|B|iBproxqi(fi(x)).{\cal A}_{\lambda}(x)={\cal A}_{\lambda}(x,\{f_{i}\},\{q_{i}\}):=x-\frac{\lambda}{|B|}\sum_{i\in B}\operatorname{prox}_{q_{i}}(\nabla f_{i}(x)). (12)

Clearly, when proxqi(y)=y\operatorname{prox}_{q_{i}}(y)=y for every yny\in\mathbb{R}^{n}, the above update corresponds to a SGD step applied to the problem of minimizing i=1kfi(z)\sum_{i=1}^{k}f_{i}(z) (with respect to zz) under the stepsize λ\lambda and starting point xx. Moreover, while it is straightforward to show that 𝒜λ(){\cal A}_{\lambda}(\cdot) is (1+λmax{m,M})(1+\lambda\max\{m,M\})-Lipschitz continuous when {fi}\{f_{i}\} satisfies (A2)151515See, for example, [11, Appendix A.6]., we prove in Proposition 2.1(b) that the Lipschitz constant can be improved to 1+κλm\sqrt{1+\kappa\lambda m} for some κ4\kappa\leq 4 when λ\lambda is small. Notice that the former constant does not converge to one when m0m\to 0, e.g., when fif_{i} becomes more convex, while the latter one does.

We are now present some important properties of 𝒜λ(){\cal A}_{\lambda}(\cdot).

Proposition 2.1.

Let (m,M)(m,M) be as in assumption (A2), and define

Lλ=Lλ(m,M):=1+2λm[1+m2(M+m)]λ>0.L_{\lambda}=L_{\lambda}(m,M):=\sqrt{1+2\lambda m\left[1+\frac{m}{2(M+m)}\right]}\quad\forall\lambda>0. (13)

Then, the following statements hold:

  • (a)

    if domqi\operatorname*{dom}q_{i} is bounded with diameter CC for i[k]i\in[k], then for any λ>0\lambda>0 we have

    𝒜λ(x)𝒜λ(y)xy+2λCx,ydomh;\|{\cal A}_{\lambda}(x)-{\cal A}_{\lambda}(y)\|\leq\|x-y\|+2\lambda C\quad\forall x,y\in\operatorname*{dom}h; (14)
  • (b)

    if {fi}\{f_{i}\} satisfies (A2) and λ1/[2(M+m)]\lambda\leq 1/[2(M+m)] then 𝒜λ(){\cal A}_{\lambda}(\cdot) is 2Lλ\sqrt{2}L_{\lambda}-Lipschitz continuous;

  • (c)

    if {fi}\{f_{i}\} satisfies (A2), λ1/(M+m)\lambda\leq 1/(M+m), and fi(x)=proxqi(fi(x))\nabla f_{i}(x)=\operatorname{prox}_{q_{i}}(\nabla f_{i}(x)) for every i[k]i\in[k] and xdomhx\in\operatorname*{dom}h, then 𝒜λ(){\cal A}_{\lambda}(\cdot) is LλL_{\lambda}-Lipschitz continuous on domh\operatorname*{dom}h.

Some remarks are in order. First, Lλ(0,M)=1L_{\lambda}(0,M)=1 and, hence, 𝒜λ(){\cal{A}}_{\lambda}(\cdot) is nonexpansive when fif_{i} is convex for every i[k]i\in[k], λ1/M\lambda\leq 1/M, and f(xi)=proxqi(f(xi))\nabla f(x_{i})=\operatorname{prox}_{q_{i}}(\nabla f(x_{i})). Second, if λ=1/(2m)\lambda=1/(2m) then Lλ(m,0)=3L_{\lambda}(m,0)=\sqrt{3} and, hence, 𝒜λ(){\cal{A}}_{\lambda}(\cdot) 6\sqrt{6}-Lispchitz continuous when fif_{i} is concave. Third, like the first remark, L0(m,M)=1L_{0}(m,M)=1 implies that 𝒜λ(){\cal{A}}_{\lambda}(\cdot) is nonexpansive. Finally, when m=Mm=M and λ=1/(2m)\lambda=1/(2m), we have Lλ(m,M)=5/2L_{\lambda}(m,M)=\sqrt{5/2} and 𝒜λ(){\cal{A}}_{\lambda}(\cdot) is 5\sqrt{5}-Lispchitz continuous.

For the remainder of this section, suppose hh satisfies (A1) and let fi:n(,]f_{i}^{\prime}:\mathbb{R}^{n}\mapsto(-\infty,\infty] for i[k]i\in[k] be such that there exists i[k]i^{*}\in[k] where fi=fif_{i}^{\prime}=f_{i} for every iii\neq i^{*} and fifif_{i^{*}}^{\prime}\neq f_{i^{*}}, i.e., {fi}{fi}\{f_{i}\}\sim\{f_{i}^{\prime}\}. That is, ii^{*} is the index where the neighboring datasets {fi}\{f_{i}\} and {fi}\{f_{i}^{\prime}\} differ.

We also make use of the follow assumption.

  • (A3)

    The functions {fi}\{f_{i}^{\prime}\} satisfy assumption (A2) with fi=fif_{i}=f_{i}^{\prime} for every i[k]i\in[k].

We now present the main RDP bounds in terms of the following constants:

dh:=sup{xy:x,ydomh},θL(0):=0,θL(s):=L2(s1)j=0s1L2js1.d_{h}:=\sup\{\|x-y\|:x,y\in\operatorname*{dom}h\},\quad\theta_{L}(0):=0,\quad\theta_{L}(s):=\frac{L^{2(s-1)}}{\sum_{j=0}^{s-1}L^{2j}}\quad\forall s\geq 1. (15)

We first present a bound where domh\operatorname*{dom}h is bounded with diameter dh<d_{h}<\infty.

Theorem 2.2.

Let {Xt}\{X_{t}\} and {Xt}\{X_{t}^{\prime}\} denote the iterates generated by Algorithm 1 with per-example loss functions {fi}\{f_{i}\} and {fi}\{f_{i}^{\prime}\}, respectively, and fixed λ\lambda, CC, bb, σ\sigma, {Nt}\{N_{t}\}, TT, and X0X_{0} for both sequences of iterates. If λ1/[2(m+M)]\lambda\leq 1/[2(m+M)] and (A1)–(A3) hold, then

Dα(XTXT)\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime}) α2σ2(Lλdh+2λCb)2,\displaystyle\leq\frac{\alpha}{2\sigma^{2}}\left(L_{\lambda}d_{h}+\frac{2\lambda C}{b}\right)^{2}, (16)

where LλL_{\lambda} and dhd_{h} are as in (13) and (15), respectively.

We now present the RDP bounds for when domh\operatorname*{dom}h is (possibly) unbounded.

Theorem 2.3.

Let {Xt}\{X_{t}\}, {Xt}\{X_{t}^{\prime}\}, λ\lambda, σ\sigma, bb, CC, and TT be as in Theorem 2.2, and denote :=k/b\ell:=k/b and E:=T/E:=\lfloor T/\ell\rfloor. If λ1/[2(m+M)]\lambda\leq 1/[2(m+M)] and (A1)–(A3) hold, then

Dα(XTXT)4α(λCbσ)2[1+Eθ2Lλ()],D_{\alpha}(X_{T}\|X_{T}^{\prime})\leq 4\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[1+E\cdot\theta_{\sqrt{2}L_{\lambda}}(\ell)\right], (17)

where LλL_{\lambda} and θL()\theta_{L}(\cdot) are as in (13) and (15), respectively. On the other hand, if

maxi[k],t[T]{fi(Xt),fi(Xt)}C,\max_{i\in[k],t\in[T]}\{\|\nabla f_{i}(X_{t})\|,\|\nabla f_{i}(X_{t}^{\prime})\|\}\leq C, (18)

i.e., no gradient clipping is performed, and λ1/(M+m)\lambda\leq{1}/{(M+m)} then

Dα(XTXT)4α(λCbσ)2[1+EθLλ()].D_{\alpha}(X_{T}\|X_{T}^{\prime})\leq 4\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[1+E\cdot\theta_{L_{\lambda}}(\ell)\right]. (19)

We conclude with a few remarks about the above bounds. First, the bound in (19) is on the same order of magnitude as the bound in [17] in terms of TT and \ell when Lλ=1L_{\lambda}=1. However, the right-hand-side of (19) scales linearly with a λ2\lambda^{2} term, which does not appear in [17]. Second, as θLλ()1\theta_{L_{\lambda}}(\cdot)\leq 1, the right-hand-sides of (17) and (19) increases (at most) linearly with respect to the number of dataset passes EE. Third, substituting σ=Θ(C/[bϵ])\sigma=\Theta({C/[b\sqrt{\epsilon}]}) in (17) yields a bound that depends linearly on ε\varepsilon and is invariant to changes in CC and bb. In Appendix C, we discuss further choices of the parameters in (19) and their properties.

References

  • [1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016.
  • [2] Jason Altschuler and Kunal Talwar. Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • [3] Jason M. Altschuler, Jinho Bok, and Kunal Talwar. On the privacy of noisy stochastic gradient descent for convex optimization. SIAM Journal on Computing, 2024.
  • [4] Shahab Asoodeh and Mario Díaz. Privacy loss of noisy stochastic gradient descent might converge even for non-convex losses. arXiv preprint arXiv:2305.09903, 2023.
  • [5] Borja Balle and Yu-Xiang Wang. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning (ICML). PMLR, 2018.
  • [6] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Symposium on foundations of computer science, 2014.
  • [7] Amir Beck. First-order methods in optimization. SIAM, 2017.
  • [8] Jinho Bok, Weijie Su, and Jason M Altschuler. Shifted interpolation for differential privacy. arXiv preprint arXiv:2403.00278, 2024.
  • [9] Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods and applications in continual learning. arXiv preprint arXiv:2403.06873, 2024.
  • [10] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 2011.
  • [11] Eli Chien and Pan Li. Convergent privacy loss of noisy-SGD without convexity and smoothness. arXiv preprint arXiv:2410.01068, 2024.
  • [12] Rishav Chourasia, Jiayuan Ye, and Reza Shokri. Differential privacy dynamics of Langevin diffusion and noisy gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • [13] Lynn Chua, Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. How private are DP-SGD implementations? In International Conference on Machine Learning (ICML), 2024.
  • [14] Lynn Chua, Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. Scalable DP-SGD: Shuffling vs. poisson subsampling. In Conference on Neural Information Processing Systems (NeurIPS), 2024.
  • [15] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 2014.
  • [16] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In ACM SIGACT Symposium on Theory of Computing (STOC), 2020.
  • [17] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In Annual Symposium on Foundations of Computer Science (FOCS), 2018.
  • [18] Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213–5225. PMLR, 2021.
  • [19] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International Conference on Machine Learning (ICML), 2015.
  • [20] Georgios Kaissis, Moritz Knolle, Friederike Jungmann, Alexander Ziller, Dmitrii Usynin, and Daniel Rueckert. A unified interpretation of the gaussian mechanism for differential privacy through the sensitivity index. arXiv preprint arXiv:2109.10528, 2021.
  • [21] Weiwei Kong. Accelerated inexact first-order methods for solving nonconvex composite optimization problems. arXiv preprint arXiv:2104.09685, 2021.
  • [22] Jiaming Liang and Renato D.C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 49(2):832–855, 2024.
  • [23] Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In International Conference on Machine Learning (ICML), 2024.
  • [24] Google LLC. Tensorflow privacy. https://github.com/tensorflow/privacy, 2019.
  • [25] Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled Gaussian mechanism. arXiv preprint arXiv:1908.10530, 2019.
  • [26] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
  • [27] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing. IEEE, 2013.
  • [28] Jiayuan Ye and Reza Shokri. Differentially private learning needs hidden state (or much faster convergence). Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • [29] Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.

Appendix A Derivation of main results

This appendix derives the main results, namely, Theorems 2.2 and 2.3. It contains three subappendices. The first one derives important properties of a family of randomized operators, the second specializes these results to the DP-SGD update operator in (12), and the last one gives the proofs of Theorems 2.2 and 2.3 using the previous two subappendices.

A.1 General operator analysis

This subappendix gives some crucial properties about randomized proximal Lipschitz operators, which consist of evaluating a Lipschitz proximal operator followed by adding Gaussian noise. More specifically, it establishes several RDP bounds based on the closeness of neighboring operators.

We first bound the shifted Rényi divergence of a randomized proximal Lipschitz operator. The proof of this result is a straightforward extension of the argument in [17, Theorem 22] from 11-Lipschitz operators to LL-Lipschitz operators with additive residuals.

To begin, we present two calculus rules for the shifted Rényi divergence given in (5). In particular, the proof of the second rule is a minor modification of the proof given for [17, Lemma 21].

Lemma A.1.

For random variables {X,X,Z}\{X^{\prime},X,Z\} and a,s0a,s\geq 0 and α(1,)\alpha\in(1,\infty), it holds that

  • (a)

    Dα(τ)(X+ZX+Z)Dα(τ+a)(XX)+supcn{Dα([Z+c]Z):ca}D_{\alpha}^{(\tau)}(X+Z\|X^{\prime}+Z)\leq D_{\alpha}^{(\tau+a)}(X\|X^{\prime})+\sup_{c\in\mathbb{R}^{n}}\{D_{\alpha}([Z+c]\|Z):\|c\|\leq a\};

  • (b)

    for some L,ζ>0L,\zeta>0, if ϕ\phi^{\prime} and ϕ\phi satisfy

    supuϕ(u)ϕ(u)s,Φ(x)Φ(y)Lxy+ζ,Φ{ϕ,ϕ},\sup_{u}\|\phi^{\prime}(u)-\phi(u)\|\leq s,\quad\|\Phi(x)-\Phi(y)\|\leq L\|x-y\|+\zeta,\quad\forall\Phi\in\{\phi,\phi^{\prime}\},

    for any x,ydomϕdomϕx,y\in\operatorname*{dom}\phi\cap\operatorname*{dom}\phi^{\prime}, then

    Dα(Lτ+ζ+s)(ϕ(X)ϕ(X))Dα(τ)(XX).D_{\alpha}^{(L\tau+\zeta+s)}(\phi(X)\|\phi^{\prime}(X^{\prime}))\leq D_{\alpha}^{(\tau)}(X\|X^{\prime}).
Proof.

(a) See [17, Lemma 20].

(b) By the definitions of of Dα(τ)(μν)D_{\alpha}^{(\tau)}(\mu\|\nu) and 𝒲(μ,ν){\cal W}_{\infty}(\mu,\nu), there exist a joint distribution (X,Y)(X,Y) such that Dα(YX)=Dα(τ)(XX)D_{\alpha}(Y\|X^{\prime})=D_{\alpha}^{(\tau)}(X\|X^{\prime}) and XYτ\|X-Y\|\leq\tau almost surely. Now, the post-processing property of Rényi divergence implies that

Dα(ϕ(Y)ϕ(X))Dα(ϕ(Y)X)Dα(YX)=Dα(τ)(XX).D_{\alpha}(\phi(Y)\|\phi^{\prime}(X^{\prime}))\leq D_{\alpha}(\phi(Y)\|X^{\prime})\leq D_{\alpha}(Y\|X^{\prime})=D_{\alpha}^{(\tau)}(X\|X^{\prime}).

Using our assumptions on ϕ\phi and ϕ\phi^{\prime} and the triangle inequality, we then have

ϕ(X)ϕ(Y)\displaystyle\|\phi(X)-\phi^{\prime}(Y)\| ϕ(X)ϕ(Y)+ϕ(Y)ϕ(Y)\displaystyle\leq\|\phi(X)-\phi(Y)\|+\|\phi(Y)-\phi^{\prime}(Y)\|
LXY+ζ+sLτ+ζ+s,\displaystyle\leq L\|X-Y\|+\zeta+s\leq L\tau+\zeta+s,

almost surely. Combining the previous two inequalities, yields the desired bound in view of the definitions of Dα(τ)(μν)D_{\alpha}^{(\tau)}(\mu\|\nu) and 𝒲(μ,ν){\cal W}_{\infty}(\mu,\nu). ∎

The next result is the aforemention shifted RDP bound.

Lemma A.2.

For some L,ζ0L,\zeta\geq 0, suppose ϕ\phi^{\prime} and ϕ\phi satisfy (8) for any x,ydomϕdomϕx,y\in\operatorname*{dom}\phi\cap\operatorname*{dom}\phi^{\prime}. Moreover, let Z𝒩(0,σ2I)Z\sim{\cal N}(0,\sigma^{2}I) and ψConv¯(n)\psi\in\overline{{\rm Conv}}\,(\mathbb{R}^{n}). Then, for any scalars a,τ0a,\tau\geq 0 and α(1,)\alpha\in(1,\infty) satisfying Lτ+ζ+sa0,L\tau+\zeta+s-a\geq 0, and random variables YY and YY^{\prime}, it holds that

Dα(Lτ+ζ+sa)(proxψ(ϕ(Y)+Z)proxψ(ϕ(Y)+Z))Dα(τ)(YY)+αa22σ2.D_{\alpha}^{(L\tau+\zeta+s-a)}(\operatorname{prox}_{\psi}(\phi(Y)+Z)\|\operatorname{prox}_{\psi}(\phi^{\prime}(Y^{\prime})+Z))\leq D_{\alpha}^{(\tau)}(Y\|Y^{\prime})+\frac{\alpha a^{2}}{2\sigma^{2}}. (20)
Proof.

We first have that

supτn{Dα([Z+c]Z):ca}=supcn{αc22σ2:ca}=αa22σ2,\sup_{\tau\in\mathbb{R}^{n}}\{D_{\alpha}([Z+c]\|Z):\|c\|\leq a\}=\sup_{c\in\mathbb{R}^{n}}\left\{\frac{\alpha c^{2}}{2\sigma^{2}}:\|c\|\leq a\right\}=\frac{\alpha a^{2}}{2\sigma^{2}}, (21)

from the well-known properties of the Rényi divergence. Using (21), Lemma A.1(a) with (X,X)=(ϕ(Y),ϕ(Y))(X,X^{\prime})=(\phi(Y),\phi^{\prime}(Y^{\prime})), and Lemma A.1(b) with (ϕ,ϕ,L,s){(ϕ,ϕ,L,s),(proxψ,proxψ,1,0)}(\phi,\phi^{\prime},L,s)\in\{(\phi,\phi^{\prime},L,s),({\rm prox}_{\psi},{\rm prox}_{\psi},1,0)\}, we have

Dα(Lτ+sa)(proxψ(ϕ(Y)+Z)proxψ(ϕ(Y)+Z))Dα(Lτ+sa)(ϕ(Y)+Zϕ(Y)+Z)\displaystyle D_{\alpha}^{(L\tau+s-a)}({\rm prox}_{\psi}(\phi(Y)+Z)\|{\rm prox}_{\psi}(\phi^{\prime}(Y^{\prime})+Z))\leq D_{\alpha}^{(L\tau+s-a)}(\phi(Y)+Z\|\phi^{\prime}(Y^{\prime})+Z)
Dα(Lτ+s)(ϕ(Y)ϕ(Y))+αa22σ2Dα(τ)(YY)+αa22σ2.\displaystyle\leq D_{\alpha}^{(L\tau+s)}(\phi(Y)\|\phi^{\prime}(Y^{\prime}))+\frac{\alpha a^{2}}{2\sigma^{2}}\leq D_{\alpha}^{(\tau)}(Y\|Y^{\prime})+\frac{\alpha a^{2}}{2\sigma^{2}}.

Note that the second inequality in (8) is equivalent to Φ\Phi being LL-Lipschitz continuous when ζ=0\zeta=0, and that the conditions in (8) need to only hold on domϕdomϕ\operatorname*{dom}\phi\cap\operatorname*{dom}\phi^{\prime}.

We next apply (20) to a sequence of points generated by the update

Yproxψ(ϕ(Y)+Z)Y\xleftarrow{}\operatorname{prox}_{\psi}(\phi(Y)+Z) (22)

under different assumptions on ζ\zeta and τ\tau and a single dataset pass. Before proceeding, we first present a technical lemma.

Lemma A.3.

Given scalars L>1L>1 and positive integer T1T\geq 1, let

ST:=i=0T1L2i,bt:=(LTtST)LT1,Rt:=Lt1i=1tbiLti,t0S_{T}:=\sum_{i=0}^{T-1}L^{2i},\quad b_{t}:=\left(\frac{L^{T-t}}{S_{T}}\right)L^{T-1},\quad R_{t}:=L^{t-1}-\sum_{i=1}^{t}b_{i}L^{t-i},\quad t\geq 0 (23)

Then, for every t[T]t\in[T],

  • (a)

    Rt+1=LRtbt+1R_{t+1}=LR_{t}-b_{t+1};

  • (b)

    Rt0R_{t}\geq 0 and RT=0R_{T}=0;

  • (c)

    t=1Tbt2=θL(T)\sum_{t=1}^{T}b_{t}^{2}=\theta_{L}(T).

Proof.

Let t[T]t\in[T] be fixed.

(a) This is immediate from the definition of RtR_{t}.

(b) We have that

STRt\displaystyle S_{T}R_{t} =ST(Lti=1tbiLti)=i=0T1L2i+t1i=1tL2T+t2i1=Lt1[i=0T1L2ii=1tL2(Ti)]\displaystyle=S_{T}\left(L^{t}-\sum_{i=1}^{t}b_{i}L^{t-i}\right)=\sum_{i=0}^{T-1}L^{2i+t-1}-\sum_{i=1}^{t}L^{2T+t-2i-1}=L^{t-1}\left[\sum_{i=0}^{T-1}L^{2i}-\sum_{i=1}^{t}L^{2(T-i)}\right]
=Lt1i=0T1tL2i0.\displaystyle=L^{t-1}\sum_{i=0}^{T-1-t}L^{2i}\geq 0.

Evaluating the above expression at t=Tt=T clearly gives RT=0R_{T}=0.

(c) The case of T=0T=0 is immediate. For the case of T1T\geq 1, we use the definitions of btb_{t} and STS_{T} to obtain

t=1Tbt2\displaystyle\sum_{t=1}^{T}b_{t}^{2} =L2(T1)i=0T1L2i(i=0T1L2i)2=L2(T1)ST=θL(T).\displaystyle=\frac{L^{2(T-1)}\sum_{i=0}^{T-1}L^{2i}}{\left(\sum_{i=0}^{T-1}L^{2i}\right)^{2}}=\frac{L^{2(T-1)}}{S_{T}}=\theta_{L}(T).

We now present the shifted RDP properties of the update in (22). This particular result generalizes the one in [17], which only considers the case of L=1L=1 and ζ=0\zeta=0.

Lemma A.4.

Let L,ζ0L,\zeta\geq 0, T1T\geq 1, and [T]\ell\in[T] be fixed. Given ψConv¯(n)\psi\in\overline{{\rm Conv}}\,(\mathbb{R}^{n}), suppose {ϕt}t=1T\{\phi_{t}\}_{t=1}^{T}, {ϕt}t=1T\{\phi_{t}^{\prime}\}_{t=1}^{T}, and s¯>0\bar{s}>0 satisfy (8) with

ϕ=ϕt,ϕ=ϕt,s={s¯,t=1mod,0,otherwise,t[T].\phi=\phi_{t},\quad\phi^{\prime}=\phi_{t}^{\prime},\quad s=\begin{cases}\bar{s},&t=1\bmod\ell,\\ 0,&\text{otherwise},\end{cases}\quad\forall t\in[T].

Moreover, given Y0,Y0domψY_{0},Y_{0}^{\prime}\in\operatorname*{dom}\psi, let Zt𝒩(0,σ2I)Z_{t}\sim{\cal N}(0,\sigma^{2}I), and define the random variables

Yt:=proxψ(ϕt(Yt1)+Zt),Yt:=proxψ(ϕt(Yt1)+Zt),t1.\begin{gathered}Y_{t}:=\operatorname{prox}_{\psi}(\phi_{t}(Y_{t-1})+Z_{t}),\quad Y_{t}^{\prime}:=\operatorname{prox}_{\psi}(\phi_{t}^{\prime}(Y_{t-1}^{\prime})+Z_{t}),\quad\forall t\geq 1.\end{gathered}

If T=T=\ell, then the following statements hold:

  • (a)

    if ζ=0\zeta=0, then

    Dα(YTYT)Dα(τ)(Y0Y0)α2(Lτ+s¯σ)2θL(T);D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})\leq\frac{\alpha}{2}\left(\frac{L\tau+\bar{s}}{\sigma}\right)^{2}\theta_{L}(T); (24)
  • (b)

    if τ=0\tau=0, L=1L=1, and ζ=s¯\zeta=\bar{s}, then

    Dα(YTYT)Dα(Y0Y0)2αT(ζσ)2D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}(Y_{0}\|Y_{0}^{\prime})\leq 2\alpha T\left(\frac{\zeta}{\sigma}\right)^{2} (25)
Proof.

(a) Let s=s¯s=\bar{s}. Our goal is to recursively apply (20) with suitable choices of the free parameter aa at each application. Specifically, let {(bt,Rt,ST)}\{(b_{t},R_{t},S_{T})\} be as in (23), and define

at:=(Lτ+s)btt1.a_{t}:=(L\tau+s)b_{t}\quad\forall t\geq 1.

Using Lemma A.3(a)–(b), we first have Lτ+sa1=(Lτ+s)R10L\tau+s-a_{1}=(L\tau+s)R_{1}\geq 0 and, hence, by Lemma A.2, we have

Dα([Lτ+s]R1)(Y1Y1)=Dα(Lτ+sa1)(Y1Y1)Dα(τ)(Y0Y0)+αa122σ2.D_{\alpha}^{([L\tau+s]R_{1})}(Y_{1}\|Y_{1}^{\prime})=D_{\alpha}^{(L\tau+{s}-a_{1})}(Y_{1}\|Y_{1}^{\prime})\leq D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})+\frac{\alpha a_{1}^{2}}{2\sigma^{2}}.

Since Lemma A.3(a)–(b) also implies Rt0R_{t}\geq 0 and we have st=0s_{t}=0 for t2t\geq 2, we repeatedly apply Lemma A.2 with (a,τ)=(at,τt)=(at,0)(a,\tau)=(a_{t},\tau_{t})=(a_{t},0) for t2t\geq 2 to obtain

Dα(τ)(Y0Y0)\displaystyle D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime}) Dα([Lτ+s]R1)(Y1Y1)αa122σ2Dα([Lτ+s]LR1a2)(Y2Y2)α(a12+a22)2σ2\displaystyle\geq D_{\alpha}^{([L\tau+s]R_{1})}(Y_{1}\|Y_{1}^{\prime})-\frac{\alpha a_{1}^{2}}{2\sigma^{2}}\geq D_{\alpha}^{([L\tau+s]LR_{1}-a_{2})}(Y_{2}\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}
=Dα([Lτ+s]R2)(Y2Y2)α(a12+a22)2σ2\displaystyle=D_{\alpha}^{([L\tau+s]R_{2})}(Y_{2}\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}\geq\cdots
Dα([Lτ+s]RT)(YTYT)αi=1Tai22σ2=Dα(0)(YTYT)αi=1Tai22σ2\displaystyle\geq D_{\alpha}^{([L\tau+s]R_{T})}(Y_{T}\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}=D_{\alpha}^{(0)}(Y_{T}\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}
=Dα(YTYT)αi=1Tai22σ2.\displaystyle=D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}.

It now remains to bound αi=1Tai2/(2σ2)\alpha\sum_{i=1}^{T}a_{i}^{2}/(2\sigma^{2}). Using Lemma A.3(c) and the fact that T=T=\ell and s¯=s\bar{s}=s, we have

αi=1Tai22σ2\displaystyle\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}} =α2σ2[(Lτ+s)2i=2Tbi2]α2(Lτ+s¯σ)2θL(T).\displaystyle=\frac{\alpha}{2\sigma^{2}}\left[(L\tau+s)^{2}\sum_{i=2}^{T}b_{i}^{2}\right]\leq\frac{\alpha}{2}\left(\frac{L\tau+\bar{s}}{\sigma}\right)^{2}\theta_{L}(T).

Combining this bound with the previous one yields the desired conclusion.

(b) Let s=s¯s=\bar{s}. Similar to (a), our goal is to recursively apply (20) with suitable choices of the free parameter aa at each application. For this setting, let a1=ζ+sa_{1}=\zeta+s and at=ζa_{t}=\zeta for t2t\geq 2. Using the fact that τ=0\tau=0 and L=1L=1 and Lemma A.2, we first have that

Dα(Y1Y1)=Dα(0)(Y1Y1)=Dα(s+ζa1)(Y1Y1)Dα(0)(Y0Y0)+αa122σ2.D_{\alpha}(Y_{1}\|Y_{1}^{\prime})=D_{\alpha}^{(0)}(Y_{1}\|Y_{1}^{\prime})=D_{\alpha}^{(s+\zeta-a_{1})}(Y_{1}\|Y_{1}^{\prime})\leq D_{\alpha}^{(0)}(Y_{0}\|Y_{0}^{\prime})+\frac{\alpha a_{1}^{2}}{2\sigma^{2}}.

We then repeatedly apply Lemma A.2 with (a,τ)=(at,0)(a,\tau)=(a_{t},0) for t2t\geq 2 to obtain

Dα(0)(Y0Y0)\displaystyle D_{\alpha}^{(0)}(Y_{0}\|Y_{0}^{\prime}) Dα(0)(Y1Y1)αa122σ2Dα(ζa2)(Y2Y2)α(a12+a22)2σ2\displaystyle\geq D_{\alpha}^{(0)}(Y_{1}\|Y_{1}^{\prime})-\frac{\alpha a_{1}^{2}}{2\sigma^{2}}\geq D_{\alpha}^{(\zeta-a_{2})}(Y_{2}\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}
=Dα(0)(Y2Y2)α(a12+a22)2σ2\displaystyle=D_{\alpha}^{(0)}(Y_{2}\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}\geq\cdots
Dα(ζaT)(YTYT)αi=1Tai22σ2=Dα(0)(YTYT)αi=1Tai22σ2\displaystyle\geq D_{\alpha}^{(\zeta-a_{T})}(Y_{T}\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}=D_{\alpha}^{(0)}(Y_{T}\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}
=Dα(YTYT)αi=1Tai22σ2.\displaystyle=D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}.

It now remains to bound αi=1Tai2/(2σ2)\alpha\sum_{i=1}^{T}a_{i}^{2}/(2\sigma^{2}). Using the definition of {at}\{a_{t}\} and the fact that ζ=s\zeta=s, it holds that

αi=1Tai22σ2α2σ2[4ζ2+(T1)ζ2]2αT(ζσ)2.\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}\leq\frac{\alpha}{2\sigma^{2}}\left[4\zeta^{2}+(T-1)\zeta^{2}\right]\leq 2\alpha T\left(\frac{\zeta}{\sigma}\right)^{2}.

Combining this bound with the previous one yields the desired conclusion. ∎

We next extend the above result to multiple dataset passes.

Proposition A.5.

Let LL, τ\tau, ζ\zeta, s¯\bar{s}, {Yt}\{Y_{t}\}, {Yt}\{Y_{t}^{\prime}\}, \ell, and TT be as in Lemma A.4. Moreover, let θL()\theta_{L}(\cdot) be as in (15). For any τ0\tau\geq 0 and E=T/E=\lfloor T/\ell\rfloor, the following statements hold:

  • (a)

    if ζ=0\zeta=0, then

    Dα(YTYT)Dα(τ)(Y0Y0)\displaystyle D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})
    α2σ2[(Lτ+s¯)2θL()+s¯2{(E1)θL()+θL(TE)}].\displaystyle\leq\frac{\alpha}{2\sigma^{2}}\left[(L\tau+\bar{s})^{2}\theta_{L}(\ell)+\bar{s}^{2}\left\{(E-1)\theta_{L}(\ell)+\theta_{L}(T-E\ell)\right\}\right]. (26)
  • (b)

    if τ=0\tau=0 and ζ=s¯\zeta=\bar{s}, then

    Dα(YTYT)Dα(Y0Y0)2αT(ζσ)2.\displaystyle D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}(Y_{0}\|Y_{0}^{\prime})\leq 2\alpha T\left(\frac{\zeta}{\sigma}\right)^{2}. (27)
Proof.

(a) Let s=s¯s=\bar{s}. For convenience, define

1(τ,T)\displaystyle{\cal B}_{1}(\tau,T) :=α2(Lτ+sσ)2θL(T),2:=ασ2[(Lτ+s)2+s2{(E1)θL()+θL(TE)}].\displaystyle:=\frac{\alpha}{2}\left(\frac{L\tau+s}{\sigma}\right)^{2}\theta_{L}(T),\quad{\cal B}_{2}:=\frac{\alpha}{\sigma^{2}}\left[(L\tau+s)^{2}+s^{2}\left\{(E-1)\theta_{L}(\ell)+\theta_{L}(T-E\ell)\right\}\right].

Using Lemma A.4(a), we have that for the first \ell iterates,

Dα(YY)Dα(τ)(Y0Y0)1(τ,).D_{\alpha}(Y_{\ell}\|Y_{\ell}^{\prime})-D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})\leq{\cal B}_{1}(\tau,\ell).

Similarly, using part Lemma A.4(a) with τ=0\tau=0, we have that

Dα(Y[k+1]Y[k+1])Dα(0)(YY)1(0,),D_{\alpha}(Y_{[k+1]\ell}\|Y_{[k+1]\ell}^{\prime})-D_{\alpha}^{(0)}(Y_{\ell}\|Y_{\ell}^{\prime})\leq{\cal B}_{1}(0,\ell),

for any 1kE11\leq k\leq E-1. Finally, using part Lemma A.4(a) with T=TET=T-E\ell and τ=0\tau=0, we have that

Dα(YTYT)Dα(0)(YEYE)1(0,TE).D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}^{(0)}(Y_{E\ell}\|Y_{E\ell}^{\prime})\leq{\cal B}_{1}(0,T-E\ell).

Summing the above three inequalities, using the fact that Dα(0)(XY)=Dα(XY)D_{\alpha}^{(0)}(X\|Y)=D_{\alpha}(X\|Y), and using the definition of 2{\cal B}_{2} we conclude that

Dα(YTYT)Dα(Y0Y0)1(τ,)+(E1)1(0,)+1(0,TE)=2.\displaystyle D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}(Y_{0}\|Y_{0}^{\prime})\leq{\cal B}_{1}(\tau,\ell)+(E-1){\cal B}_{1}(0,\ell)+{\cal B}_{1}(0,T-E\ell)={\cal B}_{2}.

(b) The proof follows similarly to (a). Repeatedly using Lemma A.4(b) at increments of \ell iterations up to iteration EE\ell, we have that

Dα(YEYE)\displaystyle D_{\alpha}(Y_{E\ell}\|Y_{E\ell}^{\prime}) Dα(Y(E1)Y(E1))+2α(ζσ)2Dα(Y(E2)Y(E2))+4α(ζσ)2\displaystyle\leq D_{\alpha}(Y_{(E-1)\ell}\|Y_{(E-1)\ell}^{\prime})+2\alpha\ell\left(\frac{\zeta}{\sigma}\right)^{2}\leq D_{\alpha}(Y_{(E-2)\ell}\|Y_{(E-2)\ell}^{\prime})+4\alpha\ell\left(\frac{\zeta}{\sigma}\right)^{2}\leq\cdots
Dα(Y0Y0)+2αE(ζσ)2.\displaystyle\leq D_{\alpha}(Y_{0}\|Y_{0}^{\prime})+2\alpha E\ell\left(\frac{\zeta}{\sigma}\right)^{2}.

For the last TET-E\ell iterations, we use Lemma A.4(b) with T=TET=T-E\ell and the above bound to obtain

Dα(YTYT)\displaystyle D_{\alpha}(Y_{T}\|Y_{T}^{\prime}) Dα(YEYE)+2α[TE](ζσ)2Dα(Y0Y0)+2αT(ζσ)2.\displaystyle\leq D_{\alpha}(Y_{E\ell}\|Y_{E\ell}^{\prime})+2\alpha[T-E\ell]\left(\frac{\zeta}{\sigma}\right)^{2}\leq D_{\alpha}(Y_{0}\|Y_{0}^{\prime})+2\alpha T\left(\frac{\zeta}{\sigma}\right)^{2}.

Some remarks about Proposition A.5 are in order. First, part (a) shows that if ϕt\phi_{t} and ϕt\phi_{t}^{\prime} only differ at t=1t=1, then Dα(YTYT)D_{\alpha}(Y_{T}\|Y_{T}^{\prime}) is finite for any TT. Second, part (a) also shows that if ϕt\phi_{t} and ϕt\phi_{t}^{\prime} differ cyclically for a cycle length of \ell, then the divergence between YTY_{T} and YTY_{T}^{\prime} grows linearly with the number of cycles EE. Third, part (b) gives a bound that is independent of LL. Finally, both of the bounds in parts (a) and (b) can be viewed as Rényi divergences between the Gaussian random variables 𝒩(0,σ2I){\cal N}(0,\sigma^{2}I) and 𝒩(μ,σ2I){\cal N}(\mu,\sigma^{2}I) for different values of μ\mu.

In Appendix B, we give a detailed discussion of how the residuals aa from Lemma A.2 are chosen to prove Proposition A.5(a). In particular, we prove that the chosen residuals yield the tightest RDP bound that can achieved by repeatedly applying (20).

A.2 SGD operator analysis

This subappendix derives some important properties about the DP-SGD update operator 𝒜λ()\cal{A}_{\lambda}(\cdot) in (12) and also contains the proof of Proposition 2.1.

To start, we recall the following well-known bound from convex analysis. Its proof can be found, for example, in [7, Theorem 5.8(iv)].

Lemma A.6.

Let F:domhF:\operatorname*{dom}h\mapsto\mathbb{R} be convex and differentiable. Then FF satisfies

F(x)F(y)F(y),xyL2xy2x,ydomhF(x)-F(y)-\left\langle\nabla F(y),x-y\right\rangle\leq\frac{L}{2}\|x-y\|^{2}\quad\forall x,y\in\operatorname*{dom}h (28)

if and only if

F(x)F(y),xy1LF(x)F(y)2x,ydomh.\left\langle\nabla F(x)-\nabla F(y),x-y\right\rangle\geq\frac{1}{L}\|\nabla F(x)-\nabla F(y)\|^{2}\quad\forall x,y\in\operatorname*{dom}h.

We next give a technical bound on fif_{i}, which generalizes the co-coercivity of convex functions to weakly-convex functions.

Lemma A.7.

For any x,ydomhx,y\in\operatorname*{dom}h and fif_{i} satisfying (A2), it holds that

fi(x)fi(y),xym[1+m2(M+m)]xy2+12(M+m)fi(x)fi(y)2.\displaystyle\left\langle\nabla f_{i}(x)-\nabla f_{i}(y),x-y\right\rangle\geq-m\left[1+\frac{m}{2(M+m)}\right]\|x-y\|^{2}+\frac{1}{2(M+m)}\|\nabla f_{i}(x)-\nabla f_{i}(y)\|^{2}.
Proof.

Define F=fi+m2/2F=f_{i}+m\|\cdot\|^{2}/2 and let x,ydomhx,y\in\operatorname*{dom}h be fixed. Moreover, note that FF is convex and satisfies (28) with L=M+mL=M+m. It then follows from Lemma A.6 with L=M+mL=M+m that

1M+mF(x)F(y)2\displaystyle\frac{1}{M+m}\|\nabla F(x)-\nabla F(y)\|^{2} =1M+mfi(x)fi(y)+m(xy)2\displaystyle=\frac{1}{M+m}\|\nabla f_{i}(x)-\nabla f_{i}(y)+m(x-y)\|^{2}
F(x)F(y),xy\displaystyle\leq\left\langle\nabla F(x)-\nabla F(y),x-y\right\rangle
=fi(x)fi(y),xy+mxy2.\displaystyle=\left\langle\nabla f_{i}(x)-\nabla f_{i}(y),x-y\right\rangle+m\|x-y\|^{2}.

Applying the bound a+b2/2a2+b2\|a+b\|^{2}/2\leq\|a\|^{2}+\|b\|^{2} with a=fi(x)fi(y)+m(xy)a=\nabla f_{i}(x)-\nabla f_{i}(y)+m(x-y) and b=m(xy)b=-m(x-y), the above inequality then implies

fi(x)fi(y),xy\displaystyle\left\langle\nabla f_{i}(x)-\nabla f_{i}(y),x-y\right\rangle mxy2+1M+mfi(x)fi(y)+m(xy)2\displaystyle\geq-m\|x-y\|^{2}+\frac{1}{M+m}\|\nabla f_{i}(x)-\nabla f_{i}(y)+m(x-y)\|^{2}
[m+m22(M+m)]xy2+12(M+m)fi(x)fi(y)2.\displaystyle\geq-\left[m+\frac{m^{2}}{2(M+m)}\right]\|x-y\|^{2}+\frac{1}{2(M+m)}\|\nabla f_{i}(x)-\nabla f_{i}(y)\|^{2}.

The below result gives some technical bounds on changes in the proximal function.

Lemma A.8.

Given u,vnu,v\in\mathbb{R}^{n}, let ψConv¯(n)\psi\in\overline{{\rm Conv}}\,(\mathbb{R}^{n}) and define

Δ:=uv,Δp:=proxψ(x)proxψ(y).\Delta:=u-v,\quad\Delta^{p}:=\operatorname{prox}_{\psi}(x)-\operatorname{prox}_{\psi}(y).

Then, the following statements hold:

  • (a)

    Δp2Δ,Δp\|\Delta^{p}\|^{2}\leq\left\langle\Delta,\Delta^{p}\right\rangle;

  • (b)

    ΔpΔ2Δ2Δp2\|\Delta^{p}-\Delta\|^{2}\leq\|\Delta\|^{2}-\|\Delta^{p}\|^{2}.

Proof.

(a) See [7, Theorem 6.42(a)].

(b) Using part (a), we have that

ΔpΔ2\displaystyle\|\Delta^{p}-\Delta\|^{2} =Δp2+Δ22Δ,ΔpΔ2Δp2.\displaystyle=\|\Delta^{p}\|^{2}+\|\Delta\|^{2}-2\left\langle\Delta,\Delta^{p}\right\rangle\leq\|\Delta\|^{2}-\|\Delta^{p}\|^{2}.

We now develop some technical properties of 𝒜λ(){\cal A}_{\lambda}(\cdot). The first result presents a bound involving the following quantities for x,ydomhx,y\in\operatorname*{dom}h and i[k]i\in[k].

d:=xy,Δi:=fi(x)fi(y),Δip:=proxqi(fi(x))proxqi(fi(y)).\begin{gathered}d:=x-y,\quad\Delta_{i}:=\nabla f_{i}(x)-\nabla f_{i}(y),\quad\Delta_{i}^{p}:=\operatorname{prox}_{q_{i}}(\nabla f_{i}(x))-\operatorname{prox}_{q_{i}}(\nabla f_{i}(y)).\end{gathered} (29)
Lemma A.9.

Let x,ydomhx,y\in\operatorname*{dom}h and i[k]i\in[k] be fixed, let dd, Δi\Delta_{i}, and Δip\Delta_{i}^{p} be as in (29) for some {fi}\{f_{i}\}. Moreover, let LλL_{\lambda} be as in (13), and suppose fif_{i} satisfies assumption (A2). If Δip=Δi\Delta_{i}^{p}=\Delta_{i}, then for any λ1/(M+m)\lambda\leq 1/(M+m) we have

dλΔipLλd.\|d-\lambda\Delta_{i}^{p}\|\leq L_{\lambda}\|d\|. (30)

On the other hand, if ΔipΔi\Delta_{i}^{p}\neq\Delta_{i}, then for any λ1/[2(M+m)]\lambda\leq 1/[2(M+m)] we have

dλΔip2Lλd.\|d-\lambda\Delta_{i}^{p}\|\leq\sqrt{2}L_{\lambda}\|d\|. (31)
Proof.

Before proceeding, we first establish a technical inequality. Using Lemma A.7, it holds that for any μ>0\mu>0,

μΔi22d,Δi\displaystyle\mu\left\|\Delta_{i}\right\|^{2}-2\left\langle d,\Delta_{i}\right\rangle μΔi2+2m[1+m2(M+m)]d2Δi2M+m\displaystyle\leq\mu\|\Delta_{i}\|^{2}+2m\left[1+\frac{m}{2(M+m)}\right]\|d\|^{2}-\frac{\|\Delta_{i}\|^{2}}{M+m}
=2m[1+m2(M+m)]d2+(μ1M+m)Δi2.\displaystyle=2m\left[1+\frac{m}{2(M+m)}\right]\|d\|^{2}+\left({\mu-\frac{1}{M+m}}\right)\|\Delta_{i}\|^{2}. (32)

We now prove (30). Supposing that Δip=Δi\Delta_{i}^{p}=\Delta_{i}, we have

dλΔip2=dλΔi2=d2+λ[λΔi22d,Δi].\|d-\lambda\Delta_{i}^{p}\|^{2}=\|d-\lambda\Delta_{i}\|^{2}=\|d\|^{2}+\lambda\left[\lambda\|\Delta_{i}\|^{2}-2\left\langle d,\Delta_{i}\right\rangle\right].

Using (32) with μ=λ\mu=\lambda, the above identity, and the definition of LλL_{\lambda}, it holds that for any λ1/(M+m)\lambda\leq 1/(M+m), we have

dλΔip2\displaystyle\|d-\lambda\Delta_{i}^{p}\|^{2} (1+2λm[1+m2(M+m)])d2+λ(λ1M+m)Δi2\displaystyle\leq\left(1+2\lambda m\left[1+\frac{m}{2(M+m)}\right]\right)\|d\|^{2}+\lambda\left({\lambda-\frac{1}{M+m}}\right)\|\Delta_{i}\|^{2}
=Lλ2d2+λ(λ1M+m)Δi2Lλ2d2.\displaystyle=L_{\lambda}^{2}\|d\|^{2}+\lambda\left({\lambda-\frac{1}{M+m}}\right)\|\Delta_{i}\|^{2}\leq L_{\lambda}^{2}\|d\|^{2}.

We now prove (31). Using Lemma A.8(b) with (Δ,Δp)=(Δi,Δip)(\Delta,\Delta^{p})=(\Delta_{i},\Delta_{i}^{p}) and the inequality a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2} for a,bna,b\in\mathbb{R}^{n}, it holds that

dλΔip2\displaystyle\left\|d-\lambda\Delta_{i}^{p}\right\|^{2} =dλ(Δi+ΔiΔip)22dλΔi2+2λ2ΔiΔip2\displaystyle=\left\|d-\lambda(\Delta_{i}+\Delta_{i}-\Delta_{i}^{p})\right\|^{2}\leq 2\|d-\lambda\Delta_{i}\|^{2}+2\lambda^{2}\|\Delta_{i}-\Delta_{i}^{p}\|^{2}
Lem.A.8(b)2dλΔi2+2λ2Δi22λ2Δip2\displaystyle\overset{Lem.\leavevmode\nobreak\ \ref{lem:prox}(b)}{\leq}2\|d-\lambda\Delta_{i}\|^{2}+2\lambda^{2}\|\Delta_{i}\|^{2}-2\lambda^{2}\|\Delta_{i}^{p}\|^{2}
2d2λΔi2+2λ2Δi2\displaystyle\leq 2\|d-2\lambda\Delta_{i}\|^{2}+2\lambda^{2}\|\Delta_{i}\|^{2}
=2(d2+λ[2λΔi22d,Δi]).\displaystyle=2\left(\|d\|^{2}+\lambda\left[2\lambda\left\|\Delta_{i}\right\|^{2}-2\left\langle d,\Delta_{i}\right\rangle\right]\right).

Using (32) with μ=2λ\mu=2\lambda, the above inequality, and the definition of LλL_{\lambda}, it holds that for any λ1/[2(M+m)]\lambda\leq 1/[2(M+m)], we have

dλΔip2\displaystyle\|d-\lambda\Delta_{i}^{p}\|^{2} 2(1+2λm[1+m2(M+m)])d2+2(2λ1M+m)Δi2\displaystyle\leq 2\left(1+2\lambda m\left[1+\frac{m}{2(M+m)}\right]\right)\|d\|^{2}+2\left({2\lambda-\frac{1}{M+m}}\right)\|\Delta_{i}\|^{2}
=2Lλ2d2+2(2λ1M+m)Δi22Lλ2d2.\displaystyle=2L_{\lambda}^{2}\|d\|^{2}+2\left({2\lambda-\frac{1}{M+m}}\right)\|\Delta_{i}\|^{2}\leq 2L_{\lambda}^{2}\|d\|^{2}.

We are now ready to give the proof of Proposition 2.1.

A.2.1 Proof of Proposition 2.1

Proof.

(a) Let x,ydomhx,y\in\operatorname*{dom}h and λ>0\lambda>0 be arbitrary. Moreover, denote pi()=proxqi()p_{i}(\cdot)=\operatorname{prox}_{q_{i}}(\cdot) for i[k]i\in[k]. Using the definition of 𝒜λ(){\cal A}_{\lambda}(\cdot), the assumption that pi(z)C\|p_{i}(z)\|\leq C for any znz\in\mathbb{R}^{n}, and the triangle inequality, we have that

𝒜λ(x)𝒜λ(y)\displaystyle\|{\cal A}_{\lambda}(x)-{\cal A}_{\lambda}(y)\| =xy+λ|B|iB[pi(x)pi(y)]xy+λ|B|iB(pi(x)+pi(y))\displaystyle=\left\|x-y+\frac{\lambda}{|B|}\sum_{i\in B}[p_{i}(x)-p_{i}(y)]\right\|\leq\|x-y\|+\frac{\lambda}{|B|}\sum_{i\in B}(\|p_{i}(x)\|+\|p_{i}(y)\|)
xy+2λC.\displaystyle\leq\|x-y\|+2\lambda C.

(b) Let x,ydomhx,y\in\operatorname*{dom}h be arbitrary, and denote ξ():=𝒜λ(,{fi},{qi})\xi(\cdot):={\cal A}_{\lambda}(\cdot,\{f_{i}\},\{q_{i}\}). Moreover, let dd, Δi\Delta_{i}, and Δip\Delta_{i}^{p} be as in (29). Using the fact that i=1|B|vi2|B|i=1|B|vi2\|\sum_{i=1}^{|B|}v_{i}\|^{2}\leq|B|\sum_{i=1}^{|B|}\|v_{i}\|^{2} for any {vi}n\{v_{i}\}\subseteq\mathbb{R}^{n}, we have

ξ(x)ξ(y)2\displaystyle\|\xi(x)-\xi(y)\|^{2} =1|B|2iB{[xλproxqi(fi(x))][yλproxqi(fi(y))]}2\displaystyle=\frac{1}{|B|^{2}}\left\|\sum_{i\in B}\left\{\left[x-\lambda\operatorname{prox}_{q_{i}}(\nabla f_{i}(x))\right]-\left[y-\lambda\operatorname{prox}_{q_{i}}(\nabla f_{i}(y))\right]\right\}\right\|^{2}
=1|B|2iB(dλΔip)21|B|iBdλΔip2.\displaystyle=\frac{1}{|B|^{2}}\left\|\sum_{i\in B}(d-\lambda\Delta_{i}^{p})\right\|^{2}\leq\frac{1}{|B|}\sum_{i\in B}\left\|d-\lambda\Delta_{i}^{p}\right\|^{2}. (33)

Using (33) and (31) in Lemma A.9, we conclude that

ξ(x)ξ(y)2\displaystyle\|\xi(x)-\xi(y)\|^{2} 1|B|iBdλΔip22Lλ2d2=2Lλxy2.\displaystyle\leq\frac{1}{|B|}\sum_{i\in B}\left\|d-\lambda\Delta_{i}^{p}\right\|^{2}\leq 2L_{\lambda}^{2}\|d\|^{2}=2L_{\lambda}\|x-y\|^{2}.

(c) Let ξ()\xi(\cdot), dd, Δi\Delta_{i}, and Δip\Delta_{i}^{p} be as in part (b). Following the same argument as in part (b), we obtain (33). Using (33) and (30) in Lemma A.9, we conclude that

ξ(x)ξ(y)2\displaystyle\|\xi(x)-\xi(y)\|^{2} 1|B|iBdλΔip2Lλ2d2=Lλxy2.\displaystyle\leq\frac{1}{|B|}\sum_{i\in B}\left\|d-\lambda\Delta_{i}^{p}\right\|^{2}\leq L_{\lambda}^{2}\|d\|^{2}=L_{\lambda}\|x-y\|^{2}.

A.3 RDP bounds

This subappendix derives the RDP bounds in Theorems 2.2 and 2.3.

The first result shows how the updates in Algorithm 1 are randomized proximal updates applied to the operator Aλ()A_{\lambda}(\cdot) in (12) with qi()=ClipC()q_{i}(\cdot)={\rm Clip}_{C}(\cdot).

Lemma A.10.

Let {Xt}\{X_{t}\}, {Xt}\{X_{t}^{\prime}\}, λ\lambda, bb, σ\sigma, CC, and TT be as in Theorem 2.2. Moreover, denote

ϕt(x):=𝒜λ(x,{fi},{ClipC}),ϕt(x):=𝒜λ(x,{fi},{ClipC}),xdomh,\phi_{t}(x):={\cal A}_{\lambda}(x,\{f_{i}\},\{{\rm Clip}_{C}\}),\quad\phi_{t}^{\prime}(x):={\cal A}_{\lambda}(x,\{f_{i}^{\prime}\},\{{\rm Clip}_{C}\}),\quad\forall x\in\operatorname*{dom}h,

where ClipC(){\rm Clip}_{C}(\cdot) and 𝒜λ(){\cal A}_{\lambda}(\cdot) are as in (2) and (12), respectively. Then, it holds that

Xt=proxλh(ϕt(Xt1)+Nt),Xt=proxλh(ϕt(Xt1)+Nt),t1.X_{t}=\operatorname{prox}_{\lambda h}(\phi_{t}(X_{t-1})+N_{t}),\quad X_{t}^{\prime}=\operatorname{prox}_{\lambda h}(\phi_{t}^{\prime}(X_{t-1}^{\prime})+N_{t}),\quad\forall t\geq 1.
Proof.

This follows immediately from the definition of ϕt\phi_{t}, the update rules in Algorithm 1, and the fact that ClipC(){\rm Clip}_{C}(\cdot) is the proximal operator of the (convex) indicator function of the convex set {x:xC}\{x:\|x\|\leq C\}. ∎

We now present some important norm bounds.

Lemma A.11.

Let {Xt}\{X_{t}\}, {Xt}\{X_{t}^{\prime}\}, {ϕt}\{\phi_{t}\}, and {ϕt}\{\phi_{t}^{\prime}\} be as in Theorem 2.2 and denote =k/b\ell=k/b and t:=inft1{t:iBt}t^{*}:=\inf_{t\geq 1}\left\{t:i^{*}\in B_{t}\right\}. Then, it holds that

XtXt=0,ϕs(x)ϕs(x)2λCb,\|X_{t^{*}}-X_{t^{*}}^{\prime}\|=0,\quad\|\phi_{s}(x)-\phi_{s}^{\prime}(x)\|\leq\frac{2\lambda C}{b}, (34)

for every s{j+t:j=0,1,}s\in\{j\ell+t^{*}:j=0,1,...\} and any xdomhx\in\operatorname*{dom}h.

Proof.

The identity in (34) follows from the fact that Xt=XtX_{t}=X_{t}^{\prime} for every ttt\leq t^{*}. For the inequality in (34), it suffices to show the bound for s=ts=t^{*} because the batches BtB_{t} in Algorithm 1 are drawn cyclically. To that end, let xdomhx\in\operatorname*{dom}h be fixed. Using the update rule in Algorithm 1, and the fact that ClipC(x)C\|{\rm Clip}_{C}(x)\|\leq C for every xnx\in\mathbb{R}^{n}, we have that

ϕs(x)ϕs(x)\displaystyle\|\phi_{s}(x)-\phi_{s}^{\prime}(x)\| =1biBt[xλClipC(fi(x))][xλClipC(fi(x))]\displaystyle=\frac{1}{b}\left\|\sum_{i\in B_{t^{*}}}\left[x-\lambda{\rm Clip}_{C}(\nabla f_{i^{*}}(x))\right]-\left[x-\lambda{\rm Clip}_{C}(\nabla f_{i^{*}}^{\prime}(x))\right]\right\|
=λbClipC(fi(x))ClipC(fi(x))\displaystyle=\frac{\lambda}{b}\|{\rm Clip}_{C}(\nabla f_{i^{*}}(x))-{\rm Clip}_{C}(\nabla f_{i^{*}}^{\prime}(x))\|
λb[ClipC(fi(x))+ClipC(fi(x))]=2λCb.\displaystyle\leq\frac{\lambda}{b}\left[\|{\rm Clip}_{C}(\nabla f_{i^{*}}(x))\|+\|{\rm Clip}_{C}(\nabla f_{i^{*}}^{\prime}(x))\|\right]=\frac{2\lambda C}{b}.

We now give the proofs of the main RDP bounds.

A.3.1 Proof of Theorem 2.2

Proof.

Using Proposition A.5(a) with

Y0=XT1,Y0=XT1,τ=dh,s=2λCb,L=Lλ,=kb,Y_{0}=X_{T-1},\quad Y_{0}^{\prime}=X_{T-1}^{\prime},\quad\tau=d_{h},\quad s=\frac{2\lambda C}{b},\quad L=L_{\lambda},\quad\ell=\frac{k}{b},

and E=T=1E=T=1, we have that

Dα(XTXT)\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime}) Dα(dh)(XT1XT1)+α2σ2(Ldh+2λCb)2θL(1)=α2σ2(Ldh+2λCb)2.\displaystyle\leq D_{\alpha}^{(d_{h})}(X_{T-1}\|X_{T-1}^{\prime})+\frac{\alpha}{2\sigma^{2}}\left(Ld_{h}+\frac{2\lambda C}{b}\right)^{2}\theta_{L}(1)=\frac{\alpha}{2\sigma^{2}}\left(Ld_{h}+\frac{2\lambda C}{b}\right)^{2}.

A.3.2 Proof of Theorem 2.3

Proof.

Suppose fifif_{i^{*}}^{\prime}\neq f_{i^{*}} and iBti^{*}\in B_{t^{*}} for indices ii^{*} and tt^{*}. We first prove (17). In view of Proposition 2.1(b), it is clear that the DP-SGD update is 2Lλ\sqrt{2}L_{\lambda}-Lipschitz continuous. Hence, using Lemma A.11(b) and Proposition A.5(a) with

Y0=Xt,Y0=Xt,τ=0,s=2λCb,L=Lλ,=kb,Y_{0}=X_{t^{*}},\quad Y_{0}^{\prime}=X_{t^{*}}^{\prime},\quad\tau=0,\quad s=\frac{2\lambda C}{b},\quad L=L_{\lambda},\quad\ell=\frac{k}{b},

and T=Tt1T=T-t^{*}-1, we have that

Dα(XTXT)\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime}) Dα(0)(X0X0)+2α(λCbσ)2[EθLλ()+θLλ(Tt1E)]\displaystyle\leq D_{\alpha}^{(0)}(X_{0}\|X_{0})+2\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[E\theta_{L_{\lambda}}(\ell)+\theta_{L_{\lambda}}(T-t^{*}-1-E\ell)\right]
=2α(λCbσ)2[EθLλ()+θLλ(Tt1E)]2α(λCbσ)2[1+EθLλ()],\displaystyle=2\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[E\theta_{L_{\lambda}}(\ell)+\theta_{L_{\lambda}}(T-t^{*}-1-E\ell)\right]\leq 2\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[1+E\theta_{L_{\lambda}}(\ell)\right],

where the last inequality follows from the fact that θLλ(s)\theta_{L_{\lambda}}(s) is nonincreasing for s1s\geq 1 and θLλ(1)=1\theta_{L_{\lambda}}(1)=1.

We now prove (19). In view of (18) and Proposition 2.1(c), it is clear that the DP-SGD update, in the absence of gradient clipping, is LλL_{\lambda}-Lipschitz continuous. Consequently, the desired bound follows from the same arguments as in the proof of (19) above, but with L=LλL=L_{\lambda} instead of L=2LλL=\sqrt{2}L_{\lambda}. ∎

Appendix B Choice of residuals

This appendix briefly discusses the choice of residuals {at}\{a_{t}\} that are used in the proof of Proposition A.5(a) and Lemma A.2.

In the setup of Proposition A.5(a), it is straightforward to show that if {at}\{a_{t}\} is a nonnegative sequence of scalars such that

R~t:=Lt1(Lτ+s)i=1taiLti0,R~T=0,\tilde{R}_{t}:=L^{t-1}(L\tau+s)-\sum_{i=1}^{t}a_{i}L^{t-i}\geq 0,\quad\tilde{R}_{T}=0,

then repeatedly applying Lemma A.2 with a=ata=a_{t} yields

Dα(YTYT)Dα(τ)(Y0Y0)α2σ2i=1Tai2.D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})\leq\frac{\alpha}{2\sigma^{2}}\sum_{i=1}^{T}a_{i}^{2}. (35)

Hence, to obtain the tightest bound of the form in (35), we need to solve the quadratic program

(P)min\displaystyle(P)\quad\min\ 12i=1Tai2\displaystyle\frac{1}{2}\sum_{i=1}^{T}a_{i}^{2}
s.t R~t0t[T1],\displaystyle\tilde{R}_{t}\geq 0\quad\forall t\in[T-1],
R~T=0.\displaystyle\tilde{R}_{T}=0.

If we ignore the inequality constraints, the first order optimality condition of the resulting problem is that there exists ξ\xi\in\mathbb{R} such that

ai=ξLtit[T],R~T=0.a_{i}=\xi L^{t-i}\quad\forall t\in[T],\quad\tilde{R}_{T}=0.

The latter identity implies that

LT1(Lτ+s)=ξi=1TL2(Ti)=ξi=0T1L2iL^{T-1}(L\tau+s)=\xi\sum_{i=1}^{T}L^{2(T-i)}=\xi\sum_{i=0}^{T-1}L^{2i}

which then implies

ai=LT1(Lτ+s)Ltii=0T1L2it[T].a_{i}=\frac{L^{T-1}(L\tau+s)L^{t-i}}{\sum_{i=0}^{T-1}L^{2i}}\quad\forall t\in[T].

Hence, to verify that the above choice of aia_{i} is optimal for (P)(P), it remains to verify that R~t0\tilde{R}_{t}\geq 0 for t[T1]t\in[T-1]. Indeed, this follows from Lemma A.3(b) after normalizing for the Lτ+sL\tau+s factor. As a consequence, the right-hand-side of (35) is minimized for our choice of aia_{i} above.

Appendix C Parameter choices

Let us now consider some interesting values for λ\lambda, σ\sigma, and \ell.

The result below establishes a useful bound on θL(s)\theta_{L}(s) for sufficiently large enough values of ss.

Lemma C.1.

For any L>1L>1 and ξ>1\xi>1, if slogLξ/(ξ1)s\geq\log_{L}\sqrt{\xi/(\xi-1)} then θL(s)ξ(1L2).\theta_{L}(s)\leq\xi\left(1-L^{-2}\right).

Proof.

Using the definition of θL()\theta_{L}(\cdot), we have

θL(s)\displaystyle\theta_{L}(s) =L2(s1)i=0s1L2i=L2sL2(s1)L2s1=1L21L2s1L21L2logLξ/(ξ1)\displaystyle=\frac{L^{-2(s-1)}}{\sum_{i=0}^{s-1}L^{2i}}=\frac{L^{2s}-L^{2(s-1)}}{L^{2s}-1}=\frac{1-L^{-2}}{1-L^{-2s}}\leq\frac{1-L^{-2}}{1-L^{-2\log_{L}\sqrt{\xi/(\xi-1)}}}
=1L21(ξ1)/ξ=ξ(11L2).\displaystyle=\frac{1-L^{-2}}{1-(\xi-1)/\xi}=\xi\left(1-\frac{1}{L^{2}}\right).

Corollary C.2.

Let α>1\alpha>1 and ε>0\varepsilon>0 be fixed, and let {Xt}\{X_{t}\}, {Xt}\{X_{t}^{\prime}\}, bb, CC, EE, \ell, λ\lambda, and TT be as in Theorem 2.3. Moreover, define

λ¯(ρ):=12(M+ρ),σ¯ε(ρ):=Cλ¯(ρ)2b1αε(1+[4ρM+ρ]E),¯(ρ):=log2log[1+ρλ¯(ρ)],\overline{\lambda}(\rho):=\frac{1}{2(M+\rho)},\quad\overline{\sigma}_{\varepsilon}(\rho):=\frac{C\cdot\overline{\lambda}(\rho)}{2b}\sqrt{\frac{1}{\alpha\varepsilon}\left(1+\left[\frac{4\rho}{M+\rho}\right]E\right)},\quad\overline{\ell}(\rho):=\frac{\log 2}{\log\left[1+\rho\overline{\lambda}(\rho)\right]},

for every ρ>0\rho>0. If λ=λ¯(m)\lambda=\overline{\lambda}(m), σσ¯ε(m)\sigma\geq\overline{\sigma}_{\varepsilon}(m), ¯(m)\ell\geq\overline{\ell}(m), and no gradient clipping is performed, then

Dα(XTXT)4α[Cλ¯(m)bσ¯ε(m)]2[1+4mM+m],\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime})\leq 4\alpha\left[\frac{C\cdot\overline{\lambda}(m)}{b\cdot\overline{\sigma}_{\varepsilon}(m)}\right]^{2}\left[1+\frac{4m}{M+m}\right],

and the corresponding instance of Algorithm 1 is (α,ε)(\alpha,\varepsilon)-Rényi-DP.

Proof.

For ease of notation, denote λ¯=λ¯(m)\overline{\lambda}=\overline{\lambda}(m), L=Lλ¯L=L_{\overline{\lambda}}, σ¯=σ¯(m)\overline{\sigma}=\overline{\sigma}(m), and ¯=¯(m)\overline{\ell}=\overline{\ell}(m). We first note that

L=Lλ¯=1+mM+m[1+mM+m]1+mλ¯(m),L=L_{\overline{\lambda}}=\sqrt{1+\frac{m}{M+m}\left[1+\frac{m}{M+m}\right]}\geq\sqrt{1+m\overline{\lambda}(m)},

which implies

¯=log2log1+mλ¯=log2logL=logL2.\ell\geq\overline{\ell}=\frac{\log\sqrt{2}}{\log\sqrt{1+m\overline{\lambda}}}=\frac{\log\sqrt{2}}{\log L}=\log_{L}\sqrt{2}.

Consequently, using Lemma C.1 with (ξ,s)=(2,)(\xi,s)=(2,\ell) and the definitions of Lλ()L_{\lambda}(\cdot) and λ¯()\overline{\lambda}(\cdot), we have that

θL()2(11L2)=2(2m2(M+m)[1+mM+m])4mM+m.\theta_{L}(\ell)\leq 2\left(1-\frac{1}{L^{2}}\right)=2\left(\frac{2m}{2(M+m)}\left[1+\frac{m}{M+m}\right]\right)\leq\frac{4m}{M+m}.

Using the above bound and Theorem 2.3 with (λ,σ,L)=(λ¯,σ¯,Lλ¯)(\lambda,\sigma,L)=(\overline{\lambda},\overline{\sigma},L_{\overline{\lambda}}), we obtain

Dα(XTXT)\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime}) 4α(λ¯Cbσ)2[1+EθL()]4α(λ¯Cbσ)2[1+4EmM+m]\displaystyle\leq 4\alpha\left(\frac{\overline{\lambda}C}{b\sigma}\right)^{2}\left[1+E\theta_{L}(\ell)\right]\leq 4\alpha\left(\frac{\overline{\lambda}C}{b\sigma}\right)^{2}\left[1+\frac{4Em}{M+m}\right]
4α(λ¯Cbσ¯)2[1+4EmM+m]ε.\displaystyle\leq 4\alpha\left(\frac{\overline{\lambda}C}{b\overline{\sigma}}\right)^{2}\left[1+\frac{4Em}{M+m}\right]\leq\varepsilon.

In view of the fact that Algorithm 1 returns the last iterate XTX_{T} (or XTX_{T}^{\prime}), the conclusion follows. ∎

Some remarks about Corollary C.2 are in order. First, σε2(m)\sigma_{\varepsilon}^{2}(m) increases linearly with the number of dataset passes EE. Second, the smaller mm is the smaller the effect of EE on σε(m)\sigma_{\varepsilon}(m) is. Fourth, limm0¯(m)=\lim_{m\to 0}\overline{\ell}(m)=\infty which implies that the reducing the dependence on EE in σε(m)\sigma_{\varepsilon}(m) leads to more restrictive choices on \ell. Finally, it is worth emphasizing that the restrictions on \ell can be removed by using (17) directly. However, the resulting bounds are less informative in terms of the topological constants mm and MM.

We now present an RDP bound that is independent of EE when λ\lambda is sufficiently small.

Corollary C.3.

Let {Xt}\{X_{t}\}, {Xt}\{X_{t}^{\prime}\}, bb, CC, EE, \ell, λ\lambda, σ\sigma, and TT be as in Theorem 2.3. If

λmin{1E,12(m+M)}\lambda\leq\min\left\{\frac{1}{\sqrt{E}},\frac{1}{2(m+M)}\right\}

and no gradient clipping is performed, then we have

Dα(XTXT)4α(Cbσ)2[1+θLλ()].D_{\alpha}(X_{T}\|X_{T}^{\prime})\leq 4\alpha\left(\frac{C}{b\sigma}\right)^{2}\left[1+\theta_{L_{\lambda}}(\ell)\right].
Proof.

Using Theorem 2.3 and the fact that θL()1\theta_{L}(\cdot)\leq 1 for any L>1L>1, we have that

Dα(XTXT)\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime}) 4α(λCbσ)2[θLλ(TE)+EθLλ()]4α(λCbσ)2[1+EθLλ()]\displaystyle\leq 4\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[\theta_{L_{\lambda}}(T-E\ell)+E\theta_{L_{\lambda}}(\ell)\right]\leq 4\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[1+E\theta_{L_{\lambda}}(\ell)\right]
4α(Cbσ)2[1E+θLλ()]4α(Cbσ)2[1+θLλ()].\displaystyle\leq 4\alpha\left(\frac{C}{b\sigma}\right)^{2}\left[\frac{1}{E}+\theta_{L_{\lambda}}(\ell)\right]\leq 4\alpha\left(\frac{C}{b\sigma}\right)^{2}\left[1+\theta_{L_{\lambda}}(\ell)\right].

Appendix D Limitations of Poisson sampling in practice

This appendix discussing the computational limitation of implementing Poisson sampling in practice. It is primarily concerned with the large-scale setting where datasets may be on the order of hundreds of millions of examples.

Data access. Implementations of Poisson sampling, e.g., Opacus [29], typically employ a pseudorandom number generator to (i) randomly sample a collection of indices from zero to N1N-1, where NN is the size of the dataset and (ii) map these indices to corresponding examples in the dataset to generate a batch of examples. In order for (ii) to be efficient, many libraries need fast random access to the dataset which is difficult to do without loading the entire dataset into RAM (as reading data from disk can be orders of magnitude more expensive). In contrast, cyclic traversal of batches only requires (relatively small) fixed blocks of the dataset to be loaded into memory for every batch and need not perform a matching of indices (such as in (i) above) to data.

Variable batch size. Independent of the access speed of the dataset examples, Poisson sampling also generates batches of random sizes, which are typically inconvenient to handle in deep learning systems [14]. For example, popular just-in-time compilation-based machine learning libraries such as JAX, PyTorch/Opacus, and TensorFlow may need to retrace their computation graph at every training step as the batch size cannot be statically inferred or kept constant. Additionally, optimizing training workloads on hardware accelerators such as graphical processing units (GPUs) and tensor processing units (TPUs) becomes difficult as (i) they require any in-device data to have fixed sizes and (ii) any input data generated by Poisson sampling will have variable sizes due to the effect of variable batch sizes. In contrast, the cyclic traversal of batches will always generate fixed batch sizes and, consequently, will not suffer from the above issues.