Privacy of the last iterate in cyclically-traversed DP-SGD on nonconvex composite losses

Weiwei Kong¹¹1Google Research, Email: [email protected], ORCID: 0000-0002-4700-619X Mónica Ribero²²2Google Research, Email: [email protected]

Abstract

Differentially-private stochastic gradient descent (DP-SGD) is a family of iterative machine learning training algorithms that privatize gradients to generate a sequence of differentially-private (DP) model parameters. It is also the standard tool used to train DP models in practice, even though most users are only interested in protecting the privacy of the final model. Tight DP accounting for the last iterate would minimize the amount of noise required while maintaining the same privacy guarantee and potentially increasing model utility. However, last-iterate accounting is challenging, and existing works require strong assumptions not satisfied by most implementations. These include assuming (i) the global sensitivity constant is known — to avoid gradient clipping; (ii) the loss function is Lipschitz or convex; and (iii) input batches are sampled randomly.

In this work, we forego any unrealistic assumptions and provide privacy bounds for the most commonly used variant of DP-SGD, in which data is traversed cyclically, gradients are clipped, and only the last model is released. More specifically, we establish new Rényi differential privacy (RDP) upper bounds for the last iterate under realistic assumptions of small stepsize and Lipschitz smoothness of the loss function. Our general bounds also recover the special-case convex bounds when the weak-convexity parameter of the objective function approaches zero and no clipping is performed. The approach itself leverages optimal transport techniques for last iterate bounds, which is a nontrivial task when the data is traversed cyclically and the loss function is nonconvex.

1 Introduction

Differential privacy (DP) is an approach to capture the sensitivity of an algorithm to any individual user’s data and is frequently used in both industrial and government applications (see the book by [15] for a rich introduction). Given a possibly nonprivate computation $f$ , a desired level of DP (or privacy budget) $\varepsilon$ is generally achieved by bounding the global sensitivity³³3The maximum change in the mechanism’s output caused by changes in a single user or data point. of $f$ and then adding noise to its output. This noise is typically calibrated to the sensitivity and $\varepsilon$ in order to obscure the contributions of a single input example. Conversely, given a mechanism $\cal A$ for making a computation differentially private, a method for determining the level of DP obtained by $\cal A$ is often called a DP accounting method.

Differentially-private stochastic gradient descent (DP-SGD) refers to a family of popular first-order methods for training model weights with DP[10, 27, 6, 1]. At a high level, a DP-SGD method first computes the gradients of a given set of per-example loss functions with respect to the model weights and applies an algorithm ${\cal A}$ to obtain a private gradient $\cal G$ . The private gradient $\cal G$ is then used in some first-order optimization scheme, e.g., SGD, Adam, or AdaGrad, to update model weights. More precisely, ${\cal A}$ consists of (i) scaling the per-example loss gradients (a.k.a. gradient clipping) to reduce sensitivity, (ii) adding independent and identically distributed (i.i.d.) Gaussian noise $\cal Z$ to each of the scaled gradients, and (iii) summing the noised gradients to obtain $\cal G$ . In general, the higher the variance of $\cal Z$ is, the lower the utility of the final trained model.

Depending on the optimization scheme, and the assumptions on how the user-level loss functions are obtained, existing DP accounting methods for DP-SGD can differ significantly. For example, when only the last iterate of DP-SGD is released, existing accounting methods require both sophisticated machinery and numerous strong assumptions to provide tight DP bounds. Some of these strong assumptions, that almost never hold in practice, include (i) the input data is sampled randomly at each DP-SGD iteration, (ii) the loss functions are convex, (iii) the global DP-SGD sensitivity is known beforehand, and (iv) the intermediate model weights are bounded.

This work develops tighter privacy analyses for last-iterate DP-SGD under more realistic settings than existing works. Consequently, our analyses enable implementations of DP-SGD that apply Gaussian noise $\cal Z$ with lower variance than existing work, and as a consequence, obtain higher utility at the same privacy budget $\varepsilon$ . More specifically, we develop a family of Rényi DP (RDP) bounds on the last iterate of DP-SGD, which are novel in that they:

(i)

do not assume knowledge of the global sensitivity constant and, hence, are valid with or without gradient clipping;
(ii)

hold for both the nonconvex and convex settings under significantly fewer assumptions than other works;
(iii)

are parameterized by a weak convexity⁴⁴4A function $f$ is $m$ -weakly-convex if $f+m\|\cdot\|^{2}/2$ is convex. parameter $m\geq 0$ , for which one of the bounds smoothly converges to a similar one in the convex setting as $m\to 0$ .

1.1 Background

We begin by formally stating the problem of interest, describing common terminology and notation, and specifying the DP-SGD variant under consideration. We then briefly describe the Privacy Amplification by Iteration (PABI) argument of [17] and discuss the difficulties of generalizing this argument to more practical settings.

Problem of interest. We develop RDP bounds for the last iterate of a DP-SGD variant applied to the nonsmooth composite optimization problem

\min_{x\in\mathbb{R}^{n}}\left\{\phi(x):=\frac{1}{k}\sum_{i=1}^{k}f_{i}(x)+h(x)\right\}

(1)

where $h$ is convex and proper lower-semicontinuous and $f_{i}$ is continuously differentiable on the domain of $h$ . Notice that the assumption on $h$ encapsulates (i) common nonsmooth regularization functions such as the $\ell_{1}$ -norm $\|\cdot\|_{1}$ , nuclear matrix norm $\|\cdot\|_{*}$ , elastic net regularizer and (ii) indicator functions on possibly unbounded closed convex sets. A common setting in practice is when $(1/k)\sum_{i=1}^{k}f_{i}(x)$ corresponds to a softmax cross-entropy loss function and $h(x)$ corresponds to an $\ell_{1}$ - or $\ell_{2}$ -regularization function.

Common terminology. An input data collection $X=\{x_{i}\}_{i=1}^{k}$ is said to be traversed cyclically (or cyclically-traversed) in batches $\{B_{t}\}$ of size $b$ if $B_{t}$ contains $\{x_{b(t-1)+1},\ldots,x_{bt}\}$ for the first $t\leq k/b$ batches⁵⁵5For simplicity, we assume $b$ divides $k$ throughout., and the the rest of the batches cycle between $B_{1},\ldots,B_{k/b}$ in order. Cyclically-traversed DP-SGD is a variant of DP-SGD where the input data is traversed cyclically⁶⁶6Cyclically traversed is also known in the literature as incremental gradient [23].. A dataset pass occurs when the input data (e.g., $X$ above) in a cyclically-traversed DP-SGD run has been used, and the next batch of inputs is the same as the first batch of inputs at the beginning of the dataset pass. Gradient clipping is the process of orthogonally projecting a gradient vector in $\mathbb{R}^{n}$ to a Euclidean ball of radius $C$ centered at the origin. The parameter $C$ is typically called the $\ell_{2}$ -norm clip value. In this work, we say a function is a randomized operator if it consists of applying some deterministic operator to an input and adding random noise to resulting output. An operator ${\cal T}$ is said to be $L$ -Lipschitz if $\|{\cal T}(x)-{\cal T}(y)\|\leq L\|x-y\|$ for every $x$ and $y$ , and ${\cal T}$ is said to be nonexpansive if it is 1-Lipschitz.

Common notation. Let $[n]=\{1,\ldots,n\}$ for any positive integer $n$ . Let $\mathbb{R}$ denote the set of real numbers and $\mathbb{R}^{n}=\mathbb{R}\times\cdots\times\mathbb{R}$ denote the $n$ -fold Cartesian product of $\mathbb{R}$ . Let $(\langle\cdot,\cdot\rangle,\mathbb{R}^{n})$ denote a Euclidean space over $\mathbb{R}^{n}$ and denote $\|\cdot\|:=\sqrt{\langle\cdot,\cdot\rangle}$ to be the induced norm. The domain of a function $\phi:\mathbb{R}^{n}\mapsto(-\infty,\infty]$ is $\operatorname*{dom}\phi:=\{z\in\mathbb{R}^{n}:\phi(z)<\infty\}$ . The function $\phi$ is said to be proper if $\operatorname*{dom}\phi\neq\emptyset$ . A function $\phi:\mathbb{R}^{n}\mapsto(-\infty,\infty]$ is said to be lower semicontinuous if $\liminf_{x\to x_{0}}\phi(x)\geq\phi(x_{0})$ . The set of proper, lower semicontinuous, convex functions over $\mathbb{R}^{n}$ is denoted by $\overline{{\rm Conv}}\,(\mathbb{R}^{n})$ . The clipping operator is given by

\displaystyle{\rm Clip}_{C}(y):=y\cdot\min\left\{1,\frac{C}{\|y\|}\right\},

(2)

and the proximal operator for a proper convex function $\psi$ is defined as

\operatorname{prox}_{\psi}(z_{0})=\operatorname*{argmin}_{z\in\mathbb{R}^{n}}\left\{\psi(z)+\frac{1}{2}\|z-z_{0}\|^{2}\right\}\quad\forall z_{0}\in\mathbb{R}^{n}.

(3)

It is well-known that ${\rm prox}_{\psi}(\cdot)$ is nonexpansive (see, for example [7, Theorem 6.42]) and that ${\rm Clip}_{C}(y)$ is the proximal operator for the (convex) indicator function of the set $\{x:\|x\|\leq C\}$ .

The $\infty$ -Wasserstein metric ${\cal W}_{\infty}(\mu,\nu)$ is the smallest real number $w$ such that for $X\sim\mu$ and $Y\sim\nu$ , there is a joint distribution on $(X,Y)$ where $\|X-Y\|\leq w$ almost surely, i.e., ${\cal W}_{\infty}(\mu,\nu)=\inf_{\gamma\sim\Gamma(\mu,\nu)}\operatorname*{ess\ sup}_{(x,y)\sim\gamma}\|x-y\|$ , where $\Gamma(\mu,\nu)$ is the collection of measures on $\mathbb{R}^{n}\times\mathbb{R}^{n}$ with first and second marginals $\mu$ and $\nu$ , respectively. For any probability distributions $\mu$ and $\nu$ with $\nu\ll\mu$ , the Rényi divergence of order $\alpha\in(1,\infty)$ is

D_{\alpha}(\mu\|\nu)=\frac{1}{\alpha-1}\log\int\left[\frac{\mu(x)}{\nu(x)}\right]^{\alpha}\nu(x)\,dx,

(4)

where we take the convention that $0/0=0$ . For $\nu\not\ll\mu$ , we define $D_{\alpha}(\mu\|\nu)=\infty$ . For parameters $\tau\geq 0$ and $\alpha\geq 1$ , the shifted Rényi divergence is

D_{\alpha}^{(\tau)}(\mu\|\nu):=\inf_{\gamma}\left\{D_{\alpha}(\gamma\|\nu):{\cal W}_{\infty}(\mu,\gamma)\leq\tau\right\}

(5)

for any probability distributions $\mu$ and $\nu$ over $\mathbb{R}^{n}$ . Given random variables $X\sim\mu$ and $Y\sim\nu$ , we denote $D_{\alpha}(X\|Y)=D_{\alpha}(\mu\|\nu)$ and $D_{\alpha}^{(\tau)}(X\|Y)=D_{\alpha}^{(\tau)}(\mu\|\nu)$ .

We consider the swap model for differential privacy. We say two datasets $S$ and $S^{\prime}$ are neighbors, denoted as $S\sim S^{\prime}$ , if $S^{\prime}$ can be obtained by swapping one record. A randomized algorithm $\cal A$ is said to be $(\alpha,\varepsilon)$ -RDP if, for every pair of neighboring datasets $S$ and $S^{\prime}$ in the domain of $\cal A$ , we have $D_{\alpha}({\cal A}(S)\|{\cal A}(S^{\prime}))\leq\varepsilon$ .

${\cal A}$ satisfies local DP if for all records $x_{i}$ and $x_{j}$ , $D_{\alpha}({\cal A}(x_{i})\|{\cal A}(x_{j}))\leq\varepsilon$ . Finally, we use the following variable conventions: $\ell$ is the number of batches (or iterations) in a dataset pass, $E$ is the number of dataset passes, $T=E\cdot\ell$ is the total number of iterations, $k$ is the total number of per-example losses, $b$ is the batch size, $\lambda$ is the DP-SGD stepsize, and $C$ is the clipping norm.

DP-SGD variant. Algorithm 1 outlines the specific variant of DP-SGD applied to (1). This variant takes as input $k$ per-example loss functions $\{f_{i}\}^{k}_{i=1}$ , the number of iterations $T$ , iid samples $\{N_{t}\}^{T}_{t=1}$ from a spherical Gaussian distribution ${\cal N}(0,\sigma^{2}I)$ , initial model weights $X_{0}$ , batch size, stepsize, and $\ell_{2}$ -clipping norm values. Model weights are updated as follows. At time step $t$ , the algorithm (i) selects a batch of examples by cyclically traversing $\{f_{i}\}^{k}_{i=1}$ , (ii) computes the average $g_{t}$ of clipped per example gradients at $X_{t-1}$ , and (iii) updates $X_{t-1}$ using a noisy gradient. Finally, the algorithm returns the last iterate, $X_{T}$ .

Input:

\{f_{i}\}_{i=1}^{k}

h

, iid samples

\{N_{t}\}\subseteq\mathbb{R}^{n}

from

{\cal N}(0,\sigma^{2}I)

X_{0}\in\operatorname*{dom}h

;

Data: batch size

b

, stepsize

\lambda

, clipping norm

C

, iteration limit

T

, steps per dataset pass

\ell

;

Output:

X_{T}\in\operatorname*{dom}h

;

for $t=1,\ldots,T$ do

j_{t}\leftarrow b(t-1)\bmod k

B_{t}\leftarrow\{j_{t}+1,\ldots,j_{t}+b\}

g_{t}\leftarrow(1/b)\sum_{i\in B_{t}}{\rm Clip}_{C}(\nabla f_{i}(X_{t-1}))

X_{t}\leftarrow\operatorname{prox}_{\lambda h}\left(X_{t-1}-\lambda g_{t}+N_{t}\right)

end for

return $X_{T}$

Algorithm 1 Cyclically-traversed last-iterate DP-SGD

1.2 Outline of approach

We now outline our approach of tackling the problem of interest. A more formal treatment is given in Section 2.

To motivate our approach, we provide a brief overview of previous well-known methods. An early approach of analyzing Algorithm 1 is to develop a bound based on local DP for a single dataset pass and extend this bound for multiple passes (see, for example, [27, 14]). While straightforward, this approach can be overly restrictive in a centralized setting. Privacy Amplification by Subsampling (PABS) (see subsequent work [13] for a comparison of different sampling methods) improves on the previous approach in certain regimes. Although this method allows for clean privacy accounting, its reliance on Poisson subsampling makes it impractical for large-scale applications (see Appendix D for an extended discussion). The work started by [17] addressed the limitations of PABS with Privacy Amplification by Iteration (PABI), which achieves a bound for releasing the final DP-SGD iterate. This PABI bound improves the baseline bound in certain regimes incorporating a contraction factor dependent on the loss function’s convexity parameters.

Our approach is inspired by PABI, but relaxes several of its convexity assumptions. For added context, we briefly review PABI below. Under the assumption that the loss function of (1) is convex and $Q$ -Lipschitz, and that $h$ is the indicator of a closed convex set, [17] shows that the DP-SGD update in Algorithm 1 with small constant stepsize $\lambda$ and no gradient clipping is a nonexpansive operator. This property can be then combined with the following technical result about nonexpansive operators.

Theorem 1.1.

Suppose we are given iterates $\{X_{t}\}$ and $\{X_{t}^{\prime}\}$ , nonexpansive operators $\{\phi_{t}\}$ and $\{\phi_{t}^{\prime}\}$ , iid Gaussian random variables $\{N_{t}\}$ , and scalars $\{s_{t}\}$ satisfying

X_{t}=\phi_{t}(X_{t-1})+N_{t},\quad X_{t}^{\prime}=\phi_{t}^{\prime}(X_{t-1}^{\prime})+N_{t},\quad\sup_{x}\|\phi_{t}(x)-\phi_{t}^{\prime}(x^{\prime})\|\leq s_{t}\quad\forall t\in[T].

For any scalar sequences $\{a_{t}\}$ and $\{z_{t}\}$ satisfying

z_{t}=\sum_{i\leq t}s_{i}-\sum_{i\leq t}a_{i}\geq 0,\quad z_{t}\geq 0,\quad a_{t}\geq 0,\quad\forall t\in[T],

(6)

we obtain the following last-iterate shifted Rényi divergence bound:

D_{\alpha}^{(z_{T})}(X_{T}\|X_{T}^{\prime})-D_{\alpha}(X_{0}\|X_{0}^{\prime})\leq\frac{\alpha}{2\sigma^{2}}\sum_{t=1}^{T}a_{t}^{2}=:{\cal R}_{T}(\{a_{t}\})\quad\forall T\geq 1,\quad\forall\alpha\geq 1.

(7)

More specifically, assuming that the DP-SGD iterates first differ at index $t^{*}$ , i.e., $X_{t^{*}}^{\prime}\neq X_{t^{*}}$ and $X_{t}^{\prime}=X_{t}$ for every $t<t^{*}$ , the operators $\{\phi_{t}\}$ and $\{\phi_{t}^{\prime}\}$ in the above theorem can be formed with $s_{t^{*}}=2\lambda Q$ and $s_{t}=0$ for every $t\neq t^{*}$ . Consequently, one can select $\{a_{t}\}$ so that the shift satisfies $z_{T}=0$ and obtain a closed form bound ${\cal R}_{T}(\{a_{t}\})=\Theta(\alpha[\lambda Q/\sigma]^{2})$ in (7), which yields a corresponding RDP bound when $X_{0}=X_{0}^{\prime}$ . The generalization to multiple dataset passes follows similarly but the final bound scales with the number of dataset passes $E$ . Specifically, we have ${\cal R}_{T}(\{a_{t}\})=\Theta(\alpha[E/\ell]\cdot[\lambda Q/\sigma]^{2})$ .

In the more practical setting where (i) the loss function in (1) is nonconvex and not necessarily $Q$ -Lipschitz, (ii) gradient clipping is applied, and (iii) $h$ is nonsmooth, it is no longer clear to what extent the corresponding DP-SGD operators $\{\phi_{t}\}$ and $\{\phi_{t}^{\prime}\}$ are nonexpansive or how $\{s_{t}\}$ should be obtained. Furthermore, the first inequality of (6) no longer holds, and additional technical issues arise when analyzing the case of multiple dataset passes.

Our approach. We generalize the above argument and combine it with additional analyses of weakly-convex functions and proximal operators to relax several strong assumptions. A sketch of our approach is given below, and formal arguments can be found in subsequent sections.

General operator analysis. In Lemma A.2, we study properties of operators $\phi$ and $\phi^{\prime}$ satisfying

\sup_{u}\|\phi^{\prime}(u)-\phi(u)\|\leq s,\quad\|\Phi(x)-\Phi(y)\|\leq L\|x-y\|+\zeta,\quad\forall\Phi\in\{\phi,\phi^{\prime}\}.

(8)

Note that if $\Phi$ is Hölder continuous, then it can be shown [22] that it satisfies the second inequality in (8)⁷⁷7More specifically, if $\Phi$ is $\eta$ -Hölder continuous with modulus $H$ , then (8) holds with $L=H\rho/2$ and $\zeta=H\eta([1-\eta]/\rho)^{(1-\eta)/\eta}$ for any $\rho>0$ .. Using these properties, we then establish in Proposition A.5 that if $\{Y_{t}\}$ and $\{Y_{t}^{\prime}\}$ are generated by a specific sequence of randomized proximal operators using $\phi$ and $\phi^{\prime}$ , respectively, then (roughly)

D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}(Y_{0}\|Y_{0}^{\prime})\preceq\frac{\alpha}{\sigma^{2}}\cdot{\cal{F}}(T),

where ${\cal F}:\mathbb{R}\mapsto\mathbb{R}$ is non-decreasing and dependent on some assumptions on $\{Y_{t}\}$ and $\{Y_{t}^{\prime}\}$ . More specifically, we obtain a sequence of parameterized shifted Rényi divergence bounds similar to (7), while dealing with the challenge of nonconvexity. In the setting of one dataset pass, we derive the bound by solving a related quadratic programming problem on a similar set of residuals $\{a_{t}\}$ as in (6) (see Appendix B for details).

Lipschitz properties of the DP-SGD update. Denoting ${\cal A}_{\lambda}(\cdot)$ as the DP-SGD update function⁸⁸8Or any SGD-like update as in (12)., i.e., $X_{t}={\cal A}_{\lambda}(X_{t-1})$ for every $t$ in Algorithm 1, we show in Proposition 2.1 that — depending on our assumptions on $h$ and stepsize $\lambda$ — the operator ${\cal A}_{\lambda}(\cdot)$ satisfies the second inequality of (8) with $\Phi=A_{\lambda}$ for different values of $L$ and $\zeta$ .

More specifically, when the domain of $h$ is bounded, we have $(L,\zeta)=(1,2\lambda C)$ in (8) for clipping norm $C$ . On the other hand, when $\lambda$ is sufficiently small, we have $\zeta=0$ and $L$ being a constant that (i) tends to $\sqrt{2}$ when the weak convexity parameter $m$ tends to zero, i.e., $f$ becomes more convex; and (ii) tends to one when no clipping is performed and $m$ in (i) tends to zero. This continuity with respect to the weak convexity parameter appears to be new, and it is proved in Appendix A.2 by using topological properties about weakly-convex functions and proximal operators.

Privacy bounds for DP-SGD. For neighboring DP-SGD iterates $X_{T}$ and $X_{T}^{\prime}$ , we combine the above results in Theorems 2.2 and 2.3 to obtain RDP bounds of the form

D_{\alpha}(X_{T}\|X_{T}^{\prime})\preceq\frac{\alpha}{\sigma^{2}}\cdot{\cal{B}}_{\lambda}(C,b,T,\ell),

(9)

where $C$ , $b$ , $T$ , $\ell$ , are as in Algorithm 1. More specifically, assuming that the DP-SGD iterates are contained within an $\ell_{2}$ ball of diameter $d_{h}$ and each $\nabla f_{i}$ is Lipschitz continuous, we obtain (9) with $B_{\lambda}(C,b,T,\ell)=(L_{\lambda}d_{h}+\lambda C/b)^{2}$ for for some $L_{\lambda}=\sqrt{1+\kappa\lambda m}$ , $\kappa\leq 4$ , and small enough $\lambda$ . When the iterates are (possibly) unbounded, we obtain (9) with

B_{\lambda}(C,b,T,\ell)=\left(1+\left[\frac{T}{\ell}\right]\left[\frac{L^{2\ell}_{\lambda}}{\sum_{i=1}^{\ell}L^{2i}_{\lambda}}\right]\right)\left(\frac{\lambda C}{b}\right)^{2}.

(10)

1.3 Related work

We first present high-level descriptions of related works in the convex and nonconvex settings, followed by more general works that use advanced composition to obtain loose bounds on the last iterate. We then conclude with some summary tables and figures, and a discussion of technical nuances that carefully compares our work to existing literature.

Convex setting. Given the challenge of proving tight bounds in the general setting, a number of prior analyses focus on the convex case. Works by [17, 16] additionally assume Lispchitz continuity of the loss function to obtain results. The work of [12] studies multiple passes over the data, but their results only apply to the smooth, strongly convex, and full batch setting without clipping. The work of [28] improves these results and extend them to mini-batches both with “shuffle and partition” and “sampling without replacement” strategies. Similarly, results in by [2], and its extension [3], consider only convex Lipschitz smooth functions. The contemporary work of [8] introduces the shifted interpolated process under $f$ -DP, allowing for tight characterizations of privacy by iteration for Rényi and other generalized DP definitions.

Notice that none of the above works study clipping, all assume access to the Lipschitz constant of the loss function, and require convexity, limiting their practical viability.

Nonconvex setting. We now discuss papers that do not require convexity of the loss function. [4] analyze the privacy loss dynamics for nonconvex functions, but their analysis differs from ours in two ways. First, they assume that their DP-SGD batches are obtained by Poisson sampling or sampling without replacement. Second, their results require numerically solving a minimization problem that can be hard in practice.

A contemporary work⁹⁹9Appearing after the first version of this preprint. by [11] derives bounds under the assumption that the loss gradient is Hölder continuous and the loss function is Lispchitz continuous when it is also convex. However, this work needs an additional assumption, that constants $L>0$ and $\eta\in(0,1]$ satisfying $\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|^{\eta}$ are known and used in a specific optimization subproblem. While [11] focuses on tight theoretical bounds under specific conditions (such as the full batch and single-epoch setting), we prioritize bridging the gap between theory and practice by addressing the complexities of real-world deployments.

Non-specialized analyses. Prior works on DP-SGD accounting often rely on loose bounds that allow for the release of intermediate updates [20, 6, 1]. These works rely on differential privacy advanced composition results [19], resulting in noise with standard deviation that scales as $\sqrt{T}$ [6, 1]. Alternatively, using disjoint batches decreases the dependence from the number of iterations $T$ to the number of epochs $E$ (see the first row in Table 2). However, the assumption that all intermediate updates are released can be stringent for certain loss regimes [17, 16], where this bound can be contracted based on loss smoothness and convexity parameters, if only the last iterate is released.

While our work focuses on the privacy guarantees of DP-SGD, it’s important to acknowledge the parallel research efforts exploring the convergence properties of shuffling methods. Recent studies, such as [23] and [9], have established convergence bounds for strongly convex and/or smooth functions in various settings, albeit without considering privacy. These works provide valuable insights into the optimization behavior of shuffling techniques, which can inform future research at the intersection of privacy and optimization.

Summary tables and graphs. Table 1 lists and labels some assumptions that are commonly used in the RDP literature. Table 2 compares our bounds against other RDP bounds. Note that the multi-epoch noisy-SGD algorithm in [17, Algorithm 1] only considers the case where the number of dataset passes equals the number of batches in dataset pass and does not consider batched gradients. As a consequence of the latter, its corresponding RDP upper bound does not depend on the batch size $b$ . Figure 1 compares our bounds against other bounds as function of the number of iterations performed, under various settings. Note that we do not consider the multi-epoch bound in [11, Theorem A.1] as it requires solving $T$ nonlinear programming problems, and we do not compare the with the bound in [25] because it exploits the fact that batches are randomly sampled (whereas our bounds assume the input batches must be obtained deterministically).

Label	Description
${\cal I}_{c}$	The regularizer $h$ is the indicator of a closed convex set
${\cal H}$	The domain of a regularizer $h$ , $\operatorname*{dom}h$ , is bounded with diameter $d_{h}$
${\cal B}_{1}$	input data batches are of size one
${\cal D}$	input data batches are disjoint
${\cal P}$	input data batches are obtained using Poisson subsampling with sampling rate $1/\ell$
${\cal S}_{Q}^{0}$	$f_{i}$ is $Q$ -Lipschitz continuous for every $i\in[k]$ , and $Q$ is known
${\cal S}_{L}^{1}$	$\nabla f_{i}$ is $L$ -Lipschitz continuous for every $i\in[k]$ , and $L$ is known
${\cal R}_{H,\eta}^{1}$	$\nabla f_{i}$ is $(H,\eta)$ -Hölder continuous for every $i\in[k]$ , and $H$ and $\eta$ are known
$\Lambda$	the stepsize needs to be small relative to certain smoothness constants
${\cal A}$	the given RDP bounds only hold for a small range of values of $\alpha$ and $\sigma$
${\cal N}$	no gradient clipping is applied or the global sensitivity is known
$\tilde{{\cal N}}$	when $\phi$ is convex, no gradient clipping is applied or the global sensitivity is known

Table 1: List of common assumptions used in RDP accounting literature. For conciseness we let

\ell

denote the number of iterations/batches in a dataset pass and

\operatorname*{dom}h

denote the domain of

h

Source	Asymptotic $(\alpha,\epsilon)-$ RDP upper bounds		Assumptions
Source	Convex $\phi$	Nonconvex $\phi$	(see Table 1)
[14, Theorem 3.1]¹⁰¹⁰10 The bound follows directly from Theorem 8 in [5]	$\dfrac{\alpha E}{2}\left[\dfrac{\lambda C}{\sigma b}\right]^{2}$	same as convex	${\cal D}$
[25, Section 3.3]	numerical procedure¹¹¹¹11Obtained by evaluating an integral using numerical quadrature techniques.	same as convex	${\color[rgb]{1,0,0}{\cal P}}$
[25, Theorem 11]	$\dfrac{\alpha E}{\ell}\left[\dfrac{\lambda C}{b\sigma}\right]^{2}$	same as convex	${\color[rgb]{1,0,0}{\cal A}},{\color[rgb]{1,0,0}{\cal P}}$
[17, Theorem 35]	$\dfrac{\alpha E}{\ell}\left[\dfrac{\lambda Q}{\sigma}\right]^{2}$	none	${\cal I}_{c},{\cal N},\Lambda,S_{Q}^{0},S_{L}^{1},{\color[rgb]{1,0,0}{\cal B}_{1}}$
[2, Theorem 3.1]	$\alpha\left[\dfrac{\lambda Q}{\sigma}\right]^{2}\min\left\{\dfrac{E}{\ell},\dfrac{d_{h}}{Q\lambda k}\right\}$	none	${\cal I}_{c},{\cal N},\Lambda,S_{Q}^{0},S_{L}^{1},{\cal H},{\color[rgb]{1,0,0}{\cal A}}$
[11, Theorem A.1]	numerical procedure¹²¹²12Obtained by solving $T$ nonlinear programming problems.	numerical procedure¹³¹³13Same procedure as in the nonconvex case, but with different parameters.	${\cal I}_{c},\tilde{{\cal N}},\Lambda,{S}_{Q}^{0},{\cal R}_{H,\eta}^{1},{\cal H}$
Ours	$\dfrac{\alpha}{\sigma^{2}}\left[L_{\lambda}d_{h}+\dfrac{\lambda C}{b}\right]^{2}$	same as convex	$\Lambda,S_{L}^{1},{\cal H},{\cal D}$
	$\dfrac{\alpha E}{\ell}\left[\dfrac{\lambda C}{b\sigma}\right]^{2}$	$\alpha E\theta_{L_{\lambda}}(\ell)\left[\dfrac{\lambda C}{\sigma}\right]^{2}$	$\Lambda,S_{L}^{1},{\cal N},{\cal D}$
	$\alpha E\theta_{\sqrt{2}}(\ell)\left[\dfrac{\lambda C}{\sigma}\right]^{2}$	$\alpha E\theta_{\sqrt{2}L_{\lambda}}(\ell)\left[\dfrac{\lambda C}{\sigma}\right]^{2}$	$\Lambda,S_{L}^{1},,{\cal D}$

Table 2: Asymptotic

\varepsilon

upper bounds for

(\alpha,\varepsilon)

-Rényi differential privacy of the last iterate after

T

iterations of DP-SGD. Here,

\lambda

is the stepsize,

C

is the clipping norm,

\sigma

is the standard deviation of the Gaussian noise,

\ell

is the number of iterations in one dataset pass, and

E

is the number of dataset passes. Also,

L_{\lambda}:=\sqrt{1+\lambda\kappa m}

for some

\kappa\leq 4

and weak-convexity parameter

m

(see (11)), and

\theta_{L}(\ell):=L^{2(\ell-1)}/\sum_{i=0}^{\ell-1}L^{2i}

. Particularly strong assumptions are highlighted in red.

Refer to caption — Figure 1: Log-scale last-iterate RDP bounds for Algorithm 1 as a function of number of iterations. The fixed algorithmic parameters are $\lambda=10^{-5}$ , $C=10$ , $\sigma=10^{-5}$ , $T=10^{5}$ , $b=10^{1}$ , and $k=10^{4}$ . The free parameters $m$ , $M$ , and $Q$ are the weak convexity, Lipschitz-smoothness, and Lipschitz constants of $f$ , respectively, while $C$ is the clipping norm. The left plot considers weakly-convex losses under the settings of no gradient clipping. The *right* plot considers the convex case where $Q$ may differ from $C$ . The RDP composition bounds are from [20], the PABI bounds are from [17], and the local DP bounds are from [14].

Technical nuances. The bound in (9) with ${\cal B}_{\lambda}$ as in (10) and $L_{\lambda}=1$ might appear to follow from subsampled RDP composition results such as [25]. However, those results only apply to DP-SGD variants where the batches are sampled randomly, an assumption that does not hold when batches are cyclically traversed. While established Python libraries like Opacus [29] and TensorFlow Privacy [24] implement and account for random sampled batches (such as those obtained by Poisson subsampling), these implementations address a different issue. One has to ensure the optimizer truly samples at random from a pre-specified distribution, which becomes incredibly difficult with large-scale datasets (see Appendix D for an extended discussion). Consequently, any privacy guarantee predicated on this idealized random sampling assumption becomes effectively meaningless when the actual optimization process deviates from it (as is the case with cyclical batch traversal). It is worth mentioning that DP-FTRL [18] was specifically developed to address this gap and the method accepts a potentially weaker DP guarantee in exchange for practical applicability.

For the special case where (i) only one dataset pass is performed, (ii) the objective function is nonconvex , and (iii) each $\nabla f_{i}$ is Lipschitz continuous, the bound in (9) with ${\cal B}_{\lambda}$ as in (10) holds with $T=\ell$ . Consequently, we obtain an RDP bound that does not grow with the number of iterations — in contrast to the RDP composition bounds in [20] which do scale linearly in $T$ or $E$ , depending on the sampling assumption. Note that (16) and Theorem 2.3 show that as $\theta_{L_{\lambda}}(\cdot)\downarrow 1$ , our bounds converge to an expression that depends linearly on $E/\ell$ , matching the bound for PABI. Figure 1 illustrates the regimes under which we improve previous work.

The addition of the composite term $h(\cdot)$ in (1) is not a trivial extension and greatly complicates the analysis. For example, the objective function $\phi$ in (1) is no longer differentiable in general (even on $\operatorname*{dom}\phi$ ), and more general analyses must be used to handle this nonsmoothness. Under stepsize $\lambda$ and $L$ -Lipschitz continuous $\nabla f_{i}$ for $i\in[k]$ , paper [11] shows that the DP-SGD update in the nonconvex case is $(1+\lambda L)$ -Lipschitz continuous, which is independent of the weak convexity constant. Consequently, the established RDP bounds in [11] in the setting where the weak convexity constant $m$ is positive but near zero (i.e., the function is nearly convex) may be an overestimate of the true RDP bound.

For the convex case, it is worth emphasizing that we do not require each to be Lipschitz continuous case in order to bound $\nabla f_{i}$ (see, for example, [17, 2, 16, 11] which do require this assumption). As a consequence, our analysis is applicable to a substantially wider class of objective functions. Moreover, all other existing bounds in the literature of the form in (9) replace the parameter $C$ with a Lipschitz constant $Q$ of $\phi(x)$ from (1), and these bounds are generally proportional to $Q^{2}$ . Consequently, when $C\ll Q$ , e.g., when $\phi(x)$ is a quadratic function on a large compact domain, our bounds are significantly tighter (see Figure 1 for an illustration).

Organization. The remainder of the paper gives a formal presentation of the results, including the key assumptions on (1), the topological properties of the DP-SGD update operator, and the non-asymptotic RDP bounds on the last DP-SGD iterates.

2 Privacy bounds for DP-SGD

This section formally presents the main RDP bounds for the last iterates of Algorithm 1. For conciseness, the lengthy proofs of the main results are given in Appendix A.

We start by precisely giving the assumptions on (1). Given $h:\mathbb{R}^{n}\mapsto(-\infty,\infty]$ and $f_{i}:\operatorname*{dom}h\mapsto\mathbb{R}$ for $i\in[k]$ , consider the following assumptions:

(A1)

$h\in\overline{{\rm Conv}}\,(\mathbb{R}^{n})$ ;

(A2)

there exists $m,M\geq 0$ such that for $i\in[k]$ the function $f_{i}$ is differentiable on an open set containing $\operatorname*{dom}h$ and

-\frac{m}{2}\|x-y\|^{2}\leq f_{i}(x)-f_{i}(y)-\left\langle\nabla f_{i}(y),x-y\right\rangle\leq\frac{M}{2}\|x-y\|^{2}\quad\forall x,y\in\operatorname*{dom}h.

(11)

We now give a few remarks about (A1)–(A2). First, (A1) is necessary to ensure that $\operatorname{prox}_{h}(\cdot)$ is well-defined. Second, it can be shown¹⁴¹⁴14See, for example, [7, 26] and [21, Proposition 2.1.55]. that (A2) is equivalent to the assumption that $\nabla f$ is $M$ -Lipschitz continuous when $m=M$ . Third, the lower bound in (11) is equivalent to assuming that $f_{i}(\cdot)+m\|\cdot\|^{2}/2$ is convex and, hence, if $m=0$ then $f_{i}$ is convex. The parameter $m$ is often called a weak-convexity parameter of $f_{i}$ . Fourth, using symmetry arguments and the third remark, if $M=0$ then $f_{i}$ is concave. Finally, the third remark motivates why we choose two parameters, $m$ and $M$ , in (11). Specifically, we use $m$ (resp. $M$ ) to develop results that can be described in terms of the level of convexity (resp. concavity) of the problem.

We now develop the some properties of an SGD-like update. Given $\{q_{i}\}\subseteq\overline{{\rm Conv}}\,(\mathbb{R}^{n})$ with $\operatorname*{dom}q_{i}\subseteq\operatorname*{dom}h$ and $B\subseteq[k]$ , define the prox-linear operator

{\cal A}_{\lambda}(x)={\cal A}_{\lambda}(x,\{f_{i}\},\{q_{i}\}):=x-\frac{\lambda}{|B|}\sum_{i\in B}\operatorname{prox}_{q_{i}}(\nabla f_{i}(x)).

(12)

Clearly, when $\operatorname{prox}_{q_{i}}(y)=y$ for every $y\in\mathbb{R}^{n}$ , the above update corresponds to a SGD step applied to the problem of minimizing $\sum_{i=1}^{k}f_{i}(z)$ (with respect to $z$ ) under the stepsize $\lambda$ and starting point $x$ . Moreover, while it is straightforward to show that ${\cal A}_{\lambda}(\cdot)$ is $(1+\lambda\max\{m,M\})$ -Lipschitz continuous when $\{f_{i}\}$ satisfies (A2)¹⁵¹⁵15See, for example, [11, Appendix A.6]., we prove in Proposition 2.1(b) that the Lipschitz constant can be improved to $\sqrt{1+\kappa\lambda m}$ for some $\kappa\leq 4$ when $\lambda$ is small. Notice that the former constant does not converge to one when $m\to 0$ , e.g., when $f_{i}$ becomes more convex, while the latter one does.

We are now present some important properties of ${\cal A}_{\lambda}(\cdot)$ .

Proposition 2.1.

Let $(m,M)$ be as in assumption (A2), and define

L_{\lambda}=L_{\lambda}(m,M):=\sqrt{1+2\lambda m\left[1+\frac{m}{2(M+m)}\right]}\quad\forall\lambda>0.

(13)

Then, the following statements hold:

(a)

if $\operatorname*{dom}q_{i}$ is bounded with diameter $C$ for $i\in[k]$ , then for any $\lambda>0$ we have

$\|{\cal A}_{\lambda}(x)-{\cal A}_{\lambda}(y)\|\leq\|x-y\|+2\lambda C\quad\forall x,y\in\operatorname*{dom}h;$ (14)
(b)

if $\{f_{i}\}$ satisfies (A2) and $\lambda\leq 1/[2(M+m)]$ then ${\cal A}_{\lambda}(\cdot)$ is $\sqrt{2}L_{\lambda}$ -Lipschitz continuous;
(c)

if $\{f_{i}\}$ satisfies (A2), $\lambda\leq 1/(M+m)$ , and $\nabla f_{i}(x)=\operatorname{prox}_{q_{i}}(\nabla f_{i}(x))$ for every $i\in[k]$ and $x\in\operatorname*{dom}h$ , then ${\cal A}_{\lambda}(\cdot)$ is $L_{\lambda}$ -Lipschitz continuous on $\operatorname*{dom}h$ .

Some remarks are in order. First, $L_{\lambda}(0,M)=1$ and, hence, ${\cal{A}}_{\lambda}(\cdot)$ is nonexpansive when $f_{i}$ is convex for every $i\in[k]$ , $\lambda\leq 1/M$ , and $\nabla f(x_{i})=\operatorname{prox}_{q_{i}}(\nabla f(x_{i}))$ . Second, if $\lambda=1/(2m)$ then $L_{\lambda}(m,0)=\sqrt{3}$ and, hence, ${\cal{A}}_{\lambda}(\cdot)$ $\sqrt{6}$ -Lispchitz continuous when $f_{i}$ is concave. Third, like the first remark, $L_{0}(m,M)=1$ implies that ${\cal{A}}_{\lambda}(\cdot)$ is nonexpansive. Finally, when $m=M$ and $\lambda=1/(2m)$ , we have $L_{\lambda}(m,M)=\sqrt{5/2}$ and ${\cal{A}}_{\lambda}(\cdot)$ is $\sqrt{5}$ -Lispchitz continuous.

For the remainder of this section, suppose $h$ satisfies (A1) and let $f_{i}^{\prime}:\mathbb{R}^{n}\mapsto(-\infty,\infty]$ for $i\in[k]$ be such that there exists $i^{*}\in[k]$ where $f_{i}^{\prime}=f_{i}$ for every $i\neq i^{*}$ and $f_{i^{*}}^{\prime}\neq f_{i^{*}}$ , i.e., $\{f_{i}\}\sim\{f_{i}^{\prime}\}$ . That is, $i^{*}$ is the index where the neighboring datasets $\{f_{i}\}$ and $\{f_{i}^{\prime}\}$ differ.

We also make use of the follow assumption.

(A3)

The functions $\{f_{i}^{\prime}\}$ satisfy assumption (A2) with $f_{i}=f_{i}^{\prime}$ for every $i\in[k]$ .

We now present the main RDP bounds in terms of the following constants:

d_{h}:=\sup\{\|x-y\|:x,y\in\operatorname*{dom}h\},\quad\theta_{L}(0):=0,\quad\theta_{L}(s):=\frac{L^{2(s-1)}}{\sum_{j=0}^{s-1}L^{2j}}\quad\forall s\geq 1.

(15)

We first present a bound where $\operatorname*{dom}h$ is bounded with diameter $d_{h}<\infty$ .

Theorem 2.2.

Let $\{X_{t}\}$ and $\{X_{t}^{\prime}\}$ denote the iterates generated by Algorithm 1 with per-example loss functions $\{f_{i}\}$ and $\{f_{i}^{\prime}\}$ , respectively, and fixed $\lambda$ , $C$ , $b$ , $\sigma$ , $\{N_{t}\}$ , $T$ , and $X_{0}$ for both sequences of iterates. If $\lambda\leq 1/[2(m+M)]$ and (A1)–(A3) hold, then

\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime})

\displaystyle\leq\frac{\alpha}{2\sigma^{2}}\left(L_{\lambda}d_{h}+\frac{2\lambda C}{b}\right)^{2},

(16)

where $L_{\lambda}$ and $d_{h}$ are as in (13) and (15), respectively.

We now present the RDP bounds for when $\operatorname*{dom}h$ is (possibly) unbounded.

Theorem 2.3.

Let $\{X_{t}\}$ , $\{X_{t}^{\prime}\}$ , $\lambda$ , $\sigma$ , $b$ , $C$ , and $T$ be as in Theorem 2.2, and denote $\ell:=k/b$ and $E:=\lfloor T/\ell\rfloor$ . If $\lambda\leq 1/[2(m+M)]$ and (A1)–(A3) hold, then

D_{\alpha}(X_{T}\|X_{T}^{\prime})\leq 4\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[1+E\cdot\theta_{\sqrt{2}L_{\lambda}}(\ell)\right],

(17)

where $L_{\lambda}$ and $\theta_{L}(\cdot)$ are as in (13) and (15), respectively. On the other hand, if

\max_{i\in[k],t\in[T]}\{\|\nabla f_{i}(X_{t})\|,\|\nabla f_{i}(X_{t}^{\prime})\|\}\leq C,

(18)

i.e., no gradient clipping is performed, and $\lambda\leq{1}/{(M+m)}$ then

D_{\alpha}(X_{T}\|X_{T}^{\prime})\leq 4\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[1+E\cdot\theta_{L_{\lambda}}(\ell)\right].

(19)

We conclude with a few remarks about the above bounds. First, the bound in (19) is on the same order of magnitude as the bound in [17] in terms of $T$ and $\ell$ when $L_{\lambda}=1$ . However, the right-hand-side of (19) scales linearly with a $\lambda^{2}$ term, which does not appear in [17]. Second, as $\theta_{L_{\lambda}}(\cdot)\leq 1$ , the right-hand-sides of (17) and (19) increases (at most) linearly with respect to the number of dataset passes $E$ . Third, substituting $\sigma=\Theta({C/[b\sqrt{\epsilon}]})$ in (17) yields a bound that depends linearly on $\varepsilon$ and is invariant to changes in $C$ and $b$ . In Appendix C, we discuss further choices of the parameters in (19) and their properties.

References

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016.
[2] Jason Altschuler and Kunal Talwar. Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. Advances in Neural Information Processing Systems (NeurIPS), 2022.
[3] Jason M. Altschuler, Jinho Bok, and Kunal Talwar. On the privacy of noisy stochastic gradient descent for convex optimization. SIAM Journal on Computing, 2024.
[4] Shahab Asoodeh and Mario Díaz. Privacy loss of noisy stochastic gradient descent might converge even for non-convex losses. arXiv preprint arXiv:2305.09903, 2023.
[5] Borja Balle and Yu-Xiang Wang. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning (ICML). PMLR, 2018.
[6] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Symposium on foundations of computer science, 2014.
[7] Amir Beck. First-order methods in optimization. SIAM, 2017.
[8] Jinho Bok, Weijie Su, and Jason M Altschuler. Shifted interpolation for differential privacy. arXiv preprint arXiv:2403.00278, 2024.
[9] Xufeng Cai and Jelena Diakonikolas. Last iterate convergence of incremental methods and applications in continual learning. arXiv preprint arXiv:2403.06873, 2024.
[10] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 2011.
[11] Eli Chien and Pan Li. Convergent privacy loss of noisy-SGD without convexity and smoothness. arXiv preprint arXiv:2410.01068, 2024.
[12] Rishav Chourasia, Jiayuan Ye, and Reza Shokri. Differential privacy dynamics of Langevin diffusion and noisy gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
[13] Lynn Chua, Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. How private are DP-SGD implementations? In International Conference on Machine Learning (ICML), 2024.
[14] Lynn Chua, Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. Scalable DP-SGD: Shuffling vs. poisson subsampling. In Conference on Neural Information Processing Systems (NeurIPS), 2024.
[15] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 2014.
[16] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In ACM SIGACT Symposium on Theory of Computing (STOC), 2020.
[17] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In Annual Symposium on Foundations of Computer Science (FOCS), 2018.
[18] Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213–5225. PMLR, 2021.
[19] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International Conference on Machine Learning (ICML), 2015.
[20] Georgios Kaissis, Moritz Knolle, Friederike Jungmann, Alexander Ziller, Dmitrii Usynin, and Daniel Rueckert. A unified interpretation of the gaussian mechanism for differential privacy through the sensitivity index. arXiv preprint arXiv:2109.10528, 2021.
[21] Weiwei Kong. Accelerated inexact first-order methods for solving nonconvex composite optimization problems. arXiv preprint arXiv:2104.09685, 2021.
[22] Jiaming Liang and Renato D.C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 49(2):832–855, 2024.
[23] Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. In International Conference on Machine Learning (ICML), 2024.
[24] Google LLC. Tensorflow privacy. https://github.com/tensorflow/privacy, 2019.
[25] Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled Gaussian mechanism. arXiv preprint arXiv:1908.10530, 2019.
[26] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
[27] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing. IEEE, 2013.
[28] Jiayuan Ye and Reza Shokri. Differentially private learning needs hidden state (or much faster convergence). Advances in Neural Information Processing Systems (NeurIPS), 2022.
[29] Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.

Appendix A Derivation of main results

This appendix derives the main results, namely, Theorems 2.2 and 2.3. It contains three subappendices. The first one derives important properties of a family of randomized operators, the second specializes these results to the DP-SGD update operator in (12), and the last one gives the proofs of Theorems 2.2 and 2.3 using the previous two subappendices.

A.1 General operator analysis

This subappendix gives some crucial properties about randomized proximal Lipschitz operators, which consist of evaluating a Lipschitz proximal operator followed by adding Gaussian noise. More specifically, it establishes several RDP bounds based on the closeness of neighboring operators.

We first bound the shifted Rényi divergence of a randomized proximal Lipschitz operator. The proof of this result is a straightforward extension of the argument in [17, Theorem 22] from $1$ -Lipschitz operators to $L$ -Lipschitz operators with additive residuals.

To begin, we present two calculus rules for the shifted Rényi divergence given in (5). In particular, the proof of the second rule is a minor modification of the proof given for [17, Lemma 21].

Lemma A.1.

For random variables $\{X^{\prime},X,Z\}$ and $a,s\geq 0$ and $\alpha\in(1,\infty)$ , it holds that

(a)

$D_{\alpha}^{(\tau)}(X+Z\|X^{\prime}+Z)\leq D_{\alpha}^{(\tau+a)}(X\|X^{\prime})+\sup_{c\in\mathbb{R}^{n}}\{D_{\alpha}([Z+c]\|Z):\|c\|\leq a\}$ ;

(b)

for some $L,\zeta>0$ , if $\phi^{\prime}$ and $\phi$ satisfy

\sup_{u}\|\phi^{\prime}(u)-\phi(u)\|\leq s,\quad\|\Phi(x)-\Phi(y)\|\leq L\|x-y\|+\zeta,\quad\forall\Phi\in\{\phi,\phi^{\prime}\},

for any $x,y\in\operatorname*{dom}\phi\cap\operatorname*{dom}\phi^{\prime}$ , then

D_{\alpha}^{(L\tau+\zeta+s)}(\phi(X)\|\phi^{\prime}(X^{\prime}))\leq D_{\alpha}^{(\tau)}(X\|X^{\prime}).

Proof.

(a) See [17, Lemma 20].

(b) By the definitions of of $D_{\alpha}^{(\tau)}(\mu\|\nu)$ and ${\cal W}_{\infty}(\mu,\nu)$ , there exist a joint distribution $(X,Y)$ such that $D_{\alpha}(Y\|X^{\prime})=D_{\alpha}^{(\tau)}(X\|X^{\prime})$ and $\|X-Y\|\leq\tau$ almost surely. Now, the post-processing property of Rényi divergence implies that

D_{\alpha}(\phi(Y)\|\phi^{\prime}(X^{\prime}))\leq D_{\alpha}(\phi(Y)\|X^{\prime})\leq D_{\alpha}(Y\|X^{\prime})=D_{\alpha}^{(\tau)}(X\|X^{\prime}).

Using our assumptions on $\phi$ and $\phi^{\prime}$ and the triangle inequality, we then have

	$\displaystyle\\|\phi(X)-\phi^{\prime}(Y)\\|$	$\displaystyle\leq\\|\phi(X)-\phi(Y)\\|+\\|\phi(Y)-\phi^{\prime}(Y)\\|$
		$\displaystyle\leq L\\|X-Y\\|+\zeta+s\leq L\tau+\zeta+s,$

almost surely. Combining the previous two inequalities, yields the desired bound in view of the definitions of $D_{\alpha}^{(\tau)}(\mu\|\nu)$ and ${\cal W}_{\infty}(\mu,\nu)$ . ∎

The next result is the aforemention shifted RDP bound.

Lemma A.2.

For some $L,\zeta\geq 0$ , suppose $\phi^{\prime}$ and $\phi$ satisfy (8) for any $x,y\in\operatorname*{dom}\phi\cap\operatorname*{dom}\phi^{\prime}$ . Moreover, let $Z\sim{\cal N}(0,\sigma^{2}I)$ and $\psi\in\overline{{\rm Conv}}\,(\mathbb{R}^{n})$ . Then, for any scalars $a,\tau\geq 0$ and $\alpha\in(1,\infty)$ satisfying $L\tau+\zeta+s-a\geq 0,$ and random variables $Y$ and $Y^{\prime}$ , it holds that

D_{\alpha}^{(L\tau+\zeta+s-a)}(\operatorname{prox}_{\psi}(\phi(Y)+Z)\|\operatorname{prox}_{\psi}(\phi^{\prime}(Y^{\prime})+Z))\leq D_{\alpha}^{(\tau)}(Y\|Y^{\prime})+\frac{\alpha a^{2}}{2\sigma^{2}}.

(20)

Proof.

We first have that

\sup_{\tau\in\mathbb{R}^{n}}\{D_{\alpha}([Z+c]\|Z):\|c\|\leq a\}=\sup_{c\in\mathbb{R}^{n}}\left\{\frac{\alpha c^{2}}{2\sigma^{2}}:\|c\|\leq a\right\}=\frac{\alpha a^{2}}{2\sigma^{2}},

(21)

from the well-known properties of the Rényi divergence. Using (21), Lemma A.1(a) with $(X,X^{\prime})=(\phi(Y),\phi^{\prime}(Y^{\prime}))$ , and Lemma A.1(b) with $(\phi,\phi^{\prime},L,s)\in\{(\phi,\phi^{\prime},L,s),({\rm prox}_{\psi},{\rm prox}_{\psi},1,0)\}$ , we have

	$\displaystyle D_{\alpha}^{(L\tau+s-a)}({\rm prox}_{\psi}(\phi(Y)+Z)\\|{\rm prox}_{\psi}(\phi^{\prime}(Y^{\prime})+Z))\leq D_{\alpha}^{(L\tau+s-a)}(\phi(Y)+Z\\|\phi^{\prime}(Y^{\prime})+Z)$
	$\displaystyle\leq D_{\alpha}^{(L\tau+s)}(\phi(Y)\\|\phi^{\prime}(Y^{\prime}))+\frac{\alpha a^{2}}{2\sigma^{2}}\leq D_{\alpha}^{(\tau)}(Y\\|Y^{\prime})+\frac{\alpha a^{2}}{2\sigma^{2}}.$

∎

Note that the second inequality in (8) is equivalent to $\Phi$ being $L$ -Lipschitz continuous when $\zeta=0$ , and that the conditions in (8) need to only hold on $\operatorname*{dom}\phi\cap\operatorname*{dom}\phi^{\prime}$ .

We next apply (20) to a sequence of points generated by the update

Y\xleftarrow{}\operatorname{prox}_{\psi}(\phi(Y)+Z)

(22)

under different assumptions on $\zeta$ and $\tau$ and a single dataset pass. Before proceeding, we first present a technical lemma.

Lemma A.3.

Given scalars $L>1$ and positive integer $T\geq 1$ , let

S_{T}:=\sum_{i=0}^{T-1}L^{2i},\quad b_{t}:=\left(\frac{L^{T-t}}{S_{T}}\right)L^{T-1},\quad R_{t}:=L^{t-1}-\sum_{i=1}^{t}b_{i}L^{t-i},\quad t\geq 0

(23)

Then, for every $t\in[T]$ ,

(a)

$R_{t+1}=LR_{t}-b_{t+1}$ ;
(b)

$R_{t}\geq 0$ and $R_{T}=0$ ;
(c)

$\sum_{t=1}^{T}b_{t}^{2}=\theta_{L}(T)$ .

Proof.

Let $t\in[T]$ be fixed.

(a) This is immediate from the definition of $R_{t}$ .

(b) We have that

	$\displaystyle S_{T}R_{t}$	$\displaystyle=S_{T}\left(L^{t}-\sum_{i=1}^{t}b_{i}L^{t-i}\right)=\sum_{i=0}^{T-1}L^{2i+t-1}-\sum_{i=1}^{t}L^{2T+t-2i-1}=L^{t-1}\left[\sum_{i=0}^{T-1}L^{2i}-\sum_{i=1}^{t}L^{2(T-i)}\right]$
		$\displaystyle=L^{t-1}\sum_{i=0}^{T-1-t}L^{2i}\geq 0.$

Evaluating the above expression at $t=T$ clearly gives $R_{T}=0$ .

\displaystyle\sum_{t=1}^{T}b_{t}^{2}

\displaystyle=\frac{L^{2(T-1)}\sum_{i=0}^{T-1}L^{2i}}{\left(\sum_{i=0}^{T-1}L^{2i}\right)^{2}}=\frac{L^{2(T-1)}}{S_{T}}=\theta_{L}(T).

∎

We now present the shifted RDP properties of the update in (22). This particular result generalizes the one in [17], which only considers the case of $L=1$ and $\zeta=0$ .

Lemma A.4.

Let $L,\zeta\geq 0$ , $T\geq 1$ , and $\ell\in[T]$ be fixed. Given $\psi\in\overline{{\rm Conv}}\,(\mathbb{R}^{n})$ , suppose $\{\phi_{t}\}_{t=1}^{T}$ , $\{\phi_{t}^{\prime}\}_{t=1}^{T}$ , and $\bar{s}>0$ satisfy (8) with

\phi=\phi_{t},\quad\phi^{\prime}=\phi_{t}^{\prime},\quad s=\begin{cases}\bar{s},&t=1\bmod\ell,\\ 0,&\text{otherwise},\end{cases}\quad\forall t\in[T].

Moreover, given $Y_{0},Y_{0}^{\prime}\in\operatorname*{dom}\psi$ , let $Z_{t}\sim{\cal N}(0,\sigma^{2}I)$ , and define the random variables

\begin{gathered}Y_{t}:=\operatorname{prox}_{\psi}(\phi_{t}(Y_{t-1})+Z_{t}),\quad Y_{t}^{\prime}:=\operatorname{prox}_{\psi}(\phi_{t}^{\prime}(Y_{t-1}^{\prime})+Z_{t}),\quad\forall t\geq 1.\end{gathered}

If $T=\ell$ , then the following statements hold:

(a)

if $\zeta=0$ , then

D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})\leq\frac{\alpha}{2}\left(\frac{L\tau+\bar{s}}{\sigma}\right)^{2}\theta_{L}(T);

(24)

(b)

if $\tau=0$ , $L=1$ , and $\zeta=\bar{s}$ , then

$D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}(Y_{0}\|Y_{0}^{\prime})\leq 2\alpha T\left(\frac{\zeta}{\sigma}\right)^{2}$ (25)

Proof.

(a) Let $s=\bar{s}$ . Our goal is to recursively apply (20) with suitable choices of the free parameter $a$ at each application. Specifically, let $\{(b_{t},R_{t},S_{T})\}$ be as in (23), and define

a_{t}:=(L\tau+s)b_{t}\quad\forall t\geq 1.

Using Lemma A.3(a)–(b), we first have $L\tau+s-a_{1}=(L\tau+s)R_{1}\geq 0$ and, hence, by Lemma A.2, we have

D_{\alpha}^{([L\tau+s]R_{1})}(Y_{1}\|Y_{1}^{\prime})=D_{\alpha}^{(L\tau+{s}-a_{1})}(Y_{1}\|Y_{1}^{\prime})\leq D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})+\frac{\alpha a_{1}^{2}}{2\sigma^{2}}.

Since Lemma A.3(a)–(b) also implies $R_{t}\geq 0$ and we have $s_{t}=0$ for $t\geq 2$ , we repeatedly apply Lemma A.2 with $(a,\tau)=(a_{t},\tau_{t})=(a_{t},0)$ for $t\geq 2$ to obtain

	$\displaystyle D_{\alpha}^{(\tau)}(Y_{0}\\|Y_{0}^{\prime})$	$\displaystyle\geq D_{\alpha}^{([L\tau+s]R_{1})}(Y_{1}\\|Y_{1}^{\prime})-\frac{\alpha a_{1}^{2}}{2\sigma^{2}}\geq D_{\alpha}^{([L\tau+s]LR_{1}-a_{2})}(Y_{2}\\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}$
		$\displaystyle=D_{\alpha}^{([L\tau+s]R_{2})}(Y_{2}\\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}\geq\cdots$
		$\displaystyle\geq D_{\alpha}^{([L\tau+s]R_{T})}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}=D_{\alpha}^{(0)}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}$
		$\displaystyle=D_{\alpha}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}.$

It now remains to bound $\alpha\sum_{i=1}^{T}a_{i}^{2}/(2\sigma^{2})$ . Using Lemma A.3(c) and the fact that $T=\ell$ and $\bar{s}=s$ , we have

\displaystyle\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}

\displaystyle=\frac{\alpha}{2\sigma^{2}}\left[(L\tau+s)^{2}\sum_{i=2}^{T}b_{i}^{2}\right]\leq\frac{\alpha}{2}\left(\frac{L\tau+\bar{s}}{\sigma}\right)^{2}\theta_{L}(T).

Combining this bound with the previous one yields the desired conclusion.

(b) Let $s=\bar{s}$ . Similar to (a), our goal is to recursively apply (20) with suitable choices of the free parameter $a$ at each application. For this setting, let $a_{1}=\zeta+s$ and $a_{t}=\zeta$ for $t\geq 2$ . Using the fact that $\tau=0$ and $L=1$ and Lemma A.2, we first have that

D_{\alpha}(Y_{1}\|Y_{1}^{\prime})=D_{\alpha}^{(0)}(Y_{1}\|Y_{1}^{\prime})=D_{\alpha}^{(s+\zeta-a_{1})}(Y_{1}\|Y_{1}^{\prime})\leq D_{\alpha}^{(0)}(Y_{0}\|Y_{0}^{\prime})+\frac{\alpha a_{1}^{2}}{2\sigma^{2}}.

We then repeatedly apply Lemma A.2 with $(a,\tau)=(a_{t},0)$ for $t\geq 2$ to obtain

	$\displaystyle D_{\alpha}^{(0)}(Y_{0}\\|Y_{0}^{\prime})$	$\displaystyle\geq D_{\alpha}^{(0)}(Y_{1}\\|Y_{1}^{\prime})-\frac{\alpha a_{1}^{2}}{2\sigma^{2}}\geq D_{\alpha}^{(\zeta-a_{2})}(Y_{2}\\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}$
		$\displaystyle=D_{\alpha}^{(0)}(Y_{2}\\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}\geq\cdots$
		$\displaystyle\geq D_{\alpha}^{(\zeta-a_{T})}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}=D_{\alpha}^{(0)}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}$
		$\displaystyle=D_{\alpha}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}.$

It now remains to bound $\alpha\sum_{i=1}^{T}a_{i}^{2}/(2\sigma^{2})$ . Using the definition of $\{a_{t}\}$ and the fact that $\zeta=s$ , it holds that

\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}\leq\frac{\alpha}{2\sigma^{2}}\left[4\zeta^{2}+(T-1)\zeta^{2}\right]\leq 2\alpha T\left(\frac{\zeta}{\sigma}\right)^{2}.

Combining this bound with the previous one yields the desired conclusion. ∎

We next extend the above result to multiple dataset passes.

Proposition A.5.

Let $L$ , $\tau$ , $\zeta$ , $\bar{s}$ , $\{Y_{t}\}$ , $\{Y_{t}^{\prime}\}$ , $\ell$ , and $T$ be as in Lemma A.4. Moreover, let $\theta_{L}(\cdot)$ be as in (15). For any $\tau\geq 0$ and $E=\lfloor T/\ell\rfloor$ , the following statements hold:

(a)

if $\zeta=0$ , then

	$\displaystyle D_{\alpha}(Y_{T}\\|Y_{T}^{\prime})-D_{\alpha}^{(\tau)}(Y_{0}\\|Y_{0}^{\prime})$
	$\displaystyle\leq\frac{\alpha}{2\sigma^{2}}\left[(L\tau+\bar{s})^{2}\theta_{L}(\ell)+\bar{s}^{2}\left\{(E-1)\theta_{L}(\ell)+\theta_{L}(T-E\ell)\right\}\right].$		(26)

(b)

if $\tau=0$ and $\zeta=\bar{s}$ , then

\displaystyle D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}(Y_{0}\|Y_{0}^{\prime})\leq 2\alpha T\left(\frac{\zeta}{\sigma}\right)^{2}.

(27)

Proof.

(a) Let $s=\bar{s}$ . For convenience, define

\displaystyle{\cal B}_{1}(\tau,T)

\displaystyle:=\frac{\alpha}{2}\left(\frac{L\tau+s}{\sigma}\right)^{2}\theta_{L}(T),\quad{\cal B}_{2}:=\frac{\alpha}{\sigma^{2}}\left[(L\tau+s)^{2}+s^{2}\left\{(E-1)\theta_{L}(\ell)+\theta_{L}(T-E\ell)\right\}\right].

Using Lemma A.4(a), we have that for the first $\ell$ iterates,

D_{\alpha}(Y_{\ell}\|Y_{\ell}^{\prime})-D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})\leq{\cal B}_{1}(\tau,\ell).

Similarly, using part Lemma A.4(a) with $\tau=0$ , we have that

D_{\alpha}(Y_{[k+1]\ell}\|Y_{[k+1]\ell}^{\prime})-D_{\alpha}^{(0)}(Y_{\ell}\|Y_{\ell}^{\prime})\leq{\cal B}_{1}(0,\ell),

for any $1\leq k\leq E-1$ . Finally, using part Lemma A.4(a) with $T=T-E\ell$ and $\tau=0$ , we have that

D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}^{(0)}(Y_{E\ell}\|Y_{E\ell}^{\prime})\leq{\cal B}_{1}(0,T-E\ell).

Summing the above three inequalities, using the fact that $D_{\alpha}^{(0)}(X\|Y)=D_{\alpha}(X\|Y)$ , and using the definition of ${\cal B}_{2}$ we conclude that

\displaystyle D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}(Y_{0}\|Y_{0}^{\prime})\leq{\cal B}_{1}(\tau,\ell)+(E-1){\cal B}_{1}(0,\ell)+{\cal B}_{1}(0,T-E\ell)={\cal B}_{2}.

(b) The proof follows similarly to (a). Repeatedly using Lemma A.4(b) at increments of $\ell$ iterations up to iteration $E\ell$ , we have that

	$\displaystyle D_{\alpha}(Y_{E\ell}\\|Y_{E\ell}^{\prime})$	$\displaystyle\leq D_{\alpha}(Y_{(E-1)\ell}\\|Y_{(E-1)\ell}^{\prime})+2\alpha\ell\left(\frac{\zeta}{\sigma}\right)^{2}\leq D_{\alpha}(Y_{(E-2)\ell}\\|Y_{(E-2)\ell}^{\prime})+4\alpha\ell\left(\frac{\zeta}{\sigma}\right)^{2}\leq\cdots$
		$\displaystyle\leq D_{\alpha}(Y_{0}\\|Y_{0}^{\prime})+2\alpha E\ell\left(\frac{\zeta}{\sigma}\right)^{2}.$

For the last $T-E\ell$ iterations, we use Lemma A.4(b) with $T=T-E\ell$ and the above bound to obtain

\displaystyle D_{\alpha}(Y_{T}\|Y_{T}^{\prime})

\displaystyle\leq D_{\alpha}(Y_{E\ell}\|Y_{E\ell}^{\prime})+2\alpha[T-E\ell]\left(\frac{\zeta}{\sigma}\right)^{2}\leq D_{\alpha}(Y_{0}\|Y_{0}^{\prime})+2\alpha T\left(\frac{\zeta}{\sigma}\right)^{2}.

∎

Some remarks about Proposition A.5 are in order. First, part (a) shows that if $\phi_{t}$ and $\phi_{t}^{\prime}$ only differ at $t=1$ , then $D_{\alpha}(Y_{T}\|Y_{T}^{\prime})$ is finite for any $T$ . Second, part (a) also shows that if $\phi_{t}$ and $\phi_{t}^{\prime}$ differ cyclically for a cycle length of $\ell$ , then the divergence between $Y_{T}$ and $Y_{T}^{\prime}$ grows linearly with the number of cycles $E$ . Third, part (b) gives a bound that is independent of $L$ . Finally, both of the bounds in parts (a) and (b) can be viewed as Rényi divergences between the Gaussian random variables ${\cal N}(0,\sigma^{2}I)$ and ${\cal N}(\mu,\sigma^{2}I)$ for different values of $\mu$ .

In Appendix B, we give a detailed discussion of how the residuals $a$ from Lemma A.2 are chosen to prove Proposition A.5(a). In particular, we prove that the chosen residuals yield the tightest RDP bound that can achieved by repeatedly applying (20).

A.2 SGD operator analysis

This subappendix derives some important properties about the DP-SGD update operator $\cal{A}_{\lambda}(\cdot)$ in (12) and also contains the proof of Proposition 2.1.

To start, we recall the following well-known bound from convex analysis. Its proof can be found, for example, in [7, Theorem 5.8(iv)].

Lemma A.6.

Let $F:\operatorname*{dom}h\mapsto\mathbb{R}$ be convex and differentiable. Then $F$ satisfies

F(x)-F(y)-\left\langle\nabla F(y),x-y\right\rangle\leq\frac{L}{2}\|x-y\|^{2}\quad\forall x,y\in\operatorname*{dom}h

(28)

if and only if

\left\langle\nabla F(x)-\nabla F(y),x-y\right\rangle\geq\frac{1}{L}\|\nabla F(x)-\nabla F(y)\|^{2}\quad\forall x,y\in\operatorname*{dom}h.

We next give a technical bound on $f_{i}$ , which generalizes the co-coercivity of convex functions to weakly-convex functions.

Lemma A.7.

For any $x,y\in\operatorname*{dom}h$ and $f_{i}$ satisfying (A2), it holds that

\displaystyle\left\langle\nabla f_{i}(x)-\nabla f_{i}(y),x-y\right\rangle\geq-m\left[1+\frac{m}{2(M+m)}\right]\|x-y\|^{2}+\frac{1}{2(M+m)}\|\nabla f_{i}(x)-\nabla f_{i}(y)\|^{2}.

Proof.

Define $F=f_{i}+m\|\cdot\|^{2}/2$ and let $x,y\in\operatorname*{dom}h$ be fixed. Moreover, note that $F$ is convex and satisfies (28) with $L=M+m$ . It then follows from Lemma A.6 with $L=M+m$ that

	$\displaystyle\frac{1}{M+m}\\|\nabla F(x)-\nabla F(y)\\|^{2}$	$\displaystyle=\frac{1}{M+m}\\|\nabla f_{i}(x)-\nabla f_{i}(y)+m(x-y)\\|^{2}$
		$\displaystyle\leq\left\langle\nabla F(x)-\nabla F(y),x-y\right\rangle$
		$\displaystyle=\left\langle\nabla f_{i}(x)-\nabla f_{i}(y),x-y\right\rangle+m\\|x-y\\|^{2}.$

Applying the bound $\|a+b\|^{2}/2\leq\|a\|^{2}+\|b\|^{2}$ with $a=\nabla f_{i}(x)-\nabla f_{i}(y)+m(x-y)$ and $b=-m(x-y)$ , the above inequality then implies

	$\displaystyle\left\langle\nabla f_{i}(x)-\nabla f_{i}(y),x-y\right\rangle$	$\displaystyle\geq-m\\|x-y\\|^{2}+\frac{1}{M+m}\\|\nabla f_{i}(x)-\nabla f_{i}(y)+m(x-y)\\|^{2}$
		$\displaystyle\geq-\left[m+\frac{m^{2}}{2(M+m)}\right]\\|x-y\\|^{2}+\frac{1}{2(M+m)}\\|\nabla f_{i}(x)-\nabla f_{i}(y)\\|^{2}.$

∎

The below result gives some technical bounds on changes in the proximal function.

Lemma A.8.

Given $u,v\in\mathbb{R}^{n}$ , let $\psi\in\overline{{\rm Conv}}\,(\mathbb{R}^{n})$ and define

\Delta:=u-v,\quad\Delta^{p}:=\operatorname{prox}_{\psi}(x)-\operatorname{prox}_{\psi}(y).

Then, the following statements hold:

(a)

$\|\Delta^{p}\|^{2}\leq\left\langle\Delta,\Delta^{p}\right\rangle$ ;
(b)

$\|\Delta^{p}-\Delta\|^{2}\leq\|\Delta\|^{2}-\|\Delta^{p}\|^{2}$ .

Proof.

(a) See [7, Theorem 6.42(a)].

(b) Using part (a), we have that

\displaystyle\|\Delta^{p}-\Delta\|^{2}

\displaystyle=\|\Delta^{p}\|^{2}+\|\Delta\|^{2}-2\left\langle\Delta,\Delta^{p}\right\rangle\leq\|\Delta\|^{2}-\|\Delta^{p}\|^{2}.

∎

We now develop some technical properties of ${\cal A}_{\lambda}(\cdot)$ . The first result presents a bound involving the following quantities for $x,y\in\operatorname*{dom}h$ and $i\in[k]$ .

\begin{gathered}d:=x-y,\quad\Delta_{i}:=\nabla f_{i}(x)-\nabla f_{i}(y),\quad\Delta_{i}^{p}:=\operatorname{prox}_{q_{i}}(\nabla f_{i}(x))-\operatorname{prox}_{q_{i}}(\nabla f_{i}(y)).\end{gathered}

(29)

Lemma A.9.

Let $x,y\in\operatorname*{dom}h$ and $i\in[k]$ be fixed, let $d$ , $\Delta_{i}$ , and $\Delta_{i}^{p}$ be as in (29) for some $\{f_{i}\}$ . Moreover, let $L_{\lambda}$ be as in (13), and suppose $f_{i}$ satisfies assumption (A2). If $\Delta_{i}^{p}=\Delta_{i}$ , then for any $\lambda\leq 1/(M+m)$ we have

\|d-\lambda\Delta_{i}^{p}\|\leq L_{\lambda}\|d\|.

(30)

On the other hand, if $\Delta_{i}^{p}\neq\Delta_{i}$ , then for any $\lambda\leq 1/[2(M+m)]$ we have

\|d-\lambda\Delta_{i}^{p}\|\leq\sqrt{2}L_{\lambda}\|d\|.

(31)

Proof.

Before proceeding, we first establish a technical inequality. Using Lemma A.7, it holds that for any $\mu>0$ ,

	$\displaystyle\mu\left\\|\Delta_{i}\right\\|^{2}-2\left\langle d,\Delta_{i}\right\rangle$	$\displaystyle\leq\mu\\|\Delta_{i}\\|^{2}+2m\left[1+\frac{m}{2(M+m)}\right]\\|d\\|^{2}-\frac{\\|\Delta_{i}\\|^{2}}{M+m}$
		$\displaystyle=2m\left[1+\frac{m}{2(M+m)}\right]\\|d\\|^{2}+\left({\mu-\frac{1}{M+m}}\right)\\|\Delta_{i}\\|^{2}.$		(32)

We now prove (30). Supposing that $\Delta_{i}^{p}=\Delta_{i}$ , we have

\|d-\lambda\Delta_{i}^{p}\|^{2}=\|d-\lambda\Delta_{i}\|^{2}=\|d\|^{2}+\lambda\left[\lambda\|\Delta_{i}\|^{2}-2\left\langle d,\Delta_{i}\right\rangle\right].

Using (32) with $\mu=\lambda$ , the above identity, and the definition of $L_{\lambda}$ , it holds that for any $\lambda\leq 1/(M+m)$ , we have

	$\displaystyle\\|d-\lambda\Delta_{i}^{p}\\|^{2}$	$\displaystyle\leq\left(1+2\lambda m\left[1+\frac{m}{2(M+m)}\right]\right)\\|d\\|^{2}+\lambda\left({\lambda-\frac{1}{M+m}}\right)\\|\Delta_{i}\\|^{2}$
		$\displaystyle=L_{\lambda}^{2}\\|d\\|^{2}+\lambda\left({\lambda-\frac{1}{M+m}}\right)\\|\Delta_{i}\\|^{2}\leq L_{\lambda}^{2}\\|d\\|^{2}.$

We now prove (31). Using Lemma A.8(b) with $(\Delta,\Delta^{p})=(\Delta_{i},\Delta_{i}^{p})$ and the inequality $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ for $a,b\in\mathbb{R}^{n}$ , it holds that

	$\displaystyle\left\\|d-\lambda\Delta_{i}^{p}\right\\|^{2}$	$\displaystyle=\left\\|d-\lambda(\Delta_{i}+\Delta_{i}-\Delta_{i}^{p})\right\\|^{2}\leq 2\\|d-\lambda\Delta_{i}\\|^{2}+2\lambda^{2}\\|\Delta_{i}-\Delta_{i}^{p}\\|^{2}$
		$\displaystyle\overset{Lem.\leavevmode\nobreak\ \ref{lem:prox}(b)}{\leq}2\\|d-\lambda\Delta_{i}\\|^{2}+2\lambda^{2}\\|\Delta_{i}\\|^{2}-2\lambda^{2}\\|\Delta_{i}^{p}\\|^{2}$
		$\displaystyle\leq 2\\|d-2\lambda\Delta_{i}\\|^{2}+2\lambda^{2}\\|\Delta_{i}\\|^{2}$
		$\displaystyle=2\left(\\|d\\|^{2}+\lambda\left[2\lambda\left\\|\Delta_{i}\right\\|^{2}-2\left\langle d,\Delta_{i}\right\rangle\right]\right).$

Using (32) with $\mu=2\lambda$ , the above inequality, and the definition of $L_{\lambda}$ , it holds that for any $\lambda\leq 1/[2(M+m)]$ , we have

	$\displaystyle\\|d-\lambda\Delta_{i}^{p}\\|^{2}$	$\displaystyle\leq 2\left(1+2\lambda m\left[1+\frac{m}{2(M+m)}\right]\right)\\|d\\|^{2}+2\left({2\lambda-\frac{1}{M+m}}\right)\\|\Delta_{i}\\|^{2}$
		$\displaystyle=2L_{\lambda}^{2}\\|d\\|^{2}+2\left({2\lambda-\frac{1}{M+m}}\right)\\|\Delta_{i}\\|^{2}\leq 2L_{\lambda}^{2}\\|d\\|^{2}.$

∎

We are now ready to give the proof of Proposition 2.1.

A.2.1 Proof of Proposition 2.1

Proof.

(a) Let $x,y\in\operatorname*{dom}h$ and $\lambda>0$ be arbitrary. Moreover, denote $p_{i}(\cdot)=\operatorname{prox}_{q_{i}}(\cdot)$ for $i\in[k]$ . Using the definition of ${\cal A}_{\lambda}(\cdot)$ , the assumption that $\|p_{i}(z)\|\leq C$ for any $z\in\mathbb{R}^{n}$ , and the triangle inequality, we have that

	$\displaystyle\\|{\cal A}_{\lambda}(x)-{\cal A}_{\lambda}(y)\\|$	$\displaystyle=\left\\|x-y+\frac{\lambda}{\|B\|}\sum_{i\in B}[p_{i}(x)-p_{i}(y)]\right\\|\leq\\|x-y\\|+\frac{\lambda}{\|B\|}\sum_{i\in B}(\\|p_{i}(x)\\|+\\|p_{i}(y)\\|)$
		$\displaystyle\leq\\|x-y\\|+2\lambda C.$

(b) Let $x,y\in\operatorname*{dom}h$ be arbitrary, and denote $\xi(\cdot):={\cal A}_{\lambda}(\cdot,\{f_{i}\},\{q_{i}\})$ . Moreover, let $d$ , $\Delta_{i}$ , and $\Delta_{i}^{p}$ be as in (29). Using the fact that $\|\sum_{i=1}^{|B|}v_{i}\|^{2}\leq|B|\sum_{i=1}^{|B|}\|v_{i}\|^{2}$ for any $\{v_{i}\}\subseteq\mathbb{R}^{n}$ , we have

	$\displaystyle\\|\xi(x)-\xi(y)\\|^{2}$	$\displaystyle=\frac{1}{\|B\|^{2}}\left\\|\sum_{i\in B}\left\{\left[x-\lambda\operatorname{prox}_{q_{i}}(\nabla f_{i}(x))\right]-\left[y-\lambda\operatorname{prox}_{q_{i}}(\nabla f_{i}(y))\right]\right\}\right\\|^{2}$
		$\displaystyle=\frac{1}{\|B\|^{2}}\left\\|\sum_{i\in B}(d-\lambda\Delta_{i}^{p})\right\\|^{2}\leq\frac{1}{\|B\|}\sum_{i\in B}\left\\|d-\lambda\Delta_{i}^{p}\right\\|^{2}.$		(33)

Using (33) and (31) in Lemma A.9, we conclude that

\displaystyle\|\xi(x)-\xi(y)\|^{2}

\displaystyle\leq\frac{1}{|B|}\sum_{i\in B}\left\|d-\lambda\Delta_{i}^{p}\right\|^{2}\leq 2L_{\lambda}^{2}\|d\|^{2}=2L_{\lambda}\|x-y\|^{2}.

(c) Let $\xi(\cdot)$ , $d$ , $\Delta_{i}$ , and $\Delta_{i}^{p}$ be as in part (b). Following the same argument as in part (b), we obtain (33). Using (33) and (30) in Lemma A.9, we conclude that

\displaystyle\|\xi(x)-\xi(y)\|^{2}

\displaystyle\leq\frac{1}{|B|}\sum_{i\in B}\left\|d-\lambda\Delta_{i}^{p}\right\|^{2}\leq L_{\lambda}^{2}\|d\|^{2}=L_{\lambda}\|x-y\|^{2}.

∎

A.3 RDP bounds

This subappendix derives the RDP bounds in Theorems 2.2 and 2.3.

The first result shows how the updates in Algorithm 1 are randomized proximal updates applied to the operator $A_{\lambda}(\cdot)$ in (12) with $q_{i}(\cdot)={\rm Clip}_{C}(\cdot)$ .

Lemma A.10.

Let $\{X_{t}\}$ , $\{X_{t}^{\prime}\}$ , $\lambda$ , $b$ , $\sigma$ , $C$ , and $T$ be as in Theorem 2.2. Moreover, denote

\phi_{t}(x):={\cal A}_{\lambda}(x,\{f_{i}\},\{{\rm Clip}_{C}\}),\quad\phi_{t}^{\prime}(x):={\cal A}_{\lambda}(x,\{f_{i}^{\prime}\},\{{\rm Clip}_{C}\}),\quad\forall x\in\operatorname*{dom}h,

where ${\rm Clip}_{C}(\cdot)$ and ${\cal A}_{\lambda}(\cdot)$ are as in (2) and (12), respectively. Then, it holds that

X_{t}=\operatorname{prox}_{\lambda h}(\phi_{t}(X_{t-1})+N_{t}),\quad X_{t}^{\prime}=\operatorname{prox}_{\lambda h}(\phi_{t}^{\prime}(X_{t-1}^{\prime})+N_{t}),\quad\forall t\geq 1.

Proof.

This follows immediately from the definition of $\phi_{t}$ , the update rules in Algorithm 1, and the fact that ${\rm Clip}_{C}(\cdot)$ is the proximal operator of the (convex) indicator function of the convex set $\{x:\|x\|\leq C\}$ . ∎

We now present some important norm bounds.

Lemma A.11.

Let $\{X_{t}\}$ , $\{X_{t}^{\prime}\}$ , $\{\phi_{t}\}$ , and $\{\phi_{t}^{\prime}\}$ be as in Theorem 2.2 and denote $\ell=k/b$ and $t^{*}:=\inf_{t\geq 1}\left\{t:i^{*}\in B_{t}\right\}$ . Then, it holds that

\|X_{t^{*}}-X_{t^{*}}^{\prime}\|=0,\quad\|\phi_{s}(x)-\phi_{s}^{\prime}(x)\|\leq\frac{2\lambda C}{b},

(34)

for every $s\in\{j\ell+t^{*}:j=0,1,...\}$ and any $x\in\operatorname*{dom}h$ .

Proof.

The identity in (34) follows from the fact that $X_{t}=X_{t}^{\prime}$ for every $t\leq t^{*}$ . For the inequality in (34), it suffices to show the bound for $s=t^{*}$ because the batches $B_{t}$ in Algorithm 1 are drawn cyclically. To that end, let $x\in\operatorname*{dom}h$ be fixed. Using the update rule in Algorithm 1, and the fact that $\|{\rm Clip}_{C}(x)\|\leq C$ for every $x\in\mathbb{R}^{n}$ , we have that

	$\displaystyle\\|\phi_{s}(x)-\phi_{s}^{\prime}(x)\\|$	$\displaystyle=\frac{1}{b}\left\\|\sum_{i\in B_{t^{}}}\left[x-\lambda{\rm Clip}_{C}(\nabla f_{i^{}}(x))\right]-\left[x-\lambda{\rm Clip}_{C}(\nabla f_{i^{*}}^{\prime}(x))\right]\right\\|$
		$\displaystyle=\frac{\lambda}{b}\\|{\rm Clip}_{C}(\nabla f_{i^{}}(x))-{\rm Clip}_{C}(\nabla f_{i^{}}^{\prime}(x))\\|$
		$\displaystyle\leq\frac{\lambda}{b}\left[\\|{\rm Clip}_{C}(\nabla f_{i^{}}(x))\\|+\\|{\rm Clip}_{C}(\nabla f_{i^{}}^{\prime}(x))\\|\right]=\frac{2\lambda C}{b}.$

∎

We now give the proofs of the main RDP bounds.

A.3.1 Proof of Theorem 2.2

Proof.

Using Proposition A.5(a) with

Y_{0}=X_{T-1},\quad Y_{0}^{\prime}=X_{T-1}^{\prime},\quad\tau=d_{h},\quad s=\frac{2\lambda C}{b},\quad L=L_{\lambda},\quad\ell=\frac{k}{b},

and $E=T=1$ , we have that

\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime})

\displaystyle\leq D_{\alpha}^{(d_{h})}(X_{T-1}\|X_{T-1}^{\prime})+\frac{\alpha}{2\sigma^{2}}\left(Ld_{h}+\frac{2\lambda C}{b}\right)^{2}\theta_{L}(1)=\frac{\alpha}{2\sigma^{2}}\left(Ld_{h}+\frac{2\lambda C}{b}\right)^{2}.

∎

A.3.2 Proof of Theorem 2.3

Proof.

Suppose $f_{i^{*}}^{\prime}\neq f_{i^{*}}$ and $i^{*}\in B_{t^{*}}$ for indices $i^{*}$ and $t^{*}$ . We first prove (17). In view of Proposition 2.1(b), it is clear that the DP-SGD update is $\sqrt{2}L_{\lambda}$ -Lipschitz continuous. Hence, using Lemma A.11(b) and Proposition A.5(a) with

Y_{0}=X_{t^{*}},\quad Y_{0}^{\prime}=X_{t^{*}}^{\prime},\quad\tau=0,\quad s=\frac{2\lambda C}{b},\quad L=L_{\lambda},\quad\ell=\frac{k}{b},

and $T=T-t^{*}-1$ , we have that

	$\displaystyle D_{\alpha}(X_{T}\\|X_{T}^{\prime})$	$\displaystyle\leq D_{\alpha}^{(0)}(X_{0}\\|X_{0})+2\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[E\theta_{L_{\lambda}}(\ell)+\theta_{L_{\lambda}}(T-t^{*}-1-E\ell)\right]$
		$\displaystyle=2\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[E\theta_{L_{\lambda}}(\ell)+\theta_{L_{\lambda}}(T-t^{*}-1-E\ell)\right]\leq 2\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[1+E\theta_{L_{\lambda}}(\ell)\right],$

where the last inequality follows from the fact that $\theta_{L_{\lambda}}(s)$ is nonincreasing for $s\geq 1$ and $\theta_{L_{\lambda}}(1)=1$ .

We now prove (19). In view of (18) and Proposition 2.1(c), it is clear that the DP-SGD update, in the absence of gradient clipping, is $L_{\lambda}$ -Lipschitz continuous. Consequently, the desired bound follows from the same arguments as in the proof of (19) above, but with $L=L_{\lambda}$ instead of $L=\sqrt{2}L_{\lambda}$ . ∎

Appendix B Choice of residuals

This appendix briefly discusses the choice of residuals $\{a_{t}\}$ that are used in the proof of Proposition A.5(a) and Lemma A.2.

In the setup of Proposition A.5(a), it is straightforward to show that if $\{a_{t}\}$ is a nonnegative sequence of scalars such that

\tilde{R}_{t}:=L^{t-1}(L\tau+s)-\sum_{i=1}^{t}a_{i}L^{t-i}\geq 0,\quad\tilde{R}_{T}=0,

then repeatedly applying Lemma A.2 with $a=a_{t}$ yields

D_{\alpha}(Y_{T}\|Y_{T}^{\prime})-D_{\alpha}^{(\tau)}(Y_{0}\|Y_{0}^{\prime})\leq\frac{\alpha}{2\sigma^{2}}\sum_{i=1}^{T}a_{i}^{2}.

(35)

Hence, to obtain the tightest bound of the form in (35), we need to solve the quadratic program

	$\displaystyle(P)\quad\min\$	$\displaystyle\frac{1}{2}\sum_{i=1}^{T}a_{i}^{2}$
	s.t	$\displaystyle\tilde{R}_{t}\geq 0\quad\forall t\in[T-1],$
		$\displaystyle\tilde{R}_{T}=0.$

If we ignore the inequality constraints, the first order optimality condition of the resulting problem is that there exists $\xi\in\mathbb{R}$ such that

a_{i}=\xi L^{t-i}\quad\forall t\in[T],\quad\tilde{R}_{T}=0.

The latter identity implies that

L^{T-1}(L\tau+s)=\xi\sum_{i=1}^{T}L^{2(T-i)}=\xi\sum_{i=0}^{T-1}L^{2i}

which then implies

a_{i}=\frac{L^{T-1}(L\tau+s)L^{t-i}}{\sum_{i=0}^{T-1}L^{2i}}\quad\forall t\in[T].

Hence, to verify that the above choice of $a_{i}$ is optimal for $(P)$ , it remains to verify that $\tilde{R}_{t}\geq 0$ for $t\in[T-1]$ . Indeed, this follows from Lemma A.3(b) after normalizing for the $L\tau+s$ factor. As a consequence, the right-hand-side of (35) is minimized for our choice of $a_{i}$ above.

Appendix C Parameter choices

Let us now consider some interesting values for $\lambda$ , $\sigma$ , and $\ell$ .

The result below establishes a useful bound on $\theta_{L}(s)$ for sufficiently large enough values of $s$ .

Lemma C.1.

For any $L>1$ and $\xi>1$ , if $s\geq\log_{L}\sqrt{\xi/(\xi-1)}$ then $\theta_{L}(s)\leq\xi\left(1-L^{-2}\right).$

Proof.

Using the definition of $\theta_{L}(\cdot)$ , we have

	$\displaystyle\theta_{L}(s)$	$\displaystyle=\frac{L^{-2(s-1)}}{\sum_{i=0}^{s-1}L^{2i}}=\frac{L^{2s}-L^{2(s-1)}}{L^{2s}-1}=\frac{1-L^{-2}}{1-L^{-2s}}\leq\frac{1-L^{-2}}{1-L^{-2\log_{L}\sqrt{\xi/(\xi-1)}}}$
		$\displaystyle=\frac{1-L^{-2}}{1-(\xi-1)/\xi}=\xi\left(1-\frac{1}{L^{2}}\right).$

∎

Corollary C.2.

Let $\alpha>1$ and $\varepsilon>0$ be fixed, and let $\{X_{t}\}$ , $\{X_{t}^{\prime}\}$ , $b$ , $C$ , $E$ , $\ell$ , $\lambda$ , and $T$ be as in Theorem 2.3. Moreover, define

\overline{\lambda}(\rho):=\frac{1}{2(M+\rho)},\quad\overline{\sigma}_{\varepsilon}(\rho):=\frac{C\cdot\overline{\lambda}(\rho)}{2b}\sqrt{\frac{1}{\alpha\varepsilon}\left(1+\left[\frac{4\rho}{M+\rho}\right]E\right)},\quad\overline{\ell}(\rho):=\frac{\log 2}{\log\left[1+\rho\overline{\lambda}(\rho)\right]},

for every $\rho>0$ . If $\lambda=\overline{\lambda}(m)$ , $\sigma\geq\overline{\sigma}_{\varepsilon}(m)$ , $\ell\geq\overline{\ell}(m)$ , and no gradient clipping is performed, then

\displaystyle D_{\alpha}(X_{T}\|X_{T}^{\prime})\leq 4\alpha\left[\frac{C\cdot\overline{\lambda}(m)}{b\cdot\overline{\sigma}_{\varepsilon}(m)}\right]^{2}\left[1+\frac{4m}{M+m}\right],

and the corresponding instance of Algorithm 1 is $(\alpha,\varepsilon)$ -Rényi-DP.

Proof.

For ease of notation, denote $\overline{\lambda}=\overline{\lambda}(m)$ , $L=L_{\overline{\lambda}}$ , $\overline{\sigma}=\overline{\sigma}(m)$ , and $\overline{\ell}=\overline{\ell}(m)$ . We first note that

L=L_{\overline{\lambda}}=\sqrt{1+\frac{m}{M+m}\left[1+\frac{m}{M+m}\right]}\geq\sqrt{1+m\overline{\lambda}(m)},

which implies

\ell\geq\overline{\ell}=\frac{\log\sqrt{2}}{\log\sqrt{1+m\overline{\lambda}}}=\frac{\log\sqrt{2}}{\log L}=\log_{L}\sqrt{2}.

Consequently, using Lemma C.1 with $(\xi,s)=(2,\ell)$ and the definitions of $L_{\lambda}(\cdot)$ and $\overline{\lambda}(\cdot)$ , we have that

\theta_{L}(\ell)\leq 2\left(1-\frac{1}{L^{2}}\right)=2\left(\frac{2m}{2(M+m)}\left[1+\frac{m}{M+m}\right]\right)\leq\frac{4m}{M+m}.

Using the above bound and Theorem 2.3 with $(\lambda,\sigma,L)=(\overline{\lambda},\overline{\sigma},L_{\overline{\lambda}})$ , we obtain

	$\displaystyle D_{\alpha}(X_{T}\\|X_{T}^{\prime})$	$\displaystyle\leq 4\alpha\left(\frac{\overline{\lambda}C}{b\sigma}\right)^{2}\left[1+E\theta_{L}(\ell)\right]\leq 4\alpha\left(\frac{\overline{\lambda}C}{b\sigma}\right)^{2}\left[1+\frac{4Em}{M+m}\right]$
		$\displaystyle\leq 4\alpha\left(\frac{\overline{\lambda}C}{b\overline{\sigma}}\right)^{2}\left[1+\frac{4Em}{M+m}\right]\leq\varepsilon.$

In view of the fact that Algorithm 1 returns the last iterate $X_{T}$ (or $X_{T}^{\prime}$ ), the conclusion follows. ∎

Some remarks about Corollary C.2 are in order. First, $\sigma_{\varepsilon}^{2}(m)$ increases linearly with the number of dataset passes $E$ . Second, the smaller $m$ is the smaller the effect of $E$ on $\sigma_{\varepsilon}(m)$ is. Fourth, $\lim_{m\to 0}\overline{\ell}(m)=\infty$ which implies that the reducing the dependence on $E$ in $\sigma_{\varepsilon}(m)$ leads to more restrictive choices on $\ell$ . Finally, it is worth emphasizing that the restrictions on $\ell$ can be removed by using (17) directly. However, the resulting bounds are less informative in terms of the topological constants $m$ and $M$ .

We now present an RDP bound that is independent of $E$ when $\lambda$ is sufficiently small.

Corollary C.3.

Let $\{X_{t}\}$ , $\{X_{t}^{\prime}\}$ , $b$ , $C$ , $E$ , $\ell$ , $\lambda$ , $\sigma$ , and $T$ be as in Theorem 2.3. If

\lambda\leq\min\left\{\frac{1}{\sqrt{E}},\frac{1}{2(m+M)}\right\}

and no gradient clipping is performed, then we have

D_{\alpha}(X_{T}\|X_{T}^{\prime})\leq 4\alpha\left(\frac{C}{b\sigma}\right)^{2}\left[1+\theta_{L_{\lambda}}(\ell)\right].

Proof.

Using Theorem 2.3 and the fact that $\theta_{L}(\cdot)\leq 1$ for any $L>1$ , we have that

	$\displaystyle D_{\alpha}(X_{T}\\|X_{T}^{\prime})$	$\displaystyle\leq 4\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[\theta_{L_{\lambda}}(T-E\ell)+E\theta_{L_{\lambda}}(\ell)\right]\leq 4\alpha\left(\frac{\lambda C}{b\sigma}\right)^{2}\left[1+E\theta_{L_{\lambda}}(\ell)\right]$
		$\displaystyle\leq 4\alpha\left(\frac{C}{b\sigma}\right)^{2}\left[\frac{1}{E}+\theta_{L_{\lambda}}(\ell)\right]\leq 4\alpha\left(\frac{C}{b\sigma}\right)^{2}\left[1+\theta_{L_{\lambda}}(\ell)\right].$

∎

Appendix D Limitations of Poisson sampling in practice

This appendix discussing the computational limitation of implementing Poisson sampling in practice. It is primarily concerned with the large-scale setting where datasets may be on the order of hundreds of millions of examples.

Data access. Implementations of Poisson sampling, e.g., Opacus [29], typically employ a pseudorandom number generator to (i) randomly sample a collection of indices from zero to $N-1$ , where $N$ is the size of the dataset and (ii) map these indices to corresponding examples in the dataset to generate a batch of examples. In order for (ii) to be efficient, many libraries need fast random access to the dataset which is difficult to do without loading the entire dataset into RAM (as reading data from disk can be orders of magnitude more expensive). In contrast, cyclic traversal of batches only requires (relatively small) fixed blocks of the dataset to be loaded into memory for every batch and need not perform a matching of indices (such as in (i) above) to data.

Variable batch size. Independent of the access speed of the dataset examples, Poisson sampling also generates batches of random sizes, which are typically inconvenient to handle in deep learning systems [14]. For example, popular just-in-time compilation-based machine learning libraries such as JAX, PyTorch/Opacus, and TensorFlow may need to retrace their computation graph at every training step as the batch size cannot be statically inferred or kept constant. Additionally, optimizing training workloads on hardware accelerators such as graphical processing units (GPUs) and tensor processing units (TPUs) becomes difficult as (i) they require any in-device data to have fixed sizes and (ii) any input data generated by Poisson sampling will have variable sizes due to the effect of variable batch sizes. In contrast, the cyclic traversal of batches will always generate fixed batch sizes and, consequently, will not suffer from the above issues.

	$\displaystyle D_{\alpha}^{(\tau)}(Y_{0}\\|Y_{0}^{\prime})$	$\displaystyle\geq D_{\alpha}^{([L\tau+s]R_{1})}(Y_{1}\\|Y_{1}^{\prime})-\frac{\alpha a_{1}^{2}}{2\sigma^{2}}\geq D_{\alpha}^{([L\tau+s]LR_{1}-a_{2})}(Y_{2}\\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}$
		$\displaystyle=D_{\alpha}^{([L\tau+s]R_{2})}(Y_{2}\\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}\geq\cdots$
		$\displaystyle\geq D_{\alpha}^{([L\tau+s]R_{T})}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}=D_{\alpha}^{(0)}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}$
		$\displaystyle=D_{\alpha}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}.$

	$\displaystyle D_{\alpha}^{(0)}(Y_{0}\\|Y_{0}^{\prime})$	$\displaystyle\geq D_{\alpha}^{(0)}(Y_{1}\\|Y_{1}^{\prime})-\frac{\alpha a_{1}^{2}}{2\sigma^{2}}\geq D_{\alpha}^{(\zeta-a_{2})}(Y_{2}\\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}$
		$\displaystyle=D_{\alpha}^{(0)}(Y_{2}\\|Y_{2}^{\prime})-\frac{\alpha(a_{1}^{2}+a_{2}^{2})}{2\sigma^{2}}\geq\cdots$
		$\displaystyle\geq D_{\alpha}^{(\zeta-a_{T})}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}=D_{\alpha}^{(0)}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}$
		$\displaystyle=D_{\alpha}(Y_{T}\\|Y_{T}^{\prime})-\frac{\alpha\sum_{i=1}^{T}a_{i}^{2}}{2\sigma^{2}}.$

	$\displaystyle\frac{1}{M+m}\\|\nabla F(x)-\nabla F(y)\\|^{2}$	$\displaystyle=\frac{1}{M+m}\\|\nabla f_{i}(x)-\nabla f_{i}(y)+m(x-y)\\|^{2}$
		$\displaystyle\leq\left\langle\nabla F(x)-\nabla F(y),x-y\right\rangle$
		$\displaystyle=\left\langle\nabla f_{i}(x)-\nabla f_{i}(y),x-y\right\rangle+m\\|x-y\\|^{2}.$

	$\displaystyle\mu\left\\|\Delta_{i}\right\\|^{2}-2\left\langle d,\Delta_{i}\right\rangle$	$\displaystyle\leq\mu\\|\Delta_{i}\\|^{2}+2m\left[1+\frac{m}{2(M+m)}\right]\\|d\\|^{2}-\frac{\\|\Delta_{i}\\|^{2}}{M+m}$
		$\displaystyle=2m\left[1+\frac{m}{2(M+m)}\right]\\|d\\|^{2}+\left({\mu-\frac{1}{M+m}}\right)\\|\Delta_{i}\\|^{2}.$		(32)

	$\displaystyle\left\\|d-\lambda\Delta_{i}^{p}\right\\|^{2}$	$\displaystyle=\left\\|d-\lambda(\Delta_{i}+\Delta_{i}-\Delta_{i}^{p})\right\\|^{2}\leq 2\\|d-\lambda\Delta_{i}\\|^{2}+2\lambda^{2}\\|\Delta_{i}-\Delta_{i}^{p}\\|^{2}$
		$\displaystyle\overset{Lem.\leavevmode\nobreak\ \ref{lem:prox}(b)}{\leq}2\\|d-\lambda\Delta_{i}\\|^{2}+2\lambda^{2}\\|\Delta_{i}\\|^{2}-2\lambda^{2}\\|\Delta_{i}^{p}\\|^{2}$
		$\displaystyle\leq 2\\|d-2\lambda\Delta_{i}\\|^{2}+2\lambda^{2}\\|\Delta_{i}\\|^{2}$
		$\displaystyle=2\left(\\|d\\|^{2}+\lambda\left[2\lambda\left\\|\Delta_{i}\right\\|^{2}-2\left\langle d,\Delta_{i}\right\rangle\right]\right).$