This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

00footnotetext: A preliminary conference version of this paper appeared in NeurIPS 2022 [4].

Privacy of Noisy Stochastic Gradient Descent:
More Iterations without More Privacy Loss

Jason M. Altschuler111This work was done while JA was an intern at Apple during Summer 2021. JA was also supported in part by NSF Graduate Research Fellowship 1122374 and a TwoSigma PhD Fellowship.
MIT
[email protected]
   Kunal Talwar
Apple
[email protected]
Abstract

A central issue in machine learning is how to train models on sensitive user data. Industry has widely adopted a simple algorithm: Stochastic Gradient Descent with noise (a.k.a. Stochastic Gradient Langevin Dynamics). However, foundational theoretical questions about this algorithm’s privacy loss remain open—even in the seemingly simple setting of smooth convex losses over a bounded domain. Our main result resolves these questions: for a large range of parameters, we characterize the differential privacy up to a constant factor. This result reveals that all previous analyses for this setting have the wrong qualitative behavior. Specifically, while previous privacy analyses increase ad infinitum in the number of iterations, we show that after a small burn-in period, running SGD longer leaks no further privacy.

Our analysis departs from previous approaches based on fast mixing, instead using techniques based on optimal transport (namely, Privacy Amplification by Iteration) and the Sampled Gaussian Mechanism (namely, Privacy Amplification by Sampling). Our techniques readily extend to other settings, e.g., strongly convex losses, non-uniform stepsizes, arbitrary batch sizes, and random or cyclic choice of batches.

1 Introduction

Convex optimization is a fundamental task in machine learning. When models are learnt on sensitive data, privacy becomes a major concern—motivating a large body of work on differentially private convex optimization [11, 5, 6, 20, 28, 26, 41, 7]. In practice, the most common approach for training private models is Noisy-SGD, i.e., Stochastic Gradient Descent with noise added each iteration. This algorithm is simple and natural, has optimal utility bounds [5, 6], and is implemented on mainstream machine learning platforms such as Tensorflow (TF Privacy) [1], PyTorch (Opacus) [48], and JAX [8].

Yet, despite the simplicity and ubiquity of this Noisy-SGD algorithm, we do not understand basic questions about its privacy loss222We follow the literature by writing “privacy loss” to refer to the (Rényi) differential privacy parameters.—i.e., how sensitive the output of Noisy-SGD is with respect to the training data. Specifically:

Question 1.1.

What is the privacy loss of Noisy-SGD as a function of the number of iterations?

Even in the seemingly simple setting of smooth convex losses over a bounded domain, this fundamental question has remained wide open. In fact, even more basic questions are open:

Question 1.2.

Does the privacy loss of Noisy-SGD increase ad infinitum in the number of iterations?

The purpose of this paper is to understand these fundamental theoretical questions. Specifically, we resolve Questions 1.1 and 1.2 by characterizing the privacy loss of Noisy-SGD up to a constant factor in this (and other) settings for a large range of parameters. Below, we first provide context by describing previous analyses in §1.1, and then describe our result in §1.2 and techniques in §1.3.

1.1 Previous approaches and limitations

Although there is a large body of research devoted to understanding the privacy loss of Noisy-SGD, all existing analyses have (at least) one of the following two drawbacks.

Privacy bounds that increase ad infinitum.

One type of analysis approach yields upper bounds on the DP loss (resp., Rényi DP loss) that scale in the number of iterations TT as T\sqrt{T} (resp., TT). This includes the original analyses of Noisy-SGD, which were based on the techniques of Privacy Amplification by Sampling and Advanced Composition [5, 2], as well as alternative analyses based on the technique of Privacy Amplification by Iteration [19]. A key issue with these analyses is that they increase unboundedly in TT. This limits the number of iterations that Noisy-SGD can be run given a reasonable privacy budget, typically leading to suboptimal optimization error in practice. Is this a failure of existing analysis techniques or an inherent fact about the privacy loss of Noisy-SGD?

Privacy bounds that apply for large TT and require strong additional assumptions.

The second type of analysis approach yields convergent upper bounds on the privacy loss, but requires strong additional assumptions. This approach is based on connections to sampling. The high-level intuition is that Noisy-SGD is a discretization of a continuous-time algorithm with bounded privacy loss. Specifically, Noisy-SGD can be interpreted as the Stochastic Gradient Langevin Dynamics (SGLD) algorithm [45], which is a discretization of a continuous-time Markov process whose stationary distribution is equivalent to the exponential mechanism [30] and thus is differentially private under certain assumptions.

However, making this connection precise requires strong additional assumptions and/or the resolution of longstanding open questions about the mixing time of SGLD (see §1.4 for details). Only recently did a large effort in this direction culminate in the breakthrough work by Chourasia et al. [12], which proves that full batch333Recently, Ye and Shokri [47] and Ryffel et al. [38], in works concurrent to the present paper, extended this result of [12] to SGLD by removing the full batch assumption; we also obtain the same result by a direct extension of our (completely different) techniques, see §5.1. Note that both these papers [47, 38] still require strongly convex losses, and in fact state in their conclusions that the removal of this assumption is an open problem that “would pave the way for wide adoption by data scientists.” Our main result resolves this question. Langevin dynamics (a.k.a., Noisy-GD rather than Noisy-SGD) has a privacy loss that converges as TT\to\infty in this setting where the smooth losses are additionally assumed to be strongly convex.

Unfortunately, the assumption of strong convexity seems unavoidable with current techniques. Indeed, in the absence of strong convexity, it is not even known if Noisy-SGD converges to a private stationary distribution, let alone if this convergence occurs in a reasonable amount of time. (The tour-de-force work [10] shows mixing in (large) polynomial time, but only in total variation distance which does not have implications for privacy.) There are fundamental challenges for proving such a result. In short, SGD is only a weakly contractive process without strong convexity, which means that its instability increases with the number of iterations [25]—or in other words, it is plausible that Noisy-SGD could run for a long time while memorizing training data, which would of course mean it is not a privacy-preserving algorithm. As such, given state-of-the-art analyses in both sampling and optimization, it is unclear if the privacy loss of Noisy-SGD should even remain bounded; i.e., it is unclear what answer one should even expect for Question 1.2, let alone Question 1.1.

1.2 Contributions

The purpose of this paper is to resolve Questions 1.1 and 1.2. To state our result requires first recalling the parameters of the problem. Throughout, we prefer to state our results in terms of Rényi Differential Privacy (RDP); these RDP bounds are easily translated to DP bounds, as mentioned below in Remark 1.4. See the preliminaries section §2 for definitions of DP and RDP.

We consider the basic Noisy-SGD algorithm run on a dataset 𝒳={x1,,xn}\mathcal{X}=\{x_{1},\dots,x_{n}\}, where each xix_{i} defines a convex, LL-Lipschitz, and MM-smooth loss function fi()f_{i}(\cdot) on a convex set 𝒦\mathcal{K} of diameter DD. For any stepsize η2/M\eta\leqslant 2/M, batch size bb, and initialization ω0𝒦\omega_{0}\in\mathcal{K}, we iterate TT times the update

ωt+1Π𝒦[ωtη(Gt+Zt)],\omega_{t+1}\leftarrow\Pi_{\mathcal{K}}\big{[}\omega_{t}-\eta(G_{t}+Z_{t})\big{]},

where GtG_{t} denotes the average gradient vector on a random batch of size bb, Zt𝒩(0,σ2Id)Z_{t}\sim\mathcal{N}(0,\sigma^{2}I_{d}) is an isotropic Gaussian444Note that different papers have different notational conventions for the total noise ηZ\eta Z: the variance is η2σ2\eta^{2}\sigma^{2} here and, e.g., [19]; but is ησ2\eta\sigma^{2} in other papers like [12]; and is σ2\sigma^{2} in others like [5]. To translate results, simply rescale σ\sigma., and Π𝒦\Pi_{\mathcal{K}} denotes the Euclidean projection onto 𝒦\mathcal{K}.

Known privacy analyses of Noisy-SGD give an (α,ε)\alpha,\varepsilon)-RDP upper bound of

εαL2n2σ2T\displaystyle\varepsilon\lesssim\frac{\alpha L^{2}}{n^{2}\sigma^{2}}T (1.1)

which increases ad infinitum as the number of iterations TT\to\infty [5, 2, 19]. Our main result is a tight characterization of the privacy loss, answering Question 1.1. This result also answers Question 1.2 and shows that previous analyses have the wrong qualitative behavior: after a small burn-in period of T¯nDLη\bar{T}\asymp\tfrac{nD}{L\eta} iterations, Noisy-SGD leaks no further privacy. See Figure 1 for an illustration.

Refer to caption
Figure 1: Even in the basic setting of smooth convex optimization over a bounded domain, existing implementations and theoretical analyses of Noisy-SGD—the standard algorithm for private optimization—leak privacy ad infinitum as the number of iterations TT increases. Our main result Theorem 1.3 establishes that after a small burn-in period of T¯n\bar{T}\asymp n iterations, there is no further privacy loss. For simplicity, this plot considers Lipschitz parameter L=1L=1 (wlog by rescaling), Rényi parameter α\alpha of constant size (the regime in practice), and omits logarithmic factors; see the main text for our precise (and optimal) dependence on all parameters.
Theorem 1.3 (Informal statement of main result: tight characterization of the privacy of Noisy-SGD).

For a large range of parameters, Noisy-SGD satisfies (α,ε)(\alpha,\varepsilon)-RDP for

εαL2n2σ2min{T,DnLη},\displaystyle\varepsilon\lesssim\frac{\alpha L^{2}}{n^{2}\sigma^{2}}\min\left\{T,\frac{Dn}{L\eta}\right\}, (1.2)

and moreover this bound is tight up to a constant factor.

Observe that the privacy bound in Theorem 1.3 is identical to the previous bound (1.1) when the number of iterations TT is small, but stops increasing when TT¯T\geqslant\bar{T}. Intuitively, T¯\bar{T} can be interpreted as the smallest number of iterations required for two SGD processes run on adjacent datasets to, with reasonable probability, reach opposite ends of the constraint set.555Details: Noisy-SGD updates on adjacent datasets differ by at most ηL/b\eta L/b if the different datapoint is in the current batch (i.e., with probability b/nb/n), and otherwise are identical. Thus, in expectation and with high probability, it takes T¯(Dn)/(ηL)\bar{T}\asymp(Dn)/(\eta L) iterations for two Noisy-SGD processes to differ by distance DD. In other words, T¯\bar{T} is effectively the smallest number of iterations before the final iterates could be maximally distinguishable.

To prove Theorem 1.3, we show a matching upper bound (in §3) and lower bound (in §4). These two bounds are formally stated as Theorems 3.1 and 4.1, respectively. See §1.3 for an overview of our techniques and how they depart from previous approaches.

We conclude this section with several remarks about Theorem 1.3.

Remark 1.4 (Tight DP characterization).

While Theorem 1.3 characterizes the privacy loss of Noisy-SGD in terms of RDP, our results can be restated in terms of standard DP bounds. Specifically, by a standard conversion from RDP to DP (Lemma 2.5), if δ\delta is not smaller than exponentially small in bσ/n-b\sigma/n and if the resulting bound is ε1\varepsilon\lesssim 1, it follows that Noisy-SGD is (ε,δ)(\varepsilon,\delta)-DP for

εLnσmin{T,DnLη}log1/δ.\displaystyle\varepsilon\lesssim\frac{L}{n\sigma}\sqrt{\min\left\{T,\frac{Dn}{L\eta}\right\}\log 1/\delta}. (1.3)

A matching DP lower bound, for the same large range of parameters as in Theorem 1.3, is proved along the way when we establish our RDP lower bound in §4.

Refer to caption
(a) In the setting of cyclic batch updates, the privacy of SGD is identical except for an unavoidable worsening in an initial phase (Theorem 5.8).
Refer to caption
(b) In the setting of strongly convex losses, the privacy of Noisy-SGD is identical except the threshold T¯\bar{T} improves (Theorem 5.1). Plotted here for η=1/M\eta=1/M.
Figure 2: Analog of Figure 1 for the extensions of our main result to other settings.
Discussion of extensions.

Our analysis techniques immediately extend to other settings:

  • Strongly convex losses. If the convexity assumption on the losses is replaced by mm-strong convexity, then Noisy-SGD enjoys better privacy. Specifically, the bound in Theorem 1.3 extends identially except that the threshold T¯\bar{T} for no further privacy loss improves from linear to logarithmic in nn, namely O~(1/(ηm))\tilde{O}(1/(\eta m)). For example, if the stepsize η=1/M\eta=1/M, then T¯=O~(κ)\bar{T}=\tilde{O}(\kappa), where κ=M/m\kappa=M/m denotes the condition number of the losses. This matches the independent results of [38, 47] (see Footnote 3) and uses completely different techniques from them. Details in §5.1; see also Figure 2(b) for an illustration of the privacy bound.

  • Cyclic batch updates. Noisy-SGD is sometimes run with batches that are chosen cyclicallly rather than randomly. The bound in Theorem 1.3 extends identically, except for an initial transient phase in which there is unavoidably a large privacy loss since the algorithm may choose to update using a sensitive datapoint in an early batch. Details in §5.2; see also Figure 2(a) for an illustration of the privacy bound.

  • Non-uniform stepsizes. Noisy-SGD is sometimes run with decaying stepsizes ηt=tc\eta_{t}=t^{-c} for c(0,1)c\in(0,1). The bound in Theorem 1.3 extends identically, with η\eta replaced by ηT\eta_{T}. This means that the privacy increases after the threshold, but at the slower rate of TcTT^{c}\ll T for RDP (resp., Tc/2T1/2T^{c/2}\ll T^{1/2} for DP). Details in §5.3.

Discussion of assumptions.
  • Boundedness. All previous privacy analyses are oblivious to any sort of diameter bound and therefore unavoidably increase ad infinitum666For strongly convex losses, boundedness is unnecessary for convergent privacy; see §5.1., c.f. our lower bound in Theorem 1.3. Our analysis is the first to exploit boundedness: whereas previous analyses can only argue that having a smaller constraint set does not worsen the privacy loss, we show that this strictly improves the privacy loss, in fact making it convergent as TT\to\infty. See the techniques section §1.3.

    We emphasize that this diameter bound is a mild constraint and is in common with all existing theoretical bounds for utility/optimization. Indeed, every utility guarantee for (non-strongly) convex losses also has a dependence on some similar diameter bound; this is inevitable simply due to the difference between the initialization and optimum. We also mention that one can solve an unconstrained problem by solving constrained problems with norm bounds, paying only a small logarithmic overhead on the number of solves and a constant overhead in the privacy loss using known techniques [29]. Moreover, in many optimization problems, the solution set is naturally constrained either from the problem formulation or application.

  • Smoothness. The smoothness assumption on the losses can be relaxed if one runs Noisy-SGD on a smoothed version of the objective. This can be done using standard smoothing techniques, e.g., Gaussian convolution smoothing [19, §5.5], or Moreau-Yousida smoothing by replacing gradient steps with proximal steps [6, §4].

  • Convexity. Convexity appears to be essential for privacy bounds that do not increase ad infinitum in TT. Our analysis exploits convexity to ensure that Noisy-SGD is a contractive process. However, convexity is too restrictive when training deep networks, and it is an interesting question if the assumption of convexity can be relaxed. Any such result appears to require new techniques—if even true. The key challenge is that for any iterative process whose set of reachable fixed points is non-convex, there will be non-contractivity at the boundary between basins of attraction—and this appears to preclude arguments based on Privacy Amplification by Iteration, which at present is the only non-sampling-based technique with a hope of establishing convergent privacy bounds (see the techniques section §1.3).

  • Lipschitzness. This assumption can be safely removed as the (necessary) assumptions of smoothness and bounded diameter already imply Lipschitzness (with parameter L=MDL=MD). We write our result in this way in order to isolate where this dependence comes from more simply, and also because the Lipschitz parameter LL may be much better than MDMD.

    Our analysis only uses Lipschitzness to bound gradient sensitivity, just like all previous analyses that use Privacy Amplification by Sampling. For details on this, as well as an example where gradient sensitivity is more useful in practice than Lipschitzness, see Section §5.4.

  • Mild assumptions on parameters. The lower bound in Theorem 1.3 makes the mild assumption that the diameter DD of the decision set is asymptotically larger than both the movement size from one gradient step, and the standard deviation from T¯\bar{T} random increments of Gaussian noise 𝒩(0,η2σ2)\mathcal{N}(0,\eta^{2}\sigma^{2}) so that the noise does not overwhelm the learning.

    The upper bound uses the technique of Privacy Amplification by Sampling, and thus inherits two mild assumptions on the parameters that are also inherent in all previous analyses of Noisy-SGD. One is an upper bound assumption on α\alpha, and the other is a lower bound assumption on the noise σ\sigma so that one step of Noisy-SGD provides at least constant privacy to any user. These restrictions neither affect numerical bounds which can be computed for any α\alpha and σ\sigma (see §2.3), nor do they affect the asymptotic (ε,δ)(\varepsilon,\delta)-DP bounds in most parameter regimes of interest. We also remark that our results require no restriction if the batches are chosen cyclically, and thus also if full-batch gradients are used (see Theorem 5.8).

1.3 Techniques

Upper bound on privacy.

Our analysis technique departs from recent approaches based on fast mixing. This allows us to bypass the many longstanding technical challenges discussed in §1.1.

Instead, our analysis combines the techniques of Privacy Amplification by Iteration and Privacy Amplification by Sampling. As discussed in §1.1, previous analyses based on these techniques yield loose privacy bounds that increase ad infinitum in TT. Indeed, bounds based on Privacy Amplification by Sampling inevitably diverge since they “pay” for releasing the entire sequence of TT iterates, each of which leaks more information about the private data [5, 2]. Privacy Amplification by Iteration avoids releasing the entire sequence by directly arguing about the final iterate; however, previous arguments based on this technique are unable to exploit a diameter bound on the constraint set, and therefore inevitably lead to privacy bounds which grow unboundedly in TT [19].

Our analysis hinges on the observation that when the iterates are constrained to a bounded domain, we can combine the techniques of Privacy Amplification by Iteration and Privacy Amplification by Sampling in order to only pay for the privacy loss from the final T¯(Dn)/(Lη)\bar{T}\asymp(Dn)/(L\eta) iterations. To explain this, first recall that by definition, differential privacy of Noisy-SGD means that the final iterate of Noisy-SGD is similar when run on two (adjacent but) different datasets from the same initialization. At an intuitive level, we establish this by arguing about the privacy of the following three scenarios:

  • (i)

    Run Noisy-SGD for TT iterations on different datasets from same initialization.

  • (ii)

    Run Noisy-SGD for T¯\bar{T} iterations on different datasets from different initializations.

  • (iii)

    Run Noisy-SGD for T¯\bar{T} iterations on same datasets from different initializations.

In order to analyze (i), which as mentioned is the definition of Noisy-SGD being differentially private, we argue that no matter how different the two input datasets are, the two Noisy-SGD iterates at iteration TT¯T-\bar{T} are within distance DD of each other since they lie within the constraint set 𝒦\mathcal{K} of diameter DD. Thus in particular, the privacy of scenario (i) is at most the privacy of scenario (ii), which has initializations that can be arbitrarily different so long as they are within distance DD from each other. In this way, we arrive at a scenario which is independent of TT. This is crucial for obtaining privacy bounds which do not diverge in TT; however, previously privacy analyses could not proceed in this way because they could not argue about different initialization.

In order to analyze (ii), we use the technique of Privacy Amplification by Sampling—but only on these final T¯T\bar{T}\ll T iterations. In words, this enables us to argue that the noisy gradients that Noisy-SGD uses in the last T¯\bar{T} iterations are indistinguishable up to a certain privacy loss (to be precise the RDP scales linearly in T¯\bar{T}) despite the fact that they are computed on different input datasets. This enables us to reduce analyzing the privacy of scenario (ii) to scenario (iii).

In order to analyze (iii), we use a new diameter-aware version of Privacy Amplification by Iteration (Proposition 3.2). In words, this shows that running the same Noisy-SGD updates on two different initializations masks their initial difference due to both the noise and contraction maps that define each Noisy-SGD update. Of course, this indistinguishability improves the longer that Noisy-SGD is run (to be precise, we show that the RDP scales inversely in T¯\bar{T}).777This step of the analysis has a natural interpretation in terms of the mixing time of the Langevin Algorithm to its biased stationary distribution; this connection and its implications are explored in detail in [3].

We arrive, therefore, at a privacy loss which is the sum of the bounds from the Privacy Amplification by Sampling argument in step (ii) and the Privacy Amplification by Iteration argument in step (iii). This privacy loss has a natural tradeoff in the parameter T¯\bar{T}: the former bound leads to a privacy loss that increases in T¯\bar{T} (since it pays for the release of T¯\bar{T} noisy gradients), whereas the latter bound leads to a privacy loss that decreases in T¯\bar{T} (since more iterations of the same Noisy-SGD process enable better masking of different initializations). Balancing these two privacy bounds leads to the final choice T¯DnLη\bar{T}\asymp\tfrac{Dn}{L\eta}. (We remark that this quantity is not just an artefact of our analysis, but in fact is the true answer for when the privacy loss stops increasing, as established by our matching lower bound.)

At a high-level, our analysis proceeds by carefully running these arguments in parallel.

We emphasize that this analysis is the first to show that the privacy loss of Noisy-SGD can strictly improve if the constraint set is made smaller. In contrast, previous analyses only argue that restricting the constraint set cannot worsen the privacy loss (e.g., by using a post-processing inequality to analyze the projection step). The key technical challenge in exploiting a diameter bound is dealing with the complicated non-linearities that arise when interleaving projections with noisy gradient updates. Our techniques enable such an analysis.

Lower bound on privacy.

We construct two adjacent datasets for which the corresponding Noisy-SGD processes are random walks—one symmetric, one biased—that are confined to an interval of length DD. In this way, we reduce the question of how large is the privacy loss of Noisy-SGD, to the question of how distinguishable is a constrained symmetric random walk from a constrained biased random walk. The core technical challenge is that the distributions of the iterates of these random walks are intractable to reason about explicitly—due to the highly non-linear interactions between the projections and random increments. Briefly, our key technique here is a way to modify these processes so that on one hand, their distinguishability is essentially the same, and the other hand, no projections occur with high probability—allowing us to explicitly compute their distributions and thus also their distinguishability. Details in §4.

1.4 Other related work

Private sampling.

The mixing time of (stochastic) Langevin Dynamics has been extensively studied in recent years starting with [13, 15]. A recent focus in this vein is analyzing mixing in more stringent notions of distance, such as the Rényi divergence [43, 22], in part because this is necessary for proving privacy bounds. In addition to the aforementioned results, several other works focus on sampling from the distribution exp(εifi(ω)\exp(\varepsilon\sum_{i}f_{i}(\omega) or its regularized versions from a privacy viewpoint. [31] proposed a mechanism of this kind that works for unbounded ff and showed (ε,δ)(\varepsilon,\delta)-DP. Recently, [24] gave better privacy bounds for such a regularized exponential mechanism, and designed an efficient sampler based only on function evaluation. Also, [23] showed that the continuous Langevin Diffusion has optimal utility bounds for various private optimization problems.

As alluded to in §1.1, there are several core issues with trying to prove DP bounds for Noisy-SGD by directly combining “fast mixing” bounds with “private once mixed” bounds. First, mixing results typically do not apply, e.g., since DP requires mixing in stringent divergences like Rényi, or because realistic settings with constraints, stochasticity, and lack of strong convexity are difficult to analyze—indeed, understanding the mixing time for such settings is a challenging open problem. Second, even when fast mixing bounds do apply, directly combining them with “private once mixed” bounds unavoidably leads to DP bounds that are loose to the point of being essentially useless (e.g., the inevitable dimension dependence in mixing bounds to the stationary distribution of the continuous-time Langevin Diffusion would lead to dimension dependence in DP bounds, which should not occur—as we show). Third, even if a Markov chain were private after mixing, one cannot conclude from this that it is private beforehand—indeed, there are simple Markov chains which are private after mixing, yet are exponentially non-private beforehand [21].

Utility bounds.

In the field of private optimization, one separately analyzes two properties of algorithms: (i) the privacy loss as a function of the number of iterations, and (ii) the utility (a.k.a., optimization error) as a function of the number of iterations. These two properties can then be combined to obtain privacy-utility tradeoffs. The purpose of this paper is to completely resolve (i); this result can then be combined with any bound on (ii).

Utility bounds for SGD are well understood [39], and these analyses have enabled understanding utility bounds for Noisy-SGD in empirical [5] and population [6] settings. However, there is a big difference between the minimax-optimal utility bounds in theory versus what is desired in practice. Indeed, while in theory a single pass of Noisy-SGD achieves the minimax-optimal population risk [20], in practice Noisy-SGD benefits from running longer to get more accurate training. In fact, this divergence is even true and well-documented for non-private SGD as well, where one epoch is minimax-optimal in theory, but in practice more epochs help. Said simply, this is because typical problems are not worst-case problems (i.e., minimax-optimal theoretical bounds are typically not representative of practice). For these practical settings, in order to run Noisy-SGD longer, it is essential to have privacy bounds which do not increase ad infinitum. Our paper resolves this.

1.5 Organization

Section 2 recalls relevant preliminaries. Our main result Theorem 1.3 is proved in Section 3 (upper bound) and Section 4 (lower bound). Section 5 contains extensions to strongly convex losses, cyclic batch updates, and non-uniform stepsizes. §6 concludes with future research directions motivated by these results. For brevity, several proofs are deferred to Appendix A. For simplicity, we do not optimize constants in the main text; this is done in Appendix B.

2 Preliminaries

In this section, we recall relevant preliminaries about convex optimization (§2.1), differential privacy (§2.2), and two by-now-standard techniques for analyzing the privacy of optimization algorithms—namely, Privacy Amplification by Sampling (§2.3) and Privacy Amplification by Iteration (§2.4).

Notation.

We write X\mathbb{P}_{X} to denote the law of a random variable XX, and X|Y=y\mathbb{P}_{X|Y=y} to denote the law of XX given the event Y=yY=y. We write ZS:TZ_{S:T} as shorthand for the vector concatenating ZS,,ZTZ_{S},\dots,Z_{T}. We write f#μf_{\#}\mu to denote the pushforward of a distribution μ\mu under a function ff, i.e., the law of f(X)f(X) where XμX\sim\mu. We abuse notation slightly by writing f#μf_{\#}\mu even when ff is a “random function”, by which we mean a Markov transition kernel. We write μν\mu\ast\nu to denote the convolution of probability distributions μ\mu and ν\nu, i.e., the law of X+YX+Y where XμX\sim\mu and YνY\sim\nu are independently distributed. We write λμ+(1λ)ν\lambda\mu+(1-\lambda)\nu to denote the mixture distribution that is μ\mu with probability λ\lambda and ν\nu with probability 1λ1-\lambda. We write 𝒜(𝒳)\mathcal{A}(\mathcal{X}) to denote the output of an algorithm 𝒜\mathcal{A} run on input 𝒳\mathcal{X}; this is a probability distribution if 𝒜\mathcal{A} is a randomized algorithm.

2.1 Convex optimization

Throughout, the loss function corresponding to a data point xix_{i} or xix_{i}^{\prime} is denoted by fif_{i} or fif_{i}^{\prime}, respectively. While the dependence on the data point is arbitrary, the loss functions are assumed to be convex in the argument ω𝒦\omega\in\mathcal{K} (a.k.a., the machine learning model we seek to train). Throughout, the set 𝒦\mathcal{K} of possible models is a convex subset of d\mathbb{R}^{d}. In order to establish both optimization and privacy guarantees, two additional assumptions are required on the loss functions. Recall that a differentiable function g:𝒦dg:\mathcal{K}\to\mathbb{R}^{d} is said to be:

  • LL-Lipschitz if g(ω)g(ω)Lωω\|g(\omega)-g(\omega^{\prime})\|\leqslant L\|\omega-\omega^{\prime}\| for all ω,ω𝒦\omega,\omega^{\prime}\in\mathcal{K}.

  • MM-smooth if g(ω)g(ω)Mωω\|\nabla g(\omega)-\nabla g(\omega^{\prime})\|\leqslant M\|\omega-\omega^{\prime}\| for all ω,ω𝒦\omega,\omega^{\prime}\in\mathcal{K}.

An important implication of smoothness is the following well-known fact; for a proof, see e.g., [35].

Lemma 2.1 (Small gradient steps on smooth convex losses are contractions).

Suppose f:df:\mathbb{R}^{d}\to\mathbb{R} is a convex, MM-smooth function. For any η2/M\eta\leqslant 2/M, the mapping ωωηf(ω)\omega\mapsto\omega-\eta\nabla f(\omega) is a contraction.

2.2 Differential privacy and Rényi differential privacy

Over the past two decades, differential privacy (DP) has become a standard approach for quantifying how much sensitive information an algorithm leaks about the dataset it is run upon. Differential privacy was first proposed in [17], and is now widely used in industry (e.g., at Apple [36], Google [18], Microsoft [14], and LinkedIn [37]), as well as in government data collection such as the 2020 US Census [44]. In words, DP measures how distinguishable the output of an algorithm is when run on two “adjacent” datasets—i.e., two datasets which differ on at most one datapoint.888This is sometimes called “replace adjacency”. The other canonical notion of dataset adjacency is “remove adjacency”, which means that one dataset is the other plus an additional datapoint. All results in this paper extend immediately to both settings; for details see Remark B.2.

Definition 2.2 (Differential Privacy).

A randomized algorithm 𝒜\mathcal{A} satisfies (ε,δ)(\varepsilon,\delta)-DP if for any two adjacent datasets 𝒳,𝒳\mathcal{X},\mathcal{X}^{\prime} and any measurable event SS,

[𝒜(𝒳)S]eε[𝒜(𝒳)S]+δ.\mathbb{P}\left[\mathcal{A}(\mathcal{X})\in S\right]\leqslant e^{\varepsilon}\mathbb{P}\left[\mathcal{A}(\mathcal{X}^{\prime})\in S\right]+\delta.

In order to prove DP guarantees, we work with the related notion of Rényi Differential Privacy (RDP) introduced by [32], since RDP is more amenable to our analysis techniques and is readily converted to DP guarantees. To define RDP, we first recall the definition of Rényi divergence.

Definition 2.3 (Rényi divergence).

The Rényi divergence between two probability measures μ\mu and ν\nu of order α(1,)\alpha\in(1,\infty) is

𝒟α(μν)=1α1log(μ(x)ν(x))αν(x)𝑑x,\mathcal{D}_{\alpha}\left(\mu\;\|\;\nu\right)=\frac{1}{\alpha-1}\log\int\left(\frac{\mu(x)}{\nu(x)}\right)^{\alpha}\nu(x)dx,

if μν\mu\ll\nu, and is \infty otherwise. Here we adopt the standard convention that 0/0=00/0=0 and x/0=x/0=\infty for x>0x>0. The Rényi divergences of order α{1,}\alpha\in\{1,\infty\} are defined by continuity.

Definition 2.4 (Rényi Differential Privacy).

A randomized algorithm 𝒜\mathcal{A} satisfies (α,ε)(\alpha,\varepsilon)-RDP if for any two adjacent datasets 𝒳,𝒳\mathcal{X},\mathcal{X}^{\prime},

𝒟α(𝒜(𝒳)𝒜(𝒳))ε.\mathcal{D}_{\alpha}\left(\mathcal{A}(\mathcal{X})\;\|\;\mathcal{A}(\mathcal{X}^{\prime})\right)\leqslant\varepsilon.

It is straightforward to convert an RDP bound to a DP bound as follows [32, Proposition 1].

Lemma 2.5 (RDP-to-DP conversion).

Suppose that an algorithm satisfies (α,εα)(\alpha,\varepsilon_{\alpha})-RDP for some α>1\alpha>1. Then the algorithm satisfies (εδ,δ)(\varepsilon_{\delta},\delta)-DP for any δ(0,1)\delta\in(0,1) and εδ=εα+(log1δ)/(α1)\varepsilon_{\delta}=\varepsilon_{\alpha}+(\operatorname{\log\tfrac{1}{\delta}})/(\alpha-1).

Following, we recall three basic properties of RDP that we use repeatedly in our analysis. The first property regards convexity. While the KL divergence (the case α=1\alpha=1) is jointly convex in its two arguments, for α1\alpha\neq 1 the Rényi divergence is only jointly quasi-convex. See e.g., [42, Theorem 13] for a proof and a further discussion of partial convexity properties of the Rényi divergence.

Lemma 2.6 (Joint quasi-convexity of Rényi divergence).

For any Rényi parameter α1\alpha\geqslant 1, any mixing probability λ[0,1]\lambda\in[0,1], and any two pairs of probability distributions μ,ν\mu,\nu and μ,ν\mu^{\prime},\nu^{\prime},

𝒟α(λμ+(1λ)μλν+(1λ)ν)max{𝒟α(μν),𝒟α(μν)}.\mathcal{D}_{\alpha}\left(\lambda\mu+(1-\lambda)\mu^{\prime}\;\|\;\lambda\nu+(1-\lambda)\nu^{\prime}\right)\leqslant\max\left\{\mathcal{D}_{\alpha}\left(\mu\;\|\;\nu\right),\mathcal{D}_{\alpha}\left(\mu^{\prime}\;\|\;\nu^{\prime}\right)\right\}.

The second property states that pushing two measures forward through the same (possibly random) function cannot increase the Rényi divergence. This is the direct analog of the classic “data-processing inequality” for the KL divergence. See e.g., [42, Theorem 9] for a proof.

Lemma 2.7 (Post-processing property of Rényi divergence).

For any Rényi parameter α1\alpha\geqslant 1, any (possibly random) function hh, and any probability distributions μ,ν\mu,\nu,

𝒟α(h#μh#ν)𝒟α(μν).\mathcal{D}_{\alpha}\left(h_{\#}\mu\;\|\;h_{\#}\nu\right)\leqslant\mathcal{D}_{\alpha}\left(\mu\;\|\;\nu\right).

The third property is the appropriate analog of the chain rule for the KL divergence (the direct analog of the chain rule does not hold for Rényi divergences, α1\alpha\neq 1). This bound has appeared in various forms in previous work (e.g., [2, 32, 16]); we state the following two equivalent versions of this bound since at various points in our analysis it will be more convenient to reference one or the other. A proof of the former version is in [32, Proposition 3]. The proof of the latter is similar to [2, Theorem 2]; for the convenience of the reader, we provide a brief proof in Section A.1.

Lemma 2.8 (Strong composition for RDP, v1).

Suppose algorithms 𝒜1\mathcal{A}_{1} and 𝒜2\mathcal{A}_{2} satisfy (α,ε1)(\alpha,\varepsilon_{1})-RDP and (α,ε2)(\alpha,\varepsilon_{2})-RDP, respectively. Let 𝒜\mathcal{A} denote the algorithm which, given a dataset 𝒳\mathcal{X} as input, outputs the composition 𝒜(𝒳)=(𝒜1(𝒳),𝒜2(𝒳,𝒜1(𝒳)))\mathcal{A}(\mathcal{X})=(\mathcal{A}_{1}(\mathcal{X}),\mathcal{A}_{2}(\mathcal{X},\mathcal{A}_{1}(\mathcal{X}))). Then 𝒜\mathcal{A} satisfies (α,ε1+ε2)(\alpha,\varepsilon_{1}+\varepsilon_{2})-RDP.

Lemma 2.9 (Strong composition for RDP, v2).

For any Rényi parameter α1\alpha\geqslant 1 and any two sequences of random variables X1,,XkX_{1},\dots,X_{k} and Y1,,YkY_{1},\dots,Y_{k},

𝒟α(X1:kY1:k)i=1ksupx1:i1𝒟α(Xi|X1:i1=x1:i1Yi|Y1:i1=x1:i1).\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{1:k}}\,\big{\|}\,\mathbb{P}_{Y_{1:k}}\right)\leqslant\sum_{i=1}^{k}\sup_{x_{1:i-1}}\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{i}|X_{1:i-1}=x_{1:i-1}}\;\|\;\mathbb{P}_{Y_{i}|Y_{1:i-1}=x_{1:i-1}}\right).

2.3 Privacy Amplification by Sampling

A core technique in the DP literature is Privacy Amplification by Sampling [27] which quantifies the idea that a private algorithm, run on a small random sample of the input, becomes more private. There are several ways of formalizing this. For our purpose of analyzing Noisy-SGD, we must understand how distinguishable a noisy stochastic gradient update is when run on two adjacent datasets. This is precisely captured by the “Sampled Gaussian Mechanism”, which is a composition of two operations: subsampling and additive Gaussian noise. Below we recall a convenient statement of this in terms of RDP from [33]. We start with a preliminary definition in one dimension.

Definition 2.10 (Rényi Divergence of the Sampled Gaussian Mechanism).

For Rényi parameter α1\alpha\geqslant 1, mixing probability q(0,1)q\in(0,1), and noise parameter σ>0\sigma>0, define

Sα(q,σ):=𝒟α(𝒩(0,σ2)(1q)𝒩(0,σ2)+q𝒩(1,σ2)).S_{\alpha}(q,\sigma):=\mathcal{D}_{\alpha}\left(\mathcal{N}(0,\sigma^{2})\,\big{\|}\,(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2})\right).

Next, we provide a simple lemma that extends this notion to higher dimensions. In words, the proof simply argues that the worst-case distribution μ\mu is a Dirac supported on a point of maximal norm, and then reduces the multivariate setting to the univariate setting via rotational invariance. The proof is similar to [33, Theorem 4]; for convenience, details are provided in Section A.2.

Lemma 2.11 (Extrema for Rényi Divergence of the Sampled Gaussian Mechanism).

For any Rényi parameter α1\alpha\geqslant 1, mixing probability q(0,1)q\in(0,1), noise level σ>0\sigma>0, dimension dd\in\mathbb{N}, and radius R>0R>0,

supμ𝒫(R𝔹d)𝒟α(𝒩(0,σ2Id)(1q)𝒩(0,σ2Id)+q(𝒩(0,σ2Id)μ))=Sα(q,σ/R),\sup_{\mu\in\mathcal{P}(R\mathbb{B}_{d})}\mathcal{D}_{\alpha}\left(\mathcal{N}(0,\sigma^{2}I_{d})\,\big{\|}\,(1-q)\mathcal{N}(0,\sigma^{2}I_{d})+q\left(\mathcal{N}(0,\sigma^{2}I_{d})\ast\mu\right)\right)=S_{\alpha}(q,\sigma/R),

where above 𝒫(R𝔹d)\mathcal{P}(R\mathbb{B}_{d}) denotes the set of Borel probability distributions that are supported on the ball of radius RR in d\mathbb{R}^{d}.

Finally, we recall the tight bound [33, Theorem 11] on these quantities. This bound restricts to Rényi parameter at most α(q,σ)\alpha^{*}(q,\sigma), which is defined to be the largest α\alpha satisfying αMσ2/2log(σ2)\alpha\leqslant M\sigma^{2}/2-\log(\sigma^{2}) and α(M2σ2/2log(5σ2))/(M+log(qα)+1/(2σ2))\alpha\leqslant(M^{2}\sigma^{2}/2-\log(5\sigma^{2}))/(M+\log(q\alpha)+1/(2\sigma^{2})), where M=log(1+1/(q(α1)))M=\log(1+1/(q(\alpha-1))). While we use Lemma 2.12 to prove the asymptotics in our theoretical results, we emphasize that i) our bounds can be computed numerically for any α1\alpha\geqslant 1, see Remark B.1; and ii) this upper bound does not preclude α\alpha from the typical parameter regime of interest, see the discussion in §1.2.

Lemma 2.12 (Bound on Rényi Divergence of the Sampled Gaussian Mechanism).

Consider Rényi parameter α>1\alpha>1, mixing probability q(0,1/5)q\in(0,1/5), and noise level σ4\sigma\geqslant 4. If αα(q,σ)\alpha\leqslant\alpha^{*}(q,\sigma), then

Sα(q,σ)2αq2/σ2.S_{\alpha}(q,\sigma)\leqslant 2\alpha q^{2}/\sigma^{2}.

2.4 Privacy Amplification by Iteration

The Privacy Amplification by Iteration technique of [19] bounds the privacy loss of an iterative algorithm without “releasing” the entire sequence of iterates—unlike arguments based on Privacy Amplification by Sampling, c.f., the discussion in §1.3. This technique applies to processes which are generated by a Contractive Noisy Iteration (CNI). We begin by recalling this definition, albeit in a slightly more general form that allows for two differences. The first difference is allowing the contractions to be random; albeit simple, this generalization is critical for analyzing Noisy-SGD because a stochastic gradient update is random. The second difference is that we project each iterate; this generalization is solely for convenience as it simplifies the exposition.999Since projections are contractive, these processes could be dubbed Contractive Noisy Contractive Processes (CNCI); however, we use Definition 2.13 as it more closely mirrors previous usages of CNI. Alternatively, since the composition of contractions is a contraction, the projection can be combined with ϕt\phi_{t} in order to obtain a bona fide CNI; however, this requires defining auxiliary shifted processes which leads to a more complicated analysis (this was done in the original arXiv version v1).

Definition 2.13 (Contractive Noisy Iteration).

Consider a (random) initial state X0dX_{0}\in\mathbb{R}^{d}, a sequence of (random) contractions ϕt:dd\phi_{t}:\mathbb{R}^{d}\to\mathbb{R}^{d}, a sequence of noise distributions ξt\xi_{t}, and a convex set 𝒦\mathcal{K}. The Projected Contractive Noisy Iteration CNI(X0,{ϕt},{ξt},𝒦)\mathrm{CNI}(X_{0},\{\phi_{t}\},\{\xi_{t}\},\mathcal{K}) is the law of the final iterate XTX_{T} of the process

Xt+1=Π𝒦[ϕt+1(Xt)+Zt+1],X_{t+1}=\Pi_{\mathcal{K}}\left[\phi_{t+1}(X_{t})+Z_{t+1}\right],

where Zt+1Z_{t+1} is drawn independently from ξt+1\xi_{t+1}.

Privacy Amplification by Iteration is based on two key lemmas. We recall both below in the case of Gaussian noise since this suffices for our purposes. We begin with a preliminary definition.

Definition 2.14 (Shifted Rényi Divergence; Definition 8 of [19]).

Let μ,ν\mu,\nu be probability distributions over d\mathbb{R}^{d}. For parameters z0z\geqslant 0 and α1\alpha\geqslant 1, the shifted Rényi divergence is

𝒟α(z)(μν)=infμ:W(μ,μ)z𝒟α(μν).\mathcal{D}_{\alpha}^{(z)}\big{(}\mu\;\|\;\nu\big{)}=\inf_{\mu^{\prime}\;:\;W_{\infty}(\mu,\mu^{\prime})\leqslant z}\mathcal{D}_{\alpha}\big{(}\mu^{\prime}\;\|\;\nu\big{)}.

(Recall that the \infty-Wasserstein metric 𝒲(μ,μ)\mathcal{W}_{\infty}(\mu,\mu^{\prime}) between two distributions μ\mu and μ\mu^{\prime} is the smallest real number ww for which the following holds: there exists a joint distribution X,X\mathbb{P}_{X,X^{\prime}} with first marginal XμX\sim\mu and second marginal XμX^{\prime}\sim\mu^{\prime}, under which XXw\|X-X^{\prime}\|\leqslant w almost surely.)

The shift-reduction lemma [19, Lemma 20] bounds the shifted Rényi divergence between two distributions that are convolved with Gaussian noise.

Lemma 2.15 (Shift-reduction lemma).

Let μ,ν\mu,\nu be probability distributions on d\mathbb{R}^{d}. For any a0a\geqslant 0,

𝒟α(z)(μ𝒩(0,σ2Id)ν𝒩(0,σ2Id))𝒟α(z+a)(μν)+αa22σ2.\mathcal{D}_{\alpha}^{(z)}\left(\mu\ast\mathcal{N}(0,\sigma^{2}I_{d})\;\big{\|}\;\nu\ast\mathcal{N}(0,\sigma^{2}I_{d})\right)\leqslant\mathcal{D}_{\alpha}^{(z+a)}\left(\mu\;\big{\|}\;\nu\right)+\frac{\alpha a^{2}}{2\sigma^{2}}.

The contraction-reduction lemma [19, Lemma 21] bounds the shifted Rényi divergence between the pushforwards of two distributions through similar contraction maps. Below we state a slight generalization of [19, Lemma 21] that allows for random contraction maps. The proof of this generalization is similar, except that here we exploit the quasi-convexity of the Rényi divergence to handle the additional randomness; details in Section A.3.

Lemma 2.16 (Contraction-reduction lemma, for random contractions).

Suppose ϕ,ϕ\phi,\phi^{\prime} are random functions from d\mathbb{R}^{d} to d\mathbb{R}^{d} such that (i) each is a contraction almost surely; and (ii) there exists a coupling of (ϕ,ϕ)(\phi,\phi^{\prime}) under which supzϕ(z)ϕ(z)s\sup_{z}\|\phi(z)-\phi^{\prime}(z)\|\leqslant s almost surely. Then for any probability distributions μ\mu and μ\mu^{\prime} on d\mathbb{R}^{d},

𝒟α(z+s)(ϕ#μϕ#μ)𝒟α(z)(μμ).\mathcal{D}_{\alpha}^{(z+s)}\left(\phi_{\#}\mu\|\phi_{\#}^{\prime}\mu^{\prime}\right)\leqslant\mathcal{D}_{\alpha}^{(z)}\left(\mu\|\mu^{\prime}\right).

The original101010Strictly speaking, Proposition 2.17 is a generalization of [19, Theorem 22] since it allows for randomized contractions and projections in the CNI (c.f., Definition 2.13). However, the proof is identical, modulo replacing the original Contraction-Reduction Lemma with its randomized generalization (Lemma 2.16) and analyzing the projection step again using the Contraction-Reduction Lemma. Privacy Amplification by Iteration argument combines these two lemmas to establish the following bound.

Proposition 2.17 (Original PABI bound).

Let XTX_{T} and XTX_{T}^{\prime} denote the outputs of CNI(X0,{ϕt},{ξt},𝒦)\mathrm{CNI}(X_{0},\{\phi_{t}\},\{\xi_{t}\},\mathcal{K}) and CNI(X0,{ϕt},{ξt},𝒦)\mathrm{CNI}(X_{0},\{\phi_{t}^{\prime}\},\{\xi_{t}\},\mathcal{K}) where ξt=𝒩(0,σt2Id)\xi_{t}=\mathcal{N}(0,\sigma_{t}^{2}I_{d}). Let st:=supxϕt(x)ϕt(x)s_{t}:=\sup_{x}\|\phi_{t}(x)-\phi_{t}^{\prime}(x)\|, and consider any sequence a1,,aTa_{1},\dots,a_{T} such that zt:=i=1t(siai)z_{t}:=\sum_{i=1}^{t}(s_{i}-a_{i}) is non-negative for all tt and satisfies zT=0z_{T}=0. Then

𝒟α(XTXT)α2t=1Tat2σt2.\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{T}}\;\|\;\mathbb{P}_{X_{T}^{\prime}}\right)\leqslant\frac{\alpha}{2}\sum_{t=1}^{T}\frac{a_{t}^{2}}{\sigma_{t}^{2}}.

3 Upper bound on privacy

In this section, we prove the upper bound in Theorem 1.3. The formal statement of this result is as follows; see §1.2 for a discussion of the mild assumptions on σ\sigma and α\alpha.

Theorem 3.1 (Privacy upper bound for Noisy-SGD).

Let 𝒦d\mathcal{K}\subset\mathbb{R}^{d} be a convex set of diameter DD, and consider optimizing convex losses over 𝒦\mathcal{K} that are LL-Lipschitz and MM-smooth. For any number of iterations TT, dataset size nn\in\mathbb{N}, batch size bnb\leqslant n, stepsize η2/M\eta\leqslant 2/M, noise parameter σ>82L/b\sigma>8\sqrt{2}L/b, and initialization ω0𝒦\omega_{0}\in\mathcal{K}, Noisy-SGD satisfies (α,ε)(\alpha,\varepsilon)-RDP for 1<αα(bn,bσ22L)1<\alpha\leqslant\alpha^{*}(\tfrac{b}{n},\tfrac{b\sigma}{2\sqrt{2}L}) and

εαL2n2σ2min{T,T¯},\displaystyle\varepsilon\lesssim\frac{\alpha L^{2}}{n^{2}\sigma^{2}}\min\left\{T,\bar{T}\right\}\,,

where T¯:=DnLη\bar{T}:=\lceil\frac{Dn}{L\eta}\rceil.

Below, in §3.1 we first isolate a simple ingredient in our analysis as it may be of independent interest. Then in §3.2 we prove Theorem 3.1.

3.1 Privacy Amplification by Iteration bounds that are not vacuous as TT\to\infty

Recall from the preliminaries section §2.4 that Privacy Amplification by Iteration arguments, while tight for a small number of iterations TT, provide vacuous bounds as TT\to\infty (c.f., Proposition 2.17). The following proposition overcomes this by establishing privacy bounds which are independent of the number of iterations TT. This result only requires additionally assuming that XτXτ\|X_{\tau}-X_{\tau}^{\prime}\| is bounded at some intermediate time τ\tau. This is a mild assumption that is satisfied automatically if, e.g., both CNI processes are in a constraint set of bounded diameter.

Proposition 3.2 (New PABI bound that is not vacuous as TT\to\infty).

Let XTX_{T}, XTX_{T}^{\prime}, and sts_{t} be as in Proposition 2.17. Consider any τ{0,,T1}\tau\in\{0,\dots,T-1\} and sequence aτ+1,,aTa_{\tau+1},\dots,a_{T} such that zt:=D+i=τ+1t(siai)z_{t}:=D+\sum_{i=\tau+1}^{t}(s_{i}-a_{i}) is non-negative for all tt and satisfies zT=0z_{T}=0. If 𝒦\mathcal{K} has diameter DD, then:

𝒟α(XTXT)α2t=τ+1Tat2σt2.\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{T}}\;\|\;\mathbb{P}_{X_{T}^{\prime}}\right)\leqslant\frac{\alpha}{2}\sum_{t=\tau+1}^{T}\frac{a_{t}^{2}}{\sigma_{t}^{2}}.

In words, the main idea behind Proposition 3.2 is simply to change the original Privacy Amplification by Iteration argument—which bounds the shifted divergence at iteration TT, by the shifted divergence at iteration T1T-1, and so on all the way to the shifted divergence at iteration 0—by instead stopping the induction earlier. Specifically, only unroll to iteration τ\tau, and then use boundedness of the iterates to control the shifted divergence at that intermediate time τ\tau.

We remark that this new version of PABI uses the shift in the shifted Rényi divergence for a different purpose than previous work: rather than just using the shift to bound the bias incurred from updating on two different losses, here we also use the shift to exploit the boundedness of the constraint set.

Proof of Proposition 3.2.

Bound the divergence at iteration TT by the shifted divergence at iteration T1T-1 as follows:

𝒟α(XTXT)\displaystyle\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{T}}\;\|\;\mathbb{P}_{X_{T}^{\prime}}\right) =𝒟α(zT)(XTXT)\displaystyle=\mathcal{D}_{\alpha}^{(z_{T})}\left(\mathbb{P}_{X_{T}}\;\|\;\mathbb{P}_{X_{T}^{\prime}}\right)
=𝒟α(zT1+sTaT)(Π𝒦[ϕT(XT1)+ZT]Π𝒦[ϕT1(XT1)+ZT])\displaystyle=\mathcal{D}_{\alpha}^{(z_{T-1}+s_{T}-a_{T})}\left(\mathbb{P}_{\Pi_{\mathcal{K}}[\phi_{T}(X_{T-1})+Z_{T}]}\;\|\;\mathbb{P}_{\Pi_{\mathcal{K}}[\phi_{T-1}^{\prime}(X_{T-1}^{\prime})+Z_{T}^{\prime}}]\right)
𝒟α(zT1+sTaT)(ϕT(XT1)+ZTϕT1(XT1)+ZT)\displaystyle\leqslant\mathcal{D}_{\alpha}^{(z_{T-1}+s_{T}-a_{T})}\left(\mathbb{P}_{\phi_{T}(X_{T-1})+Z_{T}}\;\|\;\mathbb{P}_{\phi_{T-1}^{\prime}(X_{T-1}^{\prime})+Z_{T}^{\prime}}\right)
𝒟α(zT1+sT)(ϕT(XT1)ϕT1(XT1))+αaT22σT2\displaystyle\leqslant\mathcal{D}_{\alpha}^{(z_{T-1}+s_{T})}\left(\mathbb{P}_{\phi_{T}(X_{T-1})}\;\|\;\mathbb{P}_{\phi_{T-1}^{\prime}(X_{T-1}^{\prime})}\right)+\frac{\alpha a_{T}^{2}}{2\sigma_{T}^{2}}
𝒟α(zT1)(XT1XT1)+αaT22σT2.\displaystyle\leqslant\mathcal{D}_{\alpha}^{(z_{T-1})}\left(\mathbb{P}_{X_{T-1}}\;\|\;\mathbb{P}_{X_{T-1}}^{\prime}\right)+\frac{\alpha a_{T}^{2}}{2\sigma_{T}^{2}}.

Above, the first step is because zT=0z_{T}=0; the second step is by the iterative construction of XT,XT,zTX_{T},X_{T}^{\prime},z_{T}; the third and final steps are by the the contraction-reduction lemma (Lemma 2.16), and the penulimate step is by the shift-reduction lemma (Lemma 2.15).

By repeating the above argument, from TT to T1T-1 all the way to τ\tau, we obtain:

𝒟α(XTXT)𝒟α(zτ)(XτXτ)+α2t=τ+1Tat2σt2.\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{T}}\;\|\;\mathbb{P}_{X_{T}^{\prime}}\right)\leqslant\mathcal{D}_{\alpha}^{(z_{\tau})}\left(\mathbb{P}_{X_{\tau}}\;\|\;\mathbb{P}_{X_{\tau}}^{\prime}\right)+\frac{\alpha}{2}\sum_{t=\tau+1}^{T}\frac{a_{t}^{2}}{\sigma_{t}^{2}}.

Now observe that the shifted Rényi divergence on the right hand side vanishes because zτ=Dz_{\tau}=D. ∎

3.2 Proof of Theorem 3.1

Step 1: Coupling the iterates

Suppose 𝒳={x1,,xn}\mathcal{X}=\{x_{1},\dots,x_{n}\} and 𝒳={x1,,xn}\mathcal{X}^{\prime}=\{x_{1}^{\prime},\dots,x_{n}^{\prime}\} are adjacent datasets; that is, they agree xi=xix_{i}=x_{i}^{\prime} on all indices i[n]{i}i\in[n]\setminus\{i^{*}\} except for at most one index i[n]i^{*}\in[n]. Denote the corresponding loss functions by fif_{i} and fif_{i}^{\prime}, where fi=fif_{i}=f_{i}^{\prime} except possibly fifif_{i^{*}}\neq f_{i^{*}}^{\prime}. Consider running Noisy-SGD on either dataset 𝒳\mathcal{X} or 𝒳\mathcal{X}^{\prime} for TT iterations—call the resulting iterates {Wt}t=0T\{W_{t}\}_{t=0}^{T} and {Wt}t=0T\{W_{t}^{\prime}\}_{t=0}^{T}, respectively—where we start from the same point w0𝒦w_{0}\in\mathcal{K} and couple the sequence of random batches {Bt}t=0T1\{B_{t}\}_{t=0}^{T-1} and the random noise injected in each iteration. That is,

Wt+1\displaystyle W_{t+1} =Π𝒦[WtηbiBtfi(Wt)+Yt+Zt]\displaystyle=\Pi_{\mathcal{K}}\left[W_{t}-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(W_{t})+Y_{t}+Z_{t}\right]
Wt+1\displaystyle W_{t+1}^{\prime} =Π𝒦[WtηbiBtfi(Wt)+Yt+Zt]\displaystyle=\Pi_{\mathcal{K}}\left[W_{t}^{\prime}-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(W_{t}^{\prime})+Y_{t}+Z_{t}^{\prime}\right]

for all t{0,,T1}t\in\{0,\dots,T-1\}, where we have split the Gaussian noise into terms Yt𝒩(0,η2σ12Id)Y_{t}\sim\mathcal{N}(0,\eta^{2}\sigma_{1}^{2}I_{d}), Zt𝒩(0,η2σ22Id)Z_{t}\sim\mathcal{N}(0,\eta^{2}\sigma_{2}^{2}I_{d}), Zt𝒩(0,η2σ22Id)+ηb[fi(Wt)fi(Wt)]𝟙iBtZ_{t}^{\prime}\sim\mathcal{N}(0,\eta^{2}\sigma_{2}^{2}I_{d})+\frac{\eta}{b}\left[\nabla f_{i^{*}}^{\prime}(W_{t}^{\prime})-\nabla f_{i^{*}}(W_{t}^{\prime})\right]\cdot\mathds{1}_{i^{*}\in B_{t}}, for any111111Simply choosing σ1=σ2=σ/2\sigma_{1}=\sigma_{2}=\sigma/\sqrt{2} yields our minimax-optimal asymptotics; and we do this at the end of the proof. However, optimizing the constants in this “noise splitting” is helpful in practice (details in Appendix B), and as such we leave σ1\sigma_{1} and σ2\sigma_{2} as variables until the end of the proof. numbers σ1,σ2>0\sigma_{1},\sigma_{2}>0 satisfying σ12+σ22=σ2\sigma_{1}^{2}+\sigma_{2}^{2}=\sigma^{2}. In words, this noise-splitting enables us to use the noise for both the Privacy Amplification by Sampling and Privacy Amplification by Iteration arguments below.

Importantly, notice that in the definition of Wt+1W_{t+1}^{\prime}, the gradient is taken w.r.t. loss functions corresponding to data set 𝒳\mathcal{X} rather than 𝒳\mathcal{X}^{\prime}; this is then corrected via the bias in the noise term ZtZ_{t}^{\prime}. Notice also that this bias term in ZtZ_{t}^{\prime} is only realized (i.e., ZtZ_{t}^{\prime} is possibly non-centered) with probability 1b/n1-b/n because the probability that ii^{*} is in a random size-bb subset of [n][n] is

[iBt]=bn.\displaystyle\mathbb{P}\left[i^{*}\in B_{t}\right]=\frac{b}{n}. (3.1)

Step 2: Interpretation as conditional CNI sequences

Observe that conditional on the event that Zt=ZtZ_{t}=Z_{t}^{\prime} are equal (call their value ztz_{t}), then

Wt+1\displaystyle W_{t+1} =Π𝒦[ϕt(Wt)+Yt]\displaystyle=\Pi_{\mathcal{K}}\left[\phi_{t}(W_{t})+Y_{t}\right]
Wt+1\displaystyle W_{t+1}^{\prime} =Π𝒦[ϕt(Wt)+Yt]\displaystyle=\Pi_{\mathcal{K}}\left[\phi_{t}(W_{t}^{\prime})+Y_{t}\right]

where

ϕt(ω):=ωηbiBtfi(ω)+zt.\displaystyle\phi_{t}(\omega):=\omega-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(\omega)+z_{t}\,. (3.2)

Since the following lemma establishes that ϕt\phi_{t} is contractive, we conclude that conditional on the event that Zt=ZtZ_{t}=Z_{t}^{\prime} for all tτt\geqslant\tau, then {Wt}tτ\{W_{t}\}_{t\geqslant\tau} and {Wt}tτ\{W_{t}^{\prime}\}_{t\geqslant\tau} are projected CNI (c.f., Definition 2.13) with respect to the same update functions. Here, τ{0,,T1}\tau\in\{0,\dots,T-1\} is a parameter that will be chosen shortly. Intuitively, τ\tau is the horizon for which we bound all previous privacy leakage only through the fact that Wτ,WτW_{\tau},W_{\tau}^{\prime} are within distance DD from each other, see the proof overview in §1.3.

Observation 3.3.

The function ϕt\phi_{t} defined in (3.2) is contractive.

Proof.

For any ω,ω\omega,\omega^{\prime},

ϕt(ω)ϕt(ω)\displaystyle\lVert{\phi_{t}(\omega)-\phi_{t}(\omega^{\prime})}\rVert =(ωηbiBtfi(ω))(ωηbiBtfi(ω))\displaystyle=\lVert{\big{(}\omega-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(\omega)\big{)}-\big{(}\omega^{\prime}-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(\omega^{\prime})\big{)}}\rVert\
1biBt(ωηfi(ω))(ωηfi(ω))\displaystyle\leqslant\frac{1}{b}\sum_{i\in B_{t}}\lVert{\big{(}\omega-\eta\nabla f_{i}(\omega)\big{)}-\big{(}\omega^{\prime}-\eta\nabla f_{i}(\omega^{\prime})\big{)}}\rVert
1biBtωω\displaystyle\leqslant\frac{1}{b}\sum_{i\in B_{t}}\lVert{\omega-\omega^{\prime}}\rVert
=ωω.\displaystyle=\lVert{\omega-\omega^{\prime}}\rVert\,.

by plugging in the definition of ϕt\phi_{t}, using the triangle inequality, and then using the fact that the stochastic gradient update ωωηfi(ω)\omega\mapsto\omega-\eta\nabla f_{i}(\omega) is a contraction (Lemma 2.1). ∎

Step 3: Bounding the privacy loss

Recall that we seek to upper bound 𝒟α(WTWT)\mathcal{D}_{\alpha}(\mathbb{P}_{W_{T}}\;\|\;\mathbb{P}_{W_{T}^{\prime}}). We argue that:

𝒟α(WTWT)\displaystyle\mathcal{D}_{\alpha}\left(\mathbb{P}_{W_{T}}\;\big{\|}\;\mathbb{P}_{W_{T}^{\prime}}\right) 𝒟α(WT,Zτ:T1WT,Zτ:T1)\displaystyle\leqslant\mathcal{D}_{\alpha}\left(\mathbb{P}_{W_{T},Z_{\tau:T-1}}\;\big{\|}\;\mathbb{P}_{W_{T}^{\prime},Z_{\tau:T-1}^{\prime}}\right)
𝒟α(Zτ:T1Zτ:T1)1+supz𝒟α(WT|Zτ:T1=zWT|Zτ:T1=z)2\displaystyle\leqslant\underbrace{\mathcal{D}_{\alpha}\left(\mathbb{P}_{Z_{\tau:T-1}}\;\big{\|}\;\mathbb{P}_{Z_{\tau:T-1}^{\prime}}\right)}_{\leavevmode\hbox to11.75pt{\vbox to11.75pt{\pgfpicture\makeatletter\hbox{\hskip 5.87407pt\lower-5.87407pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{5.67407pt}{0.0pt}\pgfsys@curveto{5.67407pt}{3.13374pt}{3.13374pt}{5.67407pt}{0.0pt}{5.67407pt}\pgfsys@curveto{-3.13374pt}{5.67407pt}{-5.67407pt}{3.13374pt}{-5.67407pt}{0.0pt}\pgfsys@curveto{-5.67407pt}{-3.13374pt}{-3.13374pt}{-5.67407pt}{0.0pt}{-5.67407pt}\pgfsys@curveto{3.13374pt}{-5.67407pt}{5.67407pt}{-3.13374pt}{5.67407pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}+\underbrace{\sup_{z}\mathcal{D}_{\alpha}\left(\mathbb{P}_{W_{T}|Z_{\tau:T-1}=z}\;\big{\|}\;\mathbb{P}_{W_{T}^{\prime}|Z_{\tau:T-1}^{\prime}=z}\right)}_{\leavevmode\hbox to11.75pt{\vbox to11.75pt{\pgfpicture\makeatletter\hbox{\hskip 5.87407pt\lower-5.87407pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{5.67407pt}{0.0pt}\pgfsys@curveto{5.67407pt}{3.13374pt}{3.13374pt}{5.67407pt}{0.0pt}{5.67407pt}\pgfsys@curveto{-3.13374pt}{5.67407pt}{-5.67407pt}{3.13374pt}{-5.67407pt}{0.0pt}\pgfsys@curveto{-5.67407pt}{-3.13374pt}{-3.13374pt}{-5.67407pt}{0.0pt}{-5.67407pt}\pgfsys@curveto{3.13374pt}{-5.67407pt}{5.67407pt}{-3.13374pt}{5.67407pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}} (3.3)

Above, the first step is by the post-processing inequality for the Rényi divergence (Lemma 2.7), and the second step is by the strong composition rule for the Rényi divergence (Lemma 2.9).

Step 3a: Bounding 1, using Privacy Amplification by Sampling. We argue that

1 =𝒟α(Zτ:T1Zτ:T1)\displaystyle=\mathcal{D}_{\alpha}\left(\mathbb{P}_{Z_{\tau:T-1}}\;\big{\|}\;\mathbb{P}_{Z_{\tau:T-1}^{\prime}}\right)
t=τT1supzτ:t1𝒟α(Zt|Zτ:t1=zτ:t1Zt|Zτ:t1=zτ:t1)\displaystyle\leqslant\sum_{t=\tau}^{T-1}\sup_{z_{\tau:t-1}}\mathcal{D}_{\alpha}\left(\mathbb{P}_{Z_{t}|Z_{\tau:t-1}=z_{\tau:t-1}}\;\big{\|}\;\mathbb{P}_{Z_{t}^{\prime}|Z_{\tau:t-1}^{\prime}=z_{\tau:t-1}}\right)
=t=τT1𝒟α(𝒩(0,η2σ22Id)(1bn)𝒩(0,η2σ22Id)+bn𝒩(mt,η2σ22Id))\displaystyle=\sum_{t=\tau}^{T-1}\mathcal{D}_{\alpha}\left(\mathcal{N}(0,\eta^{2}\sigma_{2}^{2}I_{d})\;\big{\|}\;(1-\tfrac{b}{n})\mathcal{N}(0,\eta^{2}\sigma_{2}^{2}I_{d})+\tfrac{b}{n}\mathcal{N}(m_{t},\eta^{2}\sigma_{2}^{2}I_{d})\right)
(Tτ)Sα(bn,bσ22L).\displaystyle\leqslant(T-\tau)S_{\alpha}\left(\frac{b}{n},\frac{b\sigma_{2}}{2L}\right). (3.4)

Above, the first step is the definition of 1. The second step is by the strong composition rule for the Rényi divergence (Lemma 2.9). The third step is because for any zτ:t1z_{\tau:t-1}, the law of ZtZ_{t} conditional on Zτ:t1=zτ:t1Z_{\tau:t-1}=z_{\tau:t-1} is the Gaussian distribution 𝒩(0,η2σ22Id)\mathcal{N}(0,\eta^{2}\sigma_{2}^{2}I_{d}); and by (3.1), the law of ZtZ_{t}^{\prime} conditional on Zτ:t1=zτ:t1Z_{\tau:t-1}=z_{\tau:t-1} is the mixture distribution that is 𝒩(0,η2σ22Id)\mathcal{N}(0,\eta^{2}\sigma_{2}^{2}I_{d}) with probability 1b/n1-b/n, and otherwise is 𝒩(mt,η2σ22Id)\mathcal{N}(m_{t},\eta^{2}\sigma_{2}^{2}I_{d}) where mt:=ηb[fi(Wt)fi(Wt)]m_{t}:=\tfrac{\eta}{b}[\nabla f_{i^{*}}(W_{t}^{\prime})-\nabla f_{i^{*}}^{\prime}(W_{t}^{\prime})]. The final step is by the bound in Lemma 2.11 on the Rényi divergence of the Sampled Gaussian Mechanism, combined with the observation that mt2ηL/b\|m_{t}\|\leqslant 2\eta L/b, which is immediate from the triangle inequality and the LL-Lipschitz smoothness of the loss functions.

Step 3b: Bounding 2, using Privacy Amplification by Iteration. As argued in step 2, conditional on the event that Zt=ZtZ_{t}=Z_{t}^{\prime} for all tτt\geqslant\tau, then {Wt}tτ\{W_{t}\}_{t\geqslant\tau} and {Wt}tτ\{W_{t}^{\prime}\}_{t\geqslant\tau} are projected CNI with respect to the same update functions. Note also that WτWτD\|W_{\tau}-W_{\tau^{\prime}}\|\leqslant D since the iterates lie in the constraint set 𝒦\mathcal{K} which has diameter DD. Therefore we may apply the new Privacy Amplification by Iteration bound (Proposition 3.2) with st0s_{t}\equiv 0 and atD/(Tτ)a_{t}\equiv D/(T-\tau) to obtain:

2=supz𝒟α(WT|Zτ:T1=zWT|Zτ:T1=z)αD22η2σ12(Tτ).\displaystyle\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}=\sup_{z}\mathcal{D}_{\alpha}\left(\mathbb{P}_{W_{T}|Z_{\tau:T-1}=z}\;\big{\|}\;\mathbb{P}_{W_{T}^{\prime}|Z_{\tau:T-1}^{\prime}=z}\right)\leqslant\frac{\alpha D^{2}}{2\eta^{2}\sigma_{1}^{2}(T-\tau)}. (3.5)

Step 4: Putting the bounds together

By plugging into (3.3) the bound (3.4) on 1 and the bound (3.5) on 2, we conclude that the algorithm is (α,ε)(\alpha,\varepsilon)-RDP for

εminτ{0,,T1}{(Tτ)Sα(bn,bσ22L)+αD22η2σ12(Tτ)}\displaystyle\varepsilon\leqslant\min_{\tau\in\{0,\dots,T-1\}}\left\{(T-\tau)\,S_{\alpha}\left(\frac{b}{n},\frac{b\sigma_{2}}{2L}\right)+\frac{\alpha D^{2}}{2\eta^{2}\sigma_{1}^{2}(T-\tau)}\right\} (3.6)

By Lemma 2.12, Sα(bn,bσ22L)8α(Lnσ2)2S_{\alpha}(\tfrac{b}{n},\tfrac{b\sigma_{2}}{2L})\leqslant 8\alpha(\tfrac{L}{n\sigma_{2}})^{2} for αα(bn,bσ22L)\alpha\leqslant\alpha^{*}(\tfrac{b}{n},\tfrac{b\sigma_{2}}{2L}) and σ28L/b\sigma_{2}\geqslant 8L/b.121212While Lemma 2.12 requires b<n/5b<n/5, the case bn/5b\geqslant n/5 has an alternate proof that does not require Lemma 2.12 (or any of its assumptions). Specifically, replace (3.4) with the upper bound 𝒟α(𝒩(0,η2σ22Id)𝒩(mt,η2σ22Id))=αmt2/(2η2σ22)=2αL2/b2σ22\mathcal{D}_{\alpha}(\mathcal{N}(0,\eta^{2}\sigma_{2}^{2}I_{d})\;\|\;\mathcal{N}(m_{t},\eta^{2}\sigma_{2}^{2}I_{d}))=\alpha\|m_{t}\|^{2}/(2\eta^{2}\sigma_{2}^{2})=2\alpha L^{2}/b^{2}\sigma_{2}^{2} by using the well-known formula for the Rényi divergence between Gaussians. This is tight up to a constant factor, and the rest of the proof proceeds identically.

By setting σ1=σ2=σ/2\sigma_{1}=\sigma_{2}=\sigma/\sqrt{2}, we have that up to a constant factor,

εαL2σ2minτ{0,,T1}{(Tτ)n2+D2η2L2(Tτ)}.\displaystyle\varepsilon\lesssim\frac{\alpha L^{2}}{\sigma^{2}}\min_{\tau\in\{0,\dots,T-1\}}\left\{\frac{(T-\tau)}{n^{2}}+\frac{D^{2}}{\eta^{2}L^{2}(T-\tau)}\right\}.

Bound this minimization by

minτ{0,,T1}{(Tτ)n2+D2η2L2(Tτ)}=minR{1,,T}{Rn2+D2η2L2R}DηLn,\min_{\tau\in\{0,\dots,T-1\}}\left\{\frac{(T-\tau)}{n^{2}}+\frac{D^{2}}{\eta^{2}L^{2}(T-\tau)}\right\}=\min_{R\in\{1,\dots,T\}}\left\{\frac{R}{n^{2}}+\frac{D^{2}}{\eta^{2}L^{2}R}\right\}\lesssim\frac{D}{\eta Ln},

where above the first step is by setting R=TτR=T-\tau, and the second step is by setting R=T¯=DnLηR=\bar{T}=\lceil\tfrac{Dn}{L\eta}\rceil (this can be done if TT¯T\gtrsim\bar{T}). Therefore, by combining the above two displays, we obtain

εαL2n2σ2min{T,T¯}.\varepsilon\lesssim\frac{\alpha L^{2}}{n^{2}\sigma^{2}}\min\left\{T,\bar{T}\right\}.

Here the first term in the minimization comes from the simple bound (1.1) which scales linearly in TT. This completes the proof of Theorem 3.1.

4 Lower bound on privacy

In this section, we prove the lower bound in Theorem 1.3. This is stated formally below and holds even for linear loss functions in one dimension, in fact even when all but one of the loss functions are zero. See §1.2 for a discussion of the mild assumptions on the diameter.

Theorem 4.1 (Privacy lower bound for Noisy-SGD).

There exist universal constants131313We prove this for cσ=103c_{\sigma}=10^{-3}, cD=103c_{D}=10^{3}, cα=107c_{\alpha}=10^{-7}, α¯=102\bar{\alpha}=10^{2}; no attempt has been made to optimize these constants. cDc_{D}, cσc_{\sigma}, cαc_{\alpha}, α¯\bar{\alpha} and a family of LL-Lipschitz linear loss functions over the interval 𝒦=[D/2,D/2]\mathcal{K}=[-D/2,D/2]\subset\mathbb{R} such that the following holds. Consider running Noisy-SGD from arbitrary initialization ω0\omega_{0} with any parameters satisfying DcDηLD\geqslant c_{D}\eta L and σ2cσD2/(η2T¯)\sigma^{2}\leqslant c_{\sigma}D^{2}/(\eta^{2}\bar{T}). Then Noisy-SGD is not (α¯,ε)(\bar{\alpha},\varepsilon)-RDP for

εcαα¯L2n2σ2min{T,T¯},\displaystyle\varepsilon\leqslant c_{\alpha}\frac{\bar{\alpha}L^{2}}{n^{2}\sigma^{2}}\min\left\{T,\bar{T}\right\}, (4.1)

where T¯:=0.75DnLη\bar{T}:=0.75\tfrac{Dn}{L\eta}.

Proof sketch of Theorem 4.1.

(Full details in Section A.4.)

Construction. Consider datasets 𝒳={x1,,xn1,xn}\mathcal{X}=\{x_{1},\dots,x_{n-1},x_{n}\} and 𝒳={x1,,xn1,xn}\mathcal{X}^{\prime}=\{x_{1},\dots,x_{n-1},x_{n}^{\prime}\} which differ only on xnx_{n}^{\prime}, and corresponding functions which are all zero f1()==fn()=0f_{1}(\cdot)=\dots=f_{n}(\cdot)=0, except for fn(ω)=L(Dω)f_{n}^{\prime}(\omega)=L(D-\omega). Clearly these functions are linear and LL-Lipschitz. The intuition behind this construction is that running Noisy-SGD on 𝒳\mathcal{X} or 𝒳\mathcal{X}^{\prime} generates a random walk that is clamped to stay within the interval 𝒦\mathcal{K}—but with the key difference that running Noisy-SGD on dataset 𝒳\mathcal{X} generates a symmetric random walk {Wt}\{W_{t}\}, whereas running Noisy-SGD on dataset 𝒳\mathcal{X}^{\prime} generates a biased random walk {Wt}\{W_{t}^{\prime}\} that biases right with probability b/nb/n each step. That is,

Wt+1=Π𝒦[Wt+Zt]andWt+1=Π𝒦[Wt+Yt+Zt]\displaystyle W_{t+1}=\Pi_{\mathcal{K}}\Big{[}W_{t}+Z_{t}\Big{]}\qquad\text{and}\qquad W_{t+1}^{\prime}=\Pi_{\mathcal{K}}\Big{[}W_{t}^{\prime}+Y_{t}+Z_{t}\Big{]}

where the processes are initialized at ω0=ω0=0\omega_{0}=\omega_{0}^{\prime}=0, each random increment Zt𝒩(0,η2σ2)Z_{t}\sim\mathcal{N}(0,\eta^{2}\sigma^{2}) is an independent Gaussian, and each bias YtY_{t} is ηL/b\eta L/b with probability b/nb/n and otherwise is 0.

Key obstacle. The high-level intuition behind this construction is simple to state: the bias of the random walk {Wt}\{W_{t}^{\prime}\} makes it distinguishable (to the minimax-optimal extent, as we show) from the symmetric random walk {Wt}\{W_{t}\}. However, making this intution precise is challenging because the distributions of the iterates Wt,WtW_{t},W_{t}^{\prime} are intractable to reason about explicitly—due to the highly non-linear interactions between the projections and the random increments. Thus we must establish the distinguishability of WT,WTW_{T},W_{T}^{\prime} without reasoning explicitly about their distributions.

Key technical ideas. A first, simple observation is that it suffices to prove Theorem 4.1 in the constant RDP regime of TT¯T\geqslant\bar{T}, since the linear RDP regime of TT¯T\leqslant\bar{T} then follows by the strong composition rule for RDP (Lemma 2.8), as described in Appendix A.4. Thus it suffices to show that the final iterates WT,WTW_{T},W_{T}^{\prime} are distinguishable after TT¯T\geqslant\bar{T} iterations.

A natural attempt to distinguish WT,WTW_{T},W_{T}^{\prime} for large TT is to test positivity. This is intuitively plausible because [WT0]=1/2\mathbb{P}[W_{T}\geqslant 0]=1/2 by symmetry, whereas we might expect [WT0]1/2\mathbb{P}[W_{T}^{\prime}\geqslant 0]\gg 1/2 since the bias of WTW_{T}^{\prime} pushes it to the top half of the interval 𝒦=[D/2,D/2]\mathcal{K}=[-D/2,D/2]. Such a discrepancy would establish an (ε,δ\varepsilon,\delta)-DP bound that, by the standard RDP-to-DP-conversion in Lemma 2.5, would imply the desired (α,ε\alpha,\varepsilon)-RDP bound in Theorem 4.1. However, the issue is how to prove the latter statement [WT0]1/2\mathbb{P}[W_{T}^{\prime}\geqslant 0]\gg 1/2 without an explicit handle on the distribution of WTW_{T}^{\prime}.

To this end, the key technical insight is to define an auxiliary process Wt′′W_{t}^{\prime\prime} which intializes T¯\bar{T} iterations before the final iteration TT at the lowest point in 𝒦\mathcal{K}, namely WTT¯′′=D/2W_{T-\bar{T}}^{\prime\prime}=-D/2, and then updates in an analogously biased way as the WtW_{t}^{\prime} process except without projections at the bottom of 𝒦\mathcal{K}. That is,

Wt+1′′:=min(Wt′′+Yt+Zt,D/2).\displaystyle W_{t+1}^{\prime\prime}:=\min\left(W_{t}^{\prime\prime}+Y_{t}+Z_{t},D/2\right).

The point is that on one hand, WtW_{t}^{\prime} stochastically dominates Wt′′W_{t}^{\prime\prime}, so that it suffices to show [WT′′0]1/2\mathbb{P}[W_{T}^{\prime\prime}\geqslant 0]\gg 1/2. And on the other hand, Wt′′W_{t}^{\prime\prime} is easy to analyze because, as we show, with overwhelming probability no projections occur. This lack of projections means that, modulo a probability δ\delta event which is irrelevant for the purpose of (ε,δ)(\varepsilon,\delta)-DP bounds, WT′′W_{T}^{\prime\prime} is the sum of (biased) independent Gaussian increments—hence it has a simple explicitly computable distribution: it is Gaussian.

It remains to establish that (i) with high probability no projections occur in the Wt′′W_{t}^{\prime\prime} process, so that the aforementioned Gaussian approximates the law of WT′′W_{T}^{\prime\prime}, and (ii) this Gaussian is positive with probability 1/2\gg 1/2, so that we may conclude the desired DP lower bound. Briefly, the former amounts to bounding the hitting time of a Gaussian random walk, which is a routine application of martingale concentration. And the latter amounts to computing the parameters of the Gaussian. A back-of-the-envelope calculation shows that the total bias in the Wt′′W_{t}^{\prime\prime} process is t=TT¯T1YtT¯ηL/n=3D/4\sum_{t=T-\bar{T}}^{T-1}Y_{t}\approx\bar{T}\eta L/n=3D/4, and conditional on such an event, the Gaussian approximating WT′′W_{T}^{\prime\prime} has mean roughly D/2+3D/4=D/4-D/2+3D/4=D/4 and variance T¯η2σ2D2/1000\bar{T}\eta^{2}\sigma^{2}\leqslant D^{2}/1000, and therefore is positive with probability 1/2\gg 1/2, as desired. Full proof details are provided in Section A.4.

5 Extensions

This section provides details for the extensions mentioned in §1.2. Specifically, we show how our techniques readily extend to strongly convex losses (§5.1), to cyclic batch updates (§5.2), to non-uniform stepsizes (§5.3), and to regularized losses (§5.4). Our techniques readily extend to any combinations of these settings, e.g. Noisy-SGD with cyclic batches and non-uniform stepsizes on strongly convex losses; the details are straightforward and omitted for brevity.

5.1 Strongly convex losses

Here we show a significantly improved privacy loss if the functions are mm-strongly convex. In particular, we show that after only O~(1/(ηm))\tilde{O}(1/(\eta m)) iterations, there is no further privacy loss. Throughout, the notation O~\tilde{O} suppresses logarithmic factors in the relevant parameters.

Theorem 5.1 (Privacy upper bound for Noisy-SGD, in strongly convex setting).

Let 𝒦d\mathcal{K}\subset\mathbb{R}^{d} be a convex set of diameter DD, and consider optimizing losses over 𝒦\mathcal{K} that are LL-Lipschitz, mm-strongly convex, and MM-smooth. For any number of iterations TT, dataset size nn\in\mathbb{N}, batch size bnb\leqslant n, noise parameter σ>82L/b\sigma>8\sqrt{2}L/b, and initialization ω0𝒦\omega_{0}\in\mathcal{K}, Noisy-SGD with stepsize η<2/M\eta<2/M satisfies (α,ε)(\alpha,\varepsilon)-RDP for 1<α<α(bn,bσ22L)1<\alpha<\alpha^{*}(\tfrac{b}{n},\tfrac{b\sigma}{2\sqrt{2}L}) and

εαL2n2σ2min{T,T¯},\displaystyle\varepsilon\lesssim\frac{\alpha L^{2}}{n^{2}\sigma^{2}}\cdot\min\big{\{}T,\bar{T}\big{\}}\,, (5.1)

where T¯=O~(1/log(1/c))\bar{T}=\tilde{O}(1/\log(1/c)) and c:=maxλ{m,M}|1ηλ|<1c:=\max_{\lambda\in\{m,M\}}|1-\eta\lambda|<1.

We make two remarks about this result.

Remark 5.2 (Intepretation of cc and T¯\bar{T}).

The operational interpretation of cc is that it is the contraction coefficient for a gradient descent iteration with stepsize η\eta (c.f., Lemma 5.5 below). In the common setting of η1/M\eta\leqslant 1/M, then c=1ηmexp(ηm)c=1-\eta m\leqslant\exp(-\eta m), whereby

T¯=O~(1log1/c)=O~(1ηm).\bar{T}=\tilde{O}\left(\frac{1}{\log 1/c}\right)=\tilde{O}\left(\frac{1}{\eta m}\right)\,.

Another important example is η=2/(M+m)\eta=2/(M+m), i.e., when the stepsize is optimized to minimize the contraction coefficient cc. In this case, c=(κ1)/(κ+1)e1/κc=(\kappa-1)/(\kappa+1)\leqslant e^{-1/\kappa} where κ:=M/m1\kappa:=M/m\geqslant 1 denotes the condition number, and thus T¯=O~(κ)\bar{T}=\tilde{O}(\kappa).

Remark 5.3 (Bounded diameter is unnecessary for convergent privacy in the strongly convex setting).

Unlike the convex setting, in this strongly convex setting, the bounded diameter assumption can be removed. Specifically, for the purposes of DP (rather than RDP), the logarithmic dependence of T¯\bar{T} on DD (hidden in the O~\tilde{O} above) can be replaced by logarithmic dependence on TT, η\eta, LL, and σ\sigma since the SGD trajectories move from the initialization point by O(Tη(L+σ))O(T\eta(L+\sigma)) with high probability, and so this can act as the “effective diameter”. The intuition behind why a diameter assumption is needed for the convex setting but not for the strongly convex setting is that without strong convexity, SGD is only weakly contractive, which means that its instability increases with the number of iterations; whereas in the strongly convex setting, SGD is strongly contractive, which means that movements from far enough in the past are effectively “erased”—a pre-requisite for convergent privacy.

The proof of Theorem 5.1 (the setting of strongly convex losses) is similar to the proof of Theorem 3.1 (the setting of convex losses), except for two key differences:

  • (i)

    Strong convexity of the losses ensures that the update function ϕt\phi_{t} in the Contractive Noisy Iterations is strongly contractive. (Formalized in Observation 5.4.)

  • (ii)

    This strong contractivity of the update function ensures exponentially better bounds in the Privacy Amplification by Iteration argument. (Formalized in Proposition 5.6.)

We first formalize the change (i).

Observation 5.4 (Analog of Observation 3.3 for strongly convex losses).

For all tt, the function ϕt\phi_{t} defined in (3.2) is almost surely cc-contractive, where cc is the quantity in Theorem 5.1.

Proof.

Identical to the proof of Observation 3.3, except use the fact that a gradient step is not simply contractive (Lemma 2.1), but in fact strongly contractive when the function is strongly convex (this fact is recalled in Lemma 5.5 below; see, e.g., [9, Theorem 3.12] for a proof). ∎

Lemma 5.5 (Analog of Lemma 2.1 for strongly convex losses).

Suppose f:df:\mathbb{R}^{d}\to\mathbb{R} is an mm-strongly convex, MM-smooth function. For any stepsize η<2/M\eta<2/M, the mapping ωωηf(ω)\omega\mapsto\omega-\eta\nabla f(\omega) is cc-contractive, for c:=maxλ{m,M}|1ηλ|<1c:=\max_{\lambda\in\{m,M\}}|1-\eta\lambda|<1.

Next we formalize the change (ii).

Proposition 5.6 (Analog of Proposition 3.2 for strongly convex losses).

Consider the setup in Proposition 3.2, and additonally assume that ϕt,ϕt\phi_{t},\phi_{t}^{\prime} are almost surely cc-contractive. Consider any τ{0,,T1}\tau\in\{0,\dots,T-1\} and any reals aτ+1,,aTa_{\tau+1},\dots,a_{T} such that zt:=ctτD+i=τ+1tcti(siai)z_{t}:=c^{t-\tau}D+\sum_{i=\tau+1}^{t}c^{t-i}(s_{i}-a_{i}) is non-negative for all tt and satisfies zT=0z_{T}=0. Then:

𝒟α(XTXT)α2t=τ+1Tat2σt2.\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{T}}\;\|\;\mathbb{P}_{X_{T}^{\prime}}\right)\leqslant\frac{\alpha}{2}\sum_{t=\tau+1}^{T}\frac{a_{t}^{2}}{\sigma_{t}^{2}}.
Proof.

Identical to the proof of Proposition 3.2, except use the contraction-reduction lemma for strong contractions (Lemma 5.7 below rather than Lemma 2.16), and the inductive relation zt+1=czt+st+1at+1z_{t+1}=cz_{t}+s_{t+1}-a_{t+1}. ∎

Lemma 5.7 (Analog of Lemma 2.16 for strongly convex losses).

Consider the setup of Lemma 5.7, except additionally assuming that ϕ,ϕ\phi,\phi^{\prime} are each cc-contractive almost surely. Then

𝒟α(cz+s)(ϕ#μϕ#μ)𝒟α(z)(μμ).\mathcal{D}_{\alpha}^{(cz+s)}\left(\phi_{\#}\mu\|\phi_{\#}^{\prime}\mu^{\prime}\right)\leqslant\mathcal{D}_{\alpha}^{(z)}\left(\mu\|\mu^{\prime}\right).
Proof.

Identical to the proof of Lemma 2.16, except use the fact that ϕ\phi is almost surely a cc-contraction to bound the second term in (A.1), namely by W(ϕ#ν,ϕ#μ)cW(ν,μ)czW_{\infty}(\phi_{\#}\nu,\phi_{\#}\mu)\leqslant cW_{\infty}(\nu,\mu)\leqslant cz. ∎

We now combine the changes (i) and (ii) to prove Theorem 5.1.

Proof of Theorem 5.1.

We record the differences to Steps 1-4 of the proof in the convex setting (Theorem 3.1). Step 1 is identical for coupling the iterates. Step 2 is identical for constructing the conditional CNI, except that in this strongly convex setting, the update functions ϕt\phi_{t} are not simply contractive (Observation 3.3), but in fact cc-contractive (Observation 5.4). We use this to establish an improved bound on the term 2 in Step 3. Specifically, invoke Proposition 5.6 with st=0s_{t}=0 for all t{τ+1,,T}t\in\{\tau+1,\dots,T\}, at=0a_{t}=0 for all t{τ+1,,T1}t\in\{\tau+1,\dots,T-1\} and aT=cTτDa_{T}=c^{T-\tau}D to obtain

2=supz𝒟α(WT|Zτ:T1=zWT|Zτ:T1=z)c2(Tτ)αD22η2σ12.\displaystyle\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}=\sup_{z}\mathcal{D}_{\alpha}\left(\mathbb{P}_{W_{T}|Z_{\tau:T-1}=z}\;\big{\|}\;\mathbb{P}_{W_{T}^{\prime}|Z_{\tau:T-1}^{\prime}=z}\right)\leqslant c^{2(T-\tau)}\frac{\alpha D^{2}}{2\eta^{2}\sigma_{1}^{2}}. (5.2)

This allows us to conclude the following analog of (3.6) for the present strongly convex setting:

εminτ{0,,T1}(Tτ)Q+c2(Tτ)αD22η2σ12,\displaystyle\varepsilon\leqslant\min_{\tau\in\{0,\dots,T-1\}}(T-\tau)Q+c^{2(T-\tau)}\frac{\alpha D^{2}}{2\eta^{2}\sigma_{1}^{2}}, (5.3)

where Q:=Sα(bn,bσ22L)Q:=S_{\alpha}(\tfrac{b}{n},\tfrac{b\sigma_{2}}{2L}). Simplifying this in Step 4 requires the following modifications since the asymptotics are different in the present strongly convex setting. Specifically, bound the above by

εQlog1/clog(αD2log1/cη2σ12Q).\displaystyle\varepsilon\lesssim\frac{Q}{\log 1/c}\log\left(\frac{\alpha D^{2}\log 1/c}{\eta^{2}\sigma_{1}^{2}Q}\right).

by setting τ=TΘ(log((αD2log1/c)/(η2σ12Q))/(log1/c))=TΘ~(1/(log1/c))\tau=T-\Theta(\log((\alpha D^{2}\log 1/c)/(\eta^{2}\sigma_{1}^{2}Q))/(\log 1/c))=T-\tilde{\Theta}(1/(\log 1/c)), valid if τ0\tau\geqslant 0. By setting σ1=σ2=σ/2\sigma_{1}=\sigma_{2}=\sigma/\sqrt{2}, and using the bound on QQ in the proof of Theorem 3.1, we conclude the desired bound

εαL2n2σ2min{T,1log1/clog(αD2log1/cη2σ12Q)}=αL2n2σ2min{T,O~(1log1/c)}.\varepsilon\lesssim\frac{\alpha L^{2}}{n^{2}\sigma^{2}}\min\left\{T,\frac{1}{\log 1/c}\log\left(\frac{\alpha D^{2}\log 1/c}{\eta^{2}\sigma_{1}^{2}Q}\right)\right\}=\frac{\alpha L^{2}}{n^{2}\sigma^{2}}\cdot\min\big{\{}T,\tilde{O}\left(\frac{1}{\log 1/c}\right)\big{\}}.

5.2 Cyclic batch updates

Here we show how our results can be extended to Noisy-SGD if the batches are chosen cyclically in round-robin fashion rather than randomly. We call this algorithm Noisy-CGD. The previous best privacy bound for this algorithm is (α,ε)(\alpha,\varepsilon)-RDP upper bound with

εαL2σ2(1b2+Tn2),\displaystyle\varepsilon\lesssim\frac{\alpha L^{2}}{\sigma^{2}}\left(\frac{1}{b^{2}}+\frac{T}{n^{2}}\right), (5.4)

which can be obtained from the standard Privacy Amplification by Iteration argument if one extends the proof of [19, Theorem 35] to arbitrary numbers of iterations TT and batch sizes bb. The key difference in our privacy bound, below, is that it does not increase ad infinitum as TT\to\infty. This is enabled by the new diameter-aware Privacy Amplification by Iteration bound (Proposition 3.2).

Theorem 5.8 (Privacy upper bound for Noisy-CGD).

Let 𝒦d\mathcal{K}\subset\mathbb{R}^{d} be a convex set of diameter DD, and consider optimizing convex losses over 𝒦\mathcal{K} that are LL-Lipschitz and MM-smooth. For any number of iterations TT\in\mathbb{N}, stepsize η2/β\eta\leqslant 2/\beta, noise parameter σ>0\sigma>0, initialization ω0𝒦\omega_{0}\in\mathcal{K}, dataset size nn\in\mathbb{N}, and batch size bb, Noisy-CGD satisfies (α,ε)(\alpha,\varepsilon)-RDP for any α>1\alpha>1 and

εαL2σ2(1b2+1n2min{T,T¯}),\displaystyle\varepsilon\lesssim\frac{\alpha L^{2}}{\sigma^{2}}\left(\frac{1}{b^{2}}+\frac{1}{n^{2}}\,\min\left\{T,\bar{T}\right\}\right)\,, (5.5)

where T¯:=DnLη\bar{T}:=\lceil\frac{Dn}{L\eta}\rceil.

Our analysis of Noisy-CGD is simpler than that of Noisy-SGD: because Noisy-CGD updates in a deterministic order, its analysis requires neither Privacy Amplification by Sampling arguments nor the conditionality of the CNI argument in the proof of Theorem 3.1. This also obviates any assumptions on the Rényi parameter α\alpha and noise σ\sigma because the Privacy Amplification by Sampling Lemma 2.12 is not needed here. On the other hand, despite Noisy-CGD being easier to analyze, this algorithm yields substantially weaker privacy guarantees for small numbers of iterations TT.

Proof of Theorem 5.8.

We prove Theorem 5.8 in the case T>T¯T>\bar{T}, because as mentioned above, the case TT¯T\leqslant\bar{T} follows from (5.4), which is a straightforward extension of [19, Theorem 35].

Suppose 𝒳={x1,,xn}\mathcal{X}=\{x_{1},\dots,x_{n}\} and 𝒳={x1,,xn}\mathcal{X}^{\prime}=\{x_{1}^{\prime},\dots,x_{n}^{\prime}\} are adjacent datasets; that is, they agree xi=xix_{i}=x_{i}^{\prime} on all indices i[n]{i}i\in[n]\setminus\{i^{*}\} except for at most one index i[n]i^{*}\in[n]. Denote the corresponding loss functions by fif_{i} and fif_{i^{\prime}}, where fi=fif_{i}=f_{i}^{\prime} except possibly fifif_{i^{*}}\neq f_{i^{*}}^{\prime}. Consider running Noisy-CGD on either dataset 𝒳\mathcal{X} or 𝒳\mathcal{X}^{\prime} for TT iterations—call the resulting iterates {Wt}t=0T\{W_{t}\}_{t=0}^{T} and {Wt}t=0T\{W_{t}^{\prime}\}_{t=0}^{T}, respectively—where we start from the same point ω0𝒦\omega_{0}\in\mathcal{K} and couple the sequence of random batches {Bt}t=0T1\{B_{t}\}_{t=0}^{T-1} and the random noise injected in each iteration. Then

Wt\displaystyle W_{t} =Π𝒦[ϕt(Wt)+Zt]\displaystyle=\Pi_{\mathcal{K}}\left[\phi_{t}(W_{t})+Z_{t}\right]
Wt\displaystyle W_{t}^{\prime} =Π𝒦[ϕt(Wt)+Zt]\displaystyle=\Pi_{\mathcal{K}}\left[\phi_{t}^{\prime}(W_{t}^{\prime})+Z_{t}\right]

where Zt𝒩(0,σ2Id)Z_{t}\sim\mathcal{N}(0,\sigma^{2}I_{d}) and

ϕt(ω)\displaystyle\phi_{t}(\omega) :=ωηbiBtfi(ω)\displaystyle:=\omega-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}\left(\omega\right)
ϕt(ω)\displaystyle\phi_{t}^{\prime}(\omega^{\prime}) :=ωηbiBtfi(ω)\displaystyle:=\omega^{\prime}-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}^{\prime}\left(\omega^{\prime}\right)

Observe that ϕt,ϕt\phi_{t},\phi_{t}^{\prime} are contractions since a stochastic gradient update on a smooth function is contractive (by Observation 3.3 and the triangle inequality). Therefore the SGD iterates are generated by Projected Contractive Noisy Iterations WTCNI(ω0,{ϕt},{ξt},𝒦)W_{T}\sim\mathrm{CNI}(\omega_{0},\{\phi_{t}\},\{\xi_{t}\},\mathcal{K}) and WTCNI(ω0,{ϕt},{ξt},𝒦)W_{T}^{\prime}\sim\mathrm{CNI}(\omega_{0},\{\phi_{t}^{\prime}\},\{\xi_{t}\},\mathcal{K}) where the noise distributions are ξt=𝒩(0,η2σ2Id)\xi_{t}=\mathcal{N}(0,\eta^{2}\sigma^{2}I_{d}).

Therefore we may invoke the new diameter-aware Privacy Amplification by Iteration bound (Proposition 3.2) using the diameter bound D~:=D+2ηLb\tilde{D}:=D+\frac{2\eta L}{b}. To this end, we now set the parameters sts_{t}, ata_{t}, and τ\tau in that proposition. Bound st:=supωϕt(ω)ϕt(ω)2ηLb 1[iBt]s_{t}:=\sup_{\omega}\|\phi_{t}(\omega)-\phi_{t}^{\prime}(\omega)\|\leqslant\frac{2\eta L}{b}\,\mathds{1}[i^{*}\in B_{t}] by the LL-Lipschitz assumption on the losses. Set τ=TT¯\tau=T-\bar{T} and assume for simplicity of exposition that the batch size bb divides the dataset size nn, and also that n/bn/b divides T,T¯T,\bar{T} so that each datapoint is updated the same number of times in T,T¯T,\bar{T} iterations; the general case is a straightforward extension that just requires additional floors/ceilings. Then set aτ+t=D~/T¯a_{\tau+t}=\tilde{D}/\bar{T} for the first i/b1\lceil i^{*}/b\rceil-1 batches before updating the different datapoint ii^{*}, followed by aτ+t=D~/T¯+2ηL/na_{\tau+t}=\tilde{D}/\bar{T}+2\eta L/n for the next T¯n/b\bar{T}-n/b batches before the final update on ii^{*}, and then aτ+t=D~/T¯+(2ηL)/(b(n/bi/b+1))a_{\tau+t}=\tilde{D}/\bar{T}+(2\eta L)/(b(n/b-\lceil i^{*}/b\rceil+1)) until the final iteration TT. Observe that then zt=D~+i=τ+1t(siai)z_{t}=\tilde{D}+\sum_{i=\tau+1}^{t}(s_{i}-a_{i}) satisfies zt0z_{t}\geqslant 0 for all tt and has final value zT=0z_{T}=0, which lets us invoke Proposition 3.2. Therefore

𝒟α(WTWT)α2η2σ2t=τ+1Tat2.\displaystyle\mathcal{D}_{\alpha}\left(\mathbb{P}_{W_{T}}\;\|\;\mathbb{P}_{W_{T}^{\prime}}\right)\leqslant\frac{\alpha}{2\eta^{2}\sigma^{2}}\sum_{t=\tau+1}^{T}a_{t}^{2}. (5.6)

To simplify, plug in the definition of ata_{t}, use the elementary bound (x+y)22(x2+y2)(x+y)^{2}\leqslant 2(x^{2}+y^{2}), and collect terms to obtain

t=τ+1Tat2\displaystyle\sum_{t=\tau+1}^{T}a_{t}^{2} =(ib1)(D~T¯)2+(T¯nb)(D~T¯+2ηLn)2+(nbib+1)(D~T¯+2ηLb(nbib+1))2\displaystyle=\left(\left\lceil\frac{i^{*}}{b}\right\rceil-1\right)\,\left(\frac{\tilde{D}}{\bar{T}}\right)^{2}+\left(\bar{T}-\frac{n}{b}\right)\left(\frac{\tilde{D}}{\bar{T}}+\frac{2\eta L}{n}\right)^{2}+\left(\frac{n}{b}-\left\lceil\frac{i^{*}}{b}\right\rceil+1\right)\,\left(\frac{\tilde{D}}{\bar{T}}+\frac{2\eta L}{b(\tfrac{n}{b}-\lceil\tfrac{i^{*}}{b}\rceil+1)}\right)^{2}
D~2T¯+η2L2(1b2+T¯n2)\displaystyle\lesssim\frac{\tilde{D}^{2}}{\bar{T}}+\eta^{2}L^{2}\left(\frac{1}{b^{2}}+\frac{\bar{T}}{n^{2}}\right)
η2L2(1b2+T¯n2).\displaystyle\asymp\eta^{2}L^{2}\left(\frac{1}{b^{2}}+\frac{\bar{T}}{n^{2}}\right).

Combining the above two displays yields

𝒟α(WTWT)αL2σ2(1b2+T¯n2).\displaystyle\mathcal{D}_{\alpha}\left(\mathbb{P}_{W_{T}}\;\|\;\mathbb{P}_{W_{T}^{\prime}}\right)\lesssim\frac{\alpha L^{2}}{\sigma^{2}}\left(\frac{1}{b^{2}}+\frac{\bar{T}}{n^{2}}\right).

This proves Theorem 5.8 in the case T>T¯T>\bar{T}, as desired. ∎

5.3 Non-uniform stepsizes

The analysis of Noisy-SGD in Theorem 1.3 readily extends to non-uniform stepsize schedules {ηt}t=0T1\{\eta_{t}\}_{t=0}^{T-1}. Assume ηt2/M\eta_{t}\leqslant 2/M for all tt so that the gradient update is contractive (Lemma 2.1). Then the analysis in §3 goes through unchanged, and can be improved by choosing non-uniform ata_{t} in the Privacy Amplification by Iteration argument to bound 2. Specifically, since the bound in Proposition 3.2 scales as

2α2σ12t=τ+1T1at2ηt2,\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\leqslant\frac{\alpha}{2\sigma_{1}^{2}}\sum_{t=\tau+1}^{T-1}\frac{a_{t}^{2}}{\eta_{t}^{2}},

the proper choice of ata_{t} is no longer the uniform choice atD/(Tτ1)a_{t}\equiv D/(T-\tau-1), but instead the non-uniform choice at=Dηt/t=τ+1T1ηta_{t}=D\eta_{t}/\sum_{t=\tau+1}^{T-1}\eta_{t}. This yields the following generalization of the privacy bound (3.6): Noisy-SGD satisfies (α,ε)(\alpha,\varepsilon)-RDP for

εminτ{0,,T1}(Tτ)Q+(Tτ1)αD22σ12(t=τ+1T1ηt)2.\displaystyle\varepsilon\leqslant\min_{\tau\in\{0,\dots,T-1\}}(T-\tau)Q+(T-\tau-1)\frac{\alpha D^{2}}{2\sigma_{1}^{2}(\sum_{t=\tau+1}^{T-1}\eta_{t})^{2}}. (5.7)

where Q:=Sα(bn,bσ22L)Q:=S_{\alpha}(\tfrac{b}{n},\tfrac{b\sigma_{2}}{2L}).

Further simplifying this RDP bound (5.7) requires optimizing over τ\tau, and this optimization depends on the decay rate of the stepsize schedule {ηt}\{\eta_{t}\}. As an illustrative example, we demonstrate this optimization for the important special case of ηttc\eta_{t}\asymp t^{-c} for c[0,1)c\in[0,1), which generalizes the constant step-size result in Theorem 3.1.

Example 5.9 (Privacy of Noisy-SGD with polynomially decaying stepsize schedule).

Consider stepsize schedule ηttc\eta_{t}\asymp t^{-c} for c[0,1)c\in[0,1). Then up to a constant factor, the privacy bound (5.7) gives

εαDLTcnσ2\varepsilon\lesssim\alpha\frac{DLT^{c}}{n\sigma^{2}}

by a similar optimization over τ\tau as in the proof of Theorem 3.1, with the only differences here being that we take τ=TR1\tau=T-R-1 where RDTc/σα/QR\asymp DT^{c}/\sigma\sqrt{\alpha/Q} and note that t=τ+1T1ηtRTc\sum_{t=\tau+1}^{T-1}\eta_{t}\asymp RT^{-c} since RTR\ll T. We therefore obtain the final privacy bound of

εαL2n2σ2min{T,DnLTc},\varepsilon\lesssim\frac{\alpha L^{2}}{n^{2}\sigma^{2}}\min\left\{T,\frac{Dn}{L}T^{c}\right\},

where the first term in the minimization is from the simple bound (1.1) which scales linearly in TT.

5.4 Regularization

In order to improve generalization error, a common approach for training machine learning models—whether differentially private or not—is to regularize the loss functions. Specifically, solve

minω𝒦1ni=1n(fi(ω)+R(ω))\min_{\omega\in\mathcal{K}}\frac{1}{n}\sum_{i=1}^{n}\Big{(}f_{i}(\omega)+R(\omega)\Big{)}\,

for some convex regularization function RR. In order to apply existing optimization and privacy analyses for the regularized problem, one could simply use the relevant parameters (Lipschitzness, smoothness, strong convexity) of the regularized losses Fi:=fi+RF_{i}:=f_{i}+R rather than of fif_{i}. A standard issue is that the Lipschitzness constant degrades after regularization; that is, typically the Lipschitz constant of FiF_{i} can only be bounded by the Lipschitz constant of fif_{i} plus that of RR.

The purpose of this brief subsection is to point out that when using our privacy analysis, the Lipschitz constant does not degrade after regularization. Specifically, although applying our privacy bounds requires using the strong convexity and smoothness constants of the regularized function FiF_{i} (as is standard in all existing privacy/optimization analyses), one can use the Lipschitz constant of fif_{i} rather than that of FiF_{i}. This is because our analysis only uses Lipschitzness in Step 3a to bound the “gradient sensitivity” supωfi(ω)fi(ω)\sup_{\omega}\|\nabla f_{i}(\omega)-\nabla f_{i}^{\prime}(\omega)\|—aka the amount that gradients can differ when using losses from adjacent datasets—and this quantity is invariant under regularization because of course Fi(ω)Fi(ω)=(fi+R)(ω)(fi+R)(ω)=fi(ω)fi(ω)\nabla F_{i}(\omega)-\nabla F_{i}^{\prime}(\omega)=\nabla(f_{i}+R)(\omega)-\nabla(f_{i}^{\prime}+R)(\omega)=\nabla f_{i}(\omega)-\nabla f_{i}^{\prime}(\omega).

6 Discussion

The results of this paper suggest several natural directions for future work:

  • Clipped gradients? In practical settings, Noisy-SGD implementations sometimes “clip” gradients to force their norms to be small, see e.g., [2]. In the case of generalized linear models, the clipped gradients can be viewed as gradients of an auxiliary convex loss [40], in which case our results can be applied directly. However, in general, clipped gradients do not correspond to gradients of a convex loss, in which case our results (as well as all other works in the literature that aim at proving convergent privacy bounds) do not apply. Can this be remedied?

  • Average iterate? Can similar privacy guarantees be established for the average iterate rather than the last iterate? There are fundamental difficulties with trying to proving this: indeed, the average iterate is provably not as private for Noisy-CGD [6].

  • Adaptive stepsizes? Can similar privacy guarantees be established for optimization algorithms with adaptive stepsizes? The main technical obstacle is the impact of the early gradients on the trajectory through the adaptivity, and one would need to control the privacy loss accumulation due to this. This appears to preclude using our analysis techniques, at least in their current form.

  • Beyond convexity? Convergent privacy bounds break down without convexity. This precludes applicability to deep neural networks. Is there any hope of establishing similar results under some sort of mild non-convexity? Due to simple non-convex counterexamples where the privacy of Noisy-SGD diverges, any such extension would have to make additional structural assumptions on the non-convexity (and also possibly change the Noisy-SGD algorithm), although it is unclear how this would even look. Moreover, this appears to require significant new machinery as our techniques are the only known way to solve the convex problem, and they break down in the non-convex setting (see also the discussion in §1.2).

  • General techniques? Can the analysis techniques developed in this paper be used in other settings? Our techniques readily generalize to any iterative algorithm which interleaves contractive steps and noise convolutions. Such algorithms are common in differentially private optimization, and it would be interesting to apply them to variants of Noisy-SGD.

Acknowledgements.

We are grateful to Hristo Paskov for many insightful conversations.

Appendix A Deferred proofs

A.1 Proof of Lemma 2.9

For α>1\alpha>1,

exp((α1)𝒟α(X1:kY1:k))\displaystyle\exp\Big{(}(\alpha-1)\,\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{1:k}}\;\|\;\mathbb{P}_{Y_{1:k}}\right)\Big{)} =𝒳1××𝒳k[X1:k(x1:k)Y1:k(x1:k)]α𝑑Y1:k(x1:k)\displaystyle=\int_{\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{k}}\left[\frac{\mathbb{P}_{X_{1:k}}(x_{1:k})}{\mathbb{P}_{Y_{1:k}}(x_{1:k})}\right]^{\alpha}d\mathbb{P}_{Y_{1:k}}(x_{1:k})
=𝒳1××𝒳k[i=1kXi|X1:i1=x1:i1(xi)Yi|Y1:i1=x1:i1(xi)]αi=1kdYi|Y1:i1=x1:i1(xi)\displaystyle=\int_{\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{k}}\left[\prod_{i=1}^{k}\frac{\mathbb{P}_{X_{i}|X_{1:i-1}=x_{1:i-1}}(x_{i})}{\mathbb{P}_{Y_{i}|Y_{1:i-1}=x_{1:i-1}}(x_{i})}\right]^{\alpha}\prod_{i=1}^{k}d\mathbb{P}_{Y_{i}|Y_{1:i-1}=x_{1:i-1}}(x_{i})
i=1ksupx1𝒳1,,xi1𝒳i1𝒳i[Xi|X1:i1=x1:i1(xi)Yi|Y1:i1=x1:i1(xi)]α𝑑Yi|Y1:i1=x1:i1(xi)\displaystyle\leqslant\prod_{i=1}^{k}\sup_{x_{1}\in\mathcal{X}_{1},\dots,x_{i-1}\in\mathcal{X}_{i-1}}\int_{\mathcal{X}_{i}}\left[\frac{\mathbb{P}_{X_{i}|X_{1:i-1}=x_{1:i-1}}(x_{i})}{\mathbb{P}_{Y_{i}|Y_{1:i-1}=x_{1:i-1}}(x_{i})}\right]^{\alpha}d\mathbb{P}_{Y_{i}|Y_{1:i-1}=x_{1:i-1}}(x_{i})
=exp((α1)i=1ksupx1𝒳1,,xi1𝒳i1𝒟α(Xi|X1:i1=x1:i1Yi|Y1:i1=x1:i1)).\displaystyle=\exp\left((\alpha-1)\,\sum_{i=1}^{k}\sup_{x_{1}\in\mathcal{X}_{1},\dots,x_{i-1}\in\mathcal{X}_{i-1}}\mathcal{D}_{\alpha}\left(\mathbb{P}_{X_{i}|X_{1:i-1}=x_{1:i-1}}\;\|\;\mathbb{P}_{Y_{i}|Y_{1:i-1}=x_{1:i-1}}\right)\right).

The remaining case of α=1\alpha=1 follows by continuity (or by the Chain Rule for KL divergence).

A.2 Proof of Lemma 2.11

Observe that the mixture distribution (1q)𝒩(0,σ2Id)+q(𝒩(0,σ2Id)μ)(1-q)\mathcal{N}(0,\sigma^{2}I_{d})+q\left(\mathcal{N}(0,\sigma^{2}I_{d})\ast\mu\right) can be further decomposed as the mixture distribution

(1q)𝒩(0,σ2Id)+q(𝒩(0,σ2Id)μ)=[(1q)𝒩(0,σ2)+q𝒩(z,σ2)]𝑑μ(z),(1-q)\mathcal{N}(0,\sigma^{2}I_{d})+q\left(\mathcal{N}(0,\sigma^{2}I_{d})\ast\mu\right)=\int\left[(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(z,\sigma^{2})\right]d\mu(z),

where this integral is with respect to the weak topology of measures. Now argue as follows:

supμ𝒫(R𝔹d)𝒟α(𝒩(0,σ2Id)(1q)𝒩(0,σ2Id)+q(𝒩(0,σ2Id)μ))\displaystyle\;\sup_{\mu\in\mathcal{P}(R\mathbb{B}_{d})}\mathcal{D}_{\alpha}\left(\mathcal{N}(0,\sigma^{2}I_{d})\,\big{\|}\,(1-q)\mathcal{N}(0,\sigma^{2}I_{d})+q\left(\mathcal{N}(0,\sigma^{2}I_{d})\ast\mu\right)\right)
=\displaystyle= supμ𝒫(R𝔹d)𝒟α(𝒩(0,σ2Id)[(1q)𝒩(0,σ2)+q𝒩(z,σ2)]𝑑μ(z))\displaystyle\;\sup_{\mu\in\mathcal{P}(R\mathbb{B}_{d})}\mathcal{D}_{\alpha}\left(\mathcal{N}(0,\sigma^{2}I_{d})\,\big{\|}\,\int\left[(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(z,\sigma^{2})\right]d\mu(z)\right)
\displaystyle\leqslant supzd:zR𝒟α(𝒩(0,σ2Id)(1q)𝒩(0,σ2Id)+q𝒩(z,σ2Id))\displaystyle\;\sup_{z\in\mathbb{R}^{d}\;:\;\|z\|\leqslant R}\mathcal{D}_{\alpha}\left(\mathcal{N}(0,\sigma^{2}I_{d})\,\big{\|}\,(1-q)\mathcal{N}(0,\sigma^{2}I_{d})+q\mathcal{N}(z,\sigma^{2}I_{d})\right)
=\displaystyle= supr[0,R]𝒟α(𝒩(0,σ2)(1q)𝒩(0,σ2)+q𝒩(r,σ2))\displaystyle\;\sup_{r\in[0,R]}\mathcal{D}_{\alpha}\left(\mathcal{N}(0,\sigma^{2})\,\big{\|}\,(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(r,\sigma^{2})\right)
=\displaystyle= 𝒟α(𝒩(0,σ2)(1q)𝒩(0,σ2)+q𝒩(R,σ2))\displaystyle\;\mathcal{D}_{\alpha}\left(\mathcal{N}(0,\sigma^{2})\,\big{\|}\,(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(R,\sigma^{2})\right)
=\displaystyle= 𝒟α(𝒩(0,(σ/R)2)(1q)𝒩(0,(σ/R)2)+q𝒩(1,(σ/R)2))\displaystyle\;\mathcal{D}_{\alpha}\left(\mathcal{N}(0,(\sigma/R)^{2})\,\big{\|}\,(1-q)\mathcal{N}(0,(\sigma/R)^{2})+q\mathcal{N}(1,(\sigma/R)^{2})\right)
=\displaystyle= Sα(q,σ/R).\displaystyle\;S_{\alpha}(q,\sigma/R).

Above, the second step is by the quasi-convexity of the Rényi divergence (Lemma 2.6), the third step is by the rotation-invariance of the Gaussian distribution, and the fifth step is by a change of variables. Finally, observe that all steps in the above display hold with equality if μ\mu is taken to be the Dirac distribution at a point of norm RR.

A.3 Proof of Lemma 2.16

Let ν\nu be a probability distribution that certifies 𝒟α(z)(μμ)\mathcal{D}_{\alpha}^{(z)}\left(\mu\|\mu^{\prime}\right); that is, ν\nu satisfies 𝒟α(z)(μμ)=𝒟α(νμ)\mathcal{D}_{\alpha}^{(z)}(\mu\|\mu^{\prime})=\mathcal{D}_{\alpha}(\nu\|\mu^{\prime}) and W(ν,μ)zW_{\infty}(\nu,\mu)\leqslant z.

We claim that

W(ϕ#ν,ϕ#μ)s+z.\displaystyle W_{\infty}\left(\phi_{\#}^{\prime}\nu,\phi_{\#}\mu\right)\leqslant s+z. (A.1)

To establish this, first use the triangle inequality for the Wasserstein metric WW_{\infty} to bound

W(ϕ#ν,ϕ#μ)W(ϕ#ν,ϕ#ν)+W(ϕ#ν,ϕ#μ).W_{\infty}\left(\phi_{\#}^{\prime}\nu,\phi_{\#}\mu\right)\leqslant W_{\infty}\left(\phi_{\#}^{\prime}\nu,\phi_{\#}\nu\right)+W_{\infty}\left(\phi_{\#}\nu,\phi_{\#}\mu\right).

For the first term, pushforward ν\nu under the promised coupling of (ϕ,ϕ)(\phi,\phi^{\prime}) in order to form a feasible coupling for the optimal transport distance that certifies W(ϕ#ν,ϕ#ν)sW_{\infty}(\phi_{\#}^{\prime}\nu,\phi_{\#}\nu)\leqslant s. For the second term, use the fact that ϕ\phi is almost surely a contraction to bound W(ϕ#ν,ϕ#μ)W(ν,μ)zW_{\infty}(\phi_{\#}\nu,\phi_{\#}\mu)\leqslant W_{\infty}(\nu,\mu)\leqslant z.

Now that we have established (A.1), it follows that ϕ#ν\phi_{\#}^{\prime}\nu is a feasible candidate for the optimization problem 𝒟α(z+s)(ϕ#μϕ#μ)\mathcal{D}_{\alpha}^{(z+s)}(\phi_{\#}\mu\|\phi_{\#}^{\prime}\mu^{\prime}). That is,

𝒟α(z+s)(ϕ#μϕ#μ)𝒟α(ϕ#νϕ#μ).\mathcal{D}_{\alpha}^{(z+s)}(\phi_{\#}\mu\|\phi_{\#}^{\prime}\mu^{\prime})\leqslant\mathcal{D}_{\alpha}(\phi_{\#}^{\prime}\nu\|\phi_{\#}^{\prime}\mu^{\prime}).

By the quasi-convexity of the Rényi divergence (Lemma 2.6), the post-processing inequality for the Rényi divergence (Lemma 2.7), and then the construction of ν\nu,

𝒟α(ϕ#νϕ#μ)suphsupp(ϕ)𝒟α(h#νh#μ)𝒟α(νμ)=𝒟α(z)(μμ).\mathcal{D}_{\alpha}\left(\phi_{\#}^{\prime}\nu\;\|\;\phi_{\#}^{\prime}\mu^{\prime}\right)\leqslant\sup_{h\in\operatorname{supp}(\phi^{\prime})}\mathcal{D}_{\alpha}\left(h_{\#}\nu\;\|\;h_{\#}\mu^{\prime}\right)\leqslant\mathcal{D}_{\alpha}(\nu\|\mu^{\prime})=\mathcal{D}_{\alpha}^{(z)}(\mu\|\mu^{\prime}).

Combining the last two displays completes the proof.

A.4 Proof of Theorem 4.1

Assume TT¯T\geqslant\bar{T} because once the theorem is proven in this setting, then the remaining setting TT¯T\leqslant\bar{T} follows by the strong composition rule for RDP (Lemma 2.8) by equivalently re-interpreting the algorithm Noisy-SGD run once with many iterations as running multiple instantiations of Noisy-SGD, each with few iterations. Consider the choice of constants in Footnote 13; then for TT¯T\geqslant\bar{T}, the quantity in (4.1) is greater than 0.0050.005. So it suffices to show that Noisy-SGD does not satisfy (100,0.005)(100,0.005)-RDP. By the conversion from RDP to DP (Lemma 2.5), plus the calculation 0.005+(log100)/99<0.10.005+(\log 100)/99<0.1, it therefore suffices to show that Noisy-SGD does not satisfy (0.1,0.01)(0.1,0.01)-DP.

Consider the construction of datasets 𝒳,𝒳\mathcal{X},\mathcal{X}^{\prime}, functions ff, and processes Wt,Wt,Wt′′W_{t},W_{t}^{\prime},W_{t}^{\prime\prime} in §4. Then, in order to show that Noisy-SGD does not satisfy (0.1,0,01)(0.1,0,01)-DP, it suffices to show that [WT0]e0.1[WT0]+0.01\mathbb{P}\left[W_{T}^{\prime}\geqslant 0\right]\geqslant e^{0.1}\mathbb{P}[W_{T}\geqslant 0]+0.01. We simplify both sides: for the left hand side, note that WtW_{t}^{\prime} stochastically dominates Wt′′W_{t}^{\prime\prime} for all tt; and on the right hand side, note that [WT0]=1/2\mathbb{P}[W_{T}\geqslant 0]=1/2 by symmetry of the process {Wt}\{W_{t}\} around 0. Therefore it suffices to prove that

[WT′′0]?12e0.1+0.01.\displaystyle\mathbb{P}\left[W_{T}^{\prime\prime}\geqslant 0\right]\overset{\text{?}}{\geqslant}\operatorname{\frac{1}{2}}e^{0.1}+0.01. (A.2)

To this end, we make two observations that collectively formalize the Gaussian approximation described in the proof sketch in §4. Below, let Y=t=TT¯T1YtY=\sum_{t=T-\bar{T}}^{T-1}Y_{t} denote the total bias in the process Wt′′W_{t}^{\prime\prime}, and let EE denote the event that both

  • (i)

    Concentration of bias in the process Wt′′W_{t}^{\prime\prime}: it holds that Y[1±Δ]𝔼YY\in[1\raisebox{0.77498pt}{$\scriptstyle\pm$}\Delta]\cdot\mathbb{E}Y, for Δ=0.15\Delta=0.15.

  • (ii)

    No projections in the process Wt′′W_{t}^{\prime\prime}: it holds that maxt{TT¯,,T}Wt′′<D/2\max_{t\in\{T-\bar{T},\dots,T\}}W_{t}^{\prime\prime}<D/2.

Observation A.1 (EE occurs with large probability).

[E]0.9\mathbb{P}[E]\geqslant 0.9.

Proof.

For item (i) of EE, note that B:=bY/(ηL)=t=TT¯T1𝟙Yt0B:=bY/(\eta L)=\sum_{t=T-\bar{T}}^{T-1}\mathds{1}_{Y_{t}\neq 0} is a binomial random variable with T¯\bar{T} trials, each of which has probability b/nb/n, so BB has expectation 𝔼B=bT¯/n=0.75bD/(Lη)\mathbb{E}B=b\bar{T}/n=0.75bD/(L\eta). Thus, by a standard Chernoff bound (see, e.g., [34, Corollary 4.6]), the probability that (i) does not hold is at most

[item (i) fails]=[B[1±Δ]𝔼B]2exp(Δ2𝔼B3)2exp(0.1520.7510003)0.01.\displaystyle\mathbb{P}\Big{[}\text{item (i) fails}\Big{]}=\mathbb{P}\Big{[}B\notin[1\raisebox{0.77498pt}{$\scriptstyle\pm$}\Delta]\cdot\mathbb{E}B\Big{]}\leqslant 2\exp\left(-\frac{\Delta^{2}\cdot\mathbb{E}B}{3}\right)\leqslant 2\exp\left(-\frac{0.15^{2}\cdot 0.75\cdot 1000}{3}\right)\leqslant 0.01\,.

Next, we show that conditional on (i), item (ii) fails with low probability. To this end, note that (ii) is equivalent to the event that s=TT¯t1(Ys+Zs)<D\sum_{s=T-\bar{T}}^{t-1}(Y_{s}+Z_{s})<D for all tt. Thus because s=TT¯t1YsY(1+Δ)𝔼Y=(1+Δ)0.75D0.9D\sum_{s=T-\bar{T}}^{t-1}Y_{s}\leqslant Y\leqslant(1+\Delta)\mathbb{E}Y=(1+\Delta)0.75D\leqslant 0.9D conditional on event (i), we have that

[item (ii) fails | item (i) holds]\displaystyle\mathbb{P}\Big{[}\text{item (ii) fails }\Big{|}\text{ item (i) holds}\Big{]} [maxt{TT¯,,T}s=TT¯t1Zs0.1D].\displaystyle\leqslant\mathbb{P}\Bigg{[}\max_{t\in\{T-\bar{T},\dots,T\}}\sum_{s=T-\bar{T}}^{t-1}Z_{s}\geqslant 0.1D\Bigg{]}.

Now the latter expression has a simple interpretation: it is the probability that a random walk of length T¯\bar{T} with i.i.d. 𝒩(0,η2σ2)\mathcal{N}(0,\eta^{2}\sigma^{2}) increments never surpasses 0.1D0.1D. By a standard concentration inequality on the hitting time of a random walk (e.g., see the application of Doob’s Submartingale Inequality on [46, Page 139]), this probability is at most

exp((0.1D)22T¯η2σ2)exp(5)0.01,\cdots\leqslant\exp\left(-\frac{(0.1D)^{2}}{2\bar{T}\eta^{2}\sigma^{2}}\right)\leqslant\exp(-5)\leqslant 0.01\,,

Putting the above bounds together, we conclude the desired claim:

[E]=[item (i) holds][item (ii) holds | item (i) holds](10.01)20.9.\mathbb{P}\Big{[}E\Big{]}=\mathbb{P}\Big{[}\text{item (i) holds}\Big{]}\cdot\mathbb{P}\Big{[}\text{item (ii) holds }\Big{|}\text{ item (i) holds}\Big{]}\geqslant(1-0.01)^{2}\geqslant 0.9\,.

Observation A.2 (Gaussian approximation of WT′′W_{T}^{\prime\prime} conditional on EE).

Denote Z:=t=TT¯T1ZtZ:=\sum_{t=T-\bar{T}}^{T-1}Z_{t}. Conditional on the event EE, it holds that WT′′Z+10/DW_{T}^{\prime\prime}\geqslant Z+10/D.

Proof.

By item (ii) of the event EE, no projections occur in the process {Wt′′}\{W_{t}^{\prime\prime}\}, thus WT′′=D/2+t=TT¯T1(Yt+Zt)=D/2+Y+ZW_{T}^{\prime\prime}=-D/2+\sum_{t=T-\bar{T}}^{T-1}(Y_{t}+Z_{t})=-D/2+Y+Z. By item (i), the bias Y(1Δ)𝔼Y=(10.15)0.75D0.6DY\geqslant(1-\Delta)\mathbb{E}Y=(1-0.15)0.75D\geqslant 0.6D. ∎

Next, we show how to combine these two observations in order to approximate WT′′W_{T}^{\prime\prime} by a Gaussian, and from this conclude a lower bound on the probability that WT′′W_{T}^{\prime\prime} is positive. We argue that

[WT′′0]\displaystyle\mathbb{P}[W_{T}^{\prime\prime}\geqslant 0] [WT′′0,E]\displaystyle\geqslant\mathbb{P}[W_{T}^{\prime\prime}\geqslant 0,E]
[Z+D/100,E]\displaystyle\geqslant\mathbb{P}[Z+D/10\geqslant 0,E]
=[Z+D/100][Z+D/100,EC]\displaystyle=\mathbb{P}[Z+D/10\geqslant 0]-\mathbb{P}[Z+D/10\geqslant 0,E^{C}]
[Z+D/100]0.1,\displaystyle\geqslant\mathbb{P}[Z+D/10\geqslant 0]-0.1,

where above the second step is by Observation A.2, and the final step is by Observation A.1. Since ZZ is a centered Gaussian with variance T¯η2σ2cσD2=0.001D2\bar{T}\eta^{2}\sigma^{2}\leqslant c_{\sigma}D^{2}=0.001D^{2}, we have

[Z+D/100][𝒩(0.1D,0.001D2)0]=[𝒩(0,1)0.1/0.001]0.999993.\displaystyle\mathbb{P}\left[Z+D/10\geqslant 0\right]\geqslant\mathbb{P}\left[\mathcal{N}(0.1D,0.001D^{2})\geqslant 0\right]=\mathbb{P}\left[\mathcal{N}(0,1)\geqslant-0.1/\sqrt{0.001}\right]\approx 0.999993.

By combining the above two displays, we conclude that

[wT′′0]0.999930.112e0.1+0.01.\displaystyle\mathbb{P}\left[w_{T}^{\prime\prime}\geqslant 0\right]\geqslant 0.99993-0.1\geqslant\operatorname{\frac{1}{2}}e^{0.1}+0.01.

This establishes the desired DP lower bound (A.2), and therefore proves the theorem.

Appendix B Non-asymptotics for numerics

In this paper we have established asymptotic expressions for the privacy loss of Noisy-SGD (and variants thereof). While our results are tight up to a constant factor (proven in §4), constant factors are important for the numerical implementation of private machine learning algorithms. As such, we describe here how to numerically optimize these constants—first for full-batch noisy gradient descent in Section B.1 since this case is simplest, then for Noisy-SGD in Section B.2, and finally for strongly convex losses in Section B.3.

We make two remarks that apply to all of the non-asymptotic bounds in this section.

Remark B.1 (Full range of parameters).

The non-asymptotic bounds in this section hold for all Rényi parameters α1\alpha\geqslant 1 and all noise levels σ>0\sigma>0, even though Theorem 3.1 does not. (This is because the proof of Theorem 3.1 uses Lemma 2.12 to obtain asymptotics, and that lemma only holds for certain α\alpha and σ\sigma).

Remark B.2 (Alternative notions of dataset adjacency).

Differential privacy refers to the indistinguishability of an algorithm’s output when run on “adjacent” datasets, and thus privacy bounds differ depending on the notion of adjacency between datasets. There are two standard such notions: two datasets are “replace adjacent” if they agree on all but one datapoint, and are “remove adjacent” if one dataset equals the other plus an additional datapoint. Throughout this paper, we proved bounds for “replace adjacency”; for “remove adjacency”, the bounds improve by a small constant factor. Specifically, replace occurences of 2L2L by LL in the bounds in this section.141414This factor of 22 improvement is due to the fact that the sts_{t} term in the Privacy Amplification by Iteration Argument can be improved to ηL\eta L under “remove adjacency”, as opposed to 2ηL2\eta L under “replace adjacency”.

B.1 Full-batch updates

We stop the proof of Theorem 5.8 before the final asymptotic simplifications which lose small constant factors. Then full-batch GD satisfies (α,ε)(\alpha,\varepsilon)-RDP for

εα2η2σ2min{T(2ηLn)2,minT~{1,,T}T~(D~T~+2ηLn)2}.\displaystyle\varepsilon\leqslant\frac{\alpha}{2\eta^{2}\sigma^{2}}\min\left\{T\left(\frac{2\eta L}{n}\right)^{2},\min_{\tilde{T}\in\{1,\dots,T\}}\tilde{T}\left(\frac{\tilde{D}}{\tilde{T}}+\frac{2\eta L}{n}\right)^{2}\right\}. (B.1)

Here, the outer minimization combines two bounds. The second bound comes from (5.6) by applying the new Privacy Amplification by Iteration argument (Proposition 3.2) with τ=TT~\tau=T-\tilde{T}, st=2ηL/ns_{t}=2\eta L/n, and aτ+t=D~/T~+2ηL/na_{\tau+t}=\tilde{D}/\tilde{T}+2\eta L/n, where D~=D+2ηL/n\tilde{D}=D+2\eta L/n. The first bound comes simply applying the original Privacy Amplification by Iteration argument (Proposition 2.17) there with st=at=2ηL/ns_{t}=a_{t}=2\eta L/n.

Numerical computation of privacy bound.

This bound (B.1) can be efficiently computed by taking T~\tilde{T} to be (D~n)/(2ηL)(\tilde{D}n)/(2\eta L), modulo a possible floor or ceiling, if this is at most TT.

B.2 Small-batch updates

We stop the proof of Theorem 3.1 before the final asymptotic simplifications which lose small constant factors. Then Noisy-SGD satisfies (α,ε)(\alpha,\varepsilon)-RDP for

ε\displaystyle\varepsilon minσ1,σ20σ12+σ22=σ2min{TQ,minT~{1,,T1}T~Q+αD22η2σ12T~}\displaystyle\leqslant\min_{\begin{subarray}{c}\sigma_{1},\sigma_{2}\geqslant 0\\ \sigma_{1}^{2}+\sigma_{2}^{2}=\sigma^{2}\end{subarray}}\min\left\{TQ,\min_{\tilde{T}\in\{1,\dots,T-1\}}\tilde{T}Q+\frac{\alpha D^{2}}{2\eta^{2}\sigma_{1}^{2}\tilde{T}}\right\}

where Q=Sα(b/n,bσ2/(2L))Q=S_{\alpha}(b/n,b\sigma_{2}/(2L)). Above, the outermost minimization is from the noise-splitting in Step 1 of the proof of Theorem 3.1. The next minimization combines two bounds. The latter bound is (3.6), modulo a change of variables T~=Tτ\tilde{T}=T-\tau. The former bound is by noting that if τ=0\tau=0, then the two Contractive Noisy Iterations in the proof of Theorem 3.1 are identical, whereby 2=0\leavevmode\hbox to13.33pt{\vbox to13.33pt{\pgfpicture\makeatletter\hbox{\hskip 6.66582pt\lower-6.66582pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.46582pt}{0.0pt}\pgfsys@curveto{6.46582pt}{3.57101pt}{3.57101pt}{6.46582pt}{0.0pt}{6.46582pt}\pgfsys@curveto{-3.57101pt}{6.46582pt}{-6.46582pt}{3.57101pt}{-6.46582pt}{0.0pt}\pgfsys@curveto{-6.46582pt}{-3.57101pt}{-3.57101pt}{-6.46582pt}{0.0pt}{-6.46582pt}\pgfsys@curveto{3.57101pt}{-6.46582pt}{6.46582pt}{-3.57101pt}{6.46582pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.25pt}{-2.9pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}=0. Now, by swapping the outermost minimizations in the above bound, we may simplify to

εmin{TSα(bn,bσ2L),minσ1,σ20σ12+σ22=σ2minT~{1,,T1}T~Q+αD22η2σ12T~}\displaystyle\varepsilon\leqslant\min\left\{T\cdot S_{\alpha}\left(\frac{b}{n},\frac{b\sigma}{2L}\right),\min_{\begin{subarray}{c}\sigma_{1},\sigma_{2}\geqslant 0\\ \sigma_{1}^{2}+\sigma_{2}^{2}=\sigma^{2}\end{subarray}}\min_{\tilde{T}\in\{1,\dots,T-1\}}\tilde{T}Q+\frac{\alpha D^{2}}{2\eta^{2}\sigma_{1}^{2}\tilde{T}}\right\} (B.2)
Numerical computation of privacy bound.

This bound (B.2) can be efficiently computed via ternary search over σ1[0,σ]\sigma_{1}\in[0,\sigma], where for each candidate σ1\sigma_{1} we compute QQ to high precision using the numerically stable formula in [33] and then evaluate the resulting minimization over T~\tilde{T} by setting it to min(T1,D/(ησ1)α/(2Q))\min(T-1,D/(\eta\sigma_{1})\sqrt{\alpha/(2Q)}), modulo a possible floor or ceiling. In order to solve for the smallest noise σ2\sigma^{2} given an RDP budget, one can use the aforementioned computation of the RDP budget given σ2\sigma^{2} in order to binary search for σ2\sigma^{2} (since the RDP is increasing in σ\sigma).

B.3 Strongly convex losses

We stop the proof of Theorem 5.1 before the final asymptotic simplifications which lose small constant factors, and optimize151515 Specifically, rather than simply setting aT=cTτDa_{T}=c^{T-\tau}D and the rest to 0, improve the bound on 2 in (5.2) to α/(2η2σ12)\alpha/(2\eta^{2}\sigma_{1}^{2}) times the minimal value of t=τ+1Tat2\sum_{t=\tau+1}^{T}a_{t}^{2} when optimizing over aτ+1,,aTa_{\tau+1},\dots,a_{T} subject to t=τ+1Tcτtat=D\sum_{t=\tau+1}^{T}c^{\tau-t}a_{t}=D. This has a closed-form solution: at=ctβDa_{t}=c^{-t}\beta D, where β=cτ+2(1c2)/(1c2(Tτ))\beta=c^{\tau+2}(1-c^{-2})/(1-c^{-2(T-\tau)}), yielding value cτβD2c^{-\tau}\beta D^{2}. the parameters ata_{t} in the application of the new Privacy Amplification by Iteration argument (Proposition 5.6). Then, as argued in Section B.2, Noisy-SGD satisfies (α,ε)(\alpha,\varepsilon)-RDP for

ε\displaystyle\varepsilon minσ1,σ20σ12+σ22=σ2min{TQ,minT~{1,,T1}T~Q+αD2(1c2)2η2σ12(c2T~1)}\displaystyle\leqslant\min_{\begin{subarray}{c}\sigma_{1},\sigma_{2}\geqslant 0\\ \sigma_{1}^{2}+\sigma_{2}^{2}=\sigma^{2}\end{subarray}}\min\left\{TQ,\min_{\tilde{T}\in\{1,\dots,T-1\}}\tilde{T}Q+\frac{\alpha D^{2}(1-c^{2})}{2\eta^{2}\sigma_{1}^{2}(c^{-2\tilde{T}}-1)}\right\}
=min{TSα(bn,bσ2L),minσ1,σ20σ12+σ22=σ2minT~{1,,T1}T~Q+αD2(1c2)2η2σ12(c2T~1)}\displaystyle=\min\left\{T\cdot S_{\alpha}\left(\frac{b}{n},\frac{b\sigma}{2L}\right),\min_{\begin{subarray}{c}\sigma_{1},\sigma_{2}\geqslant 0\\ \sigma_{1}^{2}+\sigma_{2}^{2}=\sigma^{2}\end{subarray}}\min_{\tilde{T}\in\{1,\dots,T-1\}}\tilde{T}Q+\frac{\alpha D^{2}(1-c^{2})}{2\eta^{2}\sigma_{1}^{2}(c^{-2\tilde{T}}-1)}\right\} (B.3)

where Q=Sα(b/n,bσ2/(2L))Q=S_{\alpha}(b/n,b\sigma_{2}/(2L)).

Numerical computation of privacy bound.

This bound (B.3) can be efficiently computed via ternary search over σ1[0,σ]\sigma_{1}\in[0,\sigma], where for each candidate σ1\sigma_{1} we compute QQ to high precision using the numerically stable formula in [33] and then evaluate the resulting minimization over T~\tilde{T} by using first-order optimality to set it to

T~=log(2β+(2β)242)2logc, where β=(1c)2αD2logcQη2σ12,\tilde{T}=-\frac{\log\left(\frac{2-\beta+\sqrt{(2-\beta)^{2}-4}}{2}\right)}{2\log c},\qquad\text{ where }\qquad\beta=\frac{(1-c)^{2}\alpha D^{2}\log c}{Q\eta^{2}\sigma_{1}^{2}},

modulo possibly taking a floor/ceiling on T~\tilde{T}, and taking the minimum with T1T-1. In order to solve for the smallest noise σ2\sigma^{2} given an RDP budget, one can use the aforementioned computation of the RDP budget given σ2\sigma^{2} in order to binary search for σ2\sigma^{2} (since the RDP is increasing in σ\sigma).

We remark that this bound exploits strong convexity via the contraction factor c<1c<1, which exponentially decays the last term in (B.3). This contraction is best when cc is minimal, i.e., when the stepsize is η=2/(M+m)\eta=2/(M+m), and it appears that one pays a price in privacy for any deviation from this choice of stepsize.

References

  • Abadi et al. [2016a] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for large-scale machine learning. In Symposium on Operating Systems Design and Implementation, pages 265–283, 2016a.
  • Abadi et al. [2016b] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Conference on Computer and Communications Security, pages 308–318, 2016b.
  • Altschuler and Talwar [2022a] Jason M Altschuler and Kunal Talwar. Resolving the mixing time of the Langevin Algorithm to its stationary distribution for log-concave sampling. arXiv preprint, 2022a.
  • Altschuler and Talwar [2022b] Jason M Altschuler and Kunal Talwar. Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. NeurIPS, 2022b.
  • Bassily et al. [2014] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.
  • Bassily et al. [2019] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, volume 32, pages 11282–11291, 2019.
  • Bassily et al. [2021] Raef Bassily, Cristobal Guzman, and Anupama Nandi. Non-Euclidean differentially private stochastic convex optimization. arXiv:2103.01278, 2021.
  • Bradbury et al. [2018] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  • Bubeck [2015] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
  • Bubeck et al. [2018] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with Projected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4):757–783, 2018.
  • Chaudhuri et al. [2011] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069–1109, 2011.
  • Chourasia et al. [2021] Rishav Chourasia, Jiayuan Ye, and Reza Shokri. Differential privacy dynamics of Langevin diffusion and noisy gradient descent. In Advances in Neural Information Processing Systems, volume 34, pages 14771–14781, 2021.
  • Dalalyan [2017] Arnak Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 6 2017.
  • Ding et al. [2017] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. Advances in Neural Information Processing Systems, 30, 2017.
  • Durmus and Moulines [2019] Alain Durmus and Eric Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25:2854–2882, 11 2019.
  • Dwork and Rothblum [2016] Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. arXiv preprint arXiv:1603.01887, 2016.
  • Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Conference on Theory of Cryptography, pages 265–284, 2006.
  • Erlingsson et al. [2014] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Conference on Computer and Communications Security, pages 1054–1067, 2014.
  • Feldman et al. [2018] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In Symposium on Foundations of Computer Science, pages 521–532. IEEE, 2018.
  • Feldman et al. [2020] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Symposium on the Theory of Computing, pages 439–449, 2020.
  • [21] Arun Ganesh and Kunal Talwar. In Personal communication.
  • Ganesh and Talwar [2020] Arun Ganesh and Kunal Talwar. Faster differentially private samplers via Rényi divergence analysis of discretized Langevin MCMC. In Advances in Neural Information Processing Systems, volume 33, pages 7222–7233, 2020.
  • Ganesh et al. [2022] Arun Ganesh, Abhradeep Thakurta, and Jalaj Upadhyay. Langevin diffusion: An almost universal algorithm for private Euclidean (convex) optimization. arXiv:2204.01585v2, 2022.
  • Gopi et al. [2022] Sivakanth Gopi, Yin Tat Lee, and Daogao Liu. Private convex optimization via exponential mechanism. arXiv:2203.00263v1, 2022.
  • Hardt et al. [2016] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234, 2016.
  • Jain and Thakurta [2014] Prateek Jain and Abhradeep Guha Thakurta. (Near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pages 476–484. PMLR, 2014.
  • Kasiviswanathan et al. [2011] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.
  • Kifer et al. [2012] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Computational Learning Theory, pages 25–1, 2012.
  • Liu and Talwar [2019] Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Symposium on Theory of Computing, pages 298–309, 2019.
  • McSherry and Talwar [2007] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Symposium on Foundations of Computer Science, pages 94–103, 2007.
  • Minami et al. [2016] Kentaro Minami, Hitomi Arai, Issei Sato, and Hiroshi Nakagawa. Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, volume 29, 2016.
  • Mironov [2017] Ilya Mironov. Rényi differential privacy. In Computer Security Foundations Symposium, pages 263–275. IEEE, 2017.
  • Mironov et al. [2019] Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the Sampled Gaussian Mechanism. arXiv preprint arXiv:1908.10530, 2019.
  • Mitzenmacher and Upfal [2017] Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge University Press, 2017.
  • Nesterov [2003] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • [36] Apple Press. Apple previews IOS 10, the biggest IOS release ever, 2016. https://www.apple.com/pr/library/2016/06/13Apple-Previews-iOS-10-The-Biggest-iOS-Release-Ever.html.
  • Rogers et al. [2020] Ryan Rogers, Subbu Subramaniam, Sean Peng, David Durfee, Seunghyun Lee, Santosh Kumar Kancha, Shraddha Sahay, and Parvez Ahammad. LinkedIn’s audience engagements API: A privacy preserving data analytics system at scale. arXiv preprint arXiv:2002.05839, 2020.
  • Ryffel et al. [2022] Theo Ryffel, Francis Bach, and David Pointcheval. Differential privacy guarantees for stochastic gradient Langevin dynamics. arXiv:2201.11980, 2022.
  • Shalev-Shwartz et al. [2009] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In Conference on Computational Learning Theory, 2009.
  • Song et al. [2021] Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimensionality in unconstrained private GLMs. In International Conference on Artificial Intelligence and Statistics, pages 2638–2646. PMLR, 2021.
  • Talwar et al. [2015] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly optimal private Lasso. In Advances in Neural Information Processing Systems, volume 28, pages 3025–3033, 2015.
  • Van Erven and Harremos [2014] Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
  • Vempala and Wibisono [2019] Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems 32, pages 8092–8104. 2019.
  • Wang [2021] Hansi Wang. For the U.S. census, keeping your data anonymous and useful is a tricky balance. https://www.npr.org/2021/05/19/993247101/for-the-u-s-census-keeping-your-data-anonymous-and-useful-is-a-tricky-balance, 2021.
  • Welling and Teh [2011] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, page 681–688, Madison, WI, USA, 2011.
  • Williams [1991] David Williams. Probability with martingales. Cambridge University Press, 1991.
  • Ye and Shokri [2022] Jiayuan Ye and Reza Shokri. Differentially private learning needs hidden state (or much faster convergence). arXiv:2203.05363, 2022.
  • Yousefpour et al. [2021] Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.