This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Convergence and Calibration of Deep Learning with Differential Privacy

Zhiqi Bu [email protected]
University of Pennsylvania
Hua Wang [email protected]
University of Pennsylvania
Zongyu Dai [email protected]
University of Pennsylvania
Qi Long [email protected]
University of Pennsylvania
Abstract

Differentially private (DP) training preserves the data privacy usually at the cost of slower convergence (and thus lower accuracy), as well as more severe mis-calibration than its non-private counterpart. To analyze the convergence of DP training, we formulate a continuous time analysis through the lens of neural tangent kernel (NTK), which characterizes the per-sample gradient clipping and the noise addition in DP training, for arbitrary network architectures and loss functions. Interestingly, we show that the noise addition only affects the privacy risk but not the convergence or calibration, whereas the per-sample gradient clipping (under both flat and layerwise clipping styles) only affects the convergence and calibration.

Furthermore, we observe that while DP models trained with small clipping norm usually achieve the best accurate, but are poorly calibrated and thus unreliable. In sharp contrast, DP models trained with large clipping norm enjoy the same privacy guarantee and similar accuracy, but are significantly more calibrated. Our code can be found at https://github.com/woodyx218/opacus_global_clipping.

1 Introduction

Deep learning has achieved tremendous success in many applications that involve crowdsourced information, e.g., face image, emails, financial status, and medical records. However, using such sensitive data raises severe privacy concerns on a range of image recognition, natural language processing and other tasks (Cadwalladr & Graham-Harrison, 2018; Rocher et al., 2019; Ohm, 2009; De Montjoye et al., 2013; 2015). For a concrete example, researches have recently demonstrated multiple successful privacy attacks on deep learning models, in which the attackers can re-identify a member in the dataset using the location or the purchase record, via the membership inference attack (Shokri et al., 2017; Carlini et al., 2019). In another example, the attackers can extract a person’s name, email address, phone number, and physical address from the billion-parameter GPT-2 (Radford et al., 2019) via the extraction attack (Carlini et al., 2020). Therefore, many studies have applied differential privacy (DP) (Dwork et al., 2006; Dwork, 2008; Dwork et al., 2014; Mironov, 2017; Duchi et al., 2013; Dong et al., 2019), a mathematically rigorous approach, to protect against leakage of private information (Abadi et al., 2016; McSherry & Talwar, 2007; McMahan et al., 2017; Geyer et al., 2017). To achieve this gold standard of privacy guarantee, since the seminal work (Abadi et al., 2016), DP optimizers (including DP-SGD/Adam (Abadi et al., 2016; Bassily et al., 2014; Bu et al., 2019), DP-SGLD (Wang et al., 2015; Li et al., 2019; Zhang et al., 2021), DP-FedSGD and DP-FedAvg (McMahan et al., 2017)) are applied to train the neural networks while preserving high accuracy for prediction.

Algorithmically speaking, DP optimizers have two extra steps in comparison to the non-DP standard optimizers: the per-sample gradient clipping and the random noise addition, so that DP optimizers descend in the direction of the clipped, noisy, and averaged gradient (see Equation 4.1). These extra steps protect the resulting models against privacy attacks via the Gaussian mechanism (Dwork et al., 2014, Theorem A.1), at the expense of an empirical performance degradation compared to the non-DP deep learning, in terms of much slower convergence and lower utility. For example, state-of-the-art CIFAR10 accuracy with DP is 70%\approx 70\% without pre-training (Papernot et al., 2020) (while non-DP networks can easily achieve over 95% accuracy) and similar performance drops have been observed on facial images, tweets, and many other datasets (Bagdasaryan et al., 2019; Kurakin et al., 2022).

Empirically, many works have evaluated the effects of noise scale, batch size, clipping norm, learning rate, and network architecture on the privacy-accuracy trade-off (Abadi et al., 2016; Papernot et al., 2020). However, despite the prevalent usage of DP optimizers, little is known about its convergence behavior from a theoretical viewpoint, which is necessary to understand and improve the deep learning with differential privacy.

We notice some previous attempts by (Chen et al., 2020; Bu et al., 2022; Song et al., 2021; Bu et al., 2022), which either analyze the DP-SGD in the convex setting or rely on extra assumptions in the deep learning setting.

Our Contributions  In this work, we establish a principled framework to analyze the dynamics of DP deep learning, which helps demystify the phenomenon of the privacy-accuracy trade-off.

  • We explicitly characterize the general training dynamics of deep learning with DP-GD in 4.1. We show a fundamental influence of the DP training on the NTK matrix, which causes the convergence to worsen.

  • This characterization leads to the convergence analysis for DP training with small or large clipping norm, in Theorem 1 and Theorem 2, respectively.

  • We demonstrate via numerous experiments that a small clipping norm generally leads to more accurate but less calibrated DP models, whereas a large clipping norm effectively mitigates the calibration issue, preserves a similar accuracy, and provides the same privacy guarantee.

  • We conduct the first experiments on DP and calibration with large models at the Transformer level.

To elaborate on the notion of calibration (Guo et al., 2017; Niculescu-Mizil & Caruana, 2005), a critical performance measure besides accuracy and privacy, we provide a concrete example as follow. A classifier is calibrated if its average accuracy, over all samples it predicts with pp confidence (the probability assigned on its output class), is close to pp (0<p<10<p<1). That is, a calibrated classifier’s predicted confidence matches its accuracy. We observe that DP models using a small clipping norm are oftentimes too over-confident to be reliable (the predicted confidence is much higher than the actual accuracy), while a large clipping norm is amazingly effective on mitigating the mis-calibration.

2 Background

2.1 Differential privacy notion

We provide the definition of DP (Dwork et al., 2006; 2014) as follows.

Definition 2.1.

A randomized algorithm MM is (ε,δ)(\varepsilon,\delta)-differentially private (DP) if for any neighboring datasets S,SS,S^{\prime} differ by an arbitrary sample, and for any event EE,

[M(S)E]eε[M(S)E]+δ.\displaystyle\mathbb{P}[M(S)\in E]\leqslant\mathrm{e}^{\varepsilon}\mathbb{P}\left[M\left(S^{\prime}\right)\in E\right]+\delta. (2.1)

Given a deterministic function G(S)G(S), adding noise proportional to GG’s sensitivity makes it private. This is known as the Gaussian mechanism, as stated in Lemma 2.2 and widely used in DP deep learning.

Lemma 2.2 (Theorem A.1 (Dwork et al., 2014); Theorem 2.7 (Dong et al., 2019)).

Define the 2\ell_{2} sensitivity of any function GG to be R:=supS,SG(S)G(S)2R:=\sup_{S,S^{\prime}}\|G(S)-G(S^{\prime})\|_{2} where the supreme is over all neighboring datasets(S,S)(S,S^{\prime}). Then the Gaussian mechanism G^(S)=G(S)+σR𝒩(0,𝐈)\hat{G}(S)=G(S)+\sigma R\cdot\mathcal{N}(0,\mathbf{I}) is (ϵ,δ)(\epsilon,\delta)-DP for some ϵ\epsilon depending on (σ,p,δ)(\sigma,p,\delta), where pp is the sampling ratio (e.g. batch size / total sample size).

We note that the interdependence among ϵ\epsilon and (σ,n,p,δ)(\sigma,n,p,\delta) can be characterized by various privacy accountants, including Moments accountant (Abadi et al., 2016; Canonne et al., 2020), Gaussian differential privacy (GDP) (Dong et al., 2019; Bu et al., 2019), Fourier accountant (Koskela et al., 2020), Edgeworth Accountant (Wang et al., 2022), etc., each based on a different composition theory that accumulates the privacy risk ϵ(σ,n,p,δ,T)\epsilon(\sigma,n,p,\delta,T) differently over TT iterations.

2.2 Deep learning with differential privacy

DP deep learning (Google, ; Facebook, ) uses a general optimizer, e.g. SGD and Adam, to update the neural networks with the

privatized gradient: iCi(R)i𝐰+σR𝒩(0,𝐈),\displaystyle\text{privatized gradient: }\sum_{i}C_{i}(R)\cdot\frac{\partial\ell_{i}}{\partial\mathbf{w}}+\sigma R\cdot\mathcal{N}(0,\mathbf{I}), (2.2)

where 𝐰\mathbf{w} is the trainable parameters of the network, i𝐰\frac{\partial\ell_{i}}{\partial\mathbf{w}} is the i-th per-sample gradient of loss \ell, and σ\sigma is the noise scale that determines the privacy risk. Specifically, Ci(R)C_{i}(R) is the clipping factor with the clipping norm RR, which restricts the norm of the clipped gradient in that Ci(R)i𝐰R\|C_{i}(R)\frac{\partial\ell_{i}}{\partial\mathbf{w}}\|\leq R. There are multiple ways to design such an clipping factor. The most generic clipping (Abadi et al., 2016) uses Ci=min{1,R/i𝐰}C_{i}=\min\{1,R/\|\frac{\partial\ell_{i}}{\partial\mathbf{w}}\|\}, the automatic clipping (Bu et al., 2022) uses Ci=1/(i𝐰+0.01)C_{i}=1/(\|\frac{\partial\ell_{i}}{\partial\mathbf{w}}\|+0.01) or the normalization Ci=1/i𝐰C_{i}=1/\|\frac{\partial\ell_{i}}{\partial\mathbf{w}}\|, and the global clipping uses Ci=𝕀{i𝐰R}C_{i}=\mathbb{I}\{\frac{\partial\ell_{i}}{\partial\mathbf{w}}\leq R\} to be defined in Appendix D. In this work, we focus on the traditional clipping (Abadi et al., 2016) and observe that

clipping/normalizationR/i𝐰smallRCi=min{1,R/i𝐰}largeRCi=1no clipping.\text{clipping/normalization}\Longleftrightarrow R/\left\|\frac{\partial\ell_{i}}{\partial\mathbf{w}}\right\|\overset{\text{small}R}{\longleftarrow}C_{i}=\min\left\{1,R/\left\|\frac{\partial\ell_{i}}{\partial\mathbf{w}}\right\|\right\}\overset{\text{large}R}{\longrightarrow}C_{i}=1\Longleftrightarrow\text{no clipping}.

In equation 2.2, the privatized gradient has two unique components compared to the standard non-DP gradient: the per-sample gradient clipping (to bound the sensitivity of the gradient) and the random noise addition (to guarantee the privacy of models). Empirical observations have found that optimizers with the per-sample gradient clipping, even when no noise is present, have much worse accuracy at the end of training (Abadi et al., 2016; Bagdasaryan et al., 2019). On the other hand, noise addition (without the per-sample clipping), though slows down the convergence, can lead to comparable or even better accuracy at the convergence (Neelakantan et al., 2015). Therefore, it is important to characterize the effects of the clipping and the noising, which are under-studied while widely-applied in DP deep learning.

3 Warmup: Convergence of Non-Private Gradient Method

We start by reviewing the standard non-DP Gradient Descent (GD) for arbitrary neural network and arbitrary loss. In particular, we analyze the training dynamics of a neural network using the neural tangent kernel (NTK) matrix111We emphasize that our analysis are not limited to the infinitely wide or over-parameterized neural networks. Put differently, we don’t assume the NTK matrix 𝐇\mathbf{H} to be deterministic nor nearly time-independent, as was the case in (Arora et al., 2019a; Lee et al., 2019; Du et al., 2018; Allen-Zhu et al., 2019; Zou et al., 2020; Fort et al., 2020; Arora et al., 2019b)..

Suppose a neural network ff222The neural network ff (and thus the loss \ell and LL) is assumed to be differentiable following the convention of existing literature (Du et al., 2018; Allen-Zhu et al., 2019; Xie et al., 2020; Bu et al., 2021b), in the sense that sub-gradient exists everywhere. This differentiability is a necessary foundation of the back-propagation for deep learning. is governed by weights 𝐰\mathbf{w}, with samples 𝒙i\bm{x}_{i} and labels yiy_{i} (i=1,,ni=1,...,n). Denote the prediction by fi=f(𝒙i,𝐰)f_{i}=f(\bm{x}_{i},\mathbf{w}), and the per-sample loss by i=(f(𝒙i,𝐰),yi)\ell_{i}=\ell(f(\bm{x}_{i},\mathbf{w}),y_{i}), whereas the optimization loss LL is the average of per-sample losses,

L(𝐰)=1ni=1n(f(𝒙i,𝐰),yi).\displaystyle L(\mathbf{w})=\frac{1}{n}\sum_{i=1}^{n}\ell(f(\bm{x}_{i},\mathbf{w}),y_{i}).

In discrete time, the gradient descent with a learning rate η\eta can be written as:

𝐰(k+1)=𝐰(k)ηL𝐰=𝐰(k)ηnii𝐰(k).\displaystyle\mathbf{w}(k+1)=\mathbf{w}(k)-\eta\frac{\partial L}{\partial\mathbf{w}}^{\top}=\mathbf{w}(k)-\frac{\eta}{n}{\sum}_{i}\frac{\partial\ell_{i}}{\partial\mathbf{w}(k)}.

In continuous time, the corresponding gradient flow, i.e., the ordinary differential equation (ODE) describing the weight updates with an infinitely small learning rate η0\eta\to 0, is then:

𝐰˙(t)=L𝐰(t)=1nii𝐰(t).\displaystyle\dot{\mathbf{w}}(t)=-\frac{\partial L}{\partial\mathbf{w}(t)}^{\top}=-\frac{1}{n}{\sum}_{i}\frac{\partial\ell_{i}}{\partial\mathbf{w}(t)}.

Applying the chain rules to the gradient flow, we obtain the following general dynamics of the loss LL,

L˙\displaystyle\dot{L} =L𝐰𝐰˙=L𝐰L𝐰=L𝒇𝒇𝐰𝒇𝐰L𝒇=L𝒇𝐇(t)L𝒇,\displaystyle=\frac{\partial L}{\partial\mathbf{w}}\dot{\mathbf{w}}=-\frac{\partial L}{\partial\mathbf{w}}\frac{\partial L}{\partial\mathbf{w}}^{\top}=-\frac{\partial L}{\partial\bm{f}}\frac{\partial\bm{f}}{\partial\mathbf{w}}\frac{\partial\bm{f}}{\partial\mathbf{w}}^{\top}\frac{\partial L}{\partial\bm{f}}^{\top}=-\frac{\partial L}{\partial\bm{f}}\mathbf{H}(t)\frac{\partial L}{\partial\bm{f}}^{\top}, (3.1)

where L𝒇=1n(1f1,,nfn)1×n\frac{\partial L}{\partial\bm{f}}=\frac{1}{n}(\frac{\partial\ell_{1}}{\partial f_{1}},...,\frac{\partial\ell_{n}}{\partial f_{n}})\in\mathbb{R}^{1\times n}, and the Gram matrix 𝐇(t):=𝒇𝐰𝒇𝐰n×n\mathbf{H}(t):=\frac{\partial\bm{f}}{\partial\mathbf{w}}\frac{\partial\bm{f}}{\partial\mathbf{w}}^{\top}\in\mathbb{R}^{n\times n} is known as the NTK matrix, which is positive semi-definite and crucial to analyzing the convergence behavior.

To give a concrete example, let \ell be the MSE loss i(𝐰)=(f(𝒙i,𝐰)yi)2\ell_{i}(\mathbf{w})=(f(\bm{x}_{i},\mathbf{w})-y_{i})^{2} and LMSE=1nii(𝐰)=1ni(fiyi)2L_{\text{MSE}}=\frac{1}{n}\sum_{i}\ell_{i}(\mathbf{w})=\frac{1}{n}\sum_{i}(f_{i}-y_{i})^{2}, then L˙MSE=4(𝒇𝒚)𝐇(t)(𝒇𝒚)/n2\dot{L}_{\text{MSE}}=-4(\bm{f}-\bm{y})^{\top}\mathbf{H}(t)(\bm{f}-\bm{y})/n^{2}. Furthermore, if 𝐇(t)\mathbf{H}(t) is positive definite, the MSE loss LMSE0L_{\text{MSE}}\to 0 exponentially fast (Du et al., 2018; Allen-Zhu et al., 2019; Zou et al., 2020) , and the cross-entropy loss LCE0L_{\text{CE}}\to 0 at rate O(1/t)O(1/t) (Allen-Zhu et al., 2019).

4 Continuous-time Convergence of DP Gradient Descent

In this section, we analyze the weight dynamics and loss dynamics of DP-GD with an arbitrary clipping function in continuous-time analysis. That is, we study only the gradient flow of the training dynamics as the learning rate η\eta tends to 0. Our analysis can generalize to other optimizers such as DP-SGD, DP-HeavyBall, and DP-Adam.

4.1 Effect of Noise Addition on Convergence

Our first result is simple yet surprising: the gradient flow of a stochastic noisy GD with non-zero noise equation 4.1 is the same as that of the gradient flow without the noise in equation 4.2. Put it differently, the noise addition has no effect on the convergence of DP optimizers in the limit of continuous time analysis. We note that DP-GD shares some similarity to another noisy gradient method, known as the stochastic gradient Langevin dynamics (SGLD Welling & Teh (2011)). However, while DP-GD has a noise magnitude proportional to η\eta and thus corresponds to a deterministic gradient flow, SGLD has a noise magnitude proportinal to η\sqrt{\eta}, which is much larger when we let η0\eta\to 0 in the limit, and thus corresponds to a different continuous-time behavior: its gradient flow is a stochastic differential equation driven by a Brownian motion. We will extend this comparison to the discrete time in Section 4.5.

To elaborate this point, we consider the DP-GD with Gaussian noise, following the notation in equation 2.2,

𝐰(k+1)=𝐰(k)ηn(iCii𝐰(k)+σR𝒩(0,𝐈)).\displaystyle\mathbf{w}(k+1)=\mathbf{w}(k)-\frac{\eta}{n}\left(\sum_{i}C_{i}\frac{\partial\ell_{i}}{\partial\mathbf{w}(k)}+\sigma R\cdot\mathcal{N}(0,\mathbf{I})\right). (4.1)

Notice that this general dynamics covers both the standard non-DP GD (σ=0\sigma=0 and, Ci=1C_{i}=1 if no clipping, or Ci=cC_{i}=c if batch clipping) and DP-GD with any clipping function. Through 4.1 (see proof in Appendix B), we claim that the gradient flow of equation 4.1 is the same ODE (not SDE) regardless of the value of σ\sigma. That is, different σ\sigma always results in the same gradient flow as η/n0\eta/n\to 0.

Fact 4.1.

For all σ0\sigma\geq 0, the gradient descent in equation 4.1 corresponds to the continuous gradient flow

𝐰˙(t)\displaystyle\dot{\mathbf{w}}(t) =1nii𝐰(t)Ci(t).\displaystyle=-\frac{1}{n}{\sum}_{i}\frac{\partial\ell_{i}}{\partial\mathbf{w}(t)}C_{i}(t). (4.2)

This result indeed aligns the conventional wisdom333See github.com/pytorch/opacus/blob/master/tutorials/building_image_classifier.ipynb and Section 3.3 in (Kurakin et al., 2022). of tuning the clipping norm CC first (e.g. setting σ=0\sigma=0 or small) then the noise scale σ\sigma, since the convergence is more sensitive to the clipping. We validate 4.1 in Figure 1 by experimenting on CIFAR10 with small learning rate.

Refer to caption
Refer to caption
Refer to caption
Figure 1: For fixed R=1,η=0.1R=1,\eta=0.1, ViT-base trained with DP-SGD under various noise σ\sigma has similar performance on CIFAR10 (setting in Section 5.3). Here ‘non-DP’ means both σ=0\sigma=0 and no clipping. Notice that the loss curves for different σ\sigma are very similar (though not the same) to each other, because we fix the random seed at the beginning of each iteration among different runs. This is to eliminate the potential difference from uncontrolled random realizations for fair comparison.

4.2 Effect of Per-Sample Clipping on Convergence

We move on to analyze the effect of the per-sample clipping on the DP training equation 4.2. It has been empirically observed that the per-sample clipping results in worse convergence and accuracy even without the noise (Bagdasaryan et al., 2019). We highlight that the NTK matrix is the key to understanding the convergence behavior. Specifically, the per-sample clipping affects NTK through its linear algebra properties, especially the positive semi-definiteness, which we define below in two notions for a general matrix.

Definition 4.2.

For a (not necessarily symmetric) matrix AA, it is

  1. 1.

    positive in quadratic form if and only if 𝒙A𝒙0\bm{x}^{\top}A\bm{x}\geq 0 for every non-zero 𝒙\bm{x};

  2. 2.

    positive in eigenvalues if and only if all eigenvalues of AA are non-negative.

These two positivity definitions are equivalent for a symmetric or Hermitian matrix, but not so for non-symmetric matrices. We illustrate this difference in Appendix A with some concrete examples. Next, we introduce two styles of per-sample clippings, both can work with any clipping function.

Flat clipping style.

The DP-GD described in equation 4.1, with the gradient flow equation 4.2, is equipped with the flat clipping (McMahan et al., 2018). In words, the flat clipping upper bounds the entire gradient vector by a norm RR. Using the chain rules, we get

L˙=L𝐰𝐰˙=1n2jj𝐰ii𝐰Ci=L𝒇𝐇𝐂L𝒇,\displaystyle\dot{L}=\frac{\partial L}{\partial\mathbf{w}}\dot{\mathbf{w}}=-\frac{1}{n^{2}}\sum_{j}\frac{\partial\ell_{j}}{\partial\mathbf{w}}\sum_{i}\frac{\partial\ell_{i}}{\partial\mathbf{w}}C_{i}=-\frac{\partial L}{\partial\bm{f}}\mathbf{H}\mathbf{C}\frac{\partial L}{\partial\bm{f}}^{\top}, (4.3)

where 𝐂(t)=diag(C1,,Cn)\mathbf{C}(t)=\text{diag}(C_{1},\cdots,C_{n}) and Ci(t)C_{i}(t) is defined in Section 2.2.

Layerwise clipping style.

We additionally analyze another per-sample clipping style – the layerwise clipping (Abadi et al., 2016; McMahan et al., 2017; Phan et al., 2017). Unlike the flat clipping, the layerwise clipping upper bounds the rr-th layer’s gradient vector by a layer-dependent norm RrR_{r}. Therefore, the DP-GD and its gradient flow with this layerwise clipping are:

𝐰r(k+1)=𝐰r(k)ηn(ii𝐰rCi,r+σRr𝒩(0,1)) and 𝐰r˙(t)=1nii𝐰rCi,r.\displaystyle\mathbf{w}_{r}(k+1)=\mathbf{w}_{r}(k)-\frac{\eta}{n}\left(\sum\nolimits_{i}\frac{\partial\ell_{i}}{\partial\mathbf{w}_{r}}C_{i,r}+\sigma R_{r}\cdot\mathcal{N}(0,1)\right)\text{\quad and }\dot{\mathbf{w}_{r}}(t)=-\frac{1}{n}\sum_{i}\frac{\partial\ell_{i}}{\partial\mathbf{w}_{r}}C_{i,r}.

Then the loss dynamics is obtained by the chain rules:

L˙=rL𝐰r𝐰˙r=L𝒇r𝐇r𝐂rL𝒇,\displaystyle\dot{L}=\sum_{r}\frac{\partial L}{\partial\mathbf{w}_{r}}\dot{\mathbf{w}}_{r}=-\frac{\partial L}{\partial\bm{f}}\sum_{r}\mathbf{H}_{r}\mathbf{C}_{r}\frac{\partial L}{\partial\bm{f}}^{\top}, (4.4)

where the layerwise NTK matrix 𝐇r=𝒇𝐰r𝒇𝐰r\mathbf{H}_{r}=\frac{\partial\bm{f}}{\partial\mathbf{w}_{r}}\frac{\partial\bm{f}}{\partial\mathbf{w}_{r}}^{\top}, and 𝐂r(t)=diag(C1,r,,Cn,r)\mathbf{C}_{r}(t)=\text{diag}(C_{1,r},\cdots,C_{n,r}).

In short, from equation 4.3 and equation 4.4, the per-sample clipping precisely changes the NTK matrix from 𝐇r𝐇r\mathbf{H}\equiv\sum_{r}\mathbf{H}_{r}, in the standard non-DP training, to 𝐇𝐂\mathbf{H}\mathbf{C} in DP training with flat clipping, and to r𝐇r𝐂r\sum_{r}\mathbf{H}_{r}\mathbf{C}_{r} in DP training with layerwise clipping. Subsequently, we will show that this may break the NTK’s positivity and harm the convergence of DP training.

4.3 Small Per-Sample Clipping Norm Breaks NTK Positivity

We show that the small clipping norm RR breaks the positive semi-definiteness of the NTK matrix444It is a fact that the product of a symmetric and positive definite matrices and a positive diagonal matrix may not be symmetric nor positive in quadratic form. This is shown in Appendix A..

Theorem 1.

For an arbitrary neural network and a loss convex in ff, suppose at least some per-sample gradients are clipped (i,Ci<1\exists i,C_{i}<1) in the gradient flow of DP-GD, and assume 𝐇(t)0\mathbf{H}(t)\succ 0, then:

  1. 1.

    for flat clipping style, the loss dynamics is equation 4.3 and the NTK matrix is 𝐇(t)𝐂(t)\mathbf{H}(t)\mathbf{C}(t), which may not be symmetric nor positive in quadratic form, but is positive in eigenvalues.

  2. 2.

    for layerwise clipping style, the loss dynamics is equation 4.4 and the NTK matrix is r𝐇r(t)𝐂r(t)\sum_{r}\mathbf{H}_{r}(t)\mathbf{C}_{r}(t), which may not be symmetric nor positive in quadratic form or in eigenvalues.

  3. 3.

    for both flat and layerwise clipping styles, the loss L(t)L(t) may not decrease monotonically.

  4. 4.

    if the loss L(t)L(t) converges with L˙(t)0\dot{L}(t)\to 0555Note that it is possible that L(t)L(t) converges yet L˙(t)↛0\dot{L}(t)\not\to 0, e.g. when uniform convergence is not satisfied., for the flat clipping style, it converges to 0; for the layerwise clipping style, it may converge to a non-zero value.

We prove Theorem 1 in Appendix B, which states that the symmetry of NTK is almost surely broken by the clipping using small clipping norm. If furthermore the positive definiteness of NTK is broken, then severe issues may arise in the loss convergence, which is depicted in Figure 1 and Figure 8.

4.4 Large Per-Sample Clipping Norm Preserves NTK Positivity

Now we switch gears to large clipping norm RR. Suppose at each iteration, RR is sufficiently large so that no per-sample gradient is clipped (Ci=1C_{i}=1), i.e. the per-sample clipping is not effective. Thus, the gradient flow of DP-GD is the same as that of non-DP GD. Hence we obtain the following result.

Theorem 2.

For an arbitrary neural network and a loss convex in ff, suppose none of the per-sample gradients are clipped (i,Ci=1\forall i,C_{i}=1) in the gradient flow of DP-GD, and assuming 𝐇(t)0\mathbf{H}(t)\succ 0, then:

  1. 1.

    for both flat and layerwise clipping styles, the loss dynamics is equation 3.1 and the NTK matrix is 𝐇(t)\mathbf{H}(t), which is symmetric and positive definite.

  2. 2.

    if the loss L(t)L(t) converges with L˙(t)0\dot{L}(t)\to 0, for both flat and layerwise clipping styles, the loss L(t)L(t) decreases monotonically to 0.

We prove Theorem 2 in Appendix B and the benefits of large clipping norm are assessed in Section 5.2. Our findings from Theorem 1 and Theorem 2 are visualized in the left plot of Figure 10 and summarized in Table 1.

Clipping type NTK Symmetric Positive in Positive in Loss Monotone To zero
matrix NTK quadratic form eigenvalues convergence loss decay loss
No clipping 𝐇r𝐇r\mathbf{H}\equiv\sum_{r}\mathbf{H}_{r}
Batch clipping c𝐇cr𝐇rc\mathbf{H}\equiv c\sum_{r}\mathbf{H}_{r}
Large RR clipping 𝐇r𝐇r\mathbf{H}\equiv\sum_{r}\mathbf{H}_{r}
(Flat & layerwise)
Small RR clipping 𝐇𝐂\mathbf{H}\mathbf{C}
(Flat)
Small RR clipping r𝐇r𝐂r\sum_{r}\mathbf{H}_{r}\mathbf{C}_{r}
(Layerwise)
Table 1: Effects of per-sample gradient clipping on gradient flow. Here “Yes/No" means guaranteed or not and the loss refers to the training set. “Loss convergence" is conditioned on 𝐇(t)0\mathbf{H}(t)\succ 0.

4.5 Connection to Bayesian Deep Learning

When RR is sufficiently large, all per-sample gradients are not clipped (Ci=1,iC_{i}=1,\forall i), and DP-SGD is essentially the SGD with independent Gaussian noise. This is indeed the SGLD (with a different learning rate) that is commonly used to train Bayesian neural networks.

DP-SGD:𝐰(k+1)𝐰(k)=ηDP-SGDB(ili𝐰+σR𝒩(0,I)),\displaystyle\text{DP-SGD:}\quad\mathbf{w}(k+1)-\mathbf{w}(k)=-\frac{\eta_{\text{DP-SGD}}}{B}\left(\sum_{i}\frac{\partial l_{i}}{\partial\mathbf{w}}+\sigma R\cdot\mathcal{N}(0,I)\right),
SGLD:𝐰(k+1)𝐰(k)=ηSGLDn2B(ili𝐰)+ηSGLD𝒩(0,I),\displaystyle\text{SGLD:}\quad\mathbf{w}(k+1)-\mathbf{w}(k)=-\frac{\eta_{\text{SGLD}}n}{2B}\left(\sum_{i}\frac{\partial l_{i}}{\partial\mathbf{w}}\right)+\sqrt{\eta_{\text{SGLD}}}\mathcal{N}(0,I),

where nn is the total number of samples and BB is mini-batch size. Clearly, DP-SGD (with the right combination of hyperparameters) is a special form of SGLD by setting ηDP-SGD=ηSGLDn/2\eta_{\text{DP-SGD}}=\eta_{\text{SGLD}}n/2 and σRn2B=1/ηSGLD\sigma R\frac{n}{2B}=1/\sqrt{\eta_{\text{SGLD}}}.

Similarly, DP-HeavyBall with large RR can be viewed as stochastic gradient Hamiltonian Monte Carlo. This equivalence relation opens new doors to understanding DP optimizers by borrowing the rich literature from the Bayesian learning. Especially, the uncertainty quantification of Bayesian neural network implies the amazing calibration of large-RR DP optimization in Section 5.2.

5 Discrete-time DP Optimization: privacy, accuracy, calibration

Now, we focus on the more practical analysis when the learning rate η\eta is not infinitely small, i.e. the discrete time analysis. In this regime, the gradient flow in equation 4.2 may deviate from the dynamics of the actual training, especially when the added noise is not small, e.g. when the privacy budget ϵ\epsilon is stringent and thus requires a large σ\sigma.

Nevertheless, state-of-the-art DP accuracy can be achieved under settings that is well-approximated by our gradient flow. For example, large pre-trained models such as GPT2 (0.8 billion parameters) (Bu et al., 2022; Li et al., 2021) and ViT (0.3 billion parameters) (Bu et al., a) are typically trained using small learning rates around 0.0001. In addition, the best DP models are trained with large batch size nn, e.g. (Li et al., 2021) have used a batch size 6000 to train RoBERTa on MNLI and QQP datasets, and (Kurakin et al., 2022; De et al., 2022; Mehta et al., 2022) have used batch sizes nn from 10410^{4} to 10610^{6}, i.e. full batch, to achieve state-of-the-art DP accuracy on ImageNet. These settings all result in very small noise magnitude ησR/n\eta\sigma R/n in the optimization666Here the noise magnitude discussed is per parameter. It is empirically verified that the total noise magnitude for models with millions of parameters can be also small or even dimension-independent when the gradients are low-rank (Li et al., 2022)., so that the noise has small effects on the accuracy (and the calibration), as illustrated in Figure 1. Consequently, we focus on only analyzing the effect of different clipping norms RR.

5.1 Privacy analysis

From Lemma 2.2, we highlight that DP optimizers with all clipping norms have the same privacy guarantee, independent of the choice of the privacy accountant, because the privacy risk ϵ\epsilon only depends on the noise scale σ\sigma (i.e. the noise-to-sensitivity ratio). We summarize this common fact in 5.1, which motivates the ablation study on RR in most literature of DP deep learning. Consequently, one can use a larger clipping norm that benefits the calibration, while remaining equally DP as using a smaller clipping norm.

Fact 5.1 (Abadi et al. (2016)).

DP optimizers with the same noise scale σ\sigma are equally (ϵ(σ),δ(σ))(\epsilon(\sigma),\delta(\sigma))-DP, independent of the choice of the clipping norm RR.

Proof of 5.1.

Firstly, we show that the privatized gradient in equation 2.2 has a privacy guarantee that only depends on σ\sigma, not RR, regardless of which privacy accountant is adopted. This can be seen because (1) the sum of per-sample clipped gradient iCi𝒈t(i)\sum_{i}C_{i}\bm{g}^{(i)}_{t} has a sensitivity of maxiBtCi𝒈t(i)R\max_{i\in B_{t}}\|C_{i}\bm{g}_{t}^{(i)}\|\leq R by the triangular inequality, and (2) the noise σR\sigma R is proportional to RR and hence fixing the noise-to-signal ratio at σR/R=σ\sigma R/R=\sigma, regardless of the choice of RR. Therefore, the privacy guarantee is the same and independent of RR. Secondly, it is well-known that the post-processing of a DP mechanism is equally DP, thus any optimizer (e.g. SGD or Adam) that leverages the same privatized gradient in equation 2.2 has the same DP guarantee. ∎

5.2 Accuracy and Calibration

In the following sections, we reveal a novel phenomenon that DP optimizers play important roles in producing well-calibrated and reliable models.

In MM-class classification problems, we denote the probability prediction for the ii-th sample as 𝝅iM\bm{\pi}_{i}\in\mathbb{R}^{M} so that f(𝒙i)=argmax(𝝅i)f(\bm{x}_{i})=\text{argmax}(\bm{\pi}_{i}), then the accuracy is 𝟏{f(𝒙i)=yi}\mathbf{1}\{f(\bm{x}_{i})=y_{i}\}. The confidence, i.e., the probability associated with the predicted class, is P^i:=maxk=1M[𝝅i]k\hat{P}_{i}:=\max_{k=1}^{M}[\bm{\pi}_{i}]_{k} and a good calibration means the confidence is close to the accuracy777An over-confident classifier, when predicting wrong at one data point, only reduces its accuracy a little but increases its loss significantly due to large log(πyi)-\log(\pi_{y_{i}}), since too little probability is assigned to the true class.. Formally, we employ three popular calibration metrics from (Naeini et al., 2015): the test loss, i.e. the negative log-likelihood (NLL), the Expected Calibration Error (ECE), and the Maximum Calibration Error (MCE).

ECE:𝔼P^i[|(f(𝒙i)=yi|P^i=p)p|],MCE:maxp[0,1]|(f(𝒙i)=yi|P^i=p)p|.\text{ECE:}\quad\mathbb{E}_{\hat{P}_{i}}\left[\Big{|}\mathbb{P}(f(\bm{x}_{i})=y_{i}|\hat{P}_{i}=p)-p\Big{|}\right],\quad\quad\text{MCE:}\quad\max_{p\in[0,1]}\Big{|}\mathbb{P}(f(\bm{x}_{i})=y_{i}|\hat{P}_{i}=p)-p\Big{|}.
Refer to caption
Refer to caption
Refer to caption
Figure 2: Confidence histograms on CIFAR 10 (left), MNIST (middle), and SNLI (right).
ECE % MCE %
non-DP DP (small RR) DP (large RR) non-DP DP (small RR) DP (large RR)
CIFAR10 1.3 0.9 1.1 54.8 58.6 27.5
MNIST 0.4 2.3 0.7 49.3 56.2 33.4
SNLI 13.0 22.0 17.6* 34.7 62.5 28.9*
Table 2: Calibration metrics ECE and MCE by non-DP (no clipping) and DP optimizers. *Note that the SNLI experiment uses the mix-up training as described in Section 5.5.

Throughout this paper, we use the GDP privacy accountant for the experiments, with Private Vision library (Bu et al., a) (improved on Opacus) and one P100 GPU. We cover a range of model architectures (including convolutional neural networks [CNN] and Transformers), batch sizes (from 32 to 1000), datasets (with sample size from 50,000 to 550,152), and tasks (including image and text classification). More details are available in Appendix C.

5.3 CIFAR10 image data with Vision Transformer

Refer to caption
Refer to caption
Refer to caption
Figure 3: Ablation study on the accuracy, ECE and MCE (left to right) of CIFAR10 with ViT-base.
Refer to caption
Refer to caption
Refer to caption
Figure 4: Performance on CIFAR10 with ViT-base, batch size 1000, noise scale 1.3, (ϵ,δ)=(1.96,105)(\epsilon,\delta)=(1.96,10^{-5}).

CIFAR10 is an image dataset, which contains 50000 training samples and 10000 test samples of 32×3232\times 32 color images in 10 classes. We use the Vision Transformer (ViT-base, 86 million parameters) which is pre-trained on ImageNet and train with DP-SGD for a single epoch. This is one of state-of-the-art models for this DP task (Bu et al., a; b). From Figure 3888Note that the ablation study of (η,R)(\eta,R) is necessary and well-applied on DP optimization (see Figure 8 in (Li et al., 2021) and Figure 1 in (Bu et al., 2022)). Thus, besides the evaluation of accuracy, additionally evaluating the calibration error is almost free., the best accuracy is achieved along the diagonal by small RR and large η\eta, a phenomenon that is commonly observed in (Li et al., 2021; Bu et al., 2022). However, the calibration error (especially the MCE) is worse than the standard training in Table 2 and Figure 2. Additionally, the layerwise clipping can further slow down the optimization, as indicated by Theorem 1. We highlight that we choose (R,η)(R,\eta) proportioanlly, so that the total noise magnitude ησR\eta\sigma R is fixed for different hyperparameters.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Reliability diagrams (left for non-DP; middle for DP with large RR; right for DP with small RR) on CIFAR10 with ViT-base.

On the other hand, DP training with larger RR can lead to significantly better calibration errors, while incurring a negligible reduction in the accuracy (97.1796.61%97.17\to 96.61\%). In Figure 5, the reliability diagram (DeGroot & Fienberg, 1983; Niculescu-Mizil & Caruana, 2005) displays the accuracy as a function of confidence. Graphically speaking, a calibrated classifier is expected to have blue bins close to the diagonal black dotted line. While the non-DP model is generally over-confident and thus not calibrated, the large RR clipping effectively achieves nearly perfect calibration, thanks to its Bayesian learning nature. In contrast, the classifier with small RR clipping is not only mis-calibrated, but also falls into ‘bipolar disorder’: it is either over-confident and inaccurate, or under-confident but highly accurate. This disorder is observed to different extent in all experiments in this paper.

5.4 MNIST image data with CNN model

On the MNIST dataset, which contains 60000 training samples and 10000 test samples of 28×2828\times 28 grayscale images in 10 classes, we use the standard CNN in the DP libraries999See https://github.com/tensorflow/privacy/tree/master/tutorials in Tensorflow and https://github.com/pytorch/opacus/blob/master/examples/mnist.py in Pytorch Opacus.(Google, ; Facebook, ) (see Section C.1 for architecture) and train with DP-SGD but without pre-training. In Figure 6, DP training with both clipping norms is (2.32,105)(2.32,10^{-5})-DP, and has similar test accuracy (96% for small RR and 95% for large RR), though the large RR leads to smaller loss (or NLL). In the right plot of Figure 6, we demonstrate how RR affects the accuracy and calibration, ceteris paribus, showing a clear accuracy-calibration trade-off based on 5 independent runs. Similar to Figure 5, large RR training again mitigates the mis-calibration in Figure 7.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Loss (left), accuracy (middle), accuracy with ECE (right) on MNIST with 4-layer CNN under different clipping norms RR, batch size 256, noise scale 1.1, learning rate 0.15/R0.15/R for each RR.
Refer to caption
Refer to caption
Refer to caption
Figure 7: Reliability diagrams (left for non-DP; middle for large R=200R=200; right for small R=1R=1) on MNIST with 4-layer CNN.

5.5 SNLI text data with BERT and mix-up training

Refer to caption
Refer to caption
Refer to caption
Figure 8: Loss (left), accuracy (middle) and calibration on SNLI with pre-trained BERT, batch size 32, learning rate 0.0005, noise scale 0.4, clipping norm are 0.1 or 20, (ϵ,δ)=(1.25,1/550152)(\epsilon,\delta)=(1.25,1/550152).

Stanford Natural Language Inference (SNLI) 101010We use SNLI 1.0 from https://nlp.stanford.edu/projects/snli/ is a collection of human-written English sentence paired with one of three classes: entailment, contradiction, or neutral. The dataset has 550152 training samples and 10000 test samples. We use the pre-trained BERT (Bidirectional Encoder Representations from Transformers) on Opacus tutorial111111See https://github.com/pytorch/opacus/blob/master/tutorials/building_text_classifier.ipynb., which gives a state-of-the-art privacy-accuracy result. Our BERT contains 108M parameters and we only train the last Transformer encoder, which has 7M parameters, using DP-AdamW. In particular, we use a mix-up training: we in fact train BERT with small RR for 3 epochs (51.5×10351.5\times 10^{3} iterations, i.e. 95% of the training) and then use large RR for an additional 2500 iterations (the last 5% of the training). For comparison, we also train the same model with small RR for the entire training process of 54076 iterations.

Refer to caption
Refer to caption
Refer to caption
Figure 9: Reliability diagrams (left for non-DP; middle for large R=20R=20; right for small R=0.1R=0.1) on SNLI with BERT. Note that the large RR is only used for the last 2500 out of 54000 iterations.

Surprisingly, the existing DP optimizer does not minimize the loss at all, yet the accuracy still improves along the training. We again observe that large RR training has significantly better convergence than small RR (observe that when turned to large RR in the last 2500 steps, the test loss or NLL decreases significantly from 1.791.79 to 1.081.08, and the training loss or NLL decreases from 1.811.81 to 1.471.47; while keeping a small RR does not reduce the losses). The resulting models have similar accuracy: small RR has 74.1% accuracy; mix-up training has 73.1% accuracy; as baselines, non-DP has 85.4% accuracy and the entire training with large RR has 65.9% accuracy. All DP models have the same privacy (ϵ=1.25,δ=1/550152\epsilon=1.25,\delta=1/550152), and large RR training has much better calibration in Table 2. We remark that all hyperparameters are the same as in the Opacus tutorial.

5.6 Regression Tasks

On regression tasks, the performance measure and the loss function are unified as MSE. Figure 10 shows that DP training with large RR is comparable if not better than that with small RR. We experiment on the California Housing data (20640 samples, 8 features) and Wine Quality (1599 samples, 11 features, run with full-batch DP-GD). Especially, in the left plot of Figure 10, we observe that small RR training may incurs non-monotone convergence, as explained by Theorem 1, which is mitigated by the large RR training. Additional experimental details are available in Section C.4.

Refer to caption
Refer to caption
Figure 10: Performance of DP optimizers under different clipping norms on the Wine Quality and the California Housing datasets. Experimental details in Section C.4.

6 Discussion

In this paper, we provide a continuous-time convergence analysis for DP deep learning, via the NTK matrix, which applies to the general neural network architecture and loss function. We show that in such a regime, the noise addition only affects the privacy risk but not the convergence, whereas the per-sample clipping only affects the convergence and the calibration (especially with different choices of clipping thresholds), but not the privacy risk.

We then study the accuracy-calibration trade-off formed by the DP training with different clipping norms. We show that using a small clipping norm oftentimes trains the more accurate but mis-calibrated models, while a large clipping norm provides a comparably accurate yet much more calibrated model. In fact, several follow-up works have demonstrated that DP training with large RR is remarkably accurate and well-calibrated on large transformers with >108>10^{8} parameters (Zhang et al., 2022), and it significantly mitigates the unfairness on various tasks (Esipova et al., 2022), while preserving privacy.

A future direction is to study the discrete time convergence when both the learning rate and added noise are not small. One immediate observation is that the noise addition will have an effect on the convergence in this case, which needs further investigation. In addition, the analysis of commonly-used mini-batch optimizers is also interesting, since for those optimizers, the training dynamics is no longer deterministic and instead stochastic differential equation will be used for analsis. Lastly, the inconsistency between the cross-entropy loss and the prediction accuracy, as well as the connection to the calibration issue are intriguing; their theoretical understanding awaits future research.

Acknowledgement

We would like to thank Weijie J. Su, Janardhan Kulkarni, Om Thakkar, and Gautam Kamath for constructive and stimulating discussions around the global clipping function. We also thank the Opacus team for maintaining this amazing library. This work was supported in part by NIH through R01GM124111 and RF1AG063481.

References

  • Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp.  308–318, 2016.
  • Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019.
  • Arora et al. (2019a) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019a.
  • Arora et al. (2019b) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019b.
  • Bagdasaryan et al. (2019) Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems, pp. 15453–15462, 2019.
  • Bassily et al. (2014) Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp.  464–473. IEEE, 2014.
  • Bu et al. (a) Zhiqi Bu, Jialin Mao, and Shiyun Xu. Scalable and efficient training of large convolutional neural networks with differential privacy. In Advances in Neural Information Processing Systems, a.
  • Bu et al. (b) Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private bias-term only fine-tuning of foundation models. In Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022, b.
  • Bu et al. (2019) Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J Su. Deep learning with gaussian differential privacy. arXiv preprint arXiv:1911.11607, 2019.
  • Bu et al. (2021a) Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Judy Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. arXiv preprint arXiv:2102.03013, 2021a.
  • Bu et al. (2021b) Zhiqi Bu, Shiyun Xu, and Kan Chen. A dynamical view on optimization algorithms of overparameterized neural networks. In International Conference on Artificial Intelligence and Statistics, pp.  3187–3195. PMLR, 2021b.
  • Bu et al. (2022) Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Automatic clipping: Differentially private deep learning made easier and stronger. arXiv preprint arXiv:2206.07136, 2022.
  • Cadwalladr & Graham-Harrison (2018) Carole Cadwalladr and Emma Graham-Harrison. Revealed: 50 million facebook profiles harvested for cambridge analytica in major data breach. The guardian, 17:22, 2018.
  • Canonne et al. (2020) Clément Canonne, Gautam Kamath, and Thomas Steinke. The discrete gaussian for differential privacy. arXiv preprint arXiv:2004.00010, 2020.
  • Carlini et al. (2019) Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th {\{USENIX}\} Security Symposium ({\{USENIX}\} Security 19), pp.  267–284, 2019.
  • Carlini et al. (2020) Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. arXiv preprint arXiv:2012.07805, 2020.
  • Chen et al. (2020) Xiangyi Chen, Steven Z Wu, and Mingyi Hong. Understanding gradient clipping in private sgd: A geometric perspective. Advances in Neural Information Processing Systems, 33, 2020.
  • De et al. (2022) Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650, 2022.
  • De Montjoye et al. (2013) Yves-Alexandre De Montjoye, César A Hidalgo, Michel Verleysen, and Vincent D Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3(1):1–5, 2013.
  • De Montjoye et al. (2015) Yves-Alexandre De Montjoye, Laura Radaelli, Vivek Kumar Singh, et al. Unique in the shopping mall: On the reidentifiability of credit card metadata. Science, 347(6221):536–539, 2015.
  • DeGroot & Fienberg (1983) Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983.
  • Dong et al. (2019) Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. arXiv preprint arXiv:1905.02383, 2019.
  • Du et al. (2018) Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Duchi et al. (2013) John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp.  429–438. IEEE, 2013.
  • Dwork (2008) Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pp.  1–19. Springer, 2008.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp.  265–284. Springer, 2006.
  • Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • Esipova et al. (2022) Maria S Esipova, Atiyeh Ashari Ghomi, Yaqiao Luo, and Jesse C Cresswell. Disparate impact in differential privacy from gradient misalignment. arXiv preprint arXiv:2206.07737, 2022.
  • (29) Facebook. Pytorch Privacy library — Opacus.
    https://github.com/pytorch/opacus.
  • Fort et al. (2020) Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. arXiv preprint arXiv:2010.15110, 2020.
  • Geyer et al. (2017) Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
  • (32) Google. Tensorflow Privacy library.
    https://github.com/tensorflow/privacy.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. PMLR, 2017.
  • Kairouz et al. (2021) Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. arXiv preprint arXiv:2103.00039, 2021.
  • Koskela et al. (2020) Antti Koskela, Joonas Jälkö, and Antti Honkela. Computing tight differential privacy guarantees using fft. In International Conference on Artificial Intelligence and Statistics, pp.  2560–2569. PMLR, 2020.
  • Kurakin et al. (2022) Alexey Kurakin, Steve Chien, Shuang Song, Roxana Geambasu, Andreas Terzis, and Abhradeep Thakurta. Toward training at imagenet scale with differential privacy. arXiv preprint arXiv:2201.12328, 2022.
  • Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
  • Li et al. (2019) Bai Li, Changyou Chen, Hao Liu, and Lawrence Carin. On connecting stochastic gradient mcmc and differential privacy. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  557–566. PMLR, 2019.
  • Li et al. (2021) Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2021.
  • Li et al. (2022) Xuechen Li, Daogao Liu, Tatsunori B Hashimoto, Huseyin A Inan, Janardhan Kulkarni, Yin-Tat Lee, and Abhradeep Guha Thakurta. When does differentially private learning not suffer in high dimensions? Advances in Neural Information Processing Systems, 35:28616–28630, 2022.
  • McMahan et al. (2017) H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963, 2017.
  • McMahan et al. (2018) H Brendan McMahan, Galen Andrew, Ulfar Erlingsson, Steve Chien, Ilya Mironov, Nicolas Papernot, and Peter Kairouz. A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210, 2018.
  • McSherry & Talwar (2007) Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp.  94–103. IEEE, 2007.
  • Mehta et al. (2022) Harsh Mehta, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. Large scale transfer learning for differentially private image classification. arXiv preprint arXiv:2205.02973, 2022.
  • Mironov (2017) Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp.  263–275. IEEE, 2017.
  • Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • Neelakantan et al. (2015) Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. stat, 1050:21, 2015.
  • Niculescu-Mizil & Caruana (2005) Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp.  625–632, 2005.
  • Ohm (2009) Paul Ohm. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA l. Rev., 57:1701, 2009.
  • Papernot et al. (2020) Nicolas Papernot, Abhradeep Thakurta, Shuang Song, Steve Chien, and Úlfar Erlingsson. Tempered sigmoid activations for deep learning with differential privacy. arXiv preprint arXiv:2007.14191, 2020.
  • Phan et al. (2017) NhatHai Phan, Xintao Wu, Han Hu, and Dejing Dou. Adaptive laplace mechanism: Differential privacy preservation in deep learning. In 2017 IEEE International Conference on Data Mining (ICDM), pp.  385–394. IEEE, 2017.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Rocher et al. (2019) Luc Rocher, Julien M Hendrickx, and Yves-Alexandre De Montjoye. Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1):1–9, 2019.
  • Shokri et al. (2017) Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. IEEE, 2017.
  • Song et al. (2021) Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimensionality in unconstrained private glms. In International Conference on Artificial Intelligence and Statistics, pp.  2638–2646. PMLR, 2021.
  • Wang et al. (2022) Hua Wang, Sheng Gao, Huanyu Zhang, Milan Shen, and Weijie J Su. Analytical composition of differential privacy via the edgeworth accountant. arXiv preprint arXiv:2206.04236, 2022.
  • Wang et al. (2015) Yu-Xiang Wang, Stephen Fienberg, and Alex Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pp. 2493–2502. PMLR, 2015.
  • Welling & Teh (2011) Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  681–688, 2011.
  • Xie et al. (2020) Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2020.
  • Zhang et al. (2022) Hanlin Zhang, Xuechen Li, Prithviraj Sen, Salim Roukos, and Tatsunori Hashimoto. A closer look at the calibration of differentially private learners. arXiv preprint arXiv:2210.08248, 2022.
  • Zhang et al. (2021) Qiyiwen Zhang, Zhiqi Bu, Kan Chen, and Qi Long. Differentially private bayesian neural networks on accuracy, privacy and reliability. arXiv preprint arXiv:2107.08461, 2021.
  • Zou et al. (2020) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks. Machine Learning, 109(3):467–492, 2020.

Appendix A Linear Algebra Facts

Fact A.1.

The product A=M1M2A=M_{1}M_{2}, where M1M_{1} is a symmetric and positive matrix and M2M_{2} a positive diagonal matrix, is positive definite in eigenvalues but is non-symmetric in general (unless the diagonal matrix is constant) and non-positive in quadratic forms.

Proof of A.1.

To see the non-symmetry of AA, suppose there exists i,ji,j such that (M2)jj(M2)ii(M_{2})_{jj}\neq(M_{2})_{ii}, then

(M1M2)ij\displaystyle(M_{1}M_{2})_{ij} =k(M1)ik(M2)kj=(M1)ij(M2)jj=(M1)ji(M2)jj,\displaystyle=\sum_{k}(M_{1})_{ik}(M_{2})_{kj}=(M_{1})_{ij}(M_{2})_{jj}=(M_{1})_{ji}(M_{2})_{jj},
(M1M2)ji\displaystyle(M_{1}M_{2})_{ji} =(M1)ji(M2)ii(M1)ji(M2)jj.\displaystyle=(M_{1})_{ji}(M_{2})_{ii}\neq(M_{1})_{ji}(M_{2})_{jj}.

Hence AA is not symmetric and positive definite. To see that AA may be non-positive in the quadratic form, we give a counter-example.

M1=(1112),M2=(1000.1),A=M1M2=(10.110.2),(1,2)A(12)=0.4.\displaystyle M_{1}=\left(\begin{array}[]{ll}1&1\\ 1&2\end{array}\right),M_{2}=\left(\begin{array}[]{cc}1&0\\ 0&0.1\end{array}\right),A=M_{1}M_{2}=\left(\begin{array}[]{cc}1&0.1\\ 1&0.2\end{array}\right),(1,-2)A\left(\begin{array}[]{c}1\\ -2\end{array}\right)=-0.4.

To see that AA is positive in eigenvalues, we claim that an invertible square root M11/2M_{1}^{1/2} exists as M1M_{1} is symmetric and positive definite. Now AA is similar to (M11/2)1AM11/2=M11/2M2M11/2(M_{1}^{1/2})^{-1}AM_{1}^{1/2}=M_{1}^{1/2}M_{2}M_{1}^{1/2}, hence the non-symmetric AA has the same eigenvalues as the symmetric and positive definite M11/2M2M11/2M_{1}^{1/2}M_{2}M_{1}^{1/2}. ∎

Fact A.2.

Matrix with all eigenvalues positive may be non-positive in quadratic form.

Proof of A.2.
A=(1338),(1,0)A(10)=1,\displaystyle A=\left(\begin{array}[]{cc}-1&3\\ -3&8\end{array}\right),(1,0)A\left(\begin{array}[]{c}1\\ 0\end{array}\right)=-1,

though eigenvalues of AA are 12(7±35)>0\frac{1}{2}(7\pm 3\sqrt{5})>0. ∎

Fact A.3.

Matrix with positive quadratic forms may have non-positive eigenvalues.

Proof of A.3.
A=(1111),(x,y)A(xy)=x2+y2>0,\displaystyle A=\left(\begin{array}[]{cc}1&1\\ -1&1\end{array}\right),(x,y)A\left(\begin{array}[]{c}x\\ y\end{array}\right)=x^{2}+y^{2}>0,

but eigenvalues of AA are 1±i1\pm i, not positive nor real. Actually, all eigenvalues of AA always have positive real part. ∎

Fact A.4.

Sum of products of positive definite (symmetric) matrix and positive diagonal matrix may have zero or negative eigenvalues.

Proof of A.4.
𝐇1=(8/9227),𝐂1=(0.9000.4),𝐇2=(3222),𝐂2=(0.1000.6).\displaystyle\mathbf{H}_{1}=\left(\begin{array}[]{cc}8/9&2\\ 2&7\end{array}\right),\quad\mathbf{C}_{1}=\left(\begin{array}[]{cc}0.9&0\\ 0&0.4\end{array}\right),\quad\mathbf{H}_{2}=\left(\begin{array}[]{cc}3&2\\ 2&2\end{array}\right),\quad\mathbf{C}_{2}=\left(\begin{array}[]{cc}0.1&0\\ 0&0.6\end{array}\right).

Although 𝐇j\mathbf{H}_{j} are positive definite, 𝐇1𝐂1+𝐇2𝐂2\mathbf{H}_{1}\mathbf{C}_{1}+\mathbf{H}_{2}\mathbf{C}_{2} has a zero eigenvalue. Further, if 𝐇1[1,1]=0.7\mathbf{H}_{1}[1,1]=0.7, 𝐇1𝐂1+𝐇2𝐂2\mathbf{H}_{1}\mathbf{C}_{1}+\mathbf{H}_{2}\mathbf{C}_{2} has a negative eigenvalue. ∎

Appendix B Details of Main Results

B.1 Proofs of main results

Proof of 4.1.

Expanding the discrete dynamic in equation 4.1 as 𝐰(k+1)=𝐰(k)ηni𝐰iCiησRn𝒩(0,1),\mathbf{w}(k+1)=\mathbf{w}(k)-\frac{\eta}{n}\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}-\frac{\eta\sigma R}{n}\mathcal{N}(0,1), and chaining it for r1r\geq 1 times, we obtain

𝐰(k+r)𝐰(k)=j=0r1ηni𝐰i(𝐰(k+j))Cij=0r1ησRn𝒩(0,1).\displaystyle\mathbf{w}(k+r)-\mathbf{w}(k)=-\sum_{j=0}^{r-1}\frac{\eta}{n}\sum_{i}\nabla_{\mathbf{w}}\ell_{i}(\mathbf{w}(k+j))C_{i}-\sum_{j=0}^{r-1}\frac{\eta\sigma R}{n}\mathcal{N}(0,1).

In the limit of η0\eta\to 0, we re-index the weights 𝐰\mathbf{w} by time, with t=kηt=k\eta and s=rηs=r\eta. Then consider the above equation at time t+st+s and tt: the left hand side becomes 𝐰(t+s)𝐰(t)\mathbf{w}(t+s)-\mathbf{w}(t); the first summation on the right hand side converges to 1ntt+si𝐰i(τ)Ci(τ)dτ-\frac{1}{n}\int_{t}^{t+s}{\sum}_{i}\nabla_{\mathbf{w}}\ell_{i}(\tau)C_{i}(\tau)d\tau, as long as the integral exists. This can be seen as a numerical integration with the rectangle rule, using η\eta as the width, jj as the index, and 1ntt+si𝐰i(τ)Ci(τ)\frac{1}{n}\int_{t}^{t+s}{\sum}_{i}\nabla_{\mathbf{w}}\ell_{i}(\tau)C_{i}(\tau); similarly, the second summation J(η)=j=0r1ησRn𝒩(0,1)J(\eta)=\sum_{j=0}^{r-1}\frac{\eta\sigma R}{n}\mathcal{N}(0,1) has

𝔼[J(η)]=0 and Var(J(η))=σ2R2η2n2r=ηsσ2R2n20, as η0.\displaystyle\mathbb{E}[J(\eta)]=0\quad\text{ and }\quad\text{Var}(J(\eta))=\frac{\sigma^{2}R^{2}\eta^{2}}{n^{2}}r=\eta s\frac{\sigma^{2}R^{2}}{n^{2}}\to 0,\text{ as }\eta\to 0.

Therefore, as η0\eta\to 0, the discrete stochastic dynamic equation 4.1 becomes the integral

𝐰(t+s)𝐰(t)=1ntt+si𝐰i(τ)Ci(τ)dτ.\displaystyle\mathbf{w}(t+s)-\mathbf{w}(t)=-\frac{1}{n}\int_{t}^{t+s}{\sum}_{i}\nabla_{\mathbf{w}}\ell_{i}(\tau)C_{i}(\tau)d\tau.

This integral converges to a deterministic gradient flow, as s0s\to 0, given by

𝐰˙(t)lims0𝐰(t+s)𝐰(t)s=1nlims0tt+si𝐰i(τ)Ci(τ)dτs.\displaystyle\dot{\mathbf{w}}(t)\equiv\lim_{s\to 0}\frac{\mathbf{w}(t+s)-\mathbf{w}(t)}{s}=-\frac{1}{n}\lim_{s\to 0}\frac{\int_{t}^{t+s}{\sum}_{i}\nabla_{\mathbf{w}}\ell_{i}(\tau)C_{i}(\tau)d\tau}{s}.

which corresponds to the ordinary differential equations equation 4.2. ∎

Proof of Theorem 1.

We prove the statements using the derived gradient flow dynamics equation 4.2.

For Statement 1, from our narrative in Section 4.2, we know that the flat clipping algorithm has 𝐇(t)𝐂(t)\mathbf{H}(t)\mathbf{C}(t) as its NTK. Since 𝐇(t)\mathbf{H}(t) is positive definite and 𝐂(t)\mathbf{C}(t) is a positive diagonal matrix, by A.1, the product 𝐇(t)𝐂(t)\mathbf{H}(t)\mathbf{C}(t) is positive in eigenvalues, yet may be asymmetric and not positive in quadratic form in general.

Similarly, for Statement 2, we know the NTK of layerwise clipping has the form r𝐇r(t)𝐂r(t)\sum_{r}\mathbf{H}_{r}(t)\mathbf{C}_{r}(t), which by A.4 is asymmetric in general, and may be not positive in quadratic form nor positive in eigenvalues.

For Statement 3, by the training dynamics equation 4.3 for the flat clipping algorithm and equation 4.4 for the layerwise clipping, we see that L˙\dot{L} equal the negation of a quadratic form of the corresponding NTK. By statement 1 & 2 of this theorem, such quadratic form may not be positive at all tt, and hence the loss L(t)L(t) is not guaranteed to decrease monotonically.

Lastly, for Statement 4, suppose L(t)L(t) converges in the sense that L˙=0=L𝒇𝒇˙\dot{L}=0=\frac{\partial L}{\partial\bm{f}}\dot{\bm{f}}. Suppose we have L>0L>0, then L𝒇0\frac{\partial L}{\partial\bm{f}}\neq 0 since LL is convex in the prediction 𝒇\bm{f}. In this case, we know 𝒇˙=0\dot{\bm{f}}=0. Observe that

0=𝒇˙=𝒇𝐰𝐰t=𝒇𝐰𝒇𝐰L𝒇.\displaystyle 0=\dot{\bm{f}}=\frac{\partial\bm{f}}{\partial\mathbf{w}}\frac{\partial\mathbf{w}}{\partial t}=-\frac{\partial\bm{f}}{\partial\mathbf{w}}\frac{\partial\bm{f}}{\partial\mathbf{w}}^{\top}\frac{\partial L}{\partial\bm{f}}^{\top}.

For the flat clipping, the NTK matrix, 𝒇𝐰𝒇𝐰=𝐇𝐂\frac{\partial\bm{f}}{\partial\mathbf{w}}\frac{\partial\bm{f}}{\partial\mathbf{w}}^{\top}=\mathbf{H}\mathbf{C} is positive in eigenvalues (by Statement 1), so it could only be the case that L𝒇=𝟎\frac{\partial L}{\partial\bm{f}}=\bm{0}, contradicting to our premise that L>0L>0. Therefore we know L=0L=0 as long as it converges for the flat clipping. On the other hand, for the layerwise clipping, the NTK may be not positive in eigenvalues. Hence it is possible that L0L\neq 0 when L˙=0\dot{L}=0.

Proof of Theorem 2.

The proof is similar to the previous proof and thus omitted.

Appendix C Experimental Details

C.1 MNIST

For MNIST, we use the standard CNN in Tensorflow Privacy and Opacus, as listed below. The training hyperparameters (e.g. batch size) in Section 5.4 are exactly the same as reported in https://github.com/tensorflow/privacy/tree/master/tutorials, which gives 96.6% accuracy for the small RR clipping in Tensorflow and similar accuracy in Pytorch, where our experiments are conducted. The non-DP network is about 99% accurate. Notice the tutorial uses a different privacy accountant than the GDP that we used.

class SampleConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, 8, 2, padding=3)
        self.conv2 = nn.Conv2d(16, 32, 4, 2)
        self.fc1 = nn.Linear(32 * 4 * 4, 32)
        self.fc2 = nn.Linear(32, 10)

    def forward(self, x):
        # x of shape [B, 1, 28, 28]
        x = F.relu(self.conv1(x))  # -> [B, 16, 14, 14]
        x = F.max_pool2d(x, 2, 1)  # -> [B, 16, 13, 13]
        x = F.relu(self.conv2(x))  # -> [B, 32, 5, 5]
        x = F.max_pool2d(x, 2, 1)  # -> [B, 32, 4, 4]
        x = x.view(-1, 32 * 4 * 4)  # -> [B, 512]
        x = F.relu(self.fc1(x))  # -> [B, 32]
        x = self.fc2(x)  # -> [B, 10]
        return x

C.2 CIFAR10 with Vision Transformer

In Section 5.3, we adopt the model from TIMM library. In addition to Figure 2 and Figure 5, we plot in Figure 11 the distribution of prediction probability on the true class, say [𝝅i]yi[\bm{\pi}_{i}]_{y_{i}} for the ii-th sample (notice that Figure 2 plots maxk[𝝅i]k\max_{k}[\bm{\pi}_{i}]_{k}). Clearly the small RR clipping gives overly confident prediction: almost half of the time the true class is assigned close to zero prediction probability. The large RR clipping has a more balanced prediction probability that is less concentrated to 1.

Refer to caption
Figure 11: Prediction probability on the true class on CIFAR10 with Vision Transformer.

C.3 SNLI with BERT model

To train the BERT model, we do the standard pre-processing on the corpus (tokenize the input, cut or pad each sequence to MAX_LENGTH = 128, and convert tokens into unique IDs). We train the BERT model for 3 epochs. Similar to Section C.2, in addition to Figure 8 and Figure 9, we plot the distribution of prediction probability on the true class in Figure 12. Again, the small RR clipping is overly confident, with probability masses concentrating on the two extremes, yet the large RR clipping is more balanced in assigning the prediction probability.

Refer to caption
Figure 12: Histogram of predicted confidence on the true class on SNLI with BERT using large and small clipping norms.

C.4 Regression Experiments

We experiment on the Wine Quality121212http://archive.672ics.uci.edu/ml/datasets/Wine+Quality (1279 training samples, 320 test samples, 11 features) and California Housing131313http://lib.stat.cmu.edu/datasets/houses.zip (18576 training samples, 2064 test samples, 8 features) datasets in Section 5.2. For the California Housing, we use DP-Adam with batch size 256. Since other datasets are not large, we use the full-batch DP-GD.

Across all the two experiments, we set δ=11.1×training sample size\delta=\frac{1}{1.1\times\text{training sample size}} and use the four-layer neural network with the following structure, where input_width is the input dimension for each dataset:

class Net(nn.Module):
    def __init__(self, input_width):
        super(StandardNet, self).__init__()
        self.fc1 = nn.Linear(input_width, 64, bias = True)
        self.fc2 = nn.Linear(64, 64, bias = True)
        self.fc3 = nn.Linear(64, 32, bias = True)
        self.fc4 = nn.Linear(32, 1, bias = True)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        return self.fc4(x)

The California Housing dataset is used to predict the mean price value of owner-occupied home in California. We train with DP-Adam, noise σ=1\sigma=1, clipping norm 11, and learning rate 0.00020.0002. We also trained a non-DP GD with the same learning rate. The GDP accountant gives ϵ=4.41\epsilon=4.41 after 50 epochs / 3650 iterations.

The UCI Wine Quality (red wine) dataset is used to predict the wine quality (an integer score between 0 and 10). We train with DP-GD, noise σ=35\sigma=35, clipping norm 22, and learning rate 0.030.03. We also trained a non-DP GD with learning rate 0.0010.001. The GDP accountant gives ϵ=4.40\epsilon=4.40 after 2000 iterations.

The California Housing and Wine Quality experiments are conducted in 30 independent runs. In Figure 10, the lines are the average losses and the shaded regions are the standard deviations.

appendixtocOptimizers with Clipping beyond Gradient Descent We can extend Theorem 1 and Theorem 2 to a wide class of full-batch optimizers besides DP-GD (with σ=0\sigma=0 and σ0\sigma\neq 0). We show that the NTK matrices in these optimizers determine whether the loss is zero if the model converges.

Theorem 3.

For an arbitrary neural network and a loss convex in ff, suppose we clip the per-sample gradients in the gradient flow of Heavy Ball (HB), Nesterov Accelerated Gradient (NAG), Adam, AdaGrad, RMSprop or their DP variants and that vt(i)2Z\|v_{t}^{(i)}\|_{2}\leq Z, assuming 𝐇(t)0\mathbf{H}(t)\succ 0, then

  1. 1.

    if the loss L(t)L(t) converges, it must converge to 0 for local flat, global flat and global layerwise clipping;

  2. 2.

    even if the loss L(t)L(t) converges, it may converge to non-zero for local layerwise clipping.

The proof can be easily extracted from that of Theorem 1 and Theorem 2 and hence is omitted. We highlight that DP optimizers in general correspond to deterministic gradient flow (for DP-GD, see 4.1) – as long as the noise injected in each step is linear in step size. Therefore, the gradient flow is the same whether σ>0\sigma>0 (the noisy case) or σ=0\sigma=0 (the noiseless case).

We also note that the only difference between layerwise clipping and flat clipping is the form of NTK kernel, as we showed in Theorem 1 and Theorem 2. In this section, we will only present the result for flat clipping since its generalization to layerwise clipping is straightforward. In fact, part of the results for the global clipping has been implied by (Bu et al., 2021b), which establishes the error dynamics for HB and NAG, but only on MSE loss and on specific network architecture. To analyze a broader class of optimizers and on the general loss and architecture, we turn to (da2020general) which gives the dynamical systems of all optimizers aforementioned.

C.5 Gradient Methods with Momentum

We study two commonly used momentums, the Heavy Ball (polyak1964some) and the Nesterov’s one (nesterov1983method). These gradient methods correspond to the gradient flow system (da2020general, Equation (2.1))

𝐰˙(t)\displaystyle\dot{\mathbf{w}}(t) =𝐦(t),\displaystyle=-\mathbf{m}(t), (C.1)
𝐦˙(t)\displaystyle\dot{\mathbf{m}}(t) =i𝐰iCir(t)𝐦(t).\displaystyle=\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}-r(t)\mathbf{m}(t). (C.2)

We note that HB corresponds to time-independent r(t)=rr(t)=r for some rr and NAG corresponds to r(t)=3/tr(t)=3/t. At the stationary point, we have L˙=𝐰˙=𝐦˙=0\dot{L}=\dot{\mathbf{w}}=\dot{\mathbf{m}}=0. Consequently equation C.1 gives 𝐦=𝟎\mathbf{m}=\bm{0} and equation C.2 gives

i𝐰iCi=r𝐦=𝟎.\displaystyle\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}=r\mathbf{m}=\bm{0}. (C.3)

Multiplying both sides with 𝒇𝐰\frac{\partial\bm{f}}{\partial\mathbf{w}}, we get

𝐇𝐂L𝒇=𝟎,\displaystyle\mathbf{H}\mathbf{C}\frac{\partial L}{\partial\bm{f}}=\bm{0},

where L𝒇\frac{\partial L}{\partial\bm{f}} is defined in equation 4.3. If the NTK is positive in eigenvalues, as is the case for local flat and global clipping, we get L𝒇=𝟎\frac{\partial L}{\partial\bm{f}}=\bm{0} and i=0\ell_{i}=0 for all ii since the loss is convex (thus the only stationary point is the global minimum 0). Hence L=0L=0. Otherwise, e.g. for local layerwise clipping, it is possible that L𝒇𝟎\frac{\partial L}{\partial\bm{f}}^{\top}\neq\bm{0} and L0L\neq 0.

C.6 Adaptive Gradient Methods with Momentum

We consider Adam which corresponds to the dynamical system in (da2020general, Equation (2.1))

𝐰˙(t)\displaystyle\dot{\mathbf{w}}(t) =𝐦(t)/𝐯(t)+ξ,\displaystyle=-\mathbf{m}(t)/\sqrt{\mathbf{v}(t)+\xi}, (C.4)
𝐦˙(t)\displaystyle\dot{\mathbf{m}}(t) =i𝐰iCi1α1𝐦(t),\displaystyle=\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}-\frac{1}{\alpha_{1}}\mathbf{m}(t), (C.5)
𝐯˙(t)\displaystyle\dot{\mathbf{v}}(t) =1α2[i𝐰iCi]21α2𝐯(t).\displaystyle=\frac{1}{\alpha_{2}}\left[\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}\right]^{2}-\frac{1}{\alpha_{2}}\mathbf{v}(t). (C.6)

Here ξ0\xi\geq 0 and the square is taken elementwise. At the stationary point, we have L˙=𝐰˙=𝐦˙=𝐯˙=0\dot{L}=\dot{\mathbf{w}}=\dot{\mathbf{m}}=\dot{\mathbf{v}}=0. Consequently equation C.4 gives 𝐦=𝟎\mathbf{m}=\bm{0} and equation C.5 gives i𝐰iCi=𝐦/α1=𝟎\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}=\mathbf{m}/\alpha_{1}=\bm{0}. Multiplying both sides with 𝒇𝐰\frac{\partial\bm{f}}{\partial\mathbf{w}}, we get again 𝐇𝐂L𝒇=𝟎\mathbf{H}\mathbf{C}\frac{\partial L}{\partial\bm{f}}=\bm{0}, and hence the results follow.

C.7 Adaptive Gradient Methods without Momentum

We consider ADAGRAD and RMSprop which correspond to the dynamical system in (da2020general, Remark 1)

𝐰˙(t)\displaystyle\dot{\mathbf{w}}(t) =i𝐰iCi/𝐯(t)+ξ,\displaystyle=-\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}/\sqrt{\mathbf{v}(t)+\xi}, (C.7)
𝐯˙(t)\displaystyle\dot{\mathbf{v}}(t) =p(t)[i𝐰iCi]2q(t)𝐯(t),\displaystyle=p(t)\left[\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}\right]^{2}-q(t)\mathbf{v}(t), (C.8)

for some p(t),q(t)p(t),q(t). At the stationary point, we have L˙=𝐰˙=𝐦˙=𝐯˙=0\dot{L}=\dot{\mathbf{w}}=\dot{\mathbf{m}}=\dot{\mathbf{v}}=0. Consequently equation C.7 gives i𝐰iCi=𝟎\sum_{i}\nabla_{\mathbf{w}}\ell_{i}C_{i}=\bm{0}. Multiplying both sides with 𝒇𝐰\frac{\partial\bm{f}}{\partial\mathbf{w}}, we get again 𝐇𝐂L𝒇=0\mathbf{H}\mathbf{C}\frac{\partial L}{\partial\bm{f}}^{\top}=0, and hence the results follow.

C.8 Applying Global Clipping to DP Optimization Algorithms

Here we give some concrete algorithms where we can apply the global clipping method.

Many DP optimizers, non-adaptive (like HeavyBall and Nesterov Accelerated Gradient) and adaptive (like Adam, ADAGRAD), can use the global clipping easily. These optimizers are supported in Opacus and Tensorflow Privacy libraries. The original form of DP-Adam can be found in (Bu et al., 2019).

Algorithm 1 DP-Adam (with local or global per-sample clipping)

Input: Dataset S={(𝒙1,y1),,(𝒙n,yn)}S=\{(\bm{x}_{1},y_{1}),\ldots,(\bm{x}_{n},y_{n})\}, loss function (f(𝒙i,𝐰t),yi)\ell(f(\bm{x}_{i},\mathbf{w}_{t}),y_{i}).

Parameters: initial weights 𝐰0\mathbf{w}_{0}, learning rate ηt\eta_{t}, subsampling probability pp, number of iterations TT, noise scale σ\sigma, gradient norm bound RR, maximum norm bound ZZ, momentum parameters (β1,β2)(\beta_{1},\beta_{2}), initial momentum m0m_{0}, initial past squared gradient u0u_{0}, and a small constant ξ>0\xi>0.

for t=0,,T1t=0,\ldots,T-1 do
     Take a subsample Bt{1,,n}B_{t}\subseteq\{1,\ldots,n\} from training set DD with subsampling probability pp
     for iBti\in B_{t} do
           vt(i)𝐰(f(𝒙i,𝐰t),yi)v_{t}^{(i)}\leftarrow\nabla_{\mathbf{w}}\ell(f(\bm{x}_{i},\mathbf{w}_{t}),y_{i})
           Option 1: Clocal,i=min{1,R/vt(i)2}C_{local,i}=\min\big{\{}1,R/\|v_{t}^{(i)}\|_{2}\big{\}} \triangleright Local clipping factor
           Option 2: Cglobal,i{R/Z if vt(i)2Z0 if vt(i)2>ZC_{global,i}\equiv\begin{cases}R/Z&\text{ if }\|v_{t}^{(i)}\|_{2}\leq Z\\ 0&\text{ if }\|v_{t}^{(i)}\|_{2}>Z\end{cases} \triangleright Global clipping factor
           v¯t(i)Civt(i)\bar{v}_{t}^{(i)}\leftarrow C_{i}\cdot v_{t}^{(i)} \triangleright Clip the gradient      
      V~t1|Bt|(iBtv¯t(i)+σR𝒩(0,I))\widetilde{V}_{t}\leftarrow\frac{1}{|B_{t}|}\left(\sum_{i\in B_{t}}\bar{v}_{t}^{(i)}+\sigma R\cdot\mathcal{N}(0,I)\right) \triangleright Apply Gaussian mechanism
      mtβ1mt1+(1β1)V~tm_{t}\leftarrow\beta_{1}m_{t-1}+(1-\beta_{1})\widetilde{V}_{t}
      utβ2ut1+(1β2)(V~tV~t)u_{t}\leftarrow\beta_{2}u_{t-1}+(1-\beta_{2})(\widetilde{V}_{t}\odot\widetilde{V}_{t}) \triangleright \odot is the Hadamard product
      𝐰t+1𝐰tηtmt/(ut+ξ)\mathbf{w}_{t+1}\leftarrow\mathbf{w}_{t}-\eta_{t}m_{t}/(\sqrt{u_{t}}+\xi) \triangleright Descend
Output 𝐰T\mathbf{w}_{T}

Recently, (Bu et al., 2021a) proposes to accelerate many DP optimizers with the JL projections in a memory efficient manner. Examples include DP-SGD-JL and DP-Adam-JL. The acceleration is achieved by only approximately instead of exactly computing the per-sample gradient norms. This does not affect the clipping operation afterwards and hence we can replace the local clipping currently used by our global clipping.

Algorithm 2 DP-SGD-JL (with local or global per-sample clipping)

Input: Dataset S={(𝒙1,y1),,(𝒙n,yn)}S=\{(\bm{x}_{1},y_{1}),\ldots,(\bm{x}_{n},y_{n})\}, loss function (f(𝒙i,𝐰t),yi)\ell(f(\bm{x}_{i},\mathbf{w}_{t}),y_{i}).

Parameters: initial weights 𝐰0\mathbf{w}_{0}, learning rate ηt\eta_{t}, subsampling probability pp, number of iterations TT, noise scale σ\sigma, gradient norm bound RR, maximum norm bound ZZ, number of JL projections rr.

for t=0,,T1t=0,\ldots,T-1 do
     Take a subsample Bt{1,,n}B_{t}\subseteq\{1,\ldots,n\} from training set DD with subsampling probability pp
     Sample u1,,ur𝒩(0,I)u_{1},...,u_{r}\sim\mathcal{N}(0,I)
     for iBti\in B_{t} do
          vt(i)𝐰(f(𝒙i,𝐰t),yi)v_{t}^{(i)}\leftarrow\nabla_{\mathbf{w}}\ell(f(\bm{x}_{i},\mathbf{w}_{t}),y_{i})
         for j=1j=1 to rr do
              Pijvt(i)ujP_{ij}\leftarrow v_{t}^{(i)}\cdot u_{j} (using jvp)          
          Mi=1rj=1rPij2M_{i}=\sqrt{\frac{1}{r}\sum_{j=1}^{r}P_{ij}^{2}} \triangleright MiM_{i} is an estimate for vt(i)2\|v_{t}^{(i)}\|_{2}.
          Option 1: Clocal,i=min{1,R/Mi}C_{local,i}=\min\big{\{}1,R/M_{i}\big{\}} \triangleright Local clipping factor
          Option 2: Cglobal,i{R/Z if MiZ0 if Mi>ZC_{global,i}\equiv\begin{cases}R/Z&\text{ if }M_{i}\leq Z\\ 0&\text{ if }M_{i}>Z\end{cases}
          v¯t(i)Civt(i)\bar{v}_{t}^{(i)}\leftarrow C_{i}\cdot v_{t}^{(i)} \triangleright Clip the gradient      
      V¯iBtv¯t(i)\bar{V}\leftarrow\sum_{i\in B_{t}}\bar{v}_{t}^{(i)} \triangleright Sum over batch
      V~tV¯t+σR𝒩(0,I)\widetilde{V}_{t}\leftarrow\bar{V}_{t}+\sigma R\cdot\mathcal{N}(0,I) \triangleright Apply Gaussian mechanism
      𝐰t+1𝐰tηt|Bt|V~t\mathbf{w}_{t+1}\leftarrow\mathbf{w}_{t}-\frac{\eta_{t}}{|B_{t}|}\widetilde{V}_{t} \triangleright Descend
Output 𝐰T\mathbf{w}_{T}

In another line of research on the Bayesian neural networks, where the reliability of networks are emphasized, stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods are applied to quantify the uncertainty of the weights. When DP is within the scope, one popular method is the DP stochastic gradient Langevin dynamics (DP-SGLD), where we can apply the global clipping.

Algorithm 3 DP-SGLD (with local or global per-sample clipping)

Input: Dataset S={(𝒙1,y1),,(𝒙n,yn)}S=\{(\bm{x}_{1},y_{1}),\ldots,(\bm{x}_{n},y_{n})\}, loss function (f(𝒙i,𝐰t),yi)\ell(f(\bm{x}_{i},\mathbf{w}_{t}),y_{i}).

Parameters: initial weights 𝐰0\mathbf{w}_{0}, learning rate ηt\eta_{t}, subsampling probability pp, number of iterations TT, gradient norm bound RR, maximum norm bound ZZ, and a prior p(𝐰)p(\mathbf{w}).

for t=0,,T1t=0,\ldots,T-1 do
     Take a subsample Bt{1,,n}B_{t}\subseteq\{1,\ldots,n\} from training set DD with subsampling probability pp
     for iBti\in B_{t} do
           vt(i)𝐰(f(𝒙i,𝐰t),yi)v_{t}^{(i)}\leftarrow\nabla_{\mathbf{w}}\ell(f(\bm{x}_{i},\mathbf{w}_{t}),y_{i})
           Option 1: Clocal,i=min{1,R/vt(i)2}C_{local,i}=\min\big{\{}1,R/\|v_{t}^{(i)}\|_{2}\big{\}} \triangleright Local clipping factor
           Option 2: Cglobal,i{R/Z if vt(i)2Z0 if vt(i)2>ZC_{global,i}\equiv\begin{cases}R/Z&\text{ if }\|v_{t}^{(i)}\|_{2}\leq Z\\ 0&\text{ if }\|v_{t}^{(i)}\|_{2}>Z\end{cases} \triangleright Global clipping factor
           v¯t(i)Civt(i)\bar{v}_{t}^{(i)}\leftarrow C_{i}\cdot v_{t}^{(i)} \triangleright Clip the gradient      
      V¯iBtv¯t(i)\bar{V}\leftarrow\sum_{i\in B_{t}}\bar{v}_{t}^{(i)} \triangleright Sum over batch
      𝐰t+1𝐰tηt(V¯t|Bt|𝐰logp(𝐰)n)+𝒩(0,ηtI)\mathbf{w}_{t+1}\leftarrow\mathbf{w}_{t}-\eta_{t}\left(\frac{\bar{V}_{t}}{|B_{t}|}-\frac{\nabla_{\mathbf{w}}\log p(\mathbf{w})}{n}\right)+\mathcal{N}(0,\eta_{t}I) \triangleright Descend with Gaussian noise
Output 𝐰T\mathbf{w}_{T}

Here we treat 𝐰t+1\mathbf{w}_{t+1} as a posterior sample, instead of as a point estimate. Notice that other SG-MCMC methods such as SGNHT (ding2014bayesian) can also be DP with the global per-sample clipping.

We emphasize that our global clipping applies whenever an optimization algorithm uses per-sample clipping. Therefore this appendix only gives a few example of the full capacity of global clipping.

C.9 Applying Global Clipping to DP Federated Learning

Here we present two federated learning methods in (McMahan et al., 2017): DP-FedSGD and DP-FedAvg with the global or local clipping. Notice that we only demonstrate the flat clippings and SGD. Layerwise clippings can be easily implemented by changing the ClipFn, and the optimizer can be replaced by other ones.

  Main training loop: parameters    user selection probability q(0,1]q\in(0,1]    number of examples per-user wkw_{k}    gradient norm bound RR    maximum norm bound ZZ    noise scale σ\sigma    UserUpdate (for FedAvg or FedSGD)    ClipFn (LocalClip or GlobalClip) Initialize model θ0\theta^{0}, W=kwkW=\sum_{k}w_{k} for each round t=0,1,2,t=0,1,2,\dots do     𝒞t\mathcal{C}^{t}\leftarrow (sample users with probability qq)     for each user k𝒞tk\in\mathcal{C}^{t} in parallel do          Δkt+1UserUpdate(k,θt,ClipFn)\Delta^{t+1}_{k}\leftarrow\text{UserUpdate}(k,\theta^{t},\textup{ClipFn})          Δt+1=k𝒞twkΔkqW\Delta^{t+1}=\frac{\sum_{k\in\mathcal{C}^{t}}w_{k}\Delta_{k}}{qW}     θt+1θt+Δt+1+σqWR𝒩(0,I)\theta^{t+1}\leftarrow\theta^{t}+\Delta^{t+1}+\frac{\sigma}{qW}R\mathcal{N}(0,I) FlatClip(Δ)(\Delta): return Δmin{1,R/Δ2}\Delta\cdot\min\{1,R/\|\Delta\|_{2}\} GlobalClip(Δ)(\Delta): if Δ>Z\Delta>Z:    return 0 else:    return ΔR/Z\Delta\cdot R/Z UserUpdateFedAvg(k,θ0k,\theta^{0}, ClipFn): parameters BB, EE, η\eta θθ0\theta\leftarrow\theta^{0} for each local epoch ii from 11 to EE do     \mathcal{B}\leftarrow (kk’s data split into size BB batches)     for batch bb\in\mathcal{B} do          θθη(θ;b)\theta\leftarrow\theta-\eta\nabla\ell(\theta;b)          θθ0+ClipFn(θθ0)\theta\leftarrow\theta^{0}+\textup{ClipFn}(\theta-\theta^{0})      return update Δk=θθ0\Delta_{k}=\theta-\theta^{0} UserUpdateFedSGD(k,θ0k,\theta^{0}, ClipFn): parameters BB, η\eta select a batch bb of size BB from kk’s examples return update Δk=ClipFn(η(θ;b))\Delta_{k}=\textup{ClipFn}(-\eta\nabla\ell(\theta;b))

 

Algorithm 4 DP-FedAvg and DP-FedSGD with global or local clipping.

Appendix D Global clipping and code implementation

In an earlier version of this paper, we proposed a new per-sample gradient clipping, termed as the global clipping. The global clipping computes Cglobal,i=𝕀(𝒈(i)R)C_{global,i}=\mathbb{I}(\|\bm{g}^{(i)}\|\leq R), i.e. only assigning 0 or 1 as the clipping factors to each per-sample gradient.

As demonstrated in equation 2.2, our global clipping works with any DP optimizers (e.g., DP-Adam, DP-RMSprop, DP-FTRL(Kairouz et al., 2021), DP-SGD-JL(Bu et al., 2021a), etc.), with identical computational complexity as the existing per-sample clipping Ci=min(R/𝒈(i),1)C_{i}=\min(R/\|\bm{g}^{(i)}\|,1). Building on top of the Pytorch Opacus141414see https://github.com/pytorch/opacus as for 2021/09/09. library, we only need to add one line of code into

https://github.com/pytorch/opacus/blob/master/opacus/per_sample_gradient_clip.py

To understand our implementation, we can equivalently view

Cglobal,i={1 if Ci=1𝒈(i)<Rmin(R/𝒈(i),1)=10 if Ci=R/𝒈(i)𝒈(i)Rmin(R/𝒈(i),1)=R/𝒈(i)\displaystyle C_{global,i}=\begin{cases}1&\text{ if }C_{i}=1\Longleftrightarrow\|\bm{g}^{(i)}\|<R\Longleftrightarrow\min(R/\|\bm{g}^{(i)}\|,1)=1\\ 0&\text{ if }C_{i}=R/\|\bm{g}^{(i)}\|\Longleftrightarrow\|\bm{g}^{(i)}\|\geq R\Longleftrightarrow\min(R/\|\bm{g}^{(i)}\|,1)=R/\|\bm{g}^{(i)}\|\end{cases}

In this formulation, we can easily implement our global clipping by leveraging the Opacus==0.15 library (which already computes CiC_{i}). This can be realized in multiple ways. For example, we can add the following one line after line 179 (within the for loop),

clip_factor=(clip_factor>=1).float()

At high level, global clipping does not clip small per-sample gradients (in terms of magnitude) and completely remove large ones. This may be beneficial to the optimization, since large per-sample gradients often correspond to samples that are hard-to-learn, noisy or adversarial. It is important to set a large clipping norm RR for the global clipping, so that the information from small per-sample gradients are not wasted. However, using a large clipping norm makes the global clipping similar to the existing clipping, basically not clipping most of the per-sample gradients. We confirm that empirically, with large clipping norm, applying the global clipping and existing clipping have negligible difference on the convergence and calibration.