This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gradient Leakage Attack Resilient Deep Learning

Wenqi Wei,  Ling Liu Wenqi Wei and Ling Liu are with School of Computer Science, Georgia Institute of Technology, Atlanta, GA, 30332.
E-mail: [email protected], [email protected] received xxxx xx, xxxx; revised xxxxx xx, xxxx.
Abstract

Gradient leakage attacks are considered one of the wickedest privacy threats in deep learning as attackers covertly spy gradient updates during iterative training without compromising model training quality, and yet secretly reconstruct sensitive training data using leaked gradients with high attack success rate. Although deep learning with differential privacy is a defacto standard for publishing deep learning models with differential privacy guarantee, we show that differentially private algorithms with fixed privacy parameters are vulnerable against gradient leakage attacks. This paper investigates alternative approaches to gradient leakage resilient deep learning with differential privacy (DP). First, we analyze existing implementation of deep learning with differential privacy, which use fixed noise variance to injects constant noise to the gradients in all layers using fixed privacy parameters. Despite the DP guarantee provided, the method suffers from low accuracy and is vulnerable to gradient leakage attacks. Second, we present a gradient leakage resilient deep learning approach with differential privacy guarantee by using dynamic privacy parameters. Unlike fixed-parameter strategies that result in constant noise variance, different dynamic parameter strategies present alternative techniques to introduce adaptive noise variance and adaptive noise injection which are closely aligned to the trend of gradient updates during differentially private model training. Finally, we describe four complementary metrics to evaluate and compare alternative approaches. Extensive experiments on six benchmark datasets show that differentially private deep learning with dynamic privacy parameters outperforms the deep learning using fixed DP parameters, and existing adaptive clipping approaches in all aspects: compelling accuracy performance, strong differential privacy guarantee, and high attack resilience.

Index Terms:
deep learning, gradient leakage attack, differential privacy

I Introduction

Deep neural networks (DNNs) have demonstrated superior capability of learning complex tasks with high prediction accuracy. With the premium environment for model training using rich data that companies are collecting about their users, two questions remain an overwhelming challenge: (i) how can a model be trained on private collections of sensitive data so that it can be deployed safely, minimizing disclosure of sensitive training data? and (ii) can a DNN model trained with differential privacy be trusted for its outputs against privacy intrusion?

Privacy Risks in Deep Learning. Deep learning is vulnerable to many privacy attacks at both training phase and prediction phase, by exploiting its large capacity from the large number of model parameters sufficient for encoding the details of the individual data. Gradient leakage attacks are the dominating privacy threats during training phase [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], assuming training data is encrypted in storage and transportation. The attacker, without any prior knowledge about the learning model, can breach secrecy and confidentiality of the training data from the intermediate gradients. There are also privacy threats in prediction phase, e.g., model inversion [11, 12], attribute inference [13, 14], membership inference [15, 16, 17, 18, 19, 20] and GAN-based reconstruction attack [21, 22]. These privacy concerns aggravate with the broad deployment of deep learning applications and deep learning as a service. Although recent studies on gradient leakage attacks were in the context of federated learning [1, 2, 4, 5], the attacks are high-risk threats for both centralized cloud and edge clients because the same spying process and unauthorized read can be silently employed during model training without being noticed, and the same reconstruction algorithms can be utilized to disclose private training data from the leaked gradients. This paper presents risk assessment of gradient leakage attacks and presents a gradient leakage resilient deep learning approach by extending conventional deep learning with differential privacy algorithms with dynamic privacy parameter optimizations.

Deep Learning with Differential Privacy. Differentially private deep learning is the de facto standard for publishing DNN models with provable privacy guarantee [23]: It is extremely hard to characterize the difference in output between any two models trained using two neighboring inputs differing by at most one element. In other words, by simply observing the output of a DNN model trained using the differentially private learning algorithm, one cannot tell if a single example is used in the training. [24] is the first to implement differentially private deep learning with moments accountant method for a much tighter privacy accounting under random sampling. To regulate the maximum influence of the model under the two neighboring inputs, conventional approaches [24, 25, 26, 27, 28] implement differentially private deep learning by first clipping the gradients and then applying differential privacy-controlled noise to perturb the gradients before employing the stochastic gradient descent (SGD) algorithm, ensuring that each gradient descent step is differentially private. Based on composition properties of differential privacy algorithms [23], the final model produced upon completion of the total training steps provides a certain level of differential privacy.

Inherent Limitations of Baseline. Inspired by the pioneer work [24], many proposals in the literature [25, 27, 29] and open source community [30, 31] employ the fixed privacy parameter strategy to decide clipping method, define the sensitivity of gradient updates and noise scale, which results in constant noise injection throughout every step of the entire training process. Although such a rigid setting of privacy parameters has shown reasonable accuracy while providing a certain level of differential privacy guarantee, they suffer some inherent limitations. First, fixed privacy parameters induce constant differential privacy noise throughout the training, which deems unnecessary especially at the later stage of training, and leads to low accuracy utility. Second, even with differential privacy guarantee, such algorithms require careful privacy parameter selection as otherwise remaining vulnerable against gradient leakage induced privacy violation, breaking the intended privacy protection. Although some development aim to improve the accuracy of differentially private deep learning by optimizing the noise [24, 32, 33, 26, 34, 35, 36], all these methods are not designated to defend against the gradient leakage attacks.

Scope and Contributions. In this paper, we investigate alternative approaches to differentially private deep learning, aiming for strong privacy, high accuracy performance as well as high resilience against gradient leakage attacks. This paper makes three original contributions. First, we analyze existing implementation of deep learning with differential privacy which injects differential privacy controlled noise to perturb the gradients in all layers using fixed privacy parameters. We show that despite the differential privacy guarantee provided, the deep learning model suffers from low accuracy utility and can be vulnerable to gradient leakage attacks. This motivates us to analyze the inherent limitations of using fixed-parameter strategies in deep learning with differential privacy. Second, we propose a differentially private deep learning approach with adaptive DP parameters to address the inherent limitations of conventional deep learning algorithms with fixed DP parameters. We propose three adaptive DP parameter optimizations to allow dynamic DP controlled noise variance such that more noise is injected in early rounds and smaller noise is used to perturb the gradients in the later rounds, including dynamic gradient clipping method, dynamic sensitivity and dynamic noise scale. We show that our approach with dynamic DP parameters can achieve high resilience against gradient leakage attacks, competitive accuracy performance, better differential privacy guarantee. Unlike fixed-parameter strategies that result in constant noise variance, differentially private deep learning with adaptive DP parameters can align noise injection to the trend of gradient updates during differentially private model training. Third, we present four complementary metrics for evaluation and comparative analysis of alternative approaches to differentially private deep learning: (i) comparing accuracy (utility) performance under the same privacy budget, (ii) comparing privacy cost and the level of privacy protection under the same target model accuracy, and (iii) comparing resilience against gradient leakage attacks. Extensive experiments are conducted on six benchmark datasets with five privacy accounting methods. The results show that deep learning with dynamic differential privacy optimization on sensitivity and noise scale outperforms both the baseline and other alternative approaches with dynamic parameter optimizations, offering compelling accuracy performance, strong differential privacy guarantee, and high attack resilience.

II Gradient Leakage Threat Model

Gradient leakage attacks are most relevant privacy threats in the context of deep learning, in which a curious or malicious insider may conduct unauthorized read on the gradients and reconstruct private training data based on the gradients obtained through spying over the layer-wise gradients utilized by per-step SGD in each iteration of the training. Such insider threats do not directly compromise the accuracy of the trained model and thus are much harder to detect and mitigate through reactive defense methods. We argue that effective approaches to deep learning with differential privacy can build one of the best defense methods against such threats. Our threat model makes the following assumptions about the insider adversary: (1) the insider adversary cannot gain access to the encrypted training data prior to training; (2) the insider adversary has no intention of compromise the training procedure or the quality of the trained model; and (3) the insider adversary may gain access to the intermediate model training parameters that are often saved as checkpoint data to allow the resume of iterative training from a given step. Unlike white-box adversarial example attacks [37, 38, 39, 40], gradient leakage attacks do not need any prior knowledge of the DNN training algorithm and simply use independent reconstruction algorithms on leaked gradients to infer and disclose the private training data [5] while keeping the integrity of the training.

Refer to caption

Figure 1: Attack schema.

Refer to caption

Figure 2: Attack visualization.

With parallel processing of the data, the module is replicated on each device in the forward pass and each replica handles a portion of the input. During the backward pass, gradients from each replica are summed into the original module. Figure 1 gives a sketch of the gradient leakage attack algorithm under the multi-GPU setting, which configures and executes the reconstruction attack in five steps: (1) It configures the initialization seed (xrec0(t)x^{0}_{rec}(t)), a dummy data of the same resolution (or attribute structure for text) as the training data. [5] showed some significant impact of different initialization seeds on the attack success rate and attack cost (#\# attack iterations to succeed). (2) The dummy attack seed is fed into the current copy of the model. (3) The gradient of the dummy attack seed is obtained by backpropagation. (4) The gradient loss is computed using a vector distance loss function, e.g., L2L_{2}, between the gradient of the attack seed and the actual gradient from the training. The choice of this reconstruction loss function is another tunable attack parameter. (5) The dummy attack seed is modified by the attack reconstruction learning algorithm. It aims to minimize the vector distance loss by a loss optimizer such that the gradients of the reconstructed seed xrecτ(t)x^{\tau}_{rec}(t) at round ii will be closer to the actual gradient updates stolen upon the completion of computing the gradients of one input in the batch or of the entire batch. For batch gradient, [1] and [4] maliciously introduce separate weights for each batch example, making it possible for a recovery of a batch of data: xrecτ+1(t)imodBxrecτ(t)imodBxrecτ(t)imodBDx^{\tau+1}_{rec}(t)_{i\>mod\>B}\leftarrow x^{\tau}_{rec}(t)_{i\>mod\>B}-\nabla_{x^{\tau}_{rec}(t)_{i\>mod\>B}}D where BB is the batch size and ii is index of data in the batch. This attack reconstruction learning iterates until it reaches the attack termination condition (𝕋\mathbb{T}), typically defined by the #\# attack iterations, e.g., 𝕋=300\mathbb{T}=300 (also a configurable attack parameter). If the reconstruction loss is smaller than the specified distance threshold then the reconstruction attack is successful. Figure 2 provides a visualization by examples from four datasets at different attack iterations. The details of these datasets are given in Section VI. All the experiments on gradient leakages in this paper use the patterned random seed initialization111https://github.com/git-disl/ESORICS20-CPL, with a L2L_{2} based loss function and L-BFGS optimizer, for high attack success rate (ASR) and fast attack convergence.

III Baseline Solution with Differential Privacy

In this section, we first review the differential privacy concept.Then we present the baseline deep learning with differential privacy approach that utilizes fixed privacy parameters and point out the inherent problem of fixed privacy parameters in terms of accuracy utility loss and gradient leakage vulnerability.

III-A Preliminary

Definition 1.

Differential privacy [23]: Let 𝒟\mathcal{D} be the domain of possible input data and \mathcal{R} be the range of all possible output. A randomized mechanism \mathcal{M}: 𝒟\mathcal{D}\rightarrow\mathcal{R} satisfies (ϵ,δ\epsilon,\delta)-differential privacy if for any two input sets A𝒟A\subseteq\mathcal{D} and A𝒟A^{\prime}\subseteq\mathcal{D}, differing with only one entry: AA0=1||A-A^{\prime}||_{0}=1, Equation 1 holds with 0δ<10\leq\delta<1 and ϵ>0\epsilon>0.

Pr((A))eϵPr((A))+δ.\Pr(\mathcal{M}(A)\in\mathcal{R})\leq e^{\epsilon}\Pr(\mathcal{M}(A^{\prime})\in\mathcal{R})+\delta. (1)

This definition states that given δ\delta, a smaller ϵ\epsilon would indicate a smaller difference between the output of (A)\mathcal{M}(A) and the output of (A)\mathcal{M}(A^{\prime}). By Lemma 3.17 of [23], (ϵ,δ)(\epsilon,\delta)-differential privacy ensures that for any adjacent A,AA,A^{\prime}, the absolute value of the privacy loss will be bounded by ϵ\epsilon with probability of at least 1δ1-\delta. When 0δ<10\leq\delta<1, this definition implies that it is most likely that the observed output for AA^{\prime} will be similar to the output for its neighboring input AA, whereas when δ=0\delta=0, it indicates that the output observed under AA^{\prime} is highly likely to be observed under AA. Since δ\delta is the upper bound probability of (A)\mathcal{M}(A) for breaking ϵ\epsilon-differential privacy, a smaller δ\delta is desired. Following the literature [24, 27, 36, 26], δ\delta is set to 1e51e-5 in our experiments such that ϵ\epsilon-differential privacy would hold with probability of at least 0.9999.

Definition 2.

Sensitivity [41]: Let 𝒟\mathcal{D} be the domain of possible input data and \mathcal{R} be the domain of all possible output. The sensitivity of a function f:𝒟f:\mathcal{D}\rightarrow\mathcal{R} is the maximum amount that the function value varies when a single entry of the input is changed.

S=maxA,A𝒟,AA0=1f(A)f(A)p.S=\max\nolimits_{A,A^{\prime}\subseteq\mathcal{D},||A-A^{\prime}||_{0}=1}||f(A)-f(A^{\prime})||_{p}. (2)

Definition 2 implies that to produce a randomized differentially private algorithm (f)\mathcal{M}(f) by injecting noise that follows some randomization distribution while preserving the utility of ff, we need to bound the noise by the maximum change defined as the sensitivity of function ff with neighboring inputs. In this paper, we consider Gaussian mechanism as our randomization noise distribution, which adds Gaussian noise calibrated to the sensitivity of the function ff in l2l_{2} norm. Hence, we define SS by l2l_{2} sensitivity under Gaussian noise injection.

Theorem 1.

Gaussian mechanism [23]: Let 𝒟\mathcal{D} be the domain of possible input data and \mathcal{R} be the range of all possible output. With privacy parameter ϵ\epsilon, applying Gaussian noise 𝒩(0,ς)2\mathcal{N}(0,\varsigma{{}^{2}}) calibrated to a real valued function: f:𝒟f:\mathcal{D}\rightarrow\mathcal{R} with noise variance ς2\varsigma{{}^{2}} such that (A)=f(A)+𝒩(0,ς2)\mathcal{M}(A)=f(A)+\mathcal{N}(0,\varsigma^{2}) is (ϵ,δ)(\epsilon,\delta)-differentially private if ς>22log(1.25/δ)S2ϵ2\varsigma{{}^{2}}>\frac{{2\log(1.25/\delta)\cdot{S^{2}}}}{{{\epsilon^{2}}}}.

This theorem indicates that for a constant cc, when c2>2log(1.25/δ)c^{2}>2\log(1.25/\delta), we have Gaussian mechanism (A)=f(A)+𝒩(0,ς2)\mathcal{M}(A)=f(A)+\mathcal{N}(0,\varsigma^{2}), and if ςcS/ϵ\varsigma\geq cS/\epsilon, then |loge(1/2ς2x2)e(1/2ς2(x+a)2)|ε\left|{\log\frac{{{e^{(-1/2{\varsigma^{2}}\cdot{x^{2}})}}}}{{{e^{(-1/2{\varsigma^{2}}\cdot{{(x+a)}^{2}})}}}}}\right|\leq\varepsilon holds with probability at least 1δ1-\delta, where xx is the random variable utilizing the Gaussian distribution noise with Gaussian density function 1ς2πex22ς2\frac{1}{\varsigma\sqrt{2\pi}}e^{-\frac{x^{2}}{2\varsigma^{2}}}. Hence, it is straightforward to get the following lemma:

Lemma 1.

Let noise variance ς2\varsigma^{2} in Gaussian mechanism be σ2S2\sigma^{2}S^{2} where σ\sigma is the noise scale and SS is the l2l_{2} sensitivity. We have the noise scale σ\sigma satisfying σ2>2log(1.25/δ)ϵ2.\sigma^{2}>\frac{{2\log(1.25/\delta)}}{{{\epsilon^{2}}}}.

According to this Lemma, noise scale σ\sigma and privacy loss ϵ\epsilon have an inverse correlation given a fixed δ\delta, i.e., a large noise scale indicates a small ϵ\epsilon, and conversely, a small noise scale implies the spending of a large privacy budget ϵ\epsilon.

III-B Privacy Accounting

Based on the DP composition theory [23], post processing theory [23], and privacy amplification [42], the privacy budget spending can be tracked throughout the iterations of DNN training. There are four representative privacy accounting methods: moments accountant (MA) [24], zCDP [27], advanced composition (AdvC) [43] and optimal composition (OptC) [44].

Theorem 2.

Composition theorem [23]: Let i:𝒟i\mathcal{M}_{i}:\mathcal{D}\rightarrow\mathcal{R}_{i} be a randomized function that is (ϵi,δi)(\epsilon_{i},\delta_{i})-differentially private. If \mathcal{M} is a sequence of consecutive invocations (executions) of (ϵi,δi\epsilon_{i},\delta_{i})-differentially private algorithm i\mathcal{M}_{i}, then \mathcal{M} is (iϵi,iδi\sum_{i}\epsilon_{i},\sum_{i}\delta_{i})-differentially private.

By Theorem 2, a randomized function \mathcal{M} that consists of a sequence of nn differentially private mechanisms is differentially private, and its privacy guarantee is determined by the sum of the nn individual privacy losses, defined by inϵi\sum\nolimits_{i}^{n}\epsilon_{i}. In our experiments, we consider five privacy accounting rules: base composition, advanced composition, optimal composition, Moments Accountant, and zCDP.

Advanced composition (AdvC [43]): Let ϵ,δ0\epsilon,\delta^{\prime}\geq 0. The class of ϵt\epsilon_{t}-differentially private mechanisms satisfies (ϵt,δ\epsilon_{t},\delta^{\prime})-differential privacy under TT-fold composition for:

ϵ=T(eϵt1)ϵteϵt+1+T2ϵt2log(1/δ)\epsilon^{\prime}=\sum\nolimits_{T}\frac{(e^{\epsilon_{t}}-1)\epsilon_{t}}{e^{\epsilon_{t}}+1}+\sqrt{\sum\nolimits_{T}2\epsilon_{t}^{2}\log(1/\delta)} (3)

Optimal composition (OptC [44]): Let ϵ,δ0\epsilon,\delta^{\prime}\geq 0. The class of ϵt\epsilon_{t}-differentially private mechanisms satisfies (ϵt,δ\epsilon_{t},\delta^{\prime})-differential privacy under TT-fold composition for:

ϵ=T(eϵt1)ϵteϵt+1+T2ϵt2log(e+Tϵt2δ)\epsilon^{\prime}=\sum\nolimits_{T}\frac{(e^{\epsilon_{t}}-1)\epsilon_{t}}{e^{\epsilon_{t}}+1}+\sqrt{\sum\nolimits_{T}2\epsilon_{t}^{2}\log(e+\frac{\sqrt{\sum\nolimits_{T}\epsilon_{t}^{2}}}{\delta})} (4)

Moments accountant (MA [24]): For a mechanism \mathcal{M} with Gaussian noise 𝒩(0,σ2S2)\mathcal{N}(0,\sigma^{2}S^{2}) added at each iteration where SS denotes the sensitivity and σ\sigma is the noise scale, given dataset 𝒟\mathcal{D} with a total of NN data points, the random sampling with replacement and the sampling rate q=n/Nq=n/N where nn is the sample size, and the number of iterations TT, if q<116σq<\frac{1}{16\sigma}, there exist constants c1c1 and c2c2 so that for any ϵ<c1q2T\epsilon<c1q^{2}T, \mathcal{M} is (ϵ,δ)(\epsilon^{\prime},\delta)-differentially private for δ>0\delta>0 and

ϵc2qTlog(1/δ)σ\vspace{-0.1cm}\epsilon^{\prime}\geq c2\frac{q\sqrt{T\log(1/\delta)}}{\sigma} (5)

Note that Rényi differential privacy [45, 46] is proposed on top of moments accountant to keep track of the ϵ\epsilon privacy spending of a sequence of randomized mechanisms with elegant composition rules. Rényi differential privacy can be transformed to the standard differential privacy definition. We will use Moments accountant for fair comparison with other privacy accounting methods.

zCDP [27]: For a mechanism \mathcal{M} with Gaussian noise 𝒩(0,σt2S2)\mathcal{N}(0,\sigma_{t}^{2}S^{2}) added at each iteration where SS denotes the sensitivity and σt\sigma_{t} is the per-iteration noise scale, given dataset 𝒟\mathcal{D} with a total of NN data points, given the random sampling with replacement and the sampling rate q=n/Nq=n/N where nn is the sample size, and the number of iterations TT, if q<116σq<\frac{1}{16\sigma}, \mathcal{M} is (ϵ,δ)(\epsilon^{\prime},\delta)-differentially private for δ>0\delta>0 and

ϵ=Tq2/σt2+2Tq2/σt2log(1/δ)\vspace{-0.1cm}\epsilon^{\prime}=\sum\nolimits_{T}q^{2}/\sigma_{t}^{2}+2\sqrt{\sum\nolimits_{T}q^{2}/\sigma_{t}^{2}\log(1/\delta)} (6)

Due to the random sampling involved, both advanced composition and optimal composition (ϵt,δt)(\epsilon_{t},\delta_{t}) consider an amplified per-iteration privacy spending and assume a fixed δ\delta to track ϵ\epsilon using the following privacy amplification theorem. Note that for Moments accountant, we follow the implementation provided by the Tensorflow Privacy module [47] as no closed-form expression in terms of the noise scale nor the heterogeneous per-step ϵt\epsilon_{t} is given to compute the accumulated ϵ\epsilon.

Theorem 3.

Privacy amplification [42]: Given dataset 𝒟\mathcal{D} with |D||D| data points, subsampling is defined as random sampling with replacement with nn as the sample size. If \mathcal{M} is (ϵ,δ)(\epsilon,\delta)-differentially private, then the subsampled mechanism with sampling rate q=n/|D|q=n/|D| is (log(1+q(exp(ϵ)1)),qδ)(\log(1+q(\exp(\epsilon)-1)),q\delta)-differentially private.

III-C Baseline with Fixed Parameters

The goal of deep learning with differential privacy is to train a (ϵ,δ)(\epsilon,\delta)-differentially private model over TT iterations. By composition theorem, we need to ensure that the per-step SGD, denoted by ftf_{t}, is (ϵt,δt)(\epsilon_{t},\delta_{t})-differentially private (1tT1\leq t\leq T) with ϵ=tTϵt\epsilon=\sum_{t}^{T}\epsilon_{t} and δ=tTδt\delta=\sum_{t}^{T}\delta_{t}. Hence, for each iteration tt, we inject differential privacy controlled noise to the gradients before performing per-step SGD. Given a DNN of MM layers, the baseline implementation for ensuring that ftf_{t} is (ϵt,δt)(\epsilon_{t},\delta_{t})-differentially private injects noise to all layers of the model during each training iteration.

Fixed privacy parameters. Most of existing approaches to deep learning with differential privacy [24, 27, 36, 48], including Tensorflow privacy module [30] and Pytorch Opacus privacy module [31], all employ fixed privacy parameter strategies, such as the constant clipping method with pre-defined clipping bound (e.g., C=4C=4), the fixed pre-defined noise scale (e.g., σ=6\sigma=6), and the fixed sensitivity SS defined using the constant clipping bound CC. As a result, in addition to distributing the privacy budget ϵ\epsilon uniformly across the total TT training iterations, the noise variance is fixed during the training, resulting in injecting constant noise to the gradients in each training iteration.

Constant clipping method with fixed clipping bound. Given a training example ii, let W(t)im\nabla W(t)_{im} denote the layer-wise per-example gradient vector for the mthm^{th} layer (1mM1\leq m\leq M). The clipping method is used prior to noise injection to address the problem of gradient explosion [49]. With a constant clipping method using a fixed clipping bound CC, the layer-wise per-example gradient vector W(t)im\nabla W(t)_{im} is preserved if its l2l_{2} norm satisfies W(t)im2C||\nabla W(t)_{im}||_{2}\leq C. Otherwise, W(t)im2>C||\nabla W(t)_{im}||_{2}>C holds, and the gradient vector W(t)im\nabla W(t)_{im} needs to be brought down so that its l2l_{2} norm is capped by CC. This is done by multiplying every coordinate of the gradients with a scaling factor: C/W(t)im2C/||\nabla W(t)_{im}||_{2}. Such per-example gradient clipping will be performed on all MM layers for each example ii in the batch BB of the given iteration. Let ¯W(t)im\overline{\nabla}W(t)_{im} denote the clipped per-example gradient for layer mm of training example ii at iteration tt (m{1,2,M}m\in\{1,2,...M\}). The clipped per-example gradients are then gathered for batch averaging: ¯W(t)=1BiB¯W(t)i\overline{\nabla}W(t)=\frac{1}{B}\sum\nolimits_{i}^{B}\overline{\nabla}W(t)_{i}. Algorithm 1 present pseudo code for baseline clipping function using the constant clipping method initialized with a preset clipping bound CC [24].

Noise injection with fixed sensitivity SS and fixed σ\sigma. Next, the (ϵt,δt)(\epsilon_{t},\delta_{t})-differential privacy controlled Gaussian noise 𝒩(0,σ2S2)\mathcal{N}(0,\sigma^{2}S^{2}) is injected to each layer of the batch gradients: ~W(t)=¯W(t)+𝒩(0,σ2S2)\widetilde{\nabla}W(t)=\overline{\nabla}W(t)+\mathcal{N}(0,\sigma^{2}S^{2}). The per-step SGD function ftf_{t} performs the gradient descent at iteration tt using the Gaussian noise perturbed gradients ~W(t)\widetilde{\nabla}W(t), such that W(t+1)=W(t)η~W(t)W(t+1)=W(t)-\eta\widetilde{\nabla}W(t). This process of sampling, computing gradients, clipping, and noise injection repeats until reaching the TT total iterations.

Input: Per-example gradient Wi\nabla W_{i} with MM layers, clipping bound CC.
1 for layer mm in {1,M}\{1,...M\}  do
       // compute l2l_{2} norm for layer mm of sample ii.
2       l2im=Wim2l2_{im}=||\nabla W_{im}||_{2}
       // clip per-example gradients by coordinate for layer mm of sample ii.
3       ¯WimWimmin{1,Cl2im}\overline{\nabla}W_{im}\leftarrow\nabla W_{im}*\min\{1,\frac{C}{l2_{im}}\}
4 end for
Output: the clipped per-example gradients for sample ii: ¯Wi{¯Wim},m=1,,M\overline{\nabla}W_{i}\leftarrow\{\overline{\nabla}W_{im}\},m=1,\dots,M.
Algorithm 1 CLIP (CC, Wi\nabla W_{i})

Limitation with using Fixed Privacy Parameters. First, using a fixed clipping bound to define the sensitivity of gradient changes for all iterations can be problematic, especially for the later iterations of training. This is because the l2l_{2} norm of the layer-wise per-example gradient vector W(t)im\nabla W(t)_{im} satisfies W(t)im2C||\nabla W(t)_{im}||_{2}\leq C in most cases as the training approaching the end. Second, with fixed sensitivity SS defined by the fixed clipping bound CC, the constant noise computed based on SS and fixed σ\sigma can be much larger than CC for σ>1\sigma>1. Third, injecting such large constant noise to gradients in each iteration of the training may have a detrimental effect on the accuracy performance and slow down the convergence of training and sadly it does not gain any additional privacy protection, because the accumulated privacy spending ϵ\epsilon is only inversely correlated with σ\sigma.

Furthermore, algorithms with fixed privacy parameters may be more vulnerable to gradient leakage attacks. Any inappropriate setting of fixed privacy parameters will bring in vulnerability to gradient leakage attacks. Figure 3 provides four visual examples on four datasets: MNIST, Fashion-MNIST, LFW, and CIFAR10 under three scenarios: non-private, DP-baseline with fixed parameter setting of clipping bound C=S=1C=S=1 and noise scale σ=1\sigma=1 as in [50, 29] and DP-baseline with clipping bound C=S=4C=S=4 and noise scale σ=6\sigma=6 as in [24, 27]. It is observed that an adequate amount of noise is necessary to mask the gradients from malicious or curious inference attackers using the reconstruction learning on the leaked gradients.

Refer to caption

Figure 3: Illustration of gradient leakage attack resiliency for DP-baseline with different fixed privacy parameter settings.

Given that gradients at early training iterations tend to leak more information than gradients in the later stage of the training [5], an obvious approach is to design a differential privacy algorithm that can inject larger noise at the early stage of training and resort to smaller noise injection as the training is close to the end. Given that the noise variance ς\varsigma is the product of sensitivity SS and noise scale σ\sigma, several possible strategies can be promising, such as having the sensitivity calibrated to the l2l_{2}-norm of the gradients, or having a smoothly decaying noise scale such that the noise variance follows the trend of gradient updates across the TT training iterations.

IV Dynamic Privacy Parameters

A detrimental limitation of using fixed privacy parameter strategies for deciding the clipping method and for defining sensitivity and noise scale is the result of constant noise variance and constant noise injection in each iteration of deep learning, which is a root cause for poor resilience against gradient leakage attacks. In this section, we address such inherent limitations by developing dynamic parameter strategies for determining and configuring these privacy parameters in a training progress-aware manner. This brings out a significant advantage: smaller noise variance will be used to inject noise at the later stage of the training, improving the convergence speed of training with high accuracy performance; and at the same time larger noise variance will be used to inject a larger amount of noise at the early stage of the training, leading to higher resilience against gradient leakage attacks. Given that the noise variance ς\varsigma is the product of sensitivity SS and noise scale σ\sigma (recall Theorem 1), several possible dynamic strategies are promising, such as having the sensitivity calibrated to the l2l_{2}-norm of the gradients or having a smoothly decaying noise scale such that the noise variance follows the trend of gradient updates across the TT training iterations.

IV-A Dynamic Sensitivity

We describe three different strategies for implementing dynamic sensitivity, aiming to derive declining noise variance as the training progresses in iterations: l2l_{2}-norm based sensitivity, denoted by DP-dynS[l2l_{2}-max], dynamic decaying clipping method, denoted by DP-dynS[CdecayC_{decay}], and the combination of both, denoted by DP-dynS.

DP-dynS[l2l_{2}-max]. Given that the clipping with a fixed bound CC is performed on the l2l_{2}-norm of the layer-wise gradients of per-example in a batch, one way to accommodate the noise variance calibrated to the l2l_{2} norm of the gradient is to use the max l2l_{2} norm measured on per-example gradients in a batch BB as the sensitivity of the training function ftf_{t} for iteration tt (recall Definition 2). Consider two scenarios: (1) When any of the per-example gradients in a batch is larger than the clipping bound CC, the sensitivity is set to CC, and however, (2) when l2l_{2} norm of all per-example gradients in a batch is smaller than the pre-defined clipping bound CC, the clipping bound CC is unfortunately a loose estimation of the true sensitivity of function ftf_{t} at iteration tt. If we instead define the sensitivity of ftf_{t} by the max l2l_{2} norm among these per-example gradients in the batch, we will correct the problems in the above scenario (2). In summary, the l2l_{2}-max sensitivity will take the smallest of the max l2l_{2} norm and the clipping bound CC. Given that the l2l_{2}-norm of the gradients closely follows the trend of gradient changes as the training progresses in iterations, our dynamic l2l_{2} sensitivity approach will have adaptive sensitivity SS, and hence adaptive noise variance when a fixed σ\sigma is used, and consequently inject larger noise at the early stage of training and smaller differential privacy noise at the later stage of training.

Note that unlike mean estimation [51, 52], knowing the l2l_{2} norm-based sensitivity of the multi-dimension gradient vector on per-example gradients in deep learning training cannot help disclosing the gradient value at each coordinate. Meanwhile, the l2l_{2}-max norm actually tracks the sensitivity of the training function at each iteration. Yet composition theorem only requires the individual training function component to be differentially private, with each training function component taking care of its own sensitivity. This makes the l2l_{2}-max sensitivity a possible choice for sensitivity optimization.

DP-dynS[Cdecay]. The second dynamic sensitivity strategy is to use dynamic decaying clipping. Given that clipping is used to regulate the largest change of gradients during the training, as the training progresses, the maximum changes of the gradients decrease, and gradually approach zero when the training starts to converge. Hence, one may use dynamic decaying clipping method to estimate the maximum changes of the gradients and define the dynamic sensitivity accordingly. Motivated by the dynamic learning rate [53], we propose a set of decaying policies DECAYC(C0,γ,t)DECAY_{C}(C_{0},\gamma,t) for implementing the decay-clipping based dynamic sensitivity with γ1,γ2,γ3\gamma_{1},\gamma_{2},\gamma_{3}, and γ4\gamma_{4} being the control parameters for the decaying rate.

Linear decay: The clipping decays in a linear trend, defined as Ct=C0(1γ1t)C_{t}=C_{0}(1-\gamma_{1}t) where γ1>0\gamma_{1}>0 is the smooth controlling term for clipping at iteration tt.

Exponential decay: The clipping decays in an exponential trend defined by an exponential function: Ct=C0eγ2tC_{t}=C_{0}e^{-\gamma_{2}t} where γ2\gamma_{2} is the control term for exponential trending.

Cyclic decay: The clipping decay follows a decaying triangular cyclic policy originated from the cyclic learning rate [54]. With step size VV, Ct=C0(1γ3t)C_{t}=C_{0}^{*}(1-\gamma_{3}t) for odd VV and Ct=C0(1+γ3t)C_{t}=C_{0}^{*}(1+\gamma_{3}t) for even VV. C0C_{0}^{*} itself decays every cyclic triangle: C0=C0(12γ4V)C_{0}^{*}=C_{0}(1-2*\gamma_{4}V).

DP-dynS. The third dynamic sensitivity strategy is to combine l2l_{2}-max sensitivity and clipping decay sensitivity CdecayC_{decay} such that we can take the advantage of the best side of both worlds. Concretely, although l2l_{2}-max sensitivity is a tighter estimation of the maximum changes in gradients, it may still benefit from additional improvements in some situations, in which the l2l_{2}-max sensitivity happens to be defined by the clipping bound CC. In such situations, by integrating the decay clipping sensitivity with the l2l_{2}-max sensitivity, the decaying clipping CC will be used instead of the initial large clipping bound. By Lemma 1, both the decay clipping-based dynamic sensitivity and l2l_{2}-max sensitivity do not have a direct impact on the differential privacy guarantee. However, with an accuracy target, the model with combined sensitivity optimizations may reach the accuracy and terminate the training earlier, resulting in smaller accumulated privacy spending.

Input: input data DD, batch size BB, learning rate η\eta, maximum iteration TT, decaying trend γC\gamma_{C} and γσ\gamma_{\sigma}, loss function: \mathcal{L}
1 initialization: model W(0)W(0), noise scale σ0\sigma_{0}, clipping bound C0C_{0}.
2 for iteration tt in {0,1,2,T1}\{0,1,2\dots,T-1\} do
       // batch processing, sampled over DD , start ftf_{t}.
       // determine the clipping bound for iteration tt.
3       Ct=DECAYC(C0,γC,t)C_{t}=DECAY_{C}(C_{0},\gamma_{C},t)
4       σt=DECAYσ(σ0,γσ,t)\sigma_{t}=DECAY_{\sigma}(\sigma_{0},\gamma_{\sigma},t)
5       for instance ii in {1,B}\{1,...B\}  do
             // obtain the per-example gradients for sample ii at iteration tt.
6             W(t)i1B(W(t),i)\nabla W(t)_{i}\leftarrow\frac{1}{B}\nabla\mathcal{L}(W(t),i)
             // obtain the clipped per-example gradients for sample ii at iteration tt.
7             ¯W(t)iCLIP(Ct,W(t)i,ϕ)\overline{\nabla}W(t)_{i}\leftarrow CLIP({\color[rgb]{0,0,1}C_{t}},\nabla W(t)_{i},\phi)
8            
9       end for
10      // compute batch gradients on MM layers for iteration tt:
11       ¯W(t)1Bi=1B¯W(t)i\overline{\nabla}W(t)\leftarrow\frac{1}{B}\sum\nolimits_{i=1}^{B}\overline{\nabla}W(t)_{i}
       // compute the max of l2l_{2} norm over MM layers on the batch gradient for iteration tt, assign the sensitivity SS.
12       Smaxi¯W(t)i2S\leftarrow\max_{i}||\overline{\nabla}W(t)_{i}||_{2}
       // compute sanitized batch gradients.
13       ~W(t)¯W(t)+𝒩(0,σt2S2))\widetilde{\nabla}W(t)\leftarrow\overline{\nabla}W(t)+\mathcal{N}(0,{\color[rgb]{0,0,1}\sigma_{t}^{2}S^{2}}))
       // gradient descent.
14       W(t+1)W(t)η~W(t)W(t+1)\leftarrow W(t)-\eta\widetilde{\nabla}W(t)
       // end ftf_{t}, and start next iteration until reaching TT.
15      
16 end for
Output: trained differential private model W(T)W(T)
Algorithm 2 DP-dyn[S,σ\sigma]
baseline dynS[Cdecay] dynS[l2l_{2}-max] dynS dynσ\sigma dyn[S,σ\sigma]
CtC_{t} C0C_{0} decay C0C_{0} decay C0C_{0} decay
SS CtC_{t} CtC_{t} l2l_{2}-max l2l_{2}-max CtC_{t} l2l_{2}-max
σt\sigma_{t} σ0\sigma_{0} σ0\sigma_{0} σ0\sigma_{0} σ0\sigma_{0} decay decay
TABLE I: Comparison of privacy parameters in DP-baseline and the five dynamic privacy parameter approaches

IV-B Dynamic Sensitivity with Dynamic σ\sigma

Dynamic noise scale with a decaying policy is an alternative approach to supporting dynamic Gaussian noise variance over the number of TT training iterations, denoted by DP-dynσ\sigma. Recall that the per-example gradients in early iterations are more informative and thus more vulnerable against gradient leakage attacks. With a dynamic decaying noise scale, we can ensure a larger noise scale and thus larger noise variance in the early stage of training and smoothly decay the scale σ\sigma as training progresses, such that the per-example gradients are perturbed with larger noise in early iterations and relatively less exploitable by the attacker, even when a fixed sensitivity defined by the fixed clipping bound CC is used. In our prototype implementation, we implement the dynamic decaying noise scale policies with a starting noise scale σ0\sigma_{0}, by employing the same set of three decay policies, denoted by DECAYσ(σ0,γ,t)DECAY_{\sigma}(\sigma_{0},\gamma,t), used in designing our decay-clipping dynamic sensitivity.

When combining dynamic σ\sigma with dynamic sensitivity based on both l2l_{2}-max and decay clipping, denoted by DP-dyn[S,σ\sigma], we expect the best dynamic parameter optimization for differentially private deep learning. Algorithm 2 provides a sketch of the pseudo-code of DP-dyn[S,σ\sigma], which modifies DP-baseline by adding dynamic decay clipping (line 3), dynamic decay noise scale (line 4), and combining with dynamic l2l_{2}-max sensitivity (line 11) for each iteration of the training. Table I provides a summary of comparison for the five different dynamic privacy parameter optimizations and the DP-baseline with respect to the three privacy parameters: clipping CC, sensitivity SS and noise scale σ\sigma. Figure 4 measures the noise variance ς=Sσ\varsigma=S\sigma on MNIST with three different settings: (i) fixed S=C=4S=C=4 and fixed σ=6\sigma=6 (DP-baseline), (ii) l2l_{2}-max sensitivity and fixed σ\sigma (DP-dynS) and (iii) DP-dyn[SS, σ\sigma] with l2l_{2}-max sensitivity and decaying noise scale starting from σ0=15\sigma_{0}=15. Compared to the baseline approach with fixed privacy parameters and our methods with dynamic l2l_{2}-max or dynamic clipping, the method of DP-dyn[SS, σ\sigma] with both dynamic sensitivity S and dynamic noise scale σ\sigma offers more resilience and better accuracy because it can add larger noise in the early stage of training and small noise as the training progresses towards convergence.

Refer to caption


Figure 4: Noise variance ς\varsigma of MNIST training under DP-baseline, DP-dynS and DP-dyn[S,σ\sigma].

V Privacy Analysis Metrics

For DP algorithms with fixed-parameter strategies, such as fixed clipping C=4C=4, fixed noise scale σ=6\sigma=6 and fixed sensitivity S=CS=C, no matter which one of the five privacy accounting methods outlined in Section III-A we choose to use, the total privacy spending ϵ\epsilon will be the same, given the sampling rate and δ=1e5\delta=1e-5. Hence, the evaluation will focus on measuring and comparing the accuracy performance of the model trained by using these DP-algorithms. However, for DP-algorithms with dynamic privacy parameter strategies, additional metrics should be considered. In this section, we describe four complementary metrics to evaluate and compare the effectiveness of alternative approaches to different differentially private deep learning: (1) model accuracy, (2) differential privacy, (3) resilience against gradient leakage attacks.

Accuracy with a target privacy budget. This metric is designed for performing utility analysis with respect to model accuracy under the same differential privacy spending (budget). When the privacy budget ϵ\epsilon is given and fixed for comparing different algorithms, according to the privacy accounting methods outlined in Section III-A, the following parameters are fixed: the total iterations (TT), the sampling rate qq, the privacy parameter δ\delta. By Lemma 1, the lower bound of noise scale σ\sigma are also fixed. By Theorem 1, the definition of sensitivity SS will directly impact on how the noise variance ς\varsigma is defined. This motivates us to advocate this metric as one of the four principled criteria to compare the baseline DP algorithm that use fixed privacy parameter strategies (see Section III-C) with the alternative DP algorithms that use different dynamic privacy parameter strategies (see Section IV). Following a similar setting, one may conduct privacy analysis to compare alternative DP algorithms under a fixed privacy budget ϵ\epsilon and a pre-defined total training iteration (TT) on different datasets. This allows us to measure the level of differential privacy spending using all five representative privacy accounting methods (recall Section III-A), and to gain a deeper understanding of the difference among different accounting methods in tracking privacy spending over the same TT iterations of training on the same training set. The DP-algorithm with the highest accuracy performance will be the winner in this analysis.

Next, we introduce two additional privacy analysis metrics: (i) privacy with a target accuracy and a fixed noise scale σ\sigma and (ii) privacy with a target accuracy and a fixed noise variance ς\varsigma. They are designed specifically to evaluate and compare alternative DP algorithms that use dynamic parameter strategies with baseline DP-algorithms that use fixed-parameter strategies. Both metrics are designed for conducting privacy analysis under the same utility goal defined by the target accuracy. A proper target accuracy should be chosen such that it can be achieved by every alternative algorithm being compared during privacy analysis. Given a fixed target model accuracy, with a fixed sampling rate qq and fixed δ=1e5\delta=1e-5, some algorithms may terminate the training before reaching its pre-defined total iterations TT. This may result in spending less privacy budget ϵ\epsilon, regardless of which one of the five accounting methods is used to track the accumulative privacy spending for privacy analysis.

Privacy with a target accuracy and a fixed σ\sigma. By Lemma 1, noise scale σ\sigma defines the lower bound of the accumulative privacy spending ϵ\epsilon over the training iterations used to achieve the given target accuracy. Hence, those DP algorithms that can achieve the target model accuracy before reaching the pre-defined TT iterations will terminate the training earlier, which may result in smaller accumulated privacy spending. Based on Gaussian Mechanism, when a DP algorithm uses a dynamic sensitivity SS with a fixed noise scale σ\sigma, it will result in a dynamically changing noise variance ς\varsigma, following the trend of dynamic sensitivity. In this case, even the fixed σ\sigma will lead to the same privacy spending ϵ\epsilon under the same TT iterations, the dynamic sensitivity optimized DP training may achieve the target accuracy and terminate earlier. This will result in smaller accumulated privacy spending. However, for DP algorithms with fixed privacy parameters, e.g., a fixed noise scale σ\sigma is combined with a fixed sensitivity SS, then a constant noise varianceς\varsigma will be used in each iteration of the training, which may result in excessive noise in the later stage of the training, degrading the utility without any gain on privacy.

Privacy with a target accuracy and a fixed ς\varsigma. This metric allows us to conduct privacy analysis from a very different perspective. Based on Theorem 1 and Lemma 1, the Gaussian noise variance is defined as the multiplication of sensitivity SS and noise scale σ\sigma. With a fixed noise variance ς\varsigma, all DP-algorithms will inject the same amount of noise in each of the total TT training iterations, regardless of whether they are using fixed-parameter strategies or dynamic parameter strategies. However, the accumulated privacy spending for DP-algorithms with dynamic parameter strategies will be smaller compared to the accumulative privacy cost for DP-algorithms with fixed-parameter strategies. Concretely, for DP algorithms with dynamic sensitivity, under a fixed noise variance ς\varsigma, the noise scale σ\sigma can no longer be fixed, and it is determined in a reverse trend of the dynamic sensitivity SS. As the training progresses, the dynamic sensitivity tends to align closely with the declining trend of gradient updates, and we will have to use a large noise scale to keep noise variance constant. As a result, DP-algorithms with dynamic sensitivity tend to result in smaller privacy spending (ϵ\epsilon) compared to the DP-algorithms with fixed parameters.

Resilience against Gradient Leakage Attacks. This metric is designed to measure and compare alternative DP-algorithms with respect to attack resilience, which can be defined using both (i) adverse effect of the attack, measured by the accuracy performance under attack, and (ii) attack cost in terms of the time spent to perform reconstruction inference on the leaked gradients and whether the inference results in a successful disclosure of a private training example. A DP-algorithm is considered highly resilient in the effectiveness evaluation, if under the same privacy spending, this algorithm outperforms other alternatives with the highest attack resilience.

VI Experimental Evaluation

# train # test # features # class # iter. batch size acc.
MNIST 60000 10000 28*28 10 10000 600 0.989
Fashion-MNIST 60000 10000 28*28 10 10000 600 0.875
CIFAR10 50000 10000 32*32*3 10 10000 500 0.687
LFW 2267 756 32*32*3 62 6000 22 0.766
Purchase-10 10000 2000 600 10 5000 100 0.825
Purchase-50 10000 2000 600 50 5000 100 0.667
TABLE II: Benchmark datasets and parameters

We evaluate alternative approaches to deep learning with differential privacy using six benchmark datasets. Table II list the six datasets with training set, test set, #features, #classes, total training iterations TT, batch size BB, and non-private accuracy. MNIST is a grey-scale hand-written digit image dataset. Fashion-MNIST is an image dataset associated with a label from 10 clothing classes such as T-shirt, Trouser, and Pullover. CIFAR10 is a dataset of color images of objects. LFW is a dataset, originated from 13233 images of 5749 classes, by resizing the original image size of 250×250250\times 250 to 32×3232\times 32 and extracting the ’interesting’ region. Since most of the classes have a very limited number of data points, we consider 3023 images from 62 classes that have more than 20 images per class. A 4:1 train-test ratio is applied. Purchase is a dataset adopted from Kaggle challenge of “acquire valued shopper”, with shopping histories for several thousand individuals. We consider a simplified version with 197,325 records, each with 600 binary features converted from the non-numerical raw data. Similar to [15], we cluster the data record into 10 and 50 classes for classification. A 4:1 ratio is used for training and test data split. For the four image datasets, a deep convolutional neural network with two convolutional layers and one fully connected layer is used. For the two attribute datasets, a fully connected model with two hidden layers is applied.

mnist Fashion-MNIST cifar10 LFW
non-private attack iter 11.5 12.4 28.3 25
ASR 1 1 0.973 1
MSE 1.50E-05 2.59E-05 2.20E-04 2.20E-04
DP-baseline attack iter 13.4 14.1 30.5 26.3
ASR 0.23 0.25 0.16 0.13
MSE 0.547 0.588 0.401 0.352
Quantileclip [55] attack iter 12.9 14.3 30.5 27.1
ASR 0.64 0.67 0.47 0.49
MSE 0.097 0.121 0.203 0.229
Adaclip [56] attack iter 12.6 14.1 29.9 26.4
ASR 0.72 0.76 0.58 0.61
MSE 0.033 0.041 0.129 0.144
DP-dyn[SS,σ\sigma] attack iter 300 300 300 300
ASR 0 0 0 0
MSE 0.886 0.904 0.912 0.897
TABLE III: Comparing gradient leakage resilience of non-private training with DP-baseline and DP-dyn[S,σ\sigma]. The maximum attack iteration is set to 300.

Refer to caption

Figure 5: Gradient leakage attack evaluation. Dynamic privacy parameters can help improving the robustness.

VI-A Resilience against Gradient Leakages

The first set of experiments measures and compares the resilience of alternative DP algorithms with fixed-parameter strategies, existing adaptive clipping approaches, and dynamic parameter strategies against gradient leakage attacks. The following three attack metrics are used to evaluate the adverse effect and cost of gradient leakage attacks, which in turn can be used to measure the resilience of different DP approaches in the presence of attacks. They are attack success rate (ASR), #attack iterations (attack iter) to succeed the inference with 300 as the default, and attack reconstruction distance in MSE (mean square error). The successful attack reconstruction is defined by MSE smaller than 0.70. We report the attack iterations and attack reconstruction distance only for those successful attack examples. For attacks with complete failure, we record their reconstruction distance at the pre-defined maximum reconstruction/inference iterations of 300. Table III reports the attack results under non-private models, the DP-baseline, DP-dyn[S,σ\sigma], and two representative adaptive clipping methods: quantile-based clipping [55], Adaclip [56] and for the four image datasets. AdaClip performs the clipping bound estimation based on the coordinates and adaptively add different noise levels to different dimensions of the gradients, whereas the quantile clipping estimates the clipping bound using the quantile of the unclipped gradient norm. The default setting of C=4C=4, σ=6\sigma=6 is used for DP-baseline algorithms (fixed parameters) and also used as the initial value for quantile-based clipping [55] and DP-dyn[S,σ\sigma]. The attack is performed on the gradients from the first training iteration as gradients at early training iterations tend to leak more information than gradients in the later stage of the training [5]. Figure 5 illustrates the reconstruction results. We make three observations: (1) For all four datasets, the gradients in non-private models are vulnerable to the gradient leakage attack. (2) The quantile-based clipping [55] and Adaclip [56] method can slightly improve the resilience against the gradient leakage attack compared to non-private training. The two adaptive clipping approaches focus mainly on reducing the noise variance rather than gradient attack resilience. Therefore, the DP-baseline, with large and constant noise, is more effective with high resilience against the gradient leakage threats. (3) Compared to both DP-baseline and existing adaptive clipping approaches, DP-dyn[S,σ\sigma] offers the highest resilience against gradient leakage attacks. This is because DP-dyn[S,σ\sigma] adds larger noise at the early iterations of the training thanks to the combined effect of dynamic sensitivity and dynamic noise scale, which effectively creates difficulty for the gradient leakage attack to succeed. The high attack reconstruction distance in Table III shows strong attack resilience against gradient leakage threats. The visualization in Figure 5 also provides an intuitive illustration of this effect. For the two existing approaches, the adaptive clipping brings down the differential privacy noise compared to DP-baseline and thus shows less improvement on the resilience of DP training against gradient leakage attacks, compared to DP-baseline. However, differential privacy noise injection combined with dynamic sensitivity and dynamic noise scale will offer the best defense against gradient leakage threats.

In the rest of the experiments, we provide empirical results and analysis to further illustrate the benefits of the proposed gradient leakage resilient deep learning with dynamic differential privacy parameter optimizations. We first demonstrate how the proposed DP-dyn[SS,σ\sigma] achieves higher accuracy compared to the DP-baseline at the same (ϵ,δ\epsilon,\delta) differential privacy level. Then we show how DP-dyn[SS,σ\sigma] obtains a stronger differential privacy guarantee in terms of smaller ϵ\epsilon spending at a target accuracy. Finally, we show that in addition to high resilience against gradient leakage, the proposed dynamic privacy parameter optimizations have an acceptable time cost compared to the conventional DP deep learning baseline. Furthermore, our approach consistently offers high test accuracy on all benchmark datasets, compared to both the conventional DP deep learning baseline and existing adaptive clipping proposals.

C=1 C=2 C=4 C=8 C=16 C=32
MNIST non-private 0.9892
DP-baseline 0.9450 0.963 0.960 0.945 0.915 0.881
DP-dynS[Cdecay] 0.948 0.966 0.964 0.951 0.931 0.919
DP-dynS[l2l_{2}-max] 0.959 0.975 0.977 0.983 0.975 0.976
DP-dynS 0.955 0.975 0.978 0.980 0.978 0.977
Fashion-MNIST non-private 0.875
DP-baseline 0.825 0.832 0.833 0.831 0.812 0.765
DP-dynS[Cdecay] 0.824 0.834 0.839 0.837 0.829 0.822
DP-dynS[l2l_{2}-max] 0.830 0.840 0.845 0.844 0.840 0.840
DP-dynS 0.829 0.841 0.848 0.845 0.840 0.840
CIFAR10 non-private 0.687
DP-baseline 0.576 0.595 0.608 0.588 0.538 0.397
DP-dynS[Cdecay] 0.575 0.596 0.610 0.592 0.564 0.507
DP-dynS[l2l_{2}-max] 0.583 0.603 0.616 0.609 0.591 0.592
DP-dynS 0.582 0.604 0.621 0.612 0.595 0.595
LFW non-private 0.766
DP-baseline 0.704 0.709 0.692 0.647 0.595 0.406
DP-dynS[Cdecay] 0.692 0.714 0.703 0.675 0.643 0.606
DP-dynS[l2l_{2}-max] 0.715 0.722 0.738 0.71 0.683 0.645
DP-dynS 0.711 0.725 0.749 0.716 0.685 0.662
Purchase-10 non-private 0.825
DP-baseline 0.731 0.734 0.744 0.739 0.716 0.683
DP-dynS[Cdecay] 0.727 0.738 0.749 0.744 0.729 0.721
DP-dynS[l2l_{2}-max] 0.751 0.766 0.769 0.763 0.759 0.761
DP-dynS 0.752 0.768 0.773 0.77 0.765 0.766
Purchase-50 non-private 0.667
DP-baseline 0.549 0.561 0.571 0.574 0.556 0.503
DP-dynS[Cdecay] 0.545 0.569 0.577 0.581 0.579 0.562
DP-dynS[l2l_{2}-max] 0.571 0.573 0.591 0.595 0.587 0.58
DP-dynS 0.573 0.577 0.596 0.604 0.591 0.593
TABLE IV: Effectiveness of dynamic sensitivity optimizations in model accuracy improvement. σ=6\sigma=6 and TT is given in Table II.

VI-B Dynamic Sensitivity: Accuracy Analysis

This set of experiments compares the DP-baseline with three dynamic sensitivity methods on six datasets: DP-dynS[l2l_{2}-max], DP-dynS[Cdecay], and DP-dynS, which denotes the dynamic sensitivity defined by combining l2l_{2}-max sensitivity with dynamic clipping. Given that both DP-baseline and DP-dynS[l2l_{2}-max] use a constant clipping method with fixed pre-defined clipping bound, for a fair comparison, we vary the setting of clipping bound from 1 to 32 by the interval of the power of 2 and set σ=6\sigma=6. Table IV reports the results. We make three observations. First, DP-dynS[l2l_{2}-max] consistently outperforms DP-baseline under all settings of clipping bounds on all six datasets. This is because by Lemma 1, DP-baseline uses constant noise variance, and injects constant noise in each iteration of the training, regardless of the decreasing trend of gradients in the later stage of the training. Also, DP-baseline is highly sensitive to the settings of clipping bound for all six datasets. It results in lower accuracy when the clipping bound is set too large (e.g. C=16, 32) or too small bound (e.g. C=1). In contrast, DP-dynS[l2l_{2}-max] uses dynamic l2l_{2}-max sensitivity, which changes from iteration to iteration and closely aligns with the declining trend of gradients throughout the training. Second, DP-dynS[l2l_{2}-max] consistently outperforms DP-dynS[Cdecay], showing that noise variance ς\varsigma defined by l2l_{2}-max sensitivity and noise scale σ\sigma offers tighter alignment than dynamic sensitivity defined by decaying clipping bound CdecayC_{decay}. A simple policy of linearly decaying from the initial clipping bound CC to C/2C/2 is used in this experiment. Third, DP-dynS takes the best from both l2l_{2}-max sensitivity and dynamic clipping and consistently outperforms all other alternative approaches on all six datasets. Based on our empirical observations, it is hard to provide a scalable and stable performance for deep learning with DP property when using dynamic clipping to approximate the sensitivity S for two reasons. (1) Even with dynamic decaying clipping like the DP-dynS[Cdecay] or AdaClip or Quantile Clip (see Table XI), the dynamic clipping relies on the initial setting of the clipping upper bound and the minimum clipping bound as the training progresses toward convergence to ensure the approximation of the sensitivity is correct. Such max and min clipping bound is to some extent dataset and training task dependent, as shown in Table IV. In comparison, our proposed DP-dynS[l2l_{2}-max] is a scalable and much stable dynamic DP parameter optimization in terms of attack resilience, accuracy, convergence, and privacy spending.

Privacy spending (ϵ\epsilon) acc
BaseC AdvC OptC zCDP MA
MNIST DP-baseline 123.354 7.450 6.740 1.159 0.823 0.960
DP-dynS[Cdecay] 123.354 7.450 6.740 1.159 0.823 0.964
DP-dynS[l2l_{2}-max] 123.354 7.450 6.740 1.159 0.823 0.977
DP-dynS 123.354 7.450 6.740 1.159 0.823 0.978
Fashion-MNIST DP-baseline 123.354 7.450 6.740 1.159 0.823 0.833
DP-dynS[Cdecay] 123.354 7.450 6.740 1.159 0.823 0.839
DP-dynS[l2l_{2}-max] 123.354 7.450 6.740 1.159 0.823 0.845
DP-dynS 123.354 7.450 6.740 1.159 0.823 0.848
CIFAR10 DP-baseline 123.354 7.450 6.740 1.159 0.823 0.608
DP-dynS[Cdecay] 123.354 7.450 6.740 1.159 0.823 0.610
DP-dynS[l2l_{2}-max] 123.354 7.450 6.740 1.159 0.823 0.616
DP-dynS 123.354 7.450 6.740 1.159 0.823 0.621
LFW DP-baseline 74.024 5.503 5.037 0.893 0.636 0.692
DP-dynS[Cdecay] 74.024 5.503 5.037 0.893 0.636 0.703
DP-dynS[l2l_{2}-max] 74.024 5.503 5.037 0.893 0.636 0.738
DP-dynS 74.024 5.503 5.037 0.893 0.636 0.749
purchase-10 DP-baseline 61.689 4.952 4.546 0.814 0.580 0.744
DP-dynS[Cdecay] 61.689 4.952 4.546 0.814 0.580 0.749
DP-dynS[l2l_{2}-max] 61.689 4.952 4.546 0.814 0.580 0.769
DP-dynS 61.689 4.952 4.546 0.814 0.580 0.773
purchase-50 DP-baseline 61.689 4.952 4.546 0.814 0.580 0.571
DP-dynS[Cdecay] 61.689 4.952 4.546 0.814 0.580 0.577
DP-dynS[l2l_{2}-max] 61.689 4.952 4.546 0.814 0.580 0.591
DP-dynS 61.689 4.952 4.546 0.814 0.580 0.596
TABLE V: Measure and compare privacy spending ϵ\epsilon using the five different privacy composition methods for DP-baseline, DP-dynS[l2l_{2}-max], DP-dynS[Cdecay] and DP-dynS, with clipping bound C=4C=4, noise scale σ=6\sigma=6, and δ=1e5\delta=1e-5.

VI-C Dynamic Sensitivity: Privacy Analysis

Privacy under five accounting methods. The first set of experiments for privacy analysis is designed to measure privacy spending using five privacy accounting methods for DP-baseline and three alternative optimizations of dynamic sensitivity: DP-dynS[l2l_{2}-max], DP-dynS[Cdecay] and DP-dynS. Table V reports the results. We make two observations. First, all five accounting methods show consistent performance for all four DP algorithms on the six benchmark datasets, with Moment Accountant as the most efficient tracking of privacy spending followed by zCDP. Second, DP-dynS consistently outperforms both DP-baseline and the other two dynamic sensitivity algorithms with high accuracy performance, showing that the combo of l2l_{2} sensitivity with decay clipping method CdecayC_{decay} is more effective in enabling the sensitivity SS to be aligned with the trend of gradient updates during the training.

Privacy under a target accuracy and a fixed σ\sigma. In this second set of experiments, we analyze and compare DP-baseline with three alternative dynamic sensitivity optimizations in terms of their privacy spending under a target accuracy and a fixed noise scale σ\sigma. We use the accuracy achieved by DP-baseline at TT iterations as the target accuracy for each of the six datasets (recall Table 2). Table VI reports the results. DP-dynS is the winner consistently across all six datasets because it is the first to achieve the target accuracy with the smallest privacy spending and hence it is the first to terminate the training. Concretely, it takes 5137, 5532, 6336, 3645, 3690, and 3995 iterations for DP-dynS to achieve the target accuracy for MNIST, Fashion-MNIST, CIFAR10, LFW, Purchase-10, and Purchase-50 respectively. As a result, its accumulated privacy spending is much less than the other three approaches.

Privacy spending (ϵ\epsilon) iter
BaseC AdvC OptC zCDP MA
MNIST (0.960) DP-baseline 123.354 7.450 6.740 1.159 0.823 10000
DP-dynS[Cdecay] 98.646 6.518 5.930 1.034 0.735 7996
DP-dynS[l2l_{2}-max] 66.858 5.188 4.756 0.848 0.604 5419
DP-dynS 63.379 5.029 4.615 0.825 0.588 5137
Fashion-MNIST (0.833) DP-baseline 123.354 7.450 6.740 1.159 0.823 10000
DP-dynS[Cdecay] 100.200 6.578 5.983 1.042 0.741 8124
DP-dynS[l2l_{2}-max] 71.878 5.411 4.955 0.880 0.626 5826
DP-dynS 68.251 5.250 4.812 0.857 0.610 5532
CIFAR10 (0.608) DP-baseline 123.354 7.450 6.740 1.159 0.823 10000
DP-dynS[CdecayC_{decay}] 109.044 6.919 6.280 1.088 0.773 8841
DP-dynS[l2l_{2}-max] 80.809 5.794 5.294 0.934 0.664 6550
DP-dynS 78.169 5.682 5.195 0.918 0.653 6336
LFW (0.692) DP-baseline 74.024 5.503 5.037 0.893 0.636 6000
DP-dynS[CdecayC_{decay}] 62.417 4.985 4.576 0.819 0.583 5059
DP-dynS[l2l_{2}-max] 47.331 4.254 3.919 0.711 0.508 3836
DP-dynS 44.975 4.132 3.809 0.693 0.495 3645
purchase-10 (0.744) DP-baseline 61.689 4.952 4.546 0.814 0.580 5000
DP-dynS[CdecayC_{decay}] 58.926 4.822 4.430 0.795 0.567 4776
DP-dynS[l2l_{2}-max] 46.615 4.217 3.886 0.706 0.504 3778
DP-dynS 45.530 4.161 3.835 0.697 0.498 3690
purchase-50 (0.571) DP-baseline 61.689 4.952 4.546 0.814 0.580 5000
DP-dynS[Cdecay] 59.308 4.840 4.446 0.798 0.568 4807
DP-dynS[l2l_{2}-max] 50.871 4.433 4.080 0.738 0.526 4123
DP-dynS 49.292 4.354 4.009 0.726 0.518 3995
TABLE VI: Measure and compare differential privacy spending ϵ\epsilon under the target accuracy and fixed noise scale σ=6\sigma=6 for DP-baseline, DP-dynS[l2l_{2}-max], DP-dynS[Cdecay] and DP-dynS. Clipping bound C=4C=4, δ=1e5\delta=1e-5. The target accuracy is set to the accuracy of DP-baseline of each dataset.
BaseC AdvC OptC zCDP MA acc
MNIST (C=4) DP-baseline 123.354 7.450 6.740 1.159 0.823 0.9596
DP-dynS[Cdecay] 105.173 5.889 5.736 1.119 0.806 0.9596
DP-dynS[l2l_{2}-max] 34.412 1.866 1.787 0.436 0.314 0.9596
DP-dynS 32.529 1.783 1.694 0.412 0.307 0.9596
MNIST (C=8) DP-baseline 123.354 7.450 6.740 1.159 0.823 0.945
DP-dynS[Cdecay] 114.531 6.552 6.341 0.994 0.675 0.945
DP-dynS[l2l_{2}-max] 28.828 1.589 1.511 0.378 0.283 0.945
DP-dynS 28.015 1.426 1.412 0.366 0.269 0.945
CIFAR10 (C=4) DP-baseline 123.354 7.450 6.740 1.159 0.823 0.608
DP-dynS[Cdecay] 120.119 7.006 6.662 1.153 0.812 0.608
DP-dynS[l2l_{2}-max] 116.717 6.292 6.330 1.114 0.790 0.608
DP-dynS 113.458 5.557 5.536 1.044 0.763 0.608
CIFAR10 (C=8) DP-baseline 123.354 7.450 6.740 1.159 0.823 0.588
DP-dynS[Cdecay] 116.304 6.316 6.405 1.137 0.792 0.588
DP-dynS[l2l_{2}-max] 109.405 5.865 5.886 1.063 0.754 0.588
DP-dynS 105.256 5.849 5.834 1.022 0.748 0.588
MNIST (C=4) DP-baseline 523.375 232.117 258.635 4.109 21.174 0.9773
DP-dynS[Cdecay] 155.677 8.575 9.037 1.370 0.944 0.9773
DP-dynS[l2l_{2}-max] 123.354 7.450 6.740 1.159 0.823 0.9773
DP-dynS 117.266 7.036 6.687 1.149 0.806 0.9773
TABLE VII: Measure and compare differential privacy spending ϵ\epsilon under a target accuracy and a fixed noise variance ς\varsigma for DP-baseline, DP-dynS[l2l_{2}-max], DP-dynS[Cdecay] and DP-dynS. Clipping bound C=4C=4 or C=8C=8, fixed noise scale σ=6\sigma=6 and δ=1e5\delta=1e-5. The target accuracy is set to the accuracy of DP-baseline for both MNIST and CIFAR10 in the first four scenarios. For the last scenario, the target accuracy is set to the accuracy of DP-dynS[l2l_{2}-max] (0.9773) for MNIST with C=4C=4.

Privacy under a target accuracy and a fixed ς\varsigma. The third set of experiments for privacy analysis measures the privacy spending under a target accuracy and a fixed noise variance ς\varsigma. We compare DP-baseline and the three alternative DP algorithms with three different dynamic sensitivity strategies DP-dynS[l2l_{2}], DP-dynS[CdecayC_{decay}], and DP-dynS on MNIST and CIFAR10. The accuracy of DP-baseline is used as the target accuracy for both MNIST and CIFAR10. We set σ=6\sigma=6, C=4C=4 and C=8C=8 and conduct the experiments with two fixed noise variance settings: ς=24\varsigma=24 with C=4C=4 and ς=48\varsigma=48 with C=8C=8. Based on Theorem 2 and Lemma 1, when the noise variance ς\varsigma is fixed, with dynamic sensitivity SS, the noise scale σ\sigma will follow the trend of SS in a reverse trend: as SS continues to decline with the training progresses in #\#iterations, the noise scale σ\sigma will increase to keep the noise scale σ\sigma constant. As a result, larger sensitivity SS in the early stage of the training will lead to a smaller noise scale and larger privacy spending to protect informative gradients. As the training is approaching the end, the sensitivity SS will become smaller, resulting in larger σ\sigma and smaller privacy spending. Table VII shows the results. Consider the first four scenarios, in which the target accuracy is set to the accuracy of DP-baseline for each dataset. We make three observations: (1) DP-dynS consistently outperforms the other three alternatives for the two noise variance settings on both datasets when given a target accuracy and a fixed noise variance ς\varsigma. (2) DP-dynS[l2l_{2}-max] consistently ranked as the second winner with significantly smaller privacy spending than DP-dynS[CdecayC_{decay}] for all four scenarios under all five privacy accounting methods. This further demonstrates that using adaptive clipping with decay function is still a loose estimation compared to using the l2l_{2}-max sensitivity for differentially private deep learning. Hence, DP-dynS[CdecayC_{decay}] consumes more privacy budget under both fixed noise variances for both datasets, compared to DP-dynS[l2l_{2}-max] and DP-dynS. (3) When comparing the privacy spending on the same dataset with two different fixed noise variances, or when comparing two different datasets on the same fixed noise variance, it is interesting to note that DP-baseline has the same privacy spending under both variance settings on both MNIST and CIFAR10. This is because the privacy spending ϵ\epsilon is anti-correlated to noise scale σ\sigma (recall Lemma 1) and not sensitive to the different settings of clipping bound for the constant clipping method. However, for DP algorithms with dynamic sensitivity, the privacy spending is also training data dependent. For each of the two fixed noise variances, both with σ=6\sigma=6, all three dynamic sensitivity algorithms on CIFAR10 will incur higher privacy spending while their respective privacy spending on MNIST will be smaller. This confirms the common knowledge that the gradient update trend during training is model-dependent and dataset-dependent.

The last scenario is included for privacy analysis under two different target accuracy. For MNIST with C=4C=4, we set the target accuracy to the accuracy of DP-dynS[l2l_{2}-max] (0.9773) instead of 0.9596 from DP-baseline. We make two interesting observations. (1) DP-dynS remains the winner with strong privacy guarantee at the smallest privacy spending ϵ\epsilon, followed by DP-dynS[CdecayC_{decay}]. (2) To achieve the target accuracy of 0.9773, DP-baseline has to enforce a smaller σ\sigma to maintain the fixed ς\varsigma, resulting in much higher spending of privacy budget according to all five privacy accounting methods. For MA, DP-baseline results in ϵ=21.174\epsilon=21.174 for target accuracy of 0.9773 compared to ϵ=0.823\epsilon=0.823 for the target accuracy of 0.9596 obtained at T=10000T=10000. For zCDP and base composition, DP-baseline spent 4×4\times of the privacy budget for achieving the target accuracy of 0.9773, compared to the privacy spent for achieving the target accuracy of 0.9596 at T=10000T=10000.

VI-D Dynamic Noise Scale Optimization

In this section, we evaluate the effectiveness of incorporating dynamic noise scale into DP-baseline and the three alternative dynamic sensitivity optimized DP algorithms.

BaseC AdvC OptC zCDP MA
123.354 7.450 6.740 1.159 0.823
fix σ\sigma DP-baseline 0.9596
DP-dynS[Cdecay] 0.9639
DP-dynS[l2l_{2}-max] 0.9773
DP-dynS 0.9778
iteration 10000
adaptive σ\sigma (linear) DP-dynσ\sigma 0.962 0.9614 0.9608 0.9618 0.9614
DP-dyn[S[Cdecay],σ\sigma] 0.9673 0.965 0.9641 0.9653 0.9649
DP-dyn[S[l2l_{2}-max],σ\sigma] 0.9788 0.978 0.9775 0.9781 0.9778
DP-dyn[S,σ\sigma] 0.9791 0.9783 0.9779 0.9783 0.9782
Iteration 9646 9481 8908 9494 9450
adaptive σ\sigma (exponential) DP-dynσ\sigma 0.9616 0.9604 0.9597 0.9608 0.9603
DP-dyn[S[Cdecay],σ\sigma] 0.9681 0.9652 0.9647 0.9665 0.9665
DP-dyn[S[l2l_{2}-max],σ\sigma] 0.979 0.9775 0.9775 0.9786 0.9781
DP-dyn[S,σ\sigma] 0.9803 0.9781 0.9781 0.9789 0.9787
iteration 9678 9311 8955 9553 9501
adaptive σ\sigma (cyclic) DP-dynσ\sigma 0.9601 0.9599 0.9596 0.9599 0.9599
DP-dyn[S[Cdecay],σ\sigma] 0.9652 0.9644 0.9641 0.9648 0.9645
DP-dyn[S[l2l_{2}-max],σ\sigma] 0.9784 0.9779 0.9776 0.9781 0.9779
DP-dyn[S,σ\sigma] 0.9787 0.9781 0.9779 0.9785 0.9784
iteration 8019 7724 7251 7993 7876
TABLE VIII: Accuracy analysis for eight alternative DP algorithms on MNIST under three different dynamic noise scale policies (C=4C=4, δ=1e5\delta=1e-5). The target ϵ\epsilon is from Table V, σ=6\sigma=6 for DP algorithms with fixed σ\sigma and σ=6\sigma=6 is also used as the initial σ\sigma value for dynamic σ\sigma algorithms under each of the three decaying policies.

Accuracy analysis under a target privacy budget ϵ\epsilon. The first set of experiments evaluate and compare these eight alternative DP algorithms on MNIST under a given target privacy budget ϵ\epsilon such that each algorithm will terminate when its target privacy budget is exhausted. Unless otherwise stated, δ=1e5\delta=1e-5, C=4C=4 and T=10000T=10000 for MNIST. Table VIII reports the results. We make three observations. (1) For DP-baseline and the three dynamic sensitivity algorithms under fixed σ=6\sigma=6, we measure and compare their accuracy since they have the same privacy spending as shown in Table V when reaching TT iterations. DP-dynS and DP-dynS[l2l_{2}-max] are clear winners with DP-dynS slightly higher in accuracy. (2) By Lemma 1, the decaying noise scale will lead to increased per-iteration ϵ\epsilon spending for early iterations in the training. As a result, the training is terminated early when the accumulated privacy loss reaches the target privacy budget under each accounting method. Different privacy composition methods accumulate heterogeneous privacy spending differently and result in different ending iterations. (3) By integrating dynamic noise scale optimization, one can further improve accuracy performance under all three noise scale decay policies compared to their corresponding algorithms under the fixed σ\sigma. DP-dyn[S,σ\sigma] is consistently the best performing algorithm under all three σ\sigma decaying policies. Although DP-dynσ\sigma with only dynamic noise scale slightly outperforms DP-baseline under the three decaying σ\sigma policies, dynamic sensitivity approaches consistently outperform DP-dynσ\sigma in accuracy performance, showing the critical role of sensitivity in achieving high training utility of DP algorithms. (4) Empirically, noise scale decay with the exponential trend has the best accuracy performance.

BaseC AdvC OptC zCDP MA iter
fix σ\sigma DP-baseline 123.354 7.450 6.740 1.159 0.823 10000
DP-dynS[Cdecay] 98.646 6.518 5.930 1.034 0.735 7996
DP-dynS[l2l_{2}-max] 66.858 5.188 4.756 0.848 0.604 5419
DP-dynS 63.379 5.029 4.615 0.825 0.588 5137
linear DP-dyn[S,σ\sigma] 51.365 4.538 4.501 0.561 0.459 5532
exponential 48.027 4.017 3.962 0.537 0.453 5541
cyclic 56.776 4.703 4.667 0.683 0.572 5493
TABLE IX: Comparing privacy spending ϵ\epsilon for the best dynamic parameter combo DP-dyn[S,σ\sigma] with DP-baseline and three dynamic sensitivity optimizations under fixed noise scale. The target accuracy is set to 0.9596, the accuracy of DP-baseline on MNIST with C=4C=4, σ=6\sigma=6, δ=1e5\delta=1e-5.

Privacy analysis under a target accuracy. In the next set of experiments, we measure and compare the accumulated privacy spending under a given target accuracy to evaluate the effectiveness of the best dynamic parameter optimization DP-dyn[S,σ\sigma] by comparing it with DP-baseline and three alternative dynamic sensitivity algorithms with fixed σ\sigma for MNIST: DP-dynS[l2l_{2}-max], DP-dynS[Cdecay], DP-dynS and DP-dyn[S,σ\sigma]. The target accuracy is set to 0.9596, the accuracy of DP-baseline as provided in Table V with C=4C=4 and σ=6\sigma=6. Table IX reports the results. We make two observations: (1) DP-dyn[S,σ\sigma] with exponential σ\sigma decay provides the best differential privacy guarantee at the smallest privacy spending ϵ\epsilon under all five privacy accounting methods. (2) Both DP-dynS and DP-dynS[l2l_{2}-max] terminate slightly earlier than DP-dyn[S,σ\sigma] under the same target accuracy. This is likely because the large early noise in the dynamic noise scale algorithm may have some marginal effect on the convergence, leading to taking a few iterations more than the DP-dynS in reaching the target accuracy. However, DP-dyn[S,σ\sigma] spent the smallest privacy budget to achieve the target accuracy even with a slightly longer training time. This also shows that early termination does not always guarantee smaller privacy spending. The dynamic sensitivity and dynamic noise scale ultimately control the right amount of noise to be added for achieving the best privacy under a target accuracy.

VI-E Time Cost Evaluation

This set of experiments compare the time cost for the non-private algorithm and five alternative approaches to differentially private deep learning using all six benchmark datasets. Table X report the per-iteration time cost measurement in seconds. We make three observations: (1) Incorporating dynamic parameters of sensitivity and noise scale incurs negligible additional cost. This is because all differentially private deep learning algorithms will always need to compute the l2l_{2} norm of the gradients in order to perform clipping. The difference between DP algorithms and non-private model indicates the cost of computing l2l_{2} norm of the gradients per iteration. (2) The relative cost is smaller when the model is simpler with a smaller number of parameters. For example, the two attribute datasets have a lower time cost than the four image datasets. (3) In addition to model complexity, batch size may also impact this additional time cost. The relative cost is smaller when the batch size is smaller. For example, LFW has a much smaller batch size than the other three image datasets and is relatively faster to run one iteration.

MNIST Fashion-MNIST CIFAR10 LFW Purchase
non-private 0.138 0.144 0.504 0.069 0.024
DP-baseline 1.63 1.68 1.8 0.149 0.125
DP-dynS[Cdecay] 1.70 1.72 1.83 0.152 0.131
DP-dynS[l2l_{2}-max] 1.71 1.72 1.83 0.153 0.131
DP-dynS 1.72 1.74 1.84 0.155 0.132
DP-dyn[S,σ\sigma] 1.72 1.74 1.85 0.155 0.132
TABLE X: Per-iteration time cost in seconds

VI-F Comparison with Existing Dynamic Clipping

We have shown in Section VI-A that our proposed approach for gradient leakage resilient deep learning with dynamic parameter optimizations offers the strongest resilience against gradient leakage attacks to training data privacy by comparing with both the state of the art DP baseline for deep learning and the two representative adaptive clipping enhancements [56, 55]. In this set of experiments, we provide the accuracy performance comparison to show that our dynamic parameter optimization on both noise scale and sensitivity outperforms both the baseline DP approach [24] and the two adaptive clipping proposals. Table XI compares DP-dyn[S,σ\sigma] with the two approaches in terms of accuracy performance and time cost. We consider the accuracy for the four image datasets and measure the time cost by sec/iteration. We make two observations: (1) The proposed dynamic optimization on both sensitivity SS and noise-scale σ\sigma consistently outperforms both baseline DP and the two existing proposals to improve the baseline DP with AdaClip or Quantile clipping with larger accuracy performance enhancements. (2) Comparing to the baseline DP [24], the proposed dynamic parameter optimization approach incurs the smallest time cost compared to AdaClip and Quantile clipping while offering the highest accuracy improvement in comparison.

MNIST Fashion-MNIST CIFAR-10 LFW
DP-baseline accuracy 0.960 0.833 0.608 0.692
cost (s/iter) 1.63 1.68 1.8 0.149
Quantileclip [55] accuracy 0.971 0.846 0.614 0.733
cost (s/iter) 2.66 2.75 2.92 0.304
Adaclip [56] accuracy 0.969 0.843 0.611 0.725
cost (s/iter) 1.81 1.84 1.94 0.185
DP-dyn[S,σ\sigma] accuracy 0.978 0.848 0.621 0.749
cost (s/iter) 1.72 1.74 1.85 0.155
TABLE XI: Comparing DP-dyn[S,σ\sigma] with adaptive clipping in terms of accuracy performance and time cost, with C=4C=4 and σ=6\sigma=6.

VII Related Work

We have given an overview of related work in privacy threats and deep learning with differential privacy in Section I. In this section, we focus on the most relevant literature. The first proposal for deep learning with DP [24] has been deployed in google TensorFlow [30]. However, one known problem for deep learning with DP is the degradation of model accuracy compared to the non-DP trained model. Several recent efforts have been put forward for improving the accuracy of the approach proposed in [24]. The most relevant efforts to this paper include the recent zCDP proposal with dynamic privacy budget allocation [27] instead of uniformed privacy budget allocation in [24], and the adaptive clipping proposals represented by AdaClip [56] and Quantile Clipping [55]. The main contribution of Yu et.al [27] is the new privacy accounting method zCDP that can compute the privacy spending when the DP training is using an approximate differential privacy under CDP with parameter ρ\rho to control dynamic privacy budget instead of ϵ\epsilon privacy budget as in [24]. In addition, [27] also proposed a dynamic privacy budget allocation solution to improve [24] which uses fixed privacy budget in every iteration of the DNN training, aiming to improving the accuracy of trained DNN model with DP. [27] did not consider gradient leakage attack and resilience, and also uses the fixed clipping as the approximation to sensitivity in the same way as [24]. AdaClip performs the clipping bound estimation based on the coordinates of the gradient norm and adaptively add different noise levels to different dimensions of the gradients. Quantile clipping is an alternative approach to AdaClip. It estimates the clipping bound during the training iterations using the quantile of the unclipped gradient norm instead of estimation of clipping bound based on the coordinates of the gradient norm in AdaClip. Both approaches are costly to compute dynamic clipping bounds since both needs to compute the clipping estimation on all M layers of a DNN for every example. Also both use the dynamic clipping bound to approximate the sensitivity while maintaining the fixed noise scale throughout the iterations of training to ensure that a sufficient amount of DP-noise is added according to the DP theory. Both are designed to improve accuracy of DNN training with DP but fail to be resilient against gradient leakages. In comparison, our approach is by design both gradient leakage resilient and improving model accuracy thanks to injecting dynamic controlled DP noise. Our dynamic DP parameter optimization approach DP-dyn[l2l_{2}-max] incorporates dynamic sensitivity and dynamic noise scale to support decaying noise variance, such that as learning progresses in iterations, we inject less amount of DP controlled noises to the gradients instead of constant noise amount as done in [24]. As a result, our dynamic l2l_{2}-max sensitivity provide a tighter DP controlled noise bound than the fixed Gaussian noise variance in used in conventional approaches.

VIII Conclusions

We have presented a suite of algorithms with dynamic privacy parameters for gradient leakage resilient deep learning with differential privacy. We first analyze some limitations of existing algorithms using fixed-parameter strategies that inject constant differential privacy noise to all layers during each training iteration. We then presented a suite of DP algorithms with dynamic parameter optimizations, including dynamic sensitivity mechanisms, dynamic noise scale mechanisms, and different combinations of dynamic parameter strategies. Extensive experiments on six benchmark datasets demonstrate that the proposed differentially private deep learning with dynamic hybrid sensitivity and dynamic decaying noise scale can outperform existing state-of-the-art approaches and other dynamic parameter alternatives with competitive accuracy performance, strong differential privacy guarantee, high resilience against gradient privacy leakage.

Acknowledgement. The authors would like to first thank the associate editor Dr. Grigorios Loukides and the reviewers for their constructive and helpful comments. The authors acknowledge partial support by the National Science Foundation under Grants NSF 2038029, NSF 1564097, and an IBM faculty award.

References

  • [1] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” in NeurIPS, 2019, pp. 14 747–14 756.
  • [2] B. Zhao, K. R. Mopuri, and H. Bilen, “idlg: Improved deep leakage from gradients,” arXiv preprint arXiv:2001.02610, 2020.
  • [3] Y. Aono, T. Hayashi, L. Wang, and S. Moriai, “Privacy-preserving deep learning via additively homomorphic encryption,” TIFS, vol. 13, no. 5, pp. 1333–1345, 2017.
  • [4] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradients - how easy is it to break privacy in federated learning?” in NeurIPS, 2020, pp. 16 937–16 947.
  • [5] W. Wei, L. Liu, M. Loper, K.-H. Chow, M. E. Gursoy, S. Truex, and Y. Wu, “A framework for evaluating client privacy leakages in federated learning,” in ESORICS.   Springer, 2020, pp. 545–566.
  • [6] J. Zhu and M. Blaschko, “R-gap: Recursive gradient attack on privacy,” arXiv preprint arXiv:2010.07733, 2020.
  • [7] H. Weng, J. Zhang, F. Xue, T. Wei, S. Ji, and Z. Zong, “Privacy leakage of real-world vertical federated learning,” arXiv preprint arXiv:2011.09290, 2020.
  • [8] Y. Liu, X. Zhu, J. Wang, and J. Xiao, “A quantitative metric for privacy leakage in federated learning,” in ICASSP.   IEEE, 2021.
  • [9] H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov, “See through gradients: Image batch recovery via gradinversion,” in CVPR.   IEEE, 2021.
  • [10] J. Qian, H. Nassar, and L. K. Hansen, “Minimal conditions analysis of gradient-based reconstruction in federated learning,” arXiv preprint arXiv:2010.15718, 2020.
  • [11] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in CCS.   ACM, 2015, pp. 1322–1333.
  • [12] C. Song, T. Ristenpart, and V. Shmatikov, “Machine learning models that remember too much,” in CCS.   ACM, 2017, pp. 587–601.
  • [13] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploiting unintended feature leakage in collaborative learning,” in S&P.   IEEE, 2019, pp. 691–706.
  • [14] K. Ganju, Q. Wang, W. Yang, C. A. Gunter, and N. Borisov, “Property inference attacks on fully connected neural networks using permutation invariant representations,” in CCS.   ACM, 2018, pp. 619–633.
  • [15] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership inference attacks against machine learning models,” in S&P.   IEEE, 2017, pp. 3–18.
  • [16] M. Nasr, R. Shokri, and A. Houmansadr, “Comprehensive privacy analysis of deep learning: Stand-alone and federated learning under passive and active white-box inference attacks,” in S&P.   IEEE, 2019, pp. 739–753.
  • [17] B. Hui, Y. Yang, H. Yuan, P. Burlina, N. Z. Gong, and Y. Cao, “Practical blind membership inference attack via differential comparisons,” in NDSS, 2021.
  • [18] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes, “Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models,” in NDSS, 2018.
  • [19] L. Song and P. Mittal, “Systematic evaluation of privacy risks of machine learning models,” in USENIX Security, 2021.
  • [20] G. Liu, J. Zhao, R. Zhang, C. Wang, and L. Liu, “Gradient-leaks: Enabling black-box membership inference attacks against machine leaning models,” TIFS, 2021.
  • [21] B. Hitaj, G. Ateniese, and F. Perez-Cruz, “Deep models under the gan: information leakage from collaborative deep learning,” in CCS.   ACM, 2017, pp. 603–618.
  • [22] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyond inferring class representatives: User-level privacy leakage from federated learning,” in INFOCOM.   IEEE, 2019, pp. 2512–2520.
  • [23] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
  • [24] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in CCS.   ACM, 2016, pp. 308–318.
  • [25] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning differentially private recurrent language models,” in ICLR, 2018.
  • [26] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and Ú. Erlingsson, “Scalable private learning with pate,” in ICLR, 2018.
  • [27] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex, “Differentially private model publishing for deep learning,” in S&P.   IEEE, 2019, pp. 332–349.
  • [28] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization.” JMLR, vol. 12, no. 3, 2011.
  • [29] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated learning: A client level perspective,” arXiv preprint arXiv:1712.07557, 2017.
  • [30] “Tensorflow privacy module,” https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/keras_models/dp_keras_model.py, (accessed Nov 2, 2020).
  • [31] “Pytorch privacy module,” https://github.com/pytorch/opacus/blob/master/opacus/privacy_engine.py, (accessed Nov 2, 2020).
  • [32] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization: Efficient algorithms and tight error bounds,” in FCS.   IEEE, 2014, pp. 464–473.
  • [33] N. Phan, X. Wu, H. Hu, and D. Dou, “Adaptive laplace mechanism: Differential privacy preservation in deep learning,” in ICDM.   IEEE, 2017, pp. 385–394.
  • [34] Z. Xu, S. Shi, A. X. Liu, J. Zhao, and L. Chen, “An adaptive and fast convergent approach to differentially private deep learning,” in INFOCOM.   IEEE, 2020, pp. 1867–1876.
  • [35] K. L. van der Veen, R. Seggers, P. Bloem, and G. Patrini, “Three tools for practical differential privacy,” arXiv preprint arXiv:1812.02890, 2018.
  • [36] L. Xiang, J. Yang, and B. Li, “Differentially-private deep learning from an optimization perspective,” in INFOCOM.   IEEE, 2019, pp. 559–567.
  • [37] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
  • [38] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in EuroS&P.   IEEE, 2016, pp. 372–387.
  • [39] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in S&P.   IEEE, 2017, pp. 39–57.
  • [40] W. Wei and L. Liu, “Robust deep learning ensemble against deception,” TDSC, 2020.
  • [41] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in TCC.   Springer, 2006, pp. 265–284.
  • [42] B. Balle, G. Barthe, and M. Gaboardi, “Privacy amplification by subsampling: Tight analyses via couplings and divergences,” in NeurIPS, 2018, pp. 6277–6287.
  • [43] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and differential privacy,” in FOCS.   IEEE, 2010, pp. 51–60.
  • [44] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem for differential privacy,” in ICML, 2015, pp. 1376–1385.
  • [45] I. Mironov, “Rényi differential privacy,” in CSF.   IEEE, 2017, pp. 263–275.
  • [46] Y.-X. Wang, B. Balle, and S. P. Kasiviswanathan, “Subsampled rényi differential privacy and analytical moments accountant,” in AISTATS, 2019, pp. 1226–1235.
  • [47] “Moments accountant implementation,” https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/analysis/compute_dp_sgd_privacy.py, (accessed Nov 2, 2020).
  • [48] S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, and Y. Zhou, “A hybrid approach to privacy-preserving federated learning,” in AISec, 2019, pp. 1–11.
  • [49] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in ICML, 2013, pp. 1310–1318.
  • [50] M. Nasr, R. Shokri et al., “Improving deep learning with differential privacy using gradient encoding and denoising,” arXiv preprint arXiv:2007.11524, 2020.
  • [51] K. Nissim, S. Raskhodnikova, and A. Smith, “Smooth sensitivity and sampling in private data analysis,” in STOC.   ACM, 2007, pp. 75–84.
  • [52] M. Bun and T. Steinke, “Average-case averages: Private algorithms for smooth sensitivity and mean estimation,” in NeurIPS, 2019, pp. 181–191.
  • [53] C. Darken and J. E. Moody, “Note on learning rate schedules for stochastic optimization.” in NeurIPS, 1990, pp. 832–838.
  • [54] L. N. Smith, “Cyclical learning rates for training neural networks,” in WACV.   IEEE, 2017, pp. 464–472.
  • [55] O. Thakkar, G. Andrew, and H. B. McMahan, “Differentially private learning with adaptive clipping,” arXiv preprint arXiv:1905.03871, 2019.
  • [56] V. Pichapati, A. T. Suresh, F. X. Yu, S. J. Reddi, and S. Kumar, “Adaclip: Adaptive clipping for private sgd,” arXiv preprint arXiv:1908.07643, 2019.