This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the effect of normalization layers on Differentially Private training of deep Neural networks

Ali Davody,  David Ifeoluwa Adelani,  Thomas Kleinbauer, and Dietrich Klakow
Spoken Language Systems Group
Saarland Informatics Campus
Saarland University, Germany
{adavody|didelani|kleiba|dietrich.klakow}@lsv.uni-saarland.de
Abstract

Differentially private stochastic gradient descent (DPSGD) is a variation of stochastic gradient descent based on the Differential Privacy (DP) paradigm which can mitigate privacy threats that arise from the presence of sensitive information in training data. One major drawback of training deep neural networks with DPSGD is a reduction in the model’s accuracy. In this paper, we study the effect of normalization layers on the performance of DPSGD. We demonstrate that normalization layers have a large beneficial impact on the utility of deep neural networks with noisy parameters and should be considered essential ingredients of training with DPSGD. In particular, we propose a novel method for integrating batch normalization with DPSGD without incurring an additional privacy loss. With our method, we are able to train deeper networks and achieve a better utility-privacy trade-off.

1 Introduction

Training deep neural networks typically requires large and representative data collections to achieve high performance. However, depending on the application domain, some datasets may contain sensitive information such as medical records of patients or personal financial data. This has motivated the development of dedicated training methods in order to address privacy concerns (see e. g. [30]).

Differential Privacy (DP) [6] provides a concrete cryptography-inspired notion of privacy. In practice, DP algorithms are obtained from non-private algorithms by means of appropriate randomization [8]. Differential Privacy has been integrated into deep learning [25, 1] where privacy issues can arise when the trained model permits the reconstruction of sensitive information that exists in the training data. The proposed method in [1] is based on clipping gradients and adding random noise to them in each iteration of stochastic gradient descent (SGD). Combined with a moments accountant method for tracing the privacy loss, this differentially private SGD (DPSGD) technique has enabled deep neural networks to be trained under a modest privacy budget at the cost of a manageable reduction in the model’s test accuracy. However, for low privacy budgets (i. e. a small ε\varepsilon, see Section 3), which corresponds to a large privacy guarantee, this accuracy drops significantly under DPSGD.

The central role of input noise in the application of DPSGD deserves a more detailed analysis. While adding a small amount of noise during training can benefit the generalization capability of the model, too much noise leads to inferior performance because of the high sensitivity of the output to perturbation in the parameters. In spite of this, we show that neural networks augmented with batch/layer normalization layers are strongly robust against random noise injection in their weights.

In this work, we investigate the impact of batch normalization [11] and layer normalization [2] on the performance of training under privacy constraints. Normalization layers such as batch normalization [11] and layer normalization [2] are indispensable components of nearly all state-of-the-art deep neural networks. These techniques are very essential to robustly train deep neural networks without carefully custom non-linearities [13] or choosing a specific initialization scheme [26]. They also improve generalization by preventing overfitting when training very deep networks [31].

One important property of normalization methods is the invariance of the model to weight matrix re-scaling [2]. We argue that this invariance suggests robustness against noise injection and confirm this hypothesis empirically. In particular, we show that batch-normalization can be integrated with DPSGD without any additional loss in privacy during the training process. We compare our proposal with the current state-of-the-art methods by conducting a series of experiments on Computer Vision and Natural Language Processing tasks. In summary, our contributions are as follows:

  • We demonstrate that normalization layers have a substantial impact on the performance of models with noisy parameters and should be considered essential ingredients in robust differentially private training.

  • We propose an efficient method for using batch normalization layers without incurring an additional privacy loss in the training procedure. To the best of our knowledge, our work is the first to apply a DP mechanism in the presence of batch normalization.

  • We establish new accuracy records for differentially private trained deep networks under DPSGD on the MNIST and CIFAR10 datasets.

The rest of the paper is organized as follows. We study the effect of noise on the performance of models in Section 2. Section 3 introduces our approach for differentially private training deep networks with batch normalization. We compare our method with the existing approaches in Section 4. Finally, in section 5, we discuss the related works on differential privacy.

2 Noise and normalization

In this section, we investigate how random Gaussian noise affects a network’s performance in the presence of normalization layers. More specifically, we sample the weights from a Gaussian distribution 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) with learnable mean parameters μ\mu and constant variance σ2\sigma^{2}. Backpropagation is performed by making use of the standard reparametrization trick [24]. This way of training is very similar to variational Bayesian learning of neural networks [3], where weights are represented by probability distributions rather than having a fixed value. Unlike the Bayesian approach though, where the goal is learning the true posterior distribution of the weights given the training data, here the noise is introduced via an ad-hoc distribution function.

Batch normalization [11] and layer normalization [2] are introduced to speed up deep neural network training by regularizing neuron dynamics via mean and variance statistics and reducing variance in the input to each node. Normalization techniques in combination with other architecture innovations like residual connections [10] make training of very deep networks feasible.

Both batch and layer normalization ensure zero mean and unit variance in the output of a layer but using different statistics. Batch normalization (BN) calculates the mean and variance statistics across samples in a mini-batch for each neuron independently, while layer normalization (LN) standardizes each summed input to a node utilizing the statistics over all hidden units.

More precisely, if we denote the weighted summed inputs to the ll-th layer by 𝒛l=𝒘𝑻𝒂l1\bm{z}^{l}=\bm{w}^{\bm{T}}\,\bm{a}^{l-1} where al1a^{l-1} is the activation in layer l1l-1 and 𝒘\bm{w} is the weight matrix, then the normalization operators rescale and shift zz according to:

z~il=γiσil,2+ϵ(zilμil)+βi\displaystyle\tilde{z}_{i}^{l}=\frac{\gamma_{i}}{\sqrt{\sigma_{i}^{l,2}+\epsilon}}(z_{i}^{l}-\mu_{i}^{l})+\beta_{i} (1)

where γi\gamma_{i}\in\mathbb{R} and βi\beta_{i}\in\mathbb{R} are learnable parameters which are set to one and zero respectively at the beginning of the training. The parameters σil\sigma_{i}^{l} and μil\mu_{i}^{l} are estimated as follows for batch normalization:

μiBN,l=𝔼xp(x)[zil],σiBN,l,2=𝔼xp(x)[(zilμil)2],\displaystyle\mu^{\text{BN},\,l}_{i}=\underset{x\sim p(x)}{\operatorname{\mathbb{E}}}[z_{i}^{l}],\;\;\;\sigma^{\text{BN},l,2}_{i}=\underset{x\sim p(x)}{\operatorname{\mathbb{E}}}[(z_{i}^{l}-\mu_{i}^{l})^{2}], (2)

and as follows for layer normalization:

μLN,l=1ni=1nzil,σLN,l=1ni=1n(zilμil)2,\displaystyle\mu^{\text{LN},\,l}=\frac{1}{n}\sum_{i=1}^{n}z_{i}^{l},\;\;\;\sigma^{\text{LN},l}=\frac{1}{n}\sum_{i=1}^{n}(z_{i}^{l}-\mu_{i}^{l})^{2}, (3)

where nn is the number of hidden units in the layer. The expectations are estimated using samples from the training mini-batches. In the case of batch normalization, at test time, these statistics are replaced with an exponential running average of corresponding mean and variance computed during the training phase.

In this work, we focus on the scale invariance property of normalization methods. It is well known that both batch and layer normalization are invariant under scaling of the weight matrix (matrices) 𝜽λ𝜽\bm{\theta}\rightarrow\lambda\,\bm{\theta} with arbitrary λ>0\lambda>0 [2]. In order to demonstrate the effect of this symmetry, consider a deep neural network f(𝜽,.)f(\bm{\theta}^{*},.) which characterizes the relationship from input to output with trained parameters 𝜽\bm{\theta}^{*}. The optimal weights 𝜽\bm{\theta}^{*} are usually learnt by minimizing a non-convex objective function, (𝜽,.)\mathcal{L}(\bm{\theta},.), over the training dataset using a variant of stochastic gradient descent (SGD). The training procedure also includes tuning hyperparameters, like learning rate and number of epochs. This tuning is usually done by maximizing the performance of the network on a validation dataset. The training procedure always leaves some small uncertainty of order δ1\delta\ll 1 on the final values of weights. Consequently, the performance of the model with parameters 𝜽\bm{\theta}^{*} and 𝜽+𝒪(δ)\bm{\theta}^{*}+\mathcal{O}(\delta) will be essentially identical. For example, we may change the number of iterations or the learning rate very slightly without harming the performance of the network.

We can make f(𝜽,.)f(\bm{\theta}^{*},.) invariant with respect to scaling of the weight matrix by augmenting it with batch/layer normalization operators after each learnable layer, fBN/LN(𝜽,.)f^{\text{BN/LN}}(\bm{\theta}^{*},.). The re-scaling invariance implies that if the weight uncertainty in the original network is of order 𝒪(δ)\mathcal{O}(\delta) then in the augmented network it can be of order 𝒪(λδ)\mathcal{O}(\lambda\delta) without affecting the overall performance. This suggests that neural networks with batch/layer normalization layers should be robust against the noise in their weights. In the rest of this section, we empirically confirm this hypothesis by injecting noise into the weights of augmented networks during the training and testing procedure.

To test our hypothesis empirically, we train standard fully-connected as well as convolutional neural networks with noise injected into their weights on MNIST [16] and CIFAR-10 [14]. More specifically, we sample the weights from a Gaussian distribution 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) with learnable mean parameters μ\mu and constant variance σ2\sigma^{2}. Backpropagation is performed by making use of the standard reparametrization trick [24].

In particular, we investigate thoroughly LeNet-300-100 and LeNet-5 [15] and variants of ResNet [10] and VGG [27] models. The structure of the models is outlined in Table 1. For each model, we construct a normalized augmented version by adding batch or layer normalization after each trainable layer. All models are implemented in PyTorch [21] and trained with the Adam optimizer [12].

Network LeNet-5 ResNet-18 VGG
Convolutions 6, pool, 16, pool 64, 2x[64, 64] 2x[128, 128] 2x[256, 256] 2x[512, 512] 2x64 pool 2x128 pool, 4x256, pool
FC Layers 120, 84, 10 avg-pool, 10 avg-pool, 512, 10
Table 1: Architecture of deep networks that we use for vision tasks.

Table 2 shows the accuracy of the augmented models on the MNIST test dataset, averaged over ten runs against unnormalized baselines. The BN/LN prefixes in this table denote models that are obtained by adding batch normalization or layer normalization layers to the original architectures, respectively. It is evident from this experiment that all augmented models are tolerant to noise while the baselines are not. Indeed, their accuracy does not change at all for a large range of noise levels within the statistical error as promised by the scale invariance property of the networks. On the other hand, the baseline models are very sensitive to small weight perturbations. Notably, disturbing the weights of the baseline models by a small noise of order σ=0.3\sigma=0.3 results in non-converging training.

Table 2: Results of MNIST test-set accuracy (%±\%\pm standard error) in the presence of injected noise to the weights for different models. The accuracy of models with normalization layers doesn’t change within the standard deviations. In contrast, baseline models are much more sensitive to the noise, and they don’t converge if the level of noise exceeds a threshold.
Model Noise Level (σ\sigma)
0 0.01 0.1 1 2
LeNet-300-100 98.20±0.0798.20\scalebox{0.9}{$\pm 0.07$} 97.70±0.3097.70\scalebox{0.9}{$\pm 0.30$} 96.98±0.1296.98\scalebox{0.9}{$\pm 0.12$} No-convergence No-convergence
BN-LeNet-300-100 98.20±0.1098.20\scalebox{0.9}{$\pm 0.10$} 98.10±0.1098.10\scalebox{0.9}{$\pm 0.10$} 98.07±0.1198.07\scalebox{0.9}{$\pm 0.11$} 98.07±0.1298.07\scalebox{0.9}{$\pm 0.12$} 98.13±0.0898.13\scalebox{0.9}{$\pm 0.08$}
LN-LeNet-300-100 98.04±0.1698.04\scalebox{0.9}{$\pm 0.16$} 98.00±0.1098.00\scalebox{0.9}{$\pm 0.10$} 98.08±0.1498.08\scalebox{0.9}{$\pm 0.14$} 98.04±0.0998.04\scalebox{0.9}{$\pm 0.09$} 98.03±0.1798.03\scalebox{0.9}{$\pm 0.17$}
LeNet-5 99.20±0.0299.20\scalebox{0.9}{$\pm 0.02$} 98.94±0.0798.94\scalebox{0.9}{$\pm 0.07$} 98.40±0.0398.40\scalebox{0.9}{$\pm 0.03$} No-convergence No-convergence
BN-LeNet-5 99.20±0.0899.20\scalebox{0.9}{$\pm 0.08$} 99.21±0.0599.21\scalebox{0.9}{$\pm 0.05$} 99.18±0.0699.18\scalebox{0.9}{$\pm 0.06$} 99.24±0.0499.24\scalebox{0.9}{$\pm 0.04$} 99.25±0.0799.25\scalebox{0.9}{$\pm 0.07$}
LN-LeNet-5 99.16±0.0899.16\scalebox{0.9}{$\pm 0.08$} 99.14±0.0699.14\scalebox{0.9}{$\pm 0.06$} 99.13±0.0799.13\scalebox{0.9}{$\pm 0.07$} 99.21±0.0599.21\scalebox{0.9}{$\pm 0.05$} 99.19±0.0799.19\scalebox{0.9}{$\pm 0.07$}
Table 3: CIFAR-10 test-set percentage accuracy (%±\%\pm standard error) with noisy weights for a variety of models. We see the same pattern as in MNIST dataset where models augmented with batch normalization are very robust against noise.
Model Noise Level (σ\sigma)
0 0.01 0.1 1 2
ResNet-18 93.50±0.0493.50\scalebox{0.9}{$\pm 0.04$} 89.05±0.6089.05\scalebox{0.9}{$\pm 0.60$} 89.41±0.6789.41\scalebox{0.9}{$\pm 0.67$} 86.76±1.3086.76\scalebox{0.9}{$\pm 1.30$} 87.83±0.8787.83\scalebox{0.9}{$\pm 0.87$}
ResNet-18-mod 93.60±0.1893.60\scalebox{0.9}{$\pm 0.18$} 88.65±1.2288.65\scalebox{0.9}{$\pm 1.22$} 88.16±1.2288.16\scalebox{0.9}{$\pm 1.22$} 85.46±1.2185.46\scalebox{0.9}{$\pm 1.21$} 83.95±2.5183.95\scalebox{0.9}{$\pm 2.51$}
BN-ResNet-18 93.55±0.0493.55\scalebox{0.9}{$\pm 0.04$} 93.59±0.1893.59\scalebox{0.9}{$\pm 0.18$} 93.46±0.2493.46\scalebox{0.9}{$\pm 0.24$} 92.72±0.2692.72\scalebox{0.9}{$\pm 0.26$} 92.10±0.2892.10\scalebox{0.9}{$\pm 0.28$}
VGG-16 91.21±0.1691.21\scalebox{0.9}{$\pm 0.16$} 88.89±0.5688.89\scalebox{0.9}{$\pm 0.56$} 84.59±6.1184.59\scalebox{0.9}{$\pm 6.11$} 63.12±8.2163.12\scalebox{0.9}{$\pm 8.21$} No-convergence
VGG-16-mod 91.18±0.2091.18\scalebox{0.9}{$\pm 0.20$} 88.87±0.5988.87\scalebox{0.9}{$\pm 0.59$} 82.58±6.9182.58\scalebox{0.9}{$\pm 6.91$} 22.58±5.022.58\scalebox{0.9}{$\pm 5.0$} No-convergence
BN-VGG-16 91.75±0.2291.75\scalebox{0.9}{$\pm 0.22$} 91.69±0.2091.69\scalebox{0.9}{$\pm 0.20$} 91.56±0.2491.56\scalebox{0.9}{$\pm 0.24$} 90.95±0.2590.95\scalebox{0.9}{$\pm 0.25$} 90.35±0.1490.35\scalebox{0.9}{$\pm 0.14$}

Experiments on CIFAR-10 with ResNet and VGG networks show similar trends (see Table 3). Unlike LeNet models, the original ResNet and VGG networks already contain BN layers after all except the last trainable layer. This leads to some degree of protection against noise as demonstrated in Table 3. For example, the accuracy of ResNet-18 with a noise level of 11 reduces only to 87%87\% instead of to random prediction.

To illustrate the role of normalization layers further, we also present results on ResNet-18-mod and VGG-16-mod which are obtained by removing the last normalization layer from the original architectures. As shown by the performance degradation with increasing noise levels in Table 3, these models are more vulnerable to the added noise.

Next, we extended our experiments to a more complex task, viz. natural language text classification, where we observe a similar effect. For this, we trained a BiLSTM model with one linear/dense layer (DL) with/without layer normalization on the AG News Corpus111http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html, a popular text classification dataset with four categories of news: World, Sports, Business and Sci/Tech. Each class has 30,000 training examples and 1,900 test examples. In total, the dataset consists of 120,000 training examples and 7,600 test examples. We further split the training examples into training/validation set where we use 96,000 examples for training the models and 24,000 for validation (e.g early stopping and for tuning adaptive learning rate). We trained the models with different noise levels σ\sigma for 25 epochs. We find that the higher the noise level, the more epochs are needed to maintain the accuracy of the baseline model. Our results on the language data in Table 4 suggest that layer normalization makes the BiLSTM model robust to noise with minimal drops in accuracy (17%1-7\%). Our findings make us conclude that the LSTM and CNN architectures are robust to noise when equipped with layer and batch normalization.

It is worth mentioning that achieving the same accuracy in the presence of noise does not come for free as it affects the training time: the larger the noise, the slower the training. Figure 1 illustrates the evolution of the accuracy of models on the validation set, for different levels of noise. As it is evident from these plots, models with different values of noise converge to the same accuracy, albeit with different rates. For example, increasing the noise level from 1 to 10 slows down training by a factor of order 7 for LeNet and ResNet models. More details on this can be found in Appendix 2.

Refer to caption
(a) MNIST
Refer to caption
(b) CIFAR-10
Figure 1: Evolution of validation accuracy during training for MNIST and CIFAR-10 datasets. A large value of noise slows down the training but does not affect the performance drastically.
Table 4: AG News Corpus test accuracy with noisy weights for a variety of models. We see the same pattern as the vision dataset where models augmented with layer normalization are very robust against noise.
Model Noise Level (σ\sigma)
0 0.01 0.1 1 2
BiLSTM-DL 89.34%89.34\% 89.57%89.57\% 89.01%89.01\% 66.32%66.32\% 24.76%24.76\%
LN-BiLSTM-DL 89.34%89.34\% 88.87%88.87\% 88.62%88.62\% 85.74%85.74\% 82.41%82.41\%

3 DPSGD and Normalization

Differential privacy (DP) is a systematic approach to quantifying a privacy guarantee while querying a dataset. quantifying the leakage of information due to query a dataset. DP protects privacy by bounding the influence of any sample on the outcome of queries. It provides a provable guarantee for individuals. We proceed by briefly recalling some preliminaries on differential privacy and then propose our approach for training deep neural networks in a differentially private way.

Let us denote the domain of data points by χ\chi. We call two datasets D1,D2χD_{1},D_{2}\in\chi neighboring if they differ are the same except one data point, i. e., d(D1,D2)=1d(D_{1},D_{2})=1, where d(.,.)d(.,.) is the Hamming distance.

Definition 3.1.

(Differential Privacy [7]). A randomized algorithm :χ\mathcal{M}:\chi\rightarrow\mathcal{R} with domain χ\chi and range \mathcal{R} is (ε,δ)(\varepsilon,\delta) differential private if for all measurable sets SS\in\mathcal{R} and for all neighboring datasets D1D_{1} and D2D_{2}, it holds that

Pr[(D1)S]exp(ε)Pr[(D2)S]+δ.\Pr[\mathcal{M}(D_{1})\in S]\leq\exp{(\varepsilon)}\Pr[\mathcal{M}(D_{2})\in S]+\delta. (4)

Intuitively a (ε,δ)(\varepsilon,\delta) differential private mechanism guarantees that the absolute value of privacy leakage will be bounded by ε\varepsilon with probability at least 1δ1-\delta for adjacent datasets. The higher the value of ε\varepsilon, the more the chance of data re-identification and so information leakage.

A standard approach for achieving differential privacy is to add some random noise rr to the output of queries, q(D)+rq(D)+r, and to tune the noise rr by the sensitivity of the query. LpL_{p} sensitivity is defined as the maximum change in the outcome of a query for two neighboring datasets and measures the maximum influence that a single data point can have on the result of the query:

Sp=maxd(D1,D2)=1q(D2)q(D1)p.\displaystyle S_{p}=\max_{d(D_{1},D_{2})=1}\|q(D_{2})-q(D_{1})\|_{p}. (5)

The special case where rr is calibrated with S2S_{2} sensitivity and sampled according to the normal distribution is of special importance and is termed Gaussian mechanism:

G(D):=q(D)+𝒩(0,S22σ2).G(D)\colon=q(D)+\mathcal{N}(0,S_{2}^{2}\sigma^{2}). (6)

Here 𝒩(0,S22σ2)\mathcal{N}(0,S_{2}^{2}\sigma^{2}) is the normal distribution with mean zero and standard deviation S2σS_{2}\sigma. It can be shown that this mechanism satisfies (ε,δ)(\varepsilon,\delta) differential privacy provided that σ2ln1.25δε\sigma\geq\frac{\sqrt{2\ln{\frac{1.25}{\delta}}}}{\varepsilon} [8].

Differential privacy has been integrated into deep learning in [25] and subsequently in [1] for the setting where an adversary has access to the network architecture and learned weights, f(𝜽,.)f(\bm{\theta}^{*},.). In particular, the method in [1] preserves privacy by adding noise to the SGD updates:

𝜽t+1𝜽tη𝐠t+ηLr,\bm{\theta}_{t+1}\leftarrow\bm{\theta}_{t}-\eta\,\mathbf{g}_{t}+\frac{\eta}{L}r, (7)

where 𝐠t\mathbf{g}_{t} is the averaged gradient, η\eta is the learning rate and rr is sampled from the Gaussian distribution 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}). To control the influence of training samples on the parameters, the gradients are clipped by the L2L_{2}-norm.

π(gi)=gi.min(1,C/gi2),\pi(g_{i})=g_{i}\,.\min(1,C/\|g_{i}\|_{2}), (8)

where gig_{i} is the gradient corresponding to the ii-th sample and CC is the clipping factor. It has been shown in [1] that each step of DPSGD is (ε,δ)(\varepsilon,\delta)-differential private once we tune the noise as σ=Cz\sigma=C\,z with z=2ln1.25δεz=\frac{\sqrt{2\ln{\frac{1.25}{\delta}}}}{\varepsilon}.

It is known that batch normalization is not consistent with DP training. Indeed, in a non-private setting, one usually keeps track of the running averages of mean and variance statistics (Eq. 2) during the training procedure and reuses this collected information at test time to normalize the inputs to neurons. More specifically, the update rule for running averages at each iteration is as follows:

𝝁Batch(1α)𝝁Batch+α𝝁tBatch,\displaystyle\bm{\mu}^{\text{Batch}}\leftarrow(1-\alpha)\bm{\mu}^{\text{Batch}}+\alpha\,\bm{\mu}^{\text{Batch}}_{t}\,, (9)
𝝈2,Batch(1α)𝝈2,Batch+α𝝈t2,Batch,\displaystyle\bm{\sigma}^{2,\text{Batch}}\leftarrow(1-\alpha)\bm{\sigma}^{2,\text{Batch}}+\alpha\,\bm{\sigma}^{2,\text{Batch}}_{t}, (10)

where (𝝁Batch\bm{\mu}^{\text{Batch}}, 𝝈2,Batch\bm{\sigma}^{2,\text{Batch}}) are estimated mean and variance statistics, (𝝁tBatch\bm{\mu}^{\text{Batch}}_{t}, 𝝈t2,Batch\bm{\sigma}^{2,\text{Batch}}_{t}) are new observed values on iteration tt of training according to Eq. 2 and α\alpha is the momentum of moving averages.

Since these running averages are also a part of the model’s outputs, in a private training, we have to add noise also to these statistics at each iteration and distribute the privacy budgets among the weights and moving averages to make the overall procedure differentially private. Additionally, we need to truncate the summed inputs of neurons to bound the sensitivity of means and variances, which are given by [28]:

Sμ=CL,\displaystyle S_{\mu}=\frac{C^{\prime}}{L},
Sσ2=C2(3L3L2+1L3),\displaystyle S_{\sigma^{2}}=C^{\prime 2}\left(\frac{3}{L}-\frac{3}{L^{2}}+\frac{1}{L^{3}}\right),

where CC^{\prime} is the clipping threshold for neurons activations and LL is the batch size. Empirically, we have found that if we tune the noise with the worst-case scenario according to the above sensitivities, the performance of the model drops drastically. Therefore we employ a more sophisticated approach to deal with batch normalization situation as shown in algorithm 1.

First of all, we do not track the running averages during the training phase and instead computed fresh statistics from the current batch are used to normalize the neurons. But to be able to deal with batch normalization at test time we concatenate a fixed amount of data points X^\hat{X} with size MM taken from a public dataset, disjoint from the training data, to the input of the network, both in the training and the test phase. These samples only contribute to the statistics and not to the cost function directly.

  Input: dataset 𝒟={(x1,y1),}\mathcal{D}=\{(x_{1},y_{1}),\cdots\} of size NN, a public dataset X^={x^1,x^2,,x^M}\hat{X}=\{\hat{x}_{1},\hat{x}_{2},\cdots,\hat{x}_{M}\}, loss function (𝜽,.)\mathcal{L}(\bm{\theta},.), learning rate ηt\eta_{t}, noise multiplier zz, sample size LL, gradient norm bound CC and TT iterations.
  for t=0t=0 to T1T-1 do
     \bullet Take a random sample, X𝒟X\sim\mathcal{D}, with size LL and selection probability LN\frac{L}{N}.
     \bullet Concatenate the public data to each lot
    XXX^X\leftarrow X\cup\hat{X}
     \bullet Compute lost for the first LL elements
    (𝜽t,xi)=(𝜽t,X)[i]\mathcal{L}(\bm{\theta}_{t},x_{i})=\mathcal{L}(\bm{\theta}_{t},X)[i]
     \bullet Compute gradient
  𝐠t(xi)𝝁t(𝜽t(𝝁t),xi)\;\;\mathbf{g}_{t}(x_{i})\leftarrow\nabla_{\bm{\mu}_{t}}\mathcal{L}(\bm{\theta}_{t}(\bm{\mu}_{t}),x_{i}).
     \bullet Clip gradient
    𝐠t(xi)𝐠t(xi).min(1,C/𝐠t(xi)2)\mathbf{g}_{t}(x_{i})\leftarrow\mathbf{g}_{t}(x_{i})\;.\;\min(1,C/\|\mathbf{g}_{t}(x_{i})\|_{2})
     \bullet Add noise
    𝐠t1L(i𝐠t(xi)+𝒩(0,C2z2))\mathbf{g}_{t}\leftarrow\frac{1}{L}\big{(}\sum_{i}\mathbf{g}_{t}(x_{i})+\mathcal{N}(0,C^{2}z^{2})\big{)}
     \bullet Update parameters:
    𝜽t+1𝜽tηt𝐠t\bm{\theta}_{t+1}\leftarrow\bm{\theta}_{t}-\eta_{t}\;\mathbf{g}_{t}
     Test Phase
     Input: test dataset 𝒟test={(x1,y1),}\mathcal{D}_{test}=\{(x_{1},y_{1}),\cdots\} of size NtestN_{test}, the public dataset X^={x^1,x^2,,x^M}\hat{X}=\{\hat{x}_{1},\hat{x}_{2},\cdots,\hat{x}_{M}\}, trained model f(𝜽T,.)f(\bm{\theta}_{T},.).
     \bullet Initialize Y \leftarrow\emptyset
     for i=0i=0 to Ntest1N_{test}-1 do
        \bullet Concatenate the public dataset to each data point xix_{i}
    XxiX^X\leftarrow x_{i}\cup\hat{X}
        \bullet Compute the network output corresponding to the data point xix_{i}
    yi=f(𝜽T,X)[0]y_{i}=f(\bm{\theta}_{T},X)[0]
        \bulletAppend yiy_{i} to the results
    YYyiY\leftarrow Y\cup y_{i}
     end for
     Return outputs of network YY.
  end for
Algorithm 1 DPSGD with Batch Normalization: Training

Therefore, in the training phase, the cost is computed via (𝜽t,XX^)[:L]\mathcal{L}(\bm{\theta}_{t},X\cup\hat{X})[:L], where XX is a batch of size LL from the training data, and [:L][:L] denotes the slice of the first LL elements. At test time, when iterating over the dataset, the same public data points X^\hat{X} are also concatenated to each test sample, xx, and the output of network is computed as f(𝜽T,xX^)[0]f(\bm{\theta}_{T},x\cup\hat{X})[0]. This leads to privacy preserving batch normalization and allows us to compute the normalization statistics over a batch of size 1+M1+M without any reference to the training data.

In the next section, we present the impact of normalization layers on DP training using two image recognition benchmarks, i.e. MNIST and CIFAR-10, as well as text classification task in natural language processing using AG News Corpus.

4 Experiments

In this section, we report some results of applying our method and compare them with existing DP mechanisms. The purpose of these experiments is two-fold: we show that (1) similar to non-private training (section 2), normalization layers improve the performance of models trained with DPSGD; and (2) that training very deep networks is feasible with our private version of batch normalization.

All models, as well as DPSGD, have been implemented in PyTorch [20]. To track the privacy loss over the whole training procedure, we employ the Rényi-DP technique [17]. It provides a tighter bound on the privacy loss compared to the strong composition theorem [9]. We use the open-source implementation of the Rényi DP accountant from the TensorFlow Privacy package [29] 222https://github.com/tensorflow/privacy. The total privacy loss ε\varepsilon is computed as a function of the noise multiplier zz, size of dataset NN, size of lot LL, the number of iterations TT and δ\delta.

Table 5 depicts the accuracy of LeNet-5 model on the MNIST test set for ε\varepsilon ranging from high to very low privacy budgets. We have trained LeNet-5 with and without normalization layers, using DPSGD. For training the model augmented with batch normalization, we employed 128 images of KMNIST [5] as the public dataset. The probability δ\delta parameter is set to 10510^{-5} in all our experiments.

The results illustrate that the use of normalization layers consistently improves the performance of DPSGD for all finite values of privacy loss. Further, we observe that the effect of batch normalization is greater than that of layer normalization. Remarkably, we gain around 7%7\% and 10%10\% for very low privacy budgets of ε=0.1\varepsilon=0.1 and ε=0.05\varepsilon=0.05 with our private batch normalization technique.

We now turn our attention to the CIFAR-10 dataset. Table 6 summarizes the results of DPSGD on the TensorFlow tutorial model considered in [1]. We follow the same experimental setting as in [1], i.e. we fine tune the linear layers of a pretrained model trained on the CIFAR-100 dataset. For training the model with batch normalization we also use 128 images of CIFAR-100, which has completely different image examples and classes from those of CIFAR-10, as our public dataset.

As Table 6 illustrates, using batch normalization results in better accuracies than layer normalization as well as raw TensorFlow tutorial model. We also show the results of training a light weight VGG model in this table (See appendix A for the details of architecture). The non-private accuracy of this model is comparable with the TF-tutorial model but it leads to a much lower gap for finite privacy budgets. It should be mentioned that it is not feasible to train such models without making use of the batch normalization as the training is extremely unstable. Therefore our private friendly batch normalization technique allows training much more deeper and complex networks and establishing new scores for the performance of differentially private models.

Next, we extended our experiments to a more complex task, viz. natural language text classification, where we observe a similar effect. For this, we trained a BiLSTM model with one linear/dense layer (DL) with/without layer normalization (LN) on the AG News Corpus 333http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html, a popular text classification dataset with 4 categories of news: World, Sports, Business and Sci/Tech. Table 7 shows the result of our experiment on text classification. As before, layer normalization has a large impact on the performance of the model.

Table 5: Test accuracy of three DP training methods on MNIST with different privacy loss (δ=105\delta=10^{-5}).
DP Algorithm privacy budget (ε\varepsilon)
\infty 7 3 1 0.5 0.1 0.05
DPSGD (LeNet-5) 99.20%99.20\% 97.01%97.01\% 96.34%96.34\% 94.11%94.11\% 91.10%91.10\% 83.00%83.00\% 78.96%78.96\%
DPSGD (LN-LeNet-5) 99.20%99.20\% 97.35%97.35\% 97.05%97.05\% 96.68%96.68\% 94.81%94.81\% 87.45%87.45\% 75.76%75.76\%
DPSGD (BN-LeNet-5) 99.20%99.20\% 98.68%\bm{98.68}\% 98.18%\bm{98.18}\% 97.61%\bm{97.61\%} 96.83%\bm{96.83\%} 90.68%\bm{90.68\%} 88.15%\bm{88.15}\%
Table 6: Results on CIFAR-10 test-set accuracy with δ=105\delta=10^{-5}.
DP Algorithm privacy budget (ε\varepsilon)
\infty 88 44 22
DPSGD (TF-tutorial) [1] 80.0%80.0\% 73.0%73.0\% 70.0%70.0\% 67.0%67.0\%
DPSGD (LN-TF-tutorial) 80.0%80.0\% 73.3%73.3\% 70.6%70.6\% 67.0%67.0\%
DPSGD(BN-TF-tutorial) 80.0%80.0\% 74.1%\bm{74.1\%} 71.2%\bm{71.2\%} 69.8%\bm{69.8\%}
DPSGD (BN-VGG) 80.7%80.7\% 79.5%\bm{79.5}\% 79.1%\bm{79.1}\% 77.4%\bm{77.4}\%
Table 7: Testing accuracy of differentially private training methods on AG News Corpus with different privacy loss and δ=105\delta=10^{-5}.
DP Algorithm privacy budget (ε\varepsilon)
\infty 7 3 1 0.5 0.1
DPSGD (BiLSTM-DL) 88.47%88.47\% 83.86%83.86\% 80.00%80.00\% 81.14%81.14\% 77.88%77.88\% 37.49%37.49\%
DPSGD (LN-BiLSTM-DL) 88.18%88.18\% 84.34%\bm{84.34\%} 82.51%\bm{82.51\%} 82.03%\bm{82.03\%} 79.16%\bm{79.16\%} 50.09%\bm{50.09\%}

5 Related Work

A number of different methods have been developed to preserve privacy in machine learning models. [25] proposed a distributed multiparty learning mechanism for a network without sharing input datasets, however, the obtained privacy guarantee was very loose. [1] developed an efficient differentially private SGD for training networks with a large number of parameters. Also, one may employ tighter bounds provided by the Rényi differential privacy [18] in conjunction with DPSGD. A method for adding less noise to the weights of neural networks is proposed in [23] by using adaptive clipping of the gradients. [19] shows that learning with DPSGD requires optimization of the model architectures and initializations. Perturbing the objective function as an alternative way to protect privacy has been suggested in [22]. A model based on the local differentially private mechanism is proposed in [4] to train deep convolutional networks in a way that a data owner can add a randomization layer before data leave data owners’ device. A comprehensive list of works in this area can be found in [32].

6 Conclusion

In this paper, we proposed a novel method for integrating batch normalization with differentially private stochastic gradient descent. Our method makes training very deep neural networks, such as ResNet and VGG, feasible under very strong privacy guarantees. We also have demonstrated that normalization layers are essential ingredients for a robust private training.

7 Acknowledgments

We would like to thank Marius Mosbach and Xiaoyu Shen for proof-reading and valuable comments. The presented research has been funded by the European Union’s Horizon 2020 research and innovation programme project COMPRISE (http://www.compriseh2020.eu/) under grant agreement No. 3081705.

References

  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • [3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
  • [4] M. Chamikara, P. Bertok, I. Khalil, D. Liu, and S. Camtepe. Local differential privacy for deep learning. arXiv preprint arXiv:1908.02997, 2019.
  • [5] T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha. Deep learning for classical japanese literature. CoRR, abs/1812.01718, 2018.
  • [6] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486–503. Springer, 2006.
  • [7] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • [8] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [9] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51–60. IEEE, 2010.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [13] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In Advances in neural information processing systems, pages 971–980, 2017.
  • [14] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, CIFAR, 2009.
  • [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
  • [16] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [17] I. Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.
  • [18] I. Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275, Aug 2017.
  • [19] N. Papernot, S. Chien, S. Song, A. Thakurta, and U. Erlingsson. Making the shoe fit: Architectures, initializations, and tuning for learning with privacy, 2020.
  • [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. nips, 2017.
  • [21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • [22] N. Phan, Y. Wang, X. Wu, and D. Dou. Differential privacy preservation for deep auto-encoders: an application of human behavior prediction. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [23] V. Pichapati, A. T. Suresh, F. X. Yu, S. J. Reddi, and S. Kumar. Adaclip: Adaptive clipping for private sgd, 2019.
  • [24] J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pages 3528–3536, 2015.
  • [25] R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1310–1321. ACM, 2015.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
  • [28] M. Swanberg, I. Globus-Harris, I. Griffith, A. Ritz, A. Groce, and A. Bray. Improved differentially private analysis of variance. arXiv preprint arXiv:1903.00534, 2019.
  • [29] https://github.com/tensorflow/privacy. Accessed: 2020-January.
  • [30] Y. Yang, L. Wu, G. Yin, L. Li, and H. Zhao. A survey on security and privacy issues in internet-of-things. IEEE Internet of Things Journal, 4(5):1250–1258, 2017.
  • [31] H. Zhang, Y. N. Dauphin, and T. Ma. Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321, 2019.
  • [32] T. Zhu, G. Li, W. Zhou, and S. Y. Philip. Differential privacy and applications, volume 69. Springer, 2017.