Deep Learning with Data Privacy via Residual Perturbation

Wenqi Tao*, Huaming Ling*, Zuoqiang Shi, and Bao Wang B. Wang is with the Department of Mathematics and Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, 84112 USA. E-mail: [email protected]. Tao, H. Ling, and Z. Shi are with the Department of Mathematics, Tsinghua University, Beijing 100084. E-mail: [email protected], [email protected], [email protected]. Wenqi Tao and Huaming Ling contribute equally.

Abstract

Protecting data privacy in deep learning (DL) is of crucial importance. Several celebrated privacy notions have been established and used for privacy-preserving DL. However, many existing mechanisms achieve privacy at the cost of significant utility degradation and computational overhead. In this paper, we propose a stochastic differential equation-based residual perturbation for privacy-preserving DL, which injects Gaussian noise into each residual mapping of ResNets. Theoretically, we prove that residual perturbation guarantees differential privacy (DP) and reduces the generalization gap of DL. Empirically, we show that residual perturbation is computationally efficient and outperforms the state-of-the-art differentially private stochastic gradient descent (DPSGD) in utility maintenance without sacrificing membership privacy.

Index Terms:

deep learning, data privacy, stochastic differential equations

I Introduction

Many high-capacity deep neural networks (DNNs) are trained with private data, including medical images and financial transaction data [1, 2, 3]. DNNs usually overfit and can memorize the private training data, making training DNNs exposed to privacy leakage [4, 5, 6, 7, 8]. Given a pre-trained DNN, the membership inference attack can determine if an instance is in the training set based on DNN’s response [9, 5, 6]; the model extraction attack can learn a surrogate model that matches the target model, given the adversary only black-box access to the target model [10, 11]; the model inversion attack can infer certain features of a given input from the output of a target model [12, 13]; the attribute inference attack can deanonymize the anonymized data [14, 15].

Machine learning (ML) with data privacy is crucial in many applications [16, 17, 18, 19]. Several algorithms have been developed to reduce privacy leakage, including differential privacy (DP) [20] and federated learning (FL) [21, 22], and $k$ -anonymity [23, 24]. Objective, output, and gradient perturbations are popular approaches for ML with DP guarantee [25, 26, 27, 28, 29]. FL trains centralized ML models, through gradient exchange, with the training data being distributed on the edge devices. However, the gradient exchange can still leak privacy [30, 31]. Most of the existing privacy is achieved at a tremendous sacrifice of utility. Moreover, training ML models using the state-of-the-art DP stochastic gradient descent (DPSGD) leads to tremendous computational cost due to the requirement of computing and clipping the per-sample gradient [32]. It remains a great interest to develop new privacy-preserving ML algorithms without the excessive computational overhead or degrade the ML models’ utility.

I-A Our Contribution

We propose residual perturbation for privacy-preserving deep learning (DL). At the core of residual perturbation is injecting Gaussian noise to each residual mapping [33], and the residual perturbation is theoretically motivated by the stochastic differential equation (SDE) theory. The major advantages of residual perturbation are summarized below:

•

It guarantees DP and can protect the membership privacy of the training data almost perfectly without sacrificing ResNets’ utility, even improving the classification accuracy.
•

It has fewer hyperparameters to tune than DPSGD. Moreover, it is computationally much more efficient than DPSGD; the latter is computationally prohibitive due to the requirement of computing the per-sample gradient.

I-B Related Work

Improving the utility of ML models with DP guarantees is an important task. PATE [34, 35] uses semi-supervised learning accompanied by model transfer between the “student” and “teacher” models to enhance utility. Several variants of the DP notions have also been proposed to improve the theoretical privacy budget and sometimes can also improve the resulting model’s utility [28, 36, 37, 38]. Some post-processing techniques have also been developed to improve the utility of ML models with negligible computational overhead [39, 40].

The tradeoff between privacy and utility of ML models has been studied in several different contexts. In [41], the authors identify the factors that impact the optimal tradeoff of accuracy and privacy of differentially private DL models by empirically evaluating the commonly used privacy-preserving ML libraries, including PyTorch Opacus and Tensorflow privacy. The authors of [42] investigate whether DPSGD offers better privacy in practice than what is guaranteed by its state-of-the-art theoretical privacy analysis — by taking a quantitative, empirical approach to auditing the privacy afforded by specific implementations of differentially private algorithms. In particular, they estimate an empirical DP lower bound under the data poisoning and membership inference attacks. They further compare this empirical lower bound with the DP budget obtained by the state-of-the-art analytic tools [28, 38, 43, 44]. The approach presented in [42] can demonstrate if a given algorithm with a given DP budget is sufficiently private and provides empirical evidence to guide the development of theoretical DP analysis. The privacy-utility tradeoff for different privacy measures in both global and local privacy has been studied in [45], aiming to determine the minimal privacy loss at a fixed utility expense. The paper [46] investigates a DP ensemble classifier approach that seeks to preserve data privacy while maintaining an acceptable level of utility. For the linear regression model, the privacy-utility tradeoff has been studied by enforcing the privacy of the input data via random projection and additive noise [47].

Gaussian noise injection in residual learning has been used to improve the robustness of ResNets [48, 49, 50]. In this paper, we inject Gaussian noise into each residual mapping to achieve data privacy with even improved generalization instead of adversarial robustness. Several related studies focus on disentangling adversarial robustness and generalization in the literature. Some initial hypotheses state that there exists an inherent tradeoff between robustness and accuracy, i.e., an ML model cannot be both more robust and more accurate than others [51, 52] based on the comparisons of different models with different architectures or training strategies and theoretical arguments on a toy dataset, respectively. In [53], the authors disentangle adversarial robustness and generalization, leveraging the low-dimensional manifold assumption of the data. In particular, through a detailed analysis of the regular and on-manifold [54, 55] adversarial examples of the same type as the given class — the adversarial examples that leave and stay in the manifold of the given class of data, the authors of [53] conclude that both robust and accurate ML models are possible. In particular, on-manifold robustness is essentially the generalization of the ML model, which can be improved via on-manifold adversarial training. In [56], the authors precisely characterize the effects of data augmentation on the standard error (corresponding to the natural accuracy of the model) in linear regression when the optimal linear predictor has zero standard and robust error (corresponding to the model’s robust accuracy). Moreover, it is shown in [56] that robust self-training [57, 58, 59] can improve the linear regression model’s robustness without sacrificing its standard error by leveraging extra unlabelled data. Our approach of the ensemble of noise-injected neural networks provides another feasible avenue to develop ML models that are both robust and accurate.

I-C Organization

We organize this paper as follows: In Sec. II, we introduce the residual perturbation for privacy-preserving DL. In Sec. III, we present the generalization and DP guarantees for residual perturbation. In Sec. IV, we numerically verify the efficiency of the residual perturbation in protecting data privacy without degrading the underlying models’ utility. Technical proofs and some more experimental details and results are provided in Secs. V and VI. We end up with concluding remarks.

I-D Notations

We denote scalars by lower or upper case letters; vectors and matrices by lower and upper case bold face letters, respectively. For a vector ${\bm{x}}=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}$ , we use $\|{\bm{x}}\|_{2}={(\sum_{i=1}^{d}|x_{i}|^{2})^{1/2}}$ to denote its $\ell_{2}$ norm. For a matrix ${\bm{A}}$ , we use $\|{\bm{A}}\|_{2}$ to denote its induced norm by the vector $\ell_{2}$ norm. We denote the standard Gaussian in ${\mathbb{R}}^{d}$ as $\mathcal{N}(\mathbf{0},{\bm{I}})$ with ${\bm{I}}\in{\mathbb{R}}^{d\times d}$ being the identity matrix. The set of (positive) real numbers is denoted as ( ${\mathbb{R}}^{+}$ ) ${\mathbb{R}}$ . We use $B(\mathbf{0},R)$ to denote the ball centered at $\mathbf{0}$ with radius $R$ .

II Algorithms

II-A Deep Residual Learning and Its Continuous Analogue

Given the training set ${S_{N}:=\{{\bm{x}}_{i},y_{i}\}_{i=1}^{N}}$ , with $\{{\bm{x}}_{i},y_{i}\}\subset$ ${\mathbb{R}}^{d}\times{\mathbb{R}}$ being a data-label pair. For a given ${\bm{x}}_{i}:={\bm{x}}^{0}$ the forward propagation of a ResNet with $M$ residual mappings can be written as

	$\displaystyle{\bm{x}}^{l+1}$	$\displaystyle={\bm{x}}^{l}+\hat{F}({\bm{x}}^{l},{\bm{W}}^{l}),\ \ \mbox{for}\ l=0,\ldots,M-1,$		(1)
	$\displaystyle\hat{y}_{i}$	$\displaystyle=f({\bm{x}}^{M}),$		(1)

where $\hat{F}(\cdot,{\bm{W}}^{l})$ is the nonlinear mapping of the $l$ -th residual mapping parameterized by ${\bm{W}}^{l}$ ; $f(\cdot)$ is the output activation, and $\hat{y}_{i}$ is the prediction. The heuristic continuous limit of (1) is the following ordinary differential equation (ODE)

\small d{\bm{x}}(t)=F({\bm{x}}(t),{\bm{W}}(t))dt,\ \ {\bm{x}}(0)=\hat{{\bm{x}}},

(2)

where $t$ is the time variable. ODE (2) can be revertible, and thus the ResNet counterpart might be exposed to data privacy leakage. For instance, we use an image downloaded from the Internet (Fig. 1(a)) as the initial data $\hat{{\bm{x}}}$ in (2). Then we simulate the forward propagation of ResNet by solving (2) from $t=0$ to $t=1$ using the forward Euler solver with a time step size $\Delta t=0.01$ and a given velocity field $F({\bm{x}}(t),{\bm{W}}(t))$ (see Sec. VII for the details of $F({\bm{x}}(t),{\bm{W}}(t))$ ), which maps the original image to its features (Fig. 1(b)). To recover the original image, we start from the feature and use the backward Euler iteration, i.e., $\tilde{{\bm{x}}}(t)=\tilde{{\bm{x}}}(t+\Delta t)-\Delta tF(\tilde{{\bm{x}}}(t+\Delta t),t+\Delta t),$ to evolve $\tilde{{\bm{x}}}(t)$ from $t=1$ to $t=0$ with $\tilde{{\bm{x}}}(1)={\bm{x}}(1)$ being the features obtained in the forward propagation. We plot the recovered image from features in Fig. 1(c), and the original image can be almost perfectly recovered.

Refer to caption — Figure 1: Illustrations of the forward and backward propagation of the training data using 2D ODE in equation (2) and SDE in equation (3). (a) the original image; (b) and (d) the features of the original image generated by the forward propagation using ODE and SDE, respectively; (c) and (e) the recovered images by reverse-engineering the features shown in (b) and (d), respectively. We see that it is easy to break the privacy of the ODE model but harder for SDE.

II-B Residual Perturbation and its SDE Analogue

In this part, we propose two SDE models to reduce the reversibility of (2), and the corresponding residual perturbations analogue can protect the data privacy in DL.

Strategy I

Consider the following SDE model:

\displaystyle\small d{\bm{x}}(t)=F({\bm{x}}(t),{\bm{W}}(t))dt+\gamma d{\bm{B}}(t),\ \gamma>0,

(3)

where ${\bm{B}}(t)$ is the standard Brownian motion. We simulate the forward propagation and reverse-engineering the input from the output by solving the SDE model (3) with $\gamma=1$ using the same $F({\bm{x}}(t),{\bm{W}}(t))$ and initial data $\hat{{\bm{x}}}$ . We use the following forward (4) and backward (5) Euler-Maruyama discretizations [60] of (3) for the forward and backward propagation, respectively.

\small{\bm{x}}(t+\Delta t)={\bm{x}}(t)+\Delta tF({\bm{x}}(t),{\bm{W}}(t))+\gamma\mathcal{N}(\mathbf{0},\sqrt{\Delta t}\,{\bm{I}}),

(4)

	$\displaystyle\tilde{{\bm{x}}}(t)=\tilde{{\bm{x}}}(t+\Delta t)$	$\displaystyle-\Delta tF(\tilde{{\bm{x}}}(t+\Delta t),{\bm{W}}(t+\Delta t))$		(5)
		$\displaystyle+\gamma\mathcal{N}(\mathbf{0},\sqrt{\Delta t}\,{\bm{I}}),$		(5)

Fig. 1(d) and (e) show the results of the forward and backward propagation by SDE, respectively. These results show that it is much harder to reverse the features obtained from the SDE evolution. The SDE model informs us to inject Gaussian noise, in both training and test phases, to each residual mapping of ResNet to protect data privacy, which results in

\small{\bm{x}}^{i+1}={\bm{x}}^{i}+\hat{F}({\bm{x}}^{i},{\bm{W}}^{i})+\gamma{\bm{n}}^{i},\ \mbox{where}\ {\bm{n}}^{i}\sim\mathcal{N}(\mathbf{0},{\bm{I}}).

(6)

Strategy II

For the second strategy, we consider using the multiplicative noise instead of the additive noise that used in (3) and (6), and the corresponding SDE can be written as

\displaystyle\small d{\bm{x}}(t)=F({\bm{x}}(t),{\bm{W}}(t))dt+\gamma{\bm{x}}(t)\odot d{\bm{B}}(t),\ \gamma>0,

(7)

where $\odot$ denotes the Hadamard product. Similarly, we can use the forward and backward Euler-Maruyama discretizations of (7) to propagate the image in Fig. 1(a), and we provide these results in Sec. VI-A. The corresponding residual perturbation is

\small{\bm{x}}^{i+1}={\bm{x}}^{i}+\hat{F}({\bm{x}}^{i},{\bm{W}}^{i})+\gamma{\bm{x}}^{i}\odot{\bm{n}}^{i},\ \mbox{where}\ {\bm{n}}^{i}\sim\mathcal{N}(\mathbf{0},{\bm{I}}).

(8)

Again, the noise $\gamma{\bm{x}}^{i}\odot{\bm{n}}^{i}$ is injected to each residual mapping in both training and test stages.

We will provide theoretical guarantees for these two residual perturbation schemes, i.e., (6) and (8), in Sec. III, and numerically verify their efficacy in Sec. IV.

II-C Utility Enhancement via Model Ensemble

In [49], the authors showed that an ensemble of noise-injected ResNets can improve models’ utility. In this paper, we will also study the model ensemble for utility enhancement. We inherit notations from [49], e.g., we denote an ensemble of two noise-injected ResNet8 as En₂ResNet8.

III Main Theory

III-A Differential Privacy Guarantee for Strategy I

We consider the following function class for ResNets with residual perturbation:

\small\mathcal{F}_{1}:=\left\{f({\bm{x}})={\bm{w}}^{\top}{\bm{x}}^{M}|{\bm{x}}^{i+1}={\bm{x}}^{i}+\phi\left({\bm{U}}^{i}{\bm{x}}^{i}\right)+\gamma{\bm{n}}^{i}\right\},

(9)

where ${\bm{x}}^{0}=\mbox{input data}+\pi{\bm{n}}\in{\mathbb{R}}^{d}$ is the noisy input with $\pi>0$ being a hyperparameter and ${\bm{n}}\sim\mathcal{N}(\mathbf{0},{\bm{I}})$ being a normal random vector. ${\bm{U}}^{i}\in{\mathbb{R}}^{d\times d}$ is the weight matrix in the $i$ -th residual mapping and ${\bm{w}}\in{\mathbb{R}}^{d}$ is the weights of the last layer. ${\bm{n}}^{i}\sim\mathcal{N}(\mathbf{0},{\bm{I}})$ and $\gamma>0$ is also a hyperparameter. $\phi={\rm BN}(\psi)$ with ${\rm BN}$ being the batch normalization and $\psi$ being a $L$ -Lipschitz and monotonically increasing activation function, e.g., ReLU.

We first recap on the definition of differential privacy.

Definition 1 ( $(\epsilon,\delta)$ -DP).

[20] A randomized mechanism ${\mathcal{M}}:{\mathcal{S}}^{N}\rightarrow{\mathcal{R}}$ satisfies $(\epsilon,\delta)$ -DP if for any two datasets $S,S^{\prime}\in{\mathcal{S}}^{N}$ that differ by one element, and any output subset $O\subseteq{\mathcal{R}}$ , it holds that ${\mathbb{P}}[{\mathcal{M}}(S)\in O]\leq e^{\epsilon}\cdot{\mathbb{P}}[{\mathcal{M}}(S^{\prime})\in O]+\delta,{\rm where}\,\,\,\delta\in(0,1)$ and $\epsilon>0.$

Theorem 1 below guarantees DP for Strategy I, and we provide its proof in Sec. V-A.

Theorem 1.

Assume the input to ResNet lies in $B(\mathbf{0},R)$ and the expectation of the output of every residual mapping is normally distributed and bounded by a constant $G$ , in $\ell_{2}$ norm. Given the total number of iterations $T$ used for training ResNet. For any $\epsilon>0$ and $\delta,\lambda\in(0,1)$ , the parameters ${\bm{U}}^{i}$ and ${\bm{w}}$ in the ResNet with residual perturbation satisfies $\left((\lambda/(i+1)+(1-\lambda))\epsilon,\delta\right)$ -DP and $\left((\lambda/(M+1)+(1-\lambda))\epsilon,\delta\right)$ -DP, respectively, provided that $\pi>R\sqrt{(2Tb\alpha)/(N\lambda\epsilon)}$ and $\gamma>G\sqrt{(2Tb\alpha)/(N\lambda\epsilon)}$ , where $\alpha=\log(1/\delta)/\left((1-\lambda)\epsilon\right)+1$ , $M,N$ and $b$ are the number of residual mappings, training data, and batch size, respectively. In particular, in the above setting the whole model obtained by injecting noise according to strategy I satisfies $(\epsilon,\delta)$ -DP.

III-B Theoretical Guarantees for Strategy II

Privacy

To analyze the residual perturbation scheme in (8), we consider the following function class:

\small\mathcal{F}_{2}:=\left\{f\left({\bm{x}}\right)={\bm{w}}^{\top}{\bm{x}}^{M}+\pi{\bm{x}}^{M}{\bm{n}}|\Bigg{|}\begin{aligned} &{\bm{x}}^{i+1}={\bm{x}}^{i}+\phi\left({\bm{U}}^{i}{\bm{x}}^{i}\right)+\\ &\gamma\tilde{{\bm{x}}}^{i}\odot{\bm{n}}^{i},i=0,\ldots,M-1\end{aligned}\right\}

where ${\bm{n}}^{i}\sim\mathcal{N}(\mathbf{0},{\bm{I}})$ is a normal random vector for $i=0,\ldots,M-1$ , $\|{\bm{w}}\|\leq a$ with $a>0$ being a constant. We denote the entry of ${\bm{x}}^{i}$ that has the largest absolute value as $x^{i}_{max}$ , and $\tilde{{\bm{x}}}^{i}:=(sgn(x^{i}_{j})\max(|x^{i}_{j}|,\eta))_{j=1}^{d}$ . Due to batch normalization, we assume the $l^{2}$ norm of $\phi$ can be bounded by a positive constant $B$ . The other notations are defined similar to that in (9). Consider training $\mathcal{F}_{2}$ by using two datasets $S$ and $S^{\prime}$ , and we denote the resulting models as:

	$\displaystyle f\left({\bm{x}}\|S\right)$	$\displaystyle:={\bm{w}}_{1}^{\top}{\bm{x}}^{M}+\pi{\bm{x}}^{M}\odot{\bm{n}}^{M};\,\,\ \mbox{where}$		(10)
	$\displaystyle{\bm{x}}^{i+1}$	$\displaystyle={\bm{x}}^{i}+\phi\left({\bm{U}}_{1}^{i}{\bm{x}}^{i}\right)+\gamma\tilde{{\bm{x}}}^{i}\odot{\bm{n}}^{i},i=0,\ldots,M-1,$		(10)

and

	$\displaystyle f\left({\bm{x}}\|S^{{}^{\prime}}\right)$	$\displaystyle:={\bm{w}}_{2}^{\top}{\bm{x}}^{M}+\pi{\bm{x}}^{M}\odot{\bm{n}}^{M};\,\,\ \mbox{where}$		(11)
	$\displaystyle{\bm{x}}^{i+1}$	$\displaystyle={\bm{x}}^{i}+\phi\left({\bm{U}}_{2}^{i}{\bm{x}}^{i}\right)+\gamma\tilde{{\bm{x}}}^{i}\odot{\bm{n}}^{i},i=0,\ldots,M-1.$		(11)

Then we have the following privacy guarantee.

Theorem 2.

For $f({\bm{x}}|S)$ and $f({\bm{x}}|S^{\prime})$ that are defined in (10) and (11), respectively. Let $\lambda\in(0,1),\delta\in(0,1)$ , and $\epsilon>0$ , if ${\small\\ \gamma>(B/\eta)\sqrt{(2\alpha M)/(\lambda\epsilon)}}\ \ \mbox{and}\ \ {\small\pi>a\sqrt{\left(2\alpha M\right)/\lambda\epsilon},}$ where $\alpha=\log(1/\delta)/\left((1-\lambda)\epsilon\right)+1$ , then ${\mathbb{P}}[f({\bm{x}}|S)\in O]\leq e^{\epsilon}\cdot{\mathbb{P}}[f({\bm{x}}|S^{{}^{\prime}}]\in O]+\delta$ for any input ${\bm{x}}$ and any subset $O$ in the output space.

We prove Theorem 2 in Sec. V-B. Theorem 2 guarantees the privacy of the training data given only black-box access to the model, i.e., the model will output the prediction for any input without granting adversaries access to the model itself. In particular, we cannot infer whether the model is trained on $S$ or $S^{\prime}$ no matter how we query the model in a black-box fashion. This kind of privacy-preserving prediction has also been studied in [34, 35, 61, 62]. We leave theoretical DP-guarantee for for Strategy II for future work.

Generalization gap

Many works have shown that overfitting in training ML models leads to privacy leakage [6], and reducing overfitting can mitigate data privacy leakage [5, 7, 8, 6, 63]. In this part, we show that the residual perturbation (8) can reduce overfitting via computing the Rademacher complexity. For simplicity, we consider binary classification problems. Suppose $S_{N}=$ $\{{\bm{x}}_{i},y_{i}\}_{i=1}^{N}$ is drawn from $X\times Y\subset{\mathbb{R}}^{d}\times\{-1,+1\}$ with $X$ and $Y$ being the input data and label spaces, respectively. Assume $\mathcal{D}$ is the underlying distribution of $X\times Y$ , which is unknown. Let $\mathcal{H}\subset V$ be the hypothesis class of the ML model. We first recap on the definition of Rademacher complexity.

Definition 2.

[64] Let $\mathcal{H}:X\rightarrow{\mathbb{R}}$ be the space of real-valued functions on the space $X$ . For a given sample $S=\{{\bm{x}}_{1},{\bm{x}}_{2},\ldots,{\bm{x}}_{N}\}$ of size $N$ , the empirical Rademacher complexity of $\mathcal{H}$ is defined as

\small R_{S}(\mathcal{H}):=\frac{1}{N}E_{\sigma}\Big{[}\sup_{h\in\mathcal{H}}\sum_{i=1}^{N}\sigma_{i}h({\bm{x}}_{i})\Big{]},

where $\sigma_{1},\ldots,\sigma_{N}$ are i.i.d. Rademacher random variables with $\mathbb{P}(\sigma_{i}=1)=\mathbb{P}(\sigma_{i}=-1)=1/2$ .

Rademacher complexity is a tool to bound the generalization gap [64]. The smaller the generalization gap is, the less overfitting the model is. For $\forall{\bm{x}}_{i}\in{\mathbb{R}}^{d}$ and constant $c\geq 0$ , we consider the following two function classes ( $\mathcal{F}$ and $\mathcal{G}$ ):

\small\begin{aligned} \mathcal{F}:=\left\{f({\bm{x}},{\bm{w}})={\bm{w}}{\bm{x}}^{p}(T)\big{|}d{\bm{x}}(t)={\bm{U}}{\bm{x}}(t)dt\ \text{with}\ {\bm{x}}(0)={\bm{x}}_{i}\right\}\end{aligned},

where ${\bm{w}}\in{\mathbb{R}}^{1\times d},{\bm{U}}\in{\mathbb{R}}^{d\times d}\ \mbox{with}\ \|{\bm{w}}\|_{2}\ \text{and}\ \|{\bm{U}}\|_{2}\leq c$ . And

	$\displaystyle\mathcal{G}:=\big{\{}f({\bm{x}},{\bm{w}})=\mathbb{E}\left[{\bm{w}}{\bm{x}}^{p}(T)\right]\big{\|}d{\bm{x}}(t)$	$\displaystyle={\bm{U}}{\bm{x}}(t)dt$
		$\displaystyle+\gamma{\bm{x}}(t)\odot dB(t)\big{\}},$

where ${\bm{x}}(0)={\bm{x}}_{i}$ , ${\bm{w}}\in{\mathbb{R}}^{1\times d}$ and ${\bm{U}}\in{\mathbb{R}}^{d\times d}$ with $\|{\bm{w}}\|_{2},\|{\bm{U}}\|_{2}\leq c$ , $0<p<1$ takes the value such that ${\bm{x}}^{p}$ is well defined on ${\mathbb{R}}^{d}$ , $\gamma>0$ is a hyperparameter and ${\bm{U}}$ is a circulant matrix that corresponding to the convolution layer in DNs, and $B(t)$ is the standard 1D Brownian motion. The function class $\mathcal{F}$ represents the continuous analogue of ResNet without inner nonlinear activation functions, and $\mathcal{G}$ denotes $\mathcal{F}$ with the residual perturbation (8).

Theorem 3.

Given the training set $S_{N}=\{{\bm{x}}_{i},y_{i}\}_{i=1}^{N}$ . We have $R_{S_{N}}(\mathcal{G})<R_{S_{N}}(\mathcal{F})$ .

We provide the proof of Theorem 3 in Sec. V-C, where we will also provide quantitative lower and upper bounds of the above Rademacher complexities. Theorem 3 shows that residual perturbation (8) can reduce the generalization error. We will numerically verify this generalization error reduction for ResNet with residual perturbation in Sec. IV.

IV Experiments

In this section, we will verify that 1) Residual perturbation can protect data privacy, particularly membership privacy. 2) The ensemble of ResNets with residual perturbation improves the classification accuracy. 3) The skip connection is crucial in residual perturbation for DL with data privacy. 4) Residual perturbation can improve the utility of the resulting ML models over DPSGD with a compromise in privacy protection. We focus on Strategy I in this section, and we provide the results of Strategy II in Sec. VI.

IV-A Preliminaries

Datasets

We consider MNIST, CIFAR10/CIFAR100 [65], and the Invasive Ductal Carcinoma [66] datasets. MNIST consists of 70K $28\times 28$ gray-scale images with 60K and 10K of them used for training and test, respectively. Both CIFAR10 and CIFAR100 contain 60K $32\times 32$ color images with 50K and 10K of them used for training and test, respectively. The IDC dataset is a breast cancer-related benchmark dataset, which contains 277,524 patches of the size $50\times 50$ with 198,738 labeled negative (0) and 78,786 labeled positive (1). Fig. 2 depicts a few patches from the IDC dataset. For the IDC dataset, we follow [67] and split the whole dataset into training, validation, and test set. The training set consists of 10,788 positive patches and 29,164 negative patches, and the test set contains 11,595 positive patches and 31,825 negative patches. The remaining patches are used as the validation set. For each dataset, we split its training set into $D_{\rm shadow}$ and $D_{\rm target}$ with the same size. Furthermore, we split $D_{\rm shadow}$ into two halves with the same size and denote them as $D_{\rm shadow}^{\rm train}$ and $D_{\rm shadow}^{\rm out}$ , and split $D_{\rm target}$ by half into $D_{\rm target}^{\rm train}$ and $D_{\rm target}^{\rm out}$ . The purpose of this splitting of the training set is for the membership inference attack, which will be discussed below.

Negative

Positive

Negative

Positive

Figure 2: Visualization of a few selected images from the IDC dataset.

Membership inference attack

To verify the efficiency of residual perturbation for privacy protection, we consider the membership inference attack [6] in all the experiments below. The membership attack proceeds as follows: 1) train the shadow model by using $D_{\rm shadow}^{\rm train}$ ; 2) apply the trained shadow model to predict all data points in $D_{\rm shadow}$ and obtain the corresponding classification probabilities of belonging to each class. Then we take the top three classification probabilities (or two for binary classification) to form the feature vector for each data point. A feature vector is tagged as $1$ if the corresponding data point is in $D_{\rm shadow}^{\rm train}$ , and $0$ otherwise. Then we train the attack model by leveraging all the labeled feature vectors; 3) train the target model by using $D_{\rm target}^{\rm train}$ and obtain feature vector for each point in $D_{\rm target}$ . Finally, we leverage the attack model to decide whether a data point is in $D_{\rm target}^{\rm train}$ .

Experimental settings

We consider En₅ResNet8 (ensemble of 5 ResNet8 with residual perturbation) and the standard ResNet8 as the target and shadow models, respectively. We use a multilayer perceptron with a hidden layer of $64$ nodes, followed by a softmax function as the attack model, which is adapted from [6]. We apply the same settings in [33] to train the target and shadow models on MNIST, CIFAR10, and CIFAR100. For training models on the IDC dataset, we run 100 epochs of SGD with the same setting as before except that we decay the learning rate by 4 at the 20th, 40th, 60th, and 80th epoch, respectively. Moreover, we run $50$ epochs of Adam [68] with a learning rate of 0.1 to train the attack model. For both IDC, MNIST, and CIFAR datasets, we set $\pi$ as half of $\gamma$ in Strategy I, which simplifies hyperparameters calibration, and based on our experiment it gives a good tradeoff between privacy and accuracy.

Performance evaluations

We consider both classification accuracy and capability for protecting membership privacy. The attack model is a binary classifier, which is used to decide if a data point is in the training set of the target model. For any ${\bm{x}}\in D_{\rm target}$ , we apply the attack model to predict its probability ( $p$ ) of belonging to the training set of the target model. Given any fixed threshold $t$ if $p\geq t$ , we classify ${\bm{x}}$ as a member of the training set (positive sample), and if $p<t$ , we conclude that ${\bm{x}}$ is not in the training set (negative sample); so we can obtain different attack results with different thresholds. Furthermore, we can plot the ROC curve(see details in Sec. IV-E) of the attack model and use the area under the ROC curve (AUC) as an evaluation of the membership inference attack. The target model protects perfect membership privacy if the AUC is 0.5 (the attack model performs a random guess), and the more AUC deviates from 0.5, the less private the target model is. Moreover, we use the precision (the fraction of records inferred as members are indeed members of the training set) and recall (the fraction of training set that is correctly inferred as members of the training set by the attack model) to measure ResNets’ capability for protecting membership privacy.

Figure 3: Performance of residual perturbation (Strategy I) for En₅ResNet8 with different noise coefficients (

\gamma

) and membership inference attack thresholds on the IDC dataset. Residual perturbation significantly improves membership privacy and reduces the generalization gap.

\gamma=0

corresponding to the baseline ResNet8. (Unit: %)

IV-B Experiments on the IDC Dataset

In this subsection, we numerically verify that residual perturbation protects data privacy while retaining the classification accuracy of the IDC dataset. We select the En₅ResNet8 as a benchmark architecture, which has ResNet8 as its baseline architecture. As shown in Fig. 3, we set four different thresholds to obtain different attack results with three different noise coefficients ( $\gamma$ ), and $\gamma=0$ means the standard ResNet8 without residual perturbation. We also depict the ROC curve for this experiment in Fig. 7(c). En₅ResNet8 is remarkably better in protecting the membership privacy, and as $\gamma$ increases the model becomes more resistant to the membership attack. Also, as the noise coefficient increases, the gap between training and test accuracies becomes smaller, which resonates with Theorem 3. For instance, when $\gamma=0.2$ the AUC for the attack model is $0.495$ and $0.563$ , respectively, for En₅ResNet8 and ResNet8; the classification accuracy of En₅ResNet8 and ResNet8 are 0.867 and 0.847, respectively.

Residual perturbation vs. DPSGD

In this part, we compare the residual perturbation with the benchmark Tensorflow DPSGD module [69], and we calibrate the hyperparameters, including the initial learning rate (0.1) which decays by a factor of 4 after every 20 epochs, noise multiplier (1.1), clipping threshold (1.0), micro-batches (128), and epochs (100) ²²2https://github.com/tensorflow/privacy/tree/master/tutorials such that the resulting model gives the optimal tradeoff between membership privacy and classification accuracy. DPSGD is significantly more expensive due to the requirement of computing and clipping the per-sample gradient. We compare the standard ResNet8 trained by DPSGD with the baseline non-private ResNet8, and En₅ResNet8 with residual perturbation ( $\gamma=0.5$ ). Table I lists the AUC of the attack model and training and test accuracies of the target model; we see that DPSGD and residual perturbation can protect membership privacy, and residual perturbation can improve accuracy without sacrificing membership privacy protection.

TABLE I: Residual perturbation (Strategy I) vs. DPSGD in training ResNet8 and EnResNet8 for IDC classification. Residual perturbation has better accuracy and protects better membership privacy than the baseline ResNet8 (AUC closer to 0.5).

Model	AUC	Training Acc	Test Acc
ResNet8(non-private)	0.563	1.000	0.847
ResNet8 (DPSGD)	0.503	0.831	0.828
En₁ResNet8( $\gamma=0.5$ )	0.499	0.882	0.865

IV-C Experiments on the MNIST/CIFAR10/CIFAR100 Datasets

We further test the effectiveness of residual perturbation for ResNet8 and En₅ResNet8 on the MNIST/CIFAR10/CIFAR100 dataset. Also, we will contrast residual perturbation with DPSGD in privacy protection and utility maintenance. Fig. 4 plots the performance of En₅ResNet8 on CIFAR10 under the above four different measures, namely, precision, recall, training and test accuracy, and AUC. Again, the ensemble of ResNets with residual perturbation is remarkably less vulnerable to membership inference attack; e.g., the AUC of the attack model for ResNet8 (non-private) and En₅ResNet8 ( $\gamma=0.75$ ) is $0.757$ and $0.573$ , respectively. Also, the classification accuracy of En₅ResNet8 ( $67.5$ %) is higher than that of the non-private ResNet8 ( $66.9$ %) for CIFAR10 classification. Fig. 5 depicts the results of En₅ResNet8 for CIFAR100 classification. These results confirm again that residual perturbation can protect membership privacy and improve classification accuracy.

Figure 4: Performance of En₅ResNet8 with residual perturbation (Strategy I) using different noise coefficients (

\gamma

) and membership inference attack threshold on CIFAR10. Residual perturbation can not only enhance the membership privacy, but also improve the classification accuracy.

\gamma=0

corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)

Figure 5: Performance of En₅ResNet8 with residual perturbation (Strategy I) using different noise coefficients (

\gamma

) and membership inference attack threshold on CIFAR100. Again, residual perturbation can not only enhance the membership privacy, but also improve the classification accuracy.

\gamma=0

corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)

Residual perturbation vs. DPSGD

We contrast residual perturbation with DPSGD on MNIST, CIFAR10, and CIFAR100 classification; the results are shown in Tables II, III, and IV, respectively. These results unanimously show that 1) both DPSGD and residual perturbation can effectively protect membership privacy, and 2) residual perturbation is better than DPSGD in maintaining the model’s utility without sacrificing membership privacy protection.

TABLE II: Residual perturbation (Strategy I) vs. DPSGD in training ResNet8 and EnResNet8 for MNIST classification. Residual perturbation has higher accuracy without sacrificing membership privacy protection compared to DPSGD.

Model	AUC	Training Acc	Test Acc
ResNet8(non-private)	0.514	1.000	0.988
ResNet8 (DPSGD)	0.507	0.977	0.975
En₁ResNet8( $\gamma=2.5$ )	0.506	0.989	0.978

TABLE III: Residual perturbation (Strategy I) vs. DPSGD in training ResNet8 and EnResNet8 for CIFAR10 classification. Residual perturbation has higher accuracy than the baseline model without sacrificing membership privacy protection.

Model	AUC	Training Acc	Test Acc
ResNet8(non-private)	0.757	1.000	0.669
ResNet8 (DPSGD)	0.512	0.718	0.625
En₁ResNet8( $\gamma=4.5$ )	0.511	0.760	0.695

TABLE IV: Residual perturbation (Strategy I) vs. DPSGD in training ResNet8 and EnResNet8 for CIFAR100 classification. Residual perturbation has higher accuracy than DPSGD and protects comparable membership privacy with DPSGD.

Model	AUC	Training Acc	Test Acc
ResNet8(non-private)	0.903	0.999	0.305
ResNet8 (DPSGD)	0.516	0.400	0.288
En₁ResNet8( $\gamma=3.25$ )	0.515	0.485	0.383

Effects of the number of models in the ensemble

In this part, we consider the effects of the number of residual perturbed ResNets in the ensemble. Fig. 6 illustrates the performance of EnResNet8 for CIFAR10 classification measured in the AUC, and training and test accuracy. These results show that tuning the noise coefficient and the number of models in the ensemble is crucial to optimize the tradeoff between accuracy and privacy.

Figure 6: Performance of residual perturbation (Strategy I) with different noise coefficients (

\gamma

) and different number of models in the ensemble. Privacy-utility tradeoff depends on these two factors. (Unit: %)

IV-D On the Importance of Skip Connections

Residual perturbations relies on the irreversibility of the SDEs (3) and (7), and this ansatz lies in the skip connections in the ResNet. We test both standard ResNet and the modified ResNet without skip connections. For CIFAR10 classification, under the same noise coefficient ( $\gamma=0.75$ ), the test accuracy is $0.675$ for the En₅ResNet8 (with skip connection); while the test accuracy is $0.653$ for the En₅ResNet8 (without skip connection). Skip connections makes EnResNet more resistant to noise injection which is crucial for the success of residual perturbation for privacy protection.

IV-E ROC Curves for Experiments on Different Datasets

The receiver operating characteristic (ROC) curve can be used to illustrate the classification ability of a binary classifier. ROC curve is obtained by plotting the true positive rate against the false positive rate at different thresholds. The true positive rate, also known as recall, is the fraction of the positive set (all the positive samples) that is correctly inferred as a positive sample by the binary classifier. The false positive rate can be calculated by 1-specificity, where specificity is the fraction of the negative set (all the negative samples) that is correctly inferred as a negative sample by the binary classifier. In our case, the attack model is a binary classifier. Data points in the training set of the target model are tagged as positive samples, and data points out of the training set of the target model are tagged as negative samples. Then we plot ROC curves for different datasets (as shown in Fig. 7). These ROC curves show that if $\gamma$ is sufficiently large, the attack model’s prediction will be nearly a random guess.

(a) CIFAR10	(b) CIFAR100	(c) IDC
(En₅ResNet8)	(En₅ResNet8)	(En₅ResNet8)

Figure 7: ROC curves for residual perturbation (Strategy I) for different datasets. (ga: noise coefficient

\gamma

)

IV-F Remark on the Privacy Budget

In the experiments above, we set the constants $G$ and $R$ to $30$ for Strategy I. For classifying IDC with ResNet8, the theoretical DP budget, based on Theorem 1, for Strategy I is $\small(\epsilon=1.1e3,\delta=1e-5)$ and the DP-budget for DPSGD is $\small(\epsilon=15.79,\delta=1e-5)$ . For classifying CIFAR10 with ResNet8, the DP budget, based on Theorem 1, for Strategy I is $(\epsilon=3e3,\delta=1e-5)$ and the DP-budget for DPSGD is $\small(\epsilon=22.33,\delta=1e-5)$ . Using the algorithm provided in [42], we estimate the empirical lower bound of the $\epsilon$ given $\delta=1e-5$ for both Strategy I and DPSGD. The obtained empirical lower bound can guide the improvement of analytical tools to bound the DP budget — the larger gap between theoretical and empirical DP budget indicates more theoretical budget to improve. For IDC classification, the lower bound of $\epsilon$ for Strategy I and DPSGD are 3.604e-3 and 6.608e-3, respectively. And the lower bound of $\epsilon$ for Strategy I and DPSGD for CIFAR10 classification are 1.055e-1 and 1.558e-1, respectively. These empirical DP budget estimates show that Theorem 1 offers a quite loose DP budget compared to DPSGD. There are several challenges we need to overcome to get a tighter DP bound for Strategy I. Compared to DPSGD, it is significantly harder. In particular, 1) the loss function of the noise injected ResNets is highly nonlinear and very complex with respect to the weights, also the noise term appears in the loss function due to the noise injected in each residual mapping. 2) In our proof, we leverage the framework of subsampled Rényi-DP [37] to find a feasible range of noise variance parameter, and then convert to DP to get the value of $\gamma$ for a given DP budget. This procedure will significantly reduce the accuracy of the estimated $\gamma$ . We leave the tight DP guarantee as future work. In particular, how to reduce the accuracy of estimating due to the conversion between Rényi-DP and DP.

V Technical Proofs

V-A Proof of Theorem 1

V-A1 Rényi Differential Privacy (RDP)

We use the notion of RDP to prove the DP guarantees for residual perturbations.

Definition 3.

[36] (Rényi divergence) For any two probability distributions $P$ and $Q$ defined over the distribution $\mathcal{D}$ , the Rényi divergence of order $\alpha>1$ is

\displaystyle\small D_{\alpha}(P||Q)=\frac{1}{\alpha-1}\log{\mathbb{E}}_{x\sim Q}(P/Q)^{\alpha}.

Definition 4.

[36] ( $(\alpha,\epsilon)$ -RDP) A randomized mechanism $\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R}$ is $\epsilon$ -Rényi DP of order $\alpha$ or $(\alpha,\epsilon)$ -RDP, if for any adjacent $\mathcal{S},\mathcal{S}^{{}^{\prime}}\in\mathcal{D}$ that differ by one entry, we have

\displaystyle\small D_{\alpha}(\mathcal{M}(\mathcal{S})||\mathcal{M}(\mathcal{S}^{{}^{\prime}})\leq\varepsilon.

Lemma 1.

[36] Let $f:\mathcal{D}\rightarrow\mathcal{R}_{1}$ be $(\alpha,\epsilon_{1})$ -RDP and $g:\mathcal{R}_{1}\times\mathcal{D}\rightarrow\mathcal{R}$ be $(\alpha,\epsilon)$ -RDP, then the mechanism $(X,Y)$ with $X\sim f(D)$ and $Y\sim g(X,D)$ , satisfies $(\alpha,\epsilon_{1}+\epsilon_{2})$ -RDP.

Lemma 2.

[36] (From RDP to $(\epsilon,\delta)$ -DP) If $f$ is an $(\alpha,\epsilon)$ -RDP mechanism, then it also satisfies $(\epsilon-(\log\delta)/(\alpha-1),\delta)$ -differential privacy for any $0<\delta<1$ .

Lemma 3.

[36] (Post-processing lemma) Let $\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R}$ be a randomized algorithm that is $(\alpha,\epsilon)$ -RDP, and let $f:\mathcal{R}\rightarrow\mathcal{R}^{{}^{\prime}}$ be an arbitrary randomized mapping. Then $f(\mathcal{M}(\cdot)):\mathcal{D}\rightarrow\mathcal{R}^{{}^{\prime}}$ is $(\alpha,\epsilon)$ -RDP.

V-A2 Proof of Theorem 1

In this subsection, we will give a proof of Theorem 1, i.e., DP-guarantee for the Strategy I.

Proof.

We prove Theorem 1 by mathematical induction. Consider two adjacent datasets $\mathcal{S}=\{{\bm{x}}_{1},\cdots,{\bm{x}}_{N-1},{\bm{x}}_{N}\},\mathcal{S}^{\prime}=\{{\bm{x}}_{1},\cdots,{\bm{x}}_{N-1},\hat{\bm{x}}_{N}\}$ that differ by one entry. For the first residual mapping, it is easy to check that when $\pi\geq R\sqrt{2\alpha/\epsilon_{p}}$ with $\pi$ being the standard deviation of the Gaussian noise injected into the input data, we have $D_{\alpha}({\bm{x}}^{0}_{N}||\hat{\bm{x}}^{0}_{N})=\alpha\|{\bm{x}}_{N}-\hat{\bm{x}}_{N}\|^{2}/(2\pi^{2})<\varepsilon_{p}$ . For the remaining residual mappings, we denote the response of the $i$ -th residual mapping, for any two input data ${\bm{x}}_{N}$ and $\hat{\bm{x}}_{N}$ , as ${\bm{x}}^{i}_{N},\hat{\bm{x}}^{i}_{N}$ , respectively. Based on our assumpution, we have

	$\displaystyle{\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}{\bm{x}}^{i}_{N})$	$\displaystyle\sim\mathcal{N}(\mu_{N,i},\sigma^{2}_{N,i}),$		(12)
	$\displaystyle\hat{\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}\hat{\bm{x}}^{i}_{N})$	$\displaystyle\sim\mathcal{N}(\hat{\mu}_{N,i},\sigma^{2}_{N,i}),$		(12)

where the mean $\mu_{N,i}$ and $\hat{\mu}_{N,i}$ are both bounded by the constant $G$ . If $D_{\alpha}({\bm{x}}^{i}_{N}||\hat{\bm{x}}^{i}_{N})\leq\epsilon_{p}/(i+1)$ , according the post-processing lemma (Lemma 3), we have

		$\displaystyle D_{\alpha}({\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}{\bm{x}}^{i}_{N})\|\|\hat{\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}\hat{\bm{x}}^{i}_{N}))$		(13)
	$\displaystyle=$	$\displaystyle\frac{\alpha\\|\mu_{N,i}-\mu^{{}^{\prime}}_{N,i}\\|^{2}}{2\sigma^{2}_{N,i}}\leq\epsilon_{p}/(i+1),$		(13)

so when the standard deviation of the injected Gaussian noise satisfies $\gamma>\sqrt{2\alpha G^{2}/\epsilon_{p}}$ , (13) implies that

		$\displaystyle D_{\alpha}({\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}{\bm{x}}^{i}_{N})+\gamma{\bm{n}}\|\|\hat{\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}\hat{\bm{x}}^{i}_{N})+\gamma{\bm{n}})$		(14)
	$\displaystyle=$	$\displaystyle\frac{\alpha\\|\mu_{N,i}-\mu^{{}^{\prime}}_{N,i}\\|^{2}}{2(\sigma^{2}_{N,i}+\gamma^{2})}\leq\frac{\epsilon_{p}}{i+2},$		(14)

implying that $D_{\alpha}({\bm{x}}^{i+1}_{N}||\hat{\bm{x}}^{i+1}_{N})<\frac{\epsilon_{p}}{i+2}$ . Moreover, note that

	$\displaystyle\nabla_{{\bm{U}}^{i+1}}\ell\|_{{\bm{x}}^{i+1}_{N}}$	$\displaystyle=l^{{}^{\prime}}\left({\bm{x}}^{i+1}_{N}+\phi\left({\bm{U}}^{i+1}{\bm{x}}^{i+1}_{N}\right)\right)\phi^{{}^{\prime}}({\bm{U}}^{i+1}{\bm{x}}^{i+1}_{N})({\bm{x}}^{i+1}_{N})^{\top}$		(15)
	$\displaystyle\nabla_{{\bm{U}}^{i+1}}\ell\|_{\hat{\bm{x}}^{i+1}_{N}}$	$\displaystyle=l^{{}^{\prime}}\left(\hat{\bm{x}}^{i+1}_{N}+\phi\left({\bm{U}}^{i+1}\hat{\bm{x}}^{i+1}_{N}\right)\right)\phi^{{}^{\prime}}({\bm{U}}^{i+1}\hat{\bm{x}}^{i+1}_{N})(\hat{\bm{x}}^{i+1}_{N})^{\top}.$		(15)

By the post-processing lemma (Lemma 3) again, we get

\small D_{\alpha}(\nabla_{{\bm{U}}^{i+1}}\ell|_{{\bm{x}}^{i+1}_{N}}||\nabla_{{\bm{U}}^{i+1}}\ell|_{\hat{\bm{x}}^{i+1}_{N}})<\frac{\epsilon_{p}}{i+2}.

Let $\mathcal{B}_{t}$ be the index set with $|\mathcal{B}_{t}|=b$ , and we update ${\bm{U}}^{i}$ as

	$\displaystyle{\bm{U}}^{i+1}_{t+1}\|\mathcal{S}$	$\displaystyle={\bm{U}}^{i}_{t}-\alpha\frac{1}{b}\sum_{j\in\mathcal{B}_{t}}\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{j},{\bm{U}}^{i+1}_{t}),\quad$		(16)
	$\displaystyle{\bm{U}}^{i+1}_{t+1}\|\mathcal{S}^{\prime}$	$\displaystyle={\bm{U}}^{i}_{t}-\alpha\frac{1}{b}\sum_{j\in\mathcal{B}_{t}}\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{j},{\bm{U}}^{i+1}_{t}),$		(16)

where ${\bm{U}}^{i+1}_{t}$ is the weights updated after the $t$ -th training iterations. When $N\notin\mathcal{B}_{t}$ , it’s obviously that $D_{\alpha}({\bm{U}}^{i+1}_{t+1}|\mathcal{S}||{\bm{U}}^{i+1}_{t+1}|\mathcal{S}^{\prime})=0$ ; when $N\in\mathcal{B}_{t}$ , the equations which we use to update ${\bm{U}}^{i}$ can be rewritten as

$\displaystyle{\bm{U}}^{i+1}_{t+1}\|\mathcal{S}$	$\displaystyle={\bm{U}}^{i}_{t}-\alpha(1/b)\sum_{j\in\mathcal{B}_{t}-\{N\}}\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{j},{\bm{U}}^{i+1}_{t})$	(17)
	$\displaystyle\ \ -\alpha(1/b)\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{N},{\bm{U}}^{i+1}_{t})$
$\displaystyle{\bm{U}}^{i+1}_{t+1}\|\mathcal{S}^{\prime}$	$\displaystyle={\bm{U}}^{i}_{t}-\alpha(1/b)\sum_{j\in\mathcal{B}_{t}-\{N\}}\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{j},{\bm{U}}^{i+1}_{t})$
	$\displaystyle\ \ -\alpha(1/b)\nabla_{{\bm{U}}^{i+1}_{t}}\ell(\hat{\bm{x}}^{i+1}_{N},{\bm{U}}^{i+1}_{t}).$

By Lemma 3, we have $D_{\alpha}({\bm{U}}^{i+1}_{t+1}|\mathcal{S}||{\bm{U}}^{i+1}_{t+1}|\mathcal{S}^{\prime})\leq\epsilon_{p}/(i+2)$ . Because there are only $(Tb)/N$ steps where we use the information of ${\bm{x}}_{N}$ and $\hat{\bm{x}}_{N}$ . Replacing $\epsilon_{p}$ by $(N\epsilon_{p})/(Tb)$ and use composition theorem we can get after $T$ steps the output ${\bm{U}}^{i+1}_{T}$ satisfies $(\alpha,\epsilon_{p}/(i+2))$ -RDP and ${\bm{w}}$ satisfies $(\alpha,\epsilon_{p}/(M+1))$ -RDP. By Lemma 2, we can easily establish the DP-guarantee for Strategy I, as stated in Theorem 1. ∎

V-B Proof of Theorem 2

In this section, we will provide a proof for Theorem 2.

Proof.

Let $\phi=BN(\psi)$ , where BN is the batch normalization operation and $\psi$ is an activation function. Due to the property of batch normalization, we assume the $\ell_{2}$ norm of $\phi$ can be bounded by a positive constant $B$ . To show that the model (9) guarantees training data privacy given only black-box access to the model. Consider training the model (9) with two different datasets $S$ and $S^{\prime}$ , and we denote the resulting model as $f(\cdot|S)$ and $f(\cdot|S^{\prime})$ , respectively. Next, we show that using appropriate $\gamma$ and $\pi$ , $D_{\alpha}(f({\bm{x}}|S)||f({\bm{x}}|S^{{}^{\prime}}))<\epsilon_{p}$ for any input ${\bm{x}}$ .

First, we consider the convolution layers, and let ${\rm Conv}({\bm{x}})_{i}$ be the $i$ -th entry of the vectorized ${\rm Conv}({\bm{x}})$ . Then we have ${\rm Conv}({\bm{x}})_{i}={\bm{x}}_{i}+\phi({\bm{U}}{\bm{x}})_{i}+\gamma\tilde{{\bm{x}}}_{i}\odot{\bm{n}}_{i}$ . For any two different training datasets, we denote

	$\displaystyle{\rm Conv}({\bm{x}})_{i}\|S$	$\displaystyle={\bm{x}}_{i}+\phi({\bm{U}}_{1}{\bm{x}})_{i}+\gamma\tilde{{\bm{x}}}_{i}\odot{\bm{n}}_{i}$
		$\displaystyle\sim\mathcal{N}({\bm{x}}_{i}+\phi({\bm{U}}_{1}{\bm{x}})_{i},\gamma^{2}{\bm{X}}),$

and

	$\displaystyle{\rm Conv}({\bm{x}})_{i}\|S^{{}^{\prime}}$	$\displaystyle={\bm{x}}_{i}+\phi({\bm{U}}_{2}{\bm{x}})_{i}+\gamma\tilde{{\bm{x}}}_{i}\odot{\bm{n}}_{i}$
		$\displaystyle\sim\mathcal{N}({\bm{x}}_{i}+\phi({\bm{U}}_{2}{\bm{x}})_{i},\gamma^{2}{\bm{X}}),$

where ${\bm{X}}$ is a diagonal matrix with the $j$ -th diagonal entry being the square of the $j$ -th element of the vector $\tilde{\bm{x}}^{2}_{i}$ .

Therefore, if $\gamma>(B/\eta)\sqrt{(2\alpha M)/(\epsilon_{p})}$ , we have

		$\displaystyle D_{\alpha}\left(\mathcal{N}\left({\bm{x}}_{i}+\phi\left({\bm{U}}_{1}{\bm{x}}\right)_{i},\gamma^{2}\tilde{{\bm{x}}}^{2}_{i}\right)\|\|\mathcal{N}\left({\bm{x}}_{i}+\phi\left({\bm{U}}_{2}{\bm{x}}\right)_{i},\gamma^{2}\tilde{{\bm{x}}}^{2}_{i}\right)\right)$
		$\displaystyle\leq\frac{\alpha(\phi({\bm{U}}_{1}{\bm{x}})_{i}-\phi({\bm{U}}_{2}{\bm{x}})_{i})^{2}}{2\gamma^{2}\eta^{2}}$
		$\displaystyle\leq\frac{4\alpha L^{2}B^{2}\\|{\bm{x}}\\|^{2}_{2}}{2\gamma^{2}\eta^{2}}\leq\frac{2\alpha B^{2}}{\gamma^{2}\eta^{2}}\leq\epsilon_{p}/M.$

Furthermore, $\gamma>(b/\eta)\sqrt{(2\alpha M)/(\epsilon_{p})}$ guarantees $(\alpha,\epsilon_{p}/M)$ -RDP for every convolution layer. For the last fully connected layer, if $\pi>a\sqrt{\left(2\alpha M\right)/\epsilon_{p}}$ , we have

		$\displaystyle\ \ \ \ D_{\alpha}\left(\mathcal{N}\left({\bm{w}}_{1}^{\top}{\bm{x}},\pi^{2}\tilde{{\bm{x}}}^{2}_{i}\right)\|\|\mathcal{N}\left({\bm{w}}_{2}^{\top}{\bm{x}},\pi^{2}\tilde{{\bm{x}}}^{2}_{i}\right)\right)$
		$\displaystyle=\frac{\alpha\left({\bm{w}}_{1}^{\top}{\bm{x}}-{\bm{w}}_{2}^{\top}{\bm{x}}\right)^{2}}{2\pi^{2}\\|{\bm{x}}\\|_{2}}\leq\frac{4\alpha a^{2}\\|{\bm{x}}\\|^{2}_{2}}{2\pi^{2}\\|{\bm{x}}\\|_{2}}\leq\frac{2\alpha a^{2}}{\pi^{2}}\leq\epsilon_{p}/M,$

i.e., $\pi>a\sqrt{\left(2\alpha M\right)/\epsilon_{p}}$ guarantees that the fully connected layer to be $(\alpha,\epsilon_{p}/M)$ -RDP. According to Lemma 1, we have $(\alpha,\epsilon_{p})$ -RDP guarantee for the ResNet of $M$ residual mappings if $\gamma>(Lb/\eta)\sqrt{(2\alpha dM)/(\epsilon_{p})}$ and $\pi>a\sqrt{\left(2\alpha M\right)/\epsilon_{p}}$ . Let $\lambda\in(0,1)$ , for any given $(\epsilon_{p},\delta)$ pair, if $\epsilon_{p}\leq\lambda\epsilon$ and $-(\log\delta)/(\alpha-1),\delta)\leq(1-\lambda)\epsilon$ , then we have get the $(\epsilon,\delta)$ -DP guarantee for the ResNet with $M$ residual mapping using the residual perturbation Strategy II. ∎

V-C Proof of Theorem 3

In this section, we will proof that the residual perturbation (8) can reduce the generalization error via computing the Rademacher complexity. Let us first recap on some related lemmas on SDE and Rademacher complexity.

V-C1 Some Lemmas

Let $\ell:V\times Y\rightarrow[0,B]$ be the loss function. Here we assume $\ell$ is bounded and $B$ is a positive constant. In addition, we denote the function class $\ell_{\mathcal{H}}=\{({\bm{x}},y)\rightarrow\ell(h({\bm{x}}),y):h\in\mathcal{H},({\bm{x}},y)\in X\times Y\}$ . The goal of the learning problem is to find $h\in\mathcal{H}$ such that the population risk $R(h)=\mathbb{E}_{({\bm{x}},y)\in\mathcal{D}}[\ell(h({\bm{x}}),y)]$ is minimized. The gap between population risk and empirical risk $R_{S_{N}}(h)=(1/N)\sum^{N}_{i=1}\ell(h({\bm{x}}_{i}),y_{i})$ is known as the generalization error. We have the following lemma and theorem to connect the population and empirical risks via Rademacher complexity.

Lemma 4.

(Ledoux-Talagrand inequality) [70]Let $\mathcal{H}$ be a bounded real valued function space and let $\phi:\mathbb{R}\rightarrow\mathbb{R}$ be a Lipschitz with constant L and $\phi(0)=0$ . Then we have

\small\frac{1}{n}E_{\sigma}\left[\sup_{h\in\mathcal{H}}\sum_{i=1}^{n}\sigma_{i}\phi\left(h\left(x_{i}\right)\right)\right]\leq\frac{L}{n}E_{\sigma}\left[\sup_{h\in\mathcal{H}}\sum_{i=1}^{n}\sigma_{i}h\left(x_{i}\right)\right].

Lemma 5.

[64]Let $S_{N}=\{\left({\bm{x}}_{1},y_{1}\right),\cdots,\left({\bm{x}}_{N},y_{N}\right)\}$ be samples chosen i.i.d. according to the distribution $\mathcal{D}$ . If the loss function $\ell$ is bounded by $B>0$ . Then for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , the following holds for all $h\in\mathcal{H}$ ,

\small R\left(h\right)\leq R_{S_{N}}\left(h\right)+2BR_{S_{N}}\left(\ell_{\mathcal{H}}\right)+3B\sqrt{(\log(2/\delta)/(2N)}.

In addition, according to the Ledoux-Talagrand inequality and assume loss function is $L$ -lipschitz, we have

\small R_{S_{N}}\left(\ell_{\mathcal{H}}\right)\leq LR_{S_{N}}\left(\mathcal{H}\right).

So the population risk can be bounded by the empirical risk and Rademacher complexity of the function class $\mathcal{H}$ . Because we can’t minimize the population risk directly, we can minimize it indirectly by minimizing the empirical risk and Rademacher complexity of the function class $\mathcal{H}$ . Next, we will further discuss Rademacher complexity of the function class $\mathcal{H}$ . We first introduce several lemmas below.

Lemma 6.

[71] For any given matrix ${\bm{U}}$ , the solution to the equation

\small d{\bm{x}}(t)={\bm{U}}{\bm{x}}(t)dt;\ {\bm{x}}(0)=\hat{{\bm{x}}},

has the following expression

\small{\bm{x}}(t)=\exp({\bm{U}}t)\hat{{\bm{x}}},

Also, we can write the solution to the following equation

\small d{\bm{y}}(t)={\bm{U}}{\bm{y}}(t)dt+\gamma{\bm{y}}(t)dB(t);\ {\bm{y}}(0)=\hat{{\bm{x}}},

\small{\bm{y}}(t)=\exp({\bm{U}}t-\frac{1}{2}\gamma^{2}t{\bm{I}}+\gamma B(t){\bm{I}})\hat{{\bm{x}}}.

Obviously, we have ${\mathbb{E}}[{\bm{y}}(t)]={\bm{x}}(t)$ .

Lemma 7.

For any circulant matrix ${\mathbf{C}}$ with the first row being $(a_{1},a_{2},\cdots,a_{d})$ , we have the following eigen-decomposition $\Psi^{H}{\mathbf{C}}\Psi=diag(\lambda_{1},\dots,\lambda_{d}),$ where

\small\sqrt{d}\Psi=\begin{pmatrix}1&1&\dots&1\\ 1&m_{1}&\dots&m_{d-1}\\ \dots&\dots&\dots&\dots\\ 1&m^{d-1}_{1}&\dots&m^{d-1}_{d-1}\end{pmatrix},

and $m_{i}$ s are the roots of unity and $\lambda_{i}=a_{1}+a_{2}m_{i}+\dots+a_{d}m^{d-1}_{i}$ .

V-C2 The proof of Theorem 3

Proof.

By the definition of Rademacher complexity (Def. 2), we have

	$\displaystyle R_{S_{N}}\left(\mathcal{F}\right)$	$\displaystyle=\left(1/N\right)\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{N}\sigma_{i}{\bm{w}}{\bm{x}}^{p}_{i}(T)\right]$
		$\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\left[\sup_{\\|U\\|_{2}\leq c}\\|\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\left(T\right)\\|_{2}\right].$

Let ${\bm{u}}_{i}=\Psi{\bm{x}}^{p}_{i}$ and denote the $j$ -th element of ${\bm{u}}_{i}$ as $u_{i,j}$ . Then by lemma 7, we have

		$\displaystyle R_{S_{N}}\left(\mathcal{F}\right)$
		$\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\\|{\bm{U}}\\|_{2}\leq c}\left(\sum_{i,j}\sigma_{i}\sigma_{j}\langle{\bm{x}}^{p}_{i}(T),{\bm{x}}^{p}_{j}(T)\rangle\right)^{1/2}$
		$\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\|\lambda_{i}\|\leq c}(\\|\Psi^{H}\sum_{i=1}^{N}\sigma_{i}\big{(}u_{i,1}\exp(\lambda_{1}Tp),$
		$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \cdots,u_{i,d}\exp(\lambda_{d}Tp))^{\top}\\|_{2}\big{)}$
		$\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\|\lambda_{i}\|\leq c}\{\sum_{j=1}^{d}\left[\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i,j}\exp\left(\lambda_{j}Tp\right)\right]^{2}\}^{1/2}$
		$\displaystyle=\left(c/N\right)\exp\left(cTp\right)\mathbb{E}_{\sigma}\\|\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i}\\|_{2}$
		$\displaystyle=\left(c/N\right)\exp\left(cTp\right)\mathbb{E}_{\sigma}\\|\Psi\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\\|_{2}$
		$\displaystyle=(c/N)\exp\left(cTp\right)\mathbb{E}_{\sigma}\\|\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\\|_{2}$

Note that $\mathbb{E}\left({\bm{w}}{\bm{x}}^{p}_{i}(T)\right)={\bm{w}}\mathbb{E}\left({\bm{x}}^{p}_{i}(T)\right)$ and according to Lemma 6, similar to proof for the function class $\mathcal{F}$ we have

		$\displaystyle R_{S_{N}}\left(\mathcal{G}\right)$
	$\displaystyle=$	$\displaystyle\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\\|{\bm{U}}\\|_{2}\leq c}\left(\sum_{i,j}\sigma_{i}\sigma_{j}\langle\mathbb{E}{\bm{x}}^{p}_{i}(T),\mathbb{E}{\bm{x}}^{p}_{j}(T)\rangle\right)^{1/2}$
	$\displaystyle=$	$\displaystyle(c/N)\mathbb{E}_{\sigma}\sup_{\|\lambda_{i}\|\leq c}(\\|\Psi^{H}\sum_{i=1}^{N}\sigma_{i}(u_{i,1}\exp(\lambda_{1}Tp-p(1-p)\gamma^{2}T/2),$
		$\displaystyle\hskip 71.13188pt\cdots,u_{i,d}\exp(\lambda_{d}Tp-p(1-p)\gamma^{2}T/2))^{\top}\\|_{2})$
	$\displaystyle=$	$\displaystyle\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\|\lambda_{i}\|\leq c}\Big{\{}\sum_{j=1}^{d}\Big{[}\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i,j}\exp\times$
		$\displaystyle\hskip 71.13188pt\big{(}\lambda_{j}Tp-p(1-p)\gamma^{2}T/2\big{)}\Big{]}^{2}\Big{\}}^{1/2}$
	$\displaystyle=$	$\displaystyle(c/N)\exp(cTp-p(1-p)\gamma^{2}T/2)\mathbb{E}_{\sigma}\Big{\{}\sum_{j=1}^{d}\left[\sum_{i=1}^{N}\sigma_{i}u_{i,j}\right]^{2}\Big{\}}^{1/2}$
	$\displaystyle=$	$\displaystyle\left(c/N\right)\exp\left(cTp-p(1-p)\gamma^{2}T/2\right)\mathbb{E}_{\sigma}\\|\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i}\\|_{2}$
	$\displaystyle=$	$\displaystyle\left(c/N\right)\exp\left(cTp-p(1-p)\gamma^{2}T/2\right)\mathbb{E}_{\sigma}\\|\Psi\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\\|_{2}$
	$\displaystyle=$	$\displaystyle(c/N)\exp\left(cTp-p(1-p)\gamma^{2}T/2\right)\mathbb{E}_{\sigma}\\|\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\\|_{2}$
	$\displaystyle<$	$\displaystyle R_{S_{N}}\left(\mathcal{F}\right)$

Therefore, we have completed the proof for the fact that the ensemble of Gaussian noise injected ResNets can reduce generalization error compared to the standard ResNet. ∎

VI Experiments on Strategy II

VI-A Forward and Backward Propagation Using (7)

Fig. 8 plots the forward and backward propagation of the image using the SDE model (7). Again, we cannot simply use the backward Euler-Maruyama discretization to reverse the features generated by propagating through the forward Euler-Maruyama discretization.

VI-A1 Experiments on the IDC Dataset

In this subsection, we consider the performance of residual perturbation (Strategy II) in protecting membership privacy while retaining the classification accuracy on the IDC dataset. We use the same ResNet models as that used for the first residual perturbation. We list the results in Fig. 9, these results confirm that the residual perturbation (8) can effectively protect data privacy and maintain or even improve the classification accuracy. In addition, we depict the ROC curve for this experiment in Fig. 12(c). We note that, as the noise coefficient increases, the gap between training and testing accuracies narrows, which is consistent with Theorem 3.

Figure 9: Performance of residual perturbation (8) (Strategy II) for En₅ResNet8 with different noise coefficients (

\gamma

) and membership inference attack thresholds on the IDC dataset. Residual perturbation can significantly improve membership privacy and reduce the generalization gap.

\gamma=0

corresponding to the baseline ResNet8. (Unit: %)

Residual perturbation vs. DPSGD

We have shown that the first residual perturbation (6) outperforms the DPSGD in protecting membership privacy and improving classification accuracy. In this part, we further show that the second residual perturbation (8) also outperforms the benchmark DPSGD with the above settings. Table V lists the AUC of the attack model and training & test accuracy of the target model; we see that the second residual perturbation can also improve the classification accuracy and protecting comparable privacy.

TABLE V: Residual perturbation (8) (Strategy II) vs. DPSGD in training ResNet8 for the IDC dataset classification. Ensemble of ResNet8 with residual perturbation is more accurate for classification (higher test acc) and protects comparable membership privacy compared to DPSGD.

Model	AUC	Training Acc	Test Acc
ResNet8 (DPSGD)	0.503	0.831	0.828
En₁ResNet8	0.509	0.880	0.868
En₅ResNet8	0.500	0.895	0.872

VI-A2 Experiments on the CIFAR10/CIFAR100 Datasets

In this subsection, we will test the second residual perturbation (8) on the CIFAR10/CIFAR100 datasets with the same model using the same settings as before. Fig. 10 plots the performance of En₅ResNet8 on the CIFAR10 dataset. These results show that the ensemble of ResNets with residual perturbation (8) is significantly more robust to the membership inference attack. For instance, the AUC of the attack model for ResNet8 and En₅ResNet8 ( $\gamma=2.0$ ) is $0.757$ and $0.526$ , respectively. Also, the classification accuracy of En₅ResNet8 ( $\gamma=2.0$ ) is higher than that of ResNet8, and their accuracy is $71.2$ % and $66.9$ % for CIFAR10 classification. Fig. 11 shows the results of En₅ResNet8 for CIFAR100 classification. These results confirm that residual perturbation (8) can protect membership privacy and improve classification accuracy again.

Figure 10: Performance of En₅ResNet8 with residual perturbation (8) (Strategy II) using different noise coefficients (

\gamma

) and membership inference attack threshold on CIFAR10. Residual perturbation (8) can not only enhance the membership privacy, but also improve the classification accuracy.

\gamma=0

corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)

Figure 11: Performance of En₅ResNet8 with residual perturbation (8) (Strategy II) using different noise coefficients (

\gamma

) and membership inference attack threshold on CIFAR100. Again, residual perturbation (8) can not only enhance the membership privacy, but also improve the classification accuracy.

\gamma=0

corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)

VI-A3 ROC Curves for the Experiments on Different Datasets

Fig. 12 plots the ROC curves for the experiments on different datasets with different models using Strategy II. These ROC curves again show that if $\gamma$ is sufficiently large, the attack model’s prediction will be nearly a random guess.

(a) CIFAR10	(b) CIFAR100	(c) IDC
(En₅ResNet8)	(En₅ResNet8)	(En₅ResNet8)

Figure 12: ROC curves for residual perturbation (8) (Strategy II) for different datasets. (nc: noise coefficient)

VII More Experimental Details

We give the detailed construction of the velocity field $F({\bm{x}}(t),{\bm{W}}(t))$ in (3) and (7) that used to generate Figs. 1 and 8 in Algorithm 1.

Algorithm 1 The expression of

F({\bm{x}}(t),{\bm{W}}(t))

Input: image=

{\bm{x}}(t)

; rows, cols, channels = image.shape.

Output:

F({\bm{x}}(t),{\bm{W}}(t))

=noise injected image (NII).

for

k

in range(channels) do

for

i

in range(rows): do

for

j

in range(cols): do

offset(

j

) =int

([j+50.0\text{cos}(2\pi i/180)]\%

cols);

offset(i) =int

([i+50.0\text{sin}(2\pi i/180)]\%

row);

NII

[i,j,k]=\text{image}[(i+\text{offset}(j))\%\text{rows},(j+\text{offset}(i))\%\text{cols},k]

; return NII

VIII Concluding Remarks

We proposed residual perturbations, whose theoretical foundation lies in the theory of stochastic differential equations, to protect data privacy for deep learning. Theoretically, we prove that the residual perturbation can reduce the generalization gap with differential privacy guarantees. Numerically, we have shown that residual perturbations are effective in protecting membership privacy on some benchmark datasets. The proposed residual perturbation provides a feasible avenue for developing machine learning models that are more private and accurate than the baseline approach. Nevertheless, the current theoretical differential privacy budget is far from tight; developing analytical tools for analyzing a tighter privacy budget is an interesting future direction.

References

[1] Yuen, Man-Ching, King, Irwin, Leung, and Kwong-Sak, “A Survey of Crowdsourcing Systems,” in Proceedings of the IEEE international conference on social computing (Socialcom 2011), 2011.
[2] W. Feng, Z. Yan, H. Zhang, K. Zeng, Y. Xiao, and Y. Hou, “A Survey on Security, Privacy and Trust in Mobile Crowdsourcing,” IEEE Internet of Things Journal, 2017.
[3] Y. Liu, K. Gadepalli, M. Norouzi, G. E. Dahl, and M. C. Stumpe, “Detecting Cancer Metastases on Gigapixel Pathology Images,” arXiv:1703.02442, 2017.
[4] M. Fredrikson, S. Jha, and T. Ristenpart, “Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures,” in 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS 2015), 2015.
[5] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership inference attacks against machine learning models,” in 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18, IEEE, 2017.
[6] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes, “Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models,” arXiv preprint arXiv:1806.01246, 2018.
[7] S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha, “Privacy risk in machine learning: Analyzing the connection to overfitting,” in 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282, IEEE, 2018.
[8] A. Sablayrolles, M. Douze, C. Schmid, and H. Jégou, “D $\backslash$ ’ej $\backslash$ a vu: an empirical evaluation of the memorization properties of convnets,” arXiv preprint arXiv:1809.06396, 2018.
[9] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart, “Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing,” in 23rd $\{$ USENIX $\}$ Security Symposium ( $\{$ USENIX $\}$ Security 14), pp. 17–32, 2014.
[10] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing machine learning models via prediction apis,” in 25th $\{$ USENIX $\}$ Security Symposium ( $\{$ USENIX $\}$ Security 16), pp. 601–618, 2016.
[11] N. Z. Gong and B. Liu, “Attribute inference attacks in online social networks,” ACM Transactions on Privacy and Security (TOPS), vol. 21, no. 1, pp. 1–30, 2018.
[12] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333, 2015.
[13] M. Al-Rubaie and J. M. Chang, “Reconstruction attacks against mobile-based continuous authentication systems in the cloud,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 12, pp. 2648–2663, 2016.
[14] N. Z. Gong and B. Liu, “You are who you know and how you behave: Attribute inference attacks via users’ social friends and behaviors,” in 25th $\{$ USENIX $\}$ Security Symposium ( $\{$ USENIX $\}$ Security 16), pp. 979–995, 2016.
[15] X. Zheng, Z. Cai, and Y. Li, “Data linkage in smart internet of things systems: A consideration from a privacy perspective,” IEEE Communications Magazine, vol. 56, no. 9, pp. 55–61, 2018.
[16] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Annual International Cryptology Conference, pp. 36–54, Springer, 2000.
[17] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar, “Can machine learning be secure?,” in Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pp. 16–25, 2006.
[18] E. Hesamifard, H. Takabi, M. Ghasemi, and R. N. Wright, “Privacy-preserving machine learning as a service,” Proceedings on Privacy Enhancing Technologies, vol. 2018, no. 3, pp. 123–142, 2018.
[19] H. Bae, D. Jung, and S. Yoon, “Anomigan: Generative adversarial networks for anonymizing private medical data,” arXiv preprint arXiv:1901.11313, 2019.
[20] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference, pp. 265–284, Springer, 2006.
[21] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al., “Communication-efficient learning of deep networks from decentralized data,” arXiv preprint arXiv:1602.05629, 2016.
[22] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
[23] L. Sweeney, “k-anonymity: A model for protecting privacy,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002.
[24] K. El Emam and F. K. Dankar, “Protecting privacy using k-anonymity,” Journal of the American Medical Informatics Association, vol. 15, no. 5, pp. 627–637, 2008.
[25] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization,” Journal of Machine Learning Research, vol. 12, no. Mar, pp. 1069–1109, 2011.
[26] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization: Efficient algorithms and tight error bounds,” in 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 464–473, IEEE, 2014.
[27] R. Shokri and V. Shmatikov, “Privacy-Preserving Deep Learning,” in 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS 2015), 2015.
[28] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318, 2016.
[29] E. Bagdasaryan, O. Poursaeed, and V. Shmatikov, “Differential privacy has disparate impact on model accuracy,” in Advances in Neural Information Processing Systems, pp. 15453–15462, 2019.
[30] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” in Advances in Neural Information Processing Systems, pp. 14747–14756, 2019.
[31] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyond inferring class representatives: User-level privacy leakage from federated learning,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pp. 2512–2520, IEEE, 2019.
[32] M. Abadi, A. Chu, I. Goodfellow, H. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep Learning with Differential Privacy,” in 23rd ACM Conference on Computer and Communications Security (CCS 2016), 2016.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[34] N. Papernot, M. Abadi, L. Erlingsson, I. Goodfellow, and K. Talwar, “Semisupervised Knowledge Transfer for Deep Learning from Private Training Data,” in 5th International Conference on Learning Representation (ICLR 2017), 2017.
[35] N. Papernot, S. Song, I. Mironov, A. Raghunathan, and L. Erlingsson, “Scalable Private Learning with PATE,” in International Conference on Learning Representations (ICLR 2018), 2018.
[36] I. Mironov, “Rényi differential privacy,” in 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275, IEEE, 2017.
[37] Y. X. Wang, B. Balle, and S. Kasiviswanathan, “Subsampled R $\backslash$ ’enyi Differential Privacy and Analytical Moments Accountant,” arXiv preprint arXiv:1808.00087, 2018.
[38] J. Dong, A. Roth, and W. J. Su, “Gaussian differential privacy,” arXiv preprint arXiv:1905.02383, 2019.
[39] B. Wang, Q. Gu, M. Boedihardjo, F. Barekat, and S. J. Osher, “Dp-lssgd: A stochastic optimization method to lift the utility in privacy-preserving erm,” arXiv preprint arXiv:1906.12056, 2019.
[40] Z. Liang, B. Wang, Q. Gu, S. Osher, and Y. Yao, “Exploring private federated learning with laplacian smoothing,” arXiv preprint arXiv:2005.00218, 2020.
[41] O. Kotevska, F. Alamudun, and C. Stanley, “Optimal balance of privacy and utility with differential privacy deep learning frameworks,” in 2021 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 425–430, 2021.
[42] M. Jagielski, J. Ullman, and A. Oprea, “Auditing differentially private machine learning: How private is private sgd?,” Advances in Neural Information Processing Systems, vol. 33, pp. 22205–22216, 2020.
[43] I. Mironov, K. Talwar, and L. Zhang, “R $\backslash$ ’enyi differential privacy of the sampled gaussian mechanism,” arXiv preprint arXiv:1908.10530, 2019.
[44] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex, “Differentially private model publishing for deep learning,” in 2019 IEEE symposium on security and privacy (SP), pp. 332–349, IEEE, 2019.
[45] H. Zhong and K. Bu, “Privacy-utility trade-off,” arXiv preprint arXiv:2204.12057, 2022.
[46] K. Mivule, C. Turner, and S.-Y. Ji, “Towards a differential privacy and utility preserving machine learning classifier,” Procedia Computer Science, vol. 12, pp. 176–181, 2012.
[47] M. Showkatbakhsh, C. Karakus, and S. Diggavi, “Privacy-utility trade-off of linear regression under random projections and additive noise,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 186–190, IEEE, 2018.
[48] A. S. Rakin, Z. He, and D. Fan, “Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack,” arXiv preprint arXiv:1811.09310, 2018.
[49] B. Wang, Z. Shi, and S. Osher, “Resnets ensemble via the feynman-kac formalism to improve natural and robust accuracies,” in Advances in Neural Information Processing Systems, pp. 1655–1665, 2019.
[50] X. Liu, T. Xiao, SiSi, Q. Cao, S. Kumar, and C.-J. Hsieh, “Neural sde: Stabilizing neural ode networks with stochastic noise,” arXiv preprint arXiv:1906.02355, 2019.
[51] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao, “Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648, 2018.
[52] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” in International Conference on Learning Representations, 2019.
[53] D. Stutz, M. Hein, and B. Schiele, “Disentangling adversarial robustness and generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6976–6987, 2019.
[54] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow, “Adversarial spheres,” arXiv preprint arXiv:1801.02774, 2018.
[55] Z. Zhao, D. Dua, and S. Singh, “Generating natural adversarial examples,” in International Conference on Learning Representations, 2018.
[56] A. Raghunathan, S. M. Xie, F. Yang, J. Duchi, and P. Liang, “Understanding and mitigating the tradeoff between robustness and accuracy,” in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research, pp. 7909–7919, PMLR, 13–18 Jul 2020.
[57] Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi, “Unlabeled data improves adversarial robustness,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 11192–11203, 2019.
[58] A. Najafi, S.-i. Maeda, M. Koyama, and T. Miyato, “Robustness to adversarial perturbations in learning from incomplete data,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 5541–5551, 2019.
[59] J.-B. Alayrac, J. Uesato, P.-S. Huang, A. Fawzi, R. Stanforth, and P. Kohli, “Are labels required for improving adversarial robustness?,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[60] D. J. Higham, “An algorithmic introduction to numerical simulation of stochastic differential equations,” SIAM review, vol. 43, no. 3, pp. 525–546, 2001.
[61] C. Dwork and V. Feldman, “Privacy-preserving prediction,” in Conference On Learning Theory, pp. 1693–1702, PMLR, 2018.
[62] R. Bassily, O. Thakkar, and A. Thakurta, “Model-agnostic private learning via stability,” arXiv preprint arXiv:1803.05101, 2018.
[63] B. Wu, S. Zhao, G. Sun, X. Zhang, Z. Su, C. Zeng, and Z. Liu, “P3sgd: Patient privacy preserving sgd for regularizing deep cnns in pathological image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2099–2108, 2019.
[64] P. L. Barlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results.,” Jounal of Machine Learning Research, 2002.
[65] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
[66] “Invasive ductal carcinoma (idc) histology image dataset,”
[67] B. Wu, C. Chen, S. Zhao, C. Chen, Y. Yao, G. Sun, L. Wang, X. Zhang, and J. Zhou, “Characterizing membership privacy in stochastic gradient langevin dynamics,” arXiv preprint arXiv:1910.02249, 2019.
[68] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[69] H. B. McMahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz, “A general approach to adding differential privacy to iterative training procedures,” arXiv preprint arXiv:1812.06210, 2018.
[70] L. M and Talagrand, Probability in Banach spaces: Isoperimetry and processes. Springer, 2002.
[71] F. C. Klebaner, Introduction to stochastic calculus with applications. Springer, 2005.

Wenqi Tao is a Ph.D. student in mathematics at Tsinghua University. His research focuses on graph-based machine learning and deep learning.

Huaming Ling is a Ph.D. student in mathematics at Tsinghua University. His research focuses on deep learning and stochastic optimization algorithms.

Zuoqiang Shi received Ph.D. in 2008 from Tsinghua University. He is an associate professor of mathematics at Tsinghua University.

Bao Wang received Ph.D. in 2016 from Michigan State University. He is an assistant professor of mathematics at the University of Utah. He is a recipient of the Chancellor’s award for postdoc research at UCLA. His research interests include scientific computing and deep learning.


(a) ${\bm{x}}(0)$ (Original)	(b) ${\bm{x}}(1)$ (ODE)	(c) $\tilde{{\bm{x}}}(0)$ (ODE)	(d) ${\bm{x}}(1)$ (SDE)	(e) $\tilde{{\bm{x}}}(0)$ (SDE)

		$\displaystyle R_{S_{N}}\left(\mathcal{F}\right)$
		$\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\\|{\bm{U}}\\|_{2}\leq c}\left(\sum_{i,j}\sigma_{i}\sigma_{j}\langle{\bm{x}}^{p}_{i}(T),{\bm{x}}^{p}_{j}(T)\rangle\right)^{1/2}$
		$\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\|\lambda_{i}\|\leq c}(\\|\Psi^{H}\sum_{i=1}^{N}\sigma_{i}\big{(}u_{i,1}\exp(\lambda_{1}Tp),$
		$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \cdots,u_{i,d}\exp(\lambda_{d}Tp))^{\top}\\|_{2}\big{)}$
		$\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\|\lambda_{i}\|\leq c}\{\sum_{j=1}^{d}\left[\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i,j}\exp\left(\lambda_{j}Tp\right)\right]^{2}\}^{1/2}$
		$\displaystyle=\left(c/N\right)\exp\left(cTp\right)\mathbb{E}_{\sigma}\\|\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i}\\|_{2}$
		$\displaystyle=\left(c/N\right)\exp\left(cTp\right)\mathbb{E}_{\sigma}\\|\Psi\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\\|_{2}$
		$\displaystyle=(c/N)\exp\left(cTp\right)\mathbb{E}_{\sigma}\\|\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\\|_{2}$

Deep Learning with Data Privacy via Residual Perturbation

Abstract

Index Terms:

I Introduction

I-A Our Contribution

I-B Related Work

I-C Organization

I-D Notations

II Algorithms

II-A Deep Residual Learning and Its Continuous Analogue

II-B Residual Perturbation and its SDE Analogue

Strategy I

Strategy II

II-C Utility Enhancement via Model Ensemble

III Main Theory

III-A Differential Privacy Guarantee for Strategy I

Definition 1 ((ϵ,δ)(\epsilon,\delta)-DP).

Theorem 1.

III-B Theoretical Guarantees for Strategy II

Privacy

Theorem 2.

Generalization gap

Definition 2.

Theorem 3.

IV Experiments

IV-A Preliminaries

Datasets

Membership inference attack

Experimental settings

Performance evaluations

IV-B Experiments on the IDC Dataset

Residual perturbation vs. DPSGD

IV-C Experiments on the MNIST/CIFAR10/CIFAR100 Datasets

Residual perturbation vs. DPSGD

Effects of the number of models in the ensemble

IV-D On the Importance of Skip Connections

IV-E ROC Curves for Experiments on Different Datasets

IV-F Remark on the Privacy Budget

V Technical Proofs

V-A Proof of Theorem 1

V-A1 Rényi Differential Privacy (RDP)

Definition 3.

Definition 4.

Lemma 1.

Lemma 2.

Lemma 3.

V-A2 Proof of Theorem 1

Proof.

V-B Proof of Theorem 2

Proof.

V-C Proof of Theorem 3

V-C1 Some Lemmas

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

V-C2 The proof of Theorem 3

Proof.

VI Experiments on Strategy II

VI-A Forward and Backward Propagation Using (7)

VI-A1 Experiments on the IDC Dataset

Residual perturbation vs. DPSGD

VI-A2 Experiments on the CIFAR10/CIFAR100 Datasets

VI-A3 ROC Curves for the Experiments on Different Datasets

VII More Experimental Details

VIII Concluding Remarks

References

Definition 1 ( $(\epsilon,\delta)$ -DP).