This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Learning with Data Privacy via Residual Perturbation

Wenqi Tao*, Huaming Ling*, Zuoqiang Shi, and Bao Wang B. Wang is with the Department of Mathematics and Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, 84112 USA. E-mail: [email protected]. Tao, H. Ling, and Z. Shi are with the Department of Mathematics, Tsinghua University, Beijing 100084. E-mail: [email protected], [email protected], [email protected]. Wenqi Tao and Huaming Ling contribute equally.
Abstract

Protecting data privacy in deep learning (DL) is of crucial importance. Several celebrated privacy notions have been established and used for privacy-preserving DL. However, many existing mechanisms achieve privacy at the cost of significant utility degradation and computational overhead. In this paper, we propose a stochastic differential equation-based residual perturbation for privacy-preserving DL, which injects Gaussian noise into each residual mapping of ResNets. Theoretically, we prove that residual perturbation guarantees differential privacy (DP) and reduces the generalization gap of DL. Empirically, we show that residual perturbation is computationally efficient and outperforms the state-of-the-art differentially private stochastic gradient descent (DPSGD) in utility maintenance without sacrificing membership privacy.

Index Terms:
deep learning, data privacy, stochastic differential equations

I Introduction

Many high-capacity deep neural networks (DNNs) are trained with private data, including medical images and financial transaction data [1, 2, 3]. DNNs usually overfit and can memorize the private training data, making training DNNs exposed to privacy leakage [4, 5, 6, 7, 8]. Given a pre-trained DNN, the membership inference attack can determine if an instance is in the training set based on DNN’s response [9, 5, 6]; the model extraction attack can learn a surrogate model that matches the target model, given the adversary only black-box access to the target model [10, 11]; the model inversion attack can infer certain features of a given input from the output of a target model [12, 13]; the attribute inference attack can deanonymize the anonymized data [14, 15].

Machine learning (ML) with data privacy is crucial in many applications [16, 17, 18, 19]. Several algorithms have been developed to reduce privacy leakage, including differential privacy (DP) [20] and federated learning (FL) [21, 22], and kk-anonymity [23, 24]. Objective, output, and gradient perturbations are popular approaches for ML with DP guarantee [25, 26, 27, 28, 29]. FL trains centralized ML models, through gradient exchange, with the training data being distributed on the edge devices. However, the gradient exchange can still leak privacy [30, 31]. Most of the existing privacy is achieved at a tremendous sacrifice of utility. Moreover, training ML models using the state-of-the-art DP stochastic gradient descent (DPSGD) leads to tremendous computational cost due to the requirement of computing and clipping the per-sample gradient [32]. It remains a great interest to develop new privacy-preserving ML algorithms without the excessive computational overhead or degrade the ML models’ utility.

I-A Our Contribution

We propose residual perturbation for privacy-preserving deep learning (DL). At the core of residual perturbation is injecting Gaussian noise to each residual mapping [33], and the residual perturbation is theoretically motivated by the stochastic differential equation (SDE) theory. The major advantages of residual perturbation are summarized below:

  • It guarantees DP and can protect the membership privacy of the training data almost perfectly without sacrificing ResNets’ utility, even improving the classification accuracy.

  • It has fewer hyperparameters to tune than DPSGD. Moreover, it is computationally much more efficient than DPSGD; the latter is computationally prohibitive due to the requirement of computing the per-sample gradient.

I-B Related Work

Improving the utility of ML models with DP guarantees is an important task. PATE [34, 35] uses semi-supervised learning accompanied by model transfer between the “student” and “teacher” models to enhance utility. Several variants of the DP notions have also been proposed to improve the theoretical privacy budget and sometimes can also improve the resulting model’s utility  [28, 36, 37, 38]. Some post-processing techniques have also been developed to improve the utility of ML models with negligible computational overhead [39, 40].

The tradeoff between privacy and utility of ML models has been studied in several different contexts. In [41], the authors identify the factors that impact the optimal tradeoff of accuracy and privacy of differentially private DL models by empirically evaluating the commonly used privacy-preserving ML libraries, including PyTorch Opacus and Tensorflow privacy. The authors of [42] investigate whether DPSGD offers better privacy in practice than what is guaranteed by its state-of-the-art theoretical privacy analysis — by taking a quantitative, empirical approach to auditing the privacy afforded by specific implementations of differentially private algorithms. In particular, they estimate an empirical DP lower bound under the data poisoning and membership inference attacks. They further compare this empirical lower bound with the DP budget obtained by the state-of-the-art analytic tools [28, 38, 43, 44]. The approach presented in [42] can demonstrate if a given algorithm with a given DP budget is sufficiently private and provides empirical evidence to guide the development of theoretical DP analysis. The privacy-utility tradeoff for different privacy measures in both global and local privacy has been studied in [45], aiming to determine the minimal privacy loss at a fixed utility expense. The paper [46] investigates a DP ensemble classifier approach that seeks to preserve data privacy while maintaining an acceptable level of utility. For the linear regression model, the privacy-utility tradeoff has been studied by enforcing the privacy of the input data via random projection and additive noise [47].

Gaussian noise injection in residual learning has been used to improve the robustness of ResNets [48, 49, 50]. In this paper, we inject Gaussian noise into each residual mapping to achieve data privacy with even improved generalization instead of adversarial robustness. Several related studies focus on disentangling adversarial robustness and generalization in the literature. Some initial hypotheses state that there exists an inherent tradeoff between robustness and accuracy, i.e., an ML model cannot be both more robust and more accurate than others [51, 52] based on the comparisons of different models with different architectures or training strategies and theoretical arguments on a toy dataset, respectively. In [53], the authors disentangle adversarial robustness and generalization, leveraging the low-dimensional manifold assumption of the data. In particular, through a detailed analysis of the regular and on-manifold [54, 55] adversarial examples of the same type as the given class — the adversarial examples that leave and stay in the manifold of the given class of data, the authors of [53] conclude that both robust and accurate ML models are possible. In particular, on-manifold robustness is essentially the generalization of the ML model, which can be improved via on-manifold adversarial training. In [56], the authors precisely characterize the effects of data augmentation on the standard error (corresponding to the natural accuracy of the model) in linear regression when the optimal linear predictor has zero standard and robust error (corresponding to the model’s robust accuracy). Moreover, it is shown in [56] that robust self-training [57, 58, 59] can improve the linear regression model’s robustness without sacrificing its standard error by leveraging extra unlabelled data. Our approach of the ensemble of noise-injected neural networks provides another feasible avenue to develop ML models that are both robust and accurate.

I-C Organization

We organize this paper as follows: In Sec. II, we introduce the residual perturbation for privacy-preserving DL. In Sec. III, we present the generalization and DP guarantees for residual perturbation. In Sec. IV, we numerically verify the efficiency of the residual perturbation in protecting data privacy without degrading the underlying models’ utility. Technical proofs and some more experimental details and results are provided in Secs. V and VI. We end up with concluding remarks.

I-D Notations

We denote scalars by lower or upper case letters; vectors and matrices by lower and upper case bold face letters, respectively. For a vector 𝒙=(x1,,xd)d{\bm{x}}=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}, we use 𝒙2=(i=1d|xi|2)1/2\|{\bm{x}}\|_{2}={(\sum_{i=1}^{d}|x_{i}|^{2})^{1/2}} to denote its 2\ell_{2} norm. For a matrix 𝑨{\bm{A}}, we use 𝑨2\|{\bm{A}}\|_{2} to denote its induced norm by the vector 2\ell_{2} norm. We denote the standard Gaussian in d{\mathbb{R}}^{d} as 𝒩(𝟎,𝑰)\mathcal{N}(\mathbf{0},{\bm{I}}) with 𝑰d×d{\bm{I}}\in{\mathbb{R}}^{d\times d} being the identity matrix. The set of (positive) real numbers is denoted as (+{\mathbb{R}}^{+}) {\mathbb{R}}. We use B(𝟎,R)B(\mathbf{0},R) to denote the ball centered at 𝟎\mathbf{0} with radius RR.

II Algorithms

II-A Deep Residual Learning and Its Continuous Analogue

Given the training set SN:={𝒙i,yi}i=1N{S_{N}:=\{{\bm{x}}_{i},y_{i}\}_{i=1}^{N}}, with {𝒙i,yi}\{{\bm{x}}_{i},y_{i}\}\subset d×{\mathbb{R}}^{d}\times{\mathbb{R}} being a data-label pair. For a given 𝒙i:=𝒙0{\bm{x}}_{i}:={\bm{x}}^{0} the forward propagation of a ResNet with MM residual mappings can be written as

𝒙l+1\displaystyle{\bm{x}}^{l+1} =𝒙l+F^(𝒙l,𝑾l),forl=0,,M1,\displaystyle={\bm{x}}^{l}+\hat{F}({\bm{x}}^{l},{\bm{W}}^{l}),\ \ \mbox{for}\ l=0,\ldots,M-1, (1)
y^i\displaystyle\hat{y}_{i} =f(𝒙M),\displaystyle=f({\bm{x}}^{M}),

where F^(,𝑾l)\hat{F}(\cdot,{\bm{W}}^{l}) is the nonlinear mapping of the ll-th residual mapping parameterized by 𝑾l{\bm{W}}^{l}; f()f(\cdot) is the output activation, and y^i\hat{y}_{i} is the prediction. The heuristic continuous limit of (1) is the following ordinary differential equation (ODE)

d𝒙(t)=F(𝒙(t),𝑾(t))dt,𝒙(0)=𝒙^,\small d{\bm{x}}(t)=F({\bm{x}}(t),{\bm{W}}(t))dt,\ \ {\bm{x}}(0)=\hat{{\bm{x}}}, (2)

where tt is the time variable. ODE (2) can be revertible, and thus the ResNet counterpart might be exposed to data privacy leakage. For instance, we use an image downloaded from the Internet (Fig. 1(a)) as the initial data 𝒙^\hat{{\bm{x}}} in (2). Then we simulate the forward propagation of ResNet by solving (2) from t=0t=0 to t=1t=1 using the forward Euler solver with a time step size Δt=0.01\Delta t=0.01 and a given velocity field F(𝒙(t),𝑾(t))F({\bm{x}}(t),{\bm{W}}(t)) (see Sec. VII for the details of F(𝒙(t),𝑾(t))F({\bm{x}}(t),{\bm{W}}(t))), which maps the original image to its features (Fig. 1(b)). To recover the original image, we start from the feature and use the backward Euler iteration, i.e., 𝒙~(t)=𝒙~(t+Δt)ΔtF(𝒙~(t+Δt),t+Δt),\tilde{{\bm{x}}}(t)=\tilde{{\bm{x}}}(t+\Delta t)-\Delta tF(\tilde{{\bm{x}}}(t+\Delta t),t+\Delta t), to evolve 𝒙~(t)\tilde{{\bm{x}}}(t) from t=1t=1 to t=0t=0 with 𝒙~(1)=𝒙(1)\tilde{{\bm{x}}}(1)={\bm{x}}(1) being the features obtained in the forward propagation. We plot the recovered image from features in Fig. 1(c), and the original image can be almost perfectly recovered.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(a) 𝒙(0){\bm{x}}(0) (Original) (b) 𝒙(1){\bm{x}}(1) (ODE) (c) 𝒙~(0)\tilde{{\bm{x}}}(0) (ODE) (d) 𝒙(1){\bm{x}}(1) (SDE) (e) 𝒙~(0)\tilde{{\bm{x}}}(0) (SDE)
Figure 1: Illustrations of the forward and backward propagation of the training data using 2D ODE in equation (2) and SDE in equation (3). (a) the original image; (b) and (d) the features of the original image generated by the forward propagation using ODE and SDE, respectively; (c) and (e) the recovered images by reverse-engineering the features shown in (b) and (d), respectively. We see that it is easy to break the privacy of the ODE model but harder for SDE.

II-B Residual Perturbation and its SDE Analogue

In this part, we propose two SDE models to reduce the reversibility of (2), and the corresponding residual perturbations analogue can protect the data privacy in DL.

Strategy I

Consider the following SDE model:

d𝒙(t)=F(𝒙(t),𝑾(t))dt+γd𝑩(t),γ>0,\displaystyle\small d{\bm{x}}(t)=F({\bm{x}}(t),{\bm{W}}(t))dt+\gamma d{\bm{B}}(t),\ \gamma>0, (3)

where 𝑩(t){\bm{B}}(t) is the standard Brownian motion. We simulate the forward propagation and reverse-engineering the input from the output by solving the SDE model (3) with γ=1\gamma=1 using the same F(𝒙(t),𝑾(t))F({\bm{x}}(t),{\bm{W}}(t)) and initial data 𝒙^\hat{{\bm{x}}}. We use the following forward (4) and backward (5) Euler-Maruyama discretizations [60] of (3) for the forward and backward propagation, respectively.

𝒙(t+Δt)=𝒙(t)+ΔtF(𝒙(t),𝑾(t))+γ𝒩(𝟎,Δt𝑰),\small{\bm{x}}(t+\Delta t)={\bm{x}}(t)+\Delta tF({\bm{x}}(t),{\bm{W}}(t))+\gamma\mathcal{N}(\mathbf{0},\sqrt{\Delta t}\,{\bm{I}}), (4)
𝒙~(t)=𝒙~(t+Δt)\displaystyle\tilde{{\bm{x}}}(t)=\tilde{{\bm{x}}}(t+\Delta t) ΔtF(𝒙~(t+Δt),𝑾(t+Δt))\displaystyle-\Delta tF(\tilde{{\bm{x}}}(t+\Delta t),{\bm{W}}(t+\Delta t)) (5)
+γ𝒩(𝟎,Δt𝑰),\displaystyle+\gamma\mathcal{N}(\mathbf{0},\sqrt{\Delta t}\,{\bm{I}}),

Fig. 1(d) and (e) show the results of the forward and backward propagation by SDE, respectively. These results show that it is much harder to reverse the features obtained from the SDE evolution. The SDE model informs us to inject Gaussian noise, in both training and test phases, to each residual mapping of ResNet to protect data privacy, which results in

𝒙i+1=𝒙i+F^(𝒙i,𝑾i)+γ𝒏i,where𝒏i𝒩(𝟎,𝑰).\small{\bm{x}}^{i+1}={\bm{x}}^{i}+\hat{F}({\bm{x}}^{i},{\bm{W}}^{i})+\gamma{\bm{n}}^{i},\ \mbox{where}\ {\bm{n}}^{i}\sim\mathcal{N}(\mathbf{0},{\bm{I}}). (6)
Strategy II

For the second strategy, we consider using the multiplicative noise instead of the additive noise that used in (3) and (6), and the corresponding SDE can be written as

d𝒙(t)=F(𝒙(t),𝑾(t))dt+γ𝒙(t)d𝑩(t),γ>0,\displaystyle\small d{\bm{x}}(t)=F({\bm{x}}(t),{\bm{W}}(t))dt+\gamma{\bm{x}}(t)\odot d{\bm{B}}(t),\ \gamma>0, (7)

where \odot denotes the Hadamard product. Similarly, we can use the forward and backward Euler-Maruyama discretizations of (7) to propagate the image in Fig. 1(a), and we provide these results in Sec. VI-A. The corresponding residual perturbation is

𝒙i+1=𝒙i+F^(𝒙i,𝑾i)+γ𝒙i𝒏i,where𝒏i𝒩(𝟎,𝑰).\small{\bm{x}}^{i+1}={\bm{x}}^{i}+\hat{F}({\bm{x}}^{i},{\bm{W}}^{i})+\gamma{\bm{x}}^{i}\odot{\bm{n}}^{i},\ \mbox{where}\ {\bm{n}}^{i}\sim\mathcal{N}(\mathbf{0},{\bm{I}}). (8)

Again, the noise γ𝒙i𝒏i\gamma{\bm{x}}^{i}\odot{\bm{n}}^{i} is injected to each residual mapping in both training and test stages.

We will provide theoretical guarantees for these two residual perturbation schemes, i.e., (6) and (8), in Sec. III, and numerically verify their efficacy in Sec. IV.

II-C Utility Enhancement via Model Ensemble

In [49], the authors showed that an ensemble of noise-injected ResNets can improve models’ utility. In this paper, we will also study the model ensemble for utility enhancement. We inherit notations from [49], e.g., we denote an ensemble of two noise-injected ResNet8 as En2ResNet8.

III Main Theory

III-A Differential Privacy Guarantee for Strategy I

We consider the following function class for ResNets with residual perturbation:

1:={f(𝒙)=𝒘𝒙M|𝒙i+1=𝒙i+ϕ(𝑼i𝒙i)+γ𝒏i},\small\mathcal{F}_{1}:=\left\{f({\bm{x}})={\bm{w}}^{\top}{\bm{x}}^{M}|{\bm{x}}^{i+1}={\bm{x}}^{i}+\phi\left({\bm{U}}^{i}{\bm{x}}^{i}\right)+\gamma{\bm{n}}^{i}\right\}, (9)

where 𝒙0=input data+π𝒏d{\bm{x}}^{0}=\mbox{input data}+\pi{\bm{n}}\in{\mathbb{R}}^{d} is the noisy input with π>0\pi>0 being a hyperparameter and 𝒏𝒩(𝟎,𝑰){\bm{n}}\sim\mathcal{N}(\mathbf{0},{\bm{I}}) being a normal random vector. 𝑼id×d{\bm{U}}^{i}\in{\mathbb{R}}^{d\times d} is the weight matrix in the ii-th residual mapping and 𝒘d{\bm{w}}\in{\mathbb{R}}^{d} is the weights of the last layer. 𝒏i𝒩(𝟎,𝑰){\bm{n}}^{i}\sim\mathcal{N}(\mathbf{0},{\bm{I}}) and γ>0\gamma>0 is also a hyperparameter. ϕ=BN(ψ)\phi={\rm BN}(\psi) with BN{\rm BN} being the batch normalization and ψ\psi being a LL-Lipschitz and monotonically increasing activation function, e.g., ReLU.

We first recap on the definition of differential privacy.

Definition 1 ((ϵ,δ)(\epsilon,\delta)-DP).

[20] A randomized mechanism :𝒮N{\mathcal{M}}:{\mathcal{S}}^{N}\rightarrow{\mathcal{R}} satisfies (ϵ,δ)(\epsilon,\delta)-DP if for any two datasets S,S𝒮NS,S^{\prime}\in{\mathcal{S}}^{N} that differ by one element, and any output subset OO\subseteq{\mathcal{R}}, it holds that [(S)O]eϵ[(S)O]+δ,whereδ(0,1){\mathbb{P}}[{\mathcal{M}}(S)\in O]\leq e^{\epsilon}\cdot{\mathbb{P}}[{\mathcal{M}}(S^{\prime})\in O]+\delta,{\rm where}\,\,\,\delta\in(0,1) and ϵ>0.\epsilon>0.

Theorem 1 below guarantees DP for Strategy I, and we provide its proof in Sec. V-A.

Theorem 1.

Assume the input to ResNet lies in B(𝟎,R)B(\mathbf{0},R) and the expectation of the output of every residual mapping is normally distributed and bounded by a constant GG, in 2\ell_{2} norm. Given the total number of iterations TT used for training ResNet. For any ϵ>0\epsilon>0 and δ,λ(0,1)\delta,\lambda\in(0,1), the parameters 𝐔i{\bm{U}}^{i} and 𝐰{\bm{w}} in the ResNet with residual perturbation satisfies ((λ/(i+1)+(1λ))ϵ,δ)\left((\lambda/(i+1)+(1-\lambda))\epsilon,\delta\right)-DP and ((λ/(M+1)+(1λ))ϵ,δ)\left((\lambda/(M+1)+(1-\lambda))\epsilon,\delta\right)-DP, respectively, provided that π>R(2Tbα)/(Nλϵ)\pi>R\sqrt{(2Tb\alpha)/(N\lambda\epsilon)} and γ>G(2Tbα)/(Nλϵ)\gamma>G\sqrt{(2Tb\alpha)/(N\lambda\epsilon)}, where α=log(1/δ)/((1λ)ϵ)+1\alpha=\log(1/\delta)/\left((1-\lambda)\epsilon\right)+1, M,NM,N and bb are the number of residual mappings, training data, and batch size, respectively. In particular, in the above setting the whole model obtained by injecting noise according to strategy I satisfies (ϵ,δ)(\epsilon,\delta)-DP.

III-B Theoretical Guarantees for Strategy II

Privacy

To analyze the residual perturbation scheme in (8), we consider the following function class:

2:={f(𝒙)=𝒘𝒙M+π𝒙M𝒏||𝒙i+1=𝒙i+ϕ(𝑼i𝒙i)+γ𝒙~i𝒏i,i=0,,M1}\small\mathcal{F}_{2}:=\left\{f\left({\bm{x}}\right)={\bm{w}}^{\top}{\bm{x}}^{M}+\pi{\bm{x}}^{M}{\bm{n}}|\Bigg{|}\begin{aligned} &{\bm{x}}^{i+1}={\bm{x}}^{i}+\phi\left({\bm{U}}^{i}{\bm{x}}^{i}\right)+\\ &\gamma\tilde{{\bm{x}}}^{i}\odot{\bm{n}}^{i},i=0,\ldots,M-1\end{aligned}\right\}

where 𝒏i𝒩(𝟎,𝑰){\bm{n}}^{i}\sim\mathcal{N}(\mathbf{0},{\bm{I}}) is a normal random vector for i=0,,M1i=0,\ldots,M-1, 𝒘a\|{\bm{w}}\|\leq a with a>0a>0 being a constant. We denote the entry of 𝒙i{\bm{x}}^{i} that has the largest absolute value as xmaxix^{i}_{max}, and 𝒙~i:=(sgn(xji)max(|xji|,η))j=1d\tilde{{\bm{x}}}^{i}:=(sgn(x^{i}_{j})\max(|x^{i}_{j}|,\eta))_{j=1}^{d}. Due to batch normalization, we assume the l2l^{2} norm of ϕ\phi can be bounded by a positive constant BB. The other notations are defined similar to that in (9). Consider training 2\mathcal{F}_{2} by using two datasets SS and SS^{\prime}, and we denote the resulting models as:

f(𝒙|S)\displaystyle f\left({\bm{x}}|S\right) :=𝒘1𝒙M+π𝒙M𝒏M;where\displaystyle:={\bm{w}}_{1}^{\top}{\bm{x}}^{M}+\pi{\bm{x}}^{M}\odot{\bm{n}}^{M};\,\,\ \mbox{where} (10)
𝒙i+1\displaystyle{\bm{x}}^{i+1} =𝒙i+ϕ(𝑼1i𝒙i)+γ𝒙~i𝒏i,i=0,,M1,\displaystyle={\bm{x}}^{i}+\phi\left({\bm{U}}_{1}^{i}{\bm{x}}^{i}\right)+\gamma\tilde{{\bm{x}}}^{i}\odot{\bm{n}}^{i},i=0,\ldots,M-1,

and

f(𝒙|S)\displaystyle f\left({\bm{x}}|S^{{}^{\prime}}\right) :=𝒘2𝒙M+π𝒙M𝒏M;where\displaystyle:={\bm{w}}_{2}^{\top}{\bm{x}}^{M}+\pi{\bm{x}}^{M}\odot{\bm{n}}^{M};\,\,\ \mbox{where} (11)
𝒙i+1\displaystyle{\bm{x}}^{i+1} =𝒙i+ϕ(𝑼2i𝒙i)+γ𝒙~i𝒏i,i=0,,M1.\displaystyle={\bm{x}}^{i}+\phi\left({\bm{U}}_{2}^{i}{\bm{x}}^{i}\right)+\gamma\tilde{{\bm{x}}}^{i}\odot{\bm{n}}^{i},i=0,\ldots,M-1.

Then we have the following privacy guarantee.

Theorem 2.

For f(𝐱|S)f({\bm{x}}|S) and f(𝐱|S)f({\bm{x}}|S^{\prime}) that are defined in (10) and (11), respectively. Let λ(0,1),δ(0,1)\lambda\in(0,1),\delta\in(0,1), and ϵ>0\epsilon>0, if γ>(B/η)(2αM)/(λϵ)andπ>a(2αM)/λϵ,{\small\\ \gamma>(B/\eta)\sqrt{(2\alpha M)/(\lambda\epsilon)}}\ \ \mbox{and}\ \ {\small\pi>a\sqrt{\left(2\alpha M\right)/\lambda\epsilon},} where α=log(1/δ)/((1λ)ϵ)+1\alpha=\log(1/\delta)/\left((1-\lambda)\epsilon\right)+1, then [f(𝐱|S)O]eϵ[f(𝐱|S]O]+δ{\mathbb{P}}[f({\bm{x}}|S)\in O]\leq e^{\epsilon}\cdot{\mathbb{P}}[f({\bm{x}}|S^{{}^{\prime}}]\in O]+\delta for any input 𝐱{\bm{x}} and any subset OO in the output space.

We prove Theorem 2 in Sec. V-B. Theorem 2 guarantees the privacy of the training data given only black-box access to the model, i.e., the model will output the prediction for any input without granting adversaries access to the model itself. In particular, we cannot infer whether the model is trained on SS or SS^{\prime} no matter how we query the model in a black-box fashion. This kind of privacy-preserving prediction has also been studied in [34, 35, 61, 62]. We leave theoretical DP-guarantee for for Strategy II for future work.

Generalization gap

Many works have shown that overfitting in training ML models leads to privacy leakage [6], and reducing overfitting can mitigate data privacy leakage [5, 7, 8, 6, 63]. In this part, we show that the residual perturbation (8) can reduce overfitting via computing the Rademacher complexity. For simplicity, we consider binary classification problems. Suppose SN=S_{N}={𝒙i,yi}i=1N\{{\bm{x}}_{i},y_{i}\}_{i=1}^{N} is drawn from X×Yd×{1,+1}X\times Y\subset{\mathbb{R}}^{d}\times\{-1,+1\} with XX and YY being the input data and label spaces, respectively. Assume 𝒟\mathcal{D} is the underlying distribution of X×YX\times Y, which is unknown. Let V\mathcal{H}\subset V be the hypothesis class of the ML model. We first recap on the definition of Rademacher complexity.

Definition 2.

[64] Let :X\mathcal{H}:X\rightarrow{\mathbb{R}} be the space of real-valued functions on the space XX. For a given sample S={𝐱1,𝐱2,,𝐱N}S=\{{\bm{x}}_{1},{\bm{x}}_{2},\ldots,{\bm{x}}_{N}\} of size NN, the empirical Rademacher complexity of \mathcal{H} is defined as

RS():=1NEσ[suphi=1Nσih(𝒙i)],\small R_{S}(\mathcal{H}):=\frac{1}{N}E_{\sigma}\Big{[}\sup_{h\in\mathcal{H}}\sum_{i=1}^{N}\sigma_{i}h({\bm{x}}_{i})\Big{]},

where σ1,,σN\sigma_{1},\ldots,\sigma_{N} are i.i.d. Rademacher random variables with (σi=1)=(σi=1)=1/2\mathbb{P}(\sigma_{i}=1)=\mathbb{P}(\sigma_{i}=-1)=1/2.

Rademacher complexity is a tool to bound the generalization gap [64]. The smaller the generalization gap is, the less overfitting the model is. For 𝒙id\forall{\bm{x}}_{i}\in{\mathbb{R}}^{d} and constant c0c\geq 0, we consider the following two function classes (\mathcal{F} and 𝒢\mathcal{G}):

:={f(𝒙,𝒘)=𝒘𝒙p(T)|d𝒙(t)=𝑼𝒙(t)dtwith𝒙(0)=𝒙i},\small\begin{aligned} \mathcal{F}:=\left\{f({\bm{x}},{\bm{w}})={\bm{w}}{\bm{x}}^{p}(T)\big{|}d{\bm{x}}(t)={\bm{U}}{\bm{x}}(t)dt\ \text{with}\ {\bm{x}}(0)={\bm{x}}_{i}\right\}\end{aligned},

where 𝒘1×d,𝑼d×dwith𝒘2and𝑼2c{\bm{w}}\in{\mathbb{R}}^{1\times d},{\bm{U}}\in{\mathbb{R}}^{d\times d}\ \mbox{with}\ \|{\bm{w}}\|_{2}\ \text{and}\ \|{\bm{U}}\|_{2}\leq c. And

𝒢:={f(𝒙,𝒘)=𝔼[𝒘𝒙p(T)]|d𝒙(t)\displaystyle\mathcal{G}:=\big{\{}f({\bm{x}},{\bm{w}})=\mathbb{E}\left[{\bm{w}}{\bm{x}}^{p}(T)\right]\big{|}d{\bm{x}}(t) =𝑼𝒙(t)dt\displaystyle={\bm{U}}{\bm{x}}(t)dt
+γ𝒙(t)dB(t)},\displaystyle+\gamma{\bm{x}}(t)\odot dB(t)\big{\}},

where 𝒙(0)=𝒙i{\bm{x}}(0)={\bm{x}}_{i}, 𝒘1×d{\bm{w}}\in{\mathbb{R}}^{1\times d} and 𝑼d×d{\bm{U}}\in{\mathbb{R}}^{d\times d} with 𝒘2,𝑼2c\|{\bm{w}}\|_{2},\|{\bm{U}}\|_{2}\leq c, 0<p<10<p<1 takes the value such that 𝒙p{\bm{x}}^{p} is well defined on d{\mathbb{R}}^{d}, γ>0\gamma>0 is a hyperparameter and 𝑼{\bm{U}} is a circulant matrix that corresponding to the convolution layer in DNs, and B(t)B(t) is the standard 1D Brownian motion. The function class \mathcal{F} represents the continuous analogue of ResNet without inner nonlinear activation functions, and 𝒢\mathcal{G} denotes \mathcal{F} with the residual perturbation (8).

Theorem 3.

Given the training set SN={𝐱i,yi}i=1NS_{N}=\{{\bm{x}}_{i},y_{i}\}_{i=1}^{N}. We have RSN(𝒢)<RSN()R_{S_{N}}(\mathcal{G})<R_{S_{N}}(\mathcal{F}).

We provide the proof of Theorem 3 in Sec. V-C, where we will also provide quantitative lower and upper bounds of the above Rademacher complexities. Theorem 3 shows that residual perturbation (8) can reduce the generalization error. We will numerically verify this generalization error reduction for ResNet with residual perturbation in Sec. IV.

IV Experiments

In this section, we will verify that 1) Residual perturbation can protect data privacy, particularly membership privacy. 2) The ensemble of ResNets with residual perturbation improves the classification accuracy. 3) The skip connection is crucial in residual perturbation for DL with data privacy. 4) Residual perturbation can improve the utility of the resulting ML models over DPSGD with a compromise in privacy protection. We focus on Strategy I in this section, and we provide the results of Strategy II in Sec. VI.

IV-A Preliminaries

Datasets

We consider MNIST, CIFAR10/CIFAR100 [65], and the Invasive Ductal Carcinoma [66] datasets. MNIST consists of 70K 28×2828\times 28 gray-scale images with 60K and 10K of them used for training and test, respectively. Both CIFAR10 and CIFAR100 contain 60K 32×3232\times 32 color images with 50K and 10K of them used for training and test, respectively. The IDC dataset is a breast cancer-related benchmark dataset, which contains 277,524 patches of the size 50×5050\times 50 with 198,738 labeled negative (0) and 78,786 labeled positive (1). Fig. 2 depicts a few patches from the IDC dataset. For the IDC dataset, we follow [67] and split the whole dataset into training, validation, and test set. The training set consists of 10,788 positive patches and 29,164 negative patches, and the test set contains 11,595 positive patches and 31,825 negative patches. The remaining patches are used as the validation set. For each dataset, we split its training set into DshadowD_{\rm shadow} and DtargetD_{\rm target} with the same size. Furthermore, we split DshadowD_{\rm shadow} into two halves with the same size and denote them as DshadowtrainD_{\rm shadow}^{\rm train} and DshadowoutD_{\rm shadow}^{\rm out}, and split DtargetD_{\rm target} by half into DtargettrainD_{\rm target}^{\rm train} and DtargetoutD_{\rm target}^{\rm out}. The purpose of this splitting of the training set is for the membership inference attack, which will be discussed below.

Negative Positive Negative Positive
Figure 2: Visualization of a few selected images from the IDC dataset.
Membership inference attack

To verify the efficiency of residual perturbation for privacy protection, we consider the membership inference attack [6] in all the experiments below. The membership attack proceeds as follows: 1) train the shadow model by using DshadowtrainD_{\rm shadow}^{\rm train}; 2) apply the trained shadow model to predict all data points in DshadowD_{\rm shadow} and obtain the corresponding classification probabilities of belonging to each class. Then we take the top three classification probabilities (or two for binary classification) to form the feature vector for each data point. A feature vector is tagged as 11 if the corresponding data point is in DshadowtrainD_{\rm shadow}^{\rm train}, and 0 otherwise. Then we train the attack model by leveraging all the labeled feature vectors; 3) train the target model by using DtargettrainD_{\rm target}^{\rm train} and obtain feature vector for each point in DtargetD_{\rm target}. Finally, we leverage the attack model to decide whether a data point is in DtargettrainD_{\rm target}^{\rm train}.

Experimental settings

We consider En5ResNet8 (ensemble of 5 ResNet8 with residual perturbation) and the standard ResNet8 as the target and shadow models, respectively. We use a multilayer perceptron with a hidden layer of 6464 nodes, followed by a softmax function as the attack model, which is adapted from [6]. We apply the same settings in [33] to train the target and shadow models on MNIST, CIFAR10, and CIFAR100. For training models on the IDC dataset, we run 100 epochs of SGD with the same setting as before except that we decay the learning rate by 4 at the 20th, 40th, 60th, and 80th epoch, respectively. Moreover, we run 5050 epochs of Adam [68] with a learning rate of 0.1 to train the attack model. For both IDC, MNIST, and CIFAR datasets, we set π\pi as half of γ\gamma in Strategy I, which simplifies hyperparameters calibration, and based on our experiment it gives a good tradeoff between privacy and accuracy.

Performance evaluations

We consider both classification accuracy and capability for protecting membership privacy. The attack model is a binary classifier, which is used to decide if a data point is in the training set of the target model. For any 𝒙Dtarget{\bm{x}}\in D_{\rm target}, we apply the attack model to predict its probability (pp) of belonging to the training set of the target model. Given any fixed threshold tt if ptp\geq t, we classify 𝒙{\bm{x}} as a member of the training set (positive sample), and if p<tp<t, we conclude that 𝒙{\bm{x}} is not in the training set (negative sample); so we can obtain different attack results with different thresholds. Furthermore, we can plot the ROC curve(see details in Sec. IV-E) of the attack model and use the area under the ROC curve (AUC) as an evaluation of the membership inference attack. The target model protects perfect membership privacy if the AUC is 0.5 (the attack model performs a random guess), and the more AUC deviates from 0.5, the less private the target model is. Moreover, we use the precision (the fraction of records inferred as members are indeed members of the training set) and recall (the fraction of training set that is correctly inferred as members of the training set by the attack model) to measure ResNets’ capability for protecting membership privacy.

Figure 3: Performance of residual perturbation (Strategy I) for En5ResNet8 with different noise coefficients (γ\gamma) and membership inference attack thresholds on the IDC dataset. Residual perturbation significantly improves membership privacy and reduces the generalization gap. γ=0\gamma=0 corresponding to the baseline ResNet8. (Unit: %)

IV-B Experiments on the IDC Dataset

In this subsection, we numerically verify that residual perturbation protects data privacy while retaining the classification accuracy of the IDC dataset. We select the En5ResNet8 as a benchmark architecture, which has ResNet8 as its baseline architecture. As shown in Fig. 3, we set four different thresholds to obtain different attack results with three different noise coefficients (γ\gamma), and γ=0\gamma=0 means the standard ResNet8 without residual perturbation. We also depict the ROC curve for this experiment in Fig. 7(c). En5ResNet8 is remarkably better in protecting the membership privacy, and as γ\gamma increases the model becomes more resistant to the membership attack. Also, as the noise coefficient increases, the gap between training and test accuracies becomes smaller, which resonates with Theorem 3. For instance, when γ=0.2\gamma=0.2 the AUC for the attack model is 0.4950.495 and 0.5630.563, respectively, for En5ResNet8 and ResNet8; the classification accuracy of En5ResNet8 and ResNet8 are 0.867 and 0.847, respectively.

Residual perturbation vs. DPSGD

In this part, we compare the residual perturbation with the benchmark Tensorflow DPSGD module [69], and we calibrate the hyperparameters, including the initial learning rate (0.1) which decays by a factor of 4 after every 20 epochs, noise multiplier (1.1), clipping threshold (1.0), micro-batches (128), and epochs (100) 222https://github.com/tensorflow/privacy/tree/master/tutorials such that the resulting model gives the optimal tradeoff between membership privacy and classification accuracy. DPSGD is significantly more expensive due to the requirement of computing and clipping the per-sample gradient. We compare the standard ResNet8 trained by DPSGD with the baseline non-private ResNet8, and En5ResNet8 with residual perturbation (γ=0.5\gamma=0.5). Table I lists the AUC of the attack model and training and test accuracies of the target model; we see that DPSGD and residual perturbation can protect membership privacy, and residual perturbation can improve accuracy without sacrificing membership privacy protection.

TABLE I: Residual perturbation (Strategy I) vs. DPSGD in training ResNet8 and EnResNet8 for IDC classification. Residual perturbation has better accuracy and protects better membership privacy than the baseline ResNet8 (AUC closer to 0.5).
Model AUC Training Acc Test Acc
ResNet8(non-private) 0.563 1.000 0.847
ResNet8 (DPSGD) 0.503 0.831 0.828
En1ResNet8(γ=0.5\gamma=0.5) 0.499 0.882 0.865

IV-C Experiments on the MNIST/CIFAR10/CIFAR100 Datasets

We further test the effectiveness of residual perturbation for ResNet8 and En5ResNet8 on the MNIST/CIFAR10/CIFAR100 dataset. Also, we will contrast residual perturbation with DPSGD in privacy protection and utility maintenance. Fig. 4 plots the performance of En5ResNet8 on CIFAR10 under the above four different measures, namely, precision, recall, training and test accuracy, and AUC. Again, the ensemble of ResNets with residual perturbation is remarkably less vulnerable to membership inference attack; e.g., the AUC of the attack model for ResNet8 (non-private) and En5ResNet8 (γ=0.75\gamma=0.75) is 0.7570.757 and 0.5730.573, respectively. Also, the classification accuracy of En5ResNet8 (67.567.5%) is higher than that of the non-private ResNet8 (66.966.9%) for CIFAR10 classification. Fig. 5 depicts the results of En5ResNet8 for CIFAR100 classification. These results confirm again that residual perturbation can protect membership privacy and improve classification accuracy.

Figure 4: Performance of En5ResNet8 with residual perturbation (Strategy I) using different noise coefficients (γ\gamma) and membership inference attack threshold on CIFAR10. Residual perturbation can not only enhance the membership privacy, but also improve the classification accuracy. γ=0\gamma=0 corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)
Figure 5: Performance of En5ResNet8 with residual perturbation (Strategy I) using different noise coefficients (γ\gamma) and membership inference attack threshold on CIFAR100. Again, residual perturbation can not only enhance the membership privacy, but also improve the classification accuracy. γ=0\gamma=0 corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)
Residual perturbation vs. DPSGD

We contrast residual perturbation with DPSGD on MNIST, CIFAR10, and CIFAR100 classification; the results are shown in Tables II, III, and IV, respectively. These results unanimously show that 1) both DPSGD and residual perturbation can effectively protect membership privacy, and 2) residual perturbation is better than DPSGD in maintaining the model’s utility without sacrificing membership privacy protection.

TABLE II: Residual perturbation (Strategy I) vs. DPSGD in training ResNet8 and EnResNet8 for MNIST classification. Residual perturbation has higher accuracy without sacrificing membership privacy protection compared to DPSGD.
Model AUC Training Acc Test Acc
ResNet8(non-private) 0.514 1.000 0.988
ResNet8 (DPSGD) 0.507 0.977 0.975
En1ResNet8(γ=2.5\gamma=2.5) 0.506 0.989 0.978
TABLE III: Residual perturbation (Strategy I) vs. DPSGD in training ResNet8 and EnResNet8 for CIFAR10 classification. Residual perturbation has higher accuracy than the baseline model without sacrificing membership privacy protection.
Model AUC Training Acc Test Acc
ResNet8(non-private) 0.757 1.000 0.669
ResNet8 (DPSGD) 0.512 0.718 0.625
En1ResNet8(γ=4.5\gamma=4.5) 0.511 0.760 0.695
TABLE IV: Residual perturbation (Strategy I) vs. DPSGD in training ResNet8 and EnResNet8 for CIFAR100 classification. Residual perturbation has higher accuracy than DPSGD and protects comparable membership privacy with DPSGD.
Model AUC Training Acc Test Acc
ResNet8(non-private) 0.903 0.999 0.305
ResNet8 (DPSGD) 0.516 0.400 0.288
En1ResNet8(γ=3.25\gamma=3.25) 0.515 0.485 0.383
Effects of the number of models in the ensemble

In this part, we consider the effects of the number of residual perturbed ResNets in the ensemble. Fig. 6 illustrates the performance of EnResNet8 for CIFAR10 classification measured in the AUC, and training and test accuracy. These results show that tuning the noise coefficient and the number of models in the ensemble is crucial to optimize the tradeoff between accuracy and privacy.

Figure 6: Performance of residual perturbation (Strategy I) with different noise coefficients (γ\gamma) and different number of models in the ensemble. Privacy-utility tradeoff depends on these two factors. (Unit: %)

IV-D On the Importance of Skip Connections

Residual perturbations relies on the irreversibility of the SDEs (3) and (7), and this ansatz lies in the skip connections in the ResNet. We test both standard ResNet and the modified ResNet without skip connections. For CIFAR10 classification, under the same noise coefficient (γ=0.75\gamma=0.75), the test accuracy is 0.6750.675 for the En5ResNet8 (with skip connection); while the test accuracy is 0.6530.653 for the En5ResNet8 (without skip connection). Skip connections makes EnResNet more resistant to noise injection which is crucial for the success of residual perturbation for privacy protection.

IV-E ROC Curves for Experiments on Different Datasets

The receiver operating characteristic (ROC) curve can be used to illustrate the classification ability of a binary classifier. ROC curve is obtained by plotting the true positive rate against the false positive rate at different thresholds. The true positive rate, also known as recall, is the fraction of the positive set (all the positive samples) that is correctly inferred as a positive sample by the binary classifier. The false positive rate can be calculated by 1-specificity, where specificity is the fraction of the negative set (all the negative samples) that is correctly inferred as a negative sample by the binary classifier. In our case, the attack model is a binary classifier. Data points in the training set of the target model are tagged as positive samples, and data points out of the training set of the target model are tagged as negative samples. Then we plot ROC curves for different datasets (as shown in Fig. 7). These ROC curves show that if γ\gamma is sufficiently large, the attack model’s prediction will be nearly a random guess.

(a) CIFAR10 (b) CIFAR100 (c) IDC
(En5ResNet8) (En5ResNet8) (En5ResNet8)
Figure 7: ROC curves for residual perturbation (Strategy I) for different datasets. (ga: noise coefficient γ\gamma)

IV-F Remark on the Privacy Budget

In the experiments above, we set the constants GG and RR to 3030 for Strategy I. For classifying IDC with ResNet8, the theoretical DP budget, based on Theorem 1, for Strategy I is (ϵ=1.1e3,δ=1e5)\small(\epsilon=1.1e3,\delta=1e-5) and the DP-budget for DPSGD is (ϵ=15.79,δ=1e5)\small(\epsilon=15.79,\delta=1e-5). For classifying CIFAR10 with ResNet8, the DP budget, based on Theorem 1, for Strategy I is (ϵ=3e3,δ=1e5)(\epsilon=3e3,\delta=1e-5) and the DP-budget for DPSGD is (ϵ=22.33,δ=1e5)\small(\epsilon=22.33,\delta=1e-5). Using the algorithm provided in [42], we estimate the empirical lower bound of the ϵ\epsilon given δ=1e5\delta=1e-5 for both Strategy I and DPSGD. The obtained empirical lower bound can guide the improvement of analytical tools to bound the DP budget — the larger gap between theoretical and empirical DP budget indicates more theoretical budget to improve. For IDC classification, the lower bound of ϵ\epsilon for Strategy I and DPSGD are 3.604e-3 and 6.608e-3, respectively. And the lower bound of ϵ\epsilon for Strategy I and DPSGD for CIFAR10 classification are 1.055e-1 and 1.558e-1, respectively. These empirical DP budget estimates show that Theorem 1 offers a quite loose DP budget compared to DPSGD. There are several challenges we need to overcome to get a tighter DP bound for Strategy I. Compared to DPSGD, it is significantly harder. In particular, 1) the loss function of the noise injected ResNets is highly nonlinear and very complex with respect to the weights, also the noise term appears in the loss function due to the noise injected in each residual mapping. 2) In our proof, we leverage the framework of subsampled Rényi-DP [37] to find a feasible range of noise variance parameter, and then convert to DP to get the value of γ\gamma for a given DP budget. This procedure will significantly reduce the accuracy of the estimated γ\gamma. We leave the tight DP guarantee as future work. In particular, how to reduce the accuracy of estimating due to the conversion between Rényi-DP and DP.

V Technical Proofs

V-A Proof of Theorem 1

V-A1 Rényi Differential Privacy (RDP)

We use the notion of RDP to prove the DP guarantees for residual perturbations.

Definition 3.

[36] (Rényi divergence) For any two probability distributions PP and QQ defined over the distribution 𝒟\mathcal{D}, the Rényi divergence of order α>1\alpha>1 is

Dα(P||Q)=1α1log𝔼xQ(P/Q)α.\displaystyle\small D_{\alpha}(P||Q)=\frac{1}{\alpha-1}\log{\mathbb{E}}_{x\sim Q}(P/Q)^{\alpha}.
Definition 4.

[36] ((α,ϵ)(\alpha,\epsilon)-RDP) A randomized mechanism :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R} is ϵ\epsilon-Rényi DP of order α\alpha or (α,ϵ)(\alpha,\epsilon)-RDP, if for any adjacent 𝒮,𝒮𝒟\mathcal{S},\mathcal{S}^{{}^{\prime}}\in\mathcal{D} that differ by one entry, we have

Dα((𝒮)||(𝒮)ε.\displaystyle\small D_{\alpha}(\mathcal{M}(\mathcal{S})||\mathcal{M}(\mathcal{S}^{{}^{\prime}})\leq\varepsilon.
Lemma 1.

[36] Let f:𝒟1f:\mathcal{D}\rightarrow\mathcal{R}_{1} be (α,ϵ1)(\alpha,\epsilon_{1})-RDP and g:1×𝒟g:\mathcal{R}_{1}\times\mathcal{D}\rightarrow\mathcal{R} be (α,ϵ)(\alpha,\epsilon)-RDP, then the mechanism (X,Y)(X,Y) with Xf(D)X\sim f(D) and Yg(X,D)Y\sim g(X,D), satisfies (α,ϵ1+ϵ2)(\alpha,\epsilon_{1}+\epsilon_{2})-RDP.

Lemma 2.

[36] (From RDP to (ϵ,δ)(\epsilon,\delta)-DP) If ff is an (α,ϵ)(\alpha,\epsilon)-RDP mechanism, then it also satisfies (ϵ(logδ)/(α1),δ)(\epsilon-(\log\delta)/(\alpha-1),\delta)-differential privacy for any 0<δ<10<\delta<1.

Lemma 3.

[36] (Post-processing lemma) Let :𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R} be a randomized algorithm that is (α,ϵ)(\alpha,\epsilon)-RDP, and let f:f:\mathcal{R}\rightarrow\mathcal{R}^{{}^{\prime}} be an arbitrary randomized mapping. Then f(()):𝒟f(\mathcal{M}(\cdot)):\mathcal{D}\rightarrow\mathcal{R}^{{}^{\prime}} is (α,ϵ)(\alpha,\epsilon)-RDP.

V-A2 Proof of Theorem 1

In this subsection, we will give a proof of Theorem 1, i.e., DP-guarantee for the Strategy I.

Proof.

We prove Theorem 1 by mathematical induction. Consider two adjacent datasets 𝒮={𝒙1,,𝒙N1,𝒙N},𝒮={𝒙1,,𝒙N1,𝒙^N}\mathcal{S}=\{{\bm{x}}_{1},\cdots,{\bm{x}}_{N-1},{\bm{x}}_{N}\},\mathcal{S}^{\prime}=\{{\bm{x}}_{1},\cdots,{\bm{x}}_{N-1},\hat{\bm{x}}_{N}\} that differ by one entry. For the first residual mapping, it is easy to check that when πR2α/ϵp\pi\geq R\sqrt{2\alpha/\epsilon_{p}} with π\pi being the standard deviation of the Gaussian noise injected into the input data, we have Dα(𝒙N0||𝒙^N0)=α𝒙N𝒙^N2/(2π2)<εpD_{\alpha}({\bm{x}}^{0}_{N}||\hat{\bm{x}}^{0}_{N})=\alpha\|{\bm{x}}_{N}-\hat{\bm{x}}_{N}\|^{2}/(2\pi^{2})<\varepsilon_{p}. For the remaining residual mappings, we denote the response of the ii-th residual mapping, for any two input data 𝒙N{\bm{x}}_{N} and 𝒙^N\hat{\bm{x}}_{N}, as 𝒙Ni,𝒙^Ni{\bm{x}}^{i}_{N},\hat{\bm{x}}^{i}_{N}, respectively. Based on our assumpution, we have

𝒙Ni+ϕ(𝑼i𝒙Ni)\displaystyle{\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}{\bm{x}}^{i}_{N}) 𝒩(μN,i,σN,i2),\displaystyle\sim\mathcal{N}(\mu_{N,i},\sigma^{2}_{N,i}), (12)
𝒙^Ni+ϕ(𝑼i𝒙^Ni)\displaystyle\hat{\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}\hat{\bm{x}}^{i}_{N}) 𝒩(μ^N,i,σN,i2),\displaystyle\sim\mathcal{N}(\hat{\mu}_{N,i},\sigma^{2}_{N,i}),

where the mean μN,i\mu_{N,i} and μ^N,i\hat{\mu}_{N,i} are both bounded by the constant GG. If Dα(𝒙Ni||𝒙^Ni)ϵp/(i+1)D_{\alpha}({\bm{x}}^{i}_{N}||\hat{\bm{x}}^{i}_{N})\leq\epsilon_{p}/(i+1), according the post-processing lemma (Lemma 3), we have

Dα(𝒙Ni+ϕ(𝑼i𝒙Ni)||𝒙^Ni+ϕ(𝑼i𝒙^Ni))\displaystyle D_{\alpha}({\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}{\bm{x}}^{i}_{N})||\hat{\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}\hat{\bm{x}}^{i}_{N})) (13)
=\displaystyle= αμN,iμN,i22σN,i2ϵp/(i+1),\displaystyle\frac{\alpha\|\mu_{N,i}-\mu^{{}^{\prime}}_{N,i}\|^{2}}{2\sigma^{2}_{N,i}}\leq\epsilon_{p}/(i+1),

so when the standard deviation of the injected Gaussian noise satisfies γ>2αG2/ϵp\gamma>\sqrt{2\alpha G^{2}/\epsilon_{p}}, (13) implies that

Dα(𝒙Ni+ϕ(𝑼i𝒙Ni)+γ𝒏||𝒙^Ni+ϕ(𝑼i𝒙^Ni)+γ𝒏)\displaystyle D_{\alpha}({\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}{\bm{x}}^{i}_{N})+\gamma{\bm{n}}||\hat{\bm{x}}^{i}_{N}+\phi({\bm{U}}^{i}\hat{\bm{x}}^{i}_{N})+\gamma{\bm{n}}) (14)
=\displaystyle= αμN,iμN,i22(σN,i2+γ2)ϵpi+2,\displaystyle\frac{\alpha\|\mu_{N,i}-\mu^{{}^{\prime}}_{N,i}\|^{2}}{2(\sigma^{2}_{N,i}+\gamma^{2})}\leq\frac{\epsilon_{p}}{i+2},

implying that Dα(𝒙Ni+1||𝒙^Ni+1)<ϵpi+2D_{\alpha}({\bm{x}}^{i+1}_{N}||\hat{\bm{x}}^{i+1}_{N})<\frac{\epsilon_{p}}{i+2}. Moreover, note that

𝑼i+1|𝒙Ni+1\displaystyle\nabla_{{\bm{U}}^{i+1}}\ell|_{{\bm{x}}^{i+1}_{N}} =l(𝒙Ni+1+ϕ(𝑼i+1𝒙Ni+1))ϕ(𝑼i+1𝒙Ni+1)(𝒙Ni+1)\displaystyle=l^{{}^{\prime}}\left({\bm{x}}^{i+1}_{N}+\phi\left({\bm{U}}^{i+1}{\bm{x}}^{i+1}_{N}\right)\right)\phi^{{}^{\prime}}({\bm{U}}^{i+1}{\bm{x}}^{i+1}_{N})({\bm{x}}^{i+1}_{N})^{\top} (15)
𝑼i+1|𝒙^Ni+1\displaystyle\nabla_{{\bm{U}}^{i+1}}\ell|_{\hat{\bm{x}}^{i+1}_{N}} =l(𝒙^Ni+1+ϕ(𝑼i+1𝒙^Ni+1))ϕ(𝑼i+1𝒙^Ni+1)(𝒙^Ni+1).\displaystyle=l^{{}^{\prime}}\left(\hat{\bm{x}}^{i+1}_{N}+\phi\left({\bm{U}}^{i+1}\hat{\bm{x}}^{i+1}_{N}\right)\right)\phi^{{}^{\prime}}({\bm{U}}^{i+1}\hat{\bm{x}}^{i+1}_{N})(\hat{\bm{x}}^{i+1}_{N})^{\top}.

By the post-processing lemma (Lemma 3) again, we get

Dα(𝑼i+1|𝒙Ni+1||𝑼i+1|𝒙^Ni+1)<ϵpi+2.\small D_{\alpha}(\nabla_{{\bm{U}}^{i+1}}\ell|_{{\bm{x}}^{i+1}_{N}}||\nabla_{{\bm{U}}^{i+1}}\ell|_{\hat{\bm{x}}^{i+1}_{N}})<\frac{\epsilon_{p}}{i+2}.

Let t\mathcal{B}_{t} be the index set with |t|=b|\mathcal{B}_{t}|=b, and we update 𝑼i{\bm{U}}^{i} as

𝑼t+1i+1|𝒮\displaystyle{\bm{U}}^{i+1}_{t+1}|\mathcal{S} =𝑼tiα1bjt𝑼ti+1(𝒙ji+1,𝑼ti+1),\displaystyle={\bm{U}}^{i}_{t}-\alpha\frac{1}{b}\sum_{j\in\mathcal{B}_{t}}\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{j},{\bm{U}}^{i+1}_{t}),\quad (16)
𝑼t+1i+1|𝒮\displaystyle{\bm{U}}^{i+1}_{t+1}|\mathcal{S}^{\prime} =𝑼tiα1bjt𝑼ti+1(𝒙ji+1,𝑼ti+1),\displaystyle={\bm{U}}^{i}_{t}-\alpha\frac{1}{b}\sum_{j\in\mathcal{B}_{t}}\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{j},{\bm{U}}^{i+1}_{t}),

where 𝑼ti+1{\bm{U}}^{i+1}_{t} is the weights updated after the tt-th training iterations. When NtN\notin\mathcal{B}_{t}, it’s obviously that Dα(𝑼t+1i+1|𝒮||𝑼t+1i+1|𝒮)=0D_{\alpha}({\bm{U}}^{i+1}_{t+1}|\mathcal{S}||{\bm{U}}^{i+1}_{t+1}|\mathcal{S}^{\prime})=0; when NtN\in\mathcal{B}_{t}, the equations which we use to update 𝑼i{\bm{U}}^{i} can be rewritten as

𝑼t+1i+1|𝒮\displaystyle{\bm{U}}^{i+1}_{t+1}|\mathcal{S} =𝑼tiα(1/b)jt{N}𝑼ti+1(𝒙ji+1,𝑼ti+1)\displaystyle={\bm{U}}^{i}_{t}-\alpha(1/b)\sum_{j\in\mathcal{B}_{t}-\{N\}}\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{j},{\bm{U}}^{i+1}_{t}) (17)
α(1/b)𝑼ti+1(𝒙Ni+1,𝑼ti+1)\displaystyle\ \ -\alpha(1/b)\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{N},{\bm{U}}^{i+1}_{t})
𝑼t+1i+1|𝒮\displaystyle{\bm{U}}^{i+1}_{t+1}|\mathcal{S}^{\prime} =𝑼tiα(1/b)jt{N}𝑼ti+1(𝒙ji+1,𝑼ti+1)\displaystyle={\bm{U}}^{i}_{t}-\alpha(1/b)\sum_{j\in\mathcal{B}_{t}-\{N\}}\nabla_{{\bm{U}}^{i+1}_{t}}\ell({\bm{x}}^{i+1}_{j},{\bm{U}}^{i+1}_{t})
α(1/b)𝑼ti+1(𝒙^Ni+1,𝑼ti+1).\displaystyle\ \ -\alpha(1/b)\nabla_{{\bm{U}}^{i+1}_{t}}\ell(\hat{\bm{x}}^{i+1}_{N},{\bm{U}}^{i+1}_{t}).

By Lemma 3, we have Dα(𝑼t+1i+1|𝒮||𝑼t+1i+1|𝒮)ϵp/(i+2)D_{\alpha}({\bm{U}}^{i+1}_{t+1}|\mathcal{S}||{\bm{U}}^{i+1}_{t+1}|\mathcal{S}^{\prime})\leq\epsilon_{p}/(i+2). Because there are only (Tb)/N(Tb)/N steps where we use the information of 𝒙N{\bm{x}}_{N} and 𝒙^N\hat{\bm{x}}_{N}. Replacing ϵp\epsilon_{p} by (Nϵp)/(Tb)(N\epsilon_{p})/(Tb) and use composition theorem we can get after TT steps the output 𝑼Ti+1{\bm{U}}^{i+1}_{T} satisfies (α,ϵp/(i+2))(\alpha,\epsilon_{p}/(i+2))-RDP and 𝒘{\bm{w}} satisfies (α,ϵp/(M+1))(\alpha,\epsilon_{p}/(M+1))-RDP. By Lemma 2, we can easily establish the DP-guarantee for Strategy I, as stated in Theorem 1. ∎

V-B Proof of Theorem 2

In this section, we will provide a proof for Theorem 2.

Proof.

Let ϕ=BN(ψ)\phi=BN(\psi), where BN is the batch normalization operation and ψ\psi is an activation function. Due to the property of batch normalization, we assume the 2\ell_{2} norm of ϕ\phi can be bounded by a positive constant BB. To show that the model (9) guarantees training data privacy given only black-box access to the model. Consider training the model (9) with two different datasets SS and SS^{\prime}, and we denote the resulting model as f(|S)f(\cdot|S) and f(|S)f(\cdot|S^{\prime}), respectively. Next, we show that using appropriate γ\gamma and π\pi, Dα(f(𝒙|S)||f(𝒙|S))<ϵpD_{\alpha}(f({\bm{x}}|S)||f({\bm{x}}|S^{{}^{\prime}}))<\epsilon_{p} for any input 𝒙{\bm{x}}.

First, we consider the convolution layers, and let Conv(𝒙)i{\rm Conv}({\bm{x}})_{i} be the ii-th entry of the vectorized Conv(𝒙){\rm Conv}({\bm{x}}). Then we have Conv(𝒙)i=𝒙i+ϕ(𝑼𝒙)i+γ𝒙~i𝒏i{\rm Conv}({\bm{x}})_{i}={\bm{x}}_{i}+\phi({\bm{U}}{\bm{x}})_{i}+\gamma\tilde{{\bm{x}}}_{i}\odot{\bm{n}}_{i}. For any two different training datasets, we denote

Conv(𝒙)i|S\displaystyle{\rm Conv}({\bm{x}})_{i}|S =𝒙i+ϕ(𝑼1𝒙)i+γ𝒙~i𝒏i\displaystyle={\bm{x}}_{i}+\phi({\bm{U}}_{1}{\bm{x}})_{i}+\gamma\tilde{{\bm{x}}}_{i}\odot{\bm{n}}_{i}
𝒩(𝒙i+ϕ(𝑼1𝒙)i,γ2𝑿),\displaystyle\sim\mathcal{N}({\bm{x}}_{i}+\phi({\bm{U}}_{1}{\bm{x}})_{i},\gamma^{2}{\bm{X}}),

and

Conv(𝒙)i|S\displaystyle{\rm Conv}({\bm{x}})_{i}|S^{{}^{\prime}} =𝒙i+ϕ(𝑼2𝒙)i+γ𝒙~i𝒏i\displaystyle={\bm{x}}_{i}+\phi({\bm{U}}_{2}{\bm{x}})_{i}+\gamma\tilde{{\bm{x}}}_{i}\odot{\bm{n}}_{i}
𝒩(𝒙i+ϕ(𝑼2𝒙)i,γ2𝑿),\displaystyle\sim\mathcal{N}({\bm{x}}_{i}+\phi({\bm{U}}_{2}{\bm{x}})_{i},\gamma^{2}{\bm{X}}),

where 𝑿{\bm{X}} is a diagonal matrix with the jj-th diagonal entry being the square of the jj-th element of the vector 𝒙~i2\tilde{\bm{x}}^{2}_{i}.

Therefore, if γ>(B/η)(2αM)/(ϵp)\gamma>(B/\eta)\sqrt{(2\alpha M)/(\epsilon_{p})}, we have

Dα(𝒩(𝒙i+ϕ(𝑼1𝒙)i,γ2𝒙~i2)||𝒩(𝒙i+ϕ(𝑼2𝒙)i,γ2𝒙~i2))\displaystyle D_{\alpha}\left(\mathcal{N}\left({\bm{x}}_{i}+\phi\left({\bm{U}}_{1}{\bm{x}}\right)_{i},\gamma^{2}\tilde{{\bm{x}}}^{2}_{i}\right)||\mathcal{N}\left({\bm{x}}_{i}+\phi\left({\bm{U}}_{2}{\bm{x}}\right)_{i},\gamma^{2}\tilde{{\bm{x}}}^{2}_{i}\right)\right)
α(ϕ(𝑼1𝒙)iϕ(𝑼2𝒙)i)22γ2η2\displaystyle\leq\frac{\alpha(\phi({\bm{U}}_{1}{\bm{x}})_{i}-\phi({\bm{U}}_{2}{\bm{x}})_{i})^{2}}{2\gamma^{2}\eta^{2}}
4αL2B2𝒙222γ2η22αB2γ2η2ϵp/M.\displaystyle\leq\frac{4\alpha L^{2}B^{2}\|{\bm{x}}\|^{2}_{2}}{2\gamma^{2}\eta^{2}}\leq\frac{2\alpha B^{2}}{\gamma^{2}\eta^{2}}\leq\epsilon_{p}/M.

Furthermore, γ>(b/η)(2αM)/(ϵp)\gamma>(b/\eta)\sqrt{(2\alpha M)/(\epsilon_{p})} guarantees (α,ϵp/M)(\alpha,\epsilon_{p}/M)-RDP for every convolution layer. For the last fully connected layer, if π>a(2αM)/ϵp\pi>a\sqrt{\left(2\alpha M\right)/\epsilon_{p}}, we have

Dα(𝒩(𝒘1𝒙,π2𝒙~i2)||𝒩(𝒘2𝒙,π2𝒙~i2))\displaystyle\ \ \ \ D_{\alpha}\left(\mathcal{N}\left({\bm{w}}_{1}^{\top}{\bm{x}},\pi^{2}\tilde{{\bm{x}}}^{2}_{i}\right)||\mathcal{N}\left({\bm{w}}_{2}^{\top}{\bm{x}},\pi^{2}\tilde{{\bm{x}}}^{2}_{i}\right)\right)
=α(𝒘1𝒙𝒘2𝒙)22π2𝒙24αa2𝒙222π2𝒙22αa2π2ϵp/M,\displaystyle=\frac{\alpha\left({\bm{w}}_{1}^{\top}{\bm{x}}-{\bm{w}}_{2}^{\top}{\bm{x}}\right)^{2}}{2\pi^{2}\|{\bm{x}}\|_{2}}\leq\frac{4\alpha a^{2}\|{\bm{x}}\|^{2}_{2}}{2\pi^{2}\|{\bm{x}}\|_{2}}\leq\frac{2\alpha a^{2}}{\pi^{2}}\leq\epsilon_{p}/M,

i.e., π>a(2αM)/ϵp\pi>a\sqrt{\left(2\alpha M\right)/\epsilon_{p}} guarantees that the fully connected layer to be (α,ϵp/M)(\alpha,\epsilon_{p}/M)-RDP. According to Lemma 1, we have (α,ϵp)(\alpha,\epsilon_{p})-RDP guarantee for the ResNet of MM residual mappings if γ>(Lb/η)(2αdM)/(ϵp)\gamma>(Lb/\eta)\sqrt{(2\alpha dM)/(\epsilon_{p})} and π>a(2αM)/ϵp\pi>a\sqrt{\left(2\alpha M\right)/\epsilon_{p}}. Let λ(0,1)\lambda\in(0,1), for any given (ϵp,δ)(\epsilon_{p},\delta) pair, if ϵpλϵ\epsilon_{p}\leq\lambda\epsilon and (logδ)/(α1),δ)(1λ)ϵ-(\log\delta)/(\alpha-1),\delta)\leq(1-\lambda)\epsilon, then we have get the (ϵ,δ)(\epsilon,\delta)-DP guarantee for the ResNet with MM residual mapping using the residual perturbation Strategy II. ∎

V-C Proof of Theorem 3

In this section, we will proof that the residual perturbation (8) can reduce the generalization error via computing the Rademacher complexity. Let us first recap on some related lemmas on SDE and Rademacher complexity.

V-C1 Some Lemmas

Let :V×Y[0,B]\ell:V\times Y\rightarrow[0,B] be the loss function. Here we assume \ell is bounded and BB is a positive constant. In addition, we denote the function class ={(𝒙,y)(h(𝒙),y):h,(𝒙,y)X×Y}\ell_{\mathcal{H}}=\{({\bm{x}},y)\rightarrow\ell(h({\bm{x}}),y):h\in\mathcal{H},({\bm{x}},y)\in X\times Y\}. The goal of the learning problem is to find hh\in\mathcal{H} such that the population risk R(h)=𝔼(𝒙,y)𝒟[(h(𝒙),y)]R(h)=\mathbb{E}_{({\bm{x}},y)\in\mathcal{D}}[\ell(h({\bm{x}}),y)] is minimized. The gap between population risk and empirical risk RSN(h)=(1/N)i=1N(h(𝒙i),yi)R_{S_{N}}(h)=(1/N)\sum^{N}_{i=1}\ell(h({\bm{x}}_{i}),y_{i}) is known as the generalization error. We have the following lemma and theorem to connect the population and empirical risks via Rademacher complexity.

Lemma 4.

(Ledoux-Talagrand inequality) [70]Let \mathcal{H} be a bounded real valued function space and let ϕ:\phi:\mathbb{R}\rightarrow\mathbb{R} be a Lipschitz with constant L and ϕ(0)=0\phi(0)=0. Then we have

1nEσ[suphi=1nσiϕ(h(xi))]LnEσ[suphi=1nσih(xi)].\small\frac{1}{n}E_{\sigma}\left[\sup_{h\in\mathcal{H}}\sum_{i=1}^{n}\sigma_{i}\phi\left(h\left(x_{i}\right)\right)\right]\leq\frac{L}{n}E_{\sigma}\left[\sup_{h\in\mathcal{H}}\sum_{i=1}^{n}\sigma_{i}h\left(x_{i}\right)\right].
Lemma 5.

[64]Let SN={(𝐱1,y1),,(𝐱N,yN)}S_{N}=\{\left({\bm{x}}_{1},y_{1}\right),\cdots,\left({\bm{x}}_{N},y_{N}\right)\} be samples chosen i.i.d. according to the distribution 𝒟\mathcal{D}. If the loss function \ell is bounded by B>0B>0. Then for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, the following holds for all hh\in\mathcal{H},

R(h)RSN(h)+2BRSN()+3B(log(2/δ)/(2N).\small R\left(h\right)\leq R_{S_{N}}\left(h\right)+2BR_{S_{N}}\left(\ell_{\mathcal{H}}\right)+3B\sqrt{(\log(2/\delta)/(2N)}.

In addition, according to the Ledoux-Talagrand inequality and assume loss function is LL-lipschitz, we have

RSN()LRSN().\small R_{S_{N}}\left(\ell_{\mathcal{H}}\right)\leq LR_{S_{N}}\left(\mathcal{H}\right).

So the population risk can be bounded by the empirical risk and Rademacher complexity of the function class \mathcal{H}. Because we can’t minimize the population risk directly, we can minimize it indirectly by minimizing the empirical risk and Rademacher complexity of the function class \mathcal{H}. Next, we will further discuss Rademacher complexity of the function class \mathcal{H}. We first introduce several lemmas below.

Lemma 6.

[71] For any given matrix 𝐔{\bm{U}}, the solution to the equation

d𝒙(t)=𝑼𝒙(t)dt;𝒙(0)=𝒙^,\small d{\bm{x}}(t)={\bm{U}}{\bm{x}}(t)dt;\ {\bm{x}}(0)=\hat{{\bm{x}}},

has the following expression

𝒙(t)=exp(𝑼t)𝒙^,\small{\bm{x}}(t)=\exp({\bm{U}}t)\hat{{\bm{x}}},

Also, we can write the solution to the following equation

d𝒚(t)=𝑼𝒚(t)dt+γ𝒚(t)dB(t);𝒚(0)=𝒙^,\small d{\bm{y}}(t)={\bm{U}}{\bm{y}}(t)dt+\gamma{\bm{y}}(t)dB(t);\ {\bm{y}}(0)=\hat{{\bm{x}}},

as

𝒚(t)=exp(𝑼t12γ2t𝑰+γB(t)𝑰)𝒙^.\small{\bm{y}}(t)=\exp({\bm{U}}t-\frac{1}{2}\gamma^{2}t{\bm{I}}+\gamma B(t){\bm{I}})\hat{{\bm{x}}}.

Obviously, we have 𝔼[𝐲(t)]=𝐱(t){\mathbb{E}}[{\bm{y}}(t)]={\bm{x}}(t).

Lemma 7.

For any circulant matrix 𝐂{\mathbf{C}} with the first row being (a1,a2,,ad)(a_{1},a_{2},\cdots,a_{d}), we have the following eigen-decomposition ΨH𝐂Ψ=diag(λ1,,λd),\Psi^{H}{\mathbf{C}}\Psi=diag(\lambda_{1},\dots,\lambda_{d}), where

dΨ=(1111m1md11m1d1md1d1),\small\sqrt{d}\Psi=\begin{pmatrix}1&1&\dots&1\\ 1&m_{1}&\dots&m_{d-1}\\ \dots&\dots&\dots&\dots\\ 1&m^{d-1}_{1}&\dots&m^{d-1}_{d-1}\end{pmatrix},

and mim_{i}s are the roots of unity and λi=a1+a2mi++admid1\lambda_{i}=a_{1}+a_{2}m_{i}+\dots+a_{d}m^{d-1}_{i}.

V-C2 The proof of Theorem 3

Proof.

By the definition of Rademacher complexity (Def. 2), we have

RSN()\displaystyle R_{S_{N}}\left(\mathcal{F}\right) =(1/N)𝔼σ[supfi=1Nσi𝒘𝒙ip(T)]\displaystyle=\left(1/N\right)\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{N}\sigma_{i}{\bm{w}}{\bm{x}}^{p}_{i}(T)\right]
=(c/N)𝔼σ[supU2ci=1Nσi𝒙ip(T)2].\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\left[\sup_{\|U\|_{2}\leq c}\|\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\left(T\right)\|_{2}\right].

Let 𝒖i=Ψ𝒙ip{\bm{u}}_{i}=\Psi{\bm{x}}^{p}_{i} and denote the jj-th element of 𝒖i{\bm{u}}_{i} as ui,ju_{i,j}. Then by lemma 7, we have

RSN()\displaystyle R_{S_{N}}\left(\mathcal{F}\right)
=(c/N)𝔼σsup𝑼2c(i,jσiσj𝒙ip(T),𝒙jp(T))1/2\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\|{\bm{U}}\|_{2}\leq c}\left(\sum_{i,j}\sigma_{i}\sigma_{j}\langle{\bm{x}}^{p}_{i}(T),{\bm{x}}^{p}_{j}(T)\rangle\right)^{1/2}
=(c/N)𝔼σsup|λi|c(ΨHi=1Nσi(ui,1exp(λ1Tp),\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{|\lambda_{i}|\leq c}(\|\Psi^{H}\sum_{i=1}^{N}\sigma_{i}\big{(}u_{i,1}\exp(\lambda_{1}Tp),
,ui,dexp(λdTp))2)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \cdots,u_{i,d}\exp(\lambda_{d}Tp))^{\top}\|_{2}\big{)}
=(c/N)𝔼σsup|λi|c{j=1d[i=1Nσi𝒖i,jexp(λjTp)]2}1/2\displaystyle=\left(c/N\right)\mathbb{E}_{\sigma}\sup_{|\lambda_{i}|\leq c}\{\sum_{j=1}^{d}\left[\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i,j}\exp\left(\lambda_{j}Tp\right)\right]^{2}\}^{1/2}
=(c/N)exp(cTp)𝔼σi=1Nσi𝒖i2\displaystyle=\left(c/N\right)\exp\left(cTp\right)\mathbb{E}_{\sigma}\|\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i}\|_{2}
=(c/N)exp(cTp)𝔼σΨi=1Nσi𝒙ip2\displaystyle=\left(c/N\right)\exp\left(cTp\right)\mathbb{E}_{\sigma}\|\Psi\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\|_{2}
=(c/N)exp(cTp)𝔼σi=1Nσi𝒙ip2\displaystyle=(c/N)\exp\left(cTp\right)\mathbb{E}_{\sigma}\|\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\|_{2}

Note that 𝔼(𝒘𝒙ip(T))=𝒘𝔼(𝒙ip(T))\mathbb{E}\left({\bm{w}}{\bm{x}}^{p}_{i}(T)\right)={\bm{w}}\mathbb{E}\left({\bm{x}}^{p}_{i}(T)\right) and according to Lemma 6, similar to proof for the function class \mathcal{F} we have

RSN(𝒢)\displaystyle R_{S_{N}}\left(\mathcal{G}\right)
=\displaystyle= (c/N)𝔼σsup𝑼2c(i,jσiσj𝔼𝒙ip(T),𝔼𝒙jp(T))1/2\displaystyle\left(c/N\right)\mathbb{E}_{\sigma}\sup_{\|{\bm{U}}\|_{2}\leq c}\left(\sum_{i,j}\sigma_{i}\sigma_{j}\langle\mathbb{E}{\bm{x}}^{p}_{i}(T),\mathbb{E}{\bm{x}}^{p}_{j}(T)\rangle\right)^{1/2}
=\displaystyle= (c/N)𝔼σsup|λi|c(ΨHi=1Nσi(ui,1exp(λ1Tpp(1p)γ2T/2),\displaystyle(c/N)\mathbb{E}_{\sigma}\sup_{|\lambda_{i}|\leq c}(\|\Psi^{H}\sum_{i=1}^{N}\sigma_{i}(u_{i,1}\exp(\lambda_{1}Tp-p(1-p)\gamma^{2}T/2),
,ui,dexp(λdTpp(1p)γ2T/2))2)\displaystyle\hskip 71.13188pt\cdots,u_{i,d}\exp(\lambda_{d}Tp-p(1-p)\gamma^{2}T/2))^{\top}\|_{2})
=\displaystyle= (c/N)𝔼σsup|λi|c{j=1d[i=1Nσi𝒖i,jexp×\displaystyle\left(c/N\right)\mathbb{E}_{\sigma}\sup_{|\lambda_{i}|\leq c}\Big{\{}\sum_{j=1}^{d}\Big{[}\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i,j}\exp\times
(λjTpp(1p)γ2T/2)]2}1/2\displaystyle\hskip 71.13188pt\big{(}\lambda_{j}Tp-p(1-p)\gamma^{2}T/2\big{)}\Big{]}^{2}\Big{\}}^{1/2}
=\displaystyle= (c/N)exp(cTpp(1p)γ2T/2)𝔼σ{j=1d[i=1Nσiui,j]2}1/2\displaystyle(c/N)\exp(cTp-p(1-p)\gamma^{2}T/2)\mathbb{E}_{\sigma}\Big{\{}\sum_{j=1}^{d}\left[\sum_{i=1}^{N}\sigma_{i}u_{i,j}\right]^{2}\Big{\}}^{1/2}
=\displaystyle= (c/N)exp(cTpp(1p)γ2T/2)𝔼σi=1Nσi𝒖i2\displaystyle\left(c/N\right)\exp\left(cTp-p(1-p)\gamma^{2}T/2\right)\mathbb{E}_{\sigma}\|\sum_{i=1}^{N}\sigma_{i}{\bm{u}}_{i}\|_{2}
=\displaystyle= (c/N)exp(cTpp(1p)γ2T/2)𝔼σΨi=1Nσi𝒙ip2\displaystyle\left(c/N\right)\exp\left(cTp-p(1-p)\gamma^{2}T/2\right)\mathbb{E}_{\sigma}\|\Psi\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\|_{2}
=\displaystyle= (c/N)exp(cTpp(1p)γ2T/2)𝔼σi=1Nσi𝒙ip2\displaystyle(c/N)\exp\left(cTp-p(1-p)\gamma^{2}T/2\right)\mathbb{E}_{\sigma}\|\sum_{i=1}^{N}\sigma_{i}{\bm{x}}^{p}_{i}\|_{2}
<\displaystyle< RSN()\displaystyle R_{S_{N}}\left(\mathcal{F}\right)

Therefore, we have completed the proof for the fact that the ensemble of Gaussian noise injected ResNets can reduce generalization error compared to the standard ResNet. ∎

VI Experiments on Strategy II

VI-A Forward and Backward Propagation Using (7)

Fig. 8 plots the forward and backward propagation of the image using the SDE model (7). Again, we cannot simply use the backward Euler-Maruyama discretization to reverse the features generated by propagating through the forward Euler-Maruyama discretization.

Refer to caption Refer to caption
(a) 𝒙(1){\bm{x}}(1) (SDE) (b) 𝒙~(0)\tilde{{\bm{x}}}(0) (SDE)
Figure 8: Illustrations of the forward and backward propagation of the training data using SDE model (7). (a) is the features of the original image generated by the forward propagation using SDE; (b) is the recovered images by reverse-engineering the features shown in (a).

VI-A1 Experiments on the IDC Dataset

In this subsection, we consider the performance of residual perturbation (Strategy II) in protecting membership privacy while retaining the classification accuracy on the IDC dataset. We use the same ResNet models as that used for the first residual perturbation. We list the results in Fig. 9, these results confirm that the residual perturbation (8) can effectively protect data privacy and maintain or even improve the classification accuracy. In addition, we depict the ROC curve for this experiment in Fig. 12(c). We note that, as the noise coefficient increases, the gap between training and testing accuracies narrows, which is consistent with Theorem 3.

Figure 9: Performance of residual perturbation (8) (Strategy II) for En5ResNet8 with different noise coefficients (γ\gamma) and membership inference attack thresholds on the IDC dataset. Residual perturbation can significantly improve membership privacy and reduce the generalization gap. γ=0\gamma=0 corresponding to the baseline ResNet8. (Unit: %)
Residual perturbation vs. DPSGD

We have shown that the first residual perturbation (6) outperforms the DPSGD in protecting membership privacy and improving classification accuracy. In this part, we further show that the second residual perturbation (8) also outperforms the benchmark DPSGD with the above settings. Table V lists the AUC of the attack model and training & test accuracy of the target model; we see that the second residual perturbation can also improve the classification accuracy and protecting comparable privacy.

TABLE V: Residual perturbation (8) (Strategy II) vs. DPSGD in training ResNet8 for the IDC dataset classification. Ensemble of ResNet8 with residual perturbation is more accurate for classification (higher test acc) and protects comparable membership privacy compared to DPSGD.
Model AUC Training Acc Test Acc
ResNet8 (DPSGD) 0.503 0.831 0.828
En1ResNet8 0.509 0.880 0.868
En5ResNet8 0.500 0.895 0.872

VI-A2 Experiments on the CIFAR10/CIFAR100 Datasets

In this subsection, we will test the second residual perturbation (8) on the CIFAR10/CIFAR100 datasets with the same model using the same settings as before. Fig. 10 plots the performance of En5ResNet8 on the CIFAR10 dataset. These results show that the ensemble of ResNets with residual perturbation (8) is significantly more robust to the membership inference attack. For instance, the AUC of the attack model for ResNet8 and En5ResNet8 (γ=2.0\gamma=2.0) is 0.7570.757 and 0.5260.526, respectively. Also, the classification accuracy of En5ResNet8 (γ=2.0\gamma=2.0) is higher than that of ResNet8, and their accuracy is 71.271.2% and 66.966.9% for CIFAR10 classification. Fig. 11 shows the results of En5ResNet8 for CIFAR100 classification. These results confirm that residual perturbation (8) can protect membership privacy and improve classification accuracy again.

Figure 10: Performance of En5ResNet8 with residual perturbation (8) (Strategy II) using different noise coefficients (γ\gamma) and membership inference attack threshold on CIFAR10. Residual perturbation (8) can not only enhance the membership privacy, but also improve the classification accuracy. γ=0\gamma=0 corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)
Figure 11: Performance of En5ResNet8 with residual perturbation (8) (Strategy II) using different noise coefficients (γ\gamma) and membership inference attack threshold on CIFAR100. Again, residual perturbation (8) can not only enhance the membership privacy, but also improve the classification accuracy. γ=0\gamma=0 corresponding to the baseline ResNet8 without residual perturbation or model ensemble. (Unit: %)

VI-A3 ROC Curves for the Experiments on Different Datasets

Fig. 12 plots the ROC curves for the experiments on different datasets with different models using Strategy II. These ROC curves again show that if γ\gamma is sufficiently large, the attack model’s prediction will be nearly a random guess.

(a) CIFAR10 (b) CIFAR100 (c) IDC
(En5ResNet8) (En5ResNet8) (En5ResNet8)
Figure 12: ROC curves for residual perturbation (8) (Strategy II) for different datasets. (nc: noise coefficient)

VII More Experimental Details

We give the detailed construction of the velocity field F(𝒙(t),𝑾(t))F({\bm{x}}(t),{\bm{W}}(t)) in (3) and (7) that used to generate Figs. 1 and 8 in Algorithm 1.

Algorithm 1 The expression of F(𝒙(t),𝑾(t))F({\bm{x}}(t),{\bm{W}}(t))
Input: image=𝒙(t){\bm{x}}(t); rows, cols, channels = image.shape.
Output: F(𝒙(t),𝑾(t))F({\bm{x}}(t),{\bm{W}}(t))=noise injected image (NII).
for kk in range(channels) do
    for ii in range(rows): do
         for jj in range(cols): do
             offset(jj) =int([j+50.0cos(2πi/180)]%([j+50.0\text{cos}(2\pi i/180)]\%cols);
             offset(i) =int([i+50.0sin(2πi/180)]%([i+50.0\text{sin}(2\pi i/180)]\%row);
             NII[i,j,k]=image[(i+offset(j))%rows,(j+offset(i))%cols,k][i,j,k]=\text{image}[(i+\text{offset}(j))\%\text{rows},(j+\text{offset}(i))\%\text{cols},k];              return NII

VIII Concluding Remarks

We proposed residual perturbations, whose theoretical foundation lies in the theory of stochastic differential equations, to protect data privacy for deep learning. Theoretically, we prove that the residual perturbation can reduce the generalization gap with differential privacy guarantees. Numerically, we have shown that residual perturbations are effective in protecting membership privacy on some benchmark datasets. The proposed residual perturbation provides a feasible avenue for developing machine learning models that are more private and accurate than the baseline approach. Nevertheless, the current theoretical differential privacy budget is far from tight; developing analytical tools for analyzing a tighter privacy budget is an interesting future direction.

References

  • [1] Yuen, Man-Ching, King, Irwin, Leung, and Kwong-Sak, “A Survey of Crowdsourcing Systems,” in Proceedings of the IEEE international conference on social computing (Socialcom 2011), 2011.
  • [2] W. Feng, Z. Yan, H. Zhang, K. Zeng, Y. Xiao, and Y. Hou, “A Survey on Security, Privacy and Trust in Mobile Crowdsourcing,” IEEE Internet of Things Journal, 2017.
  • [3] Y. Liu, K. Gadepalli, M. Norouzi, G. E. Dahl, and M. C. Stumpe, “Detecting Cancer Metastases on Gigapixel Pathology Images,” arXiv:1703.02442, 2017.
  • [4] M. Fredrikson, S. Jha, and T. Ristenpart, “Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures,” in 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS 2015), 2015.
  • [5] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership inference attacks against machine learning models,” in 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18, IEEE, 2017.
  • [6] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes, “Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models,” arXiv preprint arXiv:1806.01246, 2018.
  • [7] S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha, “Privacy risk in machine learning: Analyzing the connection to overfitting,” in 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282, IEEE, 2018.
  • [8] A. Sablayrolles, M. Douze, C. Schmid, and H. Jégou, “D\\backslash’ej\\backslasha vu: an empirical evaluation of the memorization properties of convnets,” arXiv preprint arXiv:1809.06396, 2018.
  • [9] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart, “Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing,” in 23rd {\{USENIX}\} Security Symposium ({\{USENIX}\} Security 14), pp. 17–32, 2014.
  • [10] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing machine learning models via prediction apis,” in 25th {\{USENIX}\} Security Symposium ({\{USENIX}\} Security 16), pp. 601–618, 2016.
  • [11] N. Z. Gong and B. Liu, “Attribute inference attacks in online social networks,” ACM Transactions on Privacy and Security (TOPS), vol. 21, no. 1, pp. 1–30, 2018.
  • [12] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333, 2015.
  • [13] M. Al-Rubaie and J. M. Chang, “Reconstruction attacks against mobile-based continuous authentication systems in the cloud,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 12, pp. 2648–2663, 2016.
  • [14] N. Z. Gong and B. Liu, “You are who you know and how you behave: Attribute inference attacks via users’ social friends and behaviors,” in 25th {\{USENIX}\} Security Symposium ({\{USENIX}\} Security 16), pp. 979–995, 2016.
  • [15] X. Zheng, Z. Cai, and Y. Li, “Data linkage in smart internet of things systems: A consideration from a privacy perspective,” IEEE Communications Magazine, vol. 56, no. 9, pp. 55–61, 2018.
  • [16] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Annual International Cryptology Conference, pp. 36–54, Springer, 2000.
  • [17] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar, “Can machine learning be secure?,” in Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pp. 16–25, 2006.
  • [18] E. Hesamifard, H. Takabi, M. Ghasemi, and R. N. Wright, “Privacy-preserving machine learning as a service,” Proceedings on Privacy Enhancing Technologies, vol. 2018, no. 3, pp. 123–142, 2018.
  • [19] H. Bae, D. Jung, and S. Yoon, “Anomigan: Generative adversarial networks for anonymizing private medical data,” arXiv preprint arXiv:1901.11313, 2019.
  • [20] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference, pp. 265–284, Springer, 2006.
  • [21] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al., “Communication-efficient learning of deep networks from decentralized data,” arXiv preprint arXiv:1602.05629, 2016.
  • [22] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
  • [23] L. Sweeney, “k-anonymity: A model for protecting privacy,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002.
  • [24] K. El Emam and F. K. Dankar, “Protecting privacy using k-anonymity,” Journal of the American Medical Informatics Association, vol. 15, no. 5, pp. 627–637, 2008.
  • [25] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization,” Journal of Machine Learning Research, vol. 12, no. Mar, pp. 1069–1109, 2011.
  • [26] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization: Efficient algorithms and tight error bounds,” in 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 464–473, IEEE, 2014.
  • [27] R. Shokri and V. Shmatikov, “Privacy-Preserving Deep Learning,” in 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS 2015), 2015.
  • [28] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318, 2016.
  • [29] E. Bagdasaryan, O. Poursaeed, and V. Shmatikov, “Differential privacy has disparate impact on model accuracy,” in Advances in Neural Information Processing Systems, pp. 15453–15462, 2019.
  • [30] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” in Advances in Neural Information Processing Systems, pp. 14747–14756, 2019.
  • [31] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyond inferring class representatives: User-level privacy leakage from federated learning,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pp. 2512–2520, IEEE, 2019.
  • [32] M. Abadi, A. Chu, I. Goodfellow, H. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep Learning with Differential Privacy,” in 23rd ACM Conference on Computer and Communications Security (CCS 2016), 2016.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • [34] N. Papernot, M. Abadi, L. Erlingsson, I. Goodfellow, and K. Talwar, “Semisupervised Knowledge Transfer for Deep Learning from Private Training Data,” in 5th International Conference on Learning Representation (ICLR 2017), 2017.
  • [35] N. Papernot, S. Song, I. Mironov, A. Raghunathan, and L. Erlingsson, “Scalable Private Learning with PATE,” in International Conference on Learning Representations (ICLR 2018), 2018.
  • [36] I. Mironov, “Rényi differential privacy,” in 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275, IEEE, 2017.
  • [37] Y. X. Wang, B. Balle, and S. Kasiviswanathan, “Subsampled R\\backslash’enyi Differential Privacy and Analytical Moments Accountant,” arXiv preprint arXiv:1808.00087, 2018.
  • [38] J. Dong, A. Roth, and W. J. Su, “Gaussian differential privacy,” arXiv preprint arXiv:1905.02383, 2019.
  • [39] B. Wang, Q. Gu, M. Boedihardjo, F. Barekat, and S. J. Osher, “Dp-lssgd: A stochastic optimization method to lift the utility in privacy-preserving erm,” arXiv preprint arXiv:1906.12056, 2019.
  • [40] Z. Liang, B. Wang, Q. Gu, S. Osher, and Y. Yao, “Exploring private federated learning with laplacian smoothing,” arXiv preprint arXiv:2005.00218, 2020.
  • [41] O. Kotevska, F. Alamudun, and C. Stanley, “Optimal balance of privacy and utility with differential privacy deep learning frameworks,” in 2021 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 425–430, 2021.
  • [42] M. Jagielski, J. Ullman, and A. Oprea, “Auditing differentially private machine learning: How private is private sgd?,” Advances in Neural Information Processing Systems, vol. 33, pp. 22205–22216, 2020.
  • [43] I. Mironov, K. Talwar, and L. Zhang, “R\\backslash’enyi differential privacy of the sampled gaussian mechanism,” arXiv preprint arXiv:1908.10530, 2019.
  • [44] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex, “Differentially private model publishing for deep learning,” in 2019 IEEE symposium on security and privacy (SP), pp. 332–349, IEEE, 2019.
  • [45] H. Zhong and K. Bu, “Privacy-utility trade-off,” arXiv preprint arXiv:2204.12057, 2022.
  • [46] K. Mivule, C. Turner, and S.-Y. Ji, “Towards a differential privacy and utility preserving machine learning classifier,” Procedia Computer Science, vol. 12, pp. 176–181, 2012.
  • [47] M. Showkatbakhsh, C. Karakus, and S. Diggavi, “Privacy-utility trade-off of linear regression under random projections and additive noise,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 186–190, IEEE, 2018.
  • [48] A. S. Rakin, Z. He, and D. Fan, “Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack,” arXiv preprint arXiv:1811.09310, 2018.
  • [49] B. Wang, Z. Shi, and S. Osher, “Resnets ensemble via the feynman-kac formalism to improve natural and robust accuracies,” in Advances in Neural Information Processing Systems, pp. 1655–1665, 2019.
  • [50] X. Liu, T. Xiao, SiSi, Q. Cao, S. Kumar, and C.-J. Hsieh, “Neural sde: Stabilizing neural ode networks with stochastic noise,” arXiv preprint arXiv:1906.02355, 2019.
  • [51] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao, “Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648, 2018.
  • [52] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” in International Conference on Learning Representations, 2019.
  • [53] D. Stutz, M. Hein, and B. Schiele, “Disentangling adversarial robustness and generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6976–6987, 2019.
  • [54] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow, “Adversarial spheres,” arXiv preprint arXiv:1801.02774, 2018.
  • [55] Z. Zhao, D. Dua, and S. Singh, “Generating natural adversarial examples,” in International Conference on Learning Representations, 2018.
  • [56] A. Raghunathan, S. M. Xie, F. Yang, J. Duchi, and P. Liang, “Understanding and mitigating the tradeoff between robustness and accuracy,” in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research, pp. 7909–7919, PMLR, 13–18 Jul 2020.
  • [57] Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi, “Unlabeled data improves adversarial robustness,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 11192–11203, 2019.
  • [58] A. Najafi, S.-i. Maeda, M. Koyama, and T. Miyato, “Robustness to adversarial perturbations in learning from incomplete data,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 5541–5551, 2019.
  • [59] J.-B. Alayrac, J. Uesato, P.-S. Huang, A. Fawzi, R. Stanforth, and P. Kohli, “Are labels required for improving adversarial robustness?,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [60] D. J. Higham, “An algorithmic introduction to numerical simulation of stochastic differential equations,” SIAM review, vol. 43, no. 3, pp. 525–546, 2001.
  • [61] C. Dwork and V. Feldman, “Privacy-preserving prediction,” in Conference On Learning Theory, pp. 1693–1702, PMLR, 2018.
  • [62] R. Bassily, O. Thakkar, and A. Thakurta, “Model-agnostic private learning via stability,” arXiv preprint arXiv:1803.05101, 2018.
  • [63] B. Wu, S. Zhao, G. Sun, X. Zhang, Z. Su, C. Zeng, and Z. Liu, “P3sgd: Patient privacy preserving sgd for regularizing deep cnns in pathological image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2099–2108, 2019.
  • [64] P. L. Barlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results.,” Jounal of Machine Learning Research, 2002.
  • [65] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
  • [66] “Invasive ductal carcinoma (idc) histology image dataset,”
  • [67] B. Wu, C. Chen, S. Zhao, C. Chen, Y. Yao, G. Sun, L. Wang, X. Zhang, and J. Zhou, “Characterizing membership privacy in stochastic gradient langevin dynamics,” arXiv preprint arXiv:1910.02249, 2019.
  • [68] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [69] H. B. McMahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz, “A general approach to adding differential privacy to iterative training procedures,” arXiv preprint arXiv:1812.06210, 2018.
  • [70] L. M and Talagrand, Probability in Banach spaces: Isoperimetry and processes. Springer, 2002.
  • [71] F. C. Klebaner, Introduction to stochastic calculus with applications. Springer, 2005.
Wenqi Tao is a Ph.D. student in mathematics at Tsinghua University. His research focuses on graph-based machine learning and deep learning.
Huaming Ling is a Ph.D. student in mathematics at Tsinghua University. His research focuses on deep learning and stochastic optimization algorithms.
Zuoqiang Shi received Ph.D. in 2008 from Tsinghua University. He is an associate professor of mathematics at Tsinghua University.
Bao Wang received Ph.D. in 2016 from Michigan State University. He is an assistant professor of mathematics at the University of Utah. He is a recipient of the Chancellor’s award for postdoc research at UCLA. His research interests include scientific computing and deep learning.