Locally Differentially Private Distributed Deep Learning via Knowledge Distillation
Abstract
Deep learning often requires a large amount of data. In real-world applications, e.g., healthcare applications, the data collected by a single organization (e.g., hospital) is often limited, and the majority of massive and diverse data is often segregated across multiple organizations. As such, it motivates the researchers to conduct distributed deep learning, where the data user would like to build DL models using the data segregated across multiple different data owners. However, this could lead to severe privacy concerns due to the sensitive nature of the data, thus the data owners would be hesitant and reluctant to participate. We propose LDP-DL, a privacy-preserving distributed deep learning framework via local differential privacy and knowledge distillation, where each data owner learns a teacher model using its own (local) private dataset, and the data user learns a student model to mimic the output of the ensemble of the teacher models. In the experimental evaluation, a comprehensive comparison has been made among our proposed approach (i.e., LDP-DL), DP-SGD, PATE and DP-FL, using three popular deep learning benchmark datasets (i.e., CIFAR10, MNIST and FashionMNIST). The experimental results show that LDP-DL consistently outperforms the other competitors in terms of privacy budget and model accuracy.
Index Terms:
Local differential privacy, distributed deep learning, knowledge distillation, active learning1 Introduction
Deep learning (DL) has been shown to achieve extraordinary results in a variety of real-world applications, such as skin lesion analysis [1], active authentication [2], facial recognition [3, 4], botnet detection [5, 6] and community detection [7]. In traditional DL environment, training data is held by a single organization in a centralized fashion, that executes the DL algorithms. In general, a DL model would be more accurate and robust if it has been trained with more massive and more diverse data. However, in certain real-world applications, e.g., healthcare applications, the data collected by a single organization (e.g., hospital) is often limited, and the majority of massive and diverse data is often segregated across multiple organizations. As such, it motivates the researchers to conduct DL in a distributed fashion, where the data user (e.g., researcher/organization) would like to build DL models using the data segregated across multiple different data owners (e.g., organizations). However, the data owners would be hesitant and reluctant to participate in the data user’s distributed deep learning, if the data user’s protocol cannot resolve the data owners’ important privacy concerns of their data. For instance, it has been shown that the private information could be inferred during the learning process [8], and the membership of certain training data could be traced back from the resulting trained model [9]. Hence, it is imperative to design an effective privacy-preserving distributed deep learning approach.
Designing an effective and efficient privacy-preserving distributed deep learning approach is highly challenging. To date, a few approaches [10, 11, 12] have been proposed for privacy-preserving (distributed) deep learning. Papernot et al. [10] proposes PATE, a “teacher-student” paradigm for privacy-preserving deep learning, where each data owner learns a teacher model using its own (local) private dataset, and the data user aims to learn a student model using the unlabelled public data (but no direct access to the data owners’ private data) to mimic the output of the ensemble of the teacher models, i.e., the student learns to make predictions that is the same as the most number of teachers. To ensure privacy, PATE [10] assumes a trusted aggregator to provide a differentially private query interface, where the data user could query the ensemble of the teacher models (from the data owners) using the unlabelled public data to obtain the labels for the training of the student model. However, a fully trusted aggregator barely exists in most of the real-world distributed deep learning scenarios. Chase et al. [11] proposes a private collaborative neural network learning approach, that combines secure multi-party computation (MPC), differential privacy (DP) and secret sharing. Since the MPC protocol is implemented via a garbled circuit whose size is subject to the number of parameters (i.e., the size of the gradient) of the neural network, it tends to be less efficient and not scalable while training larger neural networks. Also, in [11], using secret sharing requires at least two non-colluding honest data users which might not be practical.
To address the challenges mentioned above, in this paper, we propose LDP-DL, a privacy-preserving distributed deep learning framework via local differential privacy [13] and knowledge distillation [14]. Our approach adopts the same “teacher-student” paradigm as described in PATE [10], where each data owner learns a teacher model using its own (local) private dataset, and the data user aims to learn a student model to mimic the output of the ensemble of the teacher models using the unlabelled public data. Knowledge distillation [14] has been applied on the ensemble of the teacher models to enable faster and more accurate knowledge transferring to the student model, and leverage the advantage of having multiple data owners (teacher models). To ensure privacy, our approach employs local differential privacy on the data owners’ side, i.e., the query results of each teacher model, which does not require any trusted aggregator (compared to [10]). Since more queries to the teacher models tends to result in more privacy leakage (i.e., cost more privacy budget), we also design an active query sampling approach that could actively select a subset of the unlabelled public dataset for the data user to query from the data owners. In the experimental evaluation, a comprehensive comparison has been made among our proposed approach (i.e., LDP-DL), DP-SGD [15], PATE [10] and DP-FL [12], using three popular deep learning benchmark datasets (i.e., CIFAR10 [16], MNIST [17] and FashionMNIST [18]). The experimental results show that our LDP-DL framework consistently outperforms the other competitors in terms of privacy budget and model accuracy.
To summarize, our work has the following contributions:
We present a novel, effective and efficient privacy-preserving distributed deep learning framework using local differential privacy and knowledge distillation.
We present an active sampling approach to efficiently reduce the total number of queries from the data user to each data owners, so that to reduce the total cost of privacy budget.
A comprehensive experimental evaluation among our approach, DP-SGD [15], PATE [10] and DP-FL [12] has been conducted on three benchmark dataset (i.e., CIFAR10 [16], MNIST [17] and FashionMNIST [18]). For the sake of reproducibility and convenience of future studies about privacy-preserving distributed deep learning, we have released our prototype implementation of LDP-DL, information regarding the experiment datasets and the code of our comparison experiments. 111https://github.com/nogrady/LDP-DL
The rest of this paper is organized as follows: Section 2 presents the preliminaries including local differential privacy and knowledge distillation. Section 3 presents the problem statement and notations of privacy-preserving distributed deep learning, and describes our proposed framework. Section 4 presents the experimental evaluation. Section 5 presents the related literature review. Section 6 concludes.
2 Preliminaries
2.1 Local Differential Privacy
Differential Privacy (DP) [19, 20] aims to protect the privacy of individuals while releasing aggregated information about the database, which prevents membership inference attacks [9] by adding randomness to the algorithm outcome. Two databases and are neighbors if they differ in only one entry. The formal definition is given as follows:
Definition 1.
-Differential Privacy [19, 21]: A randomized mechanism is -differentially private if for every two neighboring databases , and for any subset :
(1) |
where denotes the the probability of an event, denotes the set of all possible outputs of algorithm . Smaller values of , indicates closer between and , thus stronger privacy protection gains. When the mechanism satisfies -DP, which provides stronger privacy guarantee than - DP, where .
Local Differential Privacy (LDP) [13] is the local setting of DP, which does not require any trusted aggregator. In LDP, individuals (i.e., data owners) send their data to the data aggregator after privatizing data by perturbation. Hence, these techniques provide plausible deniability for individuals (i.e., data owners). Data aggregator collects all perturbed values and makes an estimation of statistics such as the frequency of each value in the population. The formal definition is given as follows:
Definition 2.
Compared with DP, LDP provides more protection to the data owners. Other than sending the private data directly to a trusted aggregator, the data owners could perturb their private data with the mechanism that satisfies -LDP, and then release the perturbed data. As such, LDP provides a stronger privacy protection, since the aggregator (i.e., data user) only receives the perturbed data and the true values of the private data never leave the hands of the data owners.
2.2 Knowledge Distillation
Knowledge Distillation (KD) [24, 14, 25] was originally designed for deep neural network (DNN) compression and knowledge transfer. KD usually considers a “teacher-student” paradigm, where the teacher model is a DNN (or an ensemble of a set of DNNs) that performs well on a given dataset, and the student model is another neural network that may or may not have the same architecture as the teacher model, but aims to mimic the performance of the teacher model(s) using another public dataset. Hinton, et al. [14] proposes an end-to-end knowledge distillation framework with a loss function, namely Distillation Loss, where the output of the teacher model is used as the soft target (i.e., soft label) for the student model, and the overall loss function is presented as below:
(3) |
where is the true label of data , is the output of the student model, is the output of the teacher model, is the softened label of at temperature , and is the softened label at temperature , and usually .
3 Methodology
3.1 Problem Statement
In this work, we aim to develop a privacy-preserving distributed deep learning framework. As shown in Fig. 1, we consider the following problem: Given data owners, each data owner holds a set of private samples , where , , , and is the label associated with sample , ; the untrustworthy data user would like to learn a DNN model with the help of all the data owners, and a public dataset that comes from the same distribution (i.e., the same problem) as the data owners’ private datasets, but does not have the label information. In our problem setting, each data owner has two privacy requirements: (i) the value of the individual private data should not be shared to the data user, and (ii) any inference of the individual private data should be prevented from using the intermediate communication messages and the data user’s DNN model.

3.2 Threat Model
In our problem, we assume (i) the data user is untrustworthy, and (ii) the data owners are honest-but-curious, where each data owner follows the protocol honestly, but try to use the protocol transcripts to extract new information. We assume the value of the individual private data is what the adversaries would like to acquire during the whole protocol. Hence, the adversaries could be the data user, the participating data owners or an outside attacker that has the access to the intermediate communication messages or the data user’s DNN model. We also assume that the adversaries may have arbitrary background knowledge and might collude with each other. Our work aims to protect the privacy of each data owner’s individual private data while providing the reasonable utility to the data user’s DNN model. Since we assume that the data user is untrustworthy, it is of the data user’s own interest to correctly execute the algorithm or not. However, while using our proposed framework, if the untrustworthy data user behave dishonestly, it will not compromise data owner’s privacy, but will only hurt the utility of the data user’s DNN model. Furthermore, since the data owners are assumed to be honest-but-curious, the poisoning [26], backdoor [27, 28] or trojans [29] attacks (e.g., the data owner actively and maliciously modify their inputs to influence the performance of the data user’s DNN model) are beyond the scope of this work.

3.3 Privacy-preserving Distributed Deep Learning
Our proposed privacy-preserving distributed algorithm framework, as shown in Fig. 2, consists of four stages that work synergistically between the data owners and the data user. Alg.1 shows the pseudo-code of our algorithm. Firstly, each data owner trains a teacher model using his/her own private dataset (i.e., lines 1-2), and the data user initialize the student models with random or pretrained (i.e., ImageNet [30]) parameters (i.e., line 3). The student model and all the teacher models do not have to use the same DNN architecture. Secondly, in each iteration, the data user efficiently selects a subset of the available public dataset (i.e., lines 5-6), that could better improve the performance of the current student model in the upcoming training, using our well-designed active query sampling approach. The active query sampling component aims to reduce the total number of queries to each teacher model, thus saving the privacy budgets. Thirdly, the data user uses the selected subset of the public data (no labels) to query each of the teacher model from its corresponding selected data owner to obtain the “knoweledge” (data’s soft label), and all the query results are sanitised by our local differential privacy techniques before being sent back to the data user (i.e., lines 7-15). Since the data owner might select a huge amount of query samples, it is not realistic to use all the selected samples to query all the data owners, which cost much on the privacy budge and the communication, but might not help a lot for the utility (per our experimental results). Therefore, we predefined a parameter to control/specify the upper bound of the available data owners for each selected public sample to query (lines 10-11). Last but not least, the data user aggregates the received sanitised query results (i.e., the distilled knowledge) of each data, and leverage the knowledge distillation techniques (using the subset of the public data, and the distilled knowledge) to update/train the student model. The details of the most important three components in our framework (i.e, private query from teacher models, build student model via knowledge transfer and active query sampling) are described in the subsequent sections.
3.4 Private Query from Teacher Models
In our proposed algorithm, upon receiving the query data from the data user, each data owner evaluates it using his/her own teacher model and gets the data’s soft label. Each data owner perturbs the query data’s soft label (using LDP techniques) and then sends the perturbed value to the data user to transfer the distilled knowledge. The data user then aggregates the perturbed query results of each data to obtain the aggregated noisy soft label (i.e., averaged over the query results sent by all the selected data owners) of each data. As such, we could formulate this as a locally differentially private mean estimation problem, where we would like to protect the data owners’ private data from the inference attacks given the perturbed query results, and ensure the aggregated noisy soft label as close as the real value. As described in Section 2.1, while applying LDP, the adversaries could not distinguish the true value from a perturbed value with a high confidence (adjusted by the parameter ). To protect the privacy of the data owners’ private data, the randomization method which satisfies -LDP is adopted. On the other hand, the performance of the aggregation of the perturbed data could be maintained with an error bound [23, 31], which provides us a way to control the utility of the distilled knowledge. Furthermore, since all the soft labels are multidimensional numerical values, different from the hard labels which are categorical values, we can not directly adopt the encoding-based LDP techniques [23].
To achieve our goal, we adopt the Piecewise Mechanism (PM) [13] that is designed to perturbed the multidimensional numerical values, and has an asymptotic optimal error bound for the mean estimation problem. Alg. 2 shows the PM for one-dimensional numerical data (i.e., PM-ONE). To simplify our explanation, in this section, the value to be perturbed (i.e., the soft labels of classes) is denoted as , . PM-ONE (Alg. 2) takes a one-dimentional numerical data as the input, and returns its perturbed value , where is small and thus has relatively high probability (i.e., ) to be close to . As shown in [13], while applying PM-ONE with at least probability for the task of mean estimation, which is an asymptotically optimal error bound.
Alg. 3 shows the PM for multidimensional numerical data (i.e., PM), where for each data of dimensions, it randomly selects (i.e., ) attributes to perturb. Alg. 3 is designed to reduce the amount of the noise in the task of mean estimation for multidimensional numerical data. While using Alg. 2 to perturb attributes, each attribute evenly shares a privacy budget of , and the total amount of noise in the mean estimation is , which is super-liner to . However, it has been shown [13] that while using Alg. 3, with at least probability for the task of mean estimation, which is still an asymptotically optimal error bound.
3.4.1 Privacy Budget Analysis of LDP-DL
As described in Section 3.1, in our proposed framework, there are public data and data owners in total. If each public data could query the teacher model for at most times (Section 3.3), each teacher model will be queried for at most times by average. Suppose for each private query, the perturbed query result satisfies -LDP. According to the composition property of LDP [32], to meet the requirement of -LDP for each data owner’s private data, we need to satisfy . Since each data owner would participate in the private query for at most times by average, it requires . Then, the noise of each query result becomes , which is linear to . Since each public data would be queried for at most times, the noise of the mean estimation of each public data’s soft label (i.e., distilled knowledge) would be . Since is the privacy budget that should be controlled by the data owner’s preference, and has the direct influence on the precision of each query result’s mean estimation that should be decided on the data user’s empirical study, to reduce the overall noise, it is better to increase the number of participant data owners (i.e., ) or decrease the size of the set of public data (i.e., ) utilized for private query (as described in Section 3.6).
3.5 Build Student Model via Knowledge Transfer
While the data user receiving all the query results from the data owners, our proposed framework uses the knowledge distillation technique to transfer the knowledge learned from the queried teacher models to the student model. Our usage of knowledge distillation is slightly different from the it convectional usage (as described in Section 2.2), where (i) we only focus on the knowledge transfer perspective of KD, but not the model compression, thus the student model and all the teacher models could use different and arbitrary DNN architectures; and (ii) in our case, the public dataset does not have the true label information, thus cannot directly use equation (3). Hence, in our framework, the student model is trained to minimize the gap between its own predicted soft label and the aggregated soft label from the teacher models, i.e., the knowledge distillation loss:
(4) |
where is the soft label predicted by the student model, is the aggregated soft label from the teacher models, is the temperature parameter, and . The temperature parameter is usually set to 1. While , the probabilities of the classes whose normal values are near zero would be increased. To better distill the knowledge to the student, two temperature values are adopted in our KD loss (i.e., and ).

3.6 Active Query Sampling
As analyzed in Section 3.4.1, one direction to reduce the overall noise of the soft label estimation (thus, enhance the overall performance) is to decrease the size of the set of public data (i.e., ) utilized for private query. In this section, we present an active query sampling approach that could actively and adaptively choose samples from the public dataset batch-by-batch to query the teacher models. As shown in Fig. 3, we adopt a “least confidence” strategy [33], where in iteration, we attempt to select a set of query samples from the public dataset that the student model shows the “least confidence” at. To be specific, our active query sampling follows the procedure described below:
-
1.
Select an initial subset of unlabeled public data uniformly at random from . Update .
-
2.
Use to query the teacher models, and use the distilled knowledge to train the student initial student model .
-
3.
For each available public data , evaluate it on the student model . Let denote the probability of belonging to class predicted by . Let , and suppose . Let be the largest value (posterior probability) in . Then, repeat the procedure below for times to select query samples:
(5) Then, use to query the teacher models, and use the distilled knowledge to train the student initial student model .
-
4.
Repeat 3), until the student model meet the performance requirement or no more public data available (i.e., ).
4 Experimental Evaluation
In this section, we evaluate the effectiveness of our proposal method, LDP-DL, on three popular image benchmark datasets (i.e., CIFAR-10 [16], MNIST [17] and Fashion-MNIST [18]) with three basic LDP mechanisms (i.e., Piecewise mechanism [13], Duchi’s mechanism [22] and Laplace mechanism [34]). We also evaluate the performance of our proposed Active Query Sampling (AQS) of our approach. Then, we compare LDP-DL with three state-of-the-art approaches, i.e., DP-SGD [15], PATE [10] and DP-FL [12].
4.1 Experiment Environment
All the experiments were conducted on a PC with an Intel Core i9-7980XE processor, 128GB RAM, a Nvidia GeForce GTX 1080Ti graphic card, running 64-bit Ubuntu 18.04 LTS operating system. All the experiments are implemented using Python 3.7.
4.2 Experiment Datasets
Three popular benchmark image datasets are utilized to conduct our experimental evaluation:
-
1.
CIFAR-10 [16]: CIFAR-10 (Canadian Institute For Advanced Research) is a widely used benchmark dataset to evaluate deep learning algorithms. This dataset is a subset of the 80 million tiny images dataset. It contains 60,000 32 x 32 color photographs of objects in 10 different classes, such as frogs, birds, cats, ships, etc. For each class, there are 6,000 images in total, where the testing set includes exactly 1,000 images that randomly selected from each class, and the training set contains the remaining 5,000 images in a random order.
-
2.
MNIST [17]: MNIST (Modified National Institute of Standards and Technology database) dataset is a collection of handwritten digits that is commonly used in the field of image processing and machine learning. This dataset is created by ”re-mixing” samples from the NIST dataset. It contains 70,000 28 x 28 grayscale images in 10 different classes, i.e., 10 digits, from 0 to 9. The handwritten digits have been size-normalized and centered in each images. The 70,000 samples have been separated to 60,000 training samples and 10,000 testing samples.
-
3.
Fashion-MNIST [18]: This dataset is a collection of Zalando’s article images, which is created as a drop-in (more challenging) replacement for MNIST to better represent modern computer vision tasks. It contains 70,000 28 x 28 gray-scale images in 10 different classes. Each class is a kind of cloth, such as T-shirt, dress, trouser, sneaker, etc. There are 60,000 training samples and 10,000 testing samples.
4.3 Experimental Setup
In our experiments, we assume the data owner’s teacher models are using ResNet50, and the data user’s student model is using ResNet18. For each experiment dataset, we assume each data owner has 4,000 private samples to train his/her teacher model (i.e., ResNet50). The data user would query 200 public samples in each iteration of the Active Query Sampling (AQS) process, and queries 5 iterations in total to train his/her student model (i.e., ResNet18). Each teacher and student model has been trained for 20 epochs with a batch size of 32.
As discussed in Section 3.4.1, there are three major parameters that we would like to tune and evaluate in our approach, such as the privacy budget (), the number of queries of each public sample () and the total number of participated data owners (). We use various combinations of , and to evaluate our approach, where , , , , , , , , , , , , , , , , , , , , and , , , , , , , , , .
Then, we evaluate the performance of our approach while with and without the Active Query Sampling (AQS) process. Also, we evaluate the performance of our approach’s private query using three basic LDP mechanisms, including Piecewise mechanism [13], Duchi’s mechanism [22] and Laplace mechanism [34]. All the experiments have been repeated for 10 times and we take the average as the reported results.









4.4 Effectiveness Analysis
In this section, we evaluate the effectiveness of our approach on three benchmark image datasets with different parameters and basic LDP mechanisms (as shown in Fig. 4, Fig. 5 and Fig. 6). We can observe that Piecewise mechanism always performs better than Duchi’s mechanism and Laplace mechanism in our framework. Moreover, the results with our AQS process consistently outperforms the ones without our AQS process, which demonstrates that our proposed AQS could dramatically save the privacy budget and prevent privacy budget exploding from happening in privacy-preserving distributed deep learning training.
4.4.1 Effectiveness Analysis of Different Parameters
The performance of our proposed LDP-DL framework is affected by multiple parameters (, and ). To evaluate the effectiveness of single parameter, as shown in Fig. 4, Fig. 5 and Fig. 6, the other parameter are set as constant values. From Fig. 4, Fig. 5 and Fig. 6, we observe that:
-
1.
The total privacy budget (): This parameter controls the noise scale of the private queries from each data owner’s teacher model. We investigate from 1 to 10. Fig. 4(a), Fig. 5(a) and Fig. 6(a) show the impact of the privacy budget on the results of our approach. As the privacy budget increasing, the accuracy increases, since less noise would be added to the data owners’ perturbed distilled knowledge. In LDP mechanisms, greater results in smaller-scaled noise, and vice versa. While querying from multiple data owners’ teacher models, the aggregation of the query results with smaller-scaled noise gives more information towards the data user’s student model. Namely, the aggregated value is more close to the actual value, which benefits the training of the data user’s student model. As such, a greater would result in a better accuracy performance.
-
2.
The number of queries (): This parameter controls the number of data owners’ teacher models to be queried for each unlabelled public data. Fig. 4(b), Fig. 5(b) and Fig. 6(b) illustrate the results under different number of queries of each public data. As the number of queries increasing, the accuracy decreases, which is inline with our analysis in Section 3.4.1. Because increasing the number of queries of each public data will actually result in more noise being added to each query result while maintaining the same total privacy budget. Specifically speaking, as shown in our results, as increasing, the accuracy of the data user’s student model only slightly declines while utilizing either Piecewise mechanism or Duchi’s mechanism. However, for the Laplace mechanism, the accuracy significantly decreases while increasing. Because different LDP mechanisms would result in different noise scale in terms of . Compared with Laplace mechanism, Using Piecewise mechanism and Duchi’s mechanism would decrease the influence of on the performance of our LDP-DL framework.
-
3.
The number of data owners (): This parameter indicates the number of data owners participated in our LDP-DL framework. As analyzed in section 3.4.1, increasing the number of participated data owners can reduce the overall noise of the aggregated information. Fig. 4(c), Fig. 5(c) and Fig. 6(c) show the influence of the number of participated data owners on the performance of our approach. As more data owners participating in our framework, the accuracy tends to increase. Since the total privacy budget is controlled by each data owner’s preference, to obtain an appropriate performance of the data user’s student model, the number of participated data owners of our LDP-DL framework should be set to a sufficient value.
Datasets | CIFAR10 [16] | MNIST [17] | FashionMNIST [18] | |||
---|---|---|---|---|---|---|
Approaches | Accuracy | Privacy Budget | Accuracy | Privacy Budget | Accuracy | Privacy Budget |
LDP-DL | 77.5% | 5 | 98.1% | 5 | 83.4% | 5 |
79.7% | 8 | 98.8% | 8 | 85.7% | 8 | |
DP-SGD [15] | 73.0% | 8 | 97.00% | 8 | - | - |
PATE [10] | 73.6% | 5 | 97.7% | 5 | 81.5% | 5 |
76.0% | 8 | 98.2% | 8 | 84.7% | 8 | |
DP-FL [12] | 75.9% | 5 | 96.4% | 5 | 82.6% | 5 |
78.7% | 8 | 97.2% | 8 | 83.6% | 8 |
4.5 In Comparison with Existing Approaches
In this section, we have compared our LDP-DL framework with 3 state-of-the-art approaches: DP-SGD [15], PATE [10] and DP-FL [12].
-
•
DP-SGD [15]: Differentially Private Stochastic Gradient Descent (DP-SGD) algorithm trains the deep neural network with differential privacy under a centralized setting. It utilizes the Gaussian mechanism on random subset of examples to produce average noisy gradient for model optimization. This approach does not have available code from public resource. Therefore, we directly refer the results presented in the original paper.
-
•
PATE [10]: Private Aggregation of Teacher Ensembles (PATE) proposed a distributed teacher-student framework. The privacy guarantee comes from the perturbation on teachers voting aggregation. The ensemble decision based on the noisy voting provides the label of student model’s training data. The student model is trained on semi-supervised learning with GANs. PATE [10] are evaluated on the code provided by the paper authors.
-
•
DP-FL [12]: Differentially Private Federated Learning (DL-FL) describes a federated optimization algorithm under private manner. Instead of directly averaging the distributed client models updates, an alter approach that use random sampling and Gaussian mechanism on sum of clients updates is introduced to approximate the averaging. The curator collects the noisy updates to optimize the center server model. Since the original paper aims at protecting the privacy at the client’s level but not at the sample’s level, DP-FL [12] are evaluated on the code published by the author with some minor changes to enable privacy preservation at the sample’s level.
Table I shows the results of different approaches under the same privacy budgets. The results of LDP-DL are evaluated under parameters , . For the other approaches, we strictly follow the settings mentioned in the corresponding papers and keep the common parameters (such as the number of data owners (clients), privacy budgets) at the same level. Under the same level of privacy budgets, we can observe that LDP-DL consistently outperforms the other competitors. The improvement can be derived from the knowledge distillation and active query sampling. Knowledge distillation leverage richer information while transferring the knowledge from the teacher models to a student model. Meanwhile, active query sampling efficiently reduces the total number of queries from the data user (i.e., the student model) to the data owners (i.e., the teacher models). As such, the total cost of privacy budget is been reduced dramatically.
5 Related Work
Local Differential Privacy (LDP) has been proposed [22] to remove the trusted curator of the centralized differential privacy. LDP also provides the data owner more controls on the information left hands in a more strict and realistic privacy manner. LDP for statistical information collection and estimation have been well studied in the past decades [35, 36, 23, 37, 38, 39, 40, 41].
Recently, more works propose to apply DP or LDP in data mining and machine learning applications, such as clustering [42], Bayesian inference [43], frequent itemset mining [44] and probability distribution estimation [45, 46, 47, 44]. However, only a few recent works aim to use LDP in deep learning. For instance, Abadi et al. [15] propose to train the deep neural network via stochastic gradient descent with differential privacy under a centralized setting. However, it not only requires an impractical trusted third party to serve as the trusted curator but also has the privacy budget exploding issue (i.e., causing impractical huge privacy budget to train a meaningful deep learning model).
Papernot et al. [10] proposes PATE, a “teacher-student” paradigm for privacy-preserving deep learning, where each data owner learns a teacher model using its own (local) private dataset, and the data user aims to learn a student model using the unlabelled public data (but no direct access to the data owners’ private data) to mimic the output of the ensemble of the teacher models, i.e., the student learns to make predictions that is the same as the most number of teachers. To ensure privacy, PATE [10] assumes a trusted aggregator to provide a differentially private query interface, where the data user could query the ensemble of the teacher models (from the data owners) using the unlabelled public data to obtain the labels for the training of the student model. However, a fully trusted aggregator barely exists in most of the real-world distributed deep learning scenarios. Chase et al. [11] proposes a private collaborative neural network learning approach, that combines secure multi-party computation (MPC), differential privacy (DP) and secret sharing. Since the MPC protocol is implemented via a garbled circuit whose size is subject to the number of parameters (i.e., the size of the gradient) of the neural network, it tends to be less efficient and not scalable while training larger neural networks. Also, in [11], using secret sharing requires at least two non-colluding honest data users which might not be practical. In [12], the authors present a federated optimization algorithm under private manner. Instead of directly averaging the distributed client models updates, an alter approach that use random sampling and Gaussian mechanism on sum of clients updates is introduced to approximate the averaging. The curator collects the noisy updates to optimize the center server model. However, this approach only focus on training small deep learning models (i.e., only training one or very few number of iterations) and easier datasets (i.e., not testing on any image datasets).
Our approach aims to solve the challenges left by the previous approaches. The difference could be summarized in three folds: (i) our approach aims to enable training large deep neural networks (e.g., ResNet) on popular benchmark image datasets (e.g., CIFAR-10); (ii) our approach designs a proactive mechanism (i.e., the active query sampling) to reduce the overall privacy budget efficiently to prevent privacy budget exploding while training large deep neural networks; (iii) our approach is not based on federated learning, thus does not have to satisfy the requirements of performing federated learning (e.g., clients being online around the same time period).
6 Conclusion
In this paper, we proposed LDP-DL, a novel, effective and efficient privacy-preserving distributed deep learning framework using local differential privacy and knowledge distillation. We also present an active sampling approach to efficiently reduce the total number of queries from the data user to each data owners, so that to reduce the total cost of privacy budget. In the experimental evaluation, a comprehensive comparison has been made among our algorithm and three state-of-the-art privacy-preserving deep learning approaches. Extensive experiments have been conducted on three benchmark image datasets. Our results show that LDP-DL consistently outperforms the other competitors in terms of privacy budget and model accuracy.
References
- [1] F. Perez, S. Avila, and E. Valle, “Solo or ensemble? choosing a cnn architecture for melanoma classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
- [2] P.-Y. Wu, C.-C. Fang, J. M. Chang, and S.-Y. Kung, “Cost-effective kernel ridge regression implementation for keystroke-based active authentication system,” IEEE transactions on cybernetics, vol. 47, no. 11, pp. 3916–3927, 2016.
- [3] H. Nguyen, D. Zhuang, P.-Y. Wu, and M. Chang, “Autogan-based dimension reduction for privacy preservation,” Neurocomputing, 2019.
- [4] D. Zhuang, S. Wang, and J. M. Chang, “Fripal: Face recognition in privacy abstraction layer,” in 2017 IEEE Conference on Dependable and Secure Computing. IEEE, 2017, pp. 441–448.
- [5] D. Zhuang and J. M. Chang, “Peerhunter: Detecting peer-to-peer botnets through community behavior analysis,” in 2017 IEEE Conference on Dependable and Secure Computing. IEEE, 2017, pp. 493–500.
- [6] ——, “Enhanced peerhunter: Detecting peer-to-peer botnets through network-flow level community behavior analysis,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 6, pp. 1485–1500, 2018.
- [7] D. Zhuang, M. J. Chang, and M. Li, “Dynamo: Dynamic community detection by incrementally maximizing modularity,” IEEE Transactions on Knowledge and Data Engineering, 2019.
- [8] M. Nasr, R. Shokri, and A. Houmansadr, “Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning,” in 2019 IEEE Symposium on Security and Privacy (SP), May 2019, pp. 739–753.
- [9] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership inference attacks against machine learning models,” in 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 2017, pp. 3–18.
- [10] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-supervised knowledge transfer for deep learning from private training data,” arXiv preprint arXiv:1610.05755, 2016.
- [11] M. Chase, R. Gilad-Bachrach, K. Laine, K. E. Lauter, and P. Rindal, “Private collaborative neural network learning.” IACR Cryptology ePrint Archive, vol. 2017, p. 762, 2017.
- [12] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated learning: A client level perspective,” arXiv preprint arXiv:1712.07557, 2017.
- [13] N. Wang, X. Xiao, Y. Yang, J. Zhao, S. C. Hui, H. Shin, J. Shin, and G. Yu, “Collecting and analyzing multidimensional data with local differential privacy,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019, pp. 638–649.
- [14] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- [15] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 308–318.
- [16] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
- [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- [18] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
- [19] D. Cynthia, “Differential privacy,” Automata, languages and programming, pp. 1–12, 2006.
- [20] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
- [21] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang, “Analyze gauss: optimal bounds for privacy-preserving principal component analysis,” in Proceedings of the forty-sixth annual ACM symposium on Theory of computing, 2014, pp. 11–20.
- [22] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax rates,” in 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 2013, pp. 429–438.
- [23] T. Wang, J. Blocki, N. Li, and S. Jha, “Locally differentially private protocols for frequency estimation,” in 26th USENIX Security Symposium (USENIX Security 17), 2017, pp. 729–745.
- [24] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014, pp. 2654–2662.
- [25] A. Polino, R. Pascanu, and D. Alistarh, “Model compression via distillation and quantization,” arXiv preprint arXiv:1802.05668, 2018.
- [26] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support vector machines,” arXiv preprint arXiv:1206.6389, 2012.
- [27] T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning model supply chain,” arXiv preprint arXiv:1708.06733, 2017.
- [28] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks on deep learning systems using data poisoning,” arXiv preprint arXiv:1712.05526, 2017.
- [29] Y. Liu, Y. Xie, and A. Srivastava, “Neural trojans,” in 2017 IEEE International Conference on Computer Design (ICCD). IEEE, 2017, pp. 45–48.
- [30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- [31] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Minimax optimal procedures for locally private estimation,” Journal of the American Statistical Association, vol. 113, no. 521, pp. 182–201, 2018.
- [32] F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). IEEE, 2007, pp. 94–103.
- [33] B. Settles, “Active learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2009.
- [34] Y. Yang, Z. Zhang, G. Miklau, M. Winslett, and X. Xiao, “Differential privacy in data publication and analysis,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 601–606.
- [35] S. L. Warner, “Randomized response: A survey technique for eliminating evasive answer bias,” Journal of the American Statistical Association, vol. 60, no. 309, pp. 63–69, 1965.
- [36] P. Kairouz, S. Oh, and P. Viswanath, “Extremal mechanisms for local differential privacy,” arXiv preprint arXiv:1407.1338, 2014.
- [37] R. Bassily and A. Smith, “Local, private, efficient protocols for succinct histograms,” in Proceedings of the forty-seventh annual ACM symposium on Theory of computing, 2015, pp. 127–135.
- [38] Ú. Erlingsson, V. Pihur, and A. Korolova, “Rappor: Randomized aggregatable privacy-preserving ordinal response,” in Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, 2014, pp. 1054–1067.
- [39] G. Cormode, S. Jha, T. Kulkarni, N. Li, D. Srivastava, and T. Wang, “Privacy at scale: Local differential privacy in practice,” in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 1655–1658.
- [40] B. Ding, J. Kulkarni, and S. Yekhanin, “Collecting telemetry data privately,” arXiv preprint arXiv:1712.01524, 2017.
- [41] A. Bittau, Ú. Erlingsson, P. Maniatis, I. Mironov, A. Raghunathan, D. Lie, M. Rudominer, U. Kode, J. Tinnes, and B. Seefeld, “Prochlo: Strong privacy for analytics in the crowd,” in Proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 441–459.
- [42] K. Nissim and U. Stemmer, “Clustering algorithms for the centralized and local models,” in Algorithmic Learning Theory. PMLR, 2018, pp. 619–653.
- [43] E. Yilmaz, M. Al-Rubaie, and J. M. Chang, “Locally differentially private naive bayes classification,” arXiv preprint arXiv:1905.01039, 2019.
- [44] T. Wang, N. Li, and S. Jha, “Locally differentially private frequent itemset mining,” in 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 2018, pp. 127–143.
- [45] P. Kairouz, K. Bonawitz, and D. Ramage, “Discrete distribution estimation under local privacy,” in International Conference on Machine Learning. PMLR, 2016, pp. 2436–2444.
- [46] T. Murakami, H. Hino, and J. Sakuma, “Toward distribution estimation under local differential privacy with small samples,” Proceedings on Privacy Enhancing Technologies, vol. 2018, no. 3, pp. 84–104, 2018.
- [47] M. Ye and A. Barg, “Optimal schemes for discrete distribution estimation under locally differential privacy,” IEEE Transactions on Information Theory, vol. 64, no. 8, pp. 5662–5676, 2018.
![]() |
Di Zhuang (S’15) is currently a Security and Privacy Engineer at Snap Inc. He received his Ph.D. degree in electrical engineering, and B.E. degree in computer science and information security from University of South Florida, Tampa and Nankai University, Tianjin, China, respectively. His research interests include network security, social network science, and privacy preserving machine learning. He is a member of IEEE. |
![]() |
Mingchen Li received his M.S. degree in electrical engineering from Illinois Institute of Technology. He is currently pursuing the Ph.D. degree in electrical engineering with University of South Florida, Tampa. His research interests include cyber security, synthetic data generation, privacy enhancing technologies, machine learning and data analytics. |
![]() |
J. Morris Chang (SM’08) is a professor in the Department of Electrical Engineering at the University of South Florida. He received the Ph.D. degree from the North Carolina State University. His past industrial experiences include positions at Texas Instruments, Microelectronic Center of North Carolina and AT&T Bell Labs. He received the University Excellence in Teaching Award at Illinois Institute of Technology in 1999. His research interests include: cyber security, wireless networks, and energy efficient computer systems. In the last six years, his research projects on cyber security have been funded by DARPA. Currently, he is leading a DARPA project under Brandeis program focusing on privacy-preserving computation over Internet. He is a handling editor of Journal of Microprocessors and Microsystems and an editor of IEEE IT Professional. He is a senior member of IEEE. |