This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Blockchain-Based Approach for Saving and Tracking Differential-Privacy Cost

Yang Zhao,  Jun Zhao,  Jiawen Kang,  Zehang Zhang,  Dusit Niyato,  Shuyu Shi,  and Kwok-Yan Lam Yang Zhao, Jun Zhao, Jiawen Kang, Zehang Zhang, Dusit Niyato and Kwok-Yan Lam are with School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798. (Emails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]).Shuyu Shi is with Department of Computer Science and Engineering, Nanjing University, Nanjing, China, 210008. (Email: [email protected]).
Abstract

An increasing amount of users’ sensitive information is now being collected for analytics purposes. Differential privacy has been widely studied in the literature to protect the privacy of users’ information. The privacy parameter bounds the information about the dataset leaked by the noisy output. Oftentimes, a dataset needs to be used for answering multiple queries, so the level of privacy protection may degrade as more queries are answered. Thus, it is crucial to keep track of privacy budget spending, which should not exceed the given limit of privacy budget. Moreover, if a query has been answered before and is asked again on the same dataset, we may reuse the previous noisy response for the current query to save the privacy cost. In view of the above, we design an algorithm to reuse previous noisy responses if the same query is asked repeatedly. In particular, considering that different requests of the same query may have different privacy requirements, our algorithm can set the optimal reuse fraction of the old noisy response and add new noise to minimize the accumulated privacy cost. Furthermore, we design and implement a blockchain-based system for tracking and saving differential-privacy cost. As a result, the owner of the dataset will have full knowledge about how the dataset has been used and be confident that no new privacy cost will be incurred for answering queries once the specified privacy budget is exhausted.

Index Terms:
Blockchain, differential privacy, data analytics, Gaussian mechanism.

I Introduction

Massive volumes of users’ sensitive information are being collected for data analytics and machine learning such as large-scale Internet of Things (IoT) data. Some IoT data contain users’ confidential information, for example, energy consumption or location data. They may expose a family’s habits [1, 2, 3, 4, 5]. To protect personal privacy, many countries have strict policies about how technology companies collect and process users’ data. However, the companies need to analyze users’ data for service quality improvement. To preserve privacy while revealing useful information about datasets, differential privacy (DP) has been proposed [6, 7, 8]. Intuitively, by incorporating some noise, the output of an algorithm under DP will not change significantly due to the presence or absence of one user’s information in the dataset. Due to its introduction [6, 7], DP has attracted much interest from both academia [9, 10, 11, 12, 13] and industry [14, 15, 16]. For example, Apple has incorporated DP into its mobile operating system iOS [14]; Google has implemented a DP tool called RAPPOR in the Chrome browser to collect information [15].

Roughly speaking, a randomized mechanism achieving (ϵ,δ)(\epsilon,\delta)-DP [6] means that except with a (typically small) probability δ\delta, altering a record in a database cannot change the probability that an output is seen by more than a multiplicative factor eϵe^{\epsilon}. Thus, the information about the dataset leaked by the noisy output of an (ϵ,δ)(\epsilon,\delta)-DP algorithm is bounded by the privacy parameters ϵ\epsilon and δ\delta. Smaller ϵ\epsilon and δ\delta mean stronger privacy protection and less information leakage. Note that non-zero information leakage is necessary to achieve non-zero utility. Usually, a dataset may be used for answering multiple queries (e.g., for multiple analytics tasks), thus accumulating the information leakage and degrading the privacy protection level, which can be intuitively understood as the increase of privacy spending. Therefore, it is necessary to record the privacy cost to prevent it from exceeding the privacy budget. The privacy budget is used to quantify the privacy risk when a differential private scheme is applied to real-world applications. Besides, we reduce privacy cost by reusing old noisy response to answer the current query if the query was answered before.

Traditionally, the privacy cost incurred by answering queries on a dataset is claimed by the dataset holder. Users whose information is in the dataset are not clear about the usage. It is possible that privacy consumption has exceeded the privacy budget. To solve this problem, the emerging blockchain technology provides a new solution to manage the privacy cost. Blockchain is a chain of blocks storing cryptographic and tamper-resistant transaction records without using a centralized server [17, 18]. With blockchain recording how the dataset is used for answering queries, users have full knowledge of how their information is analyzed. Users can easily access the blockchain to check the consumption of the privacy budget. The dataset holder has the motivation to adopt our blockchain-based approach to provide the following accountability guarantee to users whose information is in the dataset: if the dataset holder uses the dataset more than the set of queries recorded by the blockchain, measures can be taken to catch the dataset holder with cheating because transactions written into the blockchain are tamper-resistant. Yang et al. [19] propose to leverage blockchain to track differential privacy budget, but they do not propose a mechanism to reuse noise. In contrast, we design a DP mechanism to effectively reuse previous queries’ results to reuse noise and reduce privacy cost. In Section VI-A, we present more detailed comparisons.

In view of the above, we propose a blockchain-based Algorithm 1 and implement it to track and manage differential-privacy cost, which uses blockchain to make the privacy spending transparent to the data owner. Consequently, the data owner can track how dataset used by checking blockchain transactions’ information, including each query’s type, the noisy response used to answer each query, the associated noise level added to the true query result, and the remaining privacy budget. In addition to providing transparency of privacy management, another advantage of our blockchain-based system is as follows. Once the specified privacy budget is exhausted, a smart contract implemented on the blockchain ensures that no new privacy cost will be incurred, and this can be verified. Furthermore, since the blockchain stores the noisy response used to answer each query, we also design an algorithm to minimize the accumulated privacy cost by reusing previous noisy response if the same query is asked again. Our algorithm (via a rigorous proof) is able to set the optimal reuse fraction of the old noisy response and add new noise (if necessary) considering different requests of the same query may be sent with different privacy requirements. In our blockchain-based system, reusing noisy responses not only saves privacy cost, but also reduces communication overhead when the noisy response is generated without contacting the server hosting the dataset.

Contributions. The major contributions of this paper are summarized as follows:

  • \bullet

    First, a novel privacy-preserving algorithm with a rigorous mathematical proof is designed to minimize accumulated privacy cost under a limited privacy budget by reusing previous noisy responses if the same query is received. Thus, a dataset can be used to answer more queries while preventing the privacy leakage, which is essential for the datasets with frequent queries, e.g., medical record datasets.

  • \bullet

    Second, our designed approach reduces the number of times to request the server significantly by taking advantage of recorded noisy results.

  • \bullet

    Third, we implement the proposed system and algorithm according to a detailed sequence diagram, and conduct experiments by using a real-world dataset. Numerical results demonstrate that our proposed system and algorithm are effective in saving the privacy cost while keeping accuracy.

Organization. The rest of the paper is organized as follows. Section II introduces preliminaries about differential privacy and blockchains. Section III presents system design including our proposed noise reuse algorithm. Section IV describes challenges in implementing our system. In Section V, we discuss experimental results to validate the effectiveness of our system. Section VI surveys related work. Section VIII concludes this paper and identifies future directions.

Notation. Throughout the paper, []{\mathbb{P}}\left[{\cdot}\right] denotes the probability, and 𝔽[]{\mathbb{F}}\left[{\cdot}\right] stands for the probability density function. The notation 𝒩(0,A)\mathcal{N}(0,A) denotes a Gaussian random variable with zero mean and variance AA, and means a fresh Gaussian noise when it is used to generate a noisy query response. Notations used in the rest of the paper are summarized in Table I.

Table I: Summary of notations
(ϵ,δ)(\epsilon,\delta) privacy parameters
[]{\mathbb{P}}\left[{\cdot}\right] probability
𝔽[]{\mathbb{F}}\left[{\cdot}\right] probability density function
DD dataset
DD^{\prime} neighbouring dataset of DD
ΔQ\Delta_{Q} 2\ell_{2}-sensitivity of query QQ
σ\sigma standard deviation of the Gaussian noise
Q~m(D)\widetilde{Q}_{m}(D)
noisy query response for query QmQ_{m}
on dataset DD
Y1,Y2,,YmY_{1},Y_{2},\ldots,Y_{m} randomized mechanisms
roptimalr_{\textup{optimal}} the optimal fraction
LY(D,D;y)L_{Y}(D,D^{\prime};y) privacy loss
𝒩(0,A)\mathcal{N}(0,A)
a Gaussian random variable with zero
mean and variance AA
VV variance

II Preliminaries

We organize this section on preliminaries as follows. In Section II-A, we introduce the formal definition of differential privacy. In Section II-B, we explain the concepts of blockchain, Ethereum and smart contract.

II-A Differential Privacy

Differential privacy intuitively means that the adversary cannot determine with high confidence whether the randomized output comes from a dataset DD or its neighboring dataset DD^{\prime} which differs from DD by one record. The formal definition of (ϵ,δ)(\epsilon,\delta)-differential privacy is given in Definition 1, and the notion of neighboring datasets is discussed in Remark 2.

Definition 1 ((ϵ,δ)(\epsilon,\delta)-Differential privacy [20]).

A randomized mechanism YY, which generates a randomized output given a dataset as the input, achieves (ϵ,δ)(\epsilon,\delta)-differential privacy if

[Y(D)𝒴]eϵ[Y(D)𝒴]+δ,\displaystyle{\mathbb{P}}\left[{Y(D)\in\mathcal{Y}}\right]\leq e^{\epsilon}{\mathbb{P}}\left[{Y(D^{\prime})\in\mathcal{Y}}\right]+\delta, (1)
for DD and DD^{\prime} iterating through
all pairs of neighboring datasets, and
for 𝒴 iterating through all subsets of the output range,\displaystyle\textup{for $\mathcal{Y}$ iterating through all subsets of the output range},

where []{\mathbb{P}}\left[{\cdot}\right] denotes the probability, and the probability space is over the coin flips of the randomized mechanism YY.

Remark 1.

The notion of (ϵ,δ)(\epsilon,\delta)-differential privacy under δ=0\delta=0 becomes ϵ\epsilon-differential privacy. ϵ\epsilon-Differential privacy and (ϵ,δ)(\epsilon,\delta)-differential privacy are also referred to as pure and approximate differential privacy, respectively, in many studies [9, 10, 11].

Remark 2 (Notion of neighboring datasets).

Two datasets DD and DD^{\prime} are called neighboring if they differ only in one tuple. There are still variants about this. In the first case, the sizes of DD and DD^{\prime} differ by one so that DD^{\prime} is obtained by adding one record to DD or deleting one record from DD. In the second case, DD and DD^{\prime} have the same size (say nn), and have different records at only one of the nn positions. Finally, the notion of neighboring datasets can also be defined to include both the cases above. Our results in this paper apply to all of the above cases.

Among various mechanisms to achieve DP, the Gaussian mechanism for real-valued queries proposed in [6] has received much attention. The improved result given by [20] is Lemma 1.

Lemma 1 (Theorem A.1 by Dwork and Roth [20]).

To answer a query QQ with 2\ell_{2}-sensitivity ΔQ\Delta_{Q}, adding a zero-mean Gaussian noise with standard deviation 2ln1.25δ×ΔQϵ\sqrt{2\ln\frac{1.25}{\delta}}\times\frac{\Delta_{Q}}{\epsilon} (denoted by Gaussian(ΔQ,ϵ,δ)\operatorname{Gaussian}(\Delta_{Q},\epsilon,\delta) hereafter in this paper) to each dimension of the true query result achieves (ϵ,δ)(\epsilon,\delta)-differential privacy. The above 2\ell_{2}-sensitivity ΔQ\Delta_{Q} of a query QQ is defined as the maximal 2\ell_{2} distance between the true query results for any two neighboring datasets DD and DD^{\prime} that differ in one record; i.e., ΔQ=maxneighboring D,DQ(D)Q(D)2\Delta_{Q}=\max_{\textrm{neighboring $D,D^{\prime}$}}\|Q(D)-Q(D^{\prime})\|_{2}.

More discussions on the 2\ell_{2}-sensitivity of a query are given in Section III-H. Section VII-A discusses the setting of privacy parameters ϵ\epsilon and δ\delta.

II-B Blockchain, Ethereum and Smart Contracts

Blockchain. The blockchain technology is popularly used in systems requiring high security and transparency, such as Bitcoin and Ethereum [21]. The blockchain can be effectively used to solve the double-spending problem in Bitcoin transaction by using a peer-to-peer network. The solution is to hash transaction information in a chain of hash-based Proof-of-Work (PoW, used by Bitcoin) which is the consensus mechanism algorithm used to confirm transactions and produce new blocks to the chain. Once the record is formed, it cannot be changed except redoing Proof-of-Work.

Besides, the blockchain is constantly growing with appending ‘completed’ blocks. Blocks consisting of the most recent transactions are added to the chain in chronological order [22]. Each blockchain node can have a copy of the blockchain. The blockchain allows participants to track their transactions without centralized control.

Ethereum. Ethereum is a blockchain platform which allows users to create decentralized end-to-end applications [23]. The miners in Ethereum use Proof-of-Work consensus algorithm to complete transaction verification and synchronization. Besides, Ethereum can run smart contracts elaborated below.

Smart Contract. The smart contract was first proposed by Nick Szabo as a computerized transaction protocol that can execute terms of a contract automatically [24]. It intends to make a contract digitally, and allows to maintain credible transactions without a third party. With the development of blockchains, such as Ethereum, smart contracts are stored in the blockchain as scripts. A blockchain with a Turing-complete programming language allows everyone to customize smart contract scripts for transactions [25]. Smart contracts are triggered when transactions are created or generated on the blockchain to finishe specific tasks or services.

III System Description

Our blockchain-based system provides differentially private responses to queries while minimizing the privacy cost via noise reuse. We design a web application to implement our Algorithm 1, which generates noisy responses to queries with the minimal privacy cost by setting the optimal reuse fraction of the old noisy response and adding new noise (if necessary). For clarity, we defer Algorithm 1 and its discussion to Section III. The design of the system is illustrated in Fig. 1 and we discuss the details in the following. In Section V, we will discuss the implementation and experiments of our blockchain-based system, and present more figures about the implementation. In particular, Fig. 3 there shows the screenshot of our blockchain-based privacy management system [26], while Fig. 4 presents outputs while using the system.

Refer to caption
Figure 1: The proposed blockchain-based system architecture for differential-privacy cost management.

III-A System Architecture

Our system includes the client, the blockchain, the server, and smart contract followed by more details as below.

Client: The primary function of the client is to transfer users’ queries to the blockchain smart contract. The client computes the required parameter standard deviation for the server to generate the Gaussian noise using the privacy parameters ϵ\epsilon and δ\delta and forwards the query to the blockchain. Also, the client can display the query result to the analyst after getting the noisy response to the query.

Blockchain Smart Contract: The blockchain serves as a middleware between the client and the server. It decides which query should be submitted to the server. The blockchain records the remaining privacy budget, query type, the noisy response to answer the query, the privacy parameters, and the amount of corresponding noise. If the remaining privacy budget is enough, the smart contract will execute the query match function with the recorded history. Otherwise, the smart contract will reject this query. If the current query does not match with any query in the history, the smart contract will call the server to calculate the result. If the query has been received before, the blockchain smart contract will not call the server if the noisy response can be completely generated by old noisy answers and will call the server if access to the dataset is still needed to generate the noisy response.

Server: The data provider hosts the server. The server provides APIs to answer analysts’ queries. When the API is called, the server will query the dataset to calculate the respective answer. After the true value Q(D)Q(D) is calculated, the server will add noise to perturb the answer. Then the server returns the noisy answer to the blockchain.

In the rest of the paper, we use Blockchain, Client, and Server to denote the blockchain, client, and server, respectively.

III-B System Functionality

Match query with query history and generate noisy response: Blockchain compares the current query type with saved query types to retrieve previous query results. If it is the first time for Blockchain to see the query, Blockchain will forward the query to the server, and Server will return the perturbed result which satisfies differential privacy to Blockchain. If the current query type matches previous answers’ query type, Blockchain will compare the computed amount of noise with all previously saved amounts of noise under the same query type. Based on the comparison result, Blockchain will completely reuse old responses or call Server.

Manage privacy budget: Blockchain updates the privacy budget as queries are answered and the Blockchain ensures no new privacy cost will be incurred for answering queries once the specified privacy budget is exhausted.

Algorithm 1 Our proposed algorithm to answer the mm-th query and adjust remaining privacy cost.
  Input: DD: dataset; Qm{Q}_{m}: the mm-th query; (ϵm,δm)(\epsilon_{m},\delta_{m}): requested privacy parameters for query QmQ_{m}; (ϵ_squared_remaining_budget,δbudget)(\sqrt{\epsilon\textup{\_squared\_remaining\_budget}},\delta_{\textup{budget}}): remaining privacy budget (at the beginning, it is (ϵ_squared_budget,δbudget)(\sqrt{\epsilon\textup{\_squared\_budget}},\delta_{\textup{budget}}) for ϵ_squared_budget=ϵbudget2\epsilon\textup{\_squared\_budget}={\epsilon_{\textup{budget}}}^{2}); ΔQm\Delta_{Q_{m}}: 2\ell_{2} sensitivity of query QmQ_{m};
  Output: Q~m(D)\widetilde{Q}_{m}(D): noisy query response for query QmQ_{m} on dataset DD under (ϵm,δm)(\epsilon_{m},\delta_{m})-differential privacy;
1:  σmGaussian(ΔQm,ϵm,δm){\sigma}_{m}\leftarrow\operatorname{Gaussian}(\Delta_{Q_{m}},\epsilon_{m},\delta_{m}); //Comment: From Lemma 1, it holds that Gaussian(ΔQm,ϵm,δm):=2ln1.25δm×ΔQmϵm\operatorname{Gaussian}(\Delta_{Q_{m}},\epsilon_{m},\delta_{m}):\vspace{-0pt}=\sqrt{2\ln\frac{1.25}{\delta_{m}}}\times\frac{\Delta_{Q_{m}}}{\epsilon_{m}}.
2:  if the query QmQ_{m} is seen for the first time then
3:     Client computes ϵ_squared_cost\epsilon\textup{\_squared\_cost} such that Gaussian(ΔQm,ϵ_squared_cost,δbudget)=σm\operatorname{Gaussian}(\Delta_{Q_{m}},\sqrt{\epsilon\textup{\_squared\_cost}},\delta_{\textup{budget}})=\sigma_{m};
4:     //Comment: This means 2ln1.25δbudget×ΔQmϵ_squared_cost=σm\sqrt{2\ln\frac{1.25}{\delta_{\textup{budget}}}}\times\frac{\Delta_{Q_{m}}}{\sqrt{\epsilon\textup{\_squared\_cost}}}=\sigma_{m}, where σm\sigma_{m} as Gaussian(ΔQm,ϵm,δm)\operatorname{Gaussian}(\Delta_{Q_{m}},\epsilon_{m},\delta_{m}) is 2ln1.25δm×ΔQmϵm\sqrt{2\ln\frac{1.25}{\delta_{m}}}\times\frac{\Delta_{Q_{m}}}{\epsilon_{m}}.
5:     Client computes ϵ_squared_remaining_budgetϵ_squared_remaining_budgetϵ_squared_cost\epsilon\textup{\_squared\_remaining\_budget}\leftarrow\epsilon\textup{\_squared\_remaining\_budget}-\epsilon\textup{\_squared\_cost};
6:     if ϵ_squared_remaining_budget0\epsilon\textup{\_squared\_remaining\_budget}\geq 0 then
7:        return Q~m(D)Qm(D)+𝒩(0,1)×σm\widetilde{Q}_{m}(D)\leftarrow{Q}_{m}(D)+\mathcal{N}(0,1)\times\sigma_{m}; //Comment: We refer to this Case 1) in Section Case 1):. If Qm{Q}_{m} is multidimensional, independent Gaussian noise will be added to each dimension.
8:        Blockchain records Qm’s query type,ϵm,δm,σm,Q~m(D)\langle Q_{m}\textup{'s query type},\epsilon_{m},\delta_{m},\sigma_{m},\widetilde{Q}_{m}(D)\rangle; //Comment: This information will be kept together with a cryptographic hash of the dataset DD, which Blockchain stores so it knows which records are for the same dataset DD.
9:     else
10:        return an error of insufficient privacy budget;
11:     end if
12:  else
13:     Suppose QmQ_{m} is a type tt-query. Blockchain compares σm\sigma_{m} with values in 𝚺t:={σj:σj\boldsymbol{\Sigma}_{t}:=\{\sigma_{j}:\sigma_{j} has been recorded in Blockchain and QjQ_{j} is a type tt-query}\} (i.e., 𝚺t\boldsymbol{\Sigma}_{t} consists of the corresponding noise amounts for previous instances of type tt-query), resulting in the following subcases.
14:     if there exists σj𝚺t\sigma_{j}\in\boldsymbol{\Sigma}_{t} such that σm=σj\sigma_{m}=\sigma_{j} then
15:        Blockchain returns Q~m(D)Q~j(D)\widetilde{Q}_{m}(D)\leftarrow\widetilde{Q}_{j}(D); //Comment: We refer to this Case 2A) in Section Case 2):.
16:     else if σm<min(𝚺t)\sigma_{m}<\min(\boldsymbol{\Sigma}_{t}) then
17:        //Comment: The case of partially reusing an old noise:
18:        Client computes ϵ_squared_cost\epsilon\textup{\_squared\_cost} such that [Gaussian(ΔQm,ϵ_squared_cost,δbudget)]2=σm2[min(𝚺t)]2[\operatorname{Gaussian}(\Delta_{Q_{m}},\sqrt{\epsilon\textup{\_squared\_cost}},\delta_{\textup{budget}})]^{-2}={\sigma_{m}}^{-2}-[\min(\boldsymbol{\Sigma}_{t})]^{-2};
19:        Client computes ϵ_squared_remaining_budgetϵ_squared_remaining_budgetϵ_squared_cost\epsilon\textup{\_squared\_remaining\_budget}\leftarrow\epsilon\textup{\_squared\_remaining\_budget}-\epsilon\textup{\_squared\_cost};
20:        if ϵ_squared_remaining_budget0\epsilon\textup{\_squared\_remaining\_budget}\vspace{2pt}\geq 0 then
21:           Blockchain computes NoiseReuseRatioσm2[min(𝚺t)]2\textup{NoiseReuseRatio}\leftarrow\frac{{\sigma_{m}}^{2}}{[\min(\boldsymbol{\Sigma}_{t})]^{2}} and AdditionalNoise𝒩(0,1)×σm2σm4[min(𝚺t)]2\textup{AdditionalNoise}\leftarrow\mathcal{N}(0,1)\times\sqrt{{\sigma_{m}}^{2}-\frac{{\sigma_{m}}^{4}}{[\min(\boldsymbol{\Sigma}_{t})]^{2}}}
22:           Blockchain contacts Server to compute Q~m(D)Qm(D)+NoiseReuseRatio×[Q~t,min(D)Qm(D)]+AdditionalNoise\widetilde{Q}_{m}(D)\leftarrow Q_{m}(D)+\textup{NoiseReuseRatio}\times[\widetilde{Q}_{t,\textup{min}}(D)-Q_{m}(D)]+\textup{AdditionalNoise}, where Q~t,min(D)\widetilde{Q}_{t,\textup{min}}(D) denotes the noisy response (kept in Blockchain) corresponding to min(𝚺t)\min(\boldsymbol{\Sigma}_{t}); //Comment: We refer to this Case 2B) in Section Case 2):.
23:           Blockchain records Qm’s query type,ϵm,δm,σm,Q~m(D)\langle Q_{m}\textup{'s query type},\epsilon_{m},\delta_{m},\sigma_{m},\widetilde{Q}_{m}(D)\rangle;
24:        else
25:           return an error of insufficient privacy budget;
26:        end if
27:     else
28:        //Comment: The case of fully reusing an old noise:
29:        With σ\sigma_{\ell} denoting the maximal possible value in 𝚺t\boldsymbol{\Sigma}_{t} that is also smaller than σm\sigma_{m}, Blockchain reuses Q~(D)\widetilde{Q}_{\ell}(D), which denotes the noisy response (kept in Blockchain) corresponding to σ\sigma_{\ell};
30:        Blockchain computes Q~m(D)Q~(D)+𝒩(0,1)×σm2σ2\widetilde{Q}_{m}(D)\leftarrow\widetilde{Q}_{\ell}(D)+\mathcal{N}(0,1)\times\sqrt{{\sigma_{m}}^{2}-{\sigma_{\ell}}^{2}}; //Comment: We refer to this Case 2C) in Section Case 2):.
31:        Blockchain records Qm’s query type,ϵm,δm,σm,Q~m(D)\langle Q_{m}\textup{'s query type},\epsilon_{m},\delta_{m},\sigma_{m},\widetilde{Q}_{m}(D)\rangle;
32:     end if
33:  end if
Table II: An example to explain Algorithm 1.
QmQ_{m}’s query type
Q1=Q_{1}=type-1
Q2=Q_{2}=type-2
Q3=Q_{3}=type-3
Q4=Q_{4}=type-1
Q5=Q_{5}=type-2
Q6=Q_{6}=type-1
Q7=Q_{7}=type-3
Q8=Q_{8}=type-2
Q9=Q_{9}=type-2
Q10=Q_{10}=type-1
Q11=Q_{11}=type-2
Q12=Q_{12}=type-1
Q13=Q_{13}=type-3
σm\sigma_{m} computed by
Line 1 of Alg. 1
σ1=1\sigma_{1}=1 σ2=3\sigma_{2}=3 σ3=2\sigma_{3}=2 σ4=2.5\sigma_{4}=2.5 σ5=2\sigma_{5}=2 σ6=0.5\sigma_{6}=0.5 σ7=2\sigma_{7}=2 σ8=2.5\sigma_{8}=2.5 σ9=1.5\sigma_{9}=1.5 σ10=0.25\sigma_{10}=0.25 σ11=1\sigma_{11}=1 σ12=0.75\sigma_{12}=0.75 σ13=1.5\sigma_{13}=1.5
Case involved
in Alg. 1
1): Q~1Q1\widetilde{Q}_{1}\leftarrow Q_{1}
        ++
𝒩(0,1)×σ1\mathcal{N}(0,1)\times\sigma_{1}
with
accessing DD
1): Q~2Q2\widetilde{Q}_{2}\leftarrow Q_{2}
        ++
𝒩(0,1)×σ2\mathcal{N}(0,1)\times\sigma_{2}
with
accessing DD
1): Q~3Q3\widetilde{Q}_{3}\leftarrow Q_{3}
        ++
𝒩(0,1)×σ3\mathcal{N}(0,1)\times\sigma_{3}
with
accessing DD
2C): Q~4\widetilde{Q}_{4}
reuses Q~1\widetilde{Q}_{1}
without
accessing DD
2B): Q~5\widetilde{Q}_{5}
reuses Q~2\widetilde{Q}_{2}
with
accessing DD
2B): Q~6\widetilde{Q}_{6}
reuses Q~1\widetilde{Q}_{1}
with
accessing DD
2A): Q~7\widetilde{Q}_{7}
reuses Q~3\widetilde{Q}_{3}
without
accessing DD
2C): Q~8\widetilde{Q}_{8}
reuses Q~5\widetilde{Q}_{5}
without
accessing DD
2B): Q~9\widetilde{Q}_{9}
reuses Q~5\widetilde{Q}_{5}
with
accessing DD
2B): Q~10\widetilde{Q}_{10}
reuses Q~6\widetilde{Q}_{6}
with
accessing DD
2B): Q~11\widetilde{Q}_{11}
reuses Q~9\widetilde{Q}_{9}
with
accessing DD
2C): Q~12\widetilde{Q}_{12}
reuses Q~6\widetilde{Q}_{6}
without
accessing DD
2B): Q~13\widetilde{Q}_{13}
reuses Q~7\widetilde{Q}_{7}
with
accessing DD

III-C Adversary Model

The adversary model for our system is similar to [19]. Assume that there are two kinds of adversaries:

First, adversaries can obtain perturbed query results. They may try to infer users’ real information using perturbed queries’ results.

Second, adversaries attempt to modify the privacy budget. For example, they would like to decrease the used privacy budget so that users may exceed the privacy budget. As a result, privacy will leak. However, in our case, the privacy budget is recorded on the blockchain. The adversaries cannot tamper it once the privacy budget is stored in the blockchain.

III-D Our Algorithm 1 based on Reusing Noise

We present our solution for reusing noise in Algorithm 1 in Section III-E. We consider real-valued queries so that the Gaussian mechanism can be used. Extensions to non-real-valued queries can be regarded as the future work, where we can apply the exponential mechanism of [12].

To clarify notation use, we note that QiQ_{i} means the ii-th query (ordered chronologically) and is answered by a randomized algorithm Q~i\widetilde{Q}_{i}. A type tt-query means that the query’s type is tt. Queries asked at different time can have the same query type. This is the reason that we reuse noise in Algorithm 1.

Suppose a dataset DD has been used to answer m1m-1 queries Q1,Q2,,Qm1Q_{1},Q_{2},\ldots,Q_{m-1}, where the ii-th query QiQ_{i} for i=1,2,,m1i=1,2,\ldots,m-1 is answered under (ϵi,δi)(\epsilon_{i},\delta_{i})-differential privacy (by reusing noise, or generating fresh noise, or combining both). For i=1,2,,mi=1,2,\ldots,m, we define σi:=Gaussian(ΔQi,ϵi,δi)\sigma_{i}:=\operatorname{Gaussian}(\Delta_{Q_{i}},\epsilon_{i},\delta_{i}), where ΔQi\Delta_{Q_{i}} denotes the 2\ell_{2}-sensitivity of QiQ_{i}, where we defer the discussion of ΔQi\Delta_{Q_{i}} to Section III-H. As presented in Algorithm 1, we have several cases discussed below. For better understanding of these cases, we later discuss an example given in Table II in Section II.

  1. Case 1):

    If QmQ_{m} is seen for the first time, we obtain the noisy response Q~m(D)\widetilde{Q}_{m}(D) by adding a zero-mean Gaussian noise with standard deviation Gaussian(ΔQm,ϵm,δm)\operatorname{Gaussian}(\Delta_{Q_{m}},\epsilon_{m},\delta_{m}) independently to each dimension of the true result Qm(D)Q_{m}(D) (if the privacy budget allows), as given by Line 7 of Algorithm 1, where Gaussian(ΔQm,ϵm,δm):=2ln1.25δm×ΔQmϵm\operatorname{Gaussian}(\Delta_{Q_{m}},\epsilon_{m},\delta_{m}):=\sqrt{2\ln\frac{1.25}{\delta_{m}}}\times\frac{\Delta_{Q_{m}}}{\epsilon_{m}} from Lemma 1.

  2. Case 2):

    If QmQ_{m} has been received before, suppose QmQ_{m} is a type tt-query, and among the previous m1m-1 queries Q1,Q2,,Qm1Q_{1},Q_{2},\ldots,Q_{m-1}, let 𝚺t\boldsymbol{\Sigma}_{t} consist of the corresponding noise amounts for previous instances of type tt-query; i.e., 𝚺t:={σj:σj\boldsymbol{\Sigma}_{t}:=\{\sigma_{j}:\sigma_{j} has been recorded in Blockchain and QjQ_{j} is a type tt-query}\}.

    Blockchain compares σm\sigma_{m} and the values in 𝚺t\boldsymbol{\Sigma}_{t}, resulting in the following subcases.

    1. Case 2A):

      If there exists σj𝚺t\sigma_{j}\in\boldsymbol{\Sigma}_{t} such that σm=σj\sigma_{m}=\sigma_{j}, then Q~m(D)\widetilde{Q}_{m}(D) is set as Q~j(D)\widetilde{Q}_{j}(D).

    2. Case 2B):

      This case considers that σm\sigma_{m} is less than min(𝚺t)\min(\boldsymbol{\Sigma}_{t}) which denotes the minimum in 𝚺t\boldsymbol{\Sigma}_{t}. Let Q~t,min(D)\widetilde{Q}_{t,\textup{min}}(D) denote the noisy response (kept in Blockchain) corresponding to min(𝚺t)\min(\boldsymbol{\Sigma}_{t}); specifically, if min(𝚺t)=σj\min(\boldsymbol{\Sigma}_{t})=\sigma_{j} for some jj, then Q~t,min(D)=Q~j(D)\widetilde{Q}_{t,\textup{min}}(D)=\widetilde{Q}_{j}(D). Under σm<min(𝚺t)\sigma_{m}<\min(\boldsymbol{\Sigma}_{t}), to minimize the privacy cost, we reuse σm2[min(𝚺t)]2\frac{{\sigma_{m}}^{2}}{[\min(\boldsymbol{\Sigma}_{t})]^{2}} fraction of noise in Q~t,min(D)\widetilde{Q}_{t,\textup{min}}(D) to generate Q~m(D)\widetilde{Q}_{m}(D) (if the privacy budget allows). This will be obtained by Theorem 1’s Result (ii) to be presented in Section III-E. Specifically, under min(𝚺t)>σm\min(\boldsymbol{\Sigma}_{t})>\sigma_{m}, as given by Line 22 of Algorithm 1, Q~m(D)\widetilde{Q}_{m}(D) is set by Q~m(D)Qm(D)+σm2[min(𝚺t)]2×[Q~t,min(D)Qm(D)]+𝒩(0,1)×σm2σm4[min(𝚺t)]2\widetilde{Q}_{m}(D)\leftarrow Q_{m}(D)+\frac{{\sigma_{m}}^{2}}{[\min(\boldsymbol{\Sigma}_{t})]^{2}}\times[\widetilde{Q}_{t,\textup{min}}(D)-Q_{m}(D)]+\mathcal{N}(0,1)\times\sqrt{{\sigma_{m}}^{2}-\frac{{\sigma_{m}}^{4}}{[\min(\boldsymbol{\Sigma}_{t})]^{2}}}. Note that if Qm{Q}_{m} is multidimensional, independent Gaussian noise will be added to each dimension according to the above formula. This also applies to other places of this paper.

    3. Case 2C):

      This case considers that σm\sigma_{m} is greater than min(𝚺t)\min(\boldsymbol{\Sigma}_{t}) and σm\sigma_{m} is different from all values in 𝚺t\boldsymbol{\Sigma}_{t}. Let σ\sigma_{\ell} be the maximal possible value in 𝚺t\boldsymbol{\Sigma}_{t} that is also smaller than σm\sigma_{m}; i.e., σ=max{σj:σj𝚺t and σj<σm}\sigma_{\ell}=\max\{\sigma_{j}:\sigma_{j}\in\boldsymbol{\Sigma}_{t}\textup{ and }\sigma_{j}<\sigma_{m}\}. Then Q~m(D)\widetilde{Q}_{m}(D) is set as Q~(D)+𝒩(0,1)×σm2σ2\widetilde{Q}_{\ell}(D)+\mathcal{N}(0,1)\times\sqrt{{\sigma_{m}}^{2}-{\sigma_{\ell}}^{2}}. This will become clear by Theorem 1’s Result (ii) to be presented in Section III-E.

An example to explain Algorithm 1. Table II provides an example for better understanding of Algorithm 1. We consider three types of queries. In particular, Q1,Q4,Q6,Q10,Q12Q_{1},Q_{4},Q_{6},Q_{10},Q_{12} are type 11-queries; Q2,Q5,Q8,Q9,Q11Q_{2},Q_{5},Q_{8},Q_{9},Q_{11} are type 22-queries, and Q3,Q7,Q13Q_{3},Q_{7},Q_{13} are type 33-queries.

III-E Explaining the Noise Reuse Rules of Algorithm 1

Our noise-reuse rules of Algorithm 1 are designed to minimize the accumulated privacy cost. To explain this, inspired by [13], we define the privacy loss to quantify privacy cost. We analyze the privacy loss to characterize how privacy degrades in a fine-grained manner, instead of using the composition theorem by Kairouz et al. [27]. Although [27] gives the state-of-the-art results for the composition of differentially private algorithms, the results do not assume the underlying mechanisms to achieve differential privacy. In our analysis, by analyzing the privacy loss of Gaussian mechanisms specifically, we can obtain smaller privacy cost.

For a randomized algorithm YY, neighboring datasets DD and DD^{\prime}, and output yy, the privacy loss LY(D,D;y)L_{Y}(D,D^{\prime};y) represents the multiplicative difference between the probabilities that the same output yy is observed when the randomized algorithm YY is applied to DD and DD^{\prime}. Specifically, we define

LY(D,D;y):=ln𝔽[Y(D)=y]𝔽[Y(D)=y],\displaystyle L_{Y}(D,D^{\prime};y):=\ln\frac{{\mathbb{F}}\left[{Y(D)=y}\right]}{{\mathbb{F}}\left[{Y(D^{\prime})=y}\right]}, (2)

where 𝔽[]{\mathbb{F}}\left[{\cdot}\right] denotes the probability density function.

For simplicity, we use probability density function 𝔽[]{\mathbb{F}}\left[{\cdot}\right] in Eq. (2) above by assuming that the randomized algorithm YY has the continuous output. If YY has the discrete output, we replace 𝔽[]{\mathbb{F}}\left[{\cdot}\right] by probability mass function []{\mathbb{P}}\left[{\cdot}\right].

When yy follows the probability distribution of random variable Y(D)Y(D), LY(D,D;y)L_{Y}(D,D^{\prime};y) follows the probability distribution of random variable LY(D,D;Y(D)){L}_{Y}(D,D^{\prime};Y(D)), which we write as LY(D,D){L}_{Y}(D,D^{\prime}) for simplicity.

We denote the composition of some randomized mechanisms Y1,Y2,,YmY_{1},Y_{2},\ldots,Y_{m} for a positive integer mm by Y1Y2YmY_{1}\|Y_{2}\|\ldots\|Y_{m}. For the composition, the privacy loss with respect to neighboring datasets DD and DD^{\prime} when the outputs of randomized mechanisms Y1,Y2,,YmY_{1},Y_{2},\ldots,Y_{m} are y1,y2,,ymy_{1},y_{2},\ldots,y_{m} is defined by

LY1Y2Ym(D,D;y1,y2,,ym)\displaystyle L_{Y_{1}\|Y_{2}\|\ldots\|Y_{m}}(D,D^{\prime};y_{1},y_{2},\ldots,y_{m})
:=ln𝔽[i=1m[Yi(D)=yi]]𝔽[i=1m[Yi(D)=yi]].\displaystyle:=\ln\frac{\mathbb{F}\big{[}\cap_{i=1}^{m}\left[Y_{i}(D)=y_{i}\right]\big{]}}{\mathbb{F}\big{[}\cap_{i=1}^{m}\left[Y_{i}(D^{\prime})=y_{i}\right]\big{]}}.

When yiy_{i} follows the probability distribution of random variable Yi(D)Y_{i}(D) for each i{1,2,,m}i\in\{1,2,\ldots,m\}, clearly LY1Y2Ym(D,D;y1,y2,,ym)L_{Y_{1}\|Y_{2}\|\ldots\|Y_{m}}(D,D^{\prime};y_{1},y_{2},\ldots,y_{m}) follows the probability distribution of random variable LY1Y2Ym(D,D;Y1(D),Y2(D),,Ym(D))L_{Y_{1}\|Y_{2}\|\ldots\|Y_{m}}(D,D^{\prime};Y_{1}(D),Y_{2}(D),\ldots,Y_{m}(D)), which we write as LY1Y2Ym(D,D)L_{Y_{1}\|Y_{2}\|\ldots\|Y_{m}}(D,D^{\prime}) for simplicity.

With the privacy loss defined above, we now analyze how to reuse noise when a series of queries are answered under differential privacy. To this end, we present Theorem 1, which presents the optimal ratio of reusing noise to minimize privacy cost.

Theorem 1 (Optimal ratio of reusing noise to minimize privacy cost).

Suppose that before answering query QmQ_{m} and after answering Q1,Q2,,Qm1Q_{1},Q_{2},\ldots,Q_{m-1}, the privacy loss LQ~1Q~2Q~m1(D,D)L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m-1}}(D,D^{\prime}) is given by 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})) for some A(D,D)A(D,D^{\prime}). For the mm-th query QmQ_{m}, suppose that QmQ_{m} is the same as QjQ_{j} for some j{1,2,,m1}j\in\{1,2,\ldots,m-1\} and we reuse rr fraction of noise in Q~j(D)\widetilde{Q}_{j}(D) to generate Q~m(D)\widetilde{Q}_{m}(D) for 0r10\leq r\leq 1 satisfying σm2r2σj2>0{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}>0, where rr is a constant to be decided. If Q~j(D)Qj(D)\widetilde{Q}_{j}(D)-Q_{j}(D) follows a Gaussian probability distribution with mean 0 and standard deviation σj\sigma_{j}, we generate the noisy response Q~m(D)\widetilde{Q}_{m}(D) to answer query QmQ_{m} as follows:

Q~m(D)\displaystyle\widetilde{Q}_{m}(D)
Qm(D)+r[Q~j(D)Qj(D)]+𝒩(0,σm2r2σj2),\displaystyle\leftarrow Q_{m}(D)+r[\widetilde{Q}_{j}(D)-Q_{j}(D)]+\mathcal{N}(0,{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}), (3)

so that Q~m(D)Qm(D)\widetilde{Q}_{m}(D)-Q_{m}(D) follows a Gaussian probability distribution with mean 0 and standard deviation σm\sigma_{m}.

Note that ΔQm\Delta_{Q_{m}} and ΔQj\Delta_{Q_{j}} are the same since QmQ_{m} and QjQ_{j} are the same. Then we have the following results.

  • (i)

    After answering the mm queries Q1,Q2,,QmQ_{1},Q_{2},\ldots,Q_{m}, the privacy loss LQ~1Q~2Q~m(D,D)L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m}}(D,D^{\prime}) will be 𝒩(Br(D,D)2,Br(D,D))\mathcal{N}(\frac{B_{r}(D,D^{\prime})}{2},B_{r}(D,D^{\prime})) for Br(D,D):=A(D,D)+[Qm(D)Qm(D)2]2(1r)2σm2r2σj2B_{r}(D,D^{\prime}):=A(D,D^{\prime})+\frac{[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}(1-r)^{2}}{{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}}.

  • (ii)

    We clearly require r0r\geq 0 and σm2r2σj20{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}\geq 0 in (3) above (note that 𝒩(0,0)0\mathcal{N}(0,0)\equiv 0). To minimize the total privacy cost (which is equivalent to minimize Br(D,D)B_{r}(D,D^{\prime}) above), the optimal rr is given by

    roptimal={1,if σmσj,(σmσj)2,if σm<σj,\displaystyle r_{\textup{optimal}}=\begin{cases}1,&\textup{if }\sigma_{m}\geq\sigma_{j},\\[2.0pt] \big{(}\frac{\sigma_{m}}{\sigma_{j}}\big{)}^{2},&\textup{if }\sigma_{m}<\sigma_{j},\end{cases} (4)

    so that substituting Eq. (4) into the expression of Br(D,D)B_{r}(D,D^{\prime}) gives

    Broptimal(D,D)\displaystyle B_{r_{\textup{optimal}}}(D,D^{\prime})
    ={A(D,D), if σmσj;A(D,D)+[Qm(D)Qm(D)2]2(1σm21σj2),if σm<σj.\displaystyle=\begin{cases}A(D,D^{\prime}),\textup{ if }\sigma_{m}\geq\sigma_{j};\\[2.0pt] A(D,D^{\prime})+[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}\left(\frac{1}{{\sigma_{m}}^{2}}-\frac{1}{{\sigma_{j}}^{2}}\right),\\ ~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\hskip 7.0pt\textup{if }\sigma_{m}<\sigma_{j}.\end{cases} (5)

    Note that if σm=σj\sigma_{m}=\sigma_{j} for some j{1,2,,m1}j\in\{1,2,\ldots,m-1\}, we have roptimal=1r_{\textup{optimal}}=1 and just set Q~m(D)\widetilde{Q}_{m}(D) as Q~j(D)\widetilde{Q}_{j}(D).

Proof.

The proof is in Appendix -A. ∎

Eq. (4) of Theorem 1 clearly indicates the noise use ratio σm2[min(𝚺t)]2\frac{{\sigma_{m}}^{2}}{[\min(\boldsymbol{\Sigma}_{t})]^{2}} of Case 2B) in Algorithm 1 (see Line 22 of Algorithm 1), and the noise use ratio 11 of Cases 2A) and 2C) in Algorithm 1 (see Lines 15 and 30 of Algorithm 1).

By considering r=0r=0 in Result (i) of Theorem 1, we obtain Corollary 1, which presents the classical result on the privacy loss of a single run of the Gaussian mechanism.

Corollary 1.

By considering m=1m=1 in Result (i) of Theorem 1, we have that for a randomized algorithm Q~\widetilde{Q} which adds Gaussian noise amount σ\sigma to a query QQ, the privacy loss with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})) for A(D,D):=[Q(D)Q(D)2]2σ2A(D,D^{\prime}):=\frac{[\|Q(D)-Q(D^{\prime})\|_{2}]^{2}}{{\sigma}^{2}}.

Corollary 1 has been shown in many prior studies [20, 9, 10] on the Gaussian mechanism for differential privacy.

By considering r=0r=0 in Result (i) of Theorem 1, we obtain Corollary 2, which presents the privacy loss of the naive algorithm where the noisy response to each query is generated independently using fresh noise.

Corollary 2 (Privacy loss of the naive algorithm where each query is answered independently).

Suppose a dataset has been used to answer nn queries Q1,Q2,,QnQ_{1},Q_{2},\ldots,Q_{n} under differential privacy. Specifically, for i=1,2,,ni=1,2,\ldots,n, to answer the ii-th query QiQ_{i} under (ϵi,δi)(\epsilon_{i},\delta_{i})-differential privacy, a noisy response Q~i\widetilde{Q}_{i} is generated by adding independent Gaussian noise σi:=Gaussian(ΔQi,ϵi,δi)\sigma_{i}:=\operatorname{Gaussian}(\Delta_{Q_{i}},\epsilon_{i},\delta_{i}) to the true query result Qi{Q}_{i}, where ΔQi\Delta_{Q_{i}} is the 2\ell_{2}-sensitivity of QiQ_{i}. Then after answering nn queries Q1,Q2,,QnQ_{1},Q_{2},\ldots,Q_{n} independently as above, the privacy loss with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(F(D,D)2,F(D,D))\mathcal{N}(\frac{F(D,D^{\prime})}{2},F(D,D^{\prime})) for F(D,D):=i=1n[Qi(D)Qi(D)2]2σi2F(D,D^{\prime}):=\sum_{i=1}^{n}\frac{[\|Q_{i}(D)-Q_{i}(D^{\prime})\|_{2}]^{2}}{{\sigma_{i}}^{2}}.

III-F Explaining Privacy Cost Update in Algorithm 1

Among the above cases, Cases 2A) and 2C) do not incur additional privacy cost since they just use previous noisy results and generate fresh Gaussian noise, without access to the dataset DD. In contrast, Cases 1) and 2B) incur additional privacy cost since they need to access the dataset DD to compute the true query result Qm(D)Q_{m}(D). Hence, in Algorithm 1, the privacy cost is updated in Cases 1) and 2B), but not in Cases 2A) and 2C). In this section, we explain the reason that the privacy cost is updated in Algorithm 1 according to Lines 3 and 5 for Case 1), and Lines 18 and 19 for Case 2B).

When our Algorithm 1 is used, we let the above randomized mechanism YiY_{i} be our noisy response function Q~i\widetilde{Q}_{i}. When Q~1,Q~2,,Q~i1\widetilde{Q}_{1},\widetilde{Q}_{2},\ldots,\widetilde{Q}_{i-1} on dataset DD are instantiated as y1,y2,,yi1y_{1},y_{2},\ldots,y_{i-1}, if the generation of Q~i\widetilde{Q}_{i} on dataset DD uses Q~j\widetilde{Q}_{j} for some j<ij<i, then the auxiliary information auxi\textup{aux}_{i} in the input to Q~i\widetilde{Q}_{i} contains yjy_{j} (aux1\textup{aux}_{1} is \emptyset). For the consecutive use of our Algorithm 1, it will become clear that the privacy loss, defined by LQ~1,Q~2,,Q~m(y1,y2,,ym):=lnmaxneighboring datasets D,D𝔽[i=1m[Q~i(D)=yi]]𝔽[i=1m[Q~i(D)=yi]]L_{\widetilde{Q}_{1},\widetilde{Q}_{2},\ldots,\widetilde{Q}_{m}}(y_{1},y_{2},\ldots,y_{m}):=\ln\max_{\textup{neighboring datasets $D,D^{\prime}$}}\frac{{\mathbb{F}}\left[{\cap_{i=1}^{m}\left[\widetilde{Q}_{i}(D)=y_{i}\right]}\right]}{{\mathbb{F}}\left[{\cap_{i=1}^{m}\left[\widetilde{Q}_{i}(D^{\prime})=y_{i}\right]}\right]}, follows a Gaussian probability distribution with mean V2\frac{V}{2} and variance VV for some VV, denoted by 𝒩(V2,V)\mathcal{N}(\frac{V}{2},V). For such a reason that form of privacy loss, the corresponding differential-privacy level is given by the following lemma.

Lemma 2.

If the privacy loss of a randomized mechanism YY with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(V(D,D)2,V(D,D))\mathcal{N}(\frac{V(D,D^{\prime})}{2},V(D,D^{\prime})) for some V(D,D)V(D,D^{\prime}), then YY achieves (ϵ,δ)(\epsilon,\delta)-differential privacy for ϵ\epsilon and δ\delta satisfying maxneighboring datasets D,DV(D,D)=[Gaussian(1,ϵ,δ)]2\max_{\textup{neighboring datasets $D,D^{\prime}$}}V(D,D^{\prime})=[\operatorname{Gaussian}(1,\epsilon,\delta)]^{-2}.

Proof.

The proof details are in Appendix -B. ∎

Based on the privacy loss defined above, we have the following theorem which explains the rules to update the privacy cost in our Algorithm 1.

Theorem 2.

We consider the consecutive use of Algorithm 1 here. Suppose that after answering Q1,Q2,,Qm1Q_{1},Q_{2},\ldots,Q_{m-1} and before answering query QmQ_{m}, the privacy loss with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})) for some A(D,D)A(D,D^{\prime}), and the corresponding privacy level can be given by (ϵold,δbudget)(\epsilon_{\textup{old}},\delta_{\textup{budget}})-differential privacy. Then in Algorithm 1, after answering all mm queries Q1,Q2,,Qm1,QmQ_{1},Q_{2},\ldots,Q_{m-1},Q_{m}, we have:

  • \bullet

    the privacy loss with respect to neighboring datasets DD and DD^{\prime}

    • will still be 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})) in Cases 2A) and 2C),

    • will be 𝒩(B(D,D)2,B(D,D))\mathcal{N}(\frac{B(D,D^{\prime})}{2},B(D,D^{\prime})) in Case 1) for B(D,D):=A(D,D)+[Qm(D)Qm(D)2]2σm2B(D,D^{\prime}):=A(D,D^{\prime})+\frac{[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}}{{\sigma_{m}}^{2}},

    • will be 𝒩(C(D,D)2,C(D,D))\mathcal{N}(\frac{C(D,D^{\prime})}{2},C(D,D^{\prime})) in Case 2B) for C(D,D):=A(D,D)+[Qm(D)Qm(D)2]2×[1σm21[min(𝚺t)]2]C(D,D^{\prime}):=A(D,D^{\prime})+[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}\times\left[\frac{1}{{\sigma_{m}}^{2}}-\frac{1}{[\min(\boldsymbol{\Sigma}_{t})]^{2}}\right];

  • \bullet

    the corresponding privacy level can be given by (ϵnew,δbudget)(\epsilon_{\textup{new}},\delta_{\textup{budget}})-differential privacy with the following ϵnew\epsilon_{\textup{new}}:

    • ϵnew=ϵold\epsilon_{\textup{new}}=\epsilon_{\textup{old}} in Cases 2A) and 2C),

    • ϵnew2=ϵold2+ϵ_squared_cost{\epsilon_{\textup{new}}}^{2}={\epsilon_{\textup{old}}}^{2}+\epsilon\textup{\_squared\_cost} in Case 1) for ϵ_squared_cost\epsilon\textup{\_squared\_cost} satisfying Gaussian(ΔQm,ϵ_squared_cost,δbudget)=σm\operatorname{Gaussian}(\Delta_{Q_{m}},\sqrt{\epsilon\textup{\_squared\_cost}},\delta_{\textup{budget}})=\sigma_{m},

    • ϵnew2=ϵold2+ϵ_squared_cost{\epsilon_{\textup{new}}}^{2}={\epsilon_{\textup{old}}}^{2}+\epsilon\textup{\_squared\_cost} in Case 2B) for ϵ_squared_cost\epsilon\textup{\_squared\_cost} satisfying [Gaussian(ΔQm,ϵ_squared_cost,δbudget)]2=σm2[min(𝚺t)]2[\operatorname{Gaussian}(\Delta_{Q_{m}},\sqrt{\epsilon\textup{\_squared\_cost}},\delta_{\textup{budget}})]^{-2}={\sigma_{m}}^{-2}-[\min(\boldsymbol{\Sigma}_{t})]^{-2}.

Theorem 2 explains the rules to update the privacy cost in Algorithm 1. Specifically, Result ⑤ gives Lines 3 and 5 for Case 1), and Result ⑥ gives Lines 18 and 19 for Case 2B).

Proof.

The proof is in Appendix -C. ∎

III-G Analyzing the Total Privacy Cost

Based on Theorem 2, we now analyze the total privacy cost when our system calls Algorithm 1 consecutively.

At the beginning when no query has been answered, we have V=0V=0 (note that 𝒩(0,0)0\mathcal{N}(0,0)\equiv 0). Then by induction via Corollary 1 and Theorem 2, for the consecutive use of Algorithm 1, the privacy loss is always in the form of 𝒩(V2,V)\mathcal{N}(\frac{V}{2},V) for some VV. In our Algorithm 1, the privacy loss changes only when the query being answered belongs to Cases 1) and 2B). More formally, we have the following theorem.

Theorem 3.

Among queries Q1,Q2,,QnQ_{1},Q_{2},\cdots,Q_{n}, let N1N_{1}, N2AN_{2A}, N2BN_{2B}, and N2CN_{2C} be the set of i{1,2,,n}i\in\{1,2,\ldots,n\} such that QiQ_{i} is in Cases 1), 2A), 2B), and 2C), respectively. For queries in Case 2B), let T2BT_{2B} be the set of query types. In Case 2B), for query type tT2Bt\in T_{2B}, suppose the number of type-tt queries be mtm_{t}, and let these type-tt queries be Qjt,1,Qjt,2,,Qjt,mtQ_{j_{t,1}},Q_{j_{t,2}},\ldots,Q_{j_{t,m_{t}}} for indices jt,1,jt,2,,jt,mtj_{t,1},j_{t,2},\ldots,j_{t,m_{t}} (ordered chronologically) all belonging to N2BN_{2B}. From Case 2B) of Algorithm 1, we have σjt,1>σjt,2>>σjt,mt\sigma_{j_{t,1}}>\sigma_{j_{t,2}}>\ldots>\sigma_{j_{t,m_{t}}}, and for k{2,3,,mt}k\in\{2,3,\ldots,m_{t}\}, Q~jt,k\widetilde{Q}_{j_{t,k}} is answered by reusing σjt,k2σjt,k12\frac{{\sigma_{j_{t,k}}}^{2}}{{\sigma_{j_{t,k-1}}}^{2}} fraction of old noise in Q~jt,k1\widetilde{Q}_{j_{t,k-1}}; more specifically, Q~jt,k=Qjt,k+σjt,k2σjt,k12[Q~jt,k1Qjt,k1]+𝒩(0,σjt,k2σjt,k4σjt,k12)\widetilde{Q}_{j_{t,k}}={Q}_{j_{t,k}}+\frac{{\sigma_{j_{t,k}}}^{2}}{{\sigma_{j_{t,k-1}}}^{2}}[\widetilde{Q}_{j_{t,k-1}}-{Q}_{j_{t,k-1}}]+\mathcal{N}(0,{\sigma_{j_{t,k}}}^{2}-\frac{{\sigma_{j_{t,k}}}^{4}}{{\sigma_{j_{t,k-1}}}^{2}}) from Line 22 of Algorithm 1 in Section III-E for Case 2B). We also consider that Q~jt,1\widetilde{Q}_{j_{t,1}} is answered by reusing σjt,12σjt,02\frac{{\sigma_{j_{t,1}}}^{2}}{{\sigma_{j_{t,0}}}^{2}} fraction of old noise in Q~jt,0\widetilde{Q}_{j_{t,0}}. Let the 2\ell_{2}-sensitivity of a type-tt query be Δ(type-t)\Delta(\textup{type-}t).

In the example provided in Table II, we have N1={1,2,3}N_{1}=\{1,2,3\}, N2A={7}N_{2A}=\{7\}, N2B={5,6,9,10,11,13}N_{2B}=\{5,6,9,10,11,13\}, and N2C={8,12}N_{2C}=\{8,12\}. T2B={type-1,type-2,type-3}T_{2B}=\{\textup{type-}1,\textup{type-}2,\textup{type-}3\}. In Case 2B), the number of type-11 queries is m1=2m_{1}=2, and these type-11 queries are Q6Q_{6} and Q10Q_{10} so j1,1=6j_{1,1}=6 and j1,2=10j_{1,2}=10 (also j1,0=1j_{1,0}=1 since Q~6\widetilde{Q}_{6} reuses Q~1\widetilde{Q}_{1}); the number of type-22 queries is m2=3m_{2}=3, and these type-22 queries are Q5,Q9Q_{5},Q_{9}, and Q11Q_{11} so j2,1=5j_{2,1}=5 and j2,2=9j_{2,2}=9, j2,3=11j_{2,3}=11 (also j2,0=2j_{2,0}=2 since Q~5\widetilde{Q}_{5} reuses Q~2\widetilde{Q}_{2}); the number of type-33 queries is m3=1m_{3}=1, and this type-33 query is Q13Q_{13} so j3,1=13j_{3,1}=13 (also j3,0=3j_{3,0}=3 since Q~13\widetilde{Q}_{13} reuses Q~3\widetilde{Q}_{3}).

Then after Algorithm 1 is used to answer all nn queries with query QiQ_{i} being answered under (ϵi,δi)(\epsilon_{i},\delta_{i})-differential privacy, we have:

  • \bullet

    The total privacy loss with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(G(D,D)2,G(D,D))\mathcal{N}(\frac{G(D,D^{\prime})}{2},G(D,D^{\prime})), where

    G(D,D):=iN1[Qi(D)Qi(D)2]2σi2\displaystyle G(D,D^{\prime}):=\sum_{i\in N_{1}}\frac{[\|Q_{i}(D)-Q_{i}(D^{\prime})\|_{2}]^{2}}{{\sigma_{i}}^{2}}
    +tT2B{[Qjt,mt(D)Qjt,mt(D)2]2σjt,mt2\displaystyle\quad+\sum_{t\in T_{2B}}\Bigg{\{}\frac{{[\|Q_{j_{t,m_{t}}}(D)-Q_{j_{t,m_{t}}}(D^{\prime})\|_{2}]}^{2}}{{\sigma_{j_{t,m_{t}}}}^{2}}
    [Qjt,0(D)Qjt,0(D)2]2σjt,02},\displaystyle\quad\quad\quad\quad\quad-\frac{{[\|Q_{j_{t,0}}(D)-Q_{j_{t,0}}(D^{\prime})\|_{2}]}^{2}}{{\sigma_{j_{t,0}}}^{2}}\Bigg{\}}, (6)

    and the first summation is the contribution from queries in Case 1), and the second summation is the contribution from queries in Case 2B). When DD and DD^{\prime} iterate the space of neighboring datasets, the maximum of Qi(D)Qi(D)\|Q_{i}(D)-Q_{i}(D^{\prime})\| is QiQ_{i}’s 2\ell_{2}-sensitivity ΔQi\Delta_{Q_{i}}, and the maximum of both Qjt,mt(D)Qjt,mt(D)2\|Q_{j_{t,m_{t}}}(D)-Q_{j_{t,m_{t}}}(D^{\prime})\|_{2} and Qjt,0(D)Qjt,0(D)2\|Q_{j_{t,0}}(D)-Q_{j_{t,0}}(D^{\prime})\|_{2} are Δ(type-t)\Delta(\textup{type-}t) since Qjt,mtQ_{j_{t,m_{t}}} and Qjt,0Q_{j_{t,0}} are both type-tt queries, we obtain

    maxneighboring datasets D,DG(D,D)\displaystyle\max_{\textup{neighboring datasets $D,D^{\prime}$}}G(D,D^{\prime})
    =iN1ΔQi2σi2+tT2B[[Δ(type-t)]2σjt,mt2[Δ(type-t)]2σjt,02].\displaystyle=\sum_{i\in N_{1}}\frac{{\Delta_{Q_{i}}}^{2}}{{\sigma_{i}}^{2}}+\sum_{t\in T_{2B}}\left[\frac{{[\Delta(\textup{type-}t)]}^{2}}{{\sigma_{j_{t,m_{t}}}}^{2}}-\frac{{[\Delta(\textup{type-}t)]}^{2}}{{\sigma_{j_{t,0}}}^{2}}\right]. (7)

    In the example provided in Table II in Section II, maxneighboring datasets D,DG(D,D)\max_{\textup{neighboring datasets $D,D^{\prime}$}}G(D,D^{\prime}) is given by ΔQ12σ12+ΔQ22σ22+ΔQ32σ32+[[Δ(type-1)]2σ102[Δ(type-1)]2σ12]+[[Δ(type-2)]2σ112[Δ(type-2)]2σ22]+[[Δ(type-3)]2σ132[Δ(type-3)]2σ32]=[Δ(type-1)]2σ102+[Δ(type-2)]2σ112+[Δ(type-3)]2σ132.\frac{{\Delta_{Q_{1}}}^{2}}{{\sigma_{1}}^{2}}+\frac{{\Delta_{Q_{2}}}^{2}}{{\sigma_{2}}^{2}}+\frac{{\Delta_{Q_{3}}}^{2}}{{\sigma_{3}}^{2}}+\left[\frac{{[\Delta(\textup{type-}1)]}^{2}}{{\sigma_{10}}^{2}}-\frac{{[\Delta(\textup{type-}1)]}^{2}}{{\sigma_{1}}^{2}}\right]+\left[\frac{{[\Delta(\textup{type-}2)]}^{2}}{{\sigma_{11}}^{2}}-\frac{{[\Delta(\textup{type-}2)]}^{2}}{{\sigma_{2}}^{2}}\right]+\left[\frac{{[\Delta(\textup{type-}3)]}^{2}}{{\sigma_{13}}^{2}}-\frac{{[\Delta(\textup{type-}3)]}^{2}}{{\sigma_{3}}^{2}}\right]=\frac{{[\Delta(\textup{type-}1)]}^{2}}{{\sigma_{10}}^{2}}+\frac{{[\Delta(\textup{type-}2)]}^{2}}{{\sigma_{11}}^{2}}+\frac{{[\Delta(\textup{type-}3)]}^{2}}{{\sigma_{13}}^{2}}.

  • \bullet

    From Lemma 2, the total privacy cost of our Algorithm 1 can be given by (ϵours,δbudget)(\epsilon_{\textup{ours}},\delta_{\textup{budget}})-differential privacy for ϵours\epsilon_{\textup{ours}} satisfying

    [Gaussian(1,ϵours,δbudget)]2=maxneighboring datasets D,DG(D,D),\displaystyle[\operatorname{Gaussian}(1,\epsilon_{\textup{ours}},\delta_{\textup{budget}})]^{-2}=\max_{\textup{neighboring datasets $D,D^{\prime}$}}G(D,D^{\prime}), (8)

    or (ϵ,δ)(\epsilon,\delta)-differential privacy for any ϵ\epsilon and δ\delta satisfying [Gaussian(1,ϵ,δ)]2=maxneighboring datasets D,DG(D,D)[\operatorname{Gaussian}(1,\epsilon,\delta)]^{-2}=\max_{\textup{neighboring datasets $D,D^{\prime}$}}G(D,D^{\prime}).

Proof.

The proof is in Appendix -D. ∎

Remark 3.

Theorem 3 can be used to understand that our Algorithm 1 incurs less privacy cost than that of the naive algorithm where nn queries are answered independently. As given in Corollary 2, the privacy loss with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(F(D,D)2,F(D,D))\mathcal{N}(\frac{F(D,D^{\prime})}{2},F(D,D^{\prime})) for F(D,D):=i=1n[Qi(D)Qi(D)2]2σi2F(D,D^{\prime}):=\sum_{i=1}^{n}\frac{[\|Q_{i}(D)-Q_{i}(D^{\prime})\|_{2}]^{2}}{{\sigma_{i}}^{2}}. Clearly, F(D,D)G(D,D)F(D,D^{\prime})\geq G(D,D^{\prime}) for G(D,D)G(D,D^{\prime}) given by Eq. (6) above. From Lemma 2, the privacy cost of the naive algorithm can be given by (ϵnaive,δbudget)(\epsilon_{\textup{naive}},\delta_{\textup{budget}})-differential privacy for ϵnaive\epsilon_{\textup{naive}} satisfying [Gaussian(1,ϵnaive,δbudget)]2=maxneighboring datasets D,DF(D,D)[\operatorname{Gaussian}(1,\epsilon_{\textup{naive}},\delta_{\textup{budget}})]^{-2}=\max_{\textup{neighboring datasets $D,D^{\prime}$}}F(D,D^{\prime}), which with Eq. (8) in Theorem 3 and the expression of Gaussian(,,)\operatorname{Gaussian}(\cdot,\cdot,\cdot) in Lemma 1 implies ϵoursϵnaive=maxneighboring datasets D,DG(D,D)maxneighboring datasets D,DF(D,D)1\frac{{\epsilon_{\textup{ours}}}}{{\epsilon_{\textup{naive}}}}=\sqrt{\frac{\max_{\textup{neighboring datasets $D,D^{\prime}$}}G(D,D^{\prime})}{\max_{\textup{neighboring datasets $D,D^{\prime}$}}F(D,D^{\prime})}}\leq 1 , where the equal sign is taken only when all nn queries are different so no noise reuse is incurred in our Algorithm 1.

III-H Computing the 2\ell_{2}-sensitivity of A Query

The 2\ell_{2}-sensitivity of a query QQ is defined as the maximal 2\ell_{2} distance between the (true) query results for any neighboring datasets DD and DD^{\prime} that differ in one record: ΔQ=maxneighboring datasets D,DQ(D)Q(D)2\Delta_{Q}=\max_{\textup{neighboring datasets $D,D^{\prime}$}}\|Q(D)-Q(D^{\prime})\|_{2}. For one-dimensional real-valued query QQ, ΔQ\Delta_{Q} is simply the maximal absolute difference between Q(D)Q(D) and Q(D)Q(D^{\prime}) for any neighboring datasets DD and DD^{\prime}. In Section V for performance evaluation, we define neighboring datasets by considering modifying an entry. Then if the dataset has nn users’ information, and the domain of each user’s income is within the interval [min_income,max_income][\textup{min\_income},\textup{max\_income}], then ΔQ\Delta_{Q} for query QQ being the average income of all users is max_incomemin_incomen\frac{\textup{max\_income}-\textup{min\_income}}{n} since this is the maximal variation in the output when a user’s record changes. Similarly, ΔQ\Delta_{Q} for query QQ being the percentage of female users is 1n\frac{1}{n}.

IV Implementation Challenges of Our Blockchain-Based System

We now discuss challenges and countermeasures during the design and implementation of our blockchain-based system.

Smart Contract fetches external data. Ethereum blockchain applications, such as Bitcoin scripts and smart contracts are unable to access and fetch directly the external data they need. However, in our application, Blockchain needs to fetch data from Server then returns them to Client. This requires smart contract to send the HTTP POST request. Hence, we use the Provable, a service integrated with a number of blockchain protocols and can be accessed by non-blockchain applications as well. It guarantees that data fetched from the original data-source is genuine and untampered.

By using the Provable, smart contracts can directly access data from web sites or APIs. In our case, Blockchain can send HTTP requests to Server with parameters, and then process and store data after Server responds successfully.

Mathematical operations with Solidity. Blockchain is written using solidity language which is designed to target Ethereum Virtual Machine. However, current solidity language does not have inherent functions for complex mathematical operations, such as taking the square root or logarithm. We write a function to implement the square root operation. To avoid using Lemma 1 to compute logarithm in Blockchain, we generate Gaussian noise in Client, and pass the value to Blockchain as one of the parameters in function QueryMatch. Besides, current Solidity version cannot operate float or double type data. To keep the precision, we scale up the noise amount during calculation, and then scale down the value before returning the noisy data to analysts.

Refer to caption
Figure 2: The proposed blockchain-based system working flow for differential-privacy cost management.

V Implementation and Experiments

In this section, we perform experiments to validate that the proposed system and algorithm are effective in saving privacy cost according to the system flow shown in Fig. 2. More specifically, a user sends a query through the UI, and then Client receives the query and forwards it to Blockchain smart contract. After the smart contract checks with stored data, it will decide whether to return the noisy response to Client directly or forward the request to Server. If Server receives the request, it will generate and return a noisy response to the smart contract.

V-A Experiment Setup

We prototype a web application based on the system description in Section III. We use the Javascript language to write Client, whereas the Solidity language is for Blockchain smart contract. Besides, Web3 is used as the Javascript API to exchange information between Client and Blockchain smart contract, and then Node.js and Express web framework are leveraged to set up Server. In addition, MongoDB is used as the database to host the real-world dataset. Our designed smart contracts are deployed on the Ropsten [28] testnet with the MetaMask extension of the Chrome browser. The Ropsten testnet is a testing blockchain environment maintained by Ethereum, and it implements the same Proof-of-Work protocol as the main Ethereum network. Fig. 3 shows the screenshot of our blockchain-based privacy management system. Fig. 4 presents outputs while sending queries using the system.

We evaluate the performance of the proposed differential privacy mechanism based on a real-world dataset containing American community survey samples extracted from the Integrated Public Use Microdata Series at https://www.ipums.org. There are 5000 records in the dataset. Each record includes the following numerical attributes: “Total personal income”, “Total family income”, “Age”, and categorical attributes: “Race”, “Citizenship status”. We set the privacy budget as ϵbudget=8\epsilon_{\textup{budget}}=8 and δbudget=104\delta_{\textup{budget}}=10^{-4}, which are commonly used to protect the privacy of a dataset [29, 30]. We consider five types of queries: “average personal income”, “average total family income”, “frequency of US citizens”, “frequency of white race”, and “frequency of age more than 60”. For the privacy parameter of each query QiQ_{i}, we sample ϵi\epsilon_{i} uniformly from [0.1,1.1][0.1,1.1] and sample δi\delta_{i} uniformly from [105,104][10^{-5},10^{-4}]. The sensitivities of these queries are 202, 404, 0.0002, 0.0002, and 0.0002, respectively. We compute the sensitivity of a query based on Section III-H. For the query “average total personal income”, since the user’s total personal income ranges from 5000-5000 to 700000700000 in the dataset mentioned above, we assume the domain of total personal income is in the range of [10000,1000000][-10000,1000000] for all possible datasets. The sensitivity is (1000000(10000))/5000=202(1000000-(-10000))/5000=202 and the mechanism protects the privacy of all data within [10000,1000000][-10000,1000000]. Thus, it can protect the privacy of the dataset in our experiment. Suppose the received query is “average total family income”. In that case, we assume the maximal variation is [20000,2000000][-20000,2000000] for all possible datasets because the total family income’s range is [5000,1379500][-5000,1379500] in the dataset we use. The sensitivity is (2000000(20000))/5000=404(2000000-(-20000))/5000=404. Hence, our generated noise with the sensitivity of 404404 can protect the privacy of all data within [20000,2000000][-20000,2000000]. Therefore, it can protect the privacy of the dataset we use as well. The sensitivity for queries “frequency of US citizens”, “frequency of white race”, and “frequency of age more than 60” is 1/5000=0.00021/5000=0.0002.

Refer to caption
Figure 3: Screenshot of blockchain-based privacy management system demo.
Refer to caption
Figure 4: Displaying of outputs with ϵ\epsilon privacy cost.
Refer to caption
Figure 5: Performance comparison of the sum of privacy cost.
Refer to caption
Figure 6: Performance comparison of the sum of relative error.

V-B Experimental Results

The benchmark of our experiment is a naive scheme which does not contain Algorithm 1 in the smart contract. That is, every query will be forwarded by the smart contract to Server to get the noisy response. Hence, no differential privacy cost can be reused in the naive scheme.

First, we use an experiment to validate that our proposed Algorithm 1 is effective in saving privacy cost. Thus, we design a performance comparison experiment by tracking privacy cost using our Algorithm 1 and the naive scheme, respectively. Specifically, we deploy two smart contracts implementing our Algorithm 1 and the naive scheme respectively on the Ropsten testnet. Then, we send 150 requests randomly selected in five query types from Client of the web application, and record the privacy cost of each query. As shown in Fig. 5, compared with the naive scheme, the proposed algorithm saves significant privacy cost. When the number of the queries is 150, the differential-privacy cost of Algorithm 1 is about 52%\% less than that of the naive algorithm. We also observe that the privacy cost in the proposed scheme increases slowly when the number of queries increases, even trending to converge to a specific value. The reason is that, in Algorithm 1, for each query type, we can always partially or fully reuse previous noisy answers when the query type is asked for a second time or more. Therefore, in our scheme, many queries are answered without incurring additional privacy cost if noisy responses fully reuse previous noisy answers.

Second, to prove that the proposed Algorithm 1 retains the accuracy of the dataset, we design another experiment to compare the sum of relative errors. We use the same smart contracts as those in the last experiment. We accumulate relative errors incurred in each query. Fig. 6 shows that the sum of relative errors of Algorithm 1 is comparable with that of the naive scheme. Since relative errors are similar between two schemes, our results demonstrate that the proposed Algorithm 1 keeps the accuracy.

As a summary, Fig. 5 and Fig. 6 together demonstrate that our Algorithm 1 can save privacy cost significantly without sacrificing the accuracy of the dataset.

Third, we evaluate the latency of our system. The latency of our system is affected by the blockchain, MetaMask, network condition and the server. To shorten the speed of network transmission, we setup a local testnet at http://localhost:3000/ using Ganache-cli client with blockTime set as 15s [31]. Fig. 7 shows that the latency increases as the number of queries increases. The capacity of Ethereum’s throughput is 20\sim60 Transmission Per Second (TPS) [32, 33, 34, 35, 36, 37]. When the number of queries reaches 6060, the latency increases significantly. In addition to Ethereum’s throughput, both MetaMask and the capacity of the deployed device affect the latency. We test the case when the query results have been saved into the system. Thus, smart contract does not need to send requests to the server. We can obtain query results by using previous query results. The worst case is that smart contract has to send request every query which takes longer time to obtain the result because of the third party service Provable.

Refer to caption
Figure 7: Latency vs the number of queries.

Forth, we evaluate the relationship between the query utility and the privacy budget. As defined in [38], the privacy utility of a mechanism satisfies (α,β)(\alpha,\beta)-useful if |Q~m(D)Qm(D)|α|\widetilde{Q}_{m}(D)-Q_{m}(D)|\leq\alpha with probability at least 1β1-\beta. Thus, a small α\alpha means that the difference between the perturbed result and the actual result is small, which also reflects that the mechanism has a high utility. The noise added to a query can be calculated as σ=Gaussian(ΔQ,ϵ,δ){\sigma}=\operatorname{Gaussian}(\Delta_{Q},\epsilon,\delta), where Gaussian(ΔQ,ϵ,δ)=2ln1.25δ×ΔQϵ\operatorname{Gaussian}(\Delta_{Q},\epsilon,\delta)\vspace{-0pt}=\sqrt{2\ln\frac{1.25}{\delta}}\times\frac{\Delta_{Q}}{\epsilon}. We set δ=105\delta=10^{-5} and ϵ[1,8]\epsilon\in[1,8]. Appendix -E proves that when we set β=0.05\beta=0.05, α=2σ\alpha=2\sigma. Fig. 8 and Fig. 9 illustrate how the utility and noise change as the privacy budget ϵ\epsilon increases. Fig. 8 shows the value of α\alpha decreases when the privacy budget ϵ\epsilon increases, meaning that the utility increases. In addition, the amount of noise added reflects the query utility as well. When less noise is added to the query response, the more utility the response gains. Fig. 9 shows that how the noise changes with the privacy budget. As the privacy budget increases, noise decreases, which means that the query utility increases. The amount of noise depends on the privacy budget and the sensitivity value. Queries such as “Frequency of US citizens”, “Frequency of white race” and “Frequency of age more than 60” have the same sensitivity value 0.00020.0002, so the noise added to their responses is the same when the privacy budgets they use are equal.

Refer to caption
Figure 8: Utility vs the privacy budget.
Refer to caption
Figure 9: Noise vs the privacy budget.

Fifth, we evaluate the efficiency of computing resources used by the DP scheme using or without the blockchain. The scheme without the blockchain’s involvement is when data analysts call the server directly to obtain noisy answers. We simulate this case using “Apache JMeter” API testing software to send queries with differential privacy parameters to the server directly. We consider the CPU usage as the metric for evaluating the efficiency of computing resources [39, 40].

Experimental Settings: We use a MacBook Pro with 2.3 GHz Quad-Core Intel Core i5, and 8 GB 2133 MHz LPDDR3 to run applications. In the experiment, we firstly deploy a locally private blockchain testnet using “Go-Ethereum” platform with three nodes mining. We then hold a separate server locally for testing CPU usage for processing API requests sent from “Apache JMeter”. We randomly send 50 queries for testing. At the same time, we use the “Activity Monitor” software to track and obtain their CPU usage [41].

Experimental Results: Fig. 10 compares CPU usages spent by schemes with and without the blockchain. We can observe that the computation efficiency (i.e., CPU usage) in our blockchain-based scheme is higher than that using the server to handle directly. However, it is still acceptable. Node.js and Express.js web application frameworks, which we use to build the web server application, are CPU-intensive. One approach to save computing resources when using blockchain is to stop mining when there is not any incoming task. Only start mining when it is necessary. Therefore, our proposed scheme is efficient and practical with acceptable computation cost.

Refer to caption
Figure 10: Comparison of DP scheme’s CPU usage with or without smart contract.

VI Related Work

In this section, we first compare our paper and a closely related study [19], and then discuss other related work.

VI-A Comparison with Yang et al. [19]

Yang et al. [19] utilize blockchain and differential privacy technologies to achieve the security and privacy protection during data sharing. Compared with [19], we summarize the differences between our work and [19] as follows.

  • \bullet

    Although Algorithm 1 of [19] claims to satisfy ϵ\epsilon-differential privacy, it does not since the noisy output’s domain (i.e., the set of all possible values) depends on the input. The explanation is as follows. In [19], for two neighboring datasets DD and DD^{\prime}, there exists a subset 𝒴\mathcal{Y} of outputs such that [Q~(D)𝒴]>0{\mathbb{P}}\left[{\widetilde{Q}(D)\in\mathcal{Y}}\right]>0 but [Q~(D)𝒴]=0{\mathbb{P}}\left[{\widetilde{Q}(D^{\prime})\in\mathcal{Y}}\right]=0. This means [Q~(D)𝒴][Q~(D)𝒴]=>eϵ\frac{{\mathbb{P}}\left[{\widetilde{Q}(D)\in\mathcal{Y}}\right]}{{\mathbb{P}}\left[{\widetilde{Q}(D^{\prime})\in\mathcal{Y}}\right]}=\infty>e^{\epsilon}, which violates ϵ\epsilon-differential privacy for any ϵ<\epsilon<\infty.

  • \bullet

    [19] does not discuss how to choose the small additional privacy parameter in its Algorithm 1.

  • \bullet

    In [19], when a query is asked for the first time, the Laplace mechanism of [7] for ϵ\epsilon-differential privacy is used to add Laplace noise to the true query result. Afterwards, [19] adds new Laplacian noise on previous noisy output, which makes the new noisy response no longer follow Laplace distribution since the sum of independent Laplace random variables does not follow a Laplace distribution. Hence, the analysis in [19] is not effective.

We consider (ϵ,δ)(\epsilon,\delta)-differential privacy by using the Gaussian noise. The advantage of Gaussian noise over Laplace noise lies in the easier privacy analysis for the composition of different privacy-preserving algorithms, since the sum of independent Gaussian random variables still follows the Gaussian distribution, while the sum of independent Laplace random variables does not obey a Laplace distribution.

VI-B Other Related Work

Differential privacy, a strong mathematical model to guarantee the database’s privacy, has attracted much attention in recent years. Blockchain is a fast-growing technology to provide security and privacy in a decentralized manner [42, 18, 43, 44, 45, 46]. Feng et al. [47] summarize prior studies about privacy protection in blockchain system, including methodology for identity and transaction privacy preservation. In the following, we will introduce more recent studies utilizing blockchain or privacy techniques to provide privacy or security protection in identity, data, and transactions.

Leveraging Blockchains for Identity Privacy/Security Protection. A few studies have focused on leveraging the blockchain to guarantee privacy/security in access control management or identity protection. For example, Zyskind et al. [48] and Xia et al. [49] both use blockchain in access control management. Zyskind et al. [48] create a decentralized personal data management system to address users’ concerns about privacy when using third-party mobile platforms. Xia et al. [49] propose a permissioned blockchain-based data sharing framework to allow only verified users to access the cloud data. Lu et al. [50] develop a private and anonymous decentralized crowdsourcing system ZebraLancer, which overcomes data leakage and identity breach in traditional decentralized crowdsourcing. The above studies focuse on identity privacy because the Blockchain is anonymous, whereas they do not consider the privacy protection for the database.

Leveraging Blockchains for Data Privacy/Security Protection. In addition to the identity privacy preservation, Hu et al. [51] replace the central server with a smart contract and construct a decentralized privacy-preserving search scheme for computing encrypted data while ensuring the privacy of data to prevent from misbehavings of a malicious centralized server. Luongo et al. [52] use secure multi-party computation to design a privacy primitive named Keep which allows contracts to manage and use private data without exposing the data to the public blockchain for protecting smart contracts on public blockchains. Alternatively, we use differential privacy standard to guarantee privacy. Moreover, blockchains are popular to be used for security protection of data sharing in IoT scenarios [43, 44, 49].

Leveraging Blockchains for Transaction Privacy/Security Protection. Moreover, some previous studies use blockchain to guarantee security and privacy in transactions. For example, Henry et al. [17] propose that the blockchain should use mechanisms that piggyback on the overlay network, which is ready for announcing transactions to de-link users’ network-level information instead of using an external service such as Tor to protect users’ privacy. Gervais [32] propose a quantitative framework to analyze the security of Proof-of-Work in blockchains, where the framework’s inputs included security, consensus, and network parameters. Herrera-Joancomartí and Pérez-Solà [53] focuse on privacy in bitcoin transactions. Sani et al. [54] propose a new blockchain Xyreum with high-performance and scalability to secure transactions in the Industrial Internet of Things.

Leveraging differential privacy for protecting privacy of IoT data. IoT devices collect users’ usage status periodically, which may contain sensitive information such as the energy consumption or location information. To avoid the privacy leakage, some studies use differential privacy mechanisms to protect the privacy of data [1, 2, 3, 4, 5]. For example, Tudor et al. [1] propose a streaming-based framework, Bes, to disclose IoT data. Hassan et al. [2] treat each IoT node as a node of blockchain to guarantee the IoT nodes’ security, and they leverage differential privacy mechanisms to protect the privacy of data of each node. To prevent adversaries from intercepting the Internet traffic from/to the smart home gateway or profile residents’ behaviors through digital traces, Liu et al. [4] leverage differential privacy to develop a framework to prevent from attacks. However, none of them discusses how to reuse the differential privacy budget.

Differentially Private Algorithms. Some differential privacy algorithms are proposed to provide privacy protection. Xiao et al. [55] propose an algorithm to correlate the Lapalce noise added to different queries to improve the overall accuracy. Given a series of counting queries, the mechanism proposed by Li and Miklau [56] select a subset of queries to answer privately and use their noisy answers to derive answers for the remaining queries. For a set of non-overlapping counting queries, Kellaris and Papadopoulos et al. [57] pre-process the counts by elaborate grouping and smoothing them via averaging to reduce the sensitivity and thus the amount of injected noise. Given a workload of queries, Yaroslavtsev et al. [58] introduce a solution to balance accuracy and efficiency by answering some queries more accurately than others.

VII Discussion and Future Work

In this section, we will discuss the meaning of differential privacy parameters, privacy of the smart contract and queries.

VII-A Differential Privacy Parameters

The value of differential privacy parameter ϵ\epsilon represents the level of the protection provided by differential privacy mechanisms. McSherry et al. [59, 60] quantify the strength of differential privacy as follows. When ϵ=0\epsilon=0, the differential privacy mechanism provides perfect privacy. Then, when ϵ0.1\epsilon\leq 0.1, the protection is considered as strong, while ϵ10\epsilon\geq 10, the privacy protection is weak. The privacy parameter δ\delta represents the small probability that a record gets altered in the database, so it should be very small. We sample δ\delta uniformly from [105,104][10^{-5},10^{-4}] for each query.

VII-B Privacy for Smart Contract

A smart contract is publicly available when it is deployed to the public blockchain. In our experiment, attackers can obtain the algorithm implemented in the smart contract. However, they still cannot obtain the accurate responses from noisy results even if they obtain the algorithm. There are some approaches that can be used to protect the privacy of the smart contract as follows :

First, Kosba et al. [61] propose Hawk, a framework to build the privacy-preserving smart contracts. Hawk enables that programmers write private smart contracts without considering cryptography. The framework will generate a cryptographic protocol during compiling automatically.

Second, we can partially address this problem by deploying Ethereum to a private blockchain. We may combine the private blockchain and Proof-of-Authority [62] consensus mechanism. When Ethereum is deployed to the private blockchain, the private blockchain can set access control. Thus, the attackers need to break access control before accessing the smart contract. Therefore, when using a private blockchain, we consider access control to protect the smart contracts’ privacy.

As it is complex to protect the smart contract’s privacy, we would like to consider the smart contract’s privacy as our future work.

VII-C Privacy for Queries

Differential privacy (DP) mechanisms consider that data analysts are untrusted, and the curator who holds the database is trusted. The trusted curator stores the database and responds to statistical queries made by an untrusted analyst so that DP will not protect the privacy of data analysts’ queries. Moreover, DP supports statistical queries that may not include much sensitive information. When we use the smart contract, some data analysts may worry about the privacy of their queries. There are two ways to protect the privacy of queries :

First, we may use the private blockchain and conduct experiments using the private blockchain and Proof-of-Authority consensus mechanism. Ethereum also supports deploying the smart contract to the private blockchain. In this case, the smart contract can be considered trusted, so that the sensitive information of queries will not leak.

Second, we may combine other cryptographic techniques with differential privacy. For example, Agarwal et al. [63] design the encrypted databases (EDB) that support differentially-private statistical queries. In their paper, both the curator and the analyst are untrusted. The curator can outsource the database to an untrusted server securely by leveraging EDBs to encrypt operations. Since the privacy protection for queries is complicated and may involve more privacy and security techniques, we would like to consider the privacy of queries as our future work.

Third, query privacy can be avoided by fixing the types of queries. Since the privacy budget is quite limited, it is impossible to let data analysts ask too many questions. Thus, one solution is to control types of queries. The system builder may take some time to select commonly used questions by data analysts, and then they set a dropdown list for data analysts to select questions. In this case, our system will not leak the queries’ privacy because quries are standard.

VII-D Difference between Our Proposed Scheme and Normal DP Schemes

We pre-define queries for efficiently calculating the sensitivity values and saving users’ time. The sensitivity values for different queries should be pre-defined even if we do not use blockchain. When a query comes in many times with different differential privacy parameters, our scheme will play an essential role in saving the differential privacy budget. For example, many companies are trying to send the same query to a dataset because of the similar data analysis tasks. In this case, some of the privacy budgets can be saved. Since the privacy budget is a scarce resource regarding to the dataset, it is necessary to use our scheme.

However, when a query is seen for the first time, our scheme can only treat it as the new query. Since the differential privacy budget is quite limited, and sensitivities have to be calculated beforehand, a dataset will not support too many different statistical queries. If our system is implemented in the real world, similar to Fig. 3, a list of queries supported maybe provided to control the variations of queries instead of letting data analysts type in different queries freely.

VIII Conclusion

In this paper, we use a blockchain-based approach for tracking and saving differential-privacy cost. In our design, we propose an algorithm that reuses noise fully or partially for different instances of the same query type to minimize the accumulated privacy cost. The efficiency of the algorithm is proved via a rigorous mathematical proof. Moreover, we design a blockchain-based system for conducting real-world experiments to confirm the effectiveness of the proposed approach.

References

  • [1] V. Tudor, V. Gulisano, M. Almgren, and M. Papatriantafilou, “Bes: Differentially private event aggregation for large-scale IoT-based systems,” Future Generation Computer Systems, vol. 108, pp. 1241–1257, 2020.
  • [2] M. U. Hassan, M. H. Rehmani, and J. Chen, “Privacy preservation in blockchain based iot systems: Integration issues, prospects, challenges, and future research directions,” Future Generation Computer Systems, vol. 97, pp. 512–529, 2019.
  • [3] J. Xiong, J. Ren, L. Chen, Z. Yao, M. Lin, D. Wu, and B. Niu, “Enhancing privacy and availability for data clustering in intelligent electrical service of IoT,” IEEE Internet of Things Journal, vol. 6, no. 2, pp. 1530–1540, 2018.
  • [4] J. Liu, C. Zhang, and Y. Fang, “Epic: A differential privacy framework to defend smart homes against internet traffic analysis,” IEEE Internet of Things Journal, vol. 5, no. 2, pp. 1206–1217, 2018.
  • [5] K. Gai, Y. Wu, L. Zhu, Z. Zhang, and M. Qiu, “Differential privacy-based blockchain for industrial Internet-of-Things,” IEEE Transactions on Industrial Informatics, vol. 16, no. 6, pp. 4156–4165, 2019.
  • [6] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” in International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), 2006, pp. 486–503.
  • [7] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography Conference (TCC), 2006, pp. 265–284.
  • [8] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
  • [9] C. Dwork and G. N. Rothblum, “Concentrated differential privacy,” arXiv preprint arXiv:1603.01887, 2016.
  • [10] M. Bun and T. Steinke, “Concentrated differential privacy: Simplifications, extensions, and lower bounds,” in Theory of Cryptography Conference (TCC), 2016, pp. 635–658.
  • [11] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Generalization in adaptive data analysis and holdout reuse,” in Conference on Neural Information Processing Systems (NIPS), 2015, pp. 2341–2349.
  • [12] F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in IEEE Symposium on Foundations of Computer Science (FOCS), 2007, pp. 94–103.
  • [13] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in ACM Conference on Computer and Communications Security (CCS), 2016, pp. 308–318.
  • [14] J. Tang, A. Korolova, X. Bai, X. Wang, and X. Wang, “Privacy loss in Apple’s implementation of differential privacy on macOS 10.12,” arXiv preprint arXiv:1709.02753, 2017.
  • [15] Ú. Erlingsson, V. Pihur, and A. Korolova, “RAPPOR: Randomized aggregatable privacy-preserving ordinal response,” in ACM Conference on Computer and Communications Security (CCS), 2014, pp. 1054–1067.
  • [16] B. Ding, J. Kulkarni, and S. Yekhanin, “Collecting telemetry data privately,” in Conference on Neural Information Processing Systems (NIPS), 2017, pp. 3571–3580.
  • [17] R. Henry, A. Herzberg, and A. Kate, “Blockchain access privacy: Challenges and directions,” IEEE Security & Privacy, vol. 16, no. 4, pp. 38–45, 2018.
  • [18] J. Kang, Z. Xiong, D. Niyato, S. Xie, and J. Zhang, “Incentive mechanism for reliable federated learning: A joint optimization approach to combining reputation and contract theory,” IEEE Internet of Things Journal, vol. 6, no. 6, pp. 10 700–10 714, Dec 2019.
  • [19] M. Yang, A. Margheri, R. Hu, and V. Sassone, “Differentially private data sharing in a cloud federation with blockchain,” IEEE Cloud Computing, vol. 5, no. 6, pp. 69–79, 2018.
  • [20] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
  • [21] S. Nakamoto, “Bitcoin: A Peer-to-Peer Electronic Cash System,” https://bitcoin.org/bitcoin.pdf, 2008.
  • [22] Investopedia, “Blockchain,” https://www.investopedia.com/terms/b/blockchain.asp.
  • [23] G. Wood, “Ethereum: A secure decentralised generalised transaction ledger,” Ethereum Project Yellow Paper, 2014.
  • [24] N. Szabo, “Smart Contracts,” http://www.fon.hum.uva.nl/rob/Courses/InformationInSpeech/CDROM/Literature/LOTwinterschool2006/szabo.best.vwh.net/smart.contracts.html, 1994.
  • [25] V. Buterin, “A next-generation smart contract and decentralized application platform,” white paper, 2014.
  • [26] L. M. Han, Y. Zhao, and J. Zhao, “Blockchain-based differential privacy cost management system,” arXiv preprint arXiv:2006.04693, 2020.
  • [27] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem for differential privacy,” IEEE Transactions on Information Theory, vol. 63, no. 6, pp. 4037–4049, 2017.
  • [28] “Ropsten’s official github page.” https://github.com/ethereum/ropsten, accessed on 9 January 2019.
  • [29] D. Sánchez, J. Domingo-Ferrer, and S. Martínez, “Improving the utility of differential privacy via univariate microaggregation,” in International Conference on Privacy in Statistical Databases.   Springer, 2014, pp. 130–142.
  • [30] N. Wang, X. Xiao, Y. Yang, J. Zhao, S. C. Hui, H. Shin, J. Shin, and G. Yu, “Collecting and analyzing multidimensional data with local differential privacy,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE).   IEEE, 2019, pp. 638–649.
  • [31] “Ganache,” Accessed on 2020. [Online]. Available: https://www.trufflesuite.com/ganache
  • [32] A. Gervais, G. Karame, K. Wüst, V. Glykantzis, H. Ritzdorf, and S. Capkun, “On the security and performance of proof of work blockchains,” in ACM SIGSAC Conference on Computer and Communication Security (CCS), 2016.
  • [33] B. Cao, Y. Li, L. Zhang, L. Zhang, S. Mumtaz, Z. Zhou, and M. Peng, “When Internet of Things meets blockchain: Challenges in distributed consensus,” IEEE Network, vol. 33, no. 6, pp. 133–139, 2019.
  • [34] D. Vujičić, D. Jagodić, and S. Ranđić, “Blockchain technology, bitcoin, and ethereum: A brief overview,” in 2018 17th international symposium infoteh-jahorina (infoteh).   IEEE, 2018, pp. 1–6.
  • [35] J. Stark, “Making sense of ethereum’s layer 2 scaling solutions: state channels, plasma, and truebit,” 2018.
  • [36] H. Tang, Y. Shi, and P. Dong, “Public blockchain evaluation using entropy and TOPSIS,” Expert Systems with Applications, vol. 117, pp. 204–210, 2019.
  • [37] M. H. Miraz and D. C. Donald, “LApps: technological, legal and market potentials of blockchain lightning network applications,” in Proceedings of the 2019 3rd International Conference on Information System and Data Mining, 2019, pp. 185–189.
  • [38] A. Blum, K. Ligett, and A. Roth, “A learning theory approach to noninteractive database privacy,” Journal of the ACM (JACM), vol. 60, no. 2, pp. 1–25, 2013.
  • [39] P. Zheng, Z. Zheng, X. Luo, X. Chen, and X. Liu, “A detailed and real-time performance monitoring framework for blockchain systems,” in 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).   IEEE, 2018, pp. 134–143.
  • [40] S. Morishima and H. Matsutani, “Accelerating blockchain search of full nodes using gpus,” in 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).   IEEE, 2018, pp. 244–248.
  • [41] S. King and S. Nadal, “Ppcoin: Peer-to-peer crypto-currency with proof-of-stake,” self-published paper, August, vol. 19, p. 1, 2012.
  • [42] X. Li, H. Li, H. Yan, Z. Cheng, W. Sun, and H. Zhu, “Mitigating query-flooding parameter duplication attack on regression models with high-dimensional gaussian mechanism,” 2020.
  • [43] J. Kang, Z. Xiong, D. Niyato, D. Ye, D. I. Kim, and J. Zhao, “Toward secure blockchain-enabled internet of vehicles: Optimizing consensus management using reputation and contract theory,” IEEE Transactions on Vehicular Technology, vol. 68, no. 3, pp. 2906–2920, March 2019.
  • [44] J. Kang, R. Yu, X. Huang, M. Wu, S. Maharjan, S. Xie, and Y. Zhang, “Blockchain for secure and efficient data sharing in vehicular edge computing and networks,” IEEE Internet of Things Journal, vol. 6, no. 3, pp. 4660–4670, June 2019.
  • [45] Y. C. Tsai, R. Tso, Z.-Y. Liu, and K. Chen, “An improved non-interactive zero-knowledge range proof for decentralized applications,” in 2019 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPCON).   IEEE, 2019, pp. 129–134.
  • [46] A. Fernández Anta, “Atomic appends: Selling cars and coordinating armies with multiple blockchains,” 2019.
  • [47] Q. Feng, D. He, S. Zeadally, M. K. Khan, and N. Kumar, “A survey on privacy protection in blockchain system,” Journal of Network and Computer Applications, vol. 126, pp. 45–58, 2019.
  • [48] G. Zyskind, O. Nathan, and A. Pentland, “Decentralizing privacy: Using blockchain to protect personal data,” in IEEE Security and Privacy Workshops (SPW), 2015, pp. 180–184.
  • [49] Q. Xia, E. B. Sifah, A. Smahi, S. Amofa, and X. Zhang, “BBDS: Blockchain-based data sharing for electronic medical records in cloud environments,” Information, vol. 8, no. 2, p. 44, 2017.
  • [50] Y. Lu, Q. Tang, and G. Wang, “Zebralancer: Private and anonymous crowdsourcing system atop open blockchain,” in 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).   IEEE, 2018, pp. 853–865.
  • [51] S. Hu, C. Cai, Q. Wang, C. Wang, X. Luo, and K. Ren, “Searching an encrypted cloud meets blockchain: A decentralized, reliable and fair realization,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications.   IEEE, 2018, pp. 792–800.
  • [52] M. Luongo and C. Pon, “The Keep network: A privacy layer for public blockchains,” https://keep. network/whitepaper, Tech. Rep., 2017.
  • [53] J. Herrera-Joancomartí and C. Pérez-Solà, “Privacy in bitcoin transactions: New challenges from blockchain scalability solutions,” in Modeling Decisions for Artificial Intelligence, 2016, pp. 26–44.
  • [54] A. S. Sani, D. Yuan, W. Bao, P. L. Yeoh, Z. Y. Dong, B. Vucetic, and E. Bertino, “Xyreum: A high-performance and scalable blockchain for iiot security and privacy,” in 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).   IEEE, 2019, pp. 1920–1930.
  • [55] X. Xiao, G. Bender, M. Hay, and J. Gehrke, “iReduct: Differential privacy with reduced relative errors,” in ACM International Conference on Management of Data (SIGMOD), 2011, pp. 229–240.
  • [56] C. Li and G. Miklau, “An adaptive mechanism for accurate query answering under differential privacy,” Proceedings of the VLDB Endowment, vol. 5, no. 6, pp. 514–525, 2012.
  • [57] G. Kellaris and S. Papadopoulos, “Practical differential privacy via grouping and smoothing,” Proceedings of the VLDB Endowment, vol. 6, no. 5, pp. 301–312, 2013.
  • [58] G. Yaroslavtsev, G. Cormode, C. M. Procopiuc, and D. Srivastava, “Accurate and efficient private release of datacubes and contingency tables,” in International Conference on Data Engineering (ICDE), 2013, pp. 745–756.
  • [59] F. McSherry and R. Mahajan, “Differentially-private network trace analysis,” ACM SIGCOMM Computer Communication Review, vol. 40, no. 4, pp. 123–134, 2010.
  • [60] P. Aaby, J. M. De Acuna, R. Macfarlane, and W. J. Buchanan, “Privacy parameter variation using RAPPOR on a malware dataset,” in 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE).   IEEE, 2018, pp. 938–945.
  • [61] A. Kosba, A. Miller, E. Shi, Z. Wen, and C. Papamanthou, “Hawk: The blockchain model of cryptography and privacy-preserving smart contracts,” in 2016 IEEE symposium on security and privacy (SP).   IEEE, 2016, pp. 839–858.
  • [62] “Proof of Authority,” Accessed on 2020. [Online]. Available: https://github.com/paritytech/parity/wiki/Proof-of-Authority-Chains.
  • [63] A. Agarwal, M. Herlihy, S. Kamara, and T. Moataz, “Encrypted databases for differential privacy,” Proceedings on Privacy Enhancing Technologies, vol. 2019, no. 3, pp. 170–190, 2019.
  • [64] B. Balle and Y.-X. Wang, “Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising,” in International Conference on Machine Learning (ICML), 2018.
  • [65] F. Pukelsheim, “The three sigma rule,” The American Statistician, vol. 48, no. 2, pp. 88–91, 1994.

-A Proof of Theorem 1

Proof.

(i) As noted in the statement of Theorem 1, we suppose that before answering query QmQ_{m} and after answering Q1,Q2,,Qm1Q_{1},Q_{2},\ldots,Q_{m-1}, the privacy loss LQ~1Q~2Q~m1(D,D)L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m-1}}(D,D^{\prime}) is given by 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})) for some A(D,D)A(D,D^{\prime}). Later we will show the existence of such A(D,D)A(D,D^{\prime}). Then when yiy_{i} follows the probability distribution of random variable Q~i(D)\widetilde{Q}_{i}(D) for each i{1,2,,m1}i\in\{1,2,\ldots,m-1\}, we have the following for the privacy loss LQ~1Q~2Q~m1(D,D;y1,y2,,ym1)L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m-1}}(D,D^{\prime};y_{1},y_{2},\ldots,y_{m-1}):

LQ~1Q~2Q~m1(D,D;y1,y2,,ym1)\displaystyle L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m-1}}(D,D^{\prime};y_{1},y_{2},\ldots,y_{m-1})
:=ln𝔽[i=1m1[Q~i(D)=yi]]𝔽[i=1m1[Q~i(D)=yi]]𝒩(A(D,D)2,A(D,D)),\displaystyle:=\ln\frac{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D)=y_{i}\right]\right]}{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D^{\prime})=y_{i}\right]\right]}\sim\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})), (9)

where we use “\sim” to mean “obeying the distribution”.

Now we need to analyze the privacy loss after answering the mm queries Q1,Q2,,QmQ_{1},Q_{2},\ldots,Q_{m}. We look at the privacy loss LQ~1Q~2Q~m(D,D;y1,y2,,ym)L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m}}(D,D^{\prime};y_{1},y_{2},\ldots,y_{m}) defined as follows:

LQ~1Q~2Q~m(D,D;y1,y2,,ym)\displaystyle L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m}}(D,D^{\prime};y_{1},y_{2},\ldots,y_{m})
:=ln𝔽[i=1m[Q~i(D)=yi]]𝔽[i=1m[Q~i(D)=yi]].\displaystyle:=\ln\frac{\mathbb{F}\left[\cap_{i=1}^{m}\left[\widetilde{Q}_{i}(D)=y_{i}\right]\right]}{\mathbb{F}\left[\cap_{i=1}^{m}\left[\widetilde{Q}_{i}(D^{\prime})=y_{i}\right]\right]}. (10)

Hence, we use (9) to analyze (10). From (3), since Q~m(D)\widetilde{Q}_{m}(D) is generated by reusing Q~j(D)\widetilde{Q}_{j}(D) and generating additional noise (if necessary), where jj is an integer in {1,2,,m1}\{1,2,\ldots,m-1\} as noted in the statement of Theorem 1, we have

LQ~1Q~2Q~m(D,D;y1,y2,,ym)\displaystyle L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m}}(D,D^{\prime};y_{1},y_{2},\ldots,y_{m})
=ln𝔽[i=1m1[Q~i(D)=yi]]𝔽[Q~m(D)=ym|Q~j(D)=yj]𝔽[i=1m1[Q~i(D)=yi]]𝔽[Q~m(D)=ym|Q~j(D)=yj]\displaystyle=\ln\frac{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D)=y_{i}\right]\right]{\mathbb{F}}\left[{\widetilde{Q}_{m}(D)=y_{m}~{}|~{}\widetilde{Q}_{j}(D)=y_{j}}\right]}{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D^{\prime})=y_{i}\right]\right]{\mathbb{F}}\left[{\widetilde{Q}_{m}(D^{\prime})=y_{m}~{}|~{}\widetilde{Q}_{j}(D^{\prime})=y_{j}}\right]}
=ln𝔽[i=1m1[Q~i(D)=yi]]𝔽[i=1m1[Q~i(D)=yi]]\displaystyle=\ln\frac{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D)=y_{i}\right]\right]}{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D^{\prime})=y_{i}\right]\right]}
+ln𝔽[Q~m(D)=ym|Q~j(D)=yj]𝔽[Q~m(D)=ym|Q~j(D)=yj].\displaystyle\quad+\ln\frac{{\mathbb{F}}\left[{\widetilde{Q}_{m}(D)=y_{m}~{}|~{}\widetilde{Q}_{j}(D)=y_{j}}\right]}{{\mathbb{F}}\left[{\widetilde{Q}_{m}(D^{\prime})=y_{m}~{}|~{}\widetilde{Q}_{j}(D^{\prime})=y_{j}}\right]}. (11)

We now discuss the first term ln𝔽[i=1m1[Q~i(D)=yi]]𝔽[i=1m1[Q~i(D)=yi]]\ln\frac{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D)=y_{i}\right]\right]}{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D^{\prime})=y_{i}\right]\right]} and the second term ln𝔽[Q~m(D)=ym|Q~j(D)=yj]𝔽[Q~m(D)=ym|Q~j(D)=yj]\ln\frac{{\mathbb{F}}\left[{\widetilde{Q}_{m}(D)=y_{m}~{}|~{}\widetilde{Q}_{j}(D)=y_{j}}\right]}{{\mathbb{F}}\left[{\widetilde{Q}_{m}(D^{\prime})=y_{m}~{}|~{}\widetilde{Q}_{j}(D^{\prime})=y_{j}}\right]} in the last row of (11). To begin with, from (9), the first term in the last row of (11) follows the Gaussian distribution 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})). Next, we analyze the second term in the last row of (11).

When Q~j(D)\widetilde{Q}_{j}(D) and Q~m(D)\widetilde{Q}_{m}(D) take yjy_{j} and ymy_{m} respectively, Q~j(D)Qj(D)\widetilde{Q}_{j}(D)-{Q}_{j}(D) and Q~m(D)Qm(D)r[Q~j(D)Qj(D)]\widetilde{Q}_{m}(D)-Q_{m}(D)-r[\widetilde{Q}_{j}(D)-Q_{j}(D)] take the following defined gjg_{j} and gmg_{m} respectively:

gj\displaystyle g_{j} :=yjQj(D),\displaystyle:=y_{j}-{Q}_{j}(D), (12)
gm\displaystyle g_{m} :=ymQm(D)r[yjQj(D)].\displaystyle:=y_{m}-Q_{m}(D)-r[y_{j}-Q_{j}(D)]. (13)

For DD^{\prime} being a neighboring dataset of DD, we further define

hj\displaystyle h_{j} :=Qj(D)Qj(D),\displaystyle:={Q}_{j}(D)-{Q}_{j}(D^{\prime}), (14)
hm\displaystyle h_{m} :=Qm(D)Qm(D),\displaystyle:={Q}_{m}(D)-{Q}_{m}(D^{\prime}), (15)

so that

gj+hj\displaystyle g_{j}+h_{j} =yjQj(D),\displaystyle=y_{j}-{Q}_{j}(D^{\prime}), (16)
gm+hmrhj\displaystyle g_{m}+h_{m}-rh_{j} =ymQm(D)r[yjQj(D)].\displaystyle=y_{m}-Q_{m}(D^{\prime})-r[y_{j}-Q_{j}(D^{\prime})]. (17)

Note that hjh_{j} and hmh_{m} are the same since QjQ_{j} and QmQ_{m} are the same. From the above analysis, we obtain :

𝔽[Q~m(D)=ym|Q~j(D)=yj]\displaystyle{\mathbb{F}}\left[{\widetilde{Q}_{m}(D)=y_{m}~{}|~{}\widetilde{Q}_{j}(D)=y_{j}}\right]
=𝔽[Q~m(D)Qm(D)r[Q~j(D)Qj(D)]=gm|Q~j(D)=yj]\displaystyle={\mathbb{F}}\left[{\begin{array}[]{l}\widetilde{Q}_{m}(D)-Q_{m}(D)-r[\widetilde{Q}_{j}(D)-Q_{j}(D)]=g_{m}\\ ~{}|~{}\widetilde{Q}_{j}(D)=y_{j}\end{array}}\right] (20)
=(b)12π(σm2r2σj2)egm22(σm2r2σj2),\displaystyle\stackrel{{\scriptstyle\textup{(b)}}}{{=}}\frac{1}{\sqrt{2\pi({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}}e^{-\frac{{g_{m}}^{2}}{2({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}}, (21)

where step (b) follows since where Q~j(D)Qj(D)\widetilde{Q}_{j}(D)-{Q}_{j}(D) is a zero-mean Gaussian random variable with variance σj2\sigma_{j}^{2} and Q~m(D)Qm(D)r[Q~j(D)Qj(D)]\widetilde{Q}_{m}(D)-Q_{m}(D)-r[\widetilde{Q}_{j}(D)-Q_{j}(D)] is a zero-mean Gaussian random variable with variance σm2r2σj2{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}.

Similarly, for dataset DD^{\prime}, we have :

𝔽[Q~m(D)=ym|Q~j(D)=yj]\displaystyle{\mathbb{F}}\left[{\widetilde{Q}_{m}(D^{\prime})=y_{m}~{}|~{}\widetilde{Q}_{j}(D^{\prime})=y_{j}}\right]
=𝔽[Q~m(D)Qm(D)r[Q~j(D)Qj(D)]=gm+hmrhj|Q~j(D)=yj]\displaystyle={\mathbb{F}}\left[{\begin{array}[]{l}\widetilde{Q}_{m}(D^{\prime})-Q_{m}(D^{\prime})-r[\widetilde{Q}_{j}(D^{\prime})-Q_{j}(D^{\prime})]\\ =g_{m}+h_{m}-rh_{j}\\ ~{}|~{}\widetilde{Q}_{j}(D^{\prime})=y_{j}\end{array}}\right] (25)
=(b)12π(σm2r2σj2)e(gm+hmrhj)22(σm2r2σj2),\displaystyle\stackrel{{\scriptstyle\textup{(b)}}}{{=}}\frac{1}{\sqrt{2\pi({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}}e^{-\frac{(g_{m}+h_{m}-rh_{j})^{2}}{2({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}}, (26)

where step (b) follows since where Q~j(D)Qj(D)\widetilde{Q}_{j}(D^{\prime})-{Q}_{j}(D^{\prime}) is a Gaussian random variable with variance σj2\sigma_{j}^{2} and Q~m(D)Qm(D)r[Q~j(D)Qj(D)\widetilde{Q}_{m}(D^{\prime})-Q_{m}(D^{\prime})-r[\widetilde{Q}_{j}(D^{\prime})-Q_{j}(D^{\prime}) is a zero-mean Gaussian random variable with variance σm2r2σj2{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}.

Then,

ln𝔽[Q~m(D)=ym|Q~j(D)=yj]𝔽[Q~m(D)=ym|Q~j(D)=yj]\displaystyle\ln\frac{{\mathbb{F}}\left[{\widetilde{Q}_{m}(D)=y_{m}~{}|~{}\widetilde{Q}_{j}(D)=y_{j}}\right]}{{\mathbb{F}}\left[{\widetilde{Q}_{m}(D^{\prime})=y_{m}~{}|~{}\widetilde{Q}_{j}(D^{\prime})=y_{j}}\right]}
=ln12π(σm2r2σj2)egm22(σm2r2σj2)12π(σm2r2σj2)e(gm+hmrhj)22(σm2r2σj2)\displaystyle=\ln\frac{~{}~{}\frac{1}{\sqrt{2\pi({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}}e^{-\frac{{g_{m}}^{2}}{2({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}}~{}~{}}{~{}~{}\frac{1}{\sqrt{2\pi({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}}e^{-\frac{(g_{m}+h_{m}-rh_{j})^{2}}{2({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}}~{}~{}}
=(gm+hmrhj)2gm22(σm2r2σj2)\displaystyle=\frac{(g_{m}+h_{m}-rh_{j})^{2}-{g_{m}}^{2}}{2({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}
=gm(hmrhj)σm2r2σj2+(hmrhj)22(σm2r2σj2).\displaystyle=\frac{g_{m}(h_{m}-rh_{j})}{{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}}+\frac{(h_{m}-rh_{j})^{2}}{2({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})}.~{} (27)

The above (27) presents the second term in the last row of (11). At first glance, it may seem that the first term ln𝔽[i=1m1[Q~i(D)=yi]]𝔽[i=1m1[Q~i(D)=yi]]\ln\frac{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D)=y_{i}\right]\right]}{\mathbb{F}\left[\cap_{i=1}^{m-1}\left[\widetilde{Q}_{i}(D^{\prime})=y_{i}\right]\right]} and the second term ln𝔽[Q~m(D)=ym|Q~j(D)=yj]𝔽[Q~m(D)=ym|Q~j(D)=yj]\ln\frac{{\mathbb{F}}\left[{\widetilde{Q}_{m}(D)=y_{m}~{}|~{}\widetilde{Q}_{j}(D)=y_{j}}\right]}{{\mathbb{F}}\left[{\widetilde{Q}_{m}(D^{\prime})=y_{m}~{}|~{}\widetilde{Q}_{j}(D^{\prime})=y_{j}}\right]} in the last row of (11) are dependent since they both involve yjy_{j}. However, we have shown from (27) above that the second term in the last row of (11) depends on only the random variable gmg_{m} (note that terms in (27) other than gmg_{m} are all given), which is the amount of additional Gaussian noise used to generated Q~m(D)\widetilde{Q}_{m}(D) according to (3) and (13); i.e., the second term in the last row of (11) is actually independent of the first term in the last row of (11). From (9), the first term in the last row of (11) follows the Gaussian distribution 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})). Next, we show that (27) presenting the second term in the last row of (11) also follows a Gaussian distribution.

Since gmg_{m} follows a zero-mean Gaussian distribution with variance σm2r2σj2{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}, clearly gm(hmrhj)σm2r2σj2\frac{g_{m}(h_{m}-rh_{j})}{{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}} follows a zero-mean Gaussian distribution with variance given by

[(hmrhj)σm2r2σj2]2×(σm2r2σj2)\displaystyle\bigg{[}\frac{(h_{m}-rh_{j})}{{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}}\bigg{]}^{2}\times({\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2})
=(hmrhj)2σm2r2σj2.\displaystyle=\frac{(h_{m}-rh_{j})^{2}}{{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}}. (28)

Since QmQ_{m} and QjQ_{j} are the same, we obtain from Eq. (14) and Eq. (15) that hj=hm=Qm(D)Qm(D)h_{j}=h_{m}={Q}_{m}(D)-{Q}_{m}(D^{\prime}), which we use to write Eq. (28) as

[Qm(D)Qm(D)2]2(1r)2σm2r2σj2.\displaystyle\frac{[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}(1-r)^{2}}{{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}}. (29)

Summarizing the above, privacy loss is

Br(D,D):=A(D,D)+[Qm(D)Qm(D)2]2(1r)2σm2r2σj2.\displaystyle B_{r}(D,D^{\prime}):=A(D,D^{\prime})+\frac{[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}(1-r)^{2}}{{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}}.~{} (30)

As noted in the statement of Theorem 1, we suppose that before answering query QmQ_{m} and after answering Q1,Q2,,Qm1Q_{1},Q_{2},\ldots,Q_{m-1}, the privacy loss LQ~1Q~2Q~m1(D,D)L_{\widetilde{Q}_{1}\|\widetilde{Q}_{2}\|\ldots\|\widetilde{Q}_{m-1}}(D,D^{\prime}) is given by 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})) for some A(D,D)A(D,D^{\prime}). With the above result (30), we can actually show that there indeed exists such A(D,D)A(D,D^{\prime}). This follows from mathematical induction. For the base case; i.e., when only one query is answered, the result follows from Lemma 3 of [64]. The induction step is given by the above result (30). Hence, we have shown the existence of A(D,D)A(D,D^{\prime}). With this result and (30), we have completed proving Result (i) of Theorem 1.

(ii) The optimal rr is obtained by minimizing Br(D,D)B_{r}(D,D^{\prime}) and hence minimizing (1r)2σm2r2σj2\frac{(1-r)^{2}}{{\sigma_{m}}^{2}-r^{2}{\sigma_{j}}^{2}}. Analyzing the monotonicity of this expression, we derive the optimal rr as in Eq. (4). The first-order derivative of Br(D,D)B_{r}(D,D^{\prime}) to rr is:

Br(D,D)=2(rσj2σm2)(r1)(r2σj2σm2)2.\displaystyle B_{r}(D,D^{\prime})^{\prime}=\frac{-2(r{\sigma_{j}}^{2}-{\sigma_{m}}^{2})(r-1)}{(r^{2}{\sigma_{j}}^{2}-{\sigma_{m}}^{2})^{2}}.~{} (31)
  • \bullet

    Case 1: if σmσj\sigma_{m}\geq\sigma_{j}, Br(D,D)0B_{r}(D,D^{\prime})^{\prime}\geq 0 when r[1,σmσj]r\in[1,\frac{\sigma_{m}}{\sigma_{j}}], and Br(D,D)<0B_{r}(D,D^{\prime})^{\prime}<0 when r(,1)(σmσj,+)r\in(-\infty,1)\cup(\frac{\sigma_{m}}{\sigma_{j}},+\infty). Hence, the optimal rr to minimize Br(D,D)B_{r}(D,D^{\prime}) is at r=1r=1.

  • \bullet

    Case 2: if σm<σj\sigma_{m}<\sigma_{j}, Br(D,D)0B_{r}(D,D^{\prime})^{\prime}\geq 0 when r[σmσj,1]r\in[\frac{\sigma_{m}}{\sigma_{j}},1], and Br(D,D)<0B_{r}(D,D^{\prime})^{\prime}<0 when r(,σmσj)(1,+)r\in(-\infty,\frac{\sigma_{m}}{\sigma_{j}})\cup(1,+\infty). Hence, the optimal rr to minimize Br(D,D)B_{r}(D,D^{\prime}) is at r=(σmσj)2r=(\frac{\sigma_{m}}{\sigma_{j}})^{2}.

Thus, we obtain optimal values of rr as Eq. (4). ∎

-B Proof of Lemma 2

Proof.

Consider a query RR with 2\ell_{2}-sensitivity being 11. Let R~\widetilde{R} be the mechanism of adding Gaussian noise amount μ:=1maxneighboring datasets D,DV(D,D)\mu:=\frac{1}{\sqrt{\max_{\textup{neighboring datasets $D,D^{\prime}$}}V(D,D^{\prime})}} to RR. From Corollary 1, the privacy loss of randomized mechanism R~\widetilde{R} with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(U(D,D)2,U(D,D))\mathcal{N}(\frac{U(D,D^{\prime})}{2},U(D,D^{\prime})) for U(D,D):=[R(D)R(D)2]2μ2U(D,D^{\prime}):=\frac{[\|R(D)-R(D^{\prime})\|_{2}]^{2}}{{\mu}^{2}}. By considering the 2\ell_{2}-sensitivity of RR (i.e., R(D)R(D)2\|R(D)-R(D^{\prime})\|_{2}) as 11, maxneighboring datasets D,DV(D,D)\max_{\textup{neighboring datasets $D,D^{\prime}$}}V(D,D^{\prime}) and maxneighboring datasets D,DU(D,D)\max_{\textup{neighboring datasets $D,D^{\prime}$}}U(D,D^{\prime}) are the same. In addition, from Theorem 5 of [64], letting YY (resp., R~\widetilde{R}) satisfy (ϵ,δ)(\epsilon,\delta)-differential privacy can be converted to a condition on maxneighboring datasets D,DV(D,D)\max_{\textup{neighboring datasets $D,D^{\prime}$}}V(D,D^{\prime}) (resp., maxneighboring datasets D,DU(D,D)\max_{\textup{neighboring datasets $D,D^{\prime}$}}U(D,D^{\prime})). Then letting YY satisfy (ϵ,δ)(\epsilon,\delta)-differential privacy is the same as letting R~\widetilde{R} satisfy (ϵ,δ)(\epsilon,\delta)-differential privacy. From Lemma 1, R~\widetilde{R} achieves (ϵ,δ)(\epsilon,\delta)-differential privacy with μ=Gaussian(1,ϵ,δ)\mu=\operatorname{Gaussian}(1,\epsilon,\delta); i.e., if maxneighboring datasets D,DV(D,D)=[Gaussian(1,ϵ,δ)]2\max_{\textup{neighboring datasets $D,D^{\prime}$}}V(D,D^{\prime})=[\operatorname{Gaussian}(1,\epsilon,\delta)]^{-2}. Summarizing the above, we complete proving Lemma 2. ∎

-C Proof of Theorem 2

Proof.

We use Theorem 1 to show Results ① ② and ③ of Theorem 2. Proof of ①: In Case 2A) and Case 2C), QmQ_{m} can reuse previous noise. Hence, the privacy loss will still be 𝒩(A(D,D)2,A(D,D))\mathcal{N}(\frac{A(D,D^{\prime})}{2},A(D,D^{\prime})) according to Eq. (5).
Proof of ②: In Case 1), QmQ_{m} cannot reuse previous noisy answers, and the new noise follows 𝒩(0,σm)\mathcal{N}(0,\sigma_{m}). Thus, B(D,D):=A(D,D)+[Qm(D)Qm(D)2]2σm2B(D,D^{\prime}):=A(D,D^{\prime})+\frac{[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}}{{\sigma_{m}}^{2}}.
Proof of ③: In Case 2B), QmQ_{m} can reuse previous noisy answers partially, so we can prove it using Eq. (5).

Then, Lemma 2 further implies Results ④ ⑤ and ⑥ of Theorem 2.
Proof of ④: QmQ_{m} can fully reuse the old noisy result in Cases 2A) and 2C). Thus, the privacy level does not change.
Proof of ⑤: From Lemma 2, we have maxneighboring datasets D,DA(D,D)=[Gaussian(1,ϵold,δbudget)]2\max_{\textup{neighboring datasets $D,D^{\prime}$}}A(D,D^{\prime})=[\operatorname{Gaussian}(1,\epsilon_{\textup{old}},\delta_{\textup{budget}})]^{-2} and maxneighboring datasets D,D{A(D,D)+[Qm(D)Qm(D)2]2\max_{\textup{neighboring datasets $D,D^{\prime}$}}\bigg{\{}A(D,D^{\prime})+[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}×1σm2}=[Gaussian(1,ϵnew,δbudget)]2\\ \times\frac{1}{{\sigma_{m}}^{2}}\bigg{\}}=[\operatorname{Gaussian}(1,\epsilon_{\textup{new}},\delta_{\textup{budget}})]^{-2}. The above two equations yield [Gaussian(1,ϵnew,δbudget)]2[Gaussian(1,ϵold,δbudget)]2[\operatorname{Gaussian}(1,\epsilon_{\textup{new}},\delta_{\textup{budget}})]^{-2}-[\operatorname{Gaussian}(1,\epsilon_{\textup{old}},\delta_{\textup{budget}})]^{-2}
=maxneighboring datasets D,D[Qm(D)Qm(D)2]2×1σm2=ΔQm2×1σm2=σm2=\max_{\textup{neighboring datasets $D,D^{\prime}$}}[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}\times\frac{1}{{\sigma_{m}}^{2}}={\Delta_{Q_{m}}}^{2}\times\frac{1}{{\sigma_{m}}^{2}}={\sigma_{m}}^{2}. Hence, Gaussian(ΔQm\Delta_{Q_{m}},ϵ_squared_cost\epsilon\textup{\_squared\_cost},δbugdet\delta_{\textup{bugdet}}) = σm\sigma_{m}.
Proof of ⑥: From Lemma 2, we have maxneighboring datasets D,DA(D,D)=[Gaussian(1,ϵold,δbudget)]2\max_{\textup{neighboring datasets $D,D^{\prime}$}}A(D,D^{\prime})=[\operatorname{Gaussian}(1,\epsilon_{\textup{old}},\delta_{\textup{budget}})]^{-2} and maxneighboring datasets D,D{A(D,D)+[Qm(D)Qm(D)2]2\max_{\textup{neighboring datasets $D,D^{\prime}$}}\bigg{\{}A(D,D^{\prime})+[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2} ×[1σm21[min(𝚺t)]2]}=[Gaussian(1,ϵnew,δbudget)]2\times\left[\frac{1}{{\sigma_{m}}^{2}}-\frac{1}{[\min(\boldsymbol{\Sigma}_{t})]^{2}}\right]\bigg{\}}=[\operatorname{Gaussian}(1,\epsilon_{\textup{new}},\delta_{\textup{budget}})]^{-2}. The above two equations yield [Gaussian(1,ϵnew,δbudget)]2[Gaussian(1,ϵold,δbudget)]2[\operatorname{Gaussian}(1,\epsilon_{\textup{new}},\delta_{\textup{budget}})]^{-2}-[\operatorname{Gaussian}(1,\epsilon_{\textup{old}},\delta_{\textup{budget}})]^{-2}
=maxneighboring datasets D,D[Qm(D)Qm(D)2]2×[1σm21[min(𝚺t)]2]=ΔQm2×[1σm21[min(𝚺t)]2]=\max_{\textup{neighboring datasets $D,D^{\prime}$}}[\|Q_{m}(D)-Q_{m}(D^{\prime})\|_{2}]^{2}\times\left[\frac{1}{{\sigma_{m}}^{2}}-\frac{1}{[\min(\boldsymbol{\Sigma}_{t})]^{2}}\right]={\Delta_{Q_{m}}}^{2}\times\left[\frac{1}{{\sigma_{m}}^{2}}-\frac{1}{[\min(\boldsymbol{\Sigma}_{t})]^{2}}\right]. Then using the expression of Gaussian(ΔQ,ϵ,δ)\operatorname{Gaussian}(\Delta_{Q},\epsilon,\delta) from Lemma 1, we further obtain Result ⑥. ∎

-D Proof of Theorem 3

Proof.

First, from Theorem 2, after Algorithm 1 is used to answer all nn queries with query QiQ_{i} being answered under (ϵi,δi)(\epsilon_{i},\delta_{i})-differential privacy, the total privacy loss with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(G(D,D)2,G(D,D))\mathcal{N}(\frac{G(D,D^{\prime})}{2},G(D,D^{\prime})) for some G(D,D)G(D,D^{\prime}).

Next, we use Theorem 2 to further show that the expression of G(D,D)G(D,D^{\prime}) is given by Eq. (6). From Theorem 2, among all queries, only queries belonging to Cases 1) and 2B) contribute to G(D,D)G(D,D^{\prime}). Below we discuss the contributions respectively.

With N1N_{1} denoting the set of i{1,2,,n}i\in\{1,2,\ldots,n\} such that QiQ_{i} is in Cases 1), we know from Result ② of Theorem 2 that the contributions of queries in Cases 1) to G(D,D)G(D,D^{\prime}) is given by

iN1[Qi(D)Qi(D)2]2σi2.\displaystyle\sum_{i\in N_{1}}\frac{[\|Q_{i}(D)-Q_{i}(D^{\prime})\|_{2}]^{2}}{{\sigma_{i}}^{2}}.~{} (32)

Below we use Result ③ of Theorem 2 to compute the contributions of queries in Case 2B) to G(D,D)G(D,D^{\prime}). For T2BT_{2B} being the set of query types in Case 2B), we discuss each query type tT2Bt\in T_{2B} respectively.

From Result ③ of Theorem 2, the contribution to G(D,D)G(D,D^{\prime}) by answering Qjt,1Q_{j_{t,1}} under differential privacy is

[Qjt,1(D)Qjt,1(D)2]2(1σjt,121σjt,02).\displaystyle[\|Q_{j_{t,1}}(D)-Q_{j_{t,1}}(D^{\prime})\|_{2}]^{2}\left(\frac{1}{\sigma_{j_{t,1}}^{2}}-\frac{1}{\sigma_{j_{t,0}}^{2}}\right).

Similarly, the contribution to G(D,D)G(D,D^{\prime}) by answering Qjt,2Q_{j_{t,2}} under differential privacy is

[Qjt,2(D)Qjt,2(D)2]2(1σjt,221σjt,12).\displaystyle[\|Q_{j_{t,2}}(D)-Q_{j_{t,2}}(D^{\prime})\|_{2}]^{2}\left(\frac{1}{\sigma_{j_{t,2}}^{2}}-\frac{1}{\sigma_{j_{t,1}}^{2}}\right).

Similar analyses are repeated for additional type-tt queries in Case 2B). In particular, for each s{1,2,,mt}s\in\{1,2,\ldots,m_{t}\}, the contribution to G(D,D)G(D,D^{\prime}) by answering Qjt,sQ_{j_{t,s}} under differential privacy is

[Qjt,s(D)Qjt,s(D)2]2(1σjt,s21σjt,s12).\displaystyle[\|Q_{j_{t,s}}(D)-Q_{j_{t,s}}(D^{\prime})\|_{2}]^{2}\left(\frac{1}{\sigma_{j_{t,s}}^{2}}-\frac{1}{\sigma_{j_{t,s-1}}^{2}}\right). (33)

Summing all (33) for s{1,2,,mt}s\in\{1,2,\ldots,m_{t}\}, we obtain that for each query type tT2Bt\in T_{2B}, the contributions to G(D,D)G(D,D^{\prime}) by answering Qjt,1,Qjt,2,,Qjt,mtQ_{j_{t,1}},Q_{j_{t,2}},\ldots,Q_{j_{t,m_{t}}} under differential privacy is

s{1,2,,mt}[Qjt,s(D)Qjt,s(D)2]2(1σjt,s21σjt,s12).\displaystyle\sum_{s\in\{1,2,\ldots,m_{t}\}}[\|Q_{j_{t,s}}(D)-Q_{j_{t,s}}(D^{\prime})\|_{2}]^{2}\left(\frac{1}{\sigma_{j_{t,s}}^{2}}-\frac{1}{\sigma_{j_{t,s-1}}^{2}}\right).~{} (34)

Since Qjt,0,Qjt,1,,Qjt,mtQ_{j_{t,0}},Q_{j_{t,1}},\ldots,Q_{j_{t,m_{t}}} for jt,0,jt,1,,jt,mtj_{t,0},j_{t,1},\ldots,j_{t,m_{t}} are all type-tt queries, Qjt,s(D)Qjt,s(D)2\|Q_{j_{t,s}}(D)-Q_{j_{t,s}}(D^{\prime})\|_{2} are all the same for s{1,2,,mt}s\in\{1,2,\ldots,m_{t}\}. Hence, we write (34) as

s{1,2,,mt}{[Qjt,s(D)Qjt,s(D)2]2σjt,s2\displaystyle\sum_{s\in\{1,2,\ldots,m_{t}\}}\bigg{\{}\frac{[\|Q_{j_{t,s}}(D)-Q_{j_{t,s}}(D^{\prime})\|_{2}]^{2}}{\sigma_{j_{t,s}}^{2}}
[Qjt,s1(D)Qjt,s1(D)2]2σjt,s12}\displaystyle\quad\quad\quad\quad\quad\quad-\frac{[\|Q_{j_{t,s-1}}(D)-Q_{j_{t,s-1}}(D^{\prime})\|_{2}]^{2}}{\sigma_{j_{t,s-1}}^{2}}\bigg{\}}
=[Qjt,mt(D)Qjt,mt(D)2]2σjt,mt2\displaystyle\quad=\frac{{[\|Q_{j_{t,m_{t}}}(D)-Q_{j_{t,m_{t}}}(D^{\prime})\|_{2}]}^{2}}{{\sigma_{j_{t,m_{t}}}^{2}}}
[Qjt,0(D)Qjt,0(D)2]2σjt,02.\displaystyle\quad\quad\quad\quad\quad-\frac{{[\|Q_{j_{t,0}}(D)-Q_{j_{t,0}}(D^{\prime})\|_{2}]}^{2}}{{\sigma_{j_{t,0}}^{2}}}.~{} (35)

Summing all (35) for tT2Bt\in T_{2B}, the contributions to G(D,D)G(D,D^{\prime}) by answering all queries in Case 2B) is

tT2B{[Qjt,mt(D)Qjt,mt(D)2]2σjt,mt2\displaystyle\sum_{t\in T_{2B}}\Bigg{\{}\frac{{[\|Q_{j_{t,m_{t}}}(D)-Q_{j_{t,m_{t}}}(D^{\prime})\|_{2}]}^{2}}{{\sigma_{j_{t,m_{t}}}^{2}}}
[Qjt,0(D)Qjt,0(D)2]2σjt,02}.\displaystyle\quad\quad\quad-\frac{{[\|Q_{j_{t,0}}(D)-Q_{j_{t,0}}(D^{\prime})\|_{2}]}^{2}}{{\sigma_{j_{t,0}}^{2}}}\Bigg{\}}. (36)

Then G(D,D)G(D,D^{\prime}) as the sum of (32) and (36) is given by Eq. (6).

Summarizing the above, we have proved that after Algorithm 1 is used to answer all nn queries under differential privacy, the total privacy loss with respect to neighboring datasets DD and DD^{\prime} is given by 𝒩(G(D,D)2,G(D,D))\mathcal{N}(\frac{G(D,D^{\prime})}{2},G(D,D^{\prime})) for G(D,D)G(D,D^{\prime}) in Eq. (6). Furthermore, under

maxneighboring datasets D,DQi(D)Qi(D)2=ΔQi\displaystyle\max_{\textup{neighboring datasets $D,D^{\prime}$}}\|Q_{i}(D)-Q_{i}(D^{\prime})\|_{2}=\Delta_{Q_{i}}

and

maxneighboring datasets D,DQjt,mt(D)Qjt,mt(D)2\displaystyle\max_{\textup{neighboring datasets $D,D^{\prime}$}}\|Q_{j_{t,m_{t}}}(D)-Q_{j_{t,m_{t}}}(D^{\prime})\|_{2}
=maxneighboring datasets D,DQjt,0(D)Qjt,0(D)2=Δ(type-t),\displaystyle=\max_{\textup{neighboring datasets $D,D^{\prime}$}}\|Q_{j_{t,0}}(D)-Q_{j_{t,0}}(D^{\prime})\|_{2}=\Delta(\textup{type-}t),

we use Eq. (6) to have maxneighboring datasets D,DG(D,D)\max_{\textup{neighboring datasets $D,D^{\prime}$}}G(D,D^{\prime}) given by Eq. (7).

Finally, from Lemma 2, the total privacy cost of our Algorithm 1 can be given by (ϵours,δbudget)(\epsilon_{\textup{ours}},\delta_{\textup{budget}})-differential privacy for ϵours\epsilon_{\textup{ours}} satisfying

[Gaussian(1,ϵours,δbudget)]2=maxneighboring datasets D,DG(D,D),\displaystyle[\operatorname{Gaussian}(1,\epsilon_{\textup{ours}},\delta_{\textup{budget}})]^{-2}=\max_{\textup{neighboring datasets $D,D^{\prime}$}}G(D,D^{\prime}),

or (ϵ,δ)(\epsilon,\delta)-differential privacy for any ϵ\epsilon and δ\delta satisfying [Gaussian(1,ϵ,δ)]2=maxneighboring datasets D,DG(D,D)[\operatorname{Gaussian}(1,\epsilon,\delta)]^{-2}=\max_{\textup{neighboring datasets $D,D^{\prime}$}}G(D,D^{\prime}). ∎

-E Utility of the Gaussian Mechanism.

Proof.

The noisy response for one-dimensional query QmQ_{m} is Q~m(D)=Qm(D)+N(0,σ2)\widetilde{Q}_{m}(D)=Q_{m}(D)+N(0,\sigma^{2}). Letting the probability of Q~m(D)Qm(D)pα\|\widetilde{Q}_{m}(D)-Q_{m}(D)\|_{p}\leq\alpha be 1β1-\beta, then we have

1β\displaystyle 1-\beta =[Q~m(D)Q(D)pα]\displaystyle={\mathbb{P}}\left[{\|\widetilde{Q}_{m}(D)-Q(D)\|_{p}\leq\alpha}\right]
=[|N(0,σ2)|α]\displaystyle={\mathbb{P}}\left[{|N(0,\sigma^{2})|\leq\alpha}\right]
=[αN(0,σ2)α]\displaystyle={\mathbb{P}}\left[{-\alpha\leq N(0,\sigma^{2})\leq\alpha}\right]
=[N(0,σ2)α][N(0,σ2)α]\displaystyle={\mathbb{P}}\left[{N(0,\sigma^{2})\leq\alpha}\right]-{\mathbb{P}}\left[{N(0,\sigma^{2})\leq-\alpha}\right]
=12[1+erf(ασ2)]12[1+erf(ασ2)]\displaystyle=\frac{1}{2}\left[1+\text{erf}\left(\frac{\alpha}{\sigma\sqrt{2}}\right)\right]-\frac{1}{2}\left[1+\text{erf}{\left(\frac{-\alpha}{\sigma\sqrt{2}}\right)}\right]
=erf(ασ2),\displaystyle=\text{erf}\left(\frac{\alpha}{\sigma\sqrt{2}}\right), (37)

where erf()(\cdot) denotes the error function and the last step of Eq. (37) uses the fact that erf()(\cdot) is an odd function.

According to the two-sigma rule of Gaussian distribution [65], which can also be obtained from above equation that 95%95\% values lie within two standard deviations of the mean. Thus, if we set α=2σ\alpha=2\sigma, β0.05\beta\approx 0.05. ∎