This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Privacy-Preserving Image Classification in the Local Setting

Sen Wang,1 J. Morris Chang1
1University of South Florida
[email protected], [email protected]
Abstract

Image data has been greatly produced by individuals and commercial vendors in the daily life, and it has been used across various domains, like advertising, medical and traffic analysis. Recently, image data also appears to be greatly important in social utility, like emergency response. However, the privacy concern becomes the biggest obstacle that prevents further exploration of image data, due to that the image could reveal sensitive information, like the personal identity and locations. The recent developed Local Differential Privacy (LDP) brings us a promising solution, which allows the data owners to randomly perturb their input to provide the plausible deniability of the data before releasing. In this paper, we consider a two-party image classification problem, in which data owners hold the image and the untrustworthy data user would like to fit a machine learning model with these images as input. To protect the image privacy, we propose to locally perturb the image representation before revealing to the data user. Subsequently, we analyze how the perturbation satisfies ϵ\epsilon-LDP and affect the data utility regarding count-based and distance-based machine learning algorithm, and propose a supervised image feature extractor, DCAConv, which produces an image representation with scalable domain size. Our experiments show that DCAConv could maintain a high data utility while preserving the privacy regarding multiple image benchmark datasets.

Introduction

Image data has been greatly produced and shared in our daily life, and the usage of image data has been demonstrated through various applications, like preference analysis and advertising(?), medical (?) and traffic(?) analysis. Lately, image data becomes greatly important in social utility(?) as well, like emergency response. For example, in the Boston Marathon bombing, the authorities spent lots of effort to gather onsite photos from people in that area, and pore through thousands of photos to search for the suspects. However, the privacy concern comes to be the biggest obstacle that prevents people sharing this information, since the image may reveal where they are and what they are doing. Thus, exploring the privacy preserving machine learning technique with regard to image data becomes paramount importance.

In this paper, we study an image classification problem, in which there are two parties(?), the image owners, who hold a set of images; the untrustworthy data user, who would like to fit a machine learning model regarding these images. As an example, in the Marathon bombing case, the authorities serves as the data user, who asks for the onsite images to search for the suspects, and the people that provided the images serve as the data owner, who wishes to support the investigation while preserving the privacy. During the last decade, Differential Privacy (?) is one of the most popular schemes for privacy protection, which prevents the inference about individuals from participation of computation. It defines a mechanism that the computation of the data is robust to any change of any individual sample by introducing uncertainty into the algorithm. Recently, Local Differential Privacy(LDP)(?)(?) extends the technique into a local setting, which provides a stronger privacy protection that allows data owners to perturb their input before releasing.

Differentially private machine learning in the local setting has been rarely studied. Recently, (?) and (?) investigate the privacy-preserving distributed machine learning, in both work, the data owners interactively work with the data user to learn the model. Unlike these previous work, a “non-interactive” manner draws much of our interest that the data owners only perturb once and then share the data with the data user. Comparing to the interactive setting, the non-interactive way is more scalable and needs less communication cost. Randomized response(?) is proposed to provide the plausible deniability for individuals responding to sensitive surveys. It has been proved that this mechanism satisfies ϵ\epsilon-LDP and is universally optimal for mutual information optimization problem(?), which provides a solid foundation for its usage in the area of machine learning. In this paper, we are interested in adopting randomized response to locally perturb the input and fit the count-based and distance-based machine learning models with such input. Specifically, we study the utility and privacy trade-off with respect to the Naive Bayes and KNN classifiers and show how the model utility is affected by the privacy parameters. Furthermore, in terms of image classification, we propose the DCAConv, a supervised image feature extractor, that improves the utility under the ϵ\epsilon-LDP.

Overall, the contributions of our work are three folds:

  • We propose to use the LDP in the privacy preserving image classification problem. To the best of our knowledge, most LDP-based work focus on the frequency estimation of the categorical data, and we are the first to investigate the privacy preserving mechanism to fit the image data in the classification problem.

  • We analyze the utility and privacy trade-off regarding count-based and distance-based machine learning models, i.e. Naive Bayes and KNN classifiers. As a result, we show how the model utility is affected by the privacy parameters.

  • We develop the DCAConv, a supervised image feature extractor, which represents the image with a scalable domain size. DCAConv is evaluated in terms of image classification through multiple benchmark datasets, like texture and facial datasets. The results confirm that DCAConv could maintain a high data utility while preserving the privacy.

The rest of paper is organized as follow. The preliminary is in Section II. The problem definition and proposed solution is introduced in Section III. We describe DCAConv, the supervised image feature extractor in Section IV. The experiment and evaluation is in Section V. The related work is presented in Section VI. Section VII presents the conclusion.

Preliminary

Local Differential Privacy

Differential privacy (DP) ensures that an adversary should not be able to reliably infer any individual sample from the computation of a dataset, even with unbounded computational power and access to every other samples. Given two data database A,AA,A^{\prime}, it is said that A,AA,A^{\prime} are neighbors if they differs on at most one row. The formal definition of a (ϵ,δ)(\epsilon,\delta)-differential private mechanism over AA is defined below:

Definition 1 (ϵ,δ\epsilon,\delta)-differential privacy(??): A randomized mechanism \mathcal{F} is (ϵ,δ)(\epsilon,\delta)-differentially private if for every two neighboring database A,AA,A^{\prime} and for any 𝒪Range()\mathcal{O}\subseteq Range(\mathcal{F}),

Pr[(A)𝒪]eϵPr[(A)𝒪]+δPr[\mathcal{F}(A)\in\mathcal{O}]\leq e^{\epsilon}Pr[\mathcal{F}(A^{\prime})\in\mathcal{O}]+\delta (1)

where Pr[]Pr[\cdot] denotes the probability of an event, Range()Range(\mathcal{F}) denotes the set of all possible outputs of the algorithm \mathcal{F}. The smaller ϵ,δ\epsilon,\delta are, the closer Pr[(A)𝒪]Pr[\mathcal{F}(A)\in\mathcal{O}] and Pr[(A)𝒪]Pr[\mathcal{F}(A^{\prime})\in\mathcal{O}] are, and the stronger privacy protection gains. When δ=0\delta=0, the mechanism \mathcal{F} is ϵ\epsilon-differentially private, which is a stronger privacy guarantee than ϵ,δ\epsilon,\delta-differential privacy with δ>0\delta>0.

Similar to DP, a stronger setting of local privacy is investigated by Duchi et al. (?), namely Local Differential Privacy. It considers a scenario that a data curator needs to collect data from data owners, and infer the statistical information from these data. The data owner is assumed to trust no one, and would perturb the data before sharing with data curator. The formal definition is given below:

Definition 2 ϵ\epsilon-Local Differential Privacy(??): A randomized mechanism 𝒢\mathcal{G} satisfies ϵ\epsilon-Local Differential Privacy (ϵ\epsilon-LDP) if for any input v1v_{1} and v2v_{2} and for any 𝒮Range(𝒢)\mathcal{S}\subseteq Range(\mathcal{G}):

Pr[𝒢(v1)𝒮]eϵPr[𝒢(v2)𝒮]Pr[\mathcal{G}(v_{1})\in\mathcal{S}]\leq e^{\epsilon}Pr[\mathcal{G}(v_{2})\in\mathcal{S}] (2)

Comparing to DP, LDP provides a stronger privacy model, while entails more noise.

Randomized Response

Randomized Response(?) is a decades-old technique which is designed to collect the statistical information regarding sensitive survey questions. The original technique is used for frequency estimation of a particular input. For social research, the interviewee is asked to provide an answer regarding the sensitive question, such as marijuana usage. To protect the privacy, the interviewee is asked to spin a spinner unobserved by the interviewer. Rather than answering the sensitive question directly, the interviewee only tells yes or no according to whether the spinner points to the true answer. Suppose that the interviewer wants to estimate the population of using marijuana and asks the interviewee if he has ever used the marijuana. And the probability that the spinner points to yes is pp, then the spinner points to no is with probability 1p1-p. Thus an unbiased maximum likelihood estimates of the true population proportion is given by p12p1+n1(2p1)n\frac{p-1}{2p-1}+\frac{n_{1}}{(2p-1)n}, where nn is the number of participating interviewees and n1n_{1} is the amount of reporting yes. It(?) shows that this mechanism satisfies (lnp1p)(ln\frac{p}{1-p})-LDP.

Randomized response could also be generalized to multiple choices, where the domain of the choices is [d],d>2[d],d>2. Suppose the randomized method is 𝒢,v,s[d]\mathcal{G},\forall v,s\in[d], the perturbation is defined below:

Pr[𝒢(v)=s]={p=eϵd1+eϵ, if v=sq=1pd1=1d1+eϵ, if vsPr[\mathcal{G}(v)=s]=\begin{cases}p=\frac{e^{\epsilon}}{d-1+e^{\epsilon}}\text{, if }v=s\\ q=\frac{1-p}{d-1}=\frac{1}{d-1+e^{\epsilon}}\text{, if }v\neq s\end{cases} (3)
Proof.

(?) For any inputs v1v_{1} and v2v_{2} and output ss:

𝒢(v1)=s𝒢(v2)=spq=eϵ\frac{\mathcal{G}(v_{1})=s}{\mathcal{G}(v_{2})=s}\leq\frac{p}{q}=e^{\epsilon} (4)

In terms of maintaining the original statistical information, even though (?) proves that randomized response is universally optimal for mutual information and ff-divergences optimization, there’s no clear guide about how to employ this mechanism in particular machine learning algorithm.

Discriminant Component Analysis

Principal Component Analysis is a well known tool for unsupervised learning which finds the subspace that preserves the most variance of the original data, however, it is not effective in the supervised learning since the labeling information is ignored in the computation. Discriminant Component Analysis (DCA)(?), basically a supervised PCA, is proposed recently as a complement for supervised learning, where the signal subspace components of DCA are associated with the discriminant power regarding the classification effectiveness while the noise subspace components of DCA are tightly coupled with the recoverability. Since the rank of the signal subspace is limited by the number of classes, DCA is capable to support classification using a small number of components.

Consider a KK-class data set consisting of NN samples X=[x1,x2,,xn]TX=[x_{1},x_{2},\dots,x_{n}]^{T}, in which ximx_{i}\in{\mathbb{R}}^{m}. Each sample xix_{i} associates with a class label yiy_{i} where yi{C1,C2,,Ck}y_{i}\in\{C_{1},C_{2},\dots,C_{k}\}. Let μ\mu denotes the centroid of the total mass, μk\mu_{k} denotes the centroid of the samples in class CkC_{k}, and NkN_{k} denotes the number of samples in class CkC_{k}. The signal matrix is represented by the between-class scatter matrix:

SB=k=1KNk(μkμ)(μkμ)TS_{B}=\sum^{K}_{k=1}N_{k}(\mu_{k}-\mu)(\mu_{k}-\mu)^{T} (5)

The noise matrix is characterized by the following within-class scatter matrix:

SW=k=1Ki:yi=Ck(xiμk)(xiμk)TS_{W}=\sum^{K}_{k=1}\sum_{i:y_{i}=C_{k}}(x_{i}-\mu_{k})(x_{i}-\mu_{k})^{T} (6)

The total scatter matrix S¯\bar{S} can be written as follows:

S¯=SB+SW\bar{S}=S_{B}+S_{W} (7)

In DCA, two ridge parameters ρ\rho and ρ\rho^{\prime} are incorporated in the noise matrix SWS_{W} and the signal matrix SBS_{B} as follows:

SB=SB+ρ𝕀S_{B}^{\prime}=S_{B}+\rho^{\prime}\mathbb{I} (8)
SW=SW+ρ𝕀S_{W}^{\prime}=S_{W}+\rho\mathbb{I} (9)

where 𝕀\mathbb{I} is the identity matrix. Therefore, the regulated scatter matrix is denoted as:

S¯=SB+SW=S¯+(ρ+ρ)𝕀\bar{S}^{\prime}=S_{B}^{\prime}+S_{W}^{\prime}=\bar{S}+(\rho+\rho^{\prime})\mathbb{I} (10)

DCA performs spectral decomposition of the pre-whitened scatter matrix, DDCAD_{DCA}:

DDCA=(SW)1S¯=UΛUTD_{DCA}=(S_{W}^{\prime})^{-1}\bar{S}^{\prime}=U\Lambda U^{T} (11)

where Λ\Lambda holds the the monotonically decreasing eigenvalues, and U holds the associated eigenvectors. To form a projection matrix, it chooses kk eigenvectors with the largest eigenvalues to form a projection matrix Wdcam×kW_{dca}\in{\mathbb{R}}^{m\times k}, in which each column is an eigenvector. And the raw data is transformed by multiplying the projection matrix:

Xproj=X×WdcaX_{proj}=X\times W_{dca} (12)

Problem Definition and Solution Analysis

Problem Definition

Refer to caption
Figure 1: Problem Overview. An image held by Data Owner OiO_{i} is represented as a vector xi={xi1,xi2,,xim},xij[d]x_{i}=\{x_{i1},x_{i2},\dots,x_{im}\},x_{ij}\in[d], each vector associates with a label yiy_{i}, the Data User would like to fit a classifier with such vectors from all nn Data Owners. To protect the privacy of xix_{i}, a perturbation is required to satisfy ϵj\epsilon_{j}-LDP for the jjth feature.

As Fig.1 shows, in this paper, we consider the following problem: Given nn Data Owners, each Data Owner OiO_{i} holds a set of images that are represented as a mm-dimensional feature vectors. For ease of analysis, here it is assume that only one image is possessed by OiO_{i} and it is denoted as xi={xi1,xi2,,xim},xij[d]x_{i}=\{x_{i1},x_{i2},\dots,x_{im}\},x_{ij}\in[d]. Each image xix_{i} associates with a label yiy_{i}, for instance, the images held by each data owner is about the handwritten digit, and yiy_{i} indicates the corresponding digit of xix_{i}; the untrustworthy Data User would like to fit an image classifier with xi,yi,i{1,2,,n}x_{i},y_{i},i\in\{1,2,\dots,n\} as input while satisfying ϵj\epsilon_{j}-LDP for the jjth feature of xix_{i}. It should be noted that the images held by each data owner are perturbed independently, so there are no difference to perturb a set of images or a single image at OiO_{i}. With respect to image classification, the effectiveness of both Naive Bayes and KNN classifiers have been demonstrated through previous work(??), which are studied in this work.

For the threat model, the adversary is intended to learn individual image xix_{i}, who could be either the untrustworthy data user, a participating data owner or an outside attacker. It assumes that adversaries might have arbitrary background knowledge and there might be a collusion among them. The goal is maintain the data utility in terms of image classification while protecting the data privacy.

Frequency and Mean Estimation

In this paper, we mainly investigate to use Randomized Response to satisfy ϵij\epsilon_{ij}-LDP for xijx_{ij}, which is perturbed independently. Given a xi,yix_{i},y_{i}, we assume that the class label yi{C1,C2,,Ck}y_{i}\in\{C_{1},C_{2},\dots,C_{k}\} and xix_{i}^{\prime} is the vector after perturbation. Firstly, we show how to build the Naive Bayes classifier by leveraging the aggregation result of the perturbed input.

In our problem, for each training data xix_{i}, xijx_{ij} is independently perturbed by randomized response at OiO_{i} and xix_{i}^{\prime} is sent to the Data User. And the frequency of each appeared value vv is served as the building block of the classifier training, more specifically, the frequency of value vv in the jjth feature that belongs to CkC_{k} should be counted, in which is denoted as cj(v)c_{j}(v). However, due to the noise injected by randomized response, the observation doesn’t reflect the true proportion of value vv. Thus given the number of observation of vv in the jjth feature, denoted as jv\sum_{j}v, cj(v)c_{j}(v) is estimated as(?):

cj(v)^\displaystyle\hat{c_{j}(v)} =jvnqjpjqj\displaystyle=\frac{\sum_{j}v-nq_{j}}{p_{j}-q_{j}} (13)
=jvn(1pj)/(d1)pj(1pj)/(d1)\displaystyle=\frac{\sum_{j}v-n(1-p_{j})/(d-1)}{p_{j}-(1-p_{j})/(d-1)} (14)

where pj=Pr[xij=v]p_{j}=Pr[x_{ij}^{\prime}=v], if xij=vx_{ij}=v, which refers to Equ.3, and nn is the number of observations that in CkC_{k}. It can be further shown that cj(v)^\hat{c_{j}(v)} is an unbiased estimator. Assuming fv,jf_{v,j} is the true proportion that value vv occurs in jjth feature, E[cj(v)^]E[\hat{c_{j}(v)}] would be:

E[cj(v)^]\displaystyle E[\hat{c_{j}(v)}] =E[jvnqjpjqj]\displaystyle=E[\frac{\sum_{j}v-nq_{j}}{p_{j}-q_{j}}] (15)
=nfv,jpj+n(1fv,j)qjnqjpjqj\displaystyle=\frac{nf_{v,j}p_{j}+n(1-f_{v,j})q_{j}-nq_{j}}{p_{j}-q_{j}} (16)
=nfv,j\displaystyle=nf_{v,j} (17)

And Var[cj(v)^]Var[\hat{c_{j}(v)}] is given as below:

Var[cj(v)^]\displaystyle Var[\hat{c_{j}(v)}] =nqj(1qj)(pjqj)2+nfv,j(1pjqj)pjqj\displaystyle=\frac{nq_{j}(1-q_{j})}{(p_{j}-q_{j})^{2}}+\frac{nf_{v,j}(1-p_{j}-q_{j})}{p_{j}-q_{j}} (18)
=n(d2+eϵj(eϵj1)2+fv,j(d2)eϵj1)\displaystyle=n(\frac{d-2+e^{\epsilon_{j}}}{(e^{\epsilon_{j}}-1)^{2}}+\frac{f_{v,j}(d-2)}{e^{\epsilon_{j}}-1}) (19)

where Var[cj(v)^]Var[\hat{c_{j}(v)}] is in O(ndeϵ)O(\frac{nd}{e^{\epsilon}}). Furthermore, μj^\hat{\mu_{j}} is the estimated mean of feature jj given the frequency estimation of each appeared value:

μj^=vvcj(v)^n\hat{\mu_{j}}=\frac{\sum_{v}v\hat{c_{j}(v)}}{n} (20)

Correspondingly, E[μj^]E[\hat{\mu_{j}}] and Var[μj^]Var[\hat{\mu_{j}}] are computed as follow:

E[μj^]\displaystyle E[\hat{\mu_{j}}] =E[vvcj(v)^n]\displaystyle=E[\frac{\sum_{v}v\hat{c_{j}(v)}}{n}] (21)
=E[vvcj(v)]^n\displaystyle=\frac{E[\sum_{v}v\hat{c_{j}(v)]}}{n} (22)
=vvE[cj(v)]^n\displaystyle=\frac{\sum_{v}vE[\hat{c_{j}(v)]}}{n} (23)
=vvfv,j\displaystyle=\sum_{v}vf_{v,j} (24)
Var[μj^]\displaystyle Var[\hat{\mu_{j}}] =Var[vvcj(v)^n]\displaystyle=Var[\frac{\sum_{v}v\hat{c_{j}(v)}}{n}] (25)
=Var[vvcj(v)]^n2\displaystyle=\frac{Var[\sum_{v}v\hat{c_{j}(v)]}}{n^{2}} (26)
=vv2Var[cj(v)]^n2\displaystyle=\frac{\sum_{v}v^{2}Var[\hat{c_{j}(v)]}}{n^{2}} (27)

It can be seen that μj^\hat{\mu_{j}} is an unbiased estimator and Var[μj^]Var[\hat{\mu_{j}}] is in O(dneϵ)O(\frac{d}{ne^{\epsilon}}). And the Mean-Squared-Error (MSE) of μj^\hat{\mu_{j}} is equal to Var[μj^]Var[\hat{\mu_{j}}]. In practical, the unbiased estimator for the sum of counts is adopted, which is vcj(v)^\sum_{v}\hat{c_{j}(v)} and it results in the same order of MSE.

A different mean estimator is given by replacing the denominator nn with the corrected sum of frequencies, where μjrr^\hat{\mu_{j}^{rr}}:

μjrr^=vvcj(v)^vcj(v)^\hat{\mu_{j}^{rr}}=\frac{\sum_{v}v\hat{c_{j}(v)}}{\sum_{v}\hat{c_{j}(v)}} (28)

Correspondingly, by using multivariate Taylor expansions, E[μjrr^]E[\hat{\mu_{j}^{rr}}] and Var[μjrr^]Var[\hat{\mu_{j}^{rr}}] are computed as follow:

E[X]=E[vvcj(v)^]=vE[vcj(v)^]=vvE[cj(v)^]nd2\displaystyle E[X]=E[\sum_{v}v\hat{c_{j}(v)}]=\sum_{v}E[v\hat{c_{j}(v)}]=\sum_{v}vE[\hat{c_{j}(v)}]\leq nd^{2} (29)
var[X]=var[vvcj(v)^]=vvar[vcj(v)^]=vv2var[cj(v)^]\displaystyle var[X]=var[\sum_{v}v\hat{c_{j}(v)}]=\sum_{v}var[v\hat{c_{j}(v)}]=\sum_{v}v^{2}var[\hat{c_{j}(v)}] (30)
E[Y]=E[vcj(v)^]=vE[cj(v)^]=n\displaystyle E[Y]=E[\sum_{v}\hat{c_{j}(v)}]=\sum_{v}E[\hat{c_{j}(v)}]=n (31)
var[Y]=var[vcj(v)^]=vvar[cj(v)^]=nd(d2+eϵj(eϵj1)2)\displaystyle var[Y]=var[\sum_{v}\hat{c_{j}(v)}]=\sum_{v}var[\hat{c_{j}(v)}]=nd(\frac{d-2+e^{\epsilon_{j}}}{(e^{\epsilon_{j}}-1)^{2}}) (32)
E[XY]E[X]E[Y]cov[X,Y]E[Y]2+E[X]E[Y]3var[Y]\displaystyle E[\frac{X}{Y}]\approx\frac{E[X]}{E[Y]}-\frac{cov[X,Y]}{E[Y]^{2}}+\frac{E[X]}{E[Y]^{3}}var[Y] (33)
E[X]E[Y]2E[Y]3cov[X,Y]E[Y]E[Y]3+E[X]var[Y]E[Y]3\displaystyle\approx\frac{E[X]E[Y]^{2}}{E[Y]^{3}}-\frac{cov[X,Y]E[Y]}{E[Y]^{3}}+\frac{E[X]var[Y]}{E[Y]^{3}} (34)
E[X]E[Y]2cov[X,Y]E[Y]+E[X]var[Y]E[Y]3\displaystyle\approx\frac{E[X]E[Y]^{2}-cov[X,Y]E[Y]+E[X]var[Y]}{E[Y]^{3}} (35)
n3d2ncov[X,Y]+n2d3d2+eϵj(eϵj1)2n3\displaystyle\leq\frac{n^{3}d^{2}-ncov[X,Y]+n^{2}d^{3}\frac{d-2+e^{\epsilon_{j}}}{(e^{\epsilon_{j}}-1)^{2}}}{n^{3}} (36)
var[XY]var[X]E[Y]22E[X]E[Y]3cov[X,Y]+E[X]2E[Y]4var[Y]\displaystyle var[\frac{X}{Y}]\approx\frac{var[X]}{E[Y]^{2}}-\frac{2E[X]}{E[Y]^{3}}cov[X,Y]+\frac{E[X]^{2}}{E[Y]^{4}}var[Y] (37)
var[X]E[Y]2E[Y]42E[X]E[Y]cov[X,Y]E[Y]4+E[X]2var[Y]E[Y]4\displaystyle\approx\frac{var[X]E[Y]^{2}}{E[Y]^{4}}-\frac{2E[X]E[Y]cov[X,Y]}{E[Y]^{4}}+\frac{E[X]^{2}var[Y]}{E[Y]^{4}} (38)
var[X]E[Y]22E[X]E[Y]cov[X,Y]+E[X]2var[Y]E[Y]4\displaystyle\approx\frac{var[X]E[Y]^{2}-2E[X]E[Y]cov[X,Y]+E[X]^{2}var[Y]}{E[Y]^{4}} (39)
cov[X,Y]=E[XY]E[X]E[Y]\displaystyle cov[X,Y]=E[XY]-E[X]E[Y] (40)
MSE[XY]=Var[XY]+(E[XY]XY)2\displaystyle MSE[\frac{X}{Y}]=Var[\frac{X}{Y}]+(E[\frac{X}{Y}]-\frac{X}{Y})^{2} (41)

Nearest Centroid-Based Classifier

The Nearest Centroid Classifier (NCC) is a classification model that assigns to testing data the label of the class of training data whose mean is closest to the observation in terms of certain distance metrics. Given training data xi,yi,i{1,2,,n}x_{i},y_{i},i\in\{1,2,\dots,n\}, the per-class centroid is computed as follows:

μk=1Nki:yi=Ckxi\mu_{k}=\frac{1}{N_{k}}\sum_{i:y_{i}=C_{k}}x_{i} (42)

where NkN_{k} is the number of training data that belonging to class CkC_{k}. For a testing data xtx_{t}, the predicted class is given by,

yt^=argminkp(xt,μk)\hat{y_{t}}=\operatorname*{argmin}_{k}\ell_{p}(x_{t},\mu_{k}) (43)

where ()\ell(\cdot) is the pp-norm distance between xtx_{t} and μk\mu_{k}. To build the NCC from the perturbed data, the estimated centroid is adopted:

μk^=vCkvcj(v)^Nk\hat{\mu_{k}}=\frac{\sum_{v\in C_{k}}v\hat{c_{j}(v)}}{N_{k}} (44)

DCAConv - Supervised Image Feature Extraction

Refer to caption
Figure 2: DCAConv: Supervised Image Feature Extractor

As analyzed in previous section, for both Naive Bayes and KNN classifiers, we show that the domain size of the input determines the utility gain regarding the perturbation of randomized response. For Naive Bayes, the utility can be measured by the variance of the estimated frequency. For KNN classifier, the utility is measured by the probability of preserving the true proximity between the testing data points and the perturbed training points. In terms of randomized response, it is readily to see that the utility gain is maximized when domain size d=2d=2. However, the binary format limits the transited information, especially for the image data. On the other hand, increasing dd asks for more price to pay with respect to privacy. Thus the domain size dd is the very key factor for the trade-off between the utility and privacy.

For image data, the naive way to protect the privacy is to apply randomized response to the pixel feature, nevertheless the utility is severely damaged due to the large domain size. Thus the challenge is to find a proper representation of the image data. On one hand, the representation should preserve as much information as possible; on the other hand, the domain size should provide a balance between the utility and privacy. The success of CNN(?) on image classification draws much attention in the last decade, which transforms the image into multiple level representations by extracting the abstract semantics. However, it is hard to apply randomized response to the CNN directly, the reason is that CNN usually needs continuous number in the data flow. Furthermore, it also relies on either Softmax or Sigmoid layer to determine the final output distribution, where the model is trained to optimize for specific loss functions. Inspired by PCANet(?), we are interested in designing a mechanism to transform the image to a representation that has a scalable domain size. In the meanwhile, the domain size could be as small as possible to narrow the impact of randomized response in terms of utility. The outcome is DCAConv, as Fig. 2 shows, which consists of the convolution layer, the binary quantization layer and the pooling layer. The convolution filter is computed using DCA and the generated discriminant components serve as the filter. The quantization layer converts the output of convolution operation to binary bits. Finally, images are represented with the mapping of the binary bits and passed through the pooling layer. The details are presented below:

DCA Convolution Layer

Suppose there are NN images {Ei}i=1N\{E_{i}\}^{N}_{i=1}, each image has the same size h×wh\times w, the patch size is kh×kwk_{h}\times k_{w}, the number of filters in the rrth convolution layer is LrL_{r}. EiE_{i} is denoted as [ei1,ei2,,eih~w~][e_{i1},e_{i2},\dots,e_{i\widetilde{h}\widetilde{w}}], and eijkh×kwe_{ij}\in{\mathbb{R}}^{k_{h}\times k_{w}} denotes the jjth vectorized patch in EiE_{i}, where h~=(hkh+2z)/t+1,w~=(wkw+2z)/t+1\widetilde{h}=(h-k_{h}+2z)/t+1,\widetilde{w}=(w-k_{w}+2z)/t+1, and tt is the size of stride, zz is the the size of zero padding. Additionally, assuming there are KK classes, each image EiE_{i} associates with a class label yiy_{i} where yi{C1,C2,,Ck}y_{i}\in\{C_{1},C_{2},\dots,C_{k}\}, and each class CkC_{k} contains NkN_{k} images. For the first convolution layer, the overall patch mean μ\mu and within class patch mean μk\mu_{k} is calculated as below:

μ\displaystyle\mu =i=1Nj=1h~w~eijN\displaystyle=\frac{\sum_{i=1}^{N}\sum_{j=1}^{\widetilde{h}\widetilde{w}}e_{ij}}{N} (45)
μk\displaystyle\mu_{k} =iCkNj=1h~w~eijNk\displaystyle=\frac{\sum_{i\in C_{k}}^{N}\sum_{j=1}^{\widetilde{h}\widetilde{w}}e_{ij}}{N_{k}} (46)

Then the between-class scatter matrix SBS_{B}, the within class scatter matrix SWS_{W} and regularized scatter matrix are calculated as below:

SB\displaystyle S_{B} =k=1KNk(μkμ)(μkμ)T\displaystyle=\sum^{K}_{k=1}N_{k}(\mu_{k}-\mu)(\mu_{k}-\mu)^{T} (47)
SW\displaystyle S_{W} =k=1KiCk(Eiμk)(Eiμk)T\displaystyle=\sum^{K}_{k=1}\sum_{i\in C_{k}}(E_{i}-\mu_{k})(E_{i}-\mu_{k})^{T} (48)
S¯\displaystyle\bar{S}^{\prime} =SB+SW=SB+ρ𝕀+SW+ρ𝕀\displaystyle=S_{B}^{\prime}+S_{W}^{\prime}=S_{B}+\rho^{\prime}\mathbb{I}+S_{W}+\rho\mathbb{I} (49)

Once getting SWS_{W} and S¯\bar{S}^{\prime}, DCA is performed over the patches and compute a set of discriminant components as the orthonormal filters, where:

(SW)1S¯=UΛUT(S_{W}^{\prime})^{-1}\bar{S}^{\prime}=U\Lambda U^{T} (50)

where UU holds the associated eigenvectors. The leading eigenvectors capture the main discriminant power of the mean-removed training patches. Then the first L1L_{1} eigenvectors are taken as the orthonormal filters, where the llth filter is represented as W1lkh×kwW^{l}_{1}\in{\mathbb{R}}^{k_{h}\times k_{w}}, in the first convolution layer.

Similar to CNN, multiple convolution layers could be stacked following the first layer, as Fig. 2 shows. Given a second convolution layer as example, it repeats the similar procedure as in the first one. Given the l1l_{1}th filter in the first layer, for image EiE_{i}:

Eil1=EiW1l1,i=1,2,,N,l1=1,2,,L1E_{i}^{l_{1}}=E_{i}*W^{l_{1}}_{1},i=1,2,\dots,N,l_{1}=1,2,\dots,L_{1} (51)

where * denotes the 2D convolution. It could use zero padding on Eil1E_{i}^{l_{1}} to make it has the same size as EiE_{i}. After that, the patch mean of Eil1E_{i}^{l_{1}} is removed, then it performs DCA over the concatenated patches and computes the filters for the second convolution layer. Once DCA is finished, the first L2L_{2} eigenvectors are taken as the filters, the llth filter in the second layer is represented as W2l,W2lkh×kwW^{l}_{2},W^{l}_{2}\in{\mathbb{R}}^{k_{h}\times k_{w}}. For each Eil1E_{i}^{l_{1}}, there are L2L_{2} filters to be convolved, as Fig. 2 shows.

Binary Quantization

After second convolution layer, there are L1×L2L_{1}\times L_{2} output in total, and they are converted to binary representation using the Heaviside step function H()H(\cdot):

H(x)={0,x01,x>0H(x)=\begin{cases}0,x\leq 0\\ 1,x>0\end{cases} (52)

where x=Eil1W2l2,l1=1,2,,L1,l2=1,2,,L2x=E_{i}^{l_{1}}*W^{l_{2}}_{2},l_{1}=1,2,\dots,L_{1},l_{2}=1,2,\dots,L_{2}.

Mapping

Once the output of the second convolution layer is converted to binary, for Eil1E_{i}^{l_{1}}, it obtains {H(Eil1W2l2)}l2=1L2\{H(E_{i}^{l_{1}}*W^{l_{2}}_{2})\}_{l_{2}=1}^{L_{2}}, which is a vector of L2L_{2} binary bits. Since Eil1E_{i}^{l_{1}} has the same size as EiE_{i} with zero padding, for each pixel, the vector of L2L_{2} binary bits is viewed as a decimal number of L2L_{2} digits. It “maps” the output back into a single integer:

Til1=l2=1L22l21H(Eil1W2l2)T_{i}^{l_{1}}=\sum^{L_{2}}_{l_{2}=1}2^{l_{2}-1}H(E_{i}^{l_{1}}*W^{l_{2}}_{2}) (53)

Max Pooling

The output of the binary quantization layer are L1L_{1} “images” with each “pixel” of L2L_{2} digits. To further reduce the size of the representation, for Til1T_{i}^{l_{1}}, a max pooling layer is applied after the binary quantization layer.

Perturbation

The randomized response is applied to the output of the pooling layer, and the domain size dd is determined by L2L_{2}, where d=2L2d=2^{L_{2}}. As we mentioned early, in DCA, the rank of the signal space is limited by the number of classes. More specially, LrKL_{r}\leq K, where LrL_{r} is the number of filters in rrth convolution layer and KK is the number of classes. By employing DCA in the design, the domain of the final output is scalable, moreover, DCA also guarantees a high utility with a small number of components.

Experiment

We evaluate DCAConv over three popular benchmark image datasets, the MNIST, Fashion-MNIST111https://github.com/zalandoresearch/fashion-mnist, YaleB222vision.ucsd.edu/ iskwak/ExtYaleDatabase/ExtYaleB.html.

The MNIST dataset contains 28×2828\times 28 grayscale images of handwritten digits that from 0 to 9, in which there are 60,000 samples for training and 10,000 samples for testing.

The Fashion-MNIST dataset contains the grayscale article images that each sample associates with a label from 10 classes. The dataset is intended to replace the overused MNIST dataset, and shares the same image size and structure of training and testing splits.

The YaleB face dataset contains grayscale photos of 27 subjects and each one has 594 in average, we select samples based on the light condition and the head position, which results in around 250 samples per subject. The face is extracted and scaled to 70 ×\times 70.

To study the utility and privacy trade-off in terms of the domain size dd and privacy parameter ϵ\epsilon, we apply randomized response to multiple image representations, like raw pixel and Histogram of Oriented Gradients (HOG), the perturbed input is then used to fit the Naive Bayes and KNN classifiers. For DCAConv, the domain size is altered by choosing different number of filters in the last convolution layer and the classification result demonstrates the effectiveness of the DCAConv over other image feature representations. All experiments are repeated 10 times, the mean and standard deviation of the classification accuracy are drawn in figures.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Comparison of DCAConv, pixel feature and BOVW, DCA_KNN represents the classification result of DCAConv in KNN, RAW_KNN represents the classification result of pixel feature in KNN, BOVW_KNN represents the classification result of the BOVW in KNN. For DCAConv and pixel feature, the output domain dd=16. For BOVW, MNIST has d=40d=40, Fashion-MNIST has d=39d=39 and YaleB has d=44d=44. DCAConv hyper-parameters, filter size: 7×\times7, filter stride size: 1×\times1, L1L_{1}:5, L2L_{2}:4, pooling window size: 2×\times2, pooling stride size: 1×\times1.

DCAConv vs Pixel vs HOG

We first measure the classification accuracy with three image representations, the raw pixel, DCAConv and HOG, where a Bag of Visual Words (BOVW) model is associated with the HOG descriptor. More specifically, HOG descriptor provides a fixed length of the image representation, while each feature is contrast-normalized to improve the accuracy. Once the descriptor is generated, a K-Means clustering is performed over the whole descriptors and the center of each cluster is used as the visual dictionary’s vocabularies. Finally, the frequency histogram is made from the visual vocabularies and is served as BOVW. Unlike DCAConv, the domain size of the BOVW is fixed once the number of clusters is determined and there is no significant difference observed after tuning the number of clusters. In the experiment, the domain size of the BOVW is determined by the maximum frequency observed among all features, e.g. MNIST has a domain size of 40, Fashion-MNIST has a size of 39 and YaleB has a size of 44, otherwise, the probability of answering honestly would be uneven among all features. For DCACov, there are two convolution layers and various number of L2L_{2} filters is tested. To make a fair comparison, the raw pixel is downscaled to have the same size as the DCAConv output domain. For example, the raw pixel value is downscaled to [0,15] using a MinMaxScaler333http://scikit-learn.org/stable/modules/generated/
sklearn.preprocessing.MinMaxScaler.html
to match the DCAConv with 4 L2L_{2} filters. For the information loss caused by the downscaling, we don’t observe a significant degrading in terms of the classification accuracy. Fig. 4 displays the accuracy regarding Naive Bayes and KNN classifiers, the horizontal specifies the ϵ\epsilon for each individual feature, and the vertical axis shows the classification accuracy.

From the figure, it could be seen that, both classifiers perform slightly better than random guessing when ϵ0.01\epsilon\leq 0.01, and the accuracy increases as ϵ0.1\epsilon\geq 0.1. When ϵ\epsilon grows to 1.01.0, for DCAConv and raw pixel, both classifiers provide reasonably good accuracy, however, the BOVW still suffers from a low accuracy as the domain size is much larger than the other two representations. Tab. 1 provides the KNN accuracy of DCAConv when ϵ0.1\epsilon\geq 0.1, where the last column gives the ground truth as no randomized response applied. As analyzed in Section III, the probability of being the true 2\ell_{2} distance is bigger than 12\frac{1}{2} as ϵln(d+1)\epsilon\geq ln(d+1), and it confirms that the accuracy approximates to the ground truth given that ϵ\epsilon, e.g. for all 3 datasets, the accuracy is within 5%5\% of the ground truth as ϵ2.83\epsilon\geq 2.83. The comparison between Naive Bayes and KNN classifiers shows that Naive Bayes achieves higher accuracy when ϵ1.0\epsilon\leq 1.0, where it may due to that the noise is eliminated by the frequency estimator, but the KNN still suffers to a low probability to preserve the true proximity of the data points given the same ϵ\epsilon. However, as ϵ\epsilon increases, the perturbation becomes less, and the KNN starts to dominate classification accuracy.

Finally, for the texture dataset, the downscaled pixel feature provides a compatible accuracy with DCAConv. However, for the more complicated data, like facial data, the pixel fails to provide a good discriminant capacity in the KNN algorithm; under the same ϵ\epsilon, BOVW performs worse than the other two representations due to the large domain size.

dataset ϵ\epsilon 0.1 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 \infty
Naive Bayes MNIST 77.52 86.35 87.15 87.19 87.07 86.97 86.92 86.92 86.90 86.90
Fashion-MNIST 58.96 68.27 68.71 68.88 68.89 68.94 68.94 68.90 68.90 68.80
YaleB 10.52 44.27 68.29 75.32 78.89 79.43 80.63 80.70 81.32 81.71
KNN MNIST 24.72 68.92 81.96 86.05 88.00 89.37 89.95 90.27 90.46 90.50
Fashion-MNIST 20.56 48.58 57.35 63.40 68.21 71.54 74.14 75.64 76.73 78.70
YaleB 6.01 49.56 90.76 95.18 95.88 95.91 95.88 95.77 95.77 95.92
Table 1: Classification accuracy (%\%) with DCAConv, where d=16d=16 and ln(d+1)2.83ln(d+1)\approx 2.83. The last column provides the ground truth as no randomized response applied. Due to the space limit, the standard deviation of the measured accuracy is omitted here.
dataset ϵ\epsilon 0.1 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 \infty
Naive Bayes MNIST 76.38 77.30 77.21 77.10 77.04 76.96 76.91 76.89 76.83 76.60
Fashion-MNIST 58.96 68.27 68.71 68.88 68.89 68.94 68.94 68.90 68.90 68.80
YaleB 24.01 44.76 43.94 43.38 43.16 42.91 42.86 42.74 42.74 42.74
KNN MNIST 65.96 85.25 88.27 89.21 89.69 89.95 90.04 90.19 90.24 90.20
Fashion-MNIST 69.66 70.90 70.62 70.52 70.43 70.39 70.39 70.37 70.32 70.30
YaleB 79.33 95.50 95.35 95.46 95.58 95.55 95.61 95.66 95.75 95.80
Table 2: Classification accuracy (%\%) with DCAConv, where d=2d=2 and ln(d+1)1.10ln(d+1)\approx 1.10. The last column provides the ground truth as no randomized response applied. Due to the space limit, the standard deviation of the measured accuracy is omitted here.

DCAConv vs BNN

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: Comparison of DCAConv and BNN, DCA_KNN represents result of DCAConv output in KNN algorithm, BNN_KNN represents result of BNN output in KNN algorithm, output domain dd=2, number of neighbors in KNN: 100. DCAConv hyperparameters, filter size: 7×\times7, filter stride size: 1×\times1, L1L_{1}: 5, L2L_{2}: 1, pooling window size: 2×\times2, pooling stride size: 1×\times1.

We evaluate DCAConv under an extreme case, where the output is set to binary, and compare the utility of the representation to a variant of CNN. As we mention early, the reason that it is hard to apply randomized response to CNN is that the data flow is continuous. However, there are studies to replace the weights and activation with discrete values for the purpose of improving the power-efficiency. Binarized Neural Network (BNN)(?) is one of such research work, which modifies the convolution layer and activation function, to transform the data flow from float number to {+1,1}\{+1,-1\}. For the implementation of BNN, we take the suggested values for parameters like the number of hidden units in the fully connected layer, but with the same convolution layer setting as DCAConv, like the same number of filters. Once the training of BNN is finished, we take the activated binarized output before the fully connected layer and perturb it using randomized response. The classification result is shown in Fig 6, where the vertical axis specifies the classification accuracy. With the binary representation, the probability of preserving the true proximity quickly dominates as ϵ1.1\epsilon\geq 1.1, and the performance of the perturbed data almost reaches the ground truth. Similarly, Tab. 2 provides the Naive Bayes and KNN accuracy of DCAConv when ϵ0.1\epsilon\geq 0.1, where the domain size d=2d=2, it can be seen that the classification accuracy of KNN almost matches the ground truth regarding all 3 datasets when ϵ1.5\epsilon\geq 1.5. However, unlike KNN, there is not a clear threshold of ϵ\epsilon for Naive Bayes classifier to provide a utility guarantee.

Related Work

Image privacy has been studied widely. Privacy-aware image classification has been addressed in (?) and (?), where they are interested in determining if an image has privacy-related content or not. By exploring the “public/private” tags provided by the end users, they trained a privacy classifier to help the end users make decisions in the context of image sharing. A privacy-preserving image feature extraction system is proposed in (?), namely SecSIFT, which allows the user to delegate the SIFT feature extraction to the cloud computing platform. In the design, the image owners encrypt the image, then upload the encrypted data to the Cloud. And the computation is distributed to multiple independent Cloud entities, who are not colluding with each other. Once the feature extraction is finished, the encrypted feature descriptors are returned to the image owners. The authors argue that both the plaintext data and the locations of feature points of the image are not leaked to the Cloud. To the best of our knowledge, the most similar work to this paper is (?), a privacy-enhanced face recognition system. They consider a two party problem, the server owns a database of face images, and the client who holds a facial image. The protocol they propose allows the client efficiently hiding both the biometrics and the result from the server that performs the matching operation by using secure multiparty computation technique. The difference from our work is that it is a centralized setting where the data has already been collected in the server. However, the technique (e.g, without adding noise for differential privacy) provides inadequate privacy protection.

LDP is firstly proposed by (?), in where the minimax bounds for learning problem is derived under the local privacy definition. Kairouz et al. (?) study the trade-off between the utility and LDP in terms of family of extremal mechanisms. Although two simple extremal mechanisms, the binary and randomized response mechanisms, are proposed and shown the theoretical guarantee of the utility, it is still hard to adopt the methods in real world machine learning problems. Wang et al. (?) propose a framework that generalizes several LDP protocols in the literature and analyzes each one regarding frequency estimation. For the real world application, RAPPOR(?) is the first that practically adopts LDP to collect the statistics from the end users, which is designed by Google. Apple and Microsoft also propose their own tools to use LDP in collecting usage, typing history and telemetry data(?).

Recently, an iterative method(?) that satisfies ϵ\epsilon-LDP is proposed for empirical risk minimization, where each data owner locally computes and perturbs the stochastic gradient descent (SGD) of the parameters, then shares the noisy version of SGD to the aggregator. In the paper, the authors demonstrate the computation with three models, the linear regression, logistic regression and SVM. Another client-server machine learning framework is proposed to learn the Generalized Linear models(?), where the server simultaneous delivers the models to the distributed clients and asks for the parameter update. More specifically, the server maintains kk versions of the machine learning model, and randomly selects one version to update the parameters and replaces the another version. To enforce the ϵ\epsilon-LDP, the client uses Laplacian mechanism to perturb the parameters and returns the model to the server. Unlike both works, we consider a “non-interactive” scenario where the data owners only perturb the input once and then share the noisy input with the data user. Comparing to the interactive fashion, the non-interactive way requires less communication cost and brings a great scalability. Overall, most current studies regarding LDP either focus on the theoretical interest of the perturbation methods or on simple statistical collection, like mean and frequency estimation. There are few studies regarding machine learning in a local setting.

Conclusion

In this paper, we study a privacy-preserving image classification problem, where the distributed data owners hold a set of images; the untrustworthy data user would like to fit a classifier regarding the images held by the owners. To protect the privacy, we propose to use randomized response to perturb the image representation locally, which satisfies ϵ\epsilon-Local Differential Privacy. We address two questions in this paper, firstly, we analyze the utility and privacy trade-off regarding the domain size dd and parameter ϵ\epsilon for Naive Bayes and KNN classifier. For Naive Bayes classifier, the utility is bounded by the frequency estimation of the individual feature; for KNN, we show that the probability of retaining the true proximity between the testing sample and the perturbed training sample is bigger than 12\frac{1}{2} as ϵln(d+1)\epsilon\geq ln(d+1); secondly, to balance the trade-off between the utility and privacy, we propose a supervised image feature extractor, namely DCAConv, which produces a representation of the image with a scalable domain size. The experiments over three benchmark image datasets evaluates the classification accuracy of the perturbed image representations with respect to Naive Bayes and KNN algorithm, which confirms our analysis regarding the domain size and ϵ\epsilon and shows that the effectiveness of DCAConv under various privacy constraints.

References

  • [Boiman, Shechtman, and Irani 2008] Boiman, O.; Shechtman, E.; and Irani, M. 2008. In defense of nearest-neighbor based image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Chan et al. 2015] Chan, T.-H.; Jia, K.; Gao, S.; Lu, J.; Zeng, Z.; and Ma, Y. 2015. Pcanet: A simple deep learning baseline for image classification? IEEE Transactions on Image Processing.
  • [Chan, Liang, and Vasconcelos 2008] Chan, A. B.; Liang, Z.-S. J.; and Vasconcelos, N. 2008. Privacy preserving crowd monitoring: Counting people without people models or tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
  • [Courbariaux et al. 2016] Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830.
  • [Ding, Kulkarni, and Yekhanin 2017] Ding, B.; Kulkarni, J.; and Yekhanin, S. 2017. Collecting telemetry data privately. In Advances in Neural Information Processing Systems.
  • [Duchi, Jordan, and Wainwright 2013] Duchi, J. C.; Jordan, M. I.; and Wainwright, M. J. 2013. Local privacy and statistical minimax rates. In IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS).
  • [Dwork et al. 2014] Dwork, C.; Talwar, K.; Thakurta, A.; and Zhang, L. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In 46th Annual ACM Symposium on Theory of Computing.
  • [Dwork 2006] Dwork, C. 2006. Differential privacy. In 33 International Conference on Automata, Languages and Programming.
  • [Erkin et al. 2009] Erkin, Z.; Franz, M.; Guajardo, J.; Katzenbeisser, S.; Lagendijk, I.; and Toft, T. 2009. Privacy-preserving face recognition. In International Symposium on Privacy Enhancing Technologies Symposium.
  • [Erlingsson, Pihur, and Korolova 2014] Erlingsson, Ú.; Pihur, V.; and Korolova, A. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security.
  • [Kairouz, Oh, and Viswanath 2014] Kairouz, P.; Oh, S.; and Viswanath, P. 2014. Extremal mechanisms for local differential privacy. In Advances in neural information processing systems.
  • [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
  • [Kung 2017] Kung, S.-Y. 2017. Discriminant component analysis for privacy protection and visualization of big data. Multimedia Tools and Applications.
  • [Leon et al. 2012] Leon, P.; Ur, B.; Shay, R.; Wang, Y.; Balebako, R.; and Cranor, L. 2012. Why johnny can’t opt out: a usability evaluation of tools to limit online behavioral advertising. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
  • [Lepinski et al. 2015] Lepinski, M.; Levin, D.; McCarthy, D.; Watro, R.; Lack, M.; Hallenbeck, D.; and Slater, D. 2015. Privacy-enhanced android for smart cities applications. In EAI International Conference on Smart Urban Mobility Services.
  • [McCann and Lowe 2012] McCann, S., and Lowe, D. G. 2012. Local naive bayes nearest neighbor for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Nguyên et al. 2016] Nguyên, T. T.; Xiao, X.; Yang, Y.; Hui, S. C.; Shin, H.; and Shin, J. 2016. Collecting and analyzing data from smart device users with local differential privacy. arXiv preprint arXiv:1606.05053.
  • [Pihur et al. 2018] Pihur, V.; Korolova, A.; Liu, F.; Sankuratripati, S.; Yung, M.; Huang, D.; and Zeng, R. 2018. Differentially-private” draw and discard” machine learning. arXiv preprint arXiv:1807.04369.
  • [Qin et al. 2014] Qin, Z.; Yan, J.; Ren, K.; Chen, C. W.; and Wang, C. 2014. Towards efficient privacy-preserving image feature extraction in cloud computing. In Proceedings of the 22nd ACM international conference on Multimedia.
  • [Spyromitros-Xioufis et al. 2016] Spyromitros-Xioufis, E.; Papadopoulos, S.; Popescu, A.; and Kompatsiaris, Y. 2016. Personalized privacy-aware image classification. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval.
  • [Wang et al. 2016] Wang, S.; Huang, L.; Wang, P.; Deng, H.; Xu, H.; and Yang, W. 2016. Private weighted histogram aggregation in crowdsourcing. In International Conference on Wireless Algorithms, Systems, and Applications.
  • [Wang et al. 2017] Wang, T.; Blocki, J.; Li, N.; and Jha, S. 2017. Locally differentially private protocols for frequency estimation. In Proceedings of the 26th USENIX Security Symposium.
  • [Warner 1965] Warner, S. L. 1965. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association.
  • [Xia et al. 2016] Xia, Z.; Wang, X.; Zhang, L.; Qin, Z.; Sun, X.; and Ren, K. 2016. A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing. IEEE Transactions on Information Forensics and Security.
  • [Zerr et al. 2012] Zerr, S.; Siersdorfer, S.; Hare, J.; and Demidova, E. 2012. Privacy-aware image classification and search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval.