This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimating g-Leakage via Machine Learning

Marco Romanelli [email protected] Inria, École Polytechnique, IPP, Università di Siena1 Rue Honoré d’Estienne d’OrvesPalaiseauFrance91120 Konstantinos Chatzikokolakis [email protected] University of AthensAthensGreece10679 Catuscia Palamidessi [email protected] Inria, École Polytechnique, IPP1 Rue Honoré d’Estienne d’OrvesPalaiseauFrance91120  and  Pablo Piantanida [email protected] CentraleSupelec, CNRS, Université Paris Saclay3 Rue Joliot CurieGif-sur-YvetteFrance91190
(2020)
Abstract.

This paper considers the problem of estimating the information leakage of a system in the black-box scenario, i.e. when the system’s internals are unknown to the learner, or too complicated to analyze, and the only available information are pairs of input-output data samples, obtained by submitting queries to the system or provided by a third party. The frequentist approach relies on counting the frequencies to estimate the input-output conditional probabilities, however this method is not accurate when the domain of possible outputs is large. To overcome this difficulty, the estimation of the Bayes error of the ideal classifier was recently investigated using Machine Learning (ML) models, and it has been shown to be more accurate thanks to the ability of those models to learn the input-output correspondence. However, the Bayes vulnerability is only suitable to describe one-try attacks. A more general and flexible measure of leakage is the gg-vulnerability, which encompasses several different types of adversaries, with different goals and capabilities. We propose a novel approach to perform black-box estimation of the gg-vulnerability using ML which does not require to estimate the conditional probabilities and is suitable for a large class of ML algorithms. First, we formally show the learnability for all data distributions. Then, we evaluate the performance via various experiments using k-Nearest Neighbors and Neural Networks. Our approach outperform the frequentist one when the observables domain is large.

gg-vulnerability estimation; machine learning; neural networks
journalyear: 2020copyright: acmlicensedconference: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security; November 9–13, 2020; Virtual Event, USAbooktitle: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS ’20), November 9–13, 2020, Virtual Event, USAprice: 15.00doi: 10.1145/3372297.3423363isbn: 978-1-4503-7089-9/20/11copyright: noneccs: Security and privacy Formal security modelsccs: Security and privacy Formal methods and theory of securityccs: Security and privacy Information flow controlccs: Computing methodologies Neural networksccs: Computing methodologies Machine learning

1. Introduction

The information leakage of a system is a fundamental concern of computer security, and measuring the amount of sensitive information that an adversary can obtain by observing the outputs of a given system is of the utmost importance to understand whether such leakage can be tolerated or must be considered a major security flaw. Much research effort has been dedicated to studying and proposing solutions to this problem, see for instance (Clark et al., 2001; Köpf and Basin, 2007; Chatzikokolakis et al., 2008a; Boreale, 2009; Smith, 2009; Alvim et al., 2012, 2014; Chatzikokolakis et al., 2019). So far, this area of research, known as quantitative information flow (QIF), has mainly focused on the so-called white-box scenario, i.e. assuming that the channel of the system is known, or can be computed by analyzing the system’s internals. This channel consists of the conditional probabilities of the outputs (observables) given the inputs (secrets).

The white-box assumption, however, is not always realistic: sometimes the system is unknown, or anyway it is too complex, so that an analytic computation becomes hard if not impossible to be performed. Therefore, it is important to consider also black-box approaches where we only assume the availability of a finite set of input-output pairs generated by the system. A further specification of a black box model is the “degree of inaccessibility”, namely whether or not we are allowed to interact with the system to obtain these pairs, or they are just provided by a third party. Both scenarios are realistic, and in our paper we consider the two cases.

The estimation of the internal probabilities of a system’s channel have been investigated in (Chothia and Guha, 2011) and (Chothia et al., 2014) via a frequentist paradigm, i.e., relying on the computation of the frequencies of the outputs given some inputs. However, this approach does not scale to applications for which the output space is very large since a prohibitively large number of samples would be necessary to achieve good results and fails on continuous alphabets unless some strong assumption on the distributions are made. In order to overcome this limitation, the authors of (Cherubin et al., 2019) exploited the fact that Machine Learning (ML) algorithms provide a better scalability to black-box measurements. Intuitively, the advantage of the ML approach over the frequentist one is its generalization power: while the frequentist method can only draw conclusions based on counts on the available samples, ML is able to extrapolate from the samples and provide better prediction (generalization) for the rest of the universe.

In particular,  (Cherubin et al., 2019) proposed to use k-Nearest Neighbors (k-NN) to measure the basic QIF metrics, the Bayes vulnerability (Smith, 2009). This is the expected probability of success of an adversary that has exactly one attempt at his disposal (one-try), and tries to maximize the chance of guessing the right secret. The Bayes vulnerability corresponds to the converse of the error of the ideal Bayes classifier that, given any sample (observable), tries to predict its corresponding class (secret). Hence the idea is to build a model that approximates such classifier, and estimate its expected error. The main takeaway is that any ML rule which is universally consistent (i.e., approximates the ideal Bayes classifier) has a guarantee on the accuracy of the estimation, i.e., the error in the estimation of the leakage tends to vanish as the number of training samples grows large.

The method of (Cherubin et al., 2019), however, is limited to the Bayes vulnerability, which models only one particular kind of adversary. As argued in (Alvim et al., 2012), there are many other realistic ones. For instance, adversaries whose aim is to guess only a part of the secret, or a property of the secret, or that have multiple tries at their disposal. To represent a larger class of attacks,  (Alvim et al., 2012) introduced the so-called gg-vulnerability. This metric is very general, and in particular it encompasses the Bayes vulnerability.

Symbol Description
x𝒳x\in\mathcal{X} a secret
w𝒲w\in\mathcal{W} a guess
y𝒴y\in\mathcal{Y} an observable output by the system
XX random var. for secrets taking values x𝒳\mathit{x\in\mathcal{X}}
WW random var. for guesses taking values w𝒲\mathit{w\in\mathcal{W}}
YY random var. for observables taking values y𝒴\mathit{y\in\mathcal{Y}}
|𝒮||\mathcal{S}| size of a set 𝒮\mathcal{S}
𝒫(𝒮)\mathcal{P}(\mathcal{S}) Distribution over a set of symbols 𝒮\mathcal{S}
\mathcal{H} class of learning functions ff
π\pi, PXP_{X} prior distribution over the secret space
π^\hat{\pi}, P^X\widehat{P}_{X} empirical prior distribution over the secret space
CC Channel matrix
πC\pi{\kern-0.6pt\smalltriangleright\kern 0.3pt}C joint distribution from prior π\pi and channel CC
PXY{P_{XY}} joint probability distribution
P^XY{\widehat{P}_{XY}} empirical joint probability distribution
PYXP_{Y\mid X} conditional probability of YY given XX
P^YX\widehat{P}_{Y\mid X} empirical conditional probabilities
\mathbb{P} probability measure
𝔼[]{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbb{\mathbb{E}[\cdot]}} expected value
g(w,x){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}g(w,x)} gain function: guess ww and secret xx as inputs
GG gain matrix of size |𝒲|×|𝒳||\mathcal{W}|\times|\mathcal{X}| for a specific gg
VgV_{g} gg-vulnerability
V(f)V(f) gg-vulnerability functional
V^n(f)\widehat{V}_{n}(f) empirical gg-vuln. functional evaluated on nn samples
Table 1. Table of symbols.

In this paper, we propose an approach to the black-box estimation of gg-vulnerability via ML. The idea is to reduce the problem to that of approximating the Bayes classifier, so that any universally consistent ML algorithm can be used for the purpose. This reduction essentially takes into account the impact of the gain function in the generation of the training data, and we propose two methods to obtain this effect, which we call channel pre-processing and data pre-processing, respectively. We evaluate our approach via experiments on various channels and gain functions. In order to show the generality of our approach, we use two different ML algorithms, namely k-NN and Artificial Neural Networks (ANN), and we compare their performances. The experimental results show that our approach provides accurate estimations, and that it outperforms by far the frequentist approach when the observables domain is large.

1.1. Our contribution

  • We propose a novel approach to the black-box estimation of gg-vulnerability based on ML. To the best of our knowledge, this is the first time that a method to estimate gg-vulnerability in a black-box fashion is introduced.

  • We provide statistical guarantees showing the learnability of the gg-vulnerability for all distributions and we derive distribution-free bounds on the accuracy of its estimation.

  • We validate the performance of our method via several experiments using k-NN and ANN models. The code is available at the URL https://github.com/LEAVESrepo/leaves.

1.2. Related work

One important aspect to keep in mind when measuring leakage is the kind of attack that we want to model. In (Köpf and Basin, 2007), Köpf and Basin identified various kinds of adversaries and showed that they can be captured by known entropy measures. In particular it focussed on the adversaries corresponding to Shannon and Guessing entropy. In (Smith, 2009) Smith proposed another notion, the Rényi min-entropy, to measure the system’s leakage when the attacker has only one try at its disposal and attempts to make its best guess. The Rényi min-entropy is the logarithm of the Bayes vulnerability, which is the expected probability of the adversary to guess the secret correctly. The Bayes vulnerability is the converse of the Bayes error, which was already proposed as a measure of leakage in (Chatzikokolakis et al., 2008b).

Alvim et al. (Alvim et al., 2012) generalized the notion of Bayes vulnerability to that of gg-vulnerability, by introducing a parameter (the gain function gg) that describes the adversary’s payoff. The gg-vulnerability  is the expected gain of the adversary in a one-try attack.

The idea of estimating the gg-vulnerability in the using ML techniques is inspired by the seminal work (Cherubin et al., 2019), which used k-NN algorithms to estimate the Bayes vulnerability, a paradigm shift with respect to the previous frequentist approaches (Chatzikokolakis et al., 2010; Chothia and Guha, 2011; Chothia et al., 2014). Notably, (Cherubin et al., 2019) showed that universally consistent learning rules allow to achieve much more precise estimations than the frequentist approach when considering large or even continuous output domains which would be intractable otherwise. However, (Cherubin et al., 2019) was limited to the Bayes adversary. In contrast, our approach handles any adversary that can be modeled by a gain function gg. Another novelty w.r.t. (Cherubin et al., 2019) is that we consider also ANN algorithms, which in various experiments appear to perform better than the k-NN ones.

Bordenabe and Smith (Bordenabe and Smith, 2016) investigated the indirect leakage induced by a channel (i.e., leakage on sensitive information not in the domain of the channel), and proved a fundamental equivalence between Dalenius min-entropy leakage under arbitrary correlations and g-leakage under arbitrary gain functions. This result is similar to our Theorem 4.2, and it opens the way to the possible extension of our approach to this more general leakage scenario.

2. Preliminaries

In this section, we recall some useful notions from QIF and ML.

2.1. Quantitative information flow

Let 𝒳\mathcal{X} be a set of secrets and 𝒴\mathcal{Y} a set of observations. The adversary’s initial knowledge about the secrets is modeled by a prior distribution 𝒫(𝒳)\mathcal{P}(\mathcal{X}) (namely PXP_{X}). A system is modeled as a probabilistic channel from 𝒳\mathcal{X} to 𝒴\mathcal{Y}, described by a stochastic matrix CC, whose elements CxyC_{xy} give the probability to observe y𝒴y\in\mathcal{Y} when the input is x𝒳x\in\mathcal{X} (namely PY|XP_{Y|X}). Running CC with input π\pi induces a joint distribution on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} denoted by πC\pi{\kern-0.6pt\smalltriangleright\kern 0.3pt}C.

In the gg-leakage framework (Alvim et al., 2012) an adversary is described by a set 𝒲\mathcal{W} of guesses (or actions) that it can make about the secret, and by a gain function g(w,x)g(w,x) expressing the gain of selecting the guess ww when the real secret is xx. The prior gg-vulnerability is the expected gain of an optimal guess, given a prior distribution on secrets:

(1) Vg(π)=defmaxw𝒲x𝒳πxg(w,x).\mathit{V_{g}(\pi)\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{w\in\mathcal{W}}\sum_{x\in\mathcal{X}}\pi_{x}\cdot g(w,x)}\,.

In the posterior case, the adversary observes the output of the system which allows to improve its guess. Its expected gain is given by the posterior gg-vulnerability, according to

(2) Vg(π,C)=defy𝒴maxw𝒲x𝒳πxC𝑥𝑦g(w,x).\mathit{V_{g}(\pi,C)\stackrel{{\scriptstyle\text{def}}}{{=}}\sum_{y\in\mathcal{Y}}\max_{w\in\mathcal{W}}\sum_{x\in\mathcal{X}}\pi_{x}\cdot C_{xy}\cdot g(w,x)\,.}

Finally, the multiplicative111Originally the multiplicative version of gg-leakage was defined as the log\log of the definition given here. In recent literature the log\log is not used anymore. Anyway, the two definitions are equivalent for system comparison, since log\log is monotonic. and additive gg-leakage quantify how much a specific channel CC increases the vulnerability of the system:

(3) gM(π,C)=defVg(π,C)Vg(π),gA(π,C)=defVg(π,C)Vg(π).\mathit{\mathcal{L}^{\rm M}_{g}(\pi,C)\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{V_{g}(\pi,C)}{V_{g}(\pi)}~{},\quad\mathcal{L}^{\rm A}_{g}(\pi,C)\stackrel{{\scriptstyle\text{def}}}{{=}}V_{g}(\pi,C)-V_{g}(\pi)}\,.

The choice of the gain function gg allows to model a variety of different adversarial scenarios. The simplest case is the identity gain function, given by 𝒲=𝒳\mathcal{W}=\mathcal{X}, gid(w,x)=1{g_{\mathrm{id}}}(w,x)=1 iff x=wx=w and 0 otherwise. This gain function models an adversary that tries to guess the secret exactly in one try; VgidV_{{g_{\mathrm{id}}}} is the Bayes-vulnerability, which corresponds to the complement of the Bayes error (cfr. (Alvim et al., 2012)).

However, the interest in gg-vulnerability lies in the fact that many more adversarial scenarios can be captured by a proper choice of gg. For instance, taking 𝒲=𝒳k\mathcal{W}=\mathcal{X}^{k} with g(w,x)=1g(w,x)=1 iff xwx\in w and 0 otherwise, models an adversary that tries to guess the secret correctly in kk tries. Moreover, guessing the secret approximately can be easily expressed by constructing gg from a metric dd on 𝒳\mathcal{X}; this is a standard approach in the area of location privacy (Shokri et al., 2011, 2012) where g(w,x)g(w,x) is taken to be inversely proportional to the Euclidean distance between ww and xx. Several other gain functions are discussed in (Alvim et al., 2012), while (Alvim et al., 2016) shows that any vulnerability function satisfying basic axioms can be expressed as VgV_{g} for a properly constructed gg.

The main focus of this paper is estimating the posterior gg-vulnerability of the system from such samples. Note that, given Vg(π,C)V_{g}(\pi,C), estimating gM(π,C)\mathit{\mathcal{L}^{\rm M}_{g}(\pi,C)} and gA(π,C)\mathit{\mathcal{L}^{\rm A}_{g}(\pi,C)} is straightforward, since Vg(π)V_{g}(\pi) only depends on the prior (not on the system) and it can be either computed analytically or estimated from the samples.

2.2. Artificial Neural Networks

We provide a short review of the aspects of ANN that are relevant for this work. For further details, we refer to (Bishop, 2007; Goodfellow et al., 2016; Hastie et al., 2001). Neural networks are usually modeled as directed graphs with weights on the connections and nodes that forward information through “activation functions”, often introducing non-linearity (such as sigmoids or soft-max). In particular, we consider an instance of learning known as supervised learning, where input samples are provided to the network model together with target labels (supervision). From the provided data and by means of iterative updates of the connection weights, the network learns how the data and respective labels are distributed. The training procedure, known as back-propagation, is an optimization process aimed at minimizing a loss function that quantifies the quality of the network’s prediction w.r.t. the data.

Classification problems are a typical example of tasks for which supervised learning works well. Samples are provided together with target labels which represent the classes they belong to. A model can be trained using these samples and, later on, it can be used to predict the class of new samples.

According to the No Free Lunch theorem (NFL) (Wolpert, 1996), in general, it cannot be said that ANN are better than other ML methods. However, it is well known that the NFL can be broken by additional information on the data or the particular problem we want to tackle, and, nowadays, for most applications and available data, especially in multidimensional domains, ANN models outperform other methods therefore representing the state-of-the-art.

2.3. k-Nearest Neighbors

The k-NN algorithm is one of the simplest algorithms used to classify a new sample given a training set of samples labelled as belonging to specific classes. This algorithm assumes that the space of the features is equipped with a notion of distance. The basic idea is the following: every time we need to classify a new sample, we find the kk samples whose features are closest to those of the new one (nearest neighbors). Once the kk nearest neighbors are selected, a majority vote over their class labels is performed to decide which class should be assigned to the new sample. For further details, as well as for an extensive analysis of the topic, we refer to the chapters about k-NN in (Shalev-Shwartz and Ben-David, 2014; Hastie et al., 2001).

3. Learning gg-vulnerability: Statistical Bounds

This section introduces the mathematical problem of learning gg-vulnerability, characterizing universal learnability in the present framework. We derive distribution-free bounds on the accuracy of the estimation, implying the estimator’s statistical consistence.

3.1. Main definitions

We consider the problem of estimating the gg-vulnerability from samples via ML models, and we show that the analysis of this estimation can be conducted in the general statistical framework of maximizing an expected functional using observed samples. The idea can be described using three components:

  • A generator of random secrets x𝒳x\in\mathcal{X} with |𝒳|<|\mathcal{X}|<\infty, drawn independently from a fixed but unknown distribution PX(x)P_{X}(x);

  • a channel that returns an observable y𝒴y\in\mathcal{Y} with |𝒴|<|\mathcal{Y}|<\infty for every input xx, according to a conditional distribution PY|X(y|x)P_{Y|X}(y|x), also fixed and unknown;

  • a learning machine capable of implementing a set of rules ff\in\mathcal{H}, where \mathcal{H} denotes the class of functions f:𝒴𝒲{f:\mathcal{Y}\rightarrow\mathcal{W}} and 𝒲\mathcal{W} is the finite set of guesses.

Moreover let us note that

(4) g:𝒲×𝒳[a,b]g:\mathcal{W}\times\mathcal{X}\rightarrow[a,b]

where aa and bb are finite real values, and 𝒳\mathcal{X} and 𝒲\mathcal{W} are finite sets. The problem of learning the gg-vulnerability is that of choosing the function f:𝒴𝒲{f:\mathcal{Y}\rightarrow\mathcal{W}} which maximizes the functional V(f)V(f), representing the expected gain, defined as:

(5) V(f)=def(x,y)𝒳×𝒴g(f(y),x)PXY(x,y).V(f)\stackrel{{\scriptstyle\text{def}}}{{=}}\sum_{(x,y)\in\mathcal{X}\times\mathcal{Y}}g\big{(}f(y),x\big{)}P_{XY}(x,y).

Note that f(y)f(y) corresponds to the “guess” ww, for the given yy, in (2). The maximum of V(f)V(f) is the gg-vulnerability, namely:

(6) Vg=defmaxfV(f).V_{g}\stackrel{{\scriptstyle\text{def}}}{{=}}\max\limits_{f\in\mathcal{H}}V(f).

3.2. Principle of the empirical gg-vulnerability maximization

Since we are in the black box scenario, the joint probability distribution PXYπCP_{XY}\equiv\pi{\kern-0.6pt\smalltriangleright\kern 0.3pt}C is unknown. We assume, however, the availability of mm independent and identically distributed (i.i.d.) samples drawn according to PXYP_{XY} that can be used as a training set 𝒟m\mathcal{D}_{m}  to solve the maximization of ff over \mathcal{H} and additionally nn i.i.d. samples are available to be used as a validation222We call 𝒯n\mathcal{T}_{n} validation set rather than test set, since we use it to estimate the gg-vulnerability with the learned fmf^{\star}_{m}, rather than to measure the quality of fmf^{\star}_{m}. set 𝒯n\mathcal{T}_{n} to estimate the average in (5). Let us denote these sets as: 𝒟m\mathcal{D}_{m} =def{(x1,y1),,(xm,ym)}\stackrel{{\scriptstyle\text{def}}}{{=}}\big{\{}(x_{1},y_{1}),\dots,(x_{m},y_{m})\big{\}} and 𝒯n\mathcal{T}_{n} =def{(xm+1,ym+1),,(xm+n,ym+n)}\stackrel{{\scriptstyle\text{def}}}{{=}}\big{\{}(x_{m+1},y_{m+1}),\dots,(x_{m+n},y_{m+n})\big{\}}, respectively.

In order to maximize the gg-vulnerability functional (5) for an unknown probability measure PXYP_{XY}, the following principle is usually applied. The expected gg-vulnerability functional V(f)V(f) is replaced by the empirical gg-vulnerability functional:

(7) V^n(f)=def1n(x,y)𝒯ng(f(y),x),\widehat{V}_{n}(f)\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{1}{n}\sum_{(x,y)\in\mathcal{T}_{n}}g\big{(}f(y),x\big{)},

which is evaluated on 𝒯n\mathcal{T}_{n} rather than PXYP_{XY}. This estimator is clearly unbiased in the sense that 𝔼[V^n(f)]=V(f).\mathbb{E}\big{[}\widehat{V}_{n}(f)\big{]}=V(f).

Let fmf^{\star}_{m} denote the empirical optimal rule given by

(8) fm=defargmaxfV^m(f),V^m(f)=def1m(x,y)𝒟mg(f(y),x),f^{\star}_{m}\stackrel{{\scriptstyle\text{def}}}{{=}}\operatorname*{arg\,max}\limits_{f\in\mathcal{H}}\widehat{V}_{m}(f),\,\,\widehat{V}_{m}(f)\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{1}{m}\sum_{(x,y)\in\mathcal{D}_{m}}g\big{(}f(y),x\big{)},

which is evaluated on 𝒟m\mathcal{D}_{m} rather than PXYP_{XY}. The function fmf^{\star}_{m} is the optimizer according to 𝒟m\mathcal{D}_{m}, namely the best way among the functions f:𝒴𝒲f:\mathcal{Y}\rightarrow\mathcal{W} to approximate VgV_{g} by maximizing V^m(f)\widehat{V}_{m}(f) over the class of functions \mathcal{H}. This principle is known in statistics as the Empirical Risk Maximization (ERM).

Intuitively, we would like fmf^{\star}_{m} to give a good approximation of the gg-vulnerability, in the sense that its expected gain

(9) V(fm)=(x,y)𝒳×𝒴g(fm(y),x)PXY(x,y)V(f^{\star}_{m})\;=\sum_{(x,y)\in\mathcal{X}\times\mathcal{Y}}g\big{(}f^{\star}_{m}(y),x\big{)}P_{XY}(x,y)

should be close to VgV_{g}. Note that the difference

(10) VgV(fm)=maxfV(f)V(fm)V_{g}-V(f^{\star}_{m})=\max\limits_{f\in\mathcal{H}}V(f)-V(f^{\star}_{m})

is always non negative and represents the gap by selecting a possible suboptimal function fmf^{\star}_{m}. Unfortunately, we are not able to compute V(fm)V(f^{\star}_{m}) either, since PXYP_{XY} is unknown. In its place, we have to use its estimation V^n(fm)\widehat{V}_{n}(f^{\star}_{m}) Hence, the real estimation error is |VgV^n(fm)||V_{g}-\widehat{V}_{n}(f^{\star}_{m})|. Note that:

(11) |VgV^n(fm)|(VgV(fm))+|V^n(fm)V(fm)|,|V_{g}-\widehat{V}_{n}(f^{\star}_{m})|\leq(V_{g}-V(f^{\star}_{m}))+|\widehat{V}_{n}(f^{\star}_{m})-V(f^{\star}_{m})|,

where the first term in the right end side represents the error induced by using the trained model fmf_{m}^{\star} and the second represents the error induced by estimating the gg-vulnerability over the nn samples in the validation set.

By using basics principles from statistical learning theory, we study two main questions:

  • When does the estimator V^n(fm)\widehat{V}_{n}(f^{\star}_{m}) work? What are the conditions for its statistical consistency?

  • How well does V^n(fm)\widehat{V}_{n}(f^{\star}_{m}) approximate VgV_{g}? In other words, how fast does the sequence of largest empirical g-leakage values converge to the largest g-leakage function? This is related to the so called rate of generalization of a learning machine that implements the ERM principle.

3.3. Bounds on the estimation accuracy

In the following we use the notation σf2=𝑉𝑎𝑟(g(f(Y),X))\sigma^{2}_{f}=\mathit{Var}(g(f(Y),X)), where 𝑉𝑎𝑟(Z)\mathit{Var}(Z) stands for the variance of the random variable ZZ. Namely, σf2\sigma^{2}_{f} is the variance of the gain for a given function ff. We also use \mathbb{P} to represent the probability induced by sampling the training and validation sets over the distribution PXYP_{XY}.

Next proposition, proved in Appendix B.1, states that we can probabilistically delimit the two parts of the bound in (11) in terms of the sizes mm and nn of the training and validation sets.

Proposition 3.1 (Uniform deviations).

For all ε>0\varepsilon>0,

(12) (|V^n(fm)V(fm)|ε)2𝔼exp(nε22σfm2+2(ba)ε/3),\displaystyle\mathbb{P}\left(\big{|}\widehat{V}_{n}(f^{\star}_{m})-V(f^{\star}_{m})\big{|}\geq\varepsilon\right)\leq 2\mathbb{E}\,\exp\left(-\frac{n\,\varepsilon^{2}}{2\,\sigma^{2}_{f^{\star}_{m}}+\nicefrac{{2\,\left(b-a\right)\varepsilon}}{{3}}}\right),

where the expectation is taken w.r.t. the random training set, and

(13) (VgV(fm)ε)2fexp(mε28σf2+4(ba)ε/3).\displaystyle\mathbb{P}\left(V_{g}-{V}(f^{\star}_{m})\geq\varepsilon\right)\leq 2\sum_{f\in\mathcal{H}}\exp\left(-\frac{m\,\varepsilon^{2}}{8\sigma_{f}^{2}+\nicefrac{{4\left(b-a\right)\varepsilon}}{{3}}}\right).

Inequality (12) shows that the estimation error due to the use of a validation set in V^n(fm)\widehat{V}_{n}(f^{\star}_{m}) instead of the true expected gain V(fm){V}(f^{\star}_{m}) vanishes with the number of validation samples. On the other hand, expression (13) implies ‘learnability’ of an optimal ff, i.e., the suboptimality of fmf^{\star}_{m} learned using the training set V^m(fm)\widehat{V}_{m}(f^{\star}_{m}) vanishes with the number of training samples.

On the other hand the bound in eq. 13 depends on the underlying distribution and the structural properties of the class \mathcal{H} through the variance. However, it does not explicitly depend on the optimization algorithm (e.g., stochastic gradient descent) used to learn the function fmf_{m}^{\star} from the training set, which just comes from assuming the worst-case upper bound over all optimization procedures. The selection of a “good” subset of candidate functions, having a high probability of containing an almost optimal classifier, is a very active area of research in ML (Chizat and Bach, 2020), and hopefully there will be some result available soon, which together with our results will provide a practical method to estimate the bound on the errors. In appendix F we discuss heuristics to choose a “good” model, together with a new set of experiments showing the impact of different architectures.

Theorem 3.2.

The averaged estimation error of the gg-vulnerability can be bounded as follows:

(14) 𝔼|VgV^n(fm)|Vg𝔼[V(fm)]+𝔼|V(fm)V^n(fm)|,\mathbb{E}\big{|}V_{g}-\widehat{V}_{n}(f^{\star}_{m})\big{|}\leq V_{g}-\mathbb{E}\big{[}{V}(f^{\star}_{m})\big{]}+\mathbb{E}\big{|}{V}(f^{\star}_{m})-\widehat{V}_{n}(f^{\star}_{m})\big{|},

where the expectations are understood over all possible training and validation sets drawn according to PXYP_{XY}. Furthermore, let σ2=maxfVar(g(f(Y),X))\sigma^{2}=max_{f\in\mathcal{H}}\text{Var}\big{(}g(f(Y),X)\big{)}, then :

𝔼|V(fm)V^n(fm)|\displaystyle\mathbb{E}\big{|}{V}(f^{\star}_{m})-\widehat{V}_{n}(f^{\star}_{m})\big{|} 4ηnexp(nσ22η)\displaystyle\leq\frac{4\eta}{n}\exp\left(-\frac{n\sigma^{2}}{2\eta}\right)
(15) +2σ2ηπnerf(σ2/2σ2ηπn),\displaystyle+\sqrt{\frac{2\sigma^{2}\eta\pi}{n}}\operatorname{erf}\left(\nicefrac{{\sigma^{2}}}{{\sqrt{\displaystyle\frac{2\sigma^{2}\eta\pi}{n}}}}\right),

where η=(1+(ba)/3)\eta=(1+\nicefrac{{(b-a)}}{{3}}) for σ2ε\sigma^{2}\leq\varepsilon, and, otherwise,

Vg𝔼[V(fm)]||8(1+η)mexp(mσ24(1+η))+\displaystyle V_{g}-\mathbb{E}\big{[}{V}(f^{\star}_{m})\big{]}\leq|\mathcal{H}|\frac{8(1+\eta)}{m}\exp\left(-\frac{m\sigma^{2}}{4(1+\eta)}\right)+
(16) ||4σ2(1+η)πmerf(σ2/4σ2(1+η)πm),\displaystyle|\mathcal{H}|\sqrt{\frac{4\sigma^{2}(1+\eta)\pi}{m}}\operatorname{erf}\left(\nicefrac{{\sigma^{2}}}{{\displaystyle\sqrt{\frac{4\sigma^{2}(1+\eta)\pi}{m}}}}\right),

with erf(θ)=def2π0θexp(μ2)𝑑μ\operatorname{erf}(\theta)\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{2}{\sqrt{\pi}}\displaystyle\int_{0}^{\theta}\exp(-\mu^{2})d\mu.

Interestingly, the term Vg𝔼[V(fm)]V_{g}-\mathbb{E}\big{[}{V}(f^{\star}_{m})\big{]} is the average error induced when estimating the function fmf^{\star}_{m} using mm samples from the training set while 𝔼|V(fm)V^n(fm)|\mathbb{E}\big{|}{V}(f^{\star}_{m})-\widehat{V}_{n}(f^{\star}_{m})\big{|} indicates the average error incurred when estimating the gg-vulnerability using nn samples from the validation set. Clearly, in eq. 15, the scaling of these bounds with the number of samples are very different which can be made evident by using the order notation:

(17) supPXY𝔼|V(fm)V^n(fm)|\displaystyle\sup_{P_{XY}}\mathbb{E}\big{|}{V}(f^{\star}_{m})-\widehat{V}_{n}(f^{\star}_{m})\big{|} 𝒪(1n),\displaystyle\in\mathcal{O}\left(\frac{1}{\sqrt{n}}\right),
(18) supPXY{Vg𝔼[V(fm)]}\displaystyle\sup_{P_{XY}}\left\{V_{g}-\mathbb{E}\big{[}{V}(f^{\star}_{m})\big{]}\right\} 𝒪(||m).\displaystyle\in\mathcal{O}\left(\frac{|\mathcal{H}|}{\sqrt{m}}\right).

These bounds indicate that the error in (17) vanishes much faster than the error in (18) and thus, the size of the training set, in general, should be kept larger than the size of the validation set, i.e., nmn\ll m.

3.4. Sample complexity

We now study how large the validation set should be in order to get a good estimation. For ε,δ>0{\varepsilon,\delta>0}, we define the sample complexity as the set of smallest integers M(ε,δ)M(\varepsilon,\delta) and N(ε,δ)N(\varepsilon,\delta) sufficient to guarantee that the gap between the true gg-vulnerability and the estimated V^n(fm)\widehat{V}_{n}(f^{\star}_{m}) is at most ε\varepsilon with at least 1δ1-\delta probability:

Definition 3.3.

For ε,δ>0{\varepsilon,\delta>0}, let all pairs (M(ε,δ),N(ε,δ))\big{(}M(\varepsilon,\delta),N(\varepsilon,\delta)\big{)} be the set of smallest (m,n)(m,n) sizes of training and validation sets such that:

(19) supPXY[|VgV^n(fm)|>ε]δ.\displaystyle\sup_{P_{XY}}\mathbb{P}\left[|V_{g}-\widehat{V}_{n}(f^{\star}_{m})|>\varepsilon\right]\leq\delta.

Next result says that we can bound the sample complexity in terms of ε,δ,σ\varepsilon,\delta,\sigma, and |ba||b-a| (see Appendix B.3 for the proof).

Corollary 3.4.

The sample complexity of the ERM algorithm gg-vulnerability  is bounded from above by the set of values satisfying:

(20) M(ε,δ)\displaystyle M(\varepsilon,\delta) 8σ2+4(ba)ε/3ε2ln(2||δΔ),\displaystyle\leq\frac{8\,\sigma^{2}+\nicefrac{{4\,\left(b-a\right)\varepsilon}}{{3}}}{\varepsilon^{2}}\,\ln\left(\frac{2\,|\mathcal{H}|}{\delta-\Delta}\right),
(21) N(ε,δ)\displaystyle N(\varepsilon,\delta) 2σ2+2(ba)ε/3ε2ln(2Δ),\displaystyle\leq\frac{2\,\sigma^{2}+\nicefrac{{2\,\left(b-a\right)\varepsilon}}{{3}}}{\varepsilon^{2}}\,\ln\left(\frac{2}{\Delta}\right),

for all Δ\Delta such that 0<Δ<δ0<\Delta<\delta.

The theoretical results of this section are very general and cannot take full advantage of the specific properties of a particular model or data distribution. In particular, it is important to emphasize that the upper bound in (11) is independent of the learned function fmf^{\star}_{m} because |V^n(fm)V(fm)|maxf|V^n(f)V(f)|{|\widehat{V}_{n}(f^{\star}_{m})-V(f^{\star}_{m})|\leq\max_{f\in\mathcal{H}}|\widehat{V}_{n}(f)-V(f)|} and because of (39), and thus these bounds are independent of the specific algorithm and training sets in used to solve the optimization in (8). Furthermore, the ff maximizing |V(f)V^n(f)||{V}(f)-\widehat{V}_{n}(f)| in those in-equations is not necessarily what the algorithm would choose. Hence the bounds given in Theorem 3.2 and 3.4 in general are not tight. However, these theoretical bounds provide a worst-case measure from which learnability holds for all data sets.

In the next section, we will propose an approach for selecting fmf^{\star}_{m} and estimating VgV_{g}. The experiments in Section 5 suggest that our method usually estimates VgV_{g} much more accurately than what is indicated by Theorem 3.2.

4. From gg-vulnerability to Bayes vulnerability via pre-processing

This is the core section of the paper, which describes our approach to select the fmf^{\star}_{m} to estimate VgV_{g}.

In principle one could train a neural network to learn fmf^{\star}_{m} by using V^m(f)-\widehat{V}_{m}(f) as the loss function, and minimizing it over the mm training samples (cfr. Equation 8). However, this would require V^m(f)\widehat{V}_{m}(f) to be a differentiable function of the weights of the neural network, so that its gradient can be computed during the back-propagation. Now, the problem is that the gg component of V^m(f)\widehat{V}_{m}(f) is essentially a non-differentiable function, so it would need to be approximated by a suitable differentiable (surrogate) function, (e.g., as it is the case of the Bayes error via the cross-entropy). Finding an adequate differentiable function to replace each possible gg may be a challenging task in practice. If this surrogate does not preserve the original dynamic of the gradient of gg with respect to ff, the learned ff will be far from being optimal.

In order to circumvent this issue, we propose a different approach, which presents two main advantages:

  1. (1)

    it reduces the problem of learning fmf^{\star}_{m} to a standard classification problem, therefore it does not require a different loss function to be implemented for each adversarial scenario;

  2. (2)

    it can be implemented by using any universally consistent learning algorithm (i.e., any ML algorithm approximating the ideal Bayes classifier).

The reduction described in the above list (item 1) is based on the idea that, in the gg-leakage framework, the adversary’s goal is not to directly infer the actual secret xx, but rather to select the optimal guess ww about the secret. As a consequence, the training of the ML classifier to produce fmf^{\star}_{m} should not be done on pairs of type (x,y)(x,y), but rather of type (w,y)(w,y), expressing the fact that the best guess, in the particular run which produced yy, is ww. This shift from (x,y)(x,y) to (w,y)(w,y) is via a pre-processing and we propose two distinct and systematic ways to perform this transformation, called data and channel pre-processing, respectively. The two methods are illustrated in the following sections.

We remind that, according to section 3, we restrict, wlog, to non-negative gg’s. If gg takes negative values, then it can be shifted by adding minw,xg(w,x)-\min_{w,x}g(w,x), without consequences for the gg-leakage value (cfr. (Alvim et al., 2012, 2014)). Furthermore we assume that there exists at least a pair (x,w)(x,w) such that πxg(w,x)>0\pi_{x}\cdot g(w,x)>0. Otherwise VgV_{g} would be 0 and the problem of estimating it will be trivial.

4.1. Data pre-processing

The data pre-processing technique is completely black-box in the sense that it does not need access to the channel. We only assume the availability of a set of pairs of type (x,y)(x,y), sampled according to πC\pi{\kern-0.6pt\smalltriangleright\kern 0.3pt}C, the input-output distribution of the channel. This set could be provided by a third party, for example. We divide the set in 𝒟m\mathcal{D}_{m} (training) and 𝒯n\mathcal{T}_{n} (validation), containing mm and nn pairs, respectively.

For the sake of simplicity, to describe this technique we assume that gg takes only integer values, in addition to being non-negative. The construction for the general case is discussed in Appendix C.3.

Input: 𝒟m\mathcal{D}_{m}; Output: 𝒟m\mathcal{D}^{\prime}_{m^{\prime}};
1. 𝒟m\mathcal{D}^{\prime}_{m^{\prime}}:=\vcentcolon=\emptyset;
2. For each x,yx,y, let uxyu_{xy} be the number of copies of (x,y)(x,y) in 𝒟m\mathcal{D}_{m};
3. For each x,y,wx,y,w, add uxyg(w,x)u_{xy}\cdot g(w,x) copies of (w,y)(w,y) to 𝒟m\mathcal{D}^{\prime}_{m^{\prime}}.
Algorithm 1 Algorithm for data pre-processing

The idea behind the data pre-processing technique is that the effect of the gain function can be represented in the transformed dataset by amplifying the impact of the guesses in proportion to their reward. For example, consider a pair (x,y)(x,y) in 𝒟m\mathcal{D}_{m}, and assume that the reward for the guess ww is g(w,x)=5g(w,x)=5, while for another guess ww^{\prime} is g(w,x)=1g(w^{\prime},x)=1. Then in the transformed dataset 𝒟m\mathcal{D}^{\prime}_{m^{\prime}} this pair will contribute with 55 copies of (w,y)(w,y) and only 11 copy of (w,y)(w^{\prime},y). The transformation is described in Algorithm 1. Note that in general it causes an expansion of the original dataset.

Estimation of VgV_{g}

Given 𝒟m\mathcal{D}_{m}, we construct the set 𝒟m\mathcal{D}^{\prime}_{m^{\prime}} of pairs (w,y)(w,y) according to Algorithm 1. Then, we use 𝒟m\mathcal{D}^{\prime}_{m^{\prime}} to train a classifier fmf^{\star}_{m^{\prime}}, using an algorithm that approximates the ideal Bayes classifier. As proved below, fmf^{\star}_{m^{\prime}} gives the same mapping 𝒴𝒲\mathcal{Y}\rightarrow\mathcal{W} as the optimal empirical rule fmf^{\star}_{m} on 𝒟m\mathcal{D}_{m} (cfr. subsection 3.2). Finally, we use fmf^{\star}_{m} and 𝒯n\mathcal{T}_{n} to compute the estimation of Vg(π,C)V_{g}(\pi,C) as in (7), with ff replaced by fmf^{\star}_{m}.

Correctness

We first need some notation. For each (w,y)(w,y), define:

(22) U(w,y)=defxπxCxyg(w,x),U(w,y)\;\stackrel{{\scriptstyle\text{def}}}{{=}}\;{\sum_{x}\pi_{x}\cdot C_{xy}\cdot g(w,x)}~{},

which represents the “ideal” proportion of copies of (w,y)(w,y) that 𝒟m\mathcal{D}^{\prime}_{m^{\prime}} should contain. From U(w,y)U(w,y) we can now derive the ideal joint distribution on 𝒲×𝒴\mathcal{W}\times\mathcal{Y} and the marginal on 𝒲\mathcal{W}:

(23) PWY(w,y)=defU(w,y)α,\displaystyle P_{WY}(w,y)\,\stackrel{{\scriptstyle\text{def}}}{{=}}\,\frac{U(w,y)}{\alpha}, where α=defy,wU(w,y),\displaystyle\alpha\;\stackrel{{\scriptstyle\text{def}}}{{=}}\;{\sum_{y,w}U(w,y)}\;,

(note that α>0\alpha>0 because of the assumption on π\pi and gg),

(24) ξw\displaystyle\xi_{w} =def\displaystyle\stackrel{{\scriptstyle\text{def}}}{{=}} yPWY(w,y).\displaystyle\sum_{y}P_{WY}(w,y).

The channel of the conditional probabilities of yy given ww is:

(25) Ewy=defPWY(w,y)ξw.E_{wy}\;\stackrel{{\scriptstyle\text{def}}}{{=}}\;\frac{P_{WY}(w,y)}{\xi_{w}}~{}.

Note that PWY=ξEP_{WY}=\xi{\kern-0.6pt\smalltriangleright\kern 0.3pt}E. By construction, it is clear that the 𝒟m\mathcal{D}^{\prime}_{m^{\prime}} generated by Algorithm 1 could have been generated, with the same probability, by sampling ξE\xi{\kern-0.6pt\smalltriangleright\kern 0.3pt}E. The following theorem, whose proof is in Appendix C.1, establishes that the gg-vulnerability  of πC\pi{\kern-0.6pt\smalltriangleright\kern 0.3pt}C is equivalent to the Bayes vulnerability of ξE\xi{\kern-0.6pt\smalltriangleright\kern 0.3pt}E, and hence it is correct to estimate fmf^{\star}_{m} as an empirical Bayes classifier fmf^{\star}_{m^{\prime}} trained on 𝒟m\mathcal{D}^{\prime}_{m^{\prime}}.

Theorem 4.1 (Correctness of data pre-processing).

Given a prior π\pi, a channel CC, and a gain function gg, we have

Vg(π,C)=αVgid(ξ,E),V_{g}(\pi,C)~{}=\alpha\cdot V_{{g_{\mathrm{id}}}}(\xi,E)~{},

where α,ξ\alpha,\xi and EE are those defined in (23), (24) and (25), respectively, and gid{g_{\mathrm{id}}} is the identity function (cfr. section 2), i.e., the gain function corresponding to the Bayesian adversary.

Estimation error

To reason about the error we need to consider the optimal empirical classifiers. Assuming that we can perfectly match the U(w,y)U(w,y) above with the uxyu_{xy} of Algorithm 1, we can repeat the same reasoning as above, thus obtaining V^m(f)=αV^m(f)\widehat{V}_{m}(f)=\alpha\cdot\widehat{V}_{m^{\prime}}(f), where Vm(f){V}_{m}(f) is the empirical functional defined in (8), and V^m(f)\widehat{V}_{m^{\prime}}(f) is the corresponding empirical functional for gid{g_{\mathrm{id}}} evaluated in 𝒟m\mathcal{D}_{m^{\prime}}:

(26) V^m(f)=def1m(w,y)𝒟mgid(f(y),w)\displaystyle\widehat{V}_{m^{\prime}}(f)\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{1}{m^{\prime}}\sum_{(w,y)\in\mathcal{D}_{m^{\prime}}}{g_{\mathrm{id}}}\big{(}f(y),w\big{)}

fmf^{\star}_{m^{\prime}} is the maximizer of this functional, i.e. fm=defargmaxfV^m(f)f^{\star}_{m^{\prime}}\stackrel{{\scriptstyle\text{def}}}{{=}}\operatorname*{arg\,max}\limits_{f\in\mathcal{H}}\widehat{V}_{m^{\prime}}(f). Therefore we have:

fm=argmaxfV^m(f)=argmaxf(αV^m(f))=argmaxfV^m(f)=fm.\displaystyle f^{\star}_{m}=\operatorname*{arg\,max}\limits_{f\in\mathcal{H}}\widehat{V}_{m}(f)=\operatorname*{arg\,max}\limits_{f\in\mathcal{H}}(\alpha\cdot\widehat{V}_{m^{\prime}}(f))=\operatorname*{arg\,max}\limits_{f\in\mathcal{H}}\widehat{V}_{m^{\prime}}(f)=f^{\star}_{m^{\prime}}.

A bound on the estimation error of this method can therefore be obtained by using the theory developed in previous section, applied to the Bayes classification problem. Remember that the estimation error is fmf^{\star}_{m} is |VgV^n(fm)||V_{g}-\widehat{V}_{n}(f^{\star}_{m})|. With respect to the estimation error of the corresponding Bayes classifier, we have a magnification of a factor α\alpha as shown by the following formula, where V^n\widehat{V}_{n^{\prime}} represents the empirical functional for the Bayes classifier:

|VgV^n(fm)|=|αVgidαV^n(fm)|=α|VgidV^n(fm)|.\displaystyle|V_{g}-\widehat{V}_{n}(f^{\star}_{m})|=|\alpha\cdot V_{{g_{\mathrm{id}}}}-\alpha\cdot\widehat{V}_{n^{\prime}}(f^{\star}_{m^{\prime}})|=\alpha\cdot|V_{{g_{\mathrm{id}}}}-\widehat{V}_{n^{\prime}}(f^{\star}_{m^{\prime}})|.

However, the normalized estimation error (cfr. section 5) remains the same because both numerator and denominator are magnified by a factor α\alpha.

Concerning the probability that the error is above a certain threshold ε\varepsilon, we have the same bound as those for the Bayes classifier in 3.1 and the other results of previous section, where ε\varepsilon is replaced by αε\alpha\varepsilon, m,nm,n by m,nm^{\prime},n^{\prime}, σ2\sigma^{2} by α2σ2\alpha^{2}\sigma^{2}, and |ba|=1|b-a|=1 (because it’s a Bayes classifier). It may sounds a bit surprising that the error for the estimation of the gg-vulnerability is not much worse than for the estimation of the Bayes error, but we recall that we are essentially solving the same problem, only every quantity is magnified by a factor α\alpha. Also, we are assuming that we can match perfectly UxyU_{xy} by uxyu_{xy}. When gg ranges in a large domain this may not be possible, and we should rather resort to the channel pre-processing method described in the next section.

4.2. Channel pre-processing

For this technique we assume black-box access to the system, i.e., that we can execute the system while controlling each input, and collect the corresponding output.

The core idea behind this technique is to transform the input of CC into entries of type ww, and to ensure that the distribution on the ww’s reflects the corresponding rewards expressed by gg.

More formally, let us define a distribution τ\tau on 𝒲\mathcal{W} as follows:

(27) τw=defxπxg(w,x)β\displaystyle\tau_{w}\,\stackrel{{\scriptstyle\text{def}}}{{=}}\,\frac{\sum_{x}\pi_{x}\cdot g(w,x)}{\beta} where β=defx,wπxg(w,x),\displaystyle\beta\,\stackrel{{\scriptstyle\text{def}}}{{=}}\,\sum_{x,w}\pi_{x}\cdot g(w,x)\,,

(note that β\beta is strictly positive because of the assumptions on gg and π\pi), and let us define the following matrix RR from 𝒲\mathcal{W} to 𝒳\mathcal{X}:

(28) Rwx=def1β1τwπxg(w,x).\displaystyle R_{wx}\,\stackrel{{\scriptstyle\text{def}}}{{=}}\,\frac{1}{\beta}\cdot\frac{1}{\tau_{w}}\cdot\pi_{x}\cdot g(w,x)\,.

It is easy to check that RR is a stochastic matrix, hence the composition RCRC is a channel. It is important to emphasize the following:

Remark  In the above definitions, β,τ\beta,\tau and RR depend solely on gg and π\pi, and not on CC.
The above property is crucial to our goals, because in the black-box approach we are not supposed to rely on the knowledge of CC’s internals. We now illustrate how we can estimate VgV_{g} using the pre-processed channel RCRC.

Estimation of   VgV_{g}

Given RCRC and τ\tau, we build a set 𝒟m′′′′\mathcal{D}^{\prime\prime}_{m^{\prime\prime}}  consisting of pairs of type (w,y)(w,y) sampled from τRC\tau{\kern-0.6pt\smalltriangleright\kern 0.3pt}RC. We also construct a set 𝒯n\mathcal{T}_{n} of pairs of type (x,y)(x,y) sampled from πC\pi{\kern-0.6pt\smalltriangleright\kern 0.3pt}C. Then, we use 𝒟m′′′′\mathcal{D}^{\prime\prime}_{m^{\prime\prime}} to train a classifier fmf^{\star}_{m}, using an algorithm that approximates the ideal Bayes classifier. Finally, we use fmf^{\star}_{m} and 𝒯n\mathcal{T}_{n} to compute the estimation of Vg(π,C)V_{g}(\pi,C) as in (7), with ff replaced by fmf^{\star}_{m}.

Alternatively, we could estimate Vg(π,C)V_{g}(\pi,C) by computing the empirical Bayes error of fmf^{\star}_{m} on a validation set 𝒯n\mathcal{T}_{n} of type (w,y)(w,y) sampled from τRC\tau{\kern-0.6pt\smalltriangleright\kern 0.3pt}RC, but the estimation would be less precise. Intuitively, this is because RCRC is more “noisy” than CC.

Correctness

The correctness of the channel pre-processing method is given by the following theorem, which shows that we can learn fmf^{\star}_{m} by training a Bayesian classifier on a set sampled from τRC\tau{\kern-0.6pt\smalltriangleright\kern 0.3pt}RC.

Theorem 4.2 (Correctness of channel pre-processing).

Given a prior π\pi and a gain function gg, we have that, for any channel CC:

Vg(π,C)=βVgid(τ,RC)for all channels C.V_{g}(\pi,C)~{}=\beta\cdot V_{{g_{\mathrm{id}}}}(\tau,RC)\quad\text{for all channels }C.

where β\beta, τ\tau and RR are those defined in (27) and (28).

Interestingly, a result similar to Theorem 4.2 is also given in (Bordenabe and Smith, 2016), although the context is completely different from ours: the focus of (Bordenabe and Smith, 2016), indeed, is to study how the leakage of CC on XX may induce also a leakage of other sensitive information ZZ that has nothing to do with CC (in the sense that is not information manipulated by CC). We intend to explore this connection in the context of a possible extension of our approach to this more general scenario.

Estimation error

Concerning the estimation error, the results are essentially the same as in previous section (with α\alpha replaced by β\beta). As for the bound on the probability of error, the results are worse, because the original variance σ2\sigma^{2} is magnified by the channel pre-processing, which introduces a further factor of randomness in the sampling of training data in 12, which means that in practice this bound is more difficult to estimate.

4.3. Pros and cons of the two methods

The fundamental advantage of data pre-processing is that it allows to estimate VgV_{g} from just samples of the system, without even black-box access. In contrast to channel pre-processing, however, this method is particularly sensitive to the values of the gain function gg. Large gain values will increase the size of 𝒟m\mathcal{D}^{\prime}_{m^{\prime}}, with consequent increase of the computational cost for estimating the gg-vulnerability. Moreover, if gg takes real values then we need to apply the technique described in Appendix C.3, which can lead to a large increase of the dataset as well. In contrast, the channel pre-processing method has the advantage of controlling the size of the training set, but it can be applied only when it is possible to interact with the channel by providing input and collecting output. Finally, from the precision point of view, we expect the estimation based on data pre-processing to be more accurate when gg consists of small integers, because the channel pre-processing introduces some extra noise in the channel.

5. Evaluation

In this section we evaluate our approach to the estimation of gg-vulnerability. We consider four different scenarios:

  1. (1)

    𝒳\mathcal{X} is a set of (synthetic) numeric data, the channel CC consists of geometric noise, and gg is the multiple guesses gain function, representing an adversary that is allowed to make several attempts to discover the secret.

  2. (2)

    𝒳\mathcal{X} is a set of locations from the Gowalla dataset (Gow, 2011), CC is the optimal noise of Shokri et al. (Shokri et al., 2012), and gg is one of the functions used to evaluate the privacy loss in (Shokri et al., 2012), namely a function anti-monotonic on the distance, representing the idea that the more the adversary’s guess is close to the target (i.e., the real location), the more he gains.

  3. (3)

    𝒳\mathcal{X} is the Cleveland heart disease dataset (Dua and Graff, 2017), CC is a differentially private (DP) mechanism (Dwork et al., 2006; Dwork, 2006), and gg assigns higher values to worse heart conditions, modeling an adversary that aims at discovering whether a patient is at risk (for instance, to deny his application for health insurance).

  4. (4)

    𝒳\mathcal{X} is a set of passwords of 128128 bits and CC is a password checker that leaks the time before the check fails, but mitigates the timing attacks by applying some random delay and the bucketing technique (see, for example, (Köpf and Dürmuth, 2009)). The function gg represents the part of the password under attack.

For each scenario, we proceed in the following way:

  • We consider 33 different samples sizes for the training sets that are used to train the ML models and learn the 𝒴𝒲\mathcal{Y}\rightarrow\mathcal{W} remapping. This is to evaluate how the precision of the estimate depends on the amount of data available, and on its relation with the size of |𝒴||\mathcal{Y}|.

  • In order to evaluate the variance of the precision, for each training size we create 55 different training sets, and

  • for each trained model we estimate the gg-vulnerability using 5050 different validation sets.

5.1. Representation of the results and metrics

We graphically represent the results of the experiment as box plots, using one box for each size. More precisely, given a specific size, let V^nij\widehat{V}^{ij}_{n} be the gg-vulnerability estimation on the jj-th validation set computed with a model trained over the ii-th training set (where i{1,,5}i\in\{1,\ldots,5\} and j{1,,50}j\in\{1,\ldots,50\}). Let VgV_{g} be the real gg-vulnerability of the system. We define the normalized estimation error δij\delta_{ij} and the mean value δ¯\overline{\delta} of the δij\delta_{ij}’s as follows:

(29) δij=def|V^nijVg|Vg,withδ¯=def1250i=15j=150δij.\displaystyle\delta_{ij}\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{|\widehat{V}^{ij}_{n}-V_{g}|}{V_{g}}\,,\quad\textrm{with}\quad\overline{\delta}\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{1}{250}\sum_{i=1}^{5}\sum_{j=1}^{50}\delta_{ij}\,.

In the graphs, the δij\delta_{ij}’s are reported next to the box corresponding to the size, and δ¯\overline{\delta} is the black horizontal line inside the box.

Thanks to the normalization the δij\delta_{ij}’s allow to compare results among different scenario and different levels of (real) gg-vulnerability. Also, we argue that the percentage of the error is more meaningful than the absolute value. The interested reader, however, can find in Appendix H also plots showing the estimations of the gg-vulnerability  and their distance from the real value.

We also consider the following typical measures of precision:

(30) dispersion=def1250i=15j=150(δijδ¯)2,\displaystyle\textrm{dispersion}\stackrel{{\scriptstyle\text{def}}}{{=}}\sqrt{\frac{1}{250}\sum_{i=1}^{5}\sum_{j=1}^{50}(\delta_{ij}-\overline{\delta})^{2}}\,,
(31) total error=def1250i=15j=150δij2.\displaystyle\textrm{total error}\stackrel{{\scriptstyle\text{def}}}{{=}}\sqrt{\frac{1}{250}\sum_{i=1}^{5}\sum_{j=1}^{50}\delta_{ij}^{2}}\,.

The dispersion is an average measure of how far the normalized estimation errors are from their mean value when using same-size training and validation sets. On the other hand, the total error is an average measure of the normalized estimation error, when using same-size training and validation sets.

In order to make a fair comparison between the two pre-processing methods, intuitively we need to use training and validation sets of “equivalent size”. For the validation part, since the “best” ff function has been already found and therefore we do not need any pre-processing anymore, “equivalent size” just means same size. But what does it mean, in the case of training sets built with different pre-processing methods? Assume that we have a set of data 𝒟m\mathcal{D}_{m} coming from a third party collector (recall that mm represents the size of the set), and let 𝒟m\mathcal{D}^{\prime}_{m^{\prime}} be the result of the data pre-processing on 𝒟m\mathcal{D}_{m}. Now, let 𝒟m′′′′\mathcal{D}^{\prime\prime}_{m^{\prime\prime}} be the dataset obtained drawing samples according to the channel pre-processing method. Should we impose m′′=mm^{\prime\prime}=m or m′′=mm^{\prime\prime}=m^{\prime}? We argue that the right choice is the first one, because the amount of “real” data collected is mm. Indeed, 𝒟m\mathcal{D}^{\prime}_{m^{\prime}} is generated synthetically from 𝒟m\mathcal{D}_{m} and cannot contain more information about CC than 𝒟m\mathcal{D}_{m}, despite its larger size.

5.2. Learning algorithms

We consider two ML algorithms in the experiments: k-Nearest Neighbors (k-NN) and Artificial Neural Networks (ANN). We have made however a slight modification of k-NN algorithm, due to the following reason: recall that, depending on the particular gain function, the data pre-processing method might create many instances where a certain observable yy is repeated multiple times in pair with different ww’s. For the k-NN algorithm, a very common choice is to consider a number of neighbors which is equivalent to natural logarithm of the total number of training samples. In particular, when the data pre-processing is applied, this means that k=log(m)k=\log(m^{\prime}) nearest neighbors will be considered for the classification decision. Since log(m)\log(m^{\prime}) grows slowly with respect to mm^{\prime}, it might happen that k-NN fails to find the subset of neighbors from which the best remapping can be learned. To amend this problem, we modify the k-NN algorithm in the following way: instead of looking for neighbors among all the mm^{\prime} samples, we only consider a subset of lml\leq m^{\prime} samples, where each value yy only appears once. After the log(l)\log(l) neighbors have been detected among the ll samples, we select ww according to a majority vote over the mm^{\prime} tuples (w,y)(w,y) created through the remapping.

The distance on which the notion of neighbor is based depends on the experiments. We have considered the standard distance among numbers in the first and fourth experiments, the Euclidean distance in the second one, and the Manhattan distance in the third one, which is a stand choice for DP.

Concerning the ANN models, their specifics are in Appendix D. Note that, for the sake of fairness, we use the same architecture for both pre-processing methods, although we adapt number of epochs and batch size to the particular dataset we are dealing with.

5.3. Frequentist approach

In the experiments, we will compare our method with the frequentist one. This approach has been proposed originally in (Chatzikokolakis et al., 2010) for estimating mutual information, and extended successively also to min-entropy leakage (Chothia et al., 2013). Although not considered in the literature, the extension to the case of gg-vulnerability  is straightforward. The method consists in estimating the probabilities that constitute the channel matrix CC, and then calculating analytically the gg-vulnerability on CC. The precise definition is in  Appendix E.

In (Cherubin et al., 2019) it was observed that in the case of the Bayes error the frequentist approach performs poorly when the size of the observable domain |𝒴||\mathcal{Y}| is large with respect to the available data. We want to study whether this is the case also for gg-vulnerability.

For the experiment on the multiple guesses the comparison is illustrated in the next section. For the other experiments, because of lack of space, we have reported it in the Appendix H.

5.4. Experiment 11: multiple guesses

Refer to caption

Figure 1. The channel of Experiment 11. The two curves represent the distributions PY|X(|x)P_{Y|X}(\cdot|x) for two adjacent secrets: x=5x=5 and x=6x=6.

We consider a system in which the secrets 𝒳\mathcal{X} are the integers between 0 and 99, and the observables 𝒴\mathcal{Y} are the integers between 0 and 1599915999. Hence |𝒳|=10|\mathcal{X}|=10 and |𝒴|=16|\mathcal{Y}|=16K. The rows of the channel CC are geometric distributions centered on the corresponding secret:

(32) Cxy=PY|X(y|x)=λexp(ν|r(x)y|),C_{xy}=P_{Y|X}(y|x)=\lambda\,\exp(-\nu|r(x)-y|)\,,

where:

  • ν\nu is a parameter that determines how concentrated around y=xy=x the distribution is. In this experiment we set ν=0.002\nu=0.002;

  • rr is an auxiliary function that reports 𝒳\mathcal{X} to the same scale of 𝒴\mathcal{Y}, and centers 𝒳\mathcal{X} on 𝒴\mathcal{Y}. Here r(x)=1000x+3499.5r(x)=1000\,x+3499.5;

  • λ=eν1/(eν+1)\lambda=\nicefrac{{e^{\nu}-1}}{{(e^{\nu}+1)}} is a normalization factor

Figure 1 illustrates the shape of CxyC_{xy}, showing the distributions PY|X(|x)P_{Y|X}(\cdot|x) for two adjacent secrets x=5x=5 and x=6x=6. We consider an adversary that can make two attempts to discover the secret (two-tries adversary), and we define the corresponding gain function as follows. A guess w𝒲w\in\mathcal{W} is one of all the possible combinations of 22 different secrets from 𝒳\mathcal{X}, i.e., w={x0,x1}w=\{x_{0},x_{1}\} with x0,x1𝒳x_{0},x_{1}\in\mathcal{X} and x0x1x_{0}\neq x_{1}. Therefore |𝒲|=(102)=45|\mathcal{W}|=\binom{10}{2}=45. The gain function gg is then

(33) g(w,x)={1if xw0otherwise.g(w,x)=\begin{cases}1&\mbox{if }x\in w\\ 0&\mbox{otherwise}\,.\end{cases}

For this experiment we consider a uniform prior distribution π\pi on 𝒳\mathcal{X}. The true gg-vulnerability  for these particular ν\nu and π\pi, results to be Vg=0.892V_{g}=0.892. As training sets sizes we consider 1010K, 3030K and 5050K, and 50K for the validation sets.

5.4.1. Data pre-processing

(cfr. Section 4.1). The plot in Figure 3 shows the performances of the k-NN and ANN models in terms of normalized estimation error, while Figure 2 shows the comparison with the frequentist approach. As we can see, the precision of the frequentist method is much lower, thus confirming that the trend observed in (Cherubin et al., 2019) for the Bayes vulnerability holds also for gg-vulnerability.

Refer to caption

Figure 2. Multiple guesses scenario, comparison between the frequentist and the ML estimations with data pre-processing.

Refer to caption

Figure 3. Multiple guesses scenario, magnification of the part of Figure 2 on the k-NN and ANN estimations.

It is worth noting that, in this experiment, the pre-processing of each sample (x,y)(x,y) creates 99 samples (matching yy with each possible w𝒲w\in\mathcal{W} such that w={x,x}w=\{x,x^{\prime}\} with xxx^{\prime}\neq x). This means that the sample size of the pre-processed sets is 99 times the size of the original ones. For functions gg representing more than 22 tries this pre-processing method may create training sets too large. In the next section we will see that the alternative channel pre-processing method can be a good compromise.

5.4.2. Channel pre-processing

(cfr. Section 4.2) The results for hannel pre-processing are reported in Figure 4. As we can see, the estimation is worse than with data pre-processing, especially for the k-NN algorithm. This was to be expected, since the random sampling to match the effect of gg introduces a further level of confusion, as explained in Section 4.2. Nevertheless, these results are still much better than the frequentist case, so it is a good alternative method to apply when the use of data pre-processing would generate validation sets that are too large, which could be the case when the matrix representing gg contains large numbers with a small common divider.

Refer to caption

Figure 4. Multiple guesses scenario, k-NN and ANN estimation with channel pre-processing

Additional plots are provided in Appendix H.

5.5. Experiment 2: location privacy

In this section we estimate the gg-vulnerability of a typical system for location privacy protection. We use data from the open Gowalla dataset (Gow, 2011), which contains the coordinates of users’ check-ins. In particular, we consider a square region in San Francisco, USA, centered in (latitude, longitude) = (37.75537.755, 122.440-122.440), and with 55Km long sides. In this area Gowalla contains 3516235162 check-ins.

We discretize the region in 400 cells of 250m long side, and we assume that the adversary’s goal is to discover the cell of a check-in. The frequency of the Gowalla check-ins per cell is represented by the heat-map in Figure 5. From these frequencies we can directly estimate the distribution representing the prior of the secrets (ElSalamouny and Palamidessi, 2020).

Refer to caption
Figure 5. Heat-map representing the Gowalla check-ins distribution in the area of interest; the density of check-ins in each cell is reported in the color bar on the side

The channel CC that we consider here is the optimal obfuscation mechanism proposed in (Shokri et al., 2012) to protect location privacy under a utility constraint. We recall that the framework of (Shokri et al., 2012) assumes two loss functions, one for utility and one for privacy. The utility loss of a mechanism, for a certain prior, is defined as the expected utility loss of the noisy data generated according to the prior and the mechanism. The privacy loss is defined in a similar way, except that we allow the attacker to “remap” the noisy data so to maximize the privacy loss. For our experiment, we use the Euclidean distance as loss function for the utility, and the gg function defined in the next paragraph as loss function for the privacy. For further details on the construction of the optimal mechanism we refer to (Shokri et al., 2012).

We define 𝒳,𝒴\mathcal{X},\mathcal{Y} and 𝒲\mathcal{W} to be the set of the cells. Hence |𝒳|=|𝒴|=|𝒲|=400|\mathcal{X}|=|\mathcal{Y}|=|\mathcal{W}|=400. We consider a gg function representing the precision of the guess in terms of Euclidean distance: the closer the guess is to the real location, the higher is the attacker’s gain. Specifically, our gg is illustrated in Figure 6, where the central cell represents the real location xx. For a generic “guess” cell ww, the number written in ww represent g(w,x)g(w,x). 333Formally, gg is defined as g(w,x)=(γexp(αd(w,x)/l)),g(w,x)=\lfloor(\gamma\exp(-\alpha d(w,x)/l))\rceil, where γ=4\gamma=4 is the maximal gain, α=0.95\alpha=0.95 is a normalization coefficient to control the skewness of the exponential, dd is the euclidean distance and l=250l=250 is the length of the cells’ side. The symbol \lfloor\cdot\rceil in this context represents the rounding to the closest integer operation.

42222111111110000000000000000000000000000000000000000
Figure 6. The “diamond” shape created by the gain function around the real secret; the values represent the gains assigned to each guessed cell ww when xx is the central cell.

In this experiment we consider training set sizes of 100100, 11k and 1010K samples respectively. After applying the data pre-processing transformation, the size of the resulting datasets is approximately 1818 times that of the original one. This was to be expected, since the sum of the values of gg in Figure 6 is 2020. Note that this sum and the increase factor in the dataset do not necessarily coincide, because the latter is also influenced by the prior and by the mechanism.

Figure 7 and Figure 8 show the performance of k-NN and ANN for both data and channel pre-processing. As expected, the data pre-processing method is more precise than the channel pre-processing one, although only slightly. The ANN model is also slightly better than the k-NN in most of the cases.

Refer to caption

Figure 7. Location privacy scenario, k-NN and ANN estimation with data pre-processing

Refer to caption

Figure 8. Location privacy scenario, k-NN and ANN estimation with channel pre-processing

5.6. Experiment 3: differential privacy

In this section we consider a popular application of DP: individual data protection in medical datasets from which we wish to extract some statistics via counting queries. It is well known that the release of exact information from the database, even if it is only the result of statistical computation on the aggregated data, can leak sensitive information about the individuals. The solution proposed by DP is to obfuscate the released information with carefully crafted noise that obeys certain properties. The goal is to make it difficult to detect whether a certain individual is in the database or not. In other words, two adjacent datasets (i.e., datasets that differ only for the presence of one individual) should have almost the same likelihood to produce a certain observable result.

In our experiment, we consider the Cleveland heart disease dataset (Dua and Graff, 2017) which consist of 303303 records of patients with a medical heart condition. Each condition is labeled by an integer number indicating the severity: from 0, representing a healthy patient, to 44, representing a patient whose life is at risk.

We assume that the hospital publishes the histogram of the patients’ records, i.e., the number of occurrences for each label. To protect the patients’ privacy, the hospital sanitizes the histogram by adding geometric noise (a typical DP mechanism) to each label’s count. More precisely, if the count of a label is z1z_{1}, the probability that the corresponding published number is z2z_{2} is defined by the distribution in (32), where xx and yy are replaced by z1z_{1} and z2z_{2} respectively, and rr is 11. Note that z1z_{1}, the real count, is an integer between 0 and 303303, while its noisy version z2z_{2} ranges on all integers. As for the value of ν\nu, in this experiment we set it to 11.

The secrets space 𝒳\mathcal{X} is set to be a set of two elements: the full dataset, and the dataset with one record less. These are adjacent in the sense of DP, and, as customary in DP, we assume that the record on which the two databases differ is the target of the adversary. The observables space 𝒴\mathcal{Y} is the set of the 55-tuples produces by the noisy counts of the 55 labels. 𝒲\mathcal{W} is set to be the same as 𝒳\mathcal{X}.

We assume that the adversary is especially interested in finding out whether the patient has a serious condition. The function gg reflects this preference by assigning higher value to higher labels. Specifically, we set 𝒲=𝒳\mathcal{W}=\mathcal{X} and

(34) g(w,x)={0,if wx1,if w=xx{0,1,2}2,if w=xx{3,4},g(w,x)=\begin{cases}0,&\mbox{if }w\neq x\\ 1,&\mbox{if }w=x\land x\in\{0,1,2\}\\ 2,&\mbox{if }w=x\land x\in\{3,4\},\end{cases}

For the estimation, we consider 10K, 30K and 50K samples for the training sets, and 50K samples for the validation set.

Refer to caption

Figure 9. Differential privacy scenario, k-NN and ANN estimation with data pre-processing

Refer to caption

Figure 10. Differential privacy scenario, k-NN and ANN estimation with channel pre-processing

For the experiments with k-NN we choose the Manhattan distance, which is typical for DP444The Manhattan distance on histograms corresponds to the total variation distance on the distributions resulting from the normalization of these histograms.. In the case of data pre-processing the size of the transformed training set 𝒟m\mathcal{D}^{\prime}_{m^{\prime}}is about 1.21.2 times the original 𝒟m\mathcal{D}_{m}. The performances of the data and channel pre-processing are shown in Figures 9 and Figure 10 respectively. Surprisingly, in this case the data pre-processing method outperforms the channel pre-processing one, although only slightly. Additional plots, including the results for the frequentist approach, can be found in Appendix H.

5.7. Experiment 4: password checker

In this experiment we consider a password checker, namely a program that tests whether a given string corresponds to the password stored in the system. We assume that string and password are sequences of 128128 bits, an that the program is “leaky”, in the sense that it checks the two sequences bit by bit and it stops checking as soon as it finds a mismatch, reporting failure. It is well known that this opens the way to a timing attack (a kind of side-channel attack), so we assume that the system tries to mitigate the threat by adding some random delay, sampled from a Laplace distribution and then bucketing the reported time in 128 bins corresponding to the positions in the sequence (or equivalently, by sampling the delay from a Geometric distribution, cfr. Equation 32). Hence the channel CC is a 2128×1282^{128}\times 128 stochastic matrix.

The typical attacker is an interactive one, which figures out larger and larger prefixes of the password by testing each bit at a time. We assume that the attacker has already figured out the first 66 bits of the sequence and it is trying to figure out the 77-th. Thus the prior π\pi is distributed (uniformly, we assume) only on the sequences formed by the known 66-bits prefix and all the possible remaining 122122 bits, while the gg function assigns 11 to the sequences whose 77-th bit agrees with the stored password, and 0 otherwise. Thus gg is a partition gain function (Alvim et al., 2012), and its particularity is that for such kind of functions data pre-processing and channel pre-processing coincide. This is because g(w,x)g(w,x) is either 0 or 11, so in both cases we generate exactly one pair (w,y)(w,y) for each pair (x,y)(x,y) for which g(w,x)=1g(w,x)=1. Note that in this case the data pre-processing transformation does not increase the training set, and the channel pre-processing transformation does not introduce any additional noise. The RCRC matrix (cfr. Section 4.1) is a 2×1282\times 128 stochastic matrix.

Refer to caption

Figure 11. Password checker scenario, k-NN and ANN estimation with data and channel pre-processing

The experiments are done with training sets of 10K, 30K and 50K samples. The results are reported in Figure 11. We note that the estimation error is quite small, especially in the ANN case. This is because the learning problem is particularly simple since, by considering the gg-leakage and the preprocessing, we have managed to reduce the problem to learning a function of type 𝒴𝒲\mathcal{Y}\rightarrow\mathcal{W}, rather than 𝒴𝒳\mathcal{Y}\rightarrow\mathcal{X}, and there is a huge difference in size between 𝒲\mathcal{W} and 𝒳\mathcal{X} (the first is 22 and the latter is 21282^{128}). Also the frequentist approach does quite well (cfr. Appendix H) , and this is because 𝒴\mathcal{Y} is small. With a finer bucketing (on top of the Laplace delay), or no bucketing at all, we expect that the difference between the accuracy of the frequentist and of the ML estimation would be much larger.

5.8. Discussion

In almost all the experiments, our method gives much better result than the frequentist approach (see Figure 2 and the other plots in the appendix Appendix H). The exception of the second experiment can be explained by the fact that the observable space is not very large, which is a scenario where the frequentist approach can be successful because the available data is enough to estimate the real distribution. In general, with the frequentist approach there is no real learning, therefore, if |𝒴||\mathcal{Y}| is large and the training set contains few samples, we cannot make a good guess with the observables never seen before (Cherubin et al., 2019). In ML, on the contrary, we can still make an informed guess, as ML models are able to generalize from samples, especially when the Bayes error is small.

We observe that ANN outperforms the k-NN in all experiments. This is because usually ANN models are better at generalizing, and hence provide better classifiers. In particular, k-NN are not very good when the distributions are not smooth with respect to the metric with respect to which the neighbor relation is evaluated (Cherubin et al., 2019). The data pre-processing method gives better results, than the channel pre-processing in all experiments except the third one (DP), in which the difference is very small. The main advantage of using the channel pre-processing method is when the gain function is such that the data pre-processing would generate a set too large, as explained in Section 4.3.

The experiments show that our method is not too sensitive to the size of |𝒴||\mathcal{Y}|. On the other hand, the size of |𝒳||\mathcal{X}| is important, because the ML classifiers are in general more precise (approximate better the ideal Bayes classifier) when the number of classes are small. This affects the estimation error of both the Bayes vulnerability and the gg-vulnerability. However, for the latter there is also the additional problem of the magnification due to the gg. To better understand this point, consider a modification of the last experiment, and assume that the password checker is not leaky, i.e., the observables are only 𝑓𝑎𝑖𝑙\mathit{fail} or 𝑠𝑢𝑐𝑐𝑒𝑠𝑠\mathit{success}. A pair (x,success)(x,\emph{success}) would have a negligible probability of appearing in the training set, hence our method, most likely, would estimate the vulnerability to be 0. This is fine if we are trying to estimate the Bayes vulnerability, which is also negligible. But the gg-vulnerability may not be negligible, in particular if we consider a gg that gives an enormous gain for the success case. If we can ensure that all the pairs (x,y)(x,y) are represented in the training set in proportion to their probability in PXYP_{XY}, then the above ”magnification” in gg-vulnerability is not a problem, because our method will ensure the also the pairs (w,y)(w,y) would be magnified (with respect to the the pairs (w,y)(w,y)) in the same proportion.

6. Conclusion and future work

We have proposed an approach to estimate the gg-vulnerability of a system under the black-box assumption, using machine learning. The basic idea is to reduce the problem to learn the Bayes classifier on a set of pre-processed training data, and we have proposed two techniques for this transformation, with different advantages and disadvantages. We have then evaluated our approach on various scenarios, showing favorable results. We have compared our approach to the frequentist one, showing that the performances are similar on small observable domains, while ours performs better on large ones. This is in line with what already observed in (Cherubin et al., 2019) for the estimation of the Bayes error.

As future work, we plan to test our framework on more real-life scenarios such as the web fingerprinting attacks (Cherubin et al., 2017; Cherubin, 2017) and the AES cryptographic algorithm (de Chérisey et al., 2019). We also would like to consider the more general case, often considered in Information-flow security, of channels that have both “high” and “low” inputs, where the first are the secrets and the latter are data visible to, or even controlled by, the adversary. Finally, a more ambitious goal is to use our approach to minimize the gg-vulnerability of complex systems, using a GAN based approach, along the lines of (Romanelli et al., 2020).

Acknowledgements.
This research was supported by DATAIA “Programme d’Investissement d’Avenir” (ANR-17-CONV-0003). It was also supported by the ANR project REPAS, and by the Inria/DRI project LOGIS. The work of Catuscia Palamidessi was supported by the project HYPATIA, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program, grant agreement n. 835294. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 792464.

References

  • (1)
  • Gow (2011) 2011. Gowalla dataset. https://snap.stanford.edu/data/loc-Gowalla.html.
  • Alvim et al. (2014) Mário S. Alvim, Konstantinos Chatzikokolakis, Annabelle McIver, Carroll Morgan, Catuscia Palamidessi, and Geoffrey Smith. 2014. Additive and Multiplicative Notions of Leakage, and Their Capacities. In IEEE 27th Computer Security Foundations Symposium, CSF 2014, Vienna, Austria, 19-22 July, 2014. IEEE, 308–322. https://doi.org/10.1109/CSF.2014.29
  • Alvim et al. (2016) Mário S. Alvim, Konstantinos Chatzikokolakis, Annabelle McIver, Carroll Morgan, Catuscia Palamidessi, and Geoffrey Smith. 2016. Axioms for Information Leakage. In Proceedings of the 29th IEEE Computer Security Foundations Symposium (CSF). 77–92. https://doi.org/10.1109/CSF.2016.13
  • Alvim et al. (2012) Mário S. Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Geoffrey Smith. 2012. Measuring Information Leakage Using Generalized Gain Functions. In 25th IEEE Computer Security Foundations Symposium, CSF 2012, Cambridge, MA, USA, June 25-27, 2012, Stephen Chong (Ed.). IEEE Computer Society, 265–279. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6265867
  • Bishop (2007) Christopher M. Bishop. 2007. Pattern recognition and machine learning, 5th Edition. Springer. I–XX, 1–738 pages. http://www.worldcat.org/oclc/71008143
  • Bordenabe and Smith (2016) Nicolás E. Bordenabe and Geoffrey Smith. 2016. Correlated Secrets in Quantitative Information Flow. In CSF. IEEE Computer Society, 93–104. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7518122
  • Boreale (2009) Michele Boreale. 2009. Quantifying information leakage in process calculi. Inf. Comput 207, 6 (2009), 699–725.
  • Boucheron et al. (2013) S. Boucheron, G. Lugosi, and P. Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford.
  • Chatzikokolakis et al. (2010) Konstantinos Chatzikokolakis, Tom Chothia, and Apratim Guha. 2010. Statistical Measurement of Information Leakage. In TACAS (Lecture Notes in Computer Science, Vol. 6015). Springer, 390–404.
  • Chatzikokolakis et al. (2019) Konstantinos Chatzikokolakis, Natasha Fernandes, and Catuscia Palamidessi. 2019. Comparing systems: max-case refinement orders and application to differential privacy. In Proceedings of the 32nd IEEE Computer Security Foundations Symposium. Hoboken, United States, 442–457. https://doi.org/10.1109/CSF.2019.00037
  • Chatzikokolakis et al. (2008a) Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Prakash Panangaden. 2008a. Anonymity protocols as noisy channels. Inf. Comput 206, 2-4 (2008), 378–401.
  • Chatzikokolakis et al. (2008b) Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Prakash Panangaden. 2008b. On the Bayes risk in information-hiding protocols. Journal of Computer Security 16, 5 (2008), 531–571. https://doi.org/10.3233/JCS-2008-0333
  • Cherubin (2017) Giovanni Cherubin. 2017. Bayes, not Naïve: Security Bounds on Website Fingerprinting Defenses. PoPETs 2017, 4 (2017), 215–231.
  • Cherubin et al. (2019) Giovanni Cherubin, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. 2019. F-BLEAU: Fast Black-box Leakage Estimation. IEEE Symposium on Security and Privacy abs/1902.01350 (2019). http://arxiv.org/abs/1902.01350
  • Cherubin et al. (2017) Giovanni Cherubin, Jamie Hayes, and Marc Juárez. 2017. Website Fingerprinting Defenses at the Application Layer. PoPETs 2017, 2 (2017), 186–203. https://doi.org/10.1515/popets-2017-0023
  • Chizat and Bach (2020) Lénaïc Chizat and Francis Bach. 2020. Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss. In Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria] (Proceedings of Machine Learning Research, Vol. 125), Jacob D. Abernethy and Shivani Agarwal (Eds.). PMLR, 1305–1338. http://proceedings.mlr.press/v125/chizat20a.html
  • Chothia and Guha (2011) Tom Chothia and Apratim Guha. 2011. A Statistical Test for Information Leaks Using Continuous Mutual Information. In CSF. IEEE Computer Society, 177–190. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5991608
  • Chothia et al. (2013) Tom Chothia, Yusuke Kawamoto, and Chris Novakovic. 2013. A tool for estimating information leakage. In International Conference on Computer Aided Verification (CAV). Springer, 690–695.
  • Chothia et al. (2014) Tom Chothia, Yusuke Kawamoto, and Chris Novakovic. 2014. LeakWatch: Estimating Information Leakage from Java Programs. In ESORICS (2) (Lecture Notes in Computer Science, Vol. 8713), Miroslaw Kutylowski and Jaideep Vaidya (Eds.). Springer, 219–236.
  • Clark et al. (2001) David Clark, Sebastian Hunt, and Pasquale Malacaria. 2001. Quantitative Analysis of the Leakage of Confidential Data. Electr. Notes Theor. Comput. Sci 59, 3 (2001), 238–251.
  • Cybenko (1992) George Cybenko. 1992. Approximation by superpositions of a sigmoidal function. MCSS 5, 4 (1992), 455.
  • de Chérisey et al. (2019) Eloi de Chérisey, Sylvain Guilley, Olivier Rioul, and Pablo Piantanida. 2019. Best Information is Most Successful - Mutual Information and Success Rate in Side-Channel Analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019, 2 (2019), 49–79. https://doi.org/10.13154/tches.v2019.i2.49-79
  • Devroye et al. (1996) Luc Devroye, László Györfi, and Gábor Lugosi. 1996. Vapnik-Chervonenkis Theory. Springer New York, New York, NY, 187–213. https://doi.org/10.1007/978-1-4612-0711-5_12
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository (Heart Disease Data Set). https://archive.ics.uci.edu/ml/datasets/heart+Disease
  • Dwork (2006) Cynthia Dwork. 2006. Differential Privacy. In 33rd International Colloquium on Automata, Languages and Programming (ICALP 2006) (Lecture Notes in Computer Science, Vol. 4052), Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener (Eds.). Springer, 1–12. http://dx.doi.org/10.1007/11787006_1
  • Dwork et al. (2006) Cynthia Dwork, Frank Mcsherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In In Proceedings of the Third Theory of Cryptography Conference (TCC) (Lecture Notes in Computer Science, Vol. 3876), Shai Halevi and Tal Rabin (Eds.). Springer, 265–284.
  • ElSalamouny and Palamidessi (2020) Ehab ElSalamouny and Catuscia Palamidessi. 2020. Full Convergence of the Iterative Bayesian Update and Applications to Mechanisms for Privacy Protection. arXiv:1909.02961 [cs.CR] To appear in the proceedings of EuroS&P.
  • Goodfellow et al. (2016) Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. 2016. Deep Learning. MIT Press. 1–775 pages. http://www.deeplearningbook.org/
  • Hastie et al. (2001) T. Hastie, R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag.
  • Köpf and Basin (2007) Boris Köpf and David A. Basin. 2007. An information-theoretic model for adaptive side-channel attacks. In Proceedings of the 2007 ACM Conference on Computer and Communications Security, CCS 2007, Alexandria, Virginia, USA, October 28-31, 2007, Peng Ning, Sabrina De Capitani di Vimercati, and Paul F. Syverson (Eds.). ACM, 286–296.
  • Köpf and Dürmuth (2009) Boris Köpf and Markus Dürmuth. 2009. A Provably Secure and Efficient Countermeasure against Timing Attacks. In Proceedings of the 2009 22nd IEEE Computer Security Foundations Symposium (CSF ’09). IEEE Computer Society, USA, 324–335. https://doi.org/10.1109/CSF.2009.21
  • Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  • Romanelli et al. (2020) Marco Romanelli, Catuscia Palamidessi, and Konstantinos Chatzikokolakis. 2020. Generating Optimal Privacy-Protection Mechanisms via Machine Learning, In Proceedings of the IEEE International Symposium on Computer Security Foundations (CSF). CoRR. arXiv:1904.01059 http://arxiv.org/abs/1904.01059
  • Sekeh et al. (2020) S. Y. Sekeh, B. Oselio, and A. O. Hero. 2020. Learning to Bound the Multi-Class Bayes Error. IEEE Transactions on Signal Processing 68 (2020), 3793–3807.
  • Shalev-Shwartz and Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding machine learning : from theory to algorithms. http://www.worldcat.org/search?qt=worldcat_org_all&q=9781107057135
  • Shokri et al. (2011) Reza Shokri, George Theodorakopoulos, Jean-Yves Le Boudec, and Jean-Pierre Hubaux. 2011. Quantifying Location Privacy. In IEEE Symposium on Security and Privacy. IEEE Computer Society, 247–262. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5955408;http://www.computer.org/csdl/proceedings/sp/2011/4402/00/index.html
  • Shokri et al. (2012) Reza Shokri, George Theodorakopoulos, Carmela Troncoso, Jean-Pierre Hubaux, and Jean-Yves Le Boudec. 2012. Protecting location privacy: optimal strategy against localization attacks. In the ACM Conference on Computer and Communications Security, CCS’12, Raleigh, NC, USA, October 16-18, 2012, Ting Yu, George Danezis, and Virgil D. Gligor (Eds.). ACM, 617–627. http://dl.acm.org/citation.cfm?id=2382196
  • Smith (2009) Geoffrey Smith. 2009. On the Foundations of Quantitative Information Flow. In Foundations of Software Science and Computational Structures, 12th International Conference, FOSSACS 2009, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009, York, UK, March 22-29, 2009. Proceedings (Lecture Notes in Computer Science, Vol. 5504), Luca de Alfaro (Ed.). Springer, 288–302.
  • Tumer and Ghosh (2003) Kagan Tumer and Joydeep Ghosh. 2003. Bayes Error Rate Estimation Using Classifier Ensembles. International Journal of Smart Engineering System Design 5, 2 (2003), 95–109. https://doi.org/10.1080/10255810305042 arXiv:https://doi.org/10.1080/10255810305042
  • Wolpert (1996) David H. Wolpert. 1996. The Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation 8, 7 (1996), 1341–1390.

Appendix A Auxiliary Results

Proposition A.1 (Bernstein’s inequality (Boucheron et al., 2013)).

LetZ1,,ZnZ{Z_{1},\dots,Z_{n}\sim Z} be i.i.d. random variables such that Z[a,b]Z\in[a,b] almost surely and let Sn=1ni=1nZi𝔼[Z]S_{n}=\frac{1}{n}\sum_{i=1}^{n}Z_{i}-\mathbb{E}\left[Z\right] and v=Var(Z)v=\text{Var}(Z) the variance of ZZ. Then, for any ε>0\varepsilon>0, we have:

(35) [Snε]exp(nε22v+2(ba)ε/3).\displaystyle\mathbb{P}\left[S_{n}\geq\varepsilon\right]\leq\exp\left(-\frac{n\,\varepsilon^{2}}{2\,v+\nicefrac{{2\left(b-a\right)\varepsilon}}{{3}}}\right).

Compared to the Hoeffding’s inequality, it is easy to check that, for regimes where ε\varepsilon is small, Bernstein’s inequality offers tighter bounds for v(ba)2v\ll(b-a)^{2}.

Lemma A.2.

Let σ2=Var(Z)\sigma^{2}=\text{Var}(Z) and let ZZ be a real-valued random variable such that for all 0<tσ20<t\leq\sigma^{2},

(36) (Zt)2qexp(t2r2).\mathbb{P}(Z\geq t)\leq 2q\exp\left(-\frac{t^{2}}{r^{2}}\right).

Then,

(37) 0σ2(Zt)𝑑tqrπerf(σ2r),\int_{0}^{\sigma^{2}}\mathbb{P}(Z\geq t)dt\leq qr\sqrt{\pi}\operatorname{erf}\left(\frac{\sigma^{2}}{r}\right),

where, for large xx,

(38) erf(x)1exp(x2)xπ+𝒪(x1exp(x2)).\displaystyle\operatorname{erf}(x)\approx 1-\frac{\exp(-x^{2})}{x\sqrt{\pi}}+\mathcal{O}\left(x^{-1}\exp(-x^{2})\right).
Proof.
0σ2(Zt)𝑑t\displaystyle\int_{0}^{\sigma^{2}}\mathbb{P}(Z\geq t)dt 0σ22qexp(t2r2)𝑑t\displaystyle\leq\int_{0}^{\sigma^{2}}2q\exp\left(-\frac{t^{2}}{r^{2}}\right)dt
=qrπerf(σ2r),\displaystyle=qr\sqrt{\pi}\operatorname{erf}\left(\frac{\sigma^{2}}{r}\right),

and eq. 38 follows from the Taylor’s expansion of the erf\operatorname{erf} function. ∎

Appendix B Proofs for the statistical bounds

The following lemma is a simple adaption of the uniform deviations of relative frequencies from probabilities theorems in (Devroye et al., 1996).

Lemma B.1.

The following inequality holds:

(39) VgV(fm)\displaystyle V_{g}-V(f^{\star}_{m}) 2maxf|V^m(f)V(f)|.\displaystyle\leq 2\max\limits_{f\in\mathcal{H}}\big{|}\widehat{V}_{m}(f)-V(f)\big{|}.
Proof.
Vg\displaystyle V_{g} V(fm)=V(f)V^m(fm)+V^m(fm)V(fm)\displaystyle-V(f^{\star}_{m})=V(f^{\star})-\widehat{V}_{m}(f^{\star}_{m})+\widehat{V}_{m}(f^{\star}_{m})-V(f^{\star}_{m})
(40) V(f)V^m(fm)+|V^m(fm)V(fm)|\displaystyle\leq V(f^{\star})-\widehat{V}_{m}(f^{\star}_{m})+\big{|}\widehat{V}_{m}(f^{\star}_{m})-V(f^{\star}_{m})\big{|}
(41) V(f)V^m(f)+|V^m(fm)V(fm)|\displaystyle\leq V(f^{\star})-\widehat{V}_{m}(f^{\star})+\big{|}\widehat{V}_{m}(f^{\star}_{m})-V(f^{\star}_{m})\big{|}
(42) |V(f)V^m(f)|+|V^m(fm)V(fm)|\displaystyle\leq\big{|}V(f^{\star})-\widehat{V}_{m}(f^{\star})\big{|}+\big{|}\widehat{V}_{m}(f^{\star}_{m})-V(f^{\star}_{m})\big{|}
(43) maxf|V^m(f)V(f)|+maxf|V^m(f)V(f)|\displaystyle\leq\max\limits_{f\in\mathcal{H}}\big{|}\widehat{V}_{m}(f)-V(f)\big{|}+\max\limits_{f\in\mathcal{H}}\big{|}\widehat{V}_{m}(f)-V(f)\big{|}
(44) 2maxf|V^m(f)V(f)|.\displaystyle\leq 2\max\limits_{f\in\mathcal{H}}\big{|}\widehat{V}_{m}(f)-V(f)\big{|}.

B.1. Proof of 3.1

See 3.1

Proof.

We first prove (12). We have:

(|V^n(fm)V(fm)|ε)\displaystyle\mathbb{P}\left(\big{|}\widehat{V}_{n}(f^{\star}_{m})-V(f^{\star}_{m})\big{|}\geq\varepsilon\right)
(45) =𝔼𝒟mPXYm(|V^n(fm)V(fm)|ε|𝒟m)\displaystyle=\mathbb{E}_{\mathcal{D}_{m}\sim P_{XY}^{m}}\mathbb{P}\left(\big{|}\widehat{V}_{n}(f^{\star}_{m})-V(f^{\star}_{m})\big{|}\geq\varepsilon\,\,|\,\mathcal{D}_{m}\right)
(46) 2𝔼𝒟mPXYmexp(nε22σfm2+2(ba)ε/3),\displaystyle\leq 2\mathbb{E}_{\mathcal{D}_{m}\sim P_{XY}^{m}}\exp\left(-\frac{n\,\varepsilon^{2}}{2\sigma^{2}_{f^{\star}_{m}}+\nicefrac{{2\left(b-a\right)\varepsilon}}{{3}}}\right),

where (45) follows from the definition of \mathbb{P}: we consider the expectation of the probability over all training sets 𝒟m\mathcal{D}_{m} sampled from PXYP_{XY}, and then for each 𝒟m\mathcal{D}_{m} we take the probability over all possible validation sets 𝒯n\mathcal{T}_{n} sampled again from PXYP_{XY}. Consider the series of (Xi,Yi)(X_{i},Y_{i})’s sampled from PXYP_{XY} that constitute the validation set, and define the random variables Zi=g(fm(Yi),Xi)Z_{i}=g(f^{\star}_{m}(Y_{i}),X_{i}). Note that 𝒟m\mathcal{D}_{m} determines fmf^{\star}_{m}, hence, for a given 𝒟m\mathcal{D}_{m} the ZiZ_{i} are i.i.d.. The inequality (46) then follows by applying A.1. Indeed, since the ZiZ_{i}’s are i.i.d., they all have the same expectation and the same variance, hence Sn=1ni=1nZi𝔼[Z]=V^n(fm)V(fm)S_{n}=\frac{1}{n}\sum_{i=1}^{n}Z_{i}-\mathbb{E}\left[Z\right]=\widehat{V}_{n}(f^{\star}_{m})-V(f^{\star}_{m}), and v=Var(Z)=σfm2v=\textit{Var}(Z)=\sigma^{2}_{f^{\star}_{m}}. The factor 22 in front of the exp\exp is because we consider the absolute value of SnS_{n}.

As for the bound (13), we have:

(47) (VgV(fm)ε)\displaystyle\mathbb{P}\left(V_{g}-{V}(f^{\star}_{m})\geq\varepsilon\right) (maxf|V^m(f)V(f)|ε/2)\displaystyle\leq\mathbb{P}\left(\max\limits_{f\in\mathcal{H}}\big{|}\widehat{V}_{m}(f)-{V}(f)\big{|}\geq\varepsilon/2\right)
(48) =(f{|V^m(f)V(f)|ε/2})\displaystyle=\mathbb{P}\left(\bigcup\limits_{f\in\mathcal{H}}\big{\{}\big{|}\widehat{V}_{m}(f)-V(f)\big{|}\geq\varepsilon/2\big{\}}\right)
(49) f(|V^m(f)V(f)|ε/2)\displaystyle\leq\sum\limits_{f\in\mathcal{H}}\mathbb{P}\left(\big{|}\widehat{V}_{m}(f)-V(f)\big{|}\geq\varepsilon/2\right)
(50) f2exp(mε28σf2+4(ba)ε/3),\displaystyle\leq\sum\limits_{f\in\mathcal{H}}2\exp\left(-\frac{m\,\varepsilon^{2}}{8\sigma^{2}_{f}+\nicefrac{{4\left(b-a\right)\varepsilon}}{{3}}}\right),

where (47) follows from B.1, steps (48) and (49) are standard, and (50) follows from the same reasoning that we have applied to prove (12). Here we do not take the expectation on the training sets because in each term of the summation the ff is fixed.

B.2. Proof of Theorem 3.2

See 3.2

Proof.

Observe that

𝔼|VgV^n(fm)|\displaystyle\mathbb{E}\big{|}V_{g}-\widehat{V}_{n}(f^{\star}_{m})\big{|} =𝔼|VgV(fm)+V(fm)V^n(fm)|\displaystyle=\mathbb{E}\big{|}V_{g}-{V}(f^{\star}_{m})+{V}(f^{\star}_{m})-\widehat{V}_{n}(f^{\star}_{m})\big{|}
Vg𝔼[V(fm)]|+𝔼|V(fm)V^n(fm)|,\displaystyle\leq V_{g}-\mathbb{E}\big{[}{V}(f^{\star}_{m})\big{]}\big{|}+\mathbb{E}\big{|}{V}(f^{\star}_{m})-\widehat{V}_{n}(f^{\star}_{m})\big{|},

which follows from the triangular inequality.

First, let us call σ2\sigma^{2} the worst case variance defined above which, according to Popoviciu’s inequality is upper-bounded by (ba)2/4\nicefrac{{(b-a)^{2}}}{{4}}. Second, let us consider that one main advantage of deriving bounds from Bernstein’s inequality is that it allows allow the upper-bound in eq. 35 grows as exp(nε)\exp(-n\varepsilon) instead of exp(nε2)\exp(-n\varepsilon^{2}) if vεv\leq\varepsilon. Moreover eq. 46 is upper-bounded by 2exp(nε2/2σ2+2(ba)ε/3)2\exp\left(-\nicefrac{{n\,\varepsilon^{2}}}{{2\sigma^{2}+\nicefrac{{2\left(b-a\right)\varepsilon}}{{3}}}}\right). This said we are going to consider two cases:

i) 𝝈𝟐𝜺\boldsymbol{\sigma^{2}\leq\varepsilon}:

(51) 𝔼|V(fm)V^n(fm)|σ22exp(nε22σ2+2(ba)ε/3)𝑑ε\displaystyle\mathbb{E}|V(f_{m}^{\star})-\widehat{V}_{n}(f_{m}^{\star})|\leq\int_{\sigma^{2}}^{\infty}2\exp\left(-\frac{n\varepsilon^{2}}{2\sigma^{2}+\nicefrac{{2(b-a)\varepsilon}}{{3}}}\right)d\varepsilon
(52) σ22exp(nε2+2(ba)/3)𝑑ε\displaystyle\leq\int_{\sigma^{2}}^{\infty}2\exp\left(-\frac{n\varepsilon}{2+\nicefrac{{2(b-a)}}{{3}}}\right)d\varepsilon
(53) =4(1+(ba)/3)nexp(nσ22(1+(ba)/3)),\displaystyle=\frac{4(1+\nicefrac{{(b-a)}}{{3}})}{n}\exp\left(-\frac{n\sigma^{2}}{2(1+\nicefrac{{(b-a)}}{{3}})}\right),

and then,

(54) Vg𝔼[V(fm)]σ2f2exp(mε28σf2+4(ba)ε/3)dε\displaystyle V_{g}-\mathbb{E}\big{[}{V}(f^{\star}_{m})\big{]}\leq\int_{\sigma^{2}}^{\infty}\sum\limits_{f\in\mathcal{H}}2\exp\left(-\frac{m\,\varepsilon^{2}}{8\sigma^{2}_{f}+\nicefrac{{4\left(b-a\right)\varepsilon}}{{3}}}\right)d\varepsilon
(55) fσ22exp(mε28σ2+4(ba)ε/3)𝑑ε\displaystyle\leq\sum\limits_{f\in\mathcal{H}}\int_{\sigma^{2}}^{\infty}2\exp\left(-\frac{m\,\varepsilon^{2}}{8\sigma^{2}+\nicefrac{{4\left(b-a\right)\varepsilon}}{{3}}}\right)d\varepsilon
(56) fσ22exp(mε8+4(ba)/3)𝑑ε\displaystyle\leq\sum\limits_{f\in\mathcal{H}}\int_{\sigma^{2}}^{\infty}2\exp\left(-\frac{m\,\varepsilon}{8+\nicefrac{{4\left(b-a\right)}}{{3}}}\right)d\varepsilon
(57) ||8(2+(ba)/3)mexp(3mσ24(6+(ba))),\displaystyle\leq|\mathcal{H}|\frac{8(2+\nicefrac{{(b-a)}}{{3}})}{m}\exp\left(-\frac{3m\sigma^{2}}{4(6+(b-a))}\right),

ii) 𝝈𝟐>𝜺\boldsymbol{\sigma^{2}>\varepsilon}:

(58) 𝔼|V(fm)V^n(fm)|0σ22exp(nε22σ2+2(ba)ε/3)𝑑ε\displaystyle\mathbb{E}|V(f_{m}^{\star})-\widehat{V}_{n}(f_{m}^{\star})|\leq\int_{0}^{\sigma^{2}}2\exp\left(-\frac{n\varepsilon^{2}}{2\sigma^{2}+\nicefrac{{2(b-a)\varepsilon}}{{3}}}\right)d\varepsilon
(59) 0σ22exp(nε22σ2+2(ba)σ2/3)𝑑ε\displaystyle\leq\int_{0}^{\sigma^{2}}2\exp\left(-\frac{n\varepsilon^{2}}{2\sigma^{2}+\nicefrac{{2(b-a)\sigma^{2}}}{{3}}}\right)d\varepsilon
(60) =2σ2(1+(ba)/3)nπerf(σ22σ2(1+(ba)/3)n),\displaystyle=\sqrt{\frac{2\sigma^{2}(1+\nicefrac{{(b-a)}}{{3}})}{n}}\sqrt{\pi}\operatorname{erf}\left(\frac{\sigma^{2}}{\sqrt{\frac{2\sigma^{2}(1+\nicefrac{{(b-a)}}{{3}})}{n}}}\right),

considering r=2σ2(1+(ba)/3)nr=\sqrt{\frac{2\sigma^{2}(1+\nicefrac{{(b-a)}}{{3}})}{n}}, q=1q=1 and applying lemma A.2. And finally,

(61) Vg𝔼[V(fm)]f0σ22exp(mε28σf2+4(ba)ε/3)𝑑ε\displaystyle V_{g}-\mathbb{E}\big{[}{V}(f^{\star}_{m})\big{]}\leq\sum\limits_{f\in\mathcal{H}}\int_{0}^{\sigma^{2}}2\exp\left(-\frac{m\,\varepsilon^{2}}{8\sigma^{2}_{f}+\nicefrac{{4\left(b-a\right)\varepsilon}}{{3}}}\right)d\varepsilon
(62) 2||0σ2exp(mε28σ2+4(ba)ε/3)𝑑ε\displaystyle\leq 2|\mathcal{H}|\int_{0}^{\sigma^{2}}\exp\left(-\frac{m\,\varepsilon^{2}}{8\sigma^{2}+\nicefrac{{4\left(b-a\right)\varepsilon}}{{3}}}\right)d\varepsilon
(63) 2||0σ2exp(mε28σ2+4(ba)σ2/3)𝑑ε\displaystyle\leq 2|\mathcal{H}|\int_{0}^{\sigma^{2}}\exp\left(-\frac{m\,\varepsilon^{2}}{8\sigma^{2}+\nicefrac{{4\left(b-a\right)\sigma^{2}}}{{3}}}\right)d\varepsilon
(64) =||8σ2+4σ2(ba)/3mπerf(σ28σ2+4σ2(ba)/3m),\displaystyle=|\mathcal{H}|\sqrt{\frac{8\sigma^{2}+\nicefrac{{4\sigma^{2}(b-a)}}{{3}}}{m}}\sqrt{\pi}\operatorname{erf}\left(\frac{\sigma^{2}}{\sqrt{\frac{8\sigma^{2}+\nicefrac{{4\sigma^{2}(b-a)}}{{3}}}{m}}}\right),

according to lemma A.2 with r=8σ2+4σ2(ba)/3mr=\sqrt{\frac{8\sigma^{2}+\nicefrac{{4\sigma^{2}(b-a)}}{{3}}}{m}} and q=||q=|\mathcal{H}|.

B.3. Proof of 3.4

See 3.4

Proof.

We first notice that

(|VgV^n(fm)|ε)\displaystyle\mathbb{P}\left(|V_{g}-\widehat{V}_{n}(f^{\star}_{m})|\geq\varepsilon\right)
(65) (VgV(fm)ε)+(|V(fm)V^n(fm)|ε),\displaystyle\leq\mathbb{P}\left(V_{g}-{V}(f^{\star}_{m})\geq\varepsilon\right)+\mathbb{P}\left(|{V}(f^{\star}_{m})-\widehat{V}_{n}(f^{\star}_{m})|\geq\varepsilon\right),

and thus from (12), (13) in Proposition 3.1, we have:

(|VgV^n(fm)|ε)\displaystyle\mathbb{P}\left(|V_{g}-\widehat{V}_{n}(f^{\star}_{m})|\geq\varepsilon\right)
(66) 2||exp(mε28σ2+4(ba)ε/3)+2exp(nε22σ2+2(ba)ε/3).\displaystyle\leq 2|\mathcal{H}|\exp\left(-\frac{m\,\varepsilon^{2}}{8\,\sigma^{2}+\nicefrac{{4\,\left(b-a\right)\varepsilon}}{{3}}}\right)+2\exp\left(-\frac{n\,\varepsilon^{2}}{2\sigma^{2}+\nicefrac{{2\,\left(b-a\right)\varepsilon}}{{3}}}\right).

Let us require:

(67) 2||exp(mε28σ2+4(ba)ε/3)\displaystyle 2\,|\mathcal{H}|\,\exp\left(-\frac{m\,\varepsilon^{2}}{8\,\sigma^{2}+\nicefrac{{4\,\left(b-a\right)\varepsilon}}{{3}}}\right) (δΔ),\displaystyle\leq(\delta-\Delta),
(68) 2exp(nε22σ2+2(ba)ε/3)\displaystyle 2\,\exp\left(-\frac{n\,\varepsilon^{2}}{2\,\sigma^{2}+\nicefrac{{2\,\left(b-a\right)\varepsilon}}{{3}}}\right) Δ,\displaystyle\leq\Delta,

which satisfies the desired condition:

(69) (|VgV^n(fm)|ε)\displaystyle\mathbb{P}\left(|V_{g}-\widehat{V}_{n}(f^{\star}_{m})|\geq\varepsilon\right) δ,\displaystyle\leq\delta,

for any 0<Δ<δ0<\Delta<\delta. Finally, from the previous inequality we can derive lower bounds on nn and mm:

(70) m\displaystyle m 8σ2+4(ba)ε/3ε2ln(2||δΔ),\displaystyle\geq\frac{8\,\sigma^{2}+\nicefrac{{4\,\left(b-a\right)\varepsilon}}{{3}}}{\varepsilon^{2}}\ln\left(\frac{2\,|\mathcal{H}|}{\delta-\Delta}\right),
(71) n\displaystyle n 2σ2+2(ba)ε/3ε2ln(2Δ),\displaystyle\geq\frac{2\,\sigma^{2}+\nicefrac{{2\,\left(b-a\right)\varepsilon}}{{3}}}{\varepsilon^{2}}\ln\left(\frac{2}{\Delta}\right),

which by definition of sample complexity shows the corollary. ∎

Appendix C Pre-processing

C.1. Data pre-processing

See 4.1

Proof.
(72) Vgid(ξ,E)=ymaxwwξwEwygid(w,w)=\displaystyle V_{{g_{\mathrm{id}}}}(\xi,E)=\sum\limits_{y}\max\limits_{w}\sum\limits_{w^{\prime}}\xi{w^{\prime}}\cdot E_{w^{\prime}y}\cdot g_{\mathrm{id}}(w,{w^{\prime}})=
(73) ymaxw(ξwEwy)=ymaxwPWY(w,y)=ymaxwU(w,y)α\displaystyle\sum\limits_{y}\max\limits_{w}(\xi_{w}E_{wy})=\sum\limits_{y}\max\limits_{w}P_{WY}(w,y)=\sum\limits_{y}\max\limits_{w}\frac{U(w,y)}{\alpha}
(74) =1αymaxwxπxCxyg(w,x)=1/αVg(π,C)\displaystyle=\frac{1}{\alpha}\cdot\sum\limits_{y}\max\limits_{w}{\sum_{x}\pi_{x}\cdot C_{xy}\cdot g(w,x)}=\nicefrac{{1}}{{\alpha}}\cdot V_{g}(\pi,C)

C.2. Channel pre-processing

See 4.2

Proof.

In this proof we use a notation that highlights the structure of the preprocessing. We will denote by GG be the matrix form of gg, i.e., Gwx=g(w,x)G_{wx}=g(w,x), and by Ψπ{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\pi}}} the square matrix with π\pi in its diagonal and 0 elsewhere. We have that β=GΨπ1=w,xGwxπx\beta=\|G{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\pi}}}\|_{1}=\sum_{w,x}G_{wx}\pi_{x}, which is strictly positive because of the assumptions on gg and π\pi. Furthermore, we have

τT=β1GΨπ𝟏,R=β1(Ψτ)1GΨπ,\tau^{T}~{}=~{}\beta^{-1}G{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\pi}}}\mathbf{1}~{},\qquad R~{}=~{}\beta^{-1}({{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\tau}}})^{-1}G{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\pi}}}~{},

where 𝟏\mathbf{1} is the vector of 11s and τT\tau^{T} represents the transposition of vector τ\tau. Note that (Ψτ)1({{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\tau}}})^{-1} is a diagonal matrix with entries τw1\tau_{w}^{-1} in its diagonal. If τw=0\tau_{w}=0 then the row Rw,R_{w,\cdot} is not properly defined; but its choice does not affect Vgid(τ,RC)V_{{g_{\mathrm{id}}}}(\tau,RC) since the corresponding prior is 0; so we can choose Rw,R_{w,\cdot} arbitrarily (or equivalently remove the action ww, it can never be optimal since it gives 0 gain). It is easy to check that τ\tau is a proper distribution and RR is a proper channel:

wτw\displaystyle\textstyle\sum_{w}\tau_{w} =wβ1xGwxπx\displaystyle~{}=~{}\textstyle\sum_{w}\beta^{-1}\sum_{x}G_{wx}\pi_{x} =β1β\displaystyle~{}=~{}\beta^{-1}\beta =1,\displaystyle~{}=~{}1~{},
xRw,x\displaystyle\textstyle\sum_{x}R_{w,x} =x1τwβ1Gwxπx\displaystyle~{}=~{}\textstyle\sum_{x}\frac{1}{\tau_{w}}\beta^{-1}G_{wx}\pi_{x} =τwτw\displaystyle~{}=~{}\frac{\tau_{w}}{\tau_{w}} =1.\displaystyle~{}=~{}1~{}.

Moreover, it holds that:

βΨτR=βΨτβ1Ψτ1GΨπ=GΨπ.\beta{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\tau}}}R~{}=~{}\beta{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\tau}}}\beta^{-1}{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\tau}}}^{-1}G{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\pi}}}~{}=~{}G{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\pi}}}~{}.

The main result follows from the trace-based formulation of posterior gg-vulnerability (Alvim et al., 2012), since for any channel CC and strategy SS, the above equation directly yields

Vg(π,C)=maxStr(GΨπCS)\displaystyle V_{g}(\pi,C)~{}=~{}\max_{S}\textbf{tr}(G{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\pi}}}CS) =βmaxStr(ΨτRCS)\displaystyle~{}=~{}\beta\cdot\max_{S}\textbf{tr}({{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Psi^{\tau}}}RCS)
=βVgid(τ,RC),\displaystyle~{}=~{}\beta\cdot V_{{g_{\mathrm{id}}}}(\tau,RC)~{},

where tr()\textbf{tr}(\cdot) is the matrix trace. ∎

C.3. Data pre-processing when gg is not integer

Approximating gg so that it only takes values 0\in\mathbb{Q}_{\geq 0} allows us to represent each gain as a quotient of two integers, namely

Numerator(Gw,x)/Denominator (Gw,x).\nicefrac{{\text{Numerator}(G_{w,x})}}{{\text{Denominator }(G_{w,x})}}.

Let us also define

(75) K=deflcmwx(Denominator(Gw,x)),K\stackrel{{\scriptstyle\text{def}}}{{=}}\operatorname{lcm}_{wx}(\text{Denominator}(G_{w,x})),

where lcm()\operatorname{lcm}(\cdot) is the least common multiple. Multiplying GG by KK gives the integer version of the gain matrix that can replace the original one. It is clear that the calculation of the least common multiplier, as well as the increase in the amount of data produced during the dataset building using a gain matrix forced to be integer, might constitute a relevant computational burden.

Appendix D ANN models

We list here the specifics for the ANNs models used in the experiments. All the models are simple feed-forward networks whose layers are fully connected. The activation functions for the hidden neurons are rectifier linear functions, while the output layer has softmax activation function.

The loss function minimized during the training is the cross entropy, a popular choice in classification problems. The remapping 𝒴𝒲\mathcal{Y}\rightarrow\mathcal{W} can be in fact considered as a classification problem such that, given an observable, a model learns to make the best guess.

For each experiments, the models have been tuned by cross-validating them using one randomly chosen training sets among the available ones choosing among the largest in terms of samples. The specs are listed experiment by experiment in table 2.

Hyper-parameters
Experiment Pre-processing learning rate hidden layers epochs hidden units per layer batch size
Multiple guesses Data 10310^{-3} 33 700 [100,100,100][100,100,100] 1000
Channel 10310^{-3} 33 500 [100,100,100][100,100,100] 1000
Location Priv. Data 10310^{-3} 33 1000 [500,500,500][500,500,500] 200,500,1000200,500,1000
Channel 10310^{-3} 33 200,500,1000200,500,1000 [500,500,500][500,500,500] 20,200,50020,200,500
Diff. Priv. Data 10310^{-3} 33 500 [100,100,100][100,100,100] 200200
Channel 10310^{-3} 33 500 [100,100,100][100,100,100] 200200
Psw SCA - 10310^{-3} 33 700 [100,100,100][100,100,100] 10001000
Table 2. Table with the hyper-parameters setting for each one of the experiments above. When multiple values are provided for the parameters of an experiment it is to be intended that each value corresponds to a specific size of the training set (sorted from the smallest to the largest number of samples).

Appendix E Frequentist approach description

In the frequentist approach the elements of the channel, namely the conditional probabilities PY|X(y|x){P}_{Y|X}(y|x), are estimated directly in the following way: the empirical prior probability of xx, π^x\widehat{\pi}_{x}, is computed by counting the number of occurrences of xx in the training set and dividing the result by the total number of elements. Analogously, the empirical joint probability P^XY(x,y)\widehat{P}_{XY}(x,y) is computed by counting the number of occurrences of the pair (x,y)(x,y) and dividing the result by the total number of elements in the set. The estimation C^xy\widehat{C}_{xy} of Cxy{C}_{xy} is then defined as

(76) C^xy=P^XY(x,y)π^(x).\widehat{C}_{xy}=\frac{\widehat{P}_{XY}(x,y)}{\widehat{\pi}(x)}.

In order to have a fair comparison with our approach, which takes advantage of the fact that we have several training sets and validation sets at our disposal, while preserving at the same time the spirit of the frequentist approach, we proceed as follows: Let us consider a training set 𝒟m\mathcal{D}_{m}, that we will use to learn the best remapping 𝒴𝒲\mathcal{Y}\rightarrow\mathcal{W}, and a validation set 𝒯n\mathcal{T}_{n} which is then used to actually estimate the gg-vulnerability. We first compute π^\widehat{\pi} using 𝒟m\mathcal{D}_{m}. For each yy in 𝒴\mathcal{Y} and for each x𝒳x\in\mathcal{X}, the empirical probability P^X|Y\widehat{P}_{X|Y} is computed using 𝒟m\mathcal{D}_{m} as well. In particular, P^X|Y(x|y)\widehat{P}_{X|Y}(x|y) is given by the number of times xx appears in pair with yy divided by the number of occurrences of yy. In case a certain yy is in 𝒯n\mathcal{T}_{n} but not in 𝒟m\mathcal{D}_{m}, it is assigned the secret x=argmaxx𝒳π^x^{\prime}=\operatorname*{arg\,max}_{x\in\mathcal{X}}\widehat{\pi} so that P^X|Y(x|y)=1\widehat{P}_{X|Y}(x^{\prime}|y)=1 and P^X|Y(x)=0,xx\widehat{P}_{X|Y}(x)=0,\forall x\neq x^{\prime}. It is now possible to find the best mapping for each yy defined as w(y)=argmaxw𝒲x𝒳P^X|Y(x|y)g(w,x)w(y)=\operatorname*{arg\,max}_{w\in\mathcal{W}}\sum_{x\in\mathcal{X}}\widehat{P}_{X|Y}(x|y)g(w,x). Now we compute the empirical joint distribution for each (x,y)(x,y) in 𝒯n\mathcal{T}_{n}, namely Q^XY\widehat{Q}_{XY}, as the number of occurrences of (x,y)(x,y) divided by the total number of samples in 𝒯n\mathcal{T}_{n}. We now estimate the gg-vulnerability on the validation samples according to:

(77) V^n=y𝒴x𝒳Q^XY(x,y)g(w(y),x).\widehat{V}_{n}=\sum\limits_{y\in\mathcal{Y}}\sum\limits_{x\in\mathcal{X}}\widehat{Q}_{XY}(x,y)g(w(y),x).

Appendix F ANN: Model selection and impact on the estimation

In this section we are going to:

  • briefly summarize the widely known background of the model selection problem from the machine learning standpoint;

  • show, through a new set of experiments, how this problem affects the leakage estimation;

  • propose a heuristic which can help the practitioner in the choice of the model on the same line as classical machine learning techniques.

The problem of model selection in machine learning is still an open one and, although a state-of-the art algorithm does not exists, heuristics can be proposed to lead the practitioner in the hard task of choosing a model over others. First, let us underline that the choice of a specific model in the context of machine learning, and specifically neural networks (and deep learning), must go through the hyper-parameter optimization procedure. In fact, if nowadays neural nets and especially deep models represent the state-of-the-art solutions to most of the automatic decision problem, it is also true that, with respect to other simpler methods, they introduce the need for hyper-parameters optimization. Some techniques, such as grid and random search as well as Bayesian optimization have been suggested during the years, especially when not so many parameters need to be tuned. Two aspects must be considered:

  1. (1)

    the hyper-parameter optimization relies on try and error strategy,

  2. (2)

    the results are highly dependent on the data distribution and how well the samples represent such distribution.

In particular, if we consider the typical classification problem framework in neural nets (which we build on to create our framework) we expect the network to reproduce in output the distribution Pclass|inputP_{class|input} from the observed data. In this context, the practitioner should be careful to avoid two main problems which might affect the models, namely under-fitting and over-fitting. Both problems undermine the generalization capabilities of the models: the former occurs when the model is too simple to correctly represent the distribution; the latter occurs when the model is over-complicated, especially for the given amount of samples, and it fits the training data “too well” but this does not translates into good performances on other samples drawn from the same distribution.

In order to understand how these problems impact our framework, we propose an analysis of the first experiment presented in the paper, considering different networks models and focusing on the different choice of number of hidden layers. In fig. 12 we compare a model with no hidden layers, hl0, and the three hidden layers model presented in the paper, hl3. Using neural networks without hidden layers is not common. Indeed, theoretical results (cfr. (Cybenko, 1992)) state that the simplest universal approximator can be modeled as a one hidden layered neural network. Although this holds in theory, in practice it is well known that this might require layers with too many neurons, and therefore, multiple hidden layers architectures have been gaining ground in real world applications.

Therefore, we do not expect too much from the network with no hidden layers and, indeed, the results represented in fig. 12(a) and fig. 12(b) show that the estimation capabilities of this shallow model are very far from the performance obtained with the three layered model.

Refer to caption
(a) Estimated vulnerability.
 
Refer to caption
(b) Normalized estimation error.
 
Figure 12. Multiple guesses scenario, comparison between a model with no hidden layers (hl0) and the three hidden layered model present in the paper (hl3).

The model with no hidden layer is too simple to reproduce the input data distribution and therefore it does not generalize well in the problem of predicting the best (i.e. most probable) ww when a certain yy is input.

Let us now focus on the results in fig. 13, where we compare the results of three different models with one, two and three hidden layers (respectively hl1, hl2, and hl3). With this specific experiment we want to:

  • analyze the behavior of different models when we change the number of hidden layer (maintaining the number of hidden neurons per layer fixed to 100) and we consider different sizes for the learning set;

  • introduce a possible heuristic to guide practitioners in the model selection task.

As we can see, observing the box-plots in fig. 13(a) and fig. 13(b) from left to right, when the amount of samples is relatively small, the shallow model, i.e. hl1 performs slightly better than the other two models. This is because, since the samples provided are not enough to accurately describe the distribution, a shallow model would be prone to under-fitting the training data, producing what seems to be a more general decision. However, as the number of samples increases, and consequently, we have a better representation of the distribution through the the data samples, a deeper model is able to reproduce the data distribution with increased accuracy. This results in better performances when trying to predict the best ww for a given input yy. As one can see the three hidden layered model keeps improving when the amount of samples available for the training increases. The improvement is limited for the two hidden layers model while for the shallowest model, i.e. hl1, there is not meaningful improvement at all.

Refer to caption
(a) Estimated vulnerability.
 
Refer to caption
(b) Normalized estimation error.
 
Figure 13. Multiple guesses scenario, comparison between a model with one hidden layers (hl1), a model with two hidden layers (hl2), and the three hidden layered model presented in the paper (hl3).

Therefore, provided the availability of enough samples, a good heuristic for the practitioner would be to pick the model that maximizes the leakage, which represents the strongest adversary. In practice, this boils down to trying several models increasing their complexity as long as an increased complexity translates into a higher leakage estimation. This also holds for models with different architecture: if switching from dense to convolutional layers results in a higher leakage estimation, then the convolutional model should be preferred. Even though this is still an open problem for the machine learning field, and we do not provide any no guarantees that, eventually, the optimal model is going to be retrieved, this can be considered a valid empirical approach. Indeed, the principle of no free lunch in machine learning tells that no model can guarantee convergence on its final sample performance. Therefore when a finite amount of sample is available, the practitioner should evaluate several model and stick to a heuristic, as the one we suggest, in order to select the best model.

In order to strengthen this point and also address the comment on the use of different architecture and their impact on the estimation, we produce yet another experimental result.

Refer to caption
Figure 14. MNIST experiment: dense network vs. LeNet model estimation of the Bayes risk. A smaller risk corresponds to a higher leakage and a more powerful adversary able to take more advantage of the information revealed by the data.

Inspired by (Sekeh et al., 2020), we consider the problem of estimating the Bayes error rate (BER, aka Bayes risk or Bayes leakage) using the MNIST dataset (a benchmark dataset for ML classification problems). In (Sekeh et al., 2020), the authors use an empirical method to estimate bounds for the BER and, in order to do so, they need to perform dimensionality reduction (they try both principal component analysis, or PCA, and auto-encoding via an auto-encoder neural network). Indeed MNIST, being a 28px×28px28\text{px}\times 28\text{px} images dataset, contains high dimensionality samples and all the three applied methods (bounds estimation, k-NN and random forest) benefit from dimensionality reduction. Given that the samples distribution for MNIST is unknown, in (Sekeh et al., 2020) the authors aim at checking whether the three estimations confirm each other. And indeed they do. However, it is well known that in many image classification tasks, the state-of-the-art is represented by deep learning and, for instance convolutional neural nets. While in (Sekeh et al., 2020) the authors study how the addition of new layers to the model affects the bound estimation, we focus on the comparison between two network models, a dense and a convolutional one.

In particular, the dense network model we consider has one hidden layer with 128 hidden units and ReLu activation function. Such an architecture corresponds to a model with 101700 training parameters (connection weights and bias weights). In order to reduce the tendency to over-fit the training data, we consider improving the model with two dropout layers, one before the hidden layer and one after, with dropout rate of 20%20\% and 50%50\% respectively.

The convolutional neural network we consider is based on the one proposed in (Lecun et al., 1998), which goes by the name of LeNet. In particular, the implementation of this net only requires 38552 training parameters, almost one third of those required by the dense model which is shallower but has many hidden neurons. This convolutional network consists of a couple of 2D convolutional layers with 3×33\times 3 kernel, alternated with a couple of average pool layer. We then have two dense layers with 60 and 42 hidden neurons respectively and both preceded by dropout layers with dropout rate of 40%40\% and 30%30\% respectively.

Both the dense and the LeNet model have a soft max final layer with 10 nodes, one for each class. We train both models on the MNIST 50000 training samples. We split the remaining 10000 samples set into 10 subsets, each with 1000 samples. We use the trained models to estimate the BER on these 10 subsets and we obtain the results represented in fig. 14, where the estimated vulnerability is the estimated BER. The average values, represented by the black horizontal lines within the box-plots, are directly comparable to the results in (Sekeh et al., 2020).

We notice that:

  • considering the aforementioned paper, the dense network’s performances are comparable to those of the random forest and k-NN algorithms and the LeNet’s performances are comparable to those of the convolutional net considered by the authors;

  • the LeNet model’s estimate is lower, i.e. the modeled Bayesian adversary is stronger than in the case of the dense net;

  • according to the previously proposed heuristics, given that we can only estimate the Bayes leakage from samples for the MNIST case, the results of the LeNet model are of higher interest, given that it is bigger than the leakage estimate through the bound;

  • it is therefore important, when only relying on data for leakage estimation, to compare different models and always look for the state-of-the-art to design the adversary which exploits the system’s leakage.

Appendix G Majority vote

Refer to caption
(a) Estimated vulnerability.
 
Refer to caption
(b) Normalized estimation error.
 
Figure 15. Comparison between our leakage estimate and the one obtained with a majority vote based model. In both cases each model is trained on 10K samples.

In this section we show an alternative procedure to perform leakage estimation. Given a set of models, instead of using each one of them to obtain an estimate, we derive a new model from them and we use this new model to estimate the leakage. According to (Tumer and Ghosh, 2003), we obtain the new model by taking a majority vote on the predictions of each model in the model set, i.e. when each one of the models receives the input it outputs a class. The class eventually assigned to the observable is the class most frequently predicted by the model ensemble (or a random one in case of ties, since we are considering the simplest way to aggregate models).

We consider the first experiment proposed in our paper, in particular the 5 models obtained by training on the 5 i.i.d. training sets of size 10K samples.

As we can see in fig. 15, the box-plot corresponding to the estimation not based on the majority vote is the same already reported (cfr. fig. 1). Compared to it, the one based on majority vote shows a higher (and therefore more precise) average estimation and a lower dispersion.

Refer to caption
(a) Estimated vulnerability.
 
Refer to caption
(b) Normalized estimation error.
 
Figure 16. Comparison between the majority vote model and a model trained on all the samples available to each model involved in the majority vote.

However, it is important to notice that, when in this case we consider a majority vote ensemble, since we are exploiting the “expertise” of many models trained on i.i.d. training samples of size 10K, we are actually exploiting the knowledge coming from 50K samples. We therefore analyze the case in which, according to what has already been done in fig. 1, the same 50K samples, obtained by merging the 5 datasets with 10K samples each, are used to train a single model. The results are showed in fig. 16. In this case, it is clear that having multiple weak classifiers and taking the majority vote according to the simple procedure described above, gives worse results in terms of leakage estimation performances than using all the samples to train a strong classifier.

Appendix H Supplementary plots

In the next pages, we show supplementary plots for each experiment presented in section 5. In particular, we have included the plots representing the comparison between the estimated vulnerability and the real one, and the plots showing the comparison between the frequentist approach and ours.

Refer to caption
(a) Vulnerability estimation for ANN and k-NN with data pre-processing.
Refer to caption
(b) Vulnerability estimation for ANN and k-NN with channel pre-processing.
Refer to caption
(c) Vulnerability estimation for the frequentist approach.
Refer to caption
(d) Vulnerability estimation for ANN and k-NN with data pre-processing, and the frequentist approach.
Refer to caption
(e) Vulnerability estimation for ANN and k-NN with channel pre-processing, and the frequentist approach.
Refer to caption
(f) Normalized estimation error for ANN and k-NN with channel pre-processing, and the frequentist approach.
Figure 17. Supplementary plots for the multiple-guesses experiment.
Refer to caption
(a) Vulnerability estimation for ANN and k-NN with data and channel pre-processing.
Refer to caption
(b) Vulnerability estimation for the frequentist approach.
Refer to caption
(c) Normalized estimation error for the frequentist approach.
Refer to caption
(d) Vulnerability estimation for ANN and k-NN with data and channel pre-processing, and the frequentist approach.
Refer to caption
(e) Normalized estimation error for ANN and k-NN with data and channel pre-processing, and the frequentist approach.
Figure 18. Supplementary plots for the password-checker experiment.
Refer to caption
(a) Vulnerability estimation for ANN and k-NN with data pre-processing.
Refer to caption
(b) Vulnerability estimation for ANN and k-NN with channel pre-processing.
Refer to caption
(c) Vulnerability estimation for the frequentist approach.
Refer to caption
(d) Normalized estimation error for the frequentist approach.
Refer to caption
(e) Vulnerability estimation for ANN and k-NN with data pre-processing, and the frequentist approach.
Refer to caption
(f) Normalized estimation error for ANN and k-NN with data pre-processing, and the frequentist approach.
Refer to caption
(g) Vulnerability estimation for ANN and k-NN with channel pre-processing, and the frequentist approach.
Refer to caption
(h) Normalized estimation error for ANN and k-NN with channel pre-processing, and the frequentist approach.
Figure 19. Supplementary plots for the location-privacy experiment.
Refer to caption
(a) Vulnerability estimation for ANN and k-NN with data pre-processing.
Refer to caption
(b) Vulnerability estimation for ANN and k-NN with channel pre-processing.
Refer to caption
(c) Vulnerability estimation for the frequentist approach.
Refer to caption
(d) Normalized estimation error for the frequentist approach.
Refer to caption
(e) Vulnerability estimation for ANN and k-NN with data pre-processing, and the frequentist approach.
Refer to caption
(f) Normalized estimation error for ANN and k-NN with data pre-processing, and the frequentist approach.
Refer to caption
(g) Vulnerability estimation for ANN and k-NN with channel pre-processing, and the frequentist approach.
Refer to caption
(h) Normalized estimation error for ANN and k-NN with channel pre-processing, and the frequentist approach.
Figure 20. Supplementary plots for the differential-privacy experiment.

Figure 17 is related to the multiple guesses scenario, Figure 19 is related to the location privacy one, Figure 20 is related to the differential privacy experiment, and Figure 18 to the password checker one.