Strongly universally consistent nonparametric regression and classification with privatised data

Thomas Berrett Department of Statistics, University of Warwick, Coventry, CV4 7AL, United Kingdom. [email protected] László Györfi Department of Computer Science and Information Theory, Budapest University of Technology and Economics, Magyar Tudósok krt. 2., Budapest, H-1117, Hungary. [email protected] Harro Walk Institute of Stochastics and Applications, University of Stuttgart, Pfaffenwaldring 57, D-70569 Stuttgart, Germany. [email protected]

In this paper we revisit the classical problem of nonparametric regression, but impose local differential privacy constraints. Under such constraints, the raw data $(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})$ , taking values in $\mathbb{R}^{d}\times\mathbb{R}$ , cannot be directly observed, and all estimators are functions of the randomised output from a suitable privacy mechanism. The statistician is free to choose the form of the privacy mechanism, and here we add Laplace distributed noise to a discretisation of the location of a feature vector $X_{i}$ and to the value of its response variable $Y_{i}$ . Based on this randomised data, we design a novel estimator of the regression function, which can be viewed as a privatised version of the well-studied partitioning regression estimator. The main result is that the estimator is strongly universally consistent. Our methods and analysis also give rise to a strongly universally consistent binary classification rule for locally differentially private data.

AMS Classification: 62G08, 62G20.

Key words and phrases: regression estimate, classification, local differential privacy, universal consistency

1 Introduction

In recent years there has been a surge of interest in data analysis methodology that is able to achieve strong statistical performance without comprimising the privacy and security of individual data holders. This has often been driven by applications in modern technology, for example by Google (erlingsson2014rappor), Apple (tang2017privacy), and Microsoft (ding2017collecting), but the study goes at least as far back as warner1965randomized and is often used in more traditional fields of clinical trials (vu2009differential, dankar2013practicing) and census data (machanavajjhala2008privacy, dwork2019differential). While there has long been an awareness that sensitive data must be anonymised, it has become apparent only relatively recently that simply removing names and addresses is insufficient in many cases (e.g. sweeney2002k, rocher2019estimating). The concept of differential privacy (dwork2006calibrating) was introduced to provide a rigorous notion of the amount of private information on individuals published statistics contain. Statistical treatments of this framework include wasserman2010statistical, lei2011differentially, avella2019differentially, cai2019cost

Although it is a suitable constraint for many problems, procedures that are differentially private often require the presence of a third party, who may be trusted to handle the raw data before statistics are published. To address this shortcoming, the local differential privacy constraint (see, for example, kairouz2014extremal, duchi2018minimax, and the references therein) was introduced to provide a setting where analysis must be carried out in such a way that each raw data point is only ever seen by the original data holder. The simplest example of a locally differentially private mechanism is the randomised response (warner1965randomized) used with binary data, but mechanisms have also been developed for tasks such as classification (berrett2019classification), generalised linear modelling (duchi2018minimax), empirical risk minimisation (wang2018empirical), density estimation (butucea2020local), functional estimation (rohde2018geometrizing) and goodness-of-fit testing (berrett2020locally).

Regression is a cornerstone of modern statistical analysis, routinely used across the sciences and beyond. We recall that, in a standard stochastic model, a regression estimator predicts for an observed $d$ dimensional random feature vector an unknown random response, with finite second moment. The regression function, given by the conditional expectation of the response given the feature vector, achieves minimum mean squared error. Typically, the statistician does not know the underlying stochastic structure, but has access to a corresponding finite sample of independent identically distributed design-response vectors in ${\mathbb{R}}^{d}\times{\mathbb{R}}$ , and on this basis estimates the regression function. The background will be given below at the beginning of Section 2, and in the following we shall refer several times to the monograph of gyorfi2006distribution. A binary classification (pattern recognition) rule predicts for a feature vector an unknown random response taking values in $\{-1,1\}$ . The so-called Bayes decision rule achieves minimum error probability (Bayes error). Given a finite sample of i.i.d. design-response vectors in ${\mathbb{R}}^{d}\times\{-1,1\}$ , the Bayes rule is approximated. We formulate the setup in Section 4, while the monograph of devroye2013probabilistic contains a detailed theory of nonparametric classification.

While regression has been relatively well-studied in the non-local model of differential privacy (e.g. cai2019cost), results in the local model are scarce. zheng2017collect studies sparse linear regression, kernel ridge regression and GLMs. smith2017interaction, wang2018empirical study parametric empirical risk minimisation. wang2018high studies sparse linear regression. duchi2018minimax, duchi2018right study GLMs. The recent work farokhi2020deconvoluting concerns a relaxed version of the locally private regression model where responses can be observed exactly, and empirically studies a Nadaraya–Watson-type estimator, but we are unaware of any other work on locally private nonparametric regression. The simpler problem of binary classification is studied in (berrett2019classification), but there are significant additional challenges in designing a suitable estimator for the regression problem.

In this paper we introduce and investigate a new method for nonparametric regression under $\alpha$ -local differential privacy constraints and also present a corresponding classification rule. For regression our procedure combines a simple non-interactive privacy mechanism with a cubic partitioning regression estimate modifying the regressogram, which was originally introduced by tukey1947non and has been well-studied since (see, e.g., gyorfi2006distribution, Chapter 4 and Section 23.1, and the references therein). In Section 3 we describe the procedure and state that the sequence of estimates is strongly universally consistent, in that the $L_{2}$ -risk converges almost surely to zero in the large sample limit for any data-generating distribution for which the response has a finite second moment. Let us mention that in the degenerate case without privacy the estimator reduces to the strongly universally consistent partitioning estimator of gyorfi1991universal. The problem of classification is strictly easier than regression, therefore our methods and analysis also give rise to a strongly universally consistent binary classification rule for locally differentially private data.

The remainder of the paper is organised as follows. In Section 2 we introduce the necessary background on regression and local differential privacy. In Section LABEL:Sec:MainRegression we introduce our privacy mechanism and estimators, and state our main results in the regression setting. In Section LABEL:Sec:Classification we study the consequences of the results in Section LABEL:Sec:MainRegression for binary classification. All proofs will be deferred to Section LABEL:Sec:Proofs. We intend to investigate the rates of convergence for the locally private regression problem in a subsequent paper.

2 Preliminaries

2.1 Background and non-private setting

Let $(X,Y)$ be a pair of random variables such that the feature vector $X$ takes values in ${\mathbb{R}}^{d}$ and its response variable $Y$ is a real-valued random variable with ${\mathbb{E}}[Y^{2}]<\infty$ . We denote by $\mu$ the distribution of the feature vector $X$ , that is, for all measurable sets $A\subset{\mathbb{R}}^{d}$ , we have $\mu(A)={\mathbb{P}}\{X\in A\}$ . Then the regression function

m(x)={\mathbb{E}}[Y\mid X=x]

(1)

is well defined for $\mu$ -almost all $x$ . For each measurable function $g:{\mathbb{R}}^{d}\to{\mathbb{R}}$ one has

{\mathbb{E}}\left[\{g(X)-Y\}^{2}\right]={\mathbb{E}}\left[\{m(X)-Y\}^{2}\right]+{\mathbb{E}}\left[\{m(X)-g(X)\}^{2}\right],

therefore, with the notation

L^{*}={\mathbb{E}}\left[\{m(X)-Y\}^{2}\right]~{},

we have

{\mathbb{E}}\left[\{g(X)-Y\}^{2}\right]=L^{*}+\int\{m(x)-g(x)\}^{2}\mu(dx).

(2)

We measure the performance of an estimator $\hat{m}$ of $m$ through the loss function

L(m,\hat{m}):=\int\{m(x)-\hat{m}(x)\}^{2}\mu(dx),

which, by (2), may be interpreted as the excess prediction risk for a new observation $X$ .

In this paper we are mainly concerned with regression estimates $\hat{m}$ based on partitions of the sample space, which were originally studied by tukey1947non. Let ${\cal P}_{h}=\{A_{h,1},A_{h,2},\ldots\}$ be a cubic partition of ${\mathbb{R}}^{d}$ such that the cells $A_{h,j}$ are cubes of volume $h^{d}$ . If $x_{h,j}$ denotes the center of the cube $A_{h,j}$ , then introduce the discretization of $x$ by the quantiser

\displaystyle Q_{h}(x):=x_{h,j},\mbox{ if }x\in A_{h,j}.

The raw data will be independent and identically distributed copies

\displaystyle\mathcal{D}_{n}:=\{(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\}

of the random vector $(X,Y)$ , and the estimators that we consider will be (randomised) functions of the binned data, defined by

\{(Q_{h}(X_{1}),Y_{1}),\ldots,(Q_{h}(X_{n}),Y_{n})\}.

Using this binned data, when we do not have to satisfy privacy constraints, one may create a scheme for a public data set as follows: there are $n$ individuals in the study such that individual $i$ generates the sample pair $(X_{i},Y_{i})$ and he submits the discretised version $(Q_{h}(X_{i}),Y_{i})$ to a data collector. The data collector calculates the empirical distributions

\displaystyle\nu_{n}(A_{h,j})=\frac{1}{n}\sum_{i;Q_{h}(X_{i})=x_{h,j}}Y_{i}

and

\displaystyle\mu_{n}(A_{h,j})=\frac{1}{n}\sum_{i;Q_{h}(X_{i})=x_{h,j}}1.

Then, the public data set

\displaystyle D_{n,h}=\{(j,\nu_{n}(A_{h,j}),\mu_{n}(A_{h,j}));\mu_{n}(A_{h,j})>0\}

is published. The data set $D_{n,h}$ has the favourable property that the size $\#(D_{n,h})$ is much less than $n$ (cf. lugosi1999adaptive).

For binned data, the partitioning regression estimate is defined by

m^{\prime}_{n}(x)=\frac{\sum_{i=1}^{n}Y_{i}{\mathbb{I}}_{\{X_{i}\in A_{h_{n},j}\}}}{\sum_{i=1}^{n}{\mathbb{I}}_{\{X_{i}\in A_{h_{n},j}\}}}\qquad\mbox{if }x\in A_{h_{n},j},

where $0/0$ is $0$ by definition and ${\mathbb{I}}$ denotes the indicator function. In order to have strong universal consistency, we modify the partitioning regression estimate as follows:

m_{n}(x)=\frac{\nu_{n}(A_{h_{n},j})}{\mu_{n}(A_{h_{n},j})}{\mathbb{I}}_{\{\mu_{n}(A_{h_{n},j})\geq\log n/n\}}\qquad\mbox{if }x\in A_{h_{n},j}.

Theorem 1.

(Theorem 23.3 in gyorfi2006distribution.) If

\lim_{n\to\infty}{h}_{n}=0\,\mbox{ and }\,\lim_{n\to\infty}n{h}_{n}^{d}/\log n=\infty,

then the estimate $m_{n}$ is strongly universally consistent, i.e.,

\displaystyle\lim_{n\to\infty}\int(m(x)-m_{n}(x))^{2}\mu(dx)=0

(3)

a.s. for any distribution of $(X,Y)$ with ${\mathbb{E}}Y^{2}<\infty$ .

There is a huge literature on weak and strong universal consistency of regression estimates. Weak universal consistency means convergence

\displaystyle\lim_{n\to\infty}\int{\mathbb{E}}\left[\{m(x)-m_{n}(x)\}^{2}\right]\mu(dx)=0

for any distribution of $(X,Y)$ with ${\mathbb{E}}Y^{2}<\infty$ . For the weak universal consistency of local averaging regression estimates $m_{n}$ , which includes partitioning estimates, kernel estimates and nearest neighbor estimates, we refer to Chapters 4 - 6 in gyorfi2006distribution.

2.2 Local differential privacy

When working under privacy constraints, our estimator will not have direct access to the raw data $\mathcal{D}_{n}$ , or even the binned data $\mathcal{D}_{n,h}$ . Instead, it will only be allowed to depend on randomised data $(Z_{1},\ldots,Z_{n})$ , defined on some measurable space $(\mathcal{Z}^{n},\mathcal{B}^{n})$ , that has been generated conditional on $\mathcal{D}_{n}$ . A privacy mechanism will be a conditional distribution $Q:\mathcal{B}^{n}\times(\mathbb{R}^{d}\times\mathbb{R})^{\otimes n}\rightarrow[0,1]$ with the interpretation that

(Z_{1},\ldots,Z_{n})|\{\mathcal{D}_{n}=\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}\}\sim Q(\cdot|(x_{1},y_{1}),\ldots,(x_{n},y_{n})).

This privacy mechanism will be said to be sequentially interactive (duchi2018minimax) if it respects the graphical structure